Anne van Kesteren

Fragment of HTML?

5 June 2004

At #atom there was a bit of discussion this afternoon about RSS 1.0, which appears to be the RDF version of RSS. In the spirit of RSS, the specification is unclear and not compatible with other specifications of RSS, but that may not surprise you. I was wondering about something that applies to Atom as well.

What is a fragment of HTML?

In feeds you can embed HTML. Note that I'm really talking about HTML this time, not about XHTML. From now on I will refer to it as text/html which might be more appropriate. text/html in feeds needs to be encoded, since it contains syntax that is incompatible with the XML (application/xml) syntax. But I couldn't find a place were it was actually defined what was allowed from the HTML specification. I'm not completely sure about Atom (note that it isn't ATOM, since it isn't an abbreviation, it is a name) for this, but I didn't get any response that it was defined somewhere so I assume it was not for the moment.

The problem is that no syndication specification currently defines what is allowed and what isn't allowed. This leads me to the following conclusion: interoperable implementations are only possible when software programmers agree on the solution for the problem in the specification. This should not be the case, since that will always go wrong.

Specifications should be comprehensive enough to cover all the little details, so that interoperable implementations won't be a problem. Specifications should define what the error handling mechanism is when people provide, for example, invalid feeds. Specifications should provide a solid test suite that covers every single little detail of the specification.

So what can we include in such a "fragment" of text/html encoded bytes? SCRIPT, TITLE, META? Note that telling me how it should work doesn't really work. That is what specifications are for.

Quick: Harry Potter 3 was nice and I like W3C over IETF, since more people know it and it seems more solid to me.

Comments

Combining XML-based languages like RSS or Atom with SGML-based languages like HTML? This is evil as far as I'm concerned. Unless I misunderstand, now you suddenly cannot parse the whole file with an XML parser anymore. Sounds like a real mess to me...
Although XML namespaces are incredibly handy, there are a few things which I don't really like about them. For one, the fact that you can now start to do all kinds of strange stuff like putting script elements inside of a feed while still (technically, although maybe not according to the specification) being correct. Let's say now a browser starts to parse stuff inside of a script element inside a feed because of the "right" namespaces, what a mess that would give off.
I much rather prefer not to mix namespaces inside of one file. I opt for a single namespace with a propper DTD or XML Schema. If you must now start to define which elements of HTML you can put inside of a feed, why not rather define those elements inside of the actual namespace of the feed. For example, that the script element must actually be part of the RSS/Atom specification rather than combinding namespaces with HTML or XHTML or whatever. (Using script as an example is stupid, since that is allmost definitely a bad idea to use anyway. But only for argument's sake.)
I allways like to have a DTD or XML Schema for each document, because then I can validate and I know that I didn't make any mistakes with my element/attribute names. As far as I understand, if you allow specific HTML elements inside of, for example, an Atom feed you must add that to your DTD for the file if you want it to validate. Again, you might as well make those elements part of the Atom namesace if you have to first define them in the specification (to be propper) and then put them specifically inside your Atom DTD. Note that I'm now talking about a DTD but an XML Schema is just as good. Also, I don't currently know of any DTD or XML Schema for Atom or RSS, but there could be (and I would definitely like to see) them. I might even think about creating them myself someday.
A last argument, XML namespaces can make a document more easily machine parsable. Yes, but that can also create a lot of anomolies (for example with the script element inside of a feed) if not treated specially. At the end of the day it might still be simpler and safer (although not neccesarily easier) to keep namespaces seperate. But that is probably not so in all cases, and is may not allways be practical either. I don't really know myself.
PS: Sorry for the long comment. I must probably start my own Weblog. :-)
Posted by Charl van Niekerk at 2:25PM
Sorry for another comment, but just so that people don't misunderstand, there is definitely need for XML namespaces, especially when it comes to RDF. I'm only saing that in a lot of cases (IMHO) it isn't optimal and is rather something to avoid when possible. Thanks.
Posted by Charl van Niekerk at 2:33PM
In the case of RSS, XML has always allowed for non-XML data to be included, via the CDATA mechanism. You'll see this on many sites whose feeds contain the original posts as HTML. Thus they remain fully XML-compliant and can be parsed properly. In other words, I don't see a problem here.
Posted by Chris Hester at 7:15PM
The problem isn't that HTML is allowed Crhis, the problem is that it isn't defined what elements are allowed. Specifications should define things in much more detail and should not assume that people know what is meant.
Charl, validating isn't really important for XML as long as the correct content-type and namespaces are used. It is also impossible to define a subset of elements, since you might want to have support for SVG, a 700 pages specification... (more about (ugly) SVG later)
Posted by Anne at 8:44PM
Chris: I have been looking around at some Atom feeds and I indeed see what you mean. This is a very ugly, yet effective method.
Anne: I think you misunderstand me. If some application/service could tell me Hey, that element is not part of that particular namespace! I would have been happy, but I don't know of many. So although validating doesn't matter once your document is correct, how do I know it is correct in the first place if I can't validate?
Posted by Charl van Niekerk at 10:09PM
In practice, "fragment of HTML" seems to mean "anything that can appear as the body of an HTML document" (that is, whatever comes between <body> and </body>). A common behavior for a feed consumer is to incorporate one or more such fragments into the body of an HTML document to be sent to a standard HTML browser or interpreter.
Elements and attributes such as <script> and onmouseover may appear within an HTML fragment, but many feed consumers will wish to filter them out before displaying the HTML fragment, for security reasons. (One security practice is to filter out all elements and attributes except those on a whitelist.)
Of course, none of this is explicitly specified, and there are definitely issues brought up by the above.
Posted by Matt Brubeck at 12:13PM
A simple solution for that problem would to import the XHTML-DTD into the RSS/Atom whatever DTD and use the either the parameter-entities %Block or %Flow for the content model of the element that should contain (X)HTML. If you really want to use HTML 4.x you have to use The CDATA sollution or you have to define RSS or ATOM as SGML subsets, but then they're not XML anymore. I used the %Block/%Flow Sollution in several Projects and it works quites well.
By the way, XML:Namespace are XSLT-Programmers Hell ;]
Posted by ben at 4:52PM
I agree with Matt that an HTML fragment is anything inside BODY. But this should of course be explicitly stated in the specification. Now, the Atom CONTENT element is far from perfect, and discussion is ongoing to polish it more. Hence, we don't yet know what kind of content you may or may not squeeze into it. Maybe only XHTML will be allowed. Who knows.
But when we settle for something, we should specify that "something" very thoroughly. I hope you will be a watchdog on this area, Anne, so we don't end up in a similar mess RSS is in at the moment. :-)
Posted by Asbjørn Ulsberg at 5:01PM
The problem isn't that HTML is allowed Crhis, the problem is that it isn't defined what elements are allowed. Specifications should define things in much more detail and should not assume that people know what is meant.

The way I think of it is this (and I could be naive here): CDATA is defined as "not part of the XML document, but character data such as HTML". Therefore, it doesn't matter what the CDATA HTML is. What relevance does the actual tags used have to do with the XML? They are not part of the structure.
CDATA allows anything to go in, safe in the knowledge that (at least in theory) the XML parser will not parse the CDATA content. Browsers displaying it as a web page is something of a kludge.
Posted by Chris Hester at 5:58PM
I personally think that the CDATA method is very ugly. And it would be nice to be able to validate (yes, here I'm going again) the (X)HTML like normal using XML Schema (more about that later). Why can't we rather combine namespaces (yes, I came to other insights) with XHTML inside a feed and use that inside the content element?
Ben: It also took me quite some time to figure out the business with XML namespaces in XSLT. The problem is that online tutorials don't seem to talk about that.. they all assume that your source document (that calls the stylesheet) are not using namespaces. But at least when you figure it out it's not so bad (in my opinion). Is this the same problem you were having?
It seems like you can actually validate a document with multiple namespaces, like importing into the DTD (thanks Ben) and also by using XML Schema. I haven't had the time to try it out myself yet, but it seems like you can use xs:schemaLocation (where the xs is the handle for the XML Schema namespace). Does anybody have any experience with this?
The only problem with the schemaLocation method is that it doesn't seem like there is currently a XML schema definition file set up for XHTML 1.1 One would probably have to set up one's own schema for XHTML if one is really that desperate to validate. I did find a tool though to convert DTD to XML Schema called "dtd2xs" at Lumrix. If that works it should be no problem.
And sorry about my obsession with validation, I'm selfadmittedly a heavy perfectionist. Please excuse me if I'm irritating you with this by now (I must be!). :-)
Posted by Charl van Niekerk at 10:41PM
I just think that validation isn't that important with XML MIMEs. XPath has support for namespaces by the way and since XSLT is using that as the selector language, there shouldn't be a problem. (I have some O'reilly books here covering those subjects, but I'm rather busy with different things at the moment.)
Chris, that wasn't the point. Eventually, some parser has to handle that HTML. But how can that parser, or feed reader, handle it without knowing what is inside?
Posted by Anne at 2:00AM
@Charl: acutally namespaces are not that bad. But if your work with several namespaces in several files and a complex xsltstylesheet - on several files distributes - you get to loos the overview fast. The fact, that there may be an imlicit namespace in any document doesn't make that easier.So XSLT-programming with namespace decreases the beauty of XSLT quite alot
And as far as I understood namespace, they're mainly to be used, whne working with several documentclasses in one document, AND if both have the same elements. Actually I hav never found this condition to be true.
Posted by ben at 4:17PM