Anne van Kesteren

HTML, XML and the DOM

24 September 2005

I have been thinking about parsers lately. In the context of building a browser. That idea came up in Norway and originates from Christian. Now I am not really a programmer (yet?) so this is all highly theoretical, but I talked about it for a bit with Jorgen and it seems to be quite interesting to do. This post will mainly talk about implementation details in the light of HTML 5.

When a browser retrieves an HTML document from a server it does not just display it. Before it can display the information in the first it first needs to be parsed. You can see the HTML or XML you create as a large string. From this string you are going to build objects with properties, child objects, et cetera. Also known as the DOM. As it is eventually about the DOM it does not really matter how you come to that point, whether by HTML or XML. There are some fundamental differences though.

In HTML Node names are always returned in uppercase per DOM Level 3 Core. (Previous DOM levels said the same thing.) By HTML I mean text/html by the way. Everything with that media type goes through the same “string to DOM” parser. (Actually, even within that media type there are some differences for the parser depending on which, if at all, DOCTYPE the site has declared.) For text/html there are also some special CSS cases, like the background property on the body element which propagates straight to the canvas if the canonicalized value of the background property on the html element is transparent or rgba(0, 0, 0, 0). Depends on whether rgba() is supported or not.

However, despite the several inconsistencies you would like the HTML DOM to be roughly similar to the DOM you would create for a similar document in XHTML. HTML 5 will make this easier by stating that in terms of the DOM HTML elements will be in the http://www.w3.org/1999/xhtml namespace. There are some problems to be worked out with the case-insensitivity of HTML obviously, but I think it might work out very well and help with the migration from HTML to XHTML.

Basically the same thing applies to the XML “string to DOM” parser only this one adheres to the rules of the XML specification and the XML Namespaces specification. The DOM that is eventually generated will be mostly equivalent. On both the HTML DOM and the XML DOM you should be able to set the xml:id attribute. The difference is that you can set it in a declarative way in XML as that markup language is namespace aware.

So namespace mixing is always possible through the DOM and if you are using XML you can also do it in a declarative way. Thanks to the DOM HTML and XHTML become roughly equivalent (one of the goals of HTML 5) with several small exceptions.

Comments

Great article. I have a question though. Maybe it's because I haven't been paying attention but why continue developing HTML when we have XHTML?
Posted by Ara Pehlivanian at 6:34PM
I've got to say, I never really understood those who liked coding HTML in uppercase. When I code (and I think this will also apply to most others), I mark up the content as I produce it. That means I write some tags, then some text, then some more tags, then some more text, etc. Why on earth would I want to keep toggling Caps Lock if I don't have to?
Anyway, that said, I was glad to see that Slashdot's new code (which is shooting for HTML 4 compliance) is at least in lowercase. Saving themselves a lot of work down the road. But what is the benefit (apart from backwards compatibility) of HTML 5 continuing to be case-insensitive?
Posted by nickster at 9:24PM
Maybe it's because I haven't been paying attention but why continue developing HTML when we have XHTML?

Because most everybody stil uses HTML, either as HTML itself, as XHTML appendix C, or as a mishmash of elements adorably known as tag soup. There's obviously stil a demand for HTML out there, so why leave it stagnant?
Posted by J. King at 11:18PM
Life would be better if the text/html DOM was the same as the application/xhtml+xml DOM—that is, if the HTMLness was confined to the parser level and did not leak into scripting and styling.
Unfortunately, it cannot be done in browsers, due to the backwards compatibility issues. However, it makes a lot of sense for non-browser apps like content management systems and bots that do not look at styles or scripts to leave the HTMLness on the parser level and work with XHTML interally. The biggest browser DOM issue is namespacing.
BTW, I feel uncomfortable about the idea if bringing colonified names like xml:id in text/html.
I've got to say, I never really understood those who liked coding HTML in uppercase.

Me neither.
Anyway, that said, I was glad to see that Slashdot's new code (which is shooting for HTML 4 compliance) is at least in lowercase.

I don’t understand those who bother making an issue of the HTML 4 tag case on the sites of others. The upper case tags do no harm. I have upper case tags on my site, because I write new stuff in NeoOffice/J and OOo Writer/Web uses upper case for output.
But what is the benefit (apart from backwards compatibility) of HTML 5 continuing to be case-insensitive?

Nothing except not having to define error handling for upper case. However, backwards compatibility is the main benefit, so there is no reason to show any other benefit.
Posted by Henri Sivonen at 12:18AM
You have already started that project. Very good!
Posted by Christian at 12:06AM