Anne van Kesteren

XHTML versus HTML

18 February 2004

I have already written something about XML versus XHTML, to explain the difference between the two. Now, I would like to discuss and explain what the reasons are to choose XHTML over HTML, even if you can't send it as application/xhtml+xml. This is related to the DTD, I "discussed" yesterday.

On the current web, XHTML doesn't have that many advantages over HTML. Some people think that it hasn't got any advantage over HTML, 'cause the correct content-type for XHTML, application/xhtml+xml, isn't supported by Internet Explorer (how many times I have written that sentence...). Other browsers, like Mozilla and Opera (includes scripting from 7.5), do support it (Safari as well, I believe). There are ways to send XHTML with the correct MIME type to "good" browsers and text/html to "bad" browsers (like IE), but those methods are rarely used (I just see that Cinnamon is added to the list, great work!).

I think we can state that is the biggest "failure" of XHTML and a point of criticism. A good point (they exist) is the Document Type Definition of XHTML. Where HTML allows you to do:

<ul>
 <li>A
 <li>Unordered
 <li>List</li>
</ul>

XHTML is stricter (you have to close the list-item-open tag). A HTML document that looks like the example above has to be "re-mark-upped" by the browser so that the browser can have a normal DOM for JavaScript and CSS. A good example of that is that you may omit both start and end tag of the BODY element, but the CSS type selector body still applies. To tell you the truth, the same parser applies to XHTML documents send as text/html, so it doesn't really matter that much (now). In the future, however, you can have (I you start writing now) completely valid XHTML documents (or well-formed) and send them as application/xhtml+xml so browsers will render your pages faster (not an extra parser, they expect your document to be valid, et cetera). What I'm trying to say (write*) is that XHTML is more forward compatible.

Did I miss a point? In short: The XHTML DTD is better to work with, 'cause it is stricter than the HTML one.

The complete post was about XHTML1.0 (!) and HTML4.01.

Comments

I don't think you can compare XHTML and HTML just like that. I'd rather see a comparison between semantic XHTML and semantic HTML. In that case there are hardly any differences, except for the closing of tags (<br />) and some attributes (xml:lang).
Good, semantic HTML will be parsed fast as well, since there are no errors which cause the additional work. That said, if you send XHTML as text/html to IE6 you are sending bad HTML.
Seeing how IE6 rapidly reaches the state of NS4 I don't see this as a problem. Therefore, the debate should be more focused on what XHTML provides over HTML: namespaces. Of course, namespaces come from XHTML being XML, but they provide a lot more semantic functionality than HTML has. For that reason and because XHTML is easier parsed (think: non-browser applications) as it is XML I would use XHTML.
Posted by Mark Wubben at 8:54PM
For me, the true power of XHTML is its extensibility, and this will become easier when we don't have to rely on Document Type Definitions at all. Since XHTML is supposed to be XML, we really should be using XML Schema instead.
As far as I am concerned, HTML and XHTML are completely different things. HTML is just a presentational language; a way of presenting content to the Web. XHTML, conversely, is an application of XML; a language to describe the structure of content - and these descriptions should be applicable and understood by any conceivable user agent.
Informatation retrieval is the most important quality of the Web's future. If content (data) is properly described, information retrieval will be easy. XHTML does a far better job of this (especially in strict flavors), and consquently it means that content will be more accessible. Content aggregators can easily be created that can scrape an XHTML document, but the same cannot be said for an HTML document.
Maybe the XHTML's advantages are not immediately obvious to Joe User, but 10 years from now they certainly will be.
Posted by Simon Jessey at 9:22PM
"Good, semantic HTML will be parsed fast as well, since there are no errors which cause the additional work."
From what I understand, and correct me if I'm wrong here, the XML parser in modern browsers is a lighter weight, or should I say, inherently quicker device.
If you deliver either html or xhtml as text/html, you're using a slower parser whether your code is valid or not.
And for what it's worth, valid markup does not equal good semantic markup.
Posted by Mike P. at 11:41PM
I prefer using XHTML as it is stricter than HTML. Unclosed paragraph tags, and list items, or unquoted attributes may be valid HTML 4.01 but can do strange things when your HTML is styled with CSS.
Using the validator on XHTML pages traps more potential problems, which from a design point of view means more time spent on getting the site to look good and to work well, and less time tearing hair out trying to work out where that CSS problem came from.
Posted by Matthew Farrand at 3:50AM
developer-x.com/journal/2004/02/19/
Comments posted here because your comments system would not except my input. It gave me a cryptic error message.
Posted by Tim Scarfe at 6:57AM
Whether I use the HTML 4.01 Strict doctype or XHTML 1.0 Strict doesn't really matter to me, as long as I have to send it as text/html to most browsers. The only difference will be a slash before the > in empty elements.
This, of course, is only true if I write the HTML as meticulously as I would XHTML, i.e. closing every element, quoting all attributes, etc.
I'm thinking about redoing my site as a weblog, and I've written a PHP script that will allow me to send pages as XHTML 1.1 with application/xhtml+xml to compliant browsers, while serving HTML 4.01 Strict with text/html to the others.
An XML parser is definitely more lightweight than a full SGML parser, so valid, well-formed XHTML should render faster in a browser that takes advantage of this difference.
/Tommy
Posted by TOOLman at 1:42PM
Well, I suppose in reality XML is a subset of SGML but it is true the XML Parser is faster because it is written to spit out errors instead of clunking along and rendering a malformed syntax as with some mainstream HTML Based Browsers.
Posted by Robert Wellock at 11:03PM
You guys can't be serious.
The difference in parsing time makes an imperceptible difference to the speed of the browser. Network download time is the dominant effect in how long it takes a page to appear in your browser. Rendering (as opposed to parsing) also takes vastly longer, for all but the simplest pages. And the difference in parsing time is negligible. You can easily verify this yourself with two local files (one HTML 4, one XHTML) of the "same" web page. I doubt you have a stopwatch accurate enough to measure the time-difference.
And there are enough differences (in the way scripts, CSS, the body element, etc. are handled) when your page is sent as application/xhtml+xml, to make it incredibly stupid to think that just because your pages validate as XHTML and work OK when sent as text/html, it will still do so when sent as XML.
This "future-proofing" argument is a crock, because the differences that matter are far more subtle, and harder to catch, that some dumb unclosed tags.
(And, of course, it begs the question of why, if your page "works" as text/html, you should ever feel the need to send it as application/xhtml+xml at some time in the misty future.)
Posted by Jacques Distler at 1:36AM
A HTML document that looks like the example above has to be "re-mark-upped" by the browser so that the browser can have a normal DOM for JavaScript and CSS.

Nonsense.
Your example is valid HTML4, and an SGML parser has no more trouble parsing it than an XML parser would have with the corresponding snippet of XHTML (in which all the <li> elements have closing </li> tags).
You do realize that there are SGML parsers, right?
The XHTML DTD is better to work with, 'cause it is stricter than the HTML one.

Again, you have misunderstood. XML parsers are stricter (they are required to bail on the first well-formedness error) than SGML parsers.
But it is meaningless to say that XML is "stricter" than SGML. It is simpler than SGML, which makes it easier to write parsers for. But in no possible sense is it "stricter." It just has a more complicated syntax.
Posted by Jacques Distler at 8:17AM
Whoops! I meant to say that the syntax of SGML is more complicated than that of XML.
Posted by Jacques Distler at 11:17AM
And there are enough differences (in the way scripts, CSS, the body element, etc. are handled) when your page is sent as application/xhtml+xml, to make it incredibly stupid to think that just because your pages validate as XHTML and work OK when sent as text/html, it will still do so when sent as XML.

This is not entirely true. The BODY element can be treated as a DIV in HTML as well. When you are scripting according to the DOM it is also possible to achieve the same effects as far as I practised with it.
You are indeed completely right about the SGML parser. I was more thinking how browsers handle the "tag-soup" HTML. And as far as I know, Mozilla will "fix" it, closing the empty elements et cetera.
Posted by Anne at 5:37PM
I was more thinking how browsers handle the "tag-soup" HTML. And as far as I know, Mozilla will "fix" it, closing the empty elements et cetera.

Valid markup is always better. It is processed faster and it is processed more reliably. And it is more "future-proof", in that you are not relying on the error handling routines of future browsers being the same as those of today's browsers.
But, presumably, no one is advocating authoring invalid HTML. I thought the discussion was about valid HTML versus valid XHTML.
Posted by Jacques Distler at 1:19PM
It is, but the example above is valid HTML, though the unclosed LI elements, have to be closed first, before Mozilla can handle it correctly, I like to call that "tag-soup" HTML, but I understand it is confusing.
Posted by Anne at 1:46PM
...though the unclosed LI elements, have to be closed first, before Mozilla can handle it correctly

That is where you are mistaken. The LI element is closed automatically by the SGML syntax of HTML. If you are an SGML parser, cruising along parsing an LI element, and you encounter another <li> tag, then you know that you have reached the end of the current LI element. Since LI elements cannot be nested, the current one must end before the new one can begin.
These rules, if you stop to think about them, are perfectly unambiguous. They are different (more general) than the rules of XML syntax. But that does not make the above example "tag soup." Tag soup is error-filled; tag-soup is ambiguous. The above is perfectly error-free and unambiguous. And, no, it doesn't need to be "fixed" before Mozilla can handle it.
Posted by Jacques Distler at 3:22PM