Anne van Kesteren

MIME types matter; DOCTYPEs don't

31 July 2004

My point of today, as well as from the last few days and this will probably remain the same in the future is that the distinction between HTML and XHTML is not made in their DOCTYPE, but in their MIME types. Not before I mention that Molly of molly.com asked the following on question on her weblog: DOCTYPES not relevant? In reply to something I said on her weblog, which I already wanted to elaborate here. Not before I tell you why I have been complaining about XHTML a bit. I believe it started with XHTML is invalid HTML and followed up with some weblog comments and other posts some people just couldn't agree with. I don't think there is anything wrong with XHTML, but as long as you have to support Internet Explorer, a crappy CMS or ad software that sucks, you can't use it. To be forward compatible (the argument made most and also a myth I believe) you must have well-formed documents. And being well-formed is tough. Not to mention that XML and therefore any language based on XML (XHTML) is tough. Limpid did make a site for a client that is always well-formed and valid and uses application/xhtml+xml and can be edited by the client (they use IE). In fact, it is number 76 of the X-Philes. But that was before I acknowledged that there aren't any advantages in using XHTML. Especially when you don't have to use MathML or XHTML Ruby. (And even then, if you don't use it heavily you will always have the ability to use the OBJECT element.)

Note that there aren't (m)any semantic differences between HTML and XHTML. They are almost interchangeable as Molly puts it, aside from the MIME types and support that is, obviously.

Back to my point. Whether you use a HTML or an XHTML DOCTYPE, it just doesn't matter as long as you are sending your document as text/html. The browser will treat it as HTML either way, seriously. There is no advantage in using an XHTML DOCTYPE just to keep up with Zeldman ;-) The W3C says to browser makers (Mozilla of course) that they should (ok, not a must but nearly as strong) treat XHTML as (invalid) HTML. You read that correct (again), you are just using tag soup.

The only way to really, really use XHTML is by starting to use the right MIME type for it. The most appropriate MIME is application/xhtml+xml. You can also use application/xml and you'd better avoid using text/xml (for everything), since it has character encoding problems. Only if you are using XHTML in the real way, you can take advantage of it:

Embed MathML directly in the markup, instead of using a separated file embedded with the OBJECT element. (Neither of those works in Internet Explorer, as does none of the XHTML advantages.)
Extending XHTML with other namespaces. You might want to invent some code elements.
Using XML tools (directly) such as XSLT to transform it into XHTML 2.0

Not using XHTML means that you are sending an appropriate and supported format to any browser, HTML. It means you can't use any of the XHTML advantages almost nobody needs and which are not that good supported. Internet Explorer doesn't, for instance.

And if the time may come, in 20 years, when every browser supports XHTML you use tidy and switch or leave your documents as they are, since browsers will probably still support it, as long as nothing bad (good) happens and the internet along with browsers need to be remade/redesigned. But for now the MIME type chooses the DOCTYPE.

Want to know how to use the correct MIME type? If you are running on Apache you could test it with the following:

AddType application/xhtml+xml;charset=utf-8 .xht

(Note that setting a explicit charset is not necessary, but is considered good practice. And important for HTML documents, since those default to iso-8859-1.) This basically means that every file with the extension '.xht' will be send as XHTML to the browser.

Comments

Brilliantly said, Anne!
Posted by Basil Crow at 10:31PM
A completely lucid argument that I fully agree with, and, I'm sure, will be ignored by most who think they know better. It's a shame.
Too many designers wear that "XHTML Validation" icon like a badge of honour yet completely ignore the fact that they are sending-out tag soup... but at the same time will complain when other people's sites don't validate... It's the pot calling the kettle black, IMHO.
Posted by MikeyC at 12:21AM
I feel at a crossroads with this ongoing discussion. I certainly understand the perspectives here from a technical standpoint, but the educator and practical person in me still doesn't buy it.
Certainly if you have control over your servers, a deeper understanding of MIME types, and are able to tap into the added value that XHTML gives us in terms of true extensibility you're going to be delivering documents as they are meant to be delivered. That's great.
But let's not forget that HTML itself is not a language that encourages rigor. These discussions keep overlooking that value. HTML is sloppy as hell, and we all know it. XHTML helps with that, and that help is invaluable both in the educational process and in practice.
First, it assists in bringing a strong understanding of better structure and semantics. An example is ensuring that non-empty elements are closed. Validation helps enforce that understanding. Maybe not important to those of us who have been working with these languages for a long time, but think about those folks who are trying to undo a lot of bad habits born of problematic tools, or just coming into the field.
XML's influence on HTML and the resulting XHTML has actually helped a lot of people learn markup in such a way that at least makes them ready to be foreward compatible even if they aren't delivering the correct MIME type. Their documents are well-formed.
Authoring XHTML is a more disciplined process, and that's helpful when someone is teaching it. That discipline also helps reduce errors, speed up debugging time, and encourages better practices overall.
Forgetting these values is a huge mistake. Does it mean these things can't be done in HTML? No, of course not, but it is in my experience vastly easier to do in XHTML.
I don't want these important issues to get lost in the MIME type argument. If you have a more structured document, authors that understand the importance of logic within that document, all that's necessary at that point for the type of compliance being discussed here is to fix the MIME type. That's all. One step from being completely ready for anything the future might demand of those documents.
That, to me, is strategically a lot more effective for the betterment of the web as a whole.
So while one simply cannot argue the rationale being presented here, I can argue some of the nuances for the practical world in terms of educating others. Should they have the whole story? Absolutely, but these are more advanced topics. There has to be an opportunity for folks to transition into better practices, and I for one believe that using XHTML to do that can be an extremely effective method.
Posted by molly e. holzschlag at 1:21AM
But let's not forget that HTML itself is not a language that encourages rigor. These discussions keep overlooking that value. HTML is sloppy as hell, and we all know it. XHTML helps with that, and that help is invaluable both in the educational process and in practice.

Invalid ("tag soup") XHTML is every bit as sloppy as invalid HTML. Conversely, valid HTML is every bit as strict in its syntax as valid XHTML.
As an educator, XHTML has an advantage: its syntax is simpler than HTML (with its sometimes-optional opening and closing tags, etc). So you can teach a monkey the syntax of XHTML, whereas (correct!) HTML is demonstrably harder.
As an author of software which consumes X(HT)ML, you have an advantage, in that you don't have to deal with rat's nest of error-recovery that has become the norm for consumers of HTML. XML parsers are required to fail on ill-formed content.
[XHTML]...at least makes them ready to be foreward compatible even if they aren't delivering the correct MIME type. Their documents are well-formed.

Their documents are almost never well-formed. (Far) less than 1% of XHTML sites are well-formed. I'd go so far as to wager that 100% of XHTML sites currently served as text/html would break immediately, if served as application/xhtml+xml, if not because they're ill-formed, then because of subtle DOM or CSS issues.
Posted by Jacques Distler at 1:54AM
Molly wrote:
XML's influence on HTML and the resulting XHTML has actually helped a lot of people learn markup in such a way that at least makes them ready to be foreward compatible even if they aren't delivering the correct MIME type. Their documents are well-formed.
Authoring XHTML is a more disciplined process, and that's helpful when someone is teaching it. That discipline also helps reduce errors, speed up debugging time, and encourages better practices overall.

So basicly if XHTML validation rules apply to HTML it's fine?
Posted by Mark Wubben at 2:06AM
I totaly agree...
Posted by Sime at 2:10AM
Invalid ("tag soup") XHTML is every bit as sloppy as invalid HTML.

You're making a big mistake in assuming that an XHTML page that is not sent with a correct MIME type is an invalid XHTML page, "and thus tag soup". A valid (ie. it passes the Validator) XHTML document, even when sent as text/html, is still a valid XHTML document, and is NOT "tag soup" (nor invalid). If you're not sending a right MIME type, then only your presentation of the (valid) XHTML document is invalid. The document itself, however, is still perfectly valid (and, again, not "tag soup").
Their documents are almost never well-formed. (Far) less than 1% of XHTML sites are well-formed. I'd go so far as to wager that 100% of XHTML sites currently served as text/html would break immediately, if served as application/xhtml+xml, if not because they're ill-formed, then because of subtle DOM or CSS issues.

Most people who use XHTML doctypes also try their hardest to keep their pages to validate in the validator, so a claim that only 1% of the XHTML sites is actually valid is ridiculous and poorly concluded, at best. Yes, there will still be a lot of sites that use XHTML doctypes yet not validate, but to say that they are 100% of the entire Internet (the XHTML-serving part thereof) is enough to make me wonder just what kind of sites you go to.
Molly mentions one the most important aspects of using XHTML (and staying valid), whereas Anne, while making good points which are mostly well-founded, completely misses it: The importance and true value of (using) XHTML lies not in being perfectly valid conform W3C Specifications, but to come to a better understanding of how the web works and how to create proper websites.
After all, sending up a validating XHTML document with the right MIME type still means absolutely jack shit when you use <div class="bigtype"> for every heading in your documents, or don't know what the semantic value of all the elements is. The importance of XHTML, today, lies in understanding the whole meaning and logic behind webauthoring - not to please the W3C, and not (as of yet) to be ready to add MathML / Ruby with ease.
Posted by Faruk Ates at 5:25AM
Molly,
Just to clarify, if you want to use XHTML-as-text/html as a stepping stone to "real" XHTML, more power to you.
Anne's position, if I understand him, is that "real" XHTML (and the features it makes possible) is not going to be usable on the mainstream web anytime in the forseeable future. So there's no point in having a stepping stone to a technology you're not going to be able to deploy anytime soon.
Me, I don't give a rat's ass about the mainstream web. I care about that small part of it involved in creating, serving and consuming mathematical and scientific content. In my little neck-of-the-woods, this technology is important today.
Posted by Jacques Distler at 5:40AM
Jacques,

As far as I can understand from Anne, that is indeed his position. But that position completely misses the importance of XHTML towards the more generic part of the web (the part that isn't your "little neck-of-the-woods", basically :)), as it is this generic part of the web that is in dire need of being educated about markup, being told to use proper markup, and being told why. And XHTML is a far better tool in that than HTML is. At least, that's my position. :)
Posted by Faruk Ates at 5:46AM
Anne, I've been following your anti-XHTML advocacy for a while now, and I have to express my strong disagreement.
You state that using XHTML on the real-world-web (with all its limitations) would have no advantages at all. As an example you list the missing ability of using XML tools on XHTML documents sent as text/html, specifically the missing ability to transform it using XSLT.
Right. That's true, of course, because XHTML sent as text/html is not well-formed XML (and how could it be: the MIME-type says its html and as we all know HTML is not XML).
But think about it: where are you most likely to XSL-transform a XHTML-Document into something else? Right, on the server. And how are you going retrieve said XHTML-Document there? Right, via the file-system of course. Whoopsie, no HTTP content-type: header there! Nothing that says this document is text/html (assuming that you're not absent-mindetly putting a superflouos meta-element declaring its content-type as text/html into the document's head). Just pure, delicious, well-formed XML. Your XML-parser will be delighted to eat that. No yukky tag-soup there.
My point is: XHTML can be can both HTML and XML. You can take advantage of an XML feature while still retaining backwards-compatibility with legacy user agents by following the HTML Compatibility Guidelines.
And that scenario isn't far-fetched at all: imagine a content-producer who wants to serve up different versions of his documents. If they're using XHTML they could use XSLT to easily generate WML and XHTML Basic versions of their content from a single source-document.
So there are advantages of using XHTML over HTML. Even though it's only going to be HTML when it's send over the wire.
Even for the author who doesn't currently plan to use XML tools on the server, it is beneficiary to use XHTML and ensure well-formedness of the document. Because when he decides to use said tools or when he decides to send his documents as XHTML (with the right MIME-type) he will be ready. No (costly) conversion is necessary.
I do, however, agree with Ian Hickson, whose essay you referenced in another post, in that the decision to use XHTML should always be an informed decision. I do not agree that XHTML is harmful in any way. If it's sent as text/html it's just that, HTML and in no way different than plain old HTML for the recipient. So if there are XHTML-documents out there which are ill-formed XML: so what?
Let me instead ask a different question: What are the disadvantages of sending XHTML as HTML?
Posted by Gerrit at 6:25AM
I'm pretty sure Anne would be the last person to downplay the importance of valid markup.
But I don't see what the "X" in XHTML has to do with the matter. The only advantage, as far as I can see, that XHTML has in this regard is that its "dumbed-down" syntax makes it easier to learn to use correctly.
Posted by Jacques Distler at 6:35AM
You're making a big mistake in assuming that an XHTML page that is not sent with a correct MIME type is an invalid XHTML page, "and thus tag soup".

I made no such mistake. I said that invalid XHTML was just as easy to author as invalid HTML. I was alluding to the fact that the vast majority of web pages (HTML or XHTML) are invalid. Only 0.7% of HTML pages validate. I am pretty sure that a similar survey of XHTML pages will yield a similar result. If you consider that a "site" is made of many pages...
Most people who use XHTML doctypes also try their hardest to keep their pages to validate in the validator, so a claim that only 1% of the XHTML sites is actually valid is ridiculous and poorly concluded, at best.

Sorry, but Evan Goer's done the experiment, and it's been replicated by many others. The vast majority (90+%) of XHTML websites run but self-avowed Web Standards advocates, markup geeks and the like fail to validate. Add in millions more sites driven by MovableType, WordPress and the like, run by ordinary folk who are not markup geeks, and I'm sure the percentage of valid XHTML sites is well below 1%.
Admittedly, I was making the stronger statement: that the vast majority of XHTML sites are, in fact ill-formed. I suggest you try replicating Evan's experiment, and keeping track of the well-formedness errors. It sounds like you will be in for a surprise.
...but to say that they are 100% of the entire Internet (the XHTML-serving part thereof)...

I didn't say that. I said that even the tiny percentage of XHTML sites which are well-formed would most likely fail in more subtle (if less catastrophic) ways when switched to application/xhtml+xml. There are significant differences in the way CSS is handled and the way javascript works (not to mention the bugbear that you cannot hide scripts and CSS directives inside XML comments anymore). The only way to ensure that an XHTML site will function correctly when served as application/xhtml+xml is to actually serve it that way.
For that reason (among others) I am highly sceptical of any claim that XHTML served as text/html brings us any closer to nirvana. I probably disagree with Anne that it's worse than serving HTML. But it's certainly no better.
Posted by Jacques Distler at 10:11AM
It was exactly Evan Goer's experiment and similar surveys why I think XHTML is the wrong choice most of the time. (This is the result of sending XHTML as text/html, obviously.)
All people use XHTML so they are in some way forward compatible, but they are not and never will be as long as publishing tools aren't modified and fixed. For a small site, it is quite easy to do so, see the example in my post. But for weblogs it already becomes much tougher. Just read all the things Jacques Distler had to do before he got things to work. (Note that most of the websites on the X-Philes list are technical weblogs and have a technical audience.)
Imagine how large news sites, with ad software, WYSIWYG editors et cetera stay valid. Not. (Another example is Blogger, of which none of it's users care about being valid. Which gives mostly invalid weblogs, if we don't take in account the ads on the top of the page, which make the page invalid before you made a post.)
And, HTML is just as good as XHTML is, if not better for most websites. Converting between the two is easy, so why not pick the right format that every browser supports?
Posted by Anne at 1:33PM
Gerrit wrote:
So there are advantages of using XHTML over HTML. Even though it's only going to be HTML when it's send over the wire.

Exactly. What you fail to recognize however is that Anne is talking about the client side use of XHTML. Storing content as XHTML (2.0) on the serverside gives you all the power you want. Sending out HTML means giving the client something it can handle.
Posted by Mark Wubben at 6:06PM
(Re: main article)
What about those applications where no MIME type is transmitted? Such examples involve pipes between different programs through stdio or even some older file systems. Hypertext is often still useful in the context of non-internet applications.
And what's so special about MIME types and HTML flavors? Your arguments could be applied to any language like Docbook or Postscript. Docbook is esp. informative, since it also used to be SGML.
Posted by Jimmy Cerra at 10:20AM
Sándor vagyok Kocsordról. Årpád biztos jó munkásember.
Posted by Sándor at 8:03PM
You're making a big mistake in assuming that an XHTML page that is not sent with a correct MIME type is an invalid XHTML page, "and thus tag soup". A valid (ie. it passes the Validator) XHTML document, even when sent as text/html, is still a valid XHTML document, and is NOT "tag soup" (nor invalid). If you're not sending a right MIME type, then only your presentation of the (valid) XHTML document is invalid. The document itself, however, is still perfectly valid (and, again, not "tag soup").

The document isn't tagsoup but will always be treated as such. It's tagsoup to the client, therefore you've lost all advantages of it not being tagsoup. In short, no, it's not, but it might as well be.
But where there is a mistake is when you assume that invalid XHTML is as sloppy as HTML. Not true. Ill-formed XHTML is, but that's a whole different kettle of fish. Well-formed XHTML is significantly more rigorous than HTML, and it is (to the client) much, much easier (and therefore faster, cheaper) to parse.
That's the reason I'm using XHTML - be nice to the client. If it's no trouble to serve as XHTML, then you should do (ignoring the miniscule bandwidth cost). If that's treated as tagsoup by IE, fine by me, but Firefox, Opera, Safari and every other XML-compliant browser is getting lovely, well-formed, node-based XML.
Of course, there's also the namespacing and modularization, but I'm yet to take advantage of that.
Posted by David House at 10:15PM
David, I don't get your point. As long as you are not sending XHTML as application/xhtml+xml it is just tag soup and less valid than HTML is. Since HTML is supposed to have the text/html MIME type and it is supposed to be parsed as such. XHTML isn't supposed to be parsed as HTML, therefore it has a different name.
You are also ignoring the fact that most websites who claim to be XHTML with a DOCTYPE are invalid and would give a fatal error in XML mode.
Posted by Anne at 6:26PM
[...] siitä, onko kyse HTML- vai XHTML-dokumentista. Anne van Kesteren kirjoittaa blogissaan, että mikäli todella haluaa käyttää XHTML:ää [...]
Posted by Koodaus.net Blog - DOCTYPE:n merkitys at 5:49PM
David, I don't get your point. As long as you are not sending XHTML as application/xhtml+xml it is just tag soup

I disagree. It would be tag soup if you interpreted it as SGML-based HTML. But the text/html media type is not limited to SGML-based HTML. RFC 2854 states that some XHTML is also permitted to be labelled as text/html.
Your claim that XHTML as text/html is nothing but tag-soup is dependent upon the assumption that a text/html document should be SGML-based. The text/html specification disagrees with you.
Now, if you modified your statement to say that Internet Explorer, Mozilla, or any specific UA you care to name would treat XHTML documents as tag soup, I would not disagree (providing, of course, that the UA in question did behave in that way). But you can't say that in general it is tag soup, because it isn't. The specification says so.
Posted by Jim Dabell at 7:42PM
I stand corrected. (I will mention the modified version in the future.)
Posted by Anne at 10:30PM