Anne van Kesteren

Quick guide to XHTML

19 August 2004

There is so much to tell about XHTML and HTML and how to do it the correct way that I'm not sure where to start. I'm an advocate of using XHTML only in the correct way, which basically means you have to use HTML. Period. This also means Tim Bray's (co-author of XML) should fix his ongoing. (Thanks Dave.)

A while ago, Evan Goer tested 119 "XHTML" sites for validation. 89% did not validate and 99% did not use the correct MIME type. These results look quite similar to a larger survey from Mark Pilgrim published in XML on the Web Has Failed. (A correct MIME type for XHTML would be application/xml or application/xhtml+xml (text/xml is in process of being deprecated and that is a good thing).)

You might think that 1% (or 11%, if you are forgiving) is quite a lot, but maybe you didn't read the part were Evan told you he only tested Alpha Geeks and friends. That includes people like Jeffrey Zeldman, Douglas Bowman, Dave Shea and probably people like me and Jacques Distler (being bulletproof is near impossible). Indeed, from the people who tell you how to use this stuff only 1% is a winner.

Designers are now warning you when new "XHTML" sites don't validate. The reason why people are using XHTML is probably based upon an illusion. They thought it was new and would therefore remove things like FONT elements and table based layouts. Well, wrong. XHTML allows that too. Most people are not taking advantage of using XHTML; it does exactly the same for them as HTML only they think it is different. It won't save you extra bandwidth. The advantages are based on semantic and structured markup, not on XHTML. People might think XHTML is a nice short abbreviation for that, but that is wrong, sorry. When you are using XHTML, you can't rely on META elements to declare the character encoding. You have to know things about character encoding, MIME types and more. XHTML is not forgiving, it is tough.

As you can read, XHTML has been used for the wrong reason. There is no known advantage in sending XHTML as text/html, only disadvantages. However, using XHTML as HTML is what people are doing. The reason you really want to use XHTML is when you are going to use other markup languages like MathML and mix them together. I know only one person who does such a thing and various people who have the intention. But their target audience is very small:

Anyone who hasn’t been asleep for the past 6 years knows that quantum gravity in asymptotically anti-de Sitter space has unitary time evolution. Blackholes may form and evaporate in interior, but the overall evolution is unitary and is holographically dual to the evolution in a gauge theory on the boundary.

... which means they can probably make people either download the browser or download some terrific plugin for Internet Explorer, the most insecure browser on the web. (This also means they have enough knowledge to cope with RFC 3023 and several related issues to which we'll get in a minute.)

If you are a Jacques Distler wanabee or just someone who likes to irritate the internet (just kidding) then here are some guidelines for using XHTML when you want to be compatible with basic XML parsers (it has to be a challenge, right?). Your document must be well-formed:

Use the correct MIME type.
Use UTF-8 - You really should be using UTF-8 and nothing else. XML parsers are required to support it so there you go. They are also required to support UTF-16, but that will probably bloat your file, whether you are from a western country or not.
Don't use entities. Don't use —,  , © et cetera (this includes MathML entities). Do you actually think an XML parser is going to download the external DTD and parse it? You know we are talking about XHTML here, right? A browser is not going to download a 2 MB file and parse it, every time it visits a page. You can not rely on the DTD. (You may even omit it, it is almost useless.)

For some people this might look like an easy list of requirements, but when people from external sites are going to add content it gets tricky. Actually, it just is tricky. Evan can tell you, Jacques can tell you and Mark loves to tell you. I know I mentioned these issues before, but the message didn't completely arrive yet, I guess.

In the end, I guess most people are better of using HTML. You really need to have access to server configuration if you want to do something useful with XHTML and even then it isn't really useful since you have to support Internet Explorer (I'm talking about client-side here by the way, using XHTML on the server is fine). You know you are supposed to use HTML. Now you just need to accept it.

Think about it.

Comments

What exactly are you trying to accomplish?
Posted by Matt at 3:48PM
Are you saying that if your document isn't well-formed, you shouldn't use the XHTML notation at all?
Until IE doesn't support the correct Mime-types, I'm planning to use XHTML without your quickguide.
The only thing I have to do, in the 'future', is change some settings to correctly use XHTML.
Posted by BtM909 at 3:55PM
Anne, you said you shouldn't use entities (I have some problems quoting you, keep getting an well-formed error although it *is* well-formed....) So the following question raises: how do I need to make an ampersand? If I just use & (not the entity, but just the character)it's wrong...
I also find it kind of strange, since you said ampersands matter...
Posted by Blizt at 4:40PM
Blitz, I meant entities defined in a DTD. &, <, >, " and ' are defined by the XML specification and must be supported by parsers. (You know you have to use block level elements inside a BLOCKQUOTE element?)
Posted by Anne at 5:08PM
I agree with you, Anne. XHTML isn't of much use if you send it as HTML. But if you send it as XHTML, it is useful even if you're not extending it through XML namespaces with MathML, SVG and such. Having the markup as XML makes it easier to find problems in almost every part of a document; CSS, JavaScript, the textual content and the markup itself.
If the CSS is written with minor errors, like UPPERCASE tag names, it won't work when you're using XHTML (sent as XML). If the JavaScript does über-hacks with your markup, it won't work when you're using XHTML. Hixie has more on this. If the content contains unescaped ampersands or something similar, it won't work when you're using XHTML. If the markup is invalid (not wellformed), it won't work when you're using XHTML.
And all of the above will be discovered right in your browser while authoring the document, instead of when (or if) you validate it. That is an advantage. The forgiveness of HTML parsers and browsers makes HTML less useful when trying to find bugs. With XHTML and XML, bugs are thrown violently in your face. There's no hiding from them.
But, of course, if you suspect that your documents will ever contain invalid markup, you shouldn't use XHTML (at least not send it as application/xhtml+xml). If you always have wellformed documents, you should serve them as XHTML to browsers that understand it, since that will make the documents parse (and thus render) faster.
Posted by Asbjørn Ulsberg at 5:15PM
Asbjørn, while I agree some of those are advantages, authoring tools should ensure valid markup.
Posted by Anne at 5:40PM
i see your point.
with one thing i'm not clear: my impression is that xhtml is quicker then html. could it be, that this is true because of the more complicate parsing of html document ( add and close missing tags ) ?
Posted by aleto at 5:42PM
As Anne knows, I agree. You either use XHTML as it is supposed to, with all the “limitations” of XML, or you use HTML. I now believe it is unprofessional to do otherwise. And yes, I understand that designers should not have to deal with this, and that we should have tools that Just Work™. But the harsh reality is that things are not that easy, so we need to be aware of these issues and deal with them the right way.
I also wrote up something after my conversation with Anne last night, in which I put the question: what are the benefits of using XHTML? Why would we use it, if doing so correctly is (still) so hard?
Posted by Ben at 5:58PM
I agree with Asbjørn that there may be some advantages to using XHTML even without mixing in other namespaces. As long as it's served with the proper MIME type, of course.
I don't currently have any MathML, SVG or similar mixed in. But one day I may have.
At the office, we don't have the "luxury" of content negotiation. Therefore those pages are HTML 4.01 Strict.
Anne, although I agree that authoring tools should ensure valid markup, not all of us are using WYSIWYG tools. Having Mozilla display the Yellow Screen of Death is a good eye-opener. :)
Posted by Tommy Olsson at 6:09PM
I am using XHTML and I think I'll continue as long as I can send the right mime type via content negotiation to browser supporting application/xhtml+xml.
But this is not because I think I'll ever need to use MathML, but rather cause I want to make sure I don't make tiny mistakes, that maybe aren't mistakes in HTML (not closing tags etc.), because the parser will tell me something IS wrong.
I'm no web professional, and probably will never be one, but I do it as a hobby. Thats why I try to be semantically correct etc. AND use XHTML to get a good coding style.
Posted by Christoph Wagner at 7:13PM
Just to inject a teensy note of realism: XHTML as text/html is indispensible as a stepping-stone to the real thing.
You gotta walk before you can run.
And tools which generate XHTML (like the blogging tools, WordPress and MovableType) are also invaluable, even though I don't think there's a tool out there, in which you can just flip a switch and start sending application/xhtml+xml without something breaking horribly.
Also, even I'm still using named entities. It would be a lot of work getting rid of them (in all of their possible sources). And the benefit — being able to send application/xhtml+xml to user-agents which support XHTML but not XHTML+MathML — just doesn't outweigh the cost.
Posted by Jacques Distler at 8:21PM
First of all, I must honestly say that I am totally disgusted with the fact that you didn't mention me too when talking about irritating the Internet. How hard must one try? ;-)
Second of all, I think that I actually do comply to all of your guidelines with one of my sites. Correct me if I'm wrong.
You know you are supposed to use HTML. Now you just need to accept it.

Actually I agree, it's just that the thought is too horrible to even think of! :-)
Posted by Charl van Niekerk at 8:36PM
Users of one of the most popular blogging tools -- Movable Type -- are caught in a trap here. Either they can emit invalid HTML, or they can annoy AvK and emit XHTML (along with all the other problems it introduces). Here's the thing: MT uses RDF and DC namespaces to support trackback; that namespaced metadata is getting mixed right in with the (x)HTML. You can decide to turn off trackback because it interferes with purity of markup, I suppose, but that's a draconian solution.
Posted by Adam Rice at 9:23PM
"There is no known advantage in sending XHTML as text/html, only disadvantages."

Backwards-compatibility isn't a benefit? A site accessible to more than a small percentage running bleeding-edge browsers isn't a benefit? You seem to have abstracted this discussion far enough to lose sight of the reason why text/html is supported to begin with.
To seriously advocate your position, you need to come right out and say it: you disagree with the W3C. You disagree that supporting text/html is worthwhile, which means you disagree with their decision to allow backwards compatibility.
Posted by Dave S. at 9:29PM
Adam wrote:
Users of one of the most popular blogging tools -- Movable Type -- are caught in a trap here.

I don't think Anne is seriously advocating that users of tools which emit XHTML convert them to emit HTML. (That would, as I like to say, be a lot of work for not a lot of benefit.) Rather, I think he's saying that, when you have a choice of markup languages, choose HTML4, unless you actually have a need for the "X" in XHTML. (And, in the latter case, that you probably need to use the correct MIME type.)
Dave wrote:
Backwards-compatibility isn't a benefit?

I don't think Anne is deprecating content-negotiation: sending application/xhtml+xml to browsers which support it and text/html to the rest. I think he's talking about XHTML sent exclusively as text/html.
Posted by Jacques Distler at 9:47PM
And tools which generate XHTML (like the blogging tools, WordPress and MovableType) are also invaluable, even though I don't think there's a tool out there, in which you can just flip a switch and start sending application/xhtml+xml without something breaking horribly.

When I switched via content negotiation to application/xhtml+xml I didn't encounter any problems with my template.. (using Wordpress)
Posted by Christoph Wagner at 10:52PM
(being bulletproof is near impossible)

No, it is not if bullet-proof is taken to mean well-formed output.
The problem is that many content management systems (including WP and MT) use templating methodology that is not suited for producing XML. Mixing stuff together by doing string substitutions (on the byte level even!) is a recipe for failure if the goal is to produce XML. (String substitution isn't suitable for producing proper HTML, either, but tag soup browsers are forgiving, which is why the uptake of a tag souping tools such as PHP has been possible in the first place.)
In order to guarantee that elements have proper closing tags, you need a stack or a tree. In order to get character and escaping issues right, you need to use unescaped Unicode internally and escape at the latest possible stage in a well-isolated serializer. Unless you are comfortable with programming SAX state machines, the tractable way is to use a server-side object representation of the XML document tree (XOM, DOM or similar).
BTW, Asbjørn, upper-case CSS selectors are not an error with text/xhtml. If you want to well-formedness checking, you could write XHTML and then parse it and re-serialize as HTML for serving. :-)
Anne, what's the deal with requiring XHTML but not allowing CDATA sections in comments? Surely a proper XML processor (in your CMS) would abstract away the particular method of escaping.
Posted by Henri Sivonen at 11:45PM
The problem is that many content management systems (including WP and MT) use templating methodology that is not suited for producing XML.

Hey, Henri, write a CMS with the functionality I need, but which does things right internally and the world (or, at least, I) will beat a path to your door.
Posted by Jacques Distler at 12:06AM
Just watch out, Henri, or you will be overrun by hardcore markup geeks! Or even worse, string theorists.
Well, okay, add me to the list too -- I would love to see a cheap or free CMS that really "grokked" XML at its core. Then I could finally close up shop on the X-Philes, once and for all.
Posted by Evan at 4:27AM
(being bulletproof is near impossible)

No, it is not if bullet-proof is taken to mean well-formed output.

Ensuring well-formed output gets pretty tough when you have people other than yourself adding content willy-nilly to you documents. Pingbacks, TrackBacks and comments can be a real headache if you're not exceptionally careful.

BTW, Asbjørn, upper-case CSS selectors are not an error with text/xhtml.

Last I checked, text/xhtml wasn't an official content-type, so its behaviour cannot be counted upon.

Anne, what's the deal with requiring XHTML but not allowing CDATA sections in comments? Surely a proper XML processor (in your CMS) would abstract away the particular method of escaping.

That has something to do with PHP4's XML parser really sucking, I believe. Simon Willison could tell you why conclusively---he wrote the checker.
Posted by J. King at 1:20PM
Anne - just going back to your comment about entities; what then would be the "correct" way to enter a copyright (©) sign or similar within the markup?
Posted by P. Oldham at 3:02PM
In reality © is the correct way to use the symbol character entity for XHTML, as it is with MathML.
Though to be on the safe side for flexibility with other applications of XML then the Numeric Entity © is typically applied as stated above - DTD depenance.
Though it amuses me for the WELL-FORMED article it mentions changing the ' (U+0027) with '
Posted by Robert Wellock at 4:14PM
Robert, you mean character references. (I did that wrong too.) However, like I mentioned you don't have to use character references or entities. You can just use the actual character.
&apos and four other entities are defined in the XML specification as mentioned above and therefore it doesn't matter.
Posted by Anne at 4:18PM
The reason you really want to use XHTML is when you are going to use other markup languages like MathML and mix them together.

So klar und präzise habe ich das eigentlich noch nicht gelesen. [...]
Posted by DenkZEIT :: Wann ist es sinnvoll XHTML einzusetzen? at 6:06PM
Ensuring well-formed output gets pretty tough when you have people other than yourself adding content willy-nilly to you documents. Pingbacks, TrackBacks and comments can be a real headache if you're not exceptionally careful.

With document tree-based templating you merge content from different sources on the document tree-level. In order for third-party data to be merged with yours, the third-party data needs to parse into element nodes and text nodes. If the data is so corrupt that you can't get that far, you drop the junk without passing it on.
Last I checked, text/xhtml wasn't an official content-type, so its behaviour cannot be counted upon.

Oops. That's an embarrassing typo. I meant text/html.
Posted by Henri Sivonen at 10:41PM
In reality © is the correct way to use the symbol character entity for XHTML, as it is with MathML.

In the reality of Opera 7.1, Safari 1.0 and Netscape 6.something (IIRC), nope, as far as application/xhtml+xml goes.
OTOH, ' does not work with the tag soup parser of IE, regardless of what the infamous Appendix C allows…
Posted by Henri Sivonen at 10:48PM
With document tree-based templating you merge content from different sources on the document tree-level. In order for third-party data to be merged with yours, the third-party data needs to parse into element nodes and text nodes. If the data is so corrupt that you can't get that far, you drop the junk without passing it on.

In all current weblogging systems (such as this one), when someone posts a comment, it is passed as a string (a query parameter to an HTTP POST). Anne and I and a few other lonely individuals pass that string through an XML parser before letting it into the database. (For us, this is just validation; you'd actually pass the parsed tree.)
You would make this a standard feature? That, already, is a vast improvement.
The only problem I, personally (no one else here will care), have with this is mathematical input. MathML is not easily human-readable, or human-editable. I much prefer to keep mathematical equations stored in a LaTeX-like syntax, which is expanded programmatically to MathML. One could store these little blobs of LaTeX as CDATA elements, but one would still (at least, on the way into the database) have to expand them to verify that the (error-free) LaTeX expands to (well-formed) MathML
I know that ruins the purity of the system, but I don't see the alternatives as palatable.
Posted by Jacques Distler at 12:39AM
Jacques--Thanks, I think. But I don't feel like that gets me any closer to an answer of what the right thing to do here is.
It's not a big deal to rig MT to emit HTML4, but it isn't possible to emit HTML4, use trackback, and be valid. For almost everyone (you excepted) valid markup is nice in theory but buys you no real-world benefits. So we could just emit invalid HTML and shrug.
It's not easy to emit XHTML w/ the correct browser-sniffed MIME type, and again, doing so has almost no real-world benefits. So we could put up with the incorrect MIME type problem and shrug.
Neither outcome does anything for our posture. What's the right thing to do?
Posted by Adam Rice at 3:39AM
Mark Pilgrim (before he switched to WordPress) was emitting mostly-valid HTML4 with MovableType, so it's certainly doable. I'm not sure that there's any advantage in doing so.
I tend to disagree somewhat with Anne on the subject of XHTML-as-text/html. We agree that, except as a stepping-stone to "real" XHTML, it is of no advantage whatsoever (and that anyone who tells you that it is advantageous, is selling you snake oil).
Anne sees XHTML-as-text/html as distinctly inferior to HTML4. I see it as more or less a wash. They are both handled as tag-soup. And all of the user agents, you'll ever care about, handle them just fine. Given a choice, you might as well use HTML 4, but if your CMS already spits out XHTML, there's no point in going through the trouble of converting it to spit out HTML4 instead.
Posted by Jacques Distler at 5:14AM
Anne sees XHTML-as-text/html as distinctly inferior to HTML4. I see it as more or less a wash. They are both handled as tag-soup. And all of the user agents, you'll ever care about, handle them just fine. Given a choice, you might as well use HTML 4, but if your CMS already spits out XHTML, there's no point in going through the trouble of converting it to spit out HTML4 instead.

Sane words, with respect to output. So, why do I prefer XHTML? Input.
I run a homegrown CMS. It simply doesn't allow malformed markup into the database, which means it won't allow a lot of valid HTML. My templates are run through an XSLT engine, which also means they're run through an XML parser, which means my templates must be well-formed XML, too, which means they can't contain a lot of valid HTML. But this also means that I can do template manipulations that I couldn't with HTML -- including transforming my XHTML to HTML for output! (Simon Willison has talked about this before.) And it means that I can be reasonably sure the next comment to my site will be well-formed.
Keep in mind that all of this happens within the app. I happen to serve my documents as application/xhtml+xml (to browsers that can handle it); but even if I didn't, the bulk of the benefit of using XHTML is realized before the first person sees my latest post.
My tool is buggy and sloppy, but it's the only thing I've got that gets XHTML.
Posted by Wayne at 7:01AM
Wayne, John Cowan's TagSoup makes it possible to use XHTML internally but allow HTML input. (Useful if you want to allow input from HTML editors).
Posted by Henri Sivonen at 3:15PM
Something just occurred to me. I've been using numbered entities, for example: —, throughout many of my files. Are these also a no-no? They don't require an external DTD in order for the processor to know how to render them, do they? I've never thought about this before.
Posted by Devon at 11:28AM
XHTML isn't difficult to be bulletproof in, unless you're accustomed to HTML rules.
It's simple to be bulletproof with HTML. Why? Because all software supports it (good enough) since it's been around so long. No configurations or script editing is needed.
The actual problem is that most software and scripts aren't made to support XHTML out of the box yet - because they don't have to. There's no demand (beyond us geeks) since HTML works good enough for most people in their current browser. Until people see the limitations of their browser or the product they make, they don't think about upgrading or improving the product.
We live in a world of supply and demand. i think if there's ever a growing supply of pure application/xhtml+xml sites, there will suddenly be a demand for improved and upgraded software. That's how HTML support became so abundant.
Posted by Devon at 11:59AM
Yeah, software is a problem.
No, character references (not entities), won't give any problems as mentioned several times.
Posted by Anne at 3:39PM
Next time I'll read the comments fully before asking. Knee jerk reaction. Sorry 'bout that. :-P
Posted by Devon at 3:18AM
Lets not forget one thing - whilst the W3C has done XHTML, the use of HTML 4.01 is still fine by them - as is HTML 3.2.
And its about using the right doctype for the right reasons. And none of the arguments for XHTML have ever convinced me.
For future compatability? Well that's about it. But if you code in HTML 4.01 strict now with sensible, coherent, valid markup, future compatability will be a doddle anyway.
Personally I'll use XHTML when I need to use XHTML. Which is not now.
Thank goodness for people like Anne for going against this "you must have XHTML or else" attitude that's going round.
Posted by Andrew B at 6:27PM
XHTML forces you to use end tags. In theory, this is an advantage when using DOM, because objects are always explicitly defined.
A mature CMS should only accept content after it's been validated against a schema or DTD. And a good CMS is XML based. In that case, XHTML does a much better job than HTML. Choosing HTML means that you will produce 'tomorrows legacy'.
That brings me to durability. In the Netherlands there's legislation (for archive purposes) that allows you to use XML, see Article 6-b of the 'Regeling geordende en toegankelijke staat archiefbescheiden'. XHTML is compliant with this regulation, HTML isn't. That means that XHTML is a safer choice for government websites, since legislation is in the making that makes it compulsary for governments to archive (the content of) their websites.
I suggest reversal of the question: what are the real disadvantages of using XHTML (even with the text/html MIME-type), when compared to HTML?
Posted by Raph at 6:31PM
That means that XHTML is a safer choice for government websites, since legislation is in the making that makes it compulsary for governments to archive (the content of) their websites.

So does that mean Dutch government web sites are going to be consistently well-formed? (Maybe the first government in the world to achieve anything approaching that.) Or will you be able to satisfy the requirements by sticking the contents of the web page in a CDATA section of an XML document (in which case, it doesn't matter what markup was used for the page itself)?
Posted by Jacques Distler at 8:04PM
Jacques,
So does that mean Dutch government web sites are going to be consistently well-formed? (Maybe the first government in the world to achieve anything approaching that.) Or will you be able to satisfy the requirements by sticking the contents of the web page in a CDATA section of an XML document (in which case, it doesn't matter what markup was used for the page itself)?

To answer your first question: in all likelihood, no. The second question isn't even relevant anymore.
What has happened is that the Dutch Government's Advisory website has sent out a note about how Government websites should be made according to Web Standards, for sake of accessibility, easier maintenance, etc. etc. What has not happened is that they made a law forbidding Gov't websites to be poorly made. What the result of this is: in a few years, probably, most Gov't websites will be either valid HTML 4 or XHTML 1.0 Trans/Strict (most of which with a text/html MIME type). And even further down the road, they may mostly be semantically rich and some even well-formed XHTML (sent as application/xhtml+xml).
If we'd be the first Government (I speak of "we" because in a way I can consider myself officially part of it) with more valid, mostly-well-formed websites than tag-soup sites that only work acceptably well in IE, then I'll be various degrees of happy and not care whatsoever that there's a bunch of XHTML 1.0 Strict sites in there that send a text/html MIME type. As you said so yourself:
Just to inject a teensy note of realism: XHTML as text/html is indispensible as a stepping-stone to the real thing.

On the scale of Government websites, it can easily be a stepping-stone process that stretches across 2-3 years. Complaining about a lack of application/xhtml+xml header won't do much good, if any, in such a process. This is something that requires a large portion of the entire industry (here in the Netherlands, anyway) to completely 'redesign' their practice.
Baby steps. Just keep thinking: baby steps.
Posted by Faruk Ates at 5:07PM