Anne van Kesteren

XHTML is invalid HTML

13 June 2004

First, when you are X-Philed, this post does not apply to you. Second, if you are not X-Philed, but you think XHTML has other benefits, this post probably applies to you. Third, if you are happily using HTML 4.01, keep it that way.

As you might know, the "real" Internet Explorer will remain updateless, more specifically: if the W3C doesn't create comprehensive test cases for XHTML, Microsoft isn't sure they will implement the standards in correct way (just look at their CSS 2 compared to 1 support). And they might be right; not every browser has a Hixie, who will complain when they fail. This will also mean that people won't stop sending their (XHTML) content as text/html to Internet Explorer. (Some Internet Explorer users might have a MathPlayer installed, which seems to have some support for application/xhtml+xml)

The problem is that when you send your XHTML syntax based documents as text/html they will be treated as HTML by the browser. Let me rephrase that: all your documents are handled as tag soup by the browser because they contain invalid HTML. Yes I said HTML. XHTML is a XML based language and should be used and treated as such. You can claim that there are backwards compatibility guidelines in the XHTML specification, but than you must also acknowledge that those guidelines are not compatible with SGML and for that matter not with HTML. (the <br /> is treated as <br>> by a conforming parser issue.)

So if we want to create valid documents people should either switch to application/xhtml+xml or switch to HTML 4.01.

Comments

So if we want to create valid documents people should either switch to application/xhtml+xml or switch to HTML 4.01.

What does the Content-Type has to do with the validity of the document?
Posted by David Håsäther at 2:55AM
While you are correct in your assertion, I would consider people using XHTML to generally be a "good" thing.
The W3C's media types document has this to say:
The use of 'text/html' for XHTML SHOULD be limited for the purpose of rendering on existing HTML user agents, and SHOULD be limited to [XHTML1] documents which follow the HTML Compatibility Guidelines.

I read this as saying: "if you have to support a browser (like IE), then you can serve it as normal HTML. Try not to, but if you have to we won't spank you." The compatibility guidelines, are there to make sure your code doesn't cause old browsers to explode, not to make valid HTML. Is it a crutch? Yeah Does that bother me that I am writing invalid HTML? Not really.
I feel the benefits of having people move slowly towards application/xhtml+xml (even if the first few years are a bit rocky) is better than not trying to move forward at all There will be no HTML5; 4.01 is the end of the road.
With that said, we really have to get Microsoft on board here. Looking at the W3C's (X)HTML Test Suite Page is not very reassuring. If we can't get them to comply, these standards are in big trouble. Any suggestions?
Posted by Josh King at 3:51AM
XHTML 1.0’s Appendix C is utter complete nonsense all right, but the conclusion that X(HT)ML syntax is not compatible with SGML is plain wrong.
XML is SGML. :)
The problem is exclusively the lexical level that is handled by the SGML declaration. The SGML declarations of XML and HTML are quite incompatible indeed.
In terms of validation, if not present in the document entity itself, you simply (ahem) point to the right SGML declaration manually—e.g. on the command line—or more likely generally in the catalog. Content types do not matter much here.
User agents are another story; the SGML declaration has simply been considered fixed for text/html since RFC 1866 (not that this kept them from changing it later on). That is why Emacs/W3 correctly spawns >’s all over the place when it is fed with xhtml syntax. :)
XHTML syntax is not handled as tag salad because it is incompatible with the lexical trivia of HTML, by the way, but because it was advertised as such by the server. But then, that is still better than tilting the bogometer with homegrown ‘content negotiation’ based on strstr() and the dislikes. Grumble.
Posted by Eric at 4:05AM
> if you are happily using HTML 4.01, keep it that way.
Yay!
/me moves on and lets other people fight this fight
Posted by Mark at 4:16AM
Happily listed on the X-philes here. Futhermore, I'm not touching this discussion with a 10 foot pole. ;)
Posted by Arthur! at 5:21AM
Because we all know that being perfect is the only way to go. Utopia or bust.
No. I endorse the use of standards, and they are a valuable tool. But i'm more concerned about things actually working.
Nothing works by standards. There is no example, real world or otherwise, of systems that work perfectly because they conform to to standards, and standards only. Sometimes, you have to work around that, find a solution. Even if you didn't create the problem.
Flexibility is a needed asset if you as a business want to succeed. If my pages work now, and are future-proof, that works great for me. No need to put ion extra effort just to make my life more miserable and decreasing the value of my product (it just Does Not Work), just because some rules said so.
Yeah, I know you don't want to evangelise perfect compliance, but I needed to vent.
PS. why doesn't your nice validator eat ™?
Posted by [m] at 5:49AM
Well, I'd like to serve my documents as application/xhtml+xml, but first issue is, that my Firefox 0.9RC then wants to download the webpage...
second issue is, that i don't get it, how to serve php docs as application/xhtml+xml at all, cause if I do, they aren't anymore parsed as php..;)
I'm sorry, but I dont understand this apache internal stuff anyway, so this is kinda too big for me.
Hel appreciated.
Posted by Christoph Wagner at 7:57AM
This is a pet peeve of mine. I've tried to ask people who insist on writing XHTML markup and then serve it as tag soup why they do it, but they can't give a good answer. Mostly because they're unaware of the issues.
Some seem to believe that XHTML is somehow better than HTML 4.01 Strict; that it's more semantic or more strict.
I do what Eric called homegrown content negotiation, serving XHTML 1.1 (as application/xhtml+xml) to compliant user agents and HTML 4.01 Strict (as text/html) to the others. No problem using PHP.
Josh King's interpretation of the term SHOULD is quite common, but in fact it's covered by RFC2119, which explicitly defines what it means. It says:
[T]here may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

Whether it doesn't work in Internet Explorer if I don't is a valid reason, is a matter of interpretation. I think, that IE should be served HTML, but that's me.
Posted by Tommy Olsson at 12:52PM
That is a W3C NOTE and not a specification. There is no way people can say it is normative and should be followed (I have done this and I was wrong). I also think people should not try to look for a way to claim what they do is valid, when they know they are wrong, but just don't want to admit (again, I have done this in the past).
Posted by Anne at 1:48PM
[m], the validator doesn't like HTML entities. Only the 5 XML entities are allowed. You can use just use that trademark by the way: '™' without encoding it, since my weblog uses a character encoding that can handle almost all characters. And in case utf-8 doesn't support a character (doubtful) you can use decimal or hexadecimal entities.
Posted by Anne at 1:55PM
What is valid? I know not, any longer. On day one the validator says my document is valid, then a crew comes in, and improves the validator, and several hundred documents are no longer valid. That is, if I were to believe the validator, which I don't.
The CSS validator is broken since August 2003. Now the markup validator is broken, too. And before some helpful soul tells me to file a bug report, I will say this: it is not my business. I am not responsible for the quality of the validator service. I will happily live with the so-called invalid documents.
Validity, my antler.
M.
Posted by Moose at 3:02PM
So if we want to create valid documents people should either switch to application/xhtml+xml or switch to HTML 4.01.

Real world limitations is what justifies the use of text/html for XHTML. It is permitted even, in XHTML1.0. So, I do not buy your either/or validity statement.
What is more of a thorn in my side is the fact that people new to webstandards jump on the XHTML bandwagon too easily. I have been there myself and so have you Anne, as I think I recall. If coders start using XHTML, they should do it for the right reasons and be aware of what transitional phase their markup is in when they serve the XHTML as text/html. A lot of ‘benefits’ of XHTML go out of the window when it is served with the ‘transitional’ (not called “wrong’ by me on purpose) MIME type.
When I talk to other coders who are thinking about starting to use XHTML, I advice them to forget about the X-factor and focus on the HTML Strict variant first. Because it is that variant of HTML that makes up the strictness that is currently being sold as a benefit of XHTML.
On a different note: if you were to advice government contracted web agencies on what markup to use, what would you tell them and why?
Different note deux: could you enlarge the size of your comment’s textarea box a bit? Thanks. Also, your comments validator tries to convince me that the I and SPAN element are invalid. I guess I will have to change them to some bogus element instead then. Language attributes seem to cause hickups too.
Posted by Kris at 3:28PM
Argh... I had written a big comment and then I accidently pressed alt+f4 instead of alt+f3 on some other page. :(
Summarized: i is style, you can use em instead which isn't style.
Valid HTML 4.01 for government agencies.
~~Lang attribute not allowed. Stupid.~~
Posted by Frenzie at 4:02PM
i is style, you can use em instead which isn't style.

But I have no intention to emphasize my text, just to markup a foreign loanword.

Valid HTML 4.01 for government agencies.

Why? I asked Anne because of the “why?”.

Lang attribute not allowed. Stupid.

Last time I checked the lang attribute is still valid in XHTML1.0, which the markup of this page says it is. Please refrain from calling someone stupid until you first know him.
Posted by Kris at 4:25PM
Please refrain from calling someone stupid until you first know him.

Sorry if you felt offended, but I was referring to the comment system, ~~and if I was calling anybody stupid (which I wasn't), then it would be Anne.~~
Posted by Frenzie at 4:47PM
And just to clarify (sorry for not thinking about this earlier), before Anne feels offended, I just think it's stupid that the lang attribute isn't allowed in the comment system, for the obvious reason of giving dit Nederlandse citaat een aparte stijl, mocht je besluiten dat in de toekomst te doen. That has nothing to do with thinking anything is stupid, except a small thing in the comment system which I would hardly use more than once a month anyway.
Valid HTML 4.01 because it is hardly different from valid XHTML 1.0, but does not give the extra application/xhtml+xml, text/html et cetera work.
Btw, although I initially started sending Watchzine as application/xhtml+xml, later refined to if accepts (hell, how could I know IE didn't support it 8 months ago when I just decided to start with using XHTML :P), later switching to text/html for all (still too many things to fix), I have now finally switched to if accepts again. The only problem is currently my allow HTML script which allows correct HTML which is incorrect XML, but I hope most people will use the available BBcodes. If they don't, I would either need some kind of automatic post validation, or an automatic code changer, which makes href=http://something into href="http://something" automatically (before inserting it into the database however). But that's for later.
Posted by Frenzie at 4:57PM
No offence taken. I suspected that I might be wrong about taking offence in your remark, so I tried to take it lightly. ~~I agree that the comment system shoudl support the (HTML) lang attribute.~~
My opinion about government sites and XHTML: with the use of XHTML and serving it as application/xhtml+xml comes great responsibility. In fact, it is an accessibility hazard, since all browsers that I know of that support XHTML stop rendering if the markup of the document served to them is not well-formed and thusly displayan error rather than a best try. Not having to try to second-guess markup is of course a great advantage (it reduces the need for resources the browser has), but webdevelopers need to have a constant eye on their site and correct errors in the markup where they occur.
As you may or may not know, the site of the company where I work, Cinnamon Interactive is served as application/xhtml+xml. Anne, among others, have been very helpful pointing out the occasional error that sometimes occured with updates, but I don't expect many visitors to do that. Especially when it comes to commercial websites, which my company has business from, this practice of an endless watch is economically not very viable.
I would like to reformulate Anne‘s statement to this: if we want to serve XHTML documents as application/xhtml+xml we should either take responsibility of making sure all of our documents are valid at all times or go in transitional stage until we are ready (text/html) or forget the insanity and use HTML 4.01.
The Web would already be such a better place if people would start to code valid HTML4.01 Transitional, let alone Strict.
The reason the Cinnamon site is served as application/xhtml+xml is simply experimentation. I do it because I am convinced I can, while playing around with techniques that one day may prove the effort was worth it.
Posted by Kris at 5:20PM

Kris: The reason … is simply experimentation.

Accept: text/html, application/xml;q=0.9, application/xhtml+xml, [etc] 
Content-Type: application/xhtml+xml

The lab exploded. :(

Posted by Eric at 6:00PM

~~Is there any particular reason, by the way, that PRE does only work as expected in the preview and is rendered useless upon submission?~~
Posted by Eric at 6:03PM
Eric, can you send me an e-mail at k.thivessen#cinnamon.nl with some more details? Like, the URL of the page you went to, what browser you use and the results (‘exploded’?). Thanks in advance.
I think this clearly illustrates the dangers of using XHTML with a certain MIME type and the responsibilities it put on the author.
Posted by Kris at 6:13PM
Kris, I'm glad you didn't feel to offended. :)
About Cinnamon, I browsed around on a few pages, all were served as application/xhtml+xml and all seemed fine. However, Onze core business is websites is just... well, I don't know what exactly. First of all, I don't like core business (hell, where do I live, in a Dutch speaking country I hope?), secondly is websites. Your core business is websites. I presume that means everything from design to PHP to CSS. Anyway, regardless, I just don't like the sentence. Sorry for being unable to tell you why.
To return to Watchzine, to me it was served as application/xhtml+xml all along, but only now I feel that I finally fixed everything (everytime the XML parser failed I knew something had to be fixed).
The main problem for me was actually PHP, it loves to output all kinds of invalid code. My second problem was Opera, but since 7.50 it only has one CSS quirk and no other quirks, so that's not an issue anymore either. They still haven't reached Mozilla's level here though.
With invalid I do not only mean font tags, unclosed br tags and such (hey, most of them are valid HTML 4.01 strict and all of them are valid HTML 4.01 Transitional), but unescaped ampersands is also something it loves, which is not valid HTML 4.01 either, or if it is valid, it has a great potential of causing problems, so it is annoying either way.
Anyway, my site validates (although I only care so much about that - it's well-formed-ness which is important) and it will remain to do so. ;)
Btw, I've been coding valid HTML 4.01 Transitional for years (since sometime in 2002 anyway).
Btw 2, I have no intention of being X-Philed, although I'm pretty sure WZ could do it.
Posted by Frenzie at 6:49PM
~~Frenzie, your comments about the site are happily appreciated in an e-mail to info@cinnamon.nl. There it will reach the people to which it matters most and who make the decisions about it.~~
~~I too like clear, Dutch language, rather than marketeer talk. Not everything is up to me though, which is a good thing. And yes, we do design to PHP to CSS.~~
..but unescaped ampersands is also something it loves

url_encode($string);, if I recall correctly, for URLs. html_specialchars($string); for stray ampersands and taghooks in ones ‘regular’ content. Not that this is all that easy though.
I've been coding valid HTML 4.01 Transitional for years

Strict is the sweetness, not the X in XHTML. You can go and collect your 77 virgins. :)
Being an X-phile is like being gay in showbiz: gives people something to talk about.
Posted by Kris at 7:14PM
1. This is not the place to discuss about my comment system. You can use my contact form for that.
2. You can use xml:lang as mentioned in the guidelines for other languages. I will decide what I think is forward compatible and best for my site. (Although I have an option now to change the style to your preference.)
3. This is also not the place to discuss marketing.
4. PRE does indeed not work properly. I have asked the WordPress people for over 5 times and stopped doing so, since nobody gave me a solid answer that workd.
5. I believe I have covered all the off topic comments now, please stop making them.
Posted by Anne at 7:29PM
url_encode($string);, if I recall correctly, for URLs.
Yes, but sessions, etc etc etc. Look, I didn't say it was impossible or even difficult (hence the validation), but it's so much extra unneccesary work.
Strict may be sweet, but IE isn't. Besides, I do not want to rewrite everything not written by me into (X)HTML strict, which is why transitional is more or less a requirement.
Anyway, does anyone know how PHP 5 is in regard to strict (X)HTML?
Posted by Frenzie at 7:53PM
Euhm.. this has just to do with PHP configuration. You can configure the session separator somewhere. I don't have that much problems with PHP and XML actually, more with PHP and unicode, although I resolved most of the issues I had.
Posted by Anne at 8:01PM
Just check the output of something like highlight_string($string), just to name an example. Of course you can configure everything, but regardless of your mimetype, an unencoded ampersand is just ridiculous, it's bound to give problems, no matter if you use application/xhtml+xml or not. I want the ability to configure everything, of course, I also appreciate that not everybody likes the same things, but I have to use ini_set all the time, which is of course slower (not notably, but still). If I do not, my code would be oh so portable... is that annoying? Definitely!
Posted by Frenzie at 8:15PM
Everytime I think about Microsoft's anti attitude towards updating IE, I remember this idea I had a few months ago -
Can you imgagine how quickly IE would handle application/xhtml+xml if a lot of blogs suddenly started spitting out ONLY that content type? People would be switching to Mozilla and Opera in swarms and IE would have to do something.
It's all about business for them, and until they have a business related reason to change what they're doing - they won't ever do it.

Posted by Devon at 9:22AM
Quite frankly, you are wrong! In practice, web browsers only turn on standards compliant mode after reading the doctype declaration string.
In theory, HTML is a family of specifications including both SGML and XML versions. As stated above, every valid XML 1.0 document is also a valid SGML document. Therefore, any valid XHTML document can be read by a SGML parser. User agents should understand the semantics of anchors, titles, style-sheets, and whatnot because that's what they are designed for! So all you can say is that the SGML encoded version could be invalid XHTML - which is always XML - but I reiterate that XHTML is always valid HTML.
Your statement is much less insightful in the specific case. Certainly XHTML 1.0 is invalid HTML 4.01; however, so is HTML 3.2 or version 2.0. Every version is invalid with every other version in some way! The only way to differentiate between the SGML encoded versions is via the doctype; however, that must be read after any HTTP headers. Therefore, those headers can't tell you which version to expect in any case. Perhaps trusting content-type should be considered harmful instead?
Posted by Jimmy Cerra at 9:40AM
Just a quick note, if you're serving up app/xhtml+xml then you MUST IMO use a publishing system that ensures validity, the error barfing is just too severe. If you're using generated script, then Nick Kew has an Apache module that can enforce validity (by running validate and if fail tidy like things within an apache module. This is the sort of thing that commercial grade (ie anything where you care about not annoying your visitors) should be running. Of course a publish only CMS could be run instead that enforces validity if your site is static.
Another thing to watch out for with serving application/xhtml+xml is content inserting proxies like Norton Anti-Virus which may modify your page and render it invalid, there's no real hope against these...
Posted by Jim Ley at 6:53PM
Damn the post doesn't apply to me... I suppose the main problem was XHTML came out before there was an appropriate MIME especially geared for XHTML 1.x onwards.
Well, I suppose one can always dream about M$ Explorer usage dropping, or them playing catch-up to Mozilla and Opera.
Posted by Robert Wellock at 12:42AM