Anne van Kesteren

Well-Formed

Before I'm going to write something about utf-8, I want to discuss error handling. Because of a bug in Mozilla I wasn't really aware of the fact that XML was even stricter than I thought. When I was going through my archives today to fix some broken links I even found a couple of characters that were incorrectly encoded. A real XML parser would have crashed on my pages and wouldn't have shown any of the contents, none. Fortunately, I'm now using utf-8 and I haven't had a single encoding issue since I switched. According to the XML specification parsers MUST support the character encodings utf-8 and utf-16, so if you are using something else, like iso-8859-1 (in combination with a XML MIME type, like application/xhtml+xml) it isn't sure parsers will be able to handle your file and you should switch to utf-8 (you could switch to utf-16 as well, but if you are using iso-8859-1 now, that wouldn't make much sense). So parsers MUST give a fatal XML error on this example. Note that Mozilla has another bug treating text/xml incorrectly as utf-8, which is a problem, since text/xml without a charset parameter should default to us-ascii. Agreed that the requirement that text/xml has to default to us-ascii is ridiculous and therefore you should never use it, application/xml is just fine. Before you continue reading, add this to your .htaccess file:

AddType application/xml .xml
AddDefaultCharset utf-8

Tim Bray was kind enough to list the rules to generate a well-formed XML file. Note that these rules apply also to both XHTML and Atom, there are no exceptions:

There's just no nice way to say this: Anyone who can't make a syndication feed that's well-formed XML is an incompetent fool. Here are the rules:

  1. For the tags you write, make sure that begin-tags and end-tags match up, and all the attribute values are quoted.
  2. Make sure that you generate correct utf-8 or utf-16 text.
  3. Filter out characters that aren't legal in XML. Don't get fancy, just lose them.
  4. Clean up any text you're passing through by replacing < with &lt;, & with &amp;, > with &gt;, " with &quot;, and ' with &apos;. This applies to attribute values and character data in elements.

Those rules seem to be impossible to follow for most software. Maybe because the software gets heavily complicated, where fixing these kind of bugs becomes a impossibility, but I can't really believe that, since most things would be fixed using a equivalent of the PHP htmlspecialchars function (never use htmlentities, since PHP has no native support for unicode and will therefore ruin your content).

To sum it up: in order to have a well-formed XML document, whether it is "just" XML, RDF, SVG, Atom or XHTML; your document needs to be in utf-8 or utf-16 and all characters must match that encoding, you must have encoded the five XML entities and last, but not least, your tags need to be opened, closed and nested correctly. If you can't do that, you are an incompetent fool.

Of course, there has been some respond to this post in thought experiment written by Mark Pilgrim telling people who believe in strict error handling they failed. Quite amusing, actually. Of course, he misses (well, doesn't mention is more like it) the fact that their XHTML isn't really send as application/xhtml+xml as it should, but as text/html. This creates the problem that authors think it is valid just because it displays okay and don't really bother fixing their software when it appears to be broken. But that isn't really an argument either, since software should just output 100% well-formed XHTML (XML in a more general way) no matter what. If you notice that the CMS or ad software you are using for your client outputs non well-formed markup, you should reconsider your choice for XHTML, really.

All above raises the question how bad we are at writing software. XML has been in existence for a long time, XHTML was released four and a half years ago and has been revised two years ago, and people still create software that can't get it right (it outputs data that is neither well-formed or valid). There are some people who think we should follow the so-called Postel's Law:

Be conservative in what you do, be liberal in what you accept from others.

That the above rule applies to RSS documents is probably irreversible thanks to the badly written specifications, lack of a MIME type (text/xml without a charset parameter all over the place) and a validator that doesn't work. Fortunately, there is now the feed validator, try it today and see if you are a fool or a normal person ;-). Most of the newly created XHTML sites have the same problem, what to do?

Comments

  1. try it today and see if you are a fool or a normal person

    Does this "sound advice" also apply to other validators? Because if so, then everyone is potentially a fool. As long as validators are broken, you can wallpaper your room with such statements, or frame them and hang above your desk, for all they are worth.

    There is only one misery worse than an invalid validator page, and that is a broken validator.

    Posted by Moose at

  2. The problem is not at the level of the weblog software, it's at the level of the tools weblog software employs. The first XML Rec was released over six years ago, and the best thing that either WordPress or Movable Type can rely on to build XML and XHTML is string concatenation.

    You cannot guarantee that you are creating well-formed XML unless you are using XML tools to create it, and in the deployed world of scripting languages, whether it's PHP, Perl, or Python, you cannot guarantee that you will have any XML tool available, even after six years of XML being the most-hyped thing in computing.

    I think it's quite significant that after he started calling people who produce ill-formed XML bozos, Tim Bray discovered that there wasn't any generally available XML writing tool, and has spent months since writing a C library to write XML. Once that's finished, how long would you guess it will be before all scripting languages have an interface to it, and have it in the core, so it's available to anyone who is using one of the three Ps? Three years?

    Posted by Phil Ringnalda at

  3. Even though he is Mr. XML, Tim Bray's rules are a little stricter than necessary. Technically, '>' doesn't need to be encoded as '&gt;' in most situations according to the specification for XML 1.0.

    Posted by Jimmy Cerra at

  4. While it's true that '>' doesn't have to be encoded in most situations, it does have to be encoded in some (specifically when it follows the character sequence ']]' and isn't the end of a marked section).

    Now, that's uncommon, but it could happen. So the safest thing to do is simply encode it whenever you use it or generate it.

    Posted by Norman Walsh at

  5. Your blanket prohibition on htmlentities() is overdone. I tend to use htmlentities($string, ENT_QUOTES, 'UTF-8'), and it works fine :-) .

    Posted by Aidan Kehoe at