Anne van Kesteren

text/xml is seriously broken over HTTP

1 March 2005

So I thought I read RFC 3023 carefully, but apparently I did not. As Mark (the one with the hobby that does not involve electricity) points out you can only change the character encoding of an XML document with a text/xml Content-Type using the optional charset parameter. Otherwise the document is encoded in US-ASCII. No matter what your browser says.

So the following examples are encoded in US-ASCII although the document seems to tell otherwise. This is based on section 8.5 of the above mentioned RFC.

Content-Type:text/xml
{BOM}<?xml version="1.0" encoding="UTF-8"?>
<test xmlns="tag:example.org,2005-03:test"/>

Content-Type:text/xml
{BOM}<?xml version="1.0"?>
<test xmlns="tag:example.org,2005-03:test"/>

Content-Type:text/xml
<?xml version="1.0" encoding="UTF-8"?>
<test xmlns="tag:example.org,2005-03:test"/>

This example is encoded in UTF-8:

Content-Type:text/xml;charset=utf-8
<?xml version="1.0">
<test xmlns="tag:example.org,2005-03:test"/>

Fortunately (as I have mentioned before) text/xml gets deprecated. However, that does not mean most problems are solved or so. Read for example Sam Ruby’s latest post on Yahoo Search and Iñtërnâtiônàlizætiøn. Not that Yahoo is a bad company.

Comments

To be absolutely clear, you don't mean that the documents are literally encoded in US ASCII, right? There's no process that strips 8-bit characters or anything like that. Perhaps it would be beter to state this as follows:

the following examples are decoded as US-ASCII by the browser as that's what the server is telling the browser the document is encoded as, even though the document is actually encoded as UTF-8, UTF-16, etc.

That doesn't seem quite right either. The point I'm trying to make is that the representation never goes through an encoding phase where it is converted to US-ASCII. It is simply decoded incorrectly because the browser doesn't know the correct encoding (or is not allowed to use it).
Or am I smoking crack?
Posted by Ryan Tomayko at 7:23AM
Ryan: the user-agent on the receiving end is supposed to behave as if it's been handed us-ascii, so (for example) if the document contains non-ascii characters that means it's not well-formed.
Posted by James at 8:12AM
Content-Type:text/xml;charset=utf-8 Is exactly the right answer to get your data intact to the other side! no matter what is listening on the other side.
Posted by xiffy at 8:40AM
Anne, your href feed responds cache-control: no-store, no-cache. That's pretty much as broken as text/xml. It make USM impossible.
Posted by Randy Charles Morin at 11:04AM
That is indeed correct, however the only user agent I'm aware of that actually implements that rule is the validator. AFAIK, Mozilla, Opera, IE, Safari and every other browser that supports text/xml will obey the XML declaration (if present) and/or the BOM, with the default of UTF-8.
It get's really interesting when an intermediary server transcodes the document from US-ASCII (or whatever encoding specified by the charset parameter) to another, because (even though that is allowed) the process wouldn't update the XML declaration, making the document non-conformant. See the Architecture of the World Wide Web - Media types for XML for more information.
Perhaps if browser's had implemented it correctly in the first place, no-one would be using text/xml anyway, but my guess is that it's probably a mistake we're stuck with, just like we're stuck with disasters like quirks mode.
Posted by Lachlan Hunt at 11:16AM
Randy, I saw that on your weblog already. I took it out.
Lachlan, I am fixing that bug for Mozilla.
Ryan, For the parser they seem to be encoded in US-ASCII.
Posted by Anne at 4:32PM
Thanks Anne! You are on a roll.
Posted by Randy Charles Morin at 8:49PM
And there are offcourse two new lines between the HTTP header and body, instead of just the one
Posted by XS at 3:51AM
Yeah!!! So you are booting wordpress then?
I've been trying to send this message to Matt on his PhotoMatt blog, but he keeps deleting it. Tell me what you think
- "I am very disappointed that you have left in that callback for the firefox image to the wordpress site. Could you tell me what all the instances of these callbacks are in wordpress, as I’d like to take them out.
  I really like wordpress, but instead of making that call default, it would be much nicer if you did what a pure freedom gpl program like NVU does and ask the user if they want to allow a ping, or pull an image, from another server. I do not believe anyone should control what is viewed on another’s webpage, not even a tiny icon. The fact that a user can easily install wordpress and not be aware of this issue is very unfortunate.
  If you could fix that somehow it would really be great for the freedoms of the users of this software.
  Keep up the great work, and consider fixing this minor quibble. I’m also very excited about the bbpress project and wish you success in that."
So at the end I added, "You can either answer these questions here or I could spread them all over the net, your choice.
It just really annoyed me that he had absolutely no respect for the freedom of speech.
Posted by AQ at 6:38AM
A note of annoyance: archive.org's feeds are more or less broken because of this. (Or at least, the speed runs feed is.)What little knowledge I have (or want, for that matter) points to the feed being Windows-1252 (other pages on the site are autodetected as this by Firefox). But no encoding is given in the headers or the declaration, and non-ASCII characters end up as question marks. (Mark's feed parser detects the feed as ASCII as well.)
The odd thing is, Bloglines switches between correctly detecting characters and not every few hours, which means the same European men keep appearing over and over again as their items change.
Posted by Josh at 3:23PM
I take that back, it's declared as UTF-8 (which is still incorrect, with the question marks and all). Ugh, this makes my head hurt.
Posted by Josh at 3:32PM