Anne van Kesteren

text/xml is seriously broken over HTTP

So I thought I read RFC 3023 carefully, but apparently I did not. As Mark (the one with the hobby that does not involve electricity) points out you can only change the character encoding of an XML document with a text/xml Content-Type using the optional charset parameter. Otherwise the document is encoded in US-ASCII. No matter what your browser says.

So the following examples are encoded in US-ASCII although the document seems to tell otherwise. This is based on section 8.5 of the above mentioned RFC.

This example is encoded in UTF-8:

Fortunately (as I have mentioned before) text/xml gets deprecated. However, that does not mean most problems are solved or so. Read for example Sam Ruby’s latest post on Yahoo Search and Iñtërnâtiônàlizætiøn. Not that Yahoo is a bad company.

Comments

  1. To be absolutely clear, you don't mean that the documents are literally encoded in US ASCII, right? There's no process that strips 8-bit characters or anything like that. Perhaps it would be beter to state this as follows:

    the following examples are decoded as US-ASCII by the browser as that's what the server is telling the browser the document is encoded as, even though the document is actually encoded as UTF-8, UTF-16, etc.

    That doesn't seem quite right either. The point I'm trying to make is that the representation never goes through an encoding phase where it is converted to US-ASCII. It is simply decoded incorrectly because the browser doesn't know the correct encoding (or is not allowed to use it).

    Or am I smoking crack?

    Posted by Ryan Tomayko at

  2. Ryan: the user-agent on the receiving end is supposed to behave as if it's been handed us-ascii, so (for example) if the document contains non-ascii characters that means it's not well-formed.

    Posted by James at

  3. Content-Type:text/xml;charset=utf-8 Is exactly the right answer to get your data intact to the other side! no matter what is listening on the other side.

    Posted by xiffy at

  4. Anne, your href feed responds cache-control: no-store, no-cache. That's pretty much as broken as text/xml. It make USM impossible.

    Posted by Randy Charles Morin at

  5. That is indeed correct, however the only user agent I'm aware of that actually implements that rule is the validator. AFAIK, Mozilla, Opera, IE, Safari and every other browser that supports text/xml will obey the XML declaration (if present) and/or the BOM, with the default of UTF-8.

    It get's really interesting when an intermediary server transcodes the document from US-ASCII (or whatever encoding specified by the charset parameter) to another, because (even though that is allowed) the process wouldn't update the XML declaration, making the document non-conformant. See the Architecture of the World Wide Web - Media types for XML for more information.

    Perhaps if browser's had implemented it correctly in the first place, no-one would be using text/xml anyway, but my guess is that it's probably a mistake we're stuck with, just like we're stuck with disasters like quirks mode.

    Posted by Lachlan Hunt at

  6. Randy, I saw that on your weblog already. I took it out.

    Lachlan, I am fixing that bug for Mozilla.

    Ryan, For the parser they seem to be encoded in US-ASCII.

    Posted by Anne at

  7. Thanks Anne! You are on a roll.

    Posted by Randy Charles Morin at

  8. And there are offcourse two new lines between the HTTP header and body, instead of just the one

    Posted by XS at

  9. Yeah!!! So you are booting wordpress then?

    I've been trying to send this message to Matt on his PhotoMatt blog, but he keeps deleting it. Tell me what you think

    So at the end I added, "You can either answer these questions here or I could spread them all over the net, your choice.

    It just really annoyed me that he had absolutely no respect for the freedom of speech.

    Posted by AQ at

  10. A note of annoyance: archive.org's feeds are more or less broken because of this. (Or at least, the speed runs feed is.)What little knowledge I have (or want, for that matter) points to the feed being Windows-1252 (other pages on the site are autodetected as this by Firefox). But no encoding is given in the headers or the declaration, and non-ASCII characters end up as question marks. (Mark's feed parser detects the feed as ASCII as well.)

    The odd thing is, Bloglines switches between correctly detecting characters and not every few hours, which means the same European men keep appearing over and over again as their items change.

    Posted by Josh at

  11. I take that back, it's declared as UTF-8 (which is still incorrect, with the question marks and all). Ugh, this makes my head hurt.

    Posted by Josh at