Anne van Kesteren

URLs are tough

RFC 3986 defines URIs. Basically a syntax for addressing limited to a subset of ASCII characters. This works well over HTTP. It also defines an escape syntax for bytes, named percent-encoding. RFC 3987 defines IRIs. The IRI specification defines a mapping from a set of characters found in a particular encoding to a URI. The URI in question uses a percent-encoded sequence of bytes for characters that fall outside the subset of allowed ASCII characters. To get the character you need to decode the sequence of bytes using UTF-8. RFC 3987 has a catch that Björn pointed out to me in 2005 and I did not quite get until I looked at it again earlier today. (Hint: Unicode Normalization strikes again.) In section 3.1 where the mapping of IRIs to URIs is defined the specifications says to construct URIs in a different way if the input is found in a document encoded in a non-Unicode encoding (e.g. Windows-1252).

As part of working on HTML5 Ian reverse engineered URL handling in browsers. Handling of URLs in browsers is not quite like how IRIs work. First of all, browsers accept a wider set of characters. Whitespace is fine, for instance. Most browsers also have spacial handling for the backslash due to Microsoftisms. Just like IRIs the URL is mapped to a URI using UTF-8. Except for the query component of the URL (the bit after the question mark). Here for legacy reasons the encoding of the document is used instead. Except if the encoding of the document is UTF-16, in which case UTF-8 is used. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least. Yay for browsers!

When Ian integrated support for URLs I went to figure out what the handling for XMLHttpRequest was. I turns out that user agents deal with them more or less identically to HTML, except for Internet Explorer which is probably because their implementation is done by a separate component. Well, and Opera and Firefox always encoded the query component using UTF-8, which is a nice bonus. The XMLHttpRequest Object specification now (editor's draft) requires the Opera and Firefox behavior.

Since details of URL processing are better defined separately from HTML5 Dan Connolly from the W3C created the Web addresses in HTML 5 draft introducing the term Web address where HTML5 used URL. I would prefer simply using URL (see e.g. url() in CSS or <input type="url"> in HTML), but it does not matter much.

Next I want to figure out where else this URL handling is performed. I’m quite confident it affects HTTP and CSS, but I haven’t done enough testing yet. Any takers?

(Needless to say a bunch of things have been simplified in this post.)

Comments

  1. Effectively, using non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results, to say the least.

    Or, for us that came in from the other side: Trying to deal with non-ASCII characters in URLs in documents not encoded as UTF-8 or UTF-16 will give you surprising results. Especially when different groups of users expect different, and incompatible, behaviour. I believe IE 5 came with different defaults on western and CJK systems because of this, and Opera has struggled in Russia and Korea for similar reasons...

    Posted by Peter Krefting at

  2. I'd be interested in testing this, I've been doing some poking in this area already. What'd you have in mind?

    Posted by Chris Weber at

  3. I completed the testing already. But basically it is about referencing resources from CSS (or e.g. the HTTP Location header) using a variety of characters and encodings and figuring out what the user agent is actually requesting from the server. I.e. what URI is crossing the wire.

    Posted by Anne van Kesteren at

  4. Were you talking about CSS like this: body { background: url(http://x.y.z) } Where the x and y are made of Unicode characters > U+007F and potentially encoded using stuff other than UTF-8?

    Posted by Chrs Weber at

  5. The email from Björn mentioned in the post has an example.

    Posted by Anne van Kesteren at

  6. If this post makes it to reddit, you'll set a world record for most "Let's go shopping!" references in a single comment thread.

    Posted by Alex at