Anne van Kesteren

IDNA Hell

In DNS, a domain is a sequence of domain labels, terminated by the empty domain label. A domain label in turn is a sequence of bytes with the first byte indicating the length of the domain label, with an upper limit of 63. A domain itself is limited to 255 bytes, but whether that is true in practice I have not tested. Although the domain label bytes can be anything, the recommendation has been to limit usage to ASCII (0x00 to 0x7F) and to treat 0x41 to 0x5A identical to 0x61 to 0x7A (case-insensitive matching if you will).

In March 2003 a new layer on top of the DNS was introduced, IDNA, now commonly referred to as IDNA2003. It defined an algorithm that took a Unicode code point sequence and either converted those to something that can be used on DNS or it returned failure. For each domain label, the algorithm would perform pre-processing, use Punycode to match DNS recommendations, and then prefix that with xn-- to prevent clashes with existing domain labels.

Commonly when writing domain names, "." is used to separate domain labels. IDNA2003 introduced three more separators to match input method editor usage. The pre-processing of IDNA2003 was also rather involved and included ignoring certain code points and transforming some code points into others. IDNA2003 was in theory also restricted to Unicode 3.2, but in practice that is not the case.

In September 2010 an update to IDNA2003 was issued, IDNA2008. IDNA2008 defines a lot less than IDNA2003. What it defines is valid code point sequences for domain labels (U-label) and their mapping to DNS recommendations (A-label). The pre-processing step is gone ( no longer becomes tm), the domain label separators other than "." are gone, and a few details have changed. Effectively, what should be done with a domain found in a URL is undefined relative to IDNA2003. ☺ is disallowed, as is ☃, so ☃.net might go away if Verisign starts to enforce IDNA2008. IDNA2003 with updated Unicode maps ß to ss, IDNA2008 does not have such mapping.

Unicode then came up with Unicode IDNA Compatibility Processing (commonly referred to as UTS #46). It effectively defines pre-processing for IDNA2008, re-introduces the domain label separators from IDNA2003, and makes IDNA2008 far more backwards compatible.

Then there are the implementations. Firefox, Internet Explorer, Chrome, and Safari implement IDNA2003 with an updated version of Unicode (ß becomes ss). Opera implements IDNA2008, with the pre-processing from RFC 5895 (a recommended against and non-normative part of IDNA2008), the domain label separators from IDNA2003, and no restrictions on Unicode code points (☃.net works). Firefox has plans to implement IDNA2008, possibly via UTS #46, but has no resources at the moment. WebKit has to my knowledge very little interest in touching this and from what I have seen the same is true for Internet Explorer.

The other component in implementations is what to display towards the user. Unicode or Punycode? This does not affect interoperability and is therefore of less interest to me, but you can get a general idea by reading IDN in Google Chrome.

In the end though, I have absolutely no idea which of these various practices to require in the URL Standard so I guess I will wait a bit for someone to make a move here.