Anne van Kesteren

URL: IDNA2003

Previously, in reverse chronological order: IDNA Hell, URL: IDNA2008, and URL: domain names.

IDNA2003 consists of two important algorithms: ToASCII and ToUnicode. Both operate on a single domain label (i.e. not a whole domain name). To obtain one or more domain labels from a domain name it needs to be split on dots (U+002E, U+3002, U+FF0E, and U+FF61).

Apart from doing a range check and checks for certain code points, ToASCII encompasses two major algorithms: Nameprep and Punycode (see Wikipedia’s Punycode). Nameprep is a specific profile of Stringprep. Stringprep in turn, does a number of things: mapping code points, Unicode normalization (NFKC — “Die, heretic scum!”), check forbidden code points, check proper use of bidirectional code points, and check unassigned code points (although this last one will not happen in browsers).

ToUnicode does the reverse, with the caveat that it cannot fail. If it fails at any point the original input is returned instead.

The URL Standard standardizes on IDNA2003 as that is what the most widely deployed clients implement. It does override one requirement, namely to use the latest version of Unicode rather than Unicode 3.2.

The IDNA section of the URL Standard references IDNA2003’s ToASCII and ToUnicode and makes appropriate requirements around them. The status quo now has better documentation than before. It seems unlikely clients will update to IDNA2008 as it’s not a straightforward replacement (it has nothing equivalent to ToASCII and ToUnicode) and is not backwards compatible.