In hierarchical URLs (e.g. those using the http
or ws
URL scheme) between the URL scheme and path there is either a domain name, IPv4, or IPv6 address. IPv6 is denoted by square brackets, domain name versus IPv4 is determined by best match. Let’s leave IP addresses alone for now and focus on domain names. You know, the bit that looks like example.org
.
Quite recently internationalized domain names were changed (the IDNA2003 standard became the IDNA2008 standard in late 2010) and most browsers have not implemented the change. And as Unicode Technical Standard #46 — Unicode IDNA Compatibility Processing — indicates, they have reason not to. Now ™.com
would no longer resolve to tm.com
, but rather xn--y2g.com
. The pre-processing from IDNA2003 that normalized a bunch of Unicode code points was removed in IDNA2008. As another example, only IDNA2008-capable browsers (only Opera) can access faß.de
. Everywhere else that would result in a request for fass.de
instead. (This mismatch is in place because DENIC (in charge of .de domain names) moved to IDNA2008 whereas most browsers have not.)
UTS #46 provides a useful overview of this problem and a reasonably clear suggested algorithm for canonicalizing domain names, but what algorithm ends up in browsers seems a bit unclear. For now I will leave this as an open issue in the upcoming URL standard.