Anne van Kesteren

Unicode Normalization

27 February 2009

Unicode is great. It defines a bunch of code points with a plethora of different encodings. One with security issues (UTF-7). One that is good (UTF-8). Two we could have avoided (UTF-16BE and UTF-16LE), but are now more or less stuck with due to parts of JavaScript and the DOM being defined in 16 bit units. Got to love that. Two we try to kill (in e.g. HTML5) and we actually recently removed support for in Opera (UTF-32BE and UTF-32LE). After all, just like fighting license proliferation, fighting character encoding proliferation is the good fight. In summary, Web software should use UTF-8 and Web browsers are probably stuck with UTF-16 internally though using UTF-8 with 16 bit unit indexing internally might be better in theory.

I have been involved in two small battles between the CSS WG and the Internationalization Core WG (i18n WG). The first was about case-insensitive matching. You see, when the grass was green, the earth flat, and US-ASCII the only character encoding that really mattered case-insensitive matching was a simple matter. A matches a and c matches C. HTTP is in fact still restricted to a very limited character set that can do only slightly more than US-ASCII. I.e. ISO-8859-1, also known as Latin-1 or l1, and actually treated by Web browsers as Windows-1252 due to our friends in Redmond. Unicode gave a different meaning to case-insensitive. I.e. it would make sense that e.g. ë case-insensitively matches Ë, right? Well yes, and this was the argument from the i18n WG. The thing is though, we were not dealing with a search engine of some sorts, but rather the design of a computer language. And although we get more processing power and such it is hardly useful to waste that on marginal complex features given that most of the language is US-ASCII compatible anyway. Worse is better.

The CSS WG ended up making user defined constants (e.g. namespace prefixes) case-sensitive and language constants (e.g. property names) ASCII case-insensitive. Yay for sanity.

The second battle is going on now and it has been escalated to the near useless and private Hypertext Coordination Group (Hypertext CG). It started with the i18n WG raising a seemingly innocent Last Call comment against the Selectors draft. It is about comparing strings again. Now some may think that comparing two strings is a simple matter. You ensure that both are in the same encoding (likely UTF-16 because you know, legacy) and then put the == operator to use. Maybe you lowercase both strings first in case of a case-insensitive match. Well, as it turns out some people think this should be more complex because otherwise the matching is biased towards the Western crowd which is not affected by, drum drum drum, Unicode Normalization. As it turns out character encoding nonsense is not all there is to Unicode. Also, beware of bridges.

The potential problem here is that two people work on something together and one of them generates NFC HTML content and the other generates NFD CSS content. This problem is highly theoretical by the way, according to non-scientific studies by Google NFC dominates Web markup by 99.9999% versus, well, nothing. (Maybe all those pesky non-NFC people tried to cross a bridge before publishing.)

Going further, XML does not normalize, HTML does not normalize, ECMAScript does not normalize, and CSS does not normalize. And nobody complained so far. Nobody. Well, apart from the i18n WG. Making Web browsers more complex here seems like the wrong solution. What is next, treat U+FF41 identical to U+0061? Make validators flag non-NFC content, but please do not require huge comparison functions where a simple pointer comparison should do. It is just not worth it.

Comments

Seems like the "how much Web content is NFC" question should really be split into two parts: (1) how much would be different if it were NFC or NFD and (2) how much of that is NFC? It's worth remembering that (1) may change over time due to the regional distribution of Web users changing.
Posted by David Baron at 9:06AM
If I understand it correctly, the whole upper/lower case thing is highly specific to Western scripts anyways, right? As far as I know, most other languages don't even have a concept of case, so doesn't that make the problem case of using NFD and performing case insensitive matches even less likely?
Posted by Martin Probst at 4:17PM
Can't UAs normalize during parsing time, instead of doing it in the comparison functions ?
Posted by David at 1:50AM
Good points about normalization.
"And nobody complained so far." - who would you complain to and have any hope of making a change? Is there a "Vice President of Normalization" at Microsoft? LOL
I'm not a language specialist, but I am aware of several other languages that not only have case sensitivity but also have special characters and punctuation. However, English is the standard for computing and internet. That may change in the future, but it is true for now.
Posted by Jillian at 5:49AM
Doing normalization at parse time might be feasible. I sort of suspect it might break scripts though that do something with normalization or try to do certain checks. Besides that, is it really necessary? It does not seem necessary now and if we encourage everyone to use NFC it will not be a problem going forward.
Posted by Anne van Kesteren at 10:13AM
The namespace prefixes being case sensitive comes from XML. The property names being case insensitive comes from HTML, so it is no wonder that the choice was so made.
Your FF41 versus 0061 is a good example of how Unicode, wanting to register the world characters, have started registering glyphs. Why have half- and full-width forms in the standard anyway?
Worse is the backward modifing accents, which turn 0061+0308 into 00E4. The standard ought to have done all accents with modifiers, but then it was based on ISO-8859-1
Note that in RFC3023 text/xml defaults to US-ASCII when no charset is present. This goes to show how silly the standards people are. Either they pay too much attention to history or not enough
You'll just have to live with it - not for historical reasons, but for hysterical reasons
Posted by BigRat at 8:17PM
Anne could you give another example of where this proposed normalization might occur? Are you saying it's limited to CSS selectors right now? The only other place I know where normalization occurs in the browser is in the URI/IRI when domain names are processed.
Posted by Chrs Weber at 12:27AM
There is a proposal for browsers to perform normalization during comparison when matching selectors. It is not currently done.
Posted by Anne van Kesteren at 12:00AM