Anne van Kesteren

Unicode’s dirty little secret

10 May 2005

I guess it’s with everything like that. The moment you start to know something well you discover its flaws. Unicode seemed a simple concept. You choose one of its variants, preferably UTF-8 for its compatibility with US-ASCII, IRIs and the ability to use almost all characters that exist. (With the notable exception of Klingon.)

In the past I wrote a quick guide to UTF-8.

So what is Unicode’s dirty little secret that has been safely hidden and almost no-one is talking about? Unicode normalization forms. Apparently, Unicode defines two equivalences between characters: Canonical and compatibility equivalence. Unicode’s fifteenth technical report defines four normalization forms of which normalization form C seems to be most popular, and useful. It is used in the IRI specification and in the Character Model for the World Wide Web.

The other normalization forms might cause data loss when there is no proper markup or other stylistic information available for display. One of the examples in the specification is U+FB03 — ‘LATIN SMALL LIGATURE FFI’ — which is in normalization form KD and KC replaced with the string ‘ffi’. An analogy that is made in the specification is that KD and KC are like uppercase and lowercase mappings of characters. Such mappings are very useful for searching documents, but might make the meaning go lost for a bit. Therefore KD and KC might be useful for search engines for the same reasons but keep in mind that it is an anology. For example, the character ‘Ⅳ’ would be turned into ‘IV’ after normalization in KD and KC; where in C (and D) it is not decomposed. The difference between C and D is that (example taken right out the specification) ‘Äffin’ turns into ‘A\u0308ffin’ in D and stays the same in normalization form C.

And it gets worse. I’m not sure what the state of this public review is at the moment, but currently there are cases where toNFC(toNFC(x)) ≠ toNFC(x). Now fortunately this is almost completely theoretical, but still.

All this means I probably need to have a “Unicode normalizer” for my weblog. Comparable to Charlint, but written in PHP. Any takers?

(With thanks to people who responded on Mark Pilgrim’s post on the issue.)

Comments

Sorry, I missed something.

Where, in your weblog, do you need to use Normalization? That is, why do you care whether "identical" text in different spots on your weblog have the same binary representation?

This is important for searching, for digital signatures, ...

What's its importance for you?

Posted by Jacques Distler at 10:07PM
See this comment by Henri Sivonen and follow-up comments there. I guess the problem is mostly on the browser side though, but it would be nice not to have that problem.
Posted by Anne at 10:53PM
Wow, this is indeed very interesting. I'm especially interested because, when I'm writing Afrikaans, ŉ can also be written as 'n (ŉ) so this is something I need to keep in mind.
I guess this all does make sense to a certain extent, but it's getting far too complex for me anyway. ;-)
But yes, I think it would be very cool if this weblog supported that all since you never know what kind of languages you might get.
Posted by Charl van Niekerk at 10:56PM
Anne, there is a Unicode normalizer written entirely in PHP as part of MediaWiki... which is what I found when I was looking for one. Just Google it, and you can grab it from CVS (I believe it's called "UTFNormal.php").
I understand that NFC is best for normalizing user-entered data, but I think NFKD is probably closest to the "ideal" subset of Unicode. Personally, I quite like NFKC :D
Posted by Porges at 2:15PM
Henri Sivonen wrote

As for real-world usefulness, I have seen a case in the wild where an author using Safari and WordPress had copied and pasted decomposed umlauts into his blog ... No problem was visible in Safari, but in Mozilla it looked ugly.

So the decomposed and composed forms render differently in Mozilla? Sounds like a Mozilla bug to me. (Or, to be more charitable, a problem with the glyphs available for rendering the decomposed form.)

Posted by Jacques Distler at 2:19PM
So the decomposed and composed forms render differently in Mozilla?

Yes.
Sounds like a Mozilla bug to me.

I agree. Like I have said before, the Mac gfx is broken.
Since non-robust software shipped by others is a reality, I think it makes sense to make one’s own software robust when feasible in order to improve the user experience. In the case of PHP, the feasibility is questionable. You start hitting the glass walls with PHP pretty fast. However, I think it would be lazy and sloppy of me not to normalize when I develop in Java, because adding ICU4J to the classpath is not a big deal.
Posted by Henri Sivonen at 4:52PM