Anne van Kesteren

Encodings: presentation and more big5

The other day I gave a presentation at a Fronteers meetup here in the Netherlands on bytes and code points, primarily to repeat a point I have made for the past eight years now: Use utf-8! (Thanks to Sam Ruby and Peter Krefting for indirectly helping with this presentation.) But with some new information on what will go wrong if you don’t and which new features will be unavailable to you. I also briefly explained how JavaScript and APIs are the only place where you will be exposed to surrogate pairs and isolated surrogates. As explained in detail by Norbert Lindenberg, this might change, but the additional convenience you get by working with code points is not quite it, as you really want to operate on grapheme clusters.

Today big5 finally got fully defined in the Encoding Standard. Philip Jägenstedt emailed some more analysis on the big5-uao and big5-2003 extensions of big5. It seems the only one we want to support is big5-hkscs which is what the standard defines today, with a slightly restricted encoder.

Apart from some minor bugs the encoding part of the Encoding Standard is complete now, with work on limited/broad encoding sniffing lined up. Just got a little bit closer to fully predictable and understandable platform.