Anne van Kesteren

Encodings: status update and big5

17 April 2012

The Encoding Standard now nearly defines all encodings user agents have to support to work with the platform, including all the idiosyncrasies of the decoders and encoders; their indexes, end-of-file handling, and handling of errors. To recap, a decoder maps bytes to Unicode code points, an encoder does the reverse. Either might make use of an index if there is no algorithmic conversion to Unicode possible. Some indexes are huge. Over 9000!

In the end each encoding consists of a decoder and encoder, but the complexity differs. Single-byte encodings are rather trivial, each byte either maps to a code point via an index, or is in error. iso-8859-3, macintosh, and windows-874 are examples of single-byte encodings. The two algorithmic encodings (utf-8 and utf-16) were not too hard to figure out as they are reasonably well documented (minus some error details). In the remaining encodings one or more bytes map to a code point (either directly or via an index or expression). These are the legacy encodings from China, Japan, and Korea. Determining the exact boundaries and testing/reverse engineering them cross browser was a rather involved process. Examples would be euc-kr, shift_jis, and hz-gb-2312.

These complex encodings have proprietary extensions that gained widespread use due to the dominance of Internet Explorer. Other browsers copied the extensions over time reaching a somewhat stable equilibrium. There is one encoding however, that is worse off. The interpretation of proprietary extensions to big5 is a regional affair. The same byte sequence can have a different meaning to a Taiwanese user and a user from Hong Kong. In Taiwan an extension called "big5-uao" (Unicode-at-on) got traction whereas in Hong Kong "big5-hkscs" (Hong Kong Supplementary Character Set) is used. Sites however typically use "big5" as label (presumably because only "big5-hkscs" exists as distinct label and Internet Explorer handles that as "big5").

In Hong Kong Microsoft has provided a patch for Windows for a while that changed system fonts and may or may not have changed the "big5" index in order to support "big5-hkscs" under the "big5" label. In Taiwan something similar happened for "big5-uao".

For the Encoding Standard I went through the dotnetdotcom.org data with help from Simon (part 1 and part 2). Being more fluent with Chinese, Philip Jägenstedt happily took over, analysed my data and even gathered more. 呂康豪 (Kang-Hao Lu) is doing the same and both are reporting process on public-html-ig-zh@w3.org. I hope the conclusion will be that we can define a single "big5" at the cost of breaking a few pages, as regional differences for decoding royally sucks, but time will tell.