Anne van Kesteren

Rough Introduction to Character Encodings on the Web

(If you found this post wondering what encoding to use on the Web: UTF-8. UTF-8 is your only sensible choice.)

For my job at Opera I am currently researching character encodings (aside from the normal stuff). As with most Web technology the details turn out to be quite complicated and not at all implemented in the same manner by different user agents. Let alone properly documented. The reason I started looking into this was because there were several Japanese sites on the Web that were labelled with EUC_JP as their encoding in the HTTP Content-Type header, but in the HTML they were labeled as UTF-8 (or similar). Now EUC_JP is not a character encoding, but EUC-JP is. Opera matched them because of the charset alias matching rules of UTS22. Firefox and Internet Explorer implement more strict matching rules and therefore do not recognize EUC_JP. (Because of feedback from us HTML5 now aligns with Internet Explorer in its requirements for "charset" matching rules. Opera plans to comply.)

Let me give a more complete introduction to character encodings. The way character encodings work is that given a byte stream and a character encoding label (such as UTF-8 or Windows-1252), a conversion process is applied turning the byte stream into a (Unicode) character stream. This process can also be reversed, e.g. consider submitting a form or saving a file to disk. The exact conversion process depends on the character encoding label and unfortunately also on the implementation that is being used. The latter is especially true for the more obscure encodings. What further varies between implementations is the list of character encodings that are supported and the labels associated with those encodings (an encoding can have several labels, e.g. l1, latin1, and ISO-8859-1 all represent the same encoding).

Now before such a conversion process can even happen for a given resource, the character encoding label has to be found. The exact way a character encoding label is found depends on the media type of the resource. Typically the charset parameter of the Content-Type is taken into account and when specified authoritative over other declarations (i.e. it wins). If the charset parameter does not provide for an encoding (or its value is not recognized) "sniffing" of the entity body occurs. The sniffing algorithm depends on the type of resource. For XML resources (including media types such as text/xml and application/xhtml+xml) the XML declaration is used, for CSS resources (text/css) the @charset construct, and for HTML resources (text/html) the meta element. Most resource types also accept a BOM at the start of the file, i.e. a couple of bytes that indicate you are dealing with UTF-16BE, UTF-16LE, or UTF-8.

If no character encoding label can be found user agents typically resort to sniffing trying to find the encoding using content heuristics. These heuristics differ among user agents as well. If this fails some kind of default that is often based on the region the user agent was distributed in will be used. I and others have been trying to document the issues and current implementations on the WHATWG Wiki: Web Encodings. The goal is to reduce the amount of details that are specific to implementation and make encoding detection and support more equivalent among implementations. The result of this should be more predictable rendering and the ability for tools that are not Web browsers to better deal with the content out there. It also allows for new Web browsers to enter the market more easily.

If only UTF-8 was the accepted encoding well before the Web took off we would not be in this mess, but alas. Trying to work out how it all fits together and then document it has some charm to it fortunately.