Anne van Kesteren

Encodings: Tested three-hundred-eight labels!

21 December 2010

A fortnight ago I did some research into 8-bit encoding labels and published a few findings here on my blog. Last week I managed to get a whole lot further. I now have a huge document listing eighty-six tables accurately mapping 8-bit labels and their encodings among seven browsers. (I tested Opera and Safari twice as their Windows and Mac mappings are different.)

I also published all the scripts I created and data I gathered for this research in my Bitbucket account: annevk / webencodings. The file named table.html is the huge document mentioned above. I will briefly outline the setup below so that it can be reproduced.

eightbitlabels.json contains a list of three-hundred-eight lowercased encoding labels collected from the WHATWG Wiki Web Encodings project page. These labels are used by gatherdata.html to figure out in a browser, for each label, what octet maps to which code point. If label produces an identical mapping it is considered an alias. Since these are all assumed to be 8-bit labels octets %00-FF should always produce 256 code points. (I did not find any surrogate code points in any browser for these labels.)

The way the mapping is found is by looking at the response of a request made using XMLHttpRequest to raw.php. Unfortunately using an iframe generates problems for certain labels. In fact, even using XMLHttpRequest can generate exceptions on reading responseText in Firefox and Internet Explorer — search for “fatal error” in the overview — but these are easier to deal with. (A way to work around this would be to remove the problematic labels. So far I have not found a need to do that however.)

Once gatherdata.html has found the mappings and which labels are aliases it does an HTTP POST request to storedata.php with all the data encoded as JSON and passes along the name of the browser. This PHP script will then generate output-mac-chrome.json if the browser was Chrome on the Mac. (Make sure the directory has the write permissions set and all.)

The final piece of the puzzle is processoutput.py which takes all the output-*.json files and creates a nice HTML document. To reduce the total amount of data it does a fair amount of filtering so that you end up with “only” eighty-six data tables. Given that the intersection of browser data yields over hundred-fifty unique encodings and over hundred-eighty sets of label aliases I suppose it is not too bad.

Basically it goes through all three-hundred-eight labels and finds label aliases common among all browsers for each of them. Then labels already flagged as alias are ignored. And labels mapping to an encoding that is already seen are ignored as well. It does mean that if you want to look up e.g. "windows-1252" you have to look in three different places. This is because in Opera and Firefox "ascii" is an alias for "windows-1252". In Internet Explorer "ascii" is special and "windows-1252" is an alias for "x-user-defined" whereas in Chrome and Safari "ascii" and "windows-1252" are treated like any other label they do not recognize. (Which happens to match what Opera and Firefox do for these two labels, but Opera and Firefox have a different fallback for unrecognized labels.) This is only this bad for a few labels though.

Tips to improve the situation, especially as patches to processoutput.py, are most definitely welcome! Or even better, fork my Bitbucket project and publish results on your own site.