Anne van Kesteren

Encodings: non-aliases of ISO-8859-1

Today I created a simple script that takes a list of 7- and 8-bit character encoding labels. It uses these labels for the resource containing the octets in the range %01-FF and then identifies which labels have a unique mapping to code points and which are duplicates of other labels. 7- and 8-bit encodings make for easy testing as there is a one-to-one mapping from byte to code point. At some point multibyte encodings have to be tested as well, but this makes for a pretty good start.

I found out that Internet Explorer treats the label "iso-8859-1" in a very special way. It is the only alias that maps to ISO-8859-1 over XMLHttpRequest. Indeed, "cp819", "latin1", "iso-ir-100", and others all map to Windows-1252 instead. I will have to test loading through an iframe tomorrow.

Overall the results show a lot of diversity among browsers. And therefore, I think, room for improvement. From the 212 labels I selected Opera gives 34 distinct encodings, Firefox 53, Chrome 28, Safari 46, and Internet Explorer 44. Messy. Next is increasing the amount of labels and finding a way to present the data so it becomes more clear where the bugs are.

(I plan on publishing these scripts when they are more stable. Feel free to contact me if you want to have them now.)