Anne van Kesteren

Technical reasons to use UTF-8

7 September 2009

Via Fronteers I discovered that even now not everyone is convinced of the merits of UTF-8. A little over five years ago I wrote a quick guide to UTF-8 and it seemed worthwhile to stipulate some technical points I became aware of meanwhile as to why using UTF-8 is a good idea.

I tend to think it is pretty self-explanatory that you want to use some encoding that can encode all of Unicode. After all, at some point you might want to sell your software abroad, you might want to accept comments in any given language, accept any kind of user-contributed content for that matter, and you simply do not want to keep an encoding label around everytime you deal with a string. Given that people are trying to phase out UTF-7 (security issues) and UTF-32 (bloat) (both gone in Opera 10) this gives UTF-8 and UTF-16 as options.

Here are two reasons to use UTF-8:

The encoding of URLs is UTF-8. The path component is always encoded in UTF-8 (when the request is made) and the query component depends on the page encoding if the link is embedded inside HTML. However, for XMLHttpRequest the query component is always encoded in UTF-8 which could result in confusion if you have the same link in the page and in a script. If you need to process such links on the server or want to link to external pages the easiest is to simply align with the encoding of URLs. I.e. by using UTF-8.
The encoding XMLHttpRequest uses for encoding text strings when sending data to the server is always UTF-8. This means your server better deals with UTF-8 input correctly. Always using UTF-8 means again less work for you since you do not have to figure out if the request came from a form element or an XMLHttpRequest object.

Here are two reasons to use UTF-8 over UTF-16:

In a thread on ietf-charsets@iana.org Erik van der Poel (Google) briefly describes a security issue with UTF-16 on the Web and says: "Google Web Search has stopped serving UTF-16." To be fair, the issue is actually in Internet Explorer, though obviously that does not make UTF-16 less dangerous given IE's market share.
Contrary to popular belief, UTF-16 often takes up more space, even for CJK pages. UTF-8 is often accused of having a Western bias and while this may be true, content on the Web does use a fair amount of markup and whitespace, which compresses a whole lot better in UTF-8. roc (Mozilla) explains the details and concludes with: "We've seen no data showing that UTF-16 is useful in practice on the real Web … except as a legacy encoding of course."

Comments

Talking about people's being against UTF-8... what about the Japanese? I know that they've been traditionally against Unicode, but really why?
Posted by Dustin Wilson at 2:02AM
I believe unified characters is seen as the main issue. Given that more and more software normalizes everything to Unicode internally nowadays that point is becoming moot though.
Posted by Anne van Kesteren at 2:25AM
BTW, even if you don't care at all about languages other than English, there are still many reasons to ♥ Unicode, e.g. you may want to allow your users to be creative in their comments: ★ ✿ ☮ ← ↑ → ↓ © ™ € ±x² ≤ ½ ☻ ☺ ► ♪ ♫ ◆ ♂ ♀ …
Posted by Lino Mastrodomenico at 7:07AM
i � unicode
☺
Posted by Miles at 8:28AM
Additional reason: If you don’t use UTF-8 and you have a form and someone enters a character that the encoding you used can’t encode, it’s impossible to tell on the server side if they entered the character or a string that looks like an NCR for the character.
Posted by Henri Sivonen at 1:25PM
What I don't get about this discussion is why this is a topic at all. Aren't browser vendors free to read the input of the page correctly (i.e. in UTF-16, SHIFT-JIS, or whatever), and then transcode that to UTF-8 for internal storage?
I know JavaScript is standardized in terms of UTF-16, but it should be possible to implement that using UTF-8 and be completely transparent, or not?
Browsers would still need to understand UTF-16, but hey, that's not really much of a deal, is it?
Posted by Martin Probst at 3:30PM
Martin, this is why Web developers should use UTF-8. Existing browsers will likely be stuck with using UTF-16 internally.
Posted by Anne van Kesteren at 6:27PM
I read about an issue with unicode regarding regexes in Friedl's book. A character can be made up of and written in two characters (like an e (U+0065) and then an accent mark (U+00B4) to make the single character é) and usually (but not always) can be one whole different character (U+00E9).
I didn't see anything at Fronteers' blog about people not using utf-8, but I'm stuck with a colleague who set the database and servers and everything up in Latin-1, while all the XML we get from clients is in utf-8 and my HTML pages are saved as utf-8. It's stupid. It should have all been utf-8, especially when we all knew we were going into international markets : ( And anything new we make, he sets as Latin-1 every time. Maybe it's just something people are used to??
Posted by Stomme poes at 3:57PM
Oh I miss ANSI... ░▒▓██▓▒░

Posted by Jonny at 12:54AM