Anne van Kesteren

Technical reasons to use UTF-8

Via Fronteers I discovered that even now not everyone is convinced of the merits of UTF-8. A little over five years ago I wrote a quick guide to UTF-8 and it seemed worthwhile to stipulate some technical points I became aware of meanwhile as to why using UTF-8 is a good idea.

I tend to think it is pretty self-explanatory that you want to use some encoding that can encode all of Unicode. After all, at some point you might want to sell your software abroad, you might want to accept comments in any given language, accept any kind of user-contributed content for that matter, and you simply do not want to keep an encoding label around everytime you deal with a string. Given that people are trying to phase out UTF-7 (security issues) and UTF-32 (bloat) (both gone in Opera 10) this gives UTF-8 and UTF-16 as options.

Here are two reasons to use UTF-8:

Here are two reasons to use UTF-8 over UTF-16:

Comments

  1. Talking about people's being against UTF-8... what about the Japanese? I know that they've been traditionally against Unicode, but really why?

    Posted by Dustin Wilson at

  2. I believe unified characters is seen as the main issue. Given that more and more software normalizes everything to Unicode internally nowadays that point is becoming moot though.

    Posted by Anne van Kesteren at

  3. BTW, even if you don't care at all about languages other than English, there are still many reasons to ♥ Unicode, e.g. you may want to allow your users to be creative in their comments: ★ ✿ ☮ ← ↑ → ↓ © ™ € ±x² ≤ ½ ☻ ☺ ► ♪ ♫ ◆ ♂ ♀ …

    Posted by Lino Mastrodomenico at

  4. i � unicode

    Posted by Miles at

  5. Additional reason: If you don’t use UTF-8 and you have a form and someone enters a character that the encoding you used can’t encode, it’s impossible to tell on the server side if they entered the character or a string that looks like an NCR for the character.

    Posted by Henri Sivonen at

  6. What I don't get about this discussion is why this is a topic at all. Aren't browser vendors free to read the input of the page correctly (i.e. in UTF-16, SHIFT-JIS, or whatever), and then transcode that to UTF-8 for internal storage?

    I know JavaScript is standardized in terms of UTF-16, but it should be possible to implement that using UTF-8 and be completely transparent, or not?

    Browsers would still need to understand UTF-16, but hey, that's not really much of a deal, is it?

    Posted by Martin Probst at

  7. Martin, this is why Web developers should use UTF-8. Existing browsers will likely be stuck with using UTF-16 internally.

    Posted by Anne van Kesteren at

  8. I read about an issue with unicode regarding regexes in Friedl's book. A character can be made up of and written in two characters (like an e (U+0065) and then an accent mark (U+00B4) to make the single character é) and usually (but not always) can be one whole different character (U+00E9).

    I didn't see anything at Fronteers' blog about people not using utf-8, but I'm stuck with a colleague who set the database and servers and everything up in Latin-1, while all the XML we get from clients is in utf-8 and my HTML pages are saved as utf-8. It's stupid. It should have all been utf-8, especially when we all knew we were going into international markets : ( And anything new we make, he sets as Latin-1 every time. Maybe it's just something people are used to??

    Posted by Stomme poes at

  9. Oh I miss ANSI... ░▒▓██▓▒░

    Posted by Jonny at