Via Fronteers I discovered that even now not everyone is convinced of the merits of UTF-8. A little over five years ago I wrote a quick guide to UTF-8 and it seemed worthwhile to stipulate some technical points I became aware of meanwhile as to why using UTF-8 is a good idea.
I tend to think it is pretty self-explanatory that you want to use some encoding that can encode all of Unicode. After all, at some point you might want to sell your software abroad, you might want to accept comments in any given language, accept any kind of user-contributed content for that matter, and you simply do not want to keep an encoding label around everytime you deal with a string. Given that people are trying to phase out UTF-7 (security issues) and UTF-32 (bloat) (both gone in Opera 10) this gives UTF-8 and UTF-16 as options.
Here are two reasons to use UTF-8:
XMLHttpRequest
the query component is always encoded in UTF-8 which could result in confusion if you have the same link in the page and in a script. If you need to process such links on the server or want to link to external pages the easiest is to simply align with the encoding of URLs. I.e. by using UTF-8.
XMLHttpRequest
uses for encoding text strings when sending data to the server is always UTF-8. This means your server better deals with UTF-8 input correctly. Always using UTF-8 means again less work for you since you do not have to figure out if the request came from a form
element or an XMLHttpRequest
object.
Here are two reasons to use UTF-8 over UTF-16:
ietf-charsets@iana.org
Erik van der Poel (Google) briefly describes a security issue with UTF-16 on the Web and says: "Google Web Search has stopped serving UTF-16." To be fair, the issue is actually in Internet Explorer, though obviously that does not make UTF-16 less dangerous given IE's market share.
Talking about people's being against UTF-8... what about the Japanese? I know that they've been traditionally against Unicode, but really why?
I believe unified characters is seen as the main issue. Given that more and more software normalizes everything to Unicode internally nowadays that point is becoming moot though.
BTW, even if you don't care at all about languages other than English, there are still many reasons to ♥ Unicode, e.g. you may want to allow your users to be creative in their comments: ★ ✿ ☮ ← ↑ → ↓ © ™ € ±x² ≤ ½ ☻ ☺ ► ♪ ♫ ◆ ♂ ♀ …
i � unicode
☺
Additional reason: If you don’t use UTF-8 and you have a form and someone enters a character that the encoding you used can’t encode, it’s impossible to tell on the server side if they entered the character or a string that looks like an NCR for the character.
What I don't get about this discussion is why this is a topic at all. Aren't browser vendors free to read the input of the page correctly (i.e. in UTF-16, SHIFT-JIS, or whatever), and then transcode that to UTF-8 for internal storage?
I know JavaScript is standardized in terms of UTF-16, but it should be possible to implement that using UTF-8 and be completely transparent, or not?
Browsers would still need to understand UTF-16, but hey, that's not really much of a deal, is it?
Martin, this is why Web developers should use UTF-8. Existing browsers will likely be stuck with using UTF-16 internally.
I read about an issue with unicode regarding regexes in Friedl's book. A character can be made up of and written in two characters (like an e (U+0065) and then an accent mark (U+00B4) to make the single character é) and usually (but not always) can be one whole different character (U+00E9).
I didn't see anything at Fronteers' blog about people not using utf-8, but I'm stuck with a colleague who set the database and servers and everything up in Latin-1, while all the XML we get from clients is in utf-8 and my HTML pages are saved as utf-8. It's stupid. It should have all been utf-8, especially when we all knew we were going into international markets : ( And anything new we make, he sets as Latin-1 every time. Maybe it's just something people are used to??
Oh I miss ANSI... ░▒▓██▓▒░