Anne van Kesteren

Quick guide to UTF-8

Even if you are not a software developer who doesn't understand Unicode, Unicode is still important. Especially on the web where there is quite some trouble with character encoding. UTF-8, not known as the 8-bit Unicode Transformation Format solves most of these problems since it can handle every single character in existence and it can be extended as well when more characters are standardized. Another advantage is that the first 128 (2^7) characters cover us-ascii, which is the most commonly used encoding format in western languages. us-ascii is also the same as the first 128 characters from the iso-8859-* series. Therefore it will be easier to switch to utf-8 encoding for people who are publishing in western languages. You just change the header and encode the few characters that are not displayed correctly when you see them. Of course, loading the text in your editor and click the option convert to utf-8 will work perfectly as well, especially if you have a lot of strange characters in your database or if you want to convert non western characters, which usually never map us-ascii or any other utf-8 encoded character.

One of the most important aspects of conversion is that you need to understand that just changing your HTTP headers isn't going to solve your encoding problems, it actually creates them. You need to re-encode the characters that don't map utf-8, since in some character encodings a character takes one byte where it takes two bytes in another encoding or perhaps a different order of bits within that byte. If your site uses a database you could easily export all your data to a single file. Save that file as utf-8 and overwrite the existing data in the database. After that you change the charset parameter of the content-type header to utf-8 and everything will work as expected. There are various ways to do this. You could add a line to your .htaccess file: AddDefaultCharset utf-8 or you could use the PHP header function: header("content-type:text/html;charset=utf-8");. If you don't have access to that you could choose the most evil of all solutions, but you actually want to avoid the HTTP-EQUIV attribute always and everywhere:

<meta http-equiv="content-type" content="text/html;charset=utf-8">

So after you have modified all your existing content and changed the headers it is time to make sure your new data doesn't get crunched by some user or yourself (lots of validating/encoding problems are made by the person who owns the site). All websites using iso-8859-* already have the problem that a user can enter one or more of the 27 differences between iso-8859-1 and windows-1252. Fortunately, this is not the case with utf-8. When you are using utf-8 you don't need to worry about those subtle differences and how a user can easily invalidate your site. Browsers will honor utf-8 and only enter utf-8 bytes. To make sure browsers do that, you make sure that all admin pages, every comment page, comment preview page and all other pages that allow user input from either visitors or yourself are sent with utf-8 encoding.

If you have any further question, don't be afraid to ask. Make the change today. Thanks.

Comments

  1. Okay, I've been waiting for this post to do just this very thing. I make the changes, all looks good, try posting Iñtërnâtiônàlizætiøn and it appears fine on screen but in my HTML I have the third example of garbled output that Sam refers to (in the small 3 row table).

    In fact it would appear that your site here does the same. Sam mentions If you have “utf-8”, you are actually ahead of the game and that's all. Any idea what is happening with this? Is this right?

    Posted by Mike P. at

  2. No, you are using the evil htmlentities instead of the htmlspecialchars function you should really be using. I have posted about that before though (and mentioned it again), so maybe that isn't the problem?

    Posted by Anne at

  3. You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.

    Using ISO-8859-1, I can (at least) guarantee validity, even though I may turn the foreign trackbacks into gobbledygook.

    Posted by Jacques Distler at

  4. Hey Anne, I'm not using that anywhere, still not sure whats up. You... wait a sec.

    Yep, it's an opera thing. Opera, on my machine, 'views source' in Wordpad, which apparently can't handle utf-8 characters.

    I noticed this while coding up a russian site this weekend - I had to use Firefox to check the source. How timely!

    Well, there's nothing stopping me now from going utf-8. Give me 5 minutes... ;-]

    Posted by Mike P. at

  5. Maybe it's good to mention these values for a .htaccess-file, since the php.ini-setings overrule the Apache mime/charset defaults:

    php_value default_charset UTF-8
    php_value default_mimetype application/xhtml+xml
    As of 4.0b4, PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty.

    On a development server I prefer these values to be us-ascii and text/plain in the php.ini-file, to make sure files within my site's or application's directory take care of these headers.

    Posted by Robbert Broersma at

  6. I just thought this might be a good spot to point people over to a great unicode font, 'cause some people (including myself at one time) think that using unicode automatically means any user can view any character without any further alterations, which isn't exactly true.

    For example — I was redesigning my site this month & used some Chinese characters in part of it. Most people just saw question marks or square boxes even tho I was seeing Chinese and using UTF-8.

    Beyond that, I have a question. If I remember right, when a page is viewed offline a browser will (by default) display it in US-ASCII. This being the case, wouldn't it make sense to add a meta element to the page when you're using a different encoding just in case someone downloads your page and views it locally?

    Posted by Devon at

  7. No, offline is utf-8. That META element is for the server, not for your browser (although it isn't exactly supported like that...).

    Posted by Anne at

  8. No, offline is utf-8. That META element is for the server, not for your browser.

    That's not entirely true. Opera Mozilla and IE6 all assumed an encoding other than UTF-8 (probably windows-1252) if none is specified. The only time you would get UTF-8 properly detected is with the presence of a byte order mark, which isn't always inserted by software and can cause problems for applications that are not aware of what the byte sequence is used for.

    So, although the meta element probably shouldn't be used for the purpose of specifying the document encoding as UTF-8, one should keep in mind that it won't magically happen without a BOM.

    Posted by J. King at

  9. Ah, but I was talking about XML documents, since after all, this weblog uses that.

    Posted by Anne at

  10. Ahh, I was thinking most people would likely save an XHTML file as .html locally, and that might cause encoding problems if the file doesn't declare a charset in a meta element.

    Posted by Devon at

  11. Hmmm unicode. The Ultima Ratio of Enconding... :]

    At first. UTF-8 cannot handle every single character in existence, as 8 Bit unicode still does only cover 256 characters. And even UFT-16 which seem the real advantage does only cover 65536 characters, what seems enough to cover most alphabets, but actually is only a bit too small to cover them all.

    I addition to that, several instituitons [like dictionary publishing houses] have many special symbols and signs, to be represented in some kind of encoding too. And another problem that springs from that fact ist, that some signs have the smae character, but a different semnatic in a different context. [For Example, in dictionaries some there's at least one sign for a weekyday, that is the same sign as for a planet.] And for good reasons, you want to keep the semantics of the sign in your document, not the reference to a unicode number. This problem is solved by using selfdefined XML-entities, and mapping them [as needed] to an encoding or something else, as general Entitis could also contain a reference to an XML-File...

    Right now, this problem is only a problem of "closed" XML-sollutions. But sooner or later, this problem is comming to the web too. Especially in the context of accessebility, a named entity like &saturn; or &monday; looks much more accessible to me...

    Posted by ben at

  12. Ben, you are wrong. utf-8 does cover all characters. Just check comment 5 on iñtërnâtiônàlizætiøn before you jump to conclusions that don't make sense. Also, read the what UTF stands for and think again ;-)

    And a named entity isn't accessible. You don't need entities with utf-8 and those (utf-8 encoded) characters should just be recognized.

    And please, don't get off-topic talking about XML.

    Posted by Anne at

  13. It is probably not widely known, but using charset iso-8859-1 induces a bug in IE5 and IE6 with fileupload (in certain situations), in combination with a multi-byte character (such as the euro-sign). IE then uses 2 different boundaries in the multipart/form-data request, which of course is wrong and usually not repairable on the server.

    Replacing the the charset with windows-1252 solves the problem.

    Replacing the charset with utf-8, solves the problem too, but creates another problem. That might have an easy solution, but I don't see it right now:
    For example, after posting the euro-sign it gets returned as 3 characters: (€) (&#226; &#8218; &#172;)
    This counts (also) for standard form submission. And not only for IE, but also for Mozilla and Opera.

    Posted by Taka at

  14. Ah, but I was talking about XML documents, since after all, this weblog uses that.

    Ah, excusez-moi, monsieur. In that case, there's no doubt at all. I myself haven't been using XHTML for very long, so I don't typically think in those terms.

    Posted by J. King at

  15. Sorry, I didn't think it was off subject (if you were referring to my comment).

    The way I see it, XHTML is XML that can be viewed as HTML. How many people will save an XHTML file as XML directly off the web? Probably less than 5%. So the idea of a character encoding problem in this situation, would be directly related to XML and UTF-8. I wouldn't want someone saving my XHTML files and losing any data.

    Posted by Devon at

  16. Devon, it is far more likely that someone prints your page than that he saves it. But even then browsers will add the necessary information to view it online like a byte order mark or perhaps that ugly META element. Otherwise the "save this page" would be seriously broken.

    And no, I wasn't pointing at you.

    Posted by Anne at

  17. What are common methods for content syndication (like trackbacks and RSS) of which the syndicated content is in a different character encoding than utf-8?

    Conversion of some kind? How?

    Opera Mozilla and IE6 all assumed an encoding other than UTF-8 (probably windows-1252) if none is specified.

    I tested with IE6 and indeed it defaults to windows-1252 when no characterset is specified. This may explain why the problem of wrong characters in websites is so omnipresent, yet hardly any webdeezigner is aware of it; their development software as well as their browser defaults to windows-1252, so they see nothing wrong happening.

    Posted by Kris at

  18. Well, you are only guaranteed that a parser can read your XML when it is either utf-8 or utf-16 according to the XML specification. Those are required to be supported.

    Posted by Anne at