Even if you are not a software developer who doesn't understand Unicode, Unicode is still important. Especially on the web where there is quite some trouble with character encoding. UTF-8, not known as the 8-bit Unicode Transformation Format solves most of these problems since it can handle every single character in existence and it can be extended as well when more characters are standardized. Another advantage is that the first 128 (2^7) characters cover us-ascii
, which is the most commonly used encoding format in western languages. us-ascii
is also the same as the first 128 characters from the iso-8859-*
series. Therefore it will be easier to switch to utf-8
encoding for people who are publishing in western languages. You just change the header and encode the few characters that are not displayed correctly when you see them. Of course, loading the text in your editor and click the option convert to
will work perfectly as well, especially if you have a lot of strange characters in your database or if you want to convert non western characters, which usually never map utf-8
us-ascii
or any other utf-8
encoded character.
One of the most important aspects of conversion is that you need to understand that just changing your HTTP headers isn't going to solve your encoding problems, it actually creates them. You need to re-encode the characters that don't map utf-8
, since in some character encodings a character takes one byte where it takes two bytes in another encoding or perhaps a different order of bits within that byte. If your site uses a database you could easily export all your data to a single file. Save that file as utf-8
and overwrite the existing data in the database. After that you change the charset
parameter of the content-type
header to utf-8
and everything will work as expected. There are various ways to do this. You could add a line to your .htaccess file: AddDefaultCharset utf-8
or you could use the PHP header
function: header("content-type:text/html;charset=utf-8");
. If you don't have access to that you could choose the most evil of all solutions, but you actually want to avoid the HTTP-EQUIV
attribute always and everywhere:
<meta http-equiv="content-type" content="text/html;charset=utf-8">
So after you have modified all your existing content and changed the headers it is time to make sure your new data doesn't get crunched by some user or yourself (lots of validating/encoding problems are made by the person who owns the site). All websites using iso-8859-*
already have the problem that a user can enter one or more of the 27 differences between iso-8859-1
and windows-1252
. Fortunately, this is not the case with utf-8
. When you are using utf-8
you don't need to worry about those subtle differences and how a user can easily invalidate your site. Browsers will honor utf-8
and only enter utf-8
bytes. To make sure browsers do that, you make sure that all admin pages, every comment page, comment preview page and all other pages that allow user input from either visitors or yourself are sent with utf-8
encoding.
If you have any further question, don't be afraid to ask. Make the change today. Thanks.
Okay, I've been waiting for this post to do just this very thing. I make the changes, all looks good, try posting Iñtërnâtiônàlizætiøn and it appears fine on screen but in my HTML I have the third example of garbled output that Sam refers to (in the small 3 row table).
In fact it would appear that your site here does the same. Sam mentions If you have “utf-8”, you are actually ahead of the game
and that's all. Any idea what is happening with this? Is this right?
No, you are using the evil htmlentities
instead of the htmlspecialchars
function you should really be using. I have posted about that before though (and mentioned it again), so maybe that isn't the problem?
You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.
Using ISO-8859-1, I can (at least) guarantee validity, even though I may turn the foreign trackbacks into gobbledygook.
Hey Anne, I'm not using that anywhere, still not sure whats up. You... wait a sec.
Yep, it's an opera thing. Opera, on my machine, 'views source' in Wordpad, which apparently can't handle utf-8 characters.
I noticed this while coding up a russian site this weekend - I had to use Firefox to check the source. How timely!
Well, there's nothing stopping me now from going utf-8. Give me 5 minutes... ;-]
Maybe it's good to mention these values for a .htaccess-file, since the php.ini-setings overrule the Apache mime/charset defaults:
php_value default_charset UTF-8
php_value default_mimetype application/xhtml+xml
As of 4.0b4, PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty.
On a development server I prefer these values to be us-ascii
and text/plain
in the php.ini-file, to make sure files within my site's or application's directory take care of these headers.
I just thought this might be a good spot to point people over to a great unicode font, 'cause some people (including myself at one time) think that using unicode automatically means any user can view any character without any further alterations, which isn't exactly true.
For example — I was redesigning my site this month & used some Chinese characters in part of it. Most people just saw question marks or square boxes even tho I was seeing Chinese and using UTF-8.
Beyond that, I have a question. If I remember right, when a page is viewed offline a browser will (by default) display it in US-ASCII. This being the case, wouldn't it make sense to add a meta element to the page when you're using a different encoding just in case someone downloads your page and views it locally?
No, offline is utf-8
. That META
element is for the server, not for your browser (although it isn't exactly supported like that...).
No, offline is utf-8. That META element is for the server, not for your browser.
That's not entirely true. Opera Mozilla and IE6 all assumed an encoding other than UTF-8 (probably windows-1252) if none is specified. The only time you would get UTF-8 properly detected is with the presence of a byte order mark, which isn't always inserted by software and can cause problems for applications that are not aware of what the byte sequence is used for.
So, although the meta
element probably shouldn't be used for the purpose of specifying the document encoding as UTF-8, one should keep in mind that it won't magically happen without a BOM.
Ah, but I was talking about XML documents, since after all, this weblog uses that.
Ahh, I was thinking most people would likely save an XHTML file as .html locally, and that might cause encoding problems if the file doesn't declare a charset in a meta
element.
Hmmm unicode. The Ultima Ratio of Enconding... :]
At first. UTF-8 cannot handle every single character in existence
, as 8 Bit unicode still does only cover 256 characters. And even UFT-16 which seem the real advantage does only cover 65536 characters, what seems enough to cover most alphabets, but actually is only a bit too small to cover them all.
I addition to that, several instituitons [like dictionary publishing houses] have many special symbols and signs, to be represented in some kind of encoding too. And another problem that springs from that fact ist, that some signs have the smae character, but a different semnatic in a different context. [For Example, in dictionaries some there's at least one sign for a weekyday, that is the same sign as for a planet.] And for good reasons, you want to keep the semantics of the sign in your document, not the reference to a unicode number. This problem is solved by using selfdefined XML-entities, and mapping them [as needed] to an encoding or something else, as general Entitis could also contain a reference to an XML-File...
Right now, this problem is only a problem of "closed" XML-sollutions. But sooner or later, this problem is comming to the web too. Especially in the context of accessebility, a named entity like &saturn; or &monday; looks much more accessible to me...
Ben, you are wrong. utf-8
does cover all characters. Just check comment 5 on iñtërnâtiônàlizætiøn before you jump to conclusions that don't make sense. Also, read the what UTF stands for and think again ;-)
And a named entity isn't accessible. You don't need entities with utf-8
and those (utf-8
encoded) characters should just be recognized.
And please, don't get off-topic talking about XML.
It is probably not widely known, but using charset iso-8859-1 induces a bug in IE5 and IE6 with fileupload (in certain situations), in combination with a multi-byte character (such as the euro-sign). IE then uses 2 different boundaries in the multipart/form-data request, which of course is wrong and usually not repairable on the server.
Replacing the the charset with windows-1252 solves the problem.
Replacing the charset with utf-8, solves the problem too, but creates another problem. That might have an easy solution, but I don't see it right now:
For example, after posting the euro-sign it gets returned as 3 characters: (€) (â ‚ ¬)
This counts (also) for standard form submission. And not only for IE, but also for Mozilla and Opera.
Ah, but I was talking about XML documents, since after all, this weblog uses that.
Ah, excusez-moi, monsieur. In that case, there's no doubt at all. I myself haven't been using XHTML for very long, so I don't typically think in those terms.
Sorry, I didn't think it was off subject (if you were referring to my comment).
The way I see it, XHTML is XML that can be viewed as HTML. How many people will save an XHTML file as XML directly off the web? Probably less than 5%. So the idea of a character encoding problem in this situation, would be directly related to XML and UTF-8. I wouldn't want someone saving my XHTML files and losing any data.
Devon, it is far more likely that someone prints your page than that he saves it. But even then browsers will add the necessary information to view it online like a byte order mark or perhaps that ugly META
element. Otherwise the "save this page" would be seriously broken.
And no, I wasn't pointing at you.
What are common methods for content syndication (like trackbacks and RSS) of which the syndicated content is in a different character encoding than utf-8
?
Conversion of some kind? How?
Opera Mozilla and IE6 all assumed an encoding other than
UTF-8
(probablywindows-1252
) if none is specified.
I tested with IE6 and indeed it defaults to windows-1252
when no characterset is specified. This may explain why the problem of wrong characters in websites is so omnipresent, yet hardly any webdeezigner is aware of it; their development software as well as their browser defaults to windows-1252
, so they see nothing wrong happening.
Well, you are only guaranteed that a parser can read your XML when it is either utf-8
or utf-16
according to the XML specification. Those are required to be supported.