When people are building websites character encoding is mostly something that does not really bother them. Actually, most of HTTP is generally ignored and some
META elements are used instead. Especially the ones with a
HTTP-EQUIV attribute. However, a much more useful way to declare the character encoding of the document is by using the
charset parameter of the
Content-Type header. If you wanted all your HTML files to be served with UTF-8 character encoding you could use:
AddType text/html;charset=utf-8 .html
This is for Apache by the way. I guess similar rules exist for different server software. You could also use:
AddCharset utf-8 .html
Or even better (this assigns UTF-8 as default encoding for all file types):
If your host has already configured your server like this you can not alter the character encoding using a
META element. Every document that suggests otherwise is incorrect. (Now I have to admit that Sam Ruby’s DevCon 2004 slide show is excellent and should be read by everyone.) However, when your document is served as
text/html without explicit
charset parameter HTTP is no longer authorative. HTTP does suggest a default encoding of ISO-8859-1 for HTML documents (which is actually treated as win-1252) but browsers choose the value of the
CONTENT attribute instead. (And only the
charset part of that attribute’s value as specified in Web Applications 1.0.)
A minor note on ISO-8859-1. Do not use it if your site contains any form of user input or input from yourself or other people. Those users will be able to submit win-1252 encoded characters to your site and you no longer validate. You could choose to make win-1252 the default encoding. However, it would be much wiser to use UTF-8 instead.
Please, don't ever use win-1252 character encoding for anything. For those that haven't noticed, it's proprietary to Microsoft and could screw non-Windows users, although I think some non-Windows systems also support it because of its wide use. UTF-8 is indeed the only way to go. :-)
Anne, I'm a bit confused by you stating that
...ISO-8859-1 for HTML documents (which is actually treated as win-1252)... Is that only on Windows, or is that on Linux too? Mac OS X?
Maybe it is worth mentioning, that Mozilla (at least) ignores charset in META when document is processed by XML parser.
Rimantas, every browser that does not ignore that is in error (see also the link to WA1 in the post). Please file a bug if possible if you ever see that in action.
I think it's also worth mentioning that the charset parameter should be omitted when serving
application/*+xml since XML documents are self describing. I'm not sure about
text/xml though, since there are problems when specifications collide, thus XML as
text/* is not recommended.
See also the page from W3Cs I18N Activity.
Charl, it is a common myth that treating ISO-8859-1 as Windows-1252 was a MS-only anomaly. It is actually very common and implemented in real-world non-MS software on non-Windows platforms as well. If you serve forms as ISO-8859-1, you should expect Windows-1252 submissions. But like Anne said, UTF-8 is the way to go.
I read something interesting the other day about encodings:
What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 byte code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working.
No wonder people don't give a crap about proper encodings because it "looks good" to them.
I do second UTF-8, btw.
The charset parameter in the Content-Type header works for most browsers, but not MSIE. As Milan points out, Internet Explorer (6.0 on WinXP) tries to guess and so it was guessing Turkish (Windows) for a Movable Type cgi comment preview page containing UTF-8 English and Urdu text!
I was sending the correct HTTP headers and had the encoding set in
<?xml version='1.0' encoding='utf-8'?>. So I had to include the META tag just for Internet Explorer.
Zack, I would love to see a proper test case of that. Since I can not really believe it. (This page uses a preview too, you can enter everything you want and still, I do not use a
I think Zack may be correct, IE does ignore the
Content-Type header in some circumstances. I set up a test case serving the document with
Content-Type: text/html; charset=iso-8859-1 but also starting with a
UTF-8 BOM. IE incorrectly parses the file as
UTF-8, while Firefox and Opera correctly obeyed the
I know the test isn't exactly what he described with Urdu text detected as Turkish, but I don't know those languages, nor whether the file he was talking about was correctly encoded as
UTF-8. This test does, however, show that Internet Explorer breaks the rules yet again.
Yes, it has its flaws. However, to truly test it you should have used some non US-ASCII characters in that test case. In one test case you put UTF-8 characters for example and in the other ISO-8859-1.
Tssk, all this cp 1252 bashing is just so unfair, M$IE actually prefers UTF-7 (just be sure to have automatic encoding detection enabled :-).
Perhaps I can clarify the cp1252/8859-1 issue. The difference between these two character sets is that Microsoft has used several character values that are left unused in 8859-1. In order to enforce this difference, browsers would have to
censor these byte values. It's less work, and doesn't really break anything, to pretend that the character sets are identical.
I don't how see advertising your pages as UTF-8 instead of 8859-1 solves this kind of problem. If a US Windows user innocently pastes a 1252-only value such as 0x97 (—), you still don't have a valid page, since that value isn't used in UTF-8. All you've accomplished is to hide the — from all your users, instead of just your non-Western users.
The optimum solution would be to translate from your user's encoding to UTF on input. Except that browsers don't report user encodings. And maintaining the necessary translation tables is pretty daunting!
Incidentally, IE's attempts to guess character encodings often have bizarre results. If you edit a document in Word with
smart quotes enabled, then paste it into a web page, IE is very likely to decide that you're using a Japanese encoding, and replace some of your punctuation with Kanji!
I was talking about online input. If you can enter anything in my comment form that is not UTF-8 encoded I would love to see it.
I forgot to mention that my Internet Explorer has the Mathplayer 2.0 plugin installed. I have no idea whether that affects things or not.
The behavior is strange in that I was not seeing it for all my pages. Also, Turkish (Roman alphabet, left-to-right) is nowhere close to Urdu (Arabic script, right-to-left).
If you are really interested in seeing that MSIE behavior, let me know by email and I can remove the meta tag from some page to show it to you.
I was talking about online input too. But I was also thinking in terms of more typical web sites that don't validate their input as well-formed XHTML. Not a criticism of this site; people who come here should be able to compose XHTML. But most web sites can't impose that kind of restriction on their users. If they want to present valid XHTML, they have to transform user input somehow. I assumed that's what you meant by "validate".
Basically, Anne, you're creating valid web pages by forcing your users to run their input past a validator for well-formed XHTML. That would work for any character set (though it would be rude to use a proprietary character set, of course). So it's not your choice of character set that keeps out proprietary characters. It's the validation procedure.
Isn't the browser meant to reply in the same encoding that the form had? I guess if you didn't specify to the liking of the browser (however that may be done) then it can return almost anything.
Although, to be fair, why anybody should use anything other than UTF-8 is completely beyond me.
[This should be a HR or BR but the site doesn't allow it (even as <ht />)]
I came to this page because I was wondering about the correct syntax for the content-type header. My thinking is that case is extremely important - it isn't enough to just have Unicode string. Case conversion involves knowledge of the language (which is why Microsoft had to break Turkish on Windows) and can't be specified by only character encoding.
Shouldn't the UTF-8 be uppercase in the stated examples?
Not really. See section 3.4 of RFC 2616:
HTTP character sets are identified by case-insensitive tokens.
I realise it would be perverse to use any other language's case conversion rules, but the standard should also state that the case conversion rules should be English.