I just love it! As you can see in my previous post WordPress handles all kind of characters flawlessly. I always wanted to post something about this, but I had problems with the comments text field. The author field went fine after I switched to UTF-8 encoding, but the comment field was just embarrassing (every time you previewed again, the bytes got more messed up). I probably still need to add something to clean Windows characters, but that is just a matter of time.
The problem with the comments field was PHP. First of all, PHP has no native support of Unicode, although there are ways to workaround that. Since I was using the htmlentities
function to convert entered comments into something that can be placed within both TEXTAREA
and input[value]
PHP seems to mess up strings. After I replaced that function with htmlspecialchars
it all works fine.
I would recommend sending all your files as UTF-8 (and making sure the bytes are actually UTF-8 bytes of course) and never use htmlentities
again. You might want to avoid PHP altogether, but that is something I won't judge about. I care about standards and unicode (UTF-8) for the moment. If you have read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) you will know that it won't take extra bytes for most European and American pages. Also note that Klingon is not in Unicode.
In the end it was a bit my problem, not that of WordPress, since I hacked the comments file for adding validation and comment preview support (I want comment editing as well). Some further reading.
You're probably already aware of this, but there is a PHP extension for Multibyte Strings which I've been using (on supported servers) with great success. Multibyte String Functions.
Great links.