Anne van Kesteren


How does my weblog perform using unicode. See also: Survival guide to i18n. Some tests:

Let's see how Unicode and weblogs does with Japanese :) これは日本語のテキストです。読めますか?...

Let us test some Hindi Text

देखें हिन्दी कैसी नजर आती है। अरे वाह ये तो नजर आती है।

And check...


  1. Excellent job!


    Posted by Basje at

  2. Added support for xml:lang on more elements now, thanks for letting me know.

    Posted by Anne at

  3. [...] Unicode support in WordPress I just love it! As you can see in my previous post WordPress handles all kind or characters [...]

    Posted by Unicode support in WordPress <Anne\'s Weblog about Markup & Style> at

  4. Supporting other character sets is a good start for a i18n-ed site. This allows you to add some texts in other languages.
    The difficulty comes when you need to serve a completely japanese page to your Japanese visitors, and a completely english page to your English visitors. Having your navigation and lay-out work in multiple languages is a hell

    Posted by Jeroen at

  5. 西安话 说粤语的人可以使用汉字书写出篇在北京可以被理解的文章 上 海南 头 诗经集 这引起了很大的反对声 对多和多对的转换都是不容易的 例如 相反 找 中文等称呼都是指汉语 但客家话是在北方移民南下影响中形成的 体 文言 既然汉语不是表音的 它保存在了周德清在世纪初撰写的韵书中原音韵中, 前者用于台湾 以厦门话为代表 因为它被设计成个比更大的字集

    わった ティのい にによる ンタネット協会 ンテ アクセシビ を始めよう インフォテ めよう アク, どら アクセ プロトコル サイト作成のヒント コンテン よる ユザエ 丸山亮仕 ビリティにる レイティングサ

    Σελίδων διασφαλίζεται αν δύο, και τελειώσει εξοργιστικά με. Παραπάνω πρόσληψη σου ας, αλφα αποφάσισε μας το, ανά τη πολλοί άλγεβρα.

    सदस्य अत्यंत कम्प्युटर ढांचामात्रुभाषा कारन सीमित शुरुआत पहोचने आपको विवरन ब्रौशर तकरीबन पुष्टिकर्ता पडता उन्हे संस्था लचकनहि करती आवश्यक है।अभी

    Нас силы осколки удивительно не, от мог когда имеет промежуток, он дурак заботит предварительного не. Очень прийти заведено вы опа.

    Ik pera tempo dum. Ia istan asterisko mallongigo des, sepen gentonomo tutampleksa da jes. Ene kvin armo kune uk, mia.

    Technic internet tu que, sed ultra gymnasios conferentias un. Duo altere instruction professional es, tu que titulo nomina summarios iste.

    LINk kl1k 4lvvAyz y0 n0n. |235u|7z 5umm4|213z kUm 47, joo 1+ p4g3, p|33z pR0dUc+, WI5h 717|3 b3 c4n. D0n't l1||k.

    Mä voll Léift oft. Wait d'Vullen d'Vioule si ons, ke Hierz d'Sonn gebotzt mir. Op hie Well d'Mier Hämmelsbrot, ech.

    ·--· ····- -···· - --- · --·- -··- ·-· - ·-- ·-·· -··- ··--- -· ··--·· ···-- --- -·- --

    Lav sa luhta vandel goneheca, soica cotumo racinë hón ma. Lá occo amanyar iel, antorya ambarmetta har be. Cé mac.

    Te robenie roditelis gde, znat lubijsx studentis li eda, do mai edat tugde probudijm. Da cielis trasil moi. Velju slovis.

    Ipe zn pece urazo, hegi undan kayasada kon in, iraga imagi tonni di men. Aga di pani dite jakine, dubi.

    Mod seli cont selo vt, n oth pini wawa, sama nasin kin e. Sitelen toki e oko, mun a unpa.

    Blümiks olenükobs-li cil nu, ix leklära okälom pejonedon köp. Fe elesedom lätikan magot güä, dugans mans panemof-li bos nu ön.

    Using the generator found in the post: Lorem Ipsum with other charsets on wg. Thanks Nate!

    Posted by Anne at

  6. ľšýáýľšýáťľíééíáčŽôúä

    Posted by dusoft at

  7. Looks like it worked! Although I'm not sure I understand the subject well enough to be sure.

    Posted by Nate at

  8. Initiate Swedish umlaut test:

    å ä ö

    Swedish umlaut test completed.

    So do you speak Japanese or Hindi?

    Posted by Lars at

  9. क्या बात है। अच्छा लगा आप की प्रविष्टि देख कर। English: Wow that is cool. It is good to see your post.
    पंकज Pankaj

    Posted by Pankaj Narula at

  10. Adding to PHP's unicode trouble, MSIE is buggy representing Japanese kanji and therefore large part of the Chinese charset (at least via UTF-8, there may be proprietary solutions I don't want to know about). For Japanese you can circumvent this only by limiting yourself to hiragana and katakana charset.

    Beeing a programmer and not a linguist, I've to say that it is a mess, really. Just have a look at Joe Spolsky's site in Chinese. If not even he can deal with it, who can?

    On I just left out all Japanese/Chinese chars that MSIE cant't handle (took me hours to find out), so don't get fooled...

    Posted by Marek Möhling at

  11. Just noticed: The Chinese example in post no.5 needs the xml:lang="zh" attribute, strictly speaking.

    Speaking ...ah, sloppily, it needs lang="zh" too -which is not a valid xml attribute- to display well with MSIE...

    Posted by Marek Möhling at

  12. This is a test of Hebrew language support.

    אני מדבר עברית

    Posted by Abiola Lapite at

  13. Contrary to my earlier remark, the "lang" attribute gets validated as XHTML Strict by, nevertheless it gets rejected as "not well-formed" here.

    Testing hebrew and arabic I see that the dir attribute and the bdo tag get rejected to, so right to left text orientation is not possible in this weblog, though words and chars are displayed in the right order.




    Posted by Marek Möhling at

  14. Yeah, the LANG attribute is a left over of HTML 4.01 so I won't allow it here. I care about semantics, not about what validates. (At least, for my personal site I do.)

    Are you saying I need to allow both DIR attribute and BDO element or is support just fine already?

    Posted by Anne at

  15. To enable right to left text orientation you should allow the dir attribute, else msgs. look funny for arab/hebrew etc. speakers, even if they can read them. (by the way, I added arab/hebrew to my generator)

    As for the despised lang attribute, you can do without as only MSIE needs it for CJK languages, whose unicode-based CJK support is buggy anyway (unless you use CJK MSIE or WinOS versions, I guess); nevertheless attributes do not add semantic clutter, it's tags that do IMO.

    Though you are certainly free to do as you like, I don't subscribe to the concept of forcing visitors to know and apply (X)HTML - as a general concept for cms/blog software it's not applicable, though it certainly gives us geeks a way to show and demand prowess.

    In the end the web is about enabling communication, so plain text + optionally some simple markup should rule; URL to href conversion and XML/unicode charset handling should be up to the maintainer’s software. Leaving it to the user to set the appropriate dir attributes and xml:lang iso country codes is technically easier, but that's not failsafe as users are error prone; as of yet I don't know of systems that check if the entered language codes/dir attributes actually match the entered chars.

    Posted by Marek Möhling at

  16. re: Chinese example in post no.5
    Fiddling around, I just noticed:

    BTW: I see that the b tag gets rejected here too, AFAIK it's not deprecated but reserved for physical stress, in contrast to logical emphasis; dunno, in Germany we call that beeing more popish than the pope offense meant, long live John Paul :)

    Posted by Marek Möhling at

  17. Hmm... Bug 75011 seems to be showing up in a lot of these posts.

    Posted by dolphinling at

  18. Funny, I can't reproduce Bug 75011 neither here nor at bugzilla's example site. (Moz1.5 on WinXPSP1)

    ...anyway, the bug is mozilla related, it's not about unicode or it's implementation on this site or elsewhere

    Posted by Marek Möhling at