Anne van Kesteren

Language tags

Language tags are defined in RFC3066 and allow you to construct a language identifier for the document. They are not tags in the HTML/XML sense of the word by the way. Like Technorati tags they are a way of ‘tagging’ content. A W3C article about language tags in XML and HTML explains how they work. If you use just a two/three-letter code you’re just denoting the language of the document. If you add a hyphen after that you can tell the country the language you are using of. For example you can use nl-NL to ‘tag’ the Dutch language spoken in the Netherlands or nl-BE for the Dutch language spoken in Belgium (also called Vlaams). Now there isn’t much difference between those so you can safely use nl, but for some languages there really is variety between countries.

The biggest problem of RFC3066 is probably that it refers to ISO639 which doesn’t cover every language. Some organization called SIL International does cover every language (including Venetian, for example) and recommends usage like x-sil-silcode, where silcode represents the language codes/tags they issued. For example, it provides language codes for all dialects spoken in the Netherlands and detailed information. Problem of these language tags is that they are not really official; hence the x prefix, and that therefore they are not really of real use.

(I was wondering by the way, would en-NL-anne be valid? There is no such document that specifies the exact relation between the language tag and the primary subtag so I guess it is. Still, it looks kind a wrong.)

For HTML4 documents there are some small problems. HTML4 currently references some old RFC and doesn’t say the magic line “or it’s successor” afterwards. Fortunately the older RFC is quite compatible, though obsoleted and it doesn’t really cause any interoperability problems. Still, you can safely use the new RFC as reference as that is what all UAs have implemented and that is what the HTML4 errata (if ever) will reference. It is also what HTML5 will specify. (Although HTML5 will probably reference the even newer yet unpublished RFC.) The other problem is that because the language of elements inherits and in HTML4 the LANG attribute can’t be empty (xml:lang can) you can’t set the language of the element to ‘null’. That is especially useful when you can’t retrieve the language from some piece of text you are including in your document or you have a piece of text that has no language.

As for retrieving the document language: Basically you look at the content-language header first. After that attributes can override the language (lang for HTML documents, xml:lang for XML documents, and lang when xml:lang is absent in XHTML documents) of individual elements and descending elements inherit that specified language unless overridden. It is very fortunate that browsers first have to look at the content-language header so you can safely omit it from elements. (This also allows you to omit tags in HTML4.)

(Parts that are described in the above paragraph are derived from the Authoring Techniques for XHTML & HTML Internationalization working draft.)

And yet, there is so much more to say…

Comments

  1. Mr Ian Hickson btw often tags documents he edits as en-GB-hixie (the Web Applications 1.0 specification for example). Perhaps I should create my own language tag as well. ;-)

    Posted by nick at

  2. I believe there is a small error in the W3C document by the way. One of the examples talks about en-scouse, where RFC3066 says that the primary subtag, which scouse represents in this example, must be either two- or three-letter or it must be prefixed with i or x.

    No, en is the primary subtag in that example: Language-Tag = Primary-subtag *( "-" Subtag )

    Posted by David Håsäther at

  3. Yes, I just discovered that after rereading the specification to make sure I was correct. As I interpreted the word subtag incorrectly I removed the relevant paragraph from the post.

    Posted by Anne at

  4. Of additional interest is the fact that the language codes for all sign languages begin with sgn- and are only passably legal in some documents for the reasons you describe.

    Posted by Joe Clark at

  5. I am too lazy now to look for any official reference, but you can of course even create an x-anne language code, CMIIW. (Gah, need to take a look at the specs again, too.)

    Posted by Jens Meiert at