Anne van Kesteren

HREFLANG and TYPE considered harmful

HREFLANG and TYPE are both attributes for the A and LINK elements. TYPE is an attribute as well for OBJECT. None of these attributes apply to AREA, I wonder why. Although embedding can be interesting, I'm going to focus on applying them to LINK and A elements. Both have some nice features, especially in combination with CSS generated content:

a[hreflang]::after{
 content:" ["attr(hreflang)"]";
}

Using the TYPE attribute you could tell the user the linked file is a PDF, so he will know that Mozilla almost crashes using the Acrobat 6.0 plugin and it will take some time:

a[type="application/pdf"]::after{
 content:" (PDF)";
}

Or the more ironical example of showing IE users that they can't view the site, because it uses XHTML:

a[type="application/xhtml+xml"]::after{
 content:" (XHTML)";
}

(Before you comment on that, reread the word "ironical". Thanks.) Those examples show the value of the attributes, so why are they considered harmful by me? Let me explain: you don't know for sure what content you get back, never. This is, obviously, completely related to content negotiation. For example: you have an English weblog, but your browser of choice has been adjusted in a way to say it prefers Dutch content over English content, because you are from The Netherlands. So you visit a site, Google pages are the best example, and you see it has Dutch content. Great, you can read that very easy, it is interesting as well, let's link to it!

<a href="http://www.google.com/press/zeitgeist.html" hreflang="nl">Google Zeitgeist</a>

As well as pointing towards that link I mention that I miss some browser statistics information on that page, which would be nonsense for English users, who will see that information on the page. This would also be a argument that Google needs better architecture, especially since I can't retrieve all information without modifying my browser's Accept-Language header, but I actually wanted to show that saying "that content is Dutch" is stupid, since you don't know for sure, never. It might have been a useful attribute if HREFLANG was specified in a way that it would alter the Accept-Language header of the user clicking the link temporarily, which will allow you to make the following constructs:

<link rel="alternate" href="/" hreflang="fr"/>
<link rel="alternate" href="/" hreflang="de"/>
<link rel="alternate" href="/" hreflang="nl"/>

Now that would have been nice, but HREFLANG doesn't do that, it specifies the base language of the resource designated by HREF, which isn't useful. So HREFLANG is harmful, since guessing what the language might be is stupid, you can't rely on it. (Unless you stress test the target document with all possible Accept-Language headers, including languages that are 'x-' prefixed (which will make it impossible to test) and it keeps returning only one document with 1 language.)

Now, it is probably a good time to say all above applies to TYPE as well. Try making a link to my weblog using TYPE ;-). Note that this use of TYPE is different from the use of TYPE on OBJECT, see the current XHTML 2.0 TYPE attribute description how that works. Note that in XHTML 1.0 it isn't possible to make a list of possible MIME types or use asterisks; we can use nested OBJECT elements though. So this would apply to the TYPE attribute on the A and LINK elements only, since you can't give metadata about a link when that metadata relies on content negotiation. Again, it would be nice if TYPE could change the Accept header:

<link rel="alternate" href="/" type="application/xhtml+xml"/>
<link rel="alternate" href="/" type="application/atom+xml"/>

But I may as well dream on.

Comments

  1. Even without bringing content negotiation into the picture it does seem hard to keep these attributes updated on links to sites that you do not control yourself.

    Posted by Peter Winnberg at

  2. Don't forget that you could get a list of available languages for a page easily by looking at that page's <link hreflang=""> elements. However, most pages don't have propper link elements anyway (including that page at Google).

    Also, keep in mind that the available languages for a page can change very easily over time. For example, somebody could just have made a perfect list of all the available languages for a particular page, and just after that I upload an Afrikaans version of the page. That means that the list would already be outdated. Too complicated if you ask me.

    The beauty of using HTTP_ACCEPT_LANGUAGE is that you can automatically have a page displayed in the language of your preference without any further human intervention. However, if you follow Anne's idea of having hreflang influence that, the beauty is taken away.

    For example, Anne could link to the Dutch version of any particular page, but I would prefer to see the Afrikaans version. So when I click on the link, I want to be taken to the Afrikaans version, not the Dutch one.

    So, as far as I'm concerned, hreflang is only usefull on link elements (for when you provide a list of possible languages for any particular page on that page itself where it can easily be updated as necessary).

    Posted by Charl van Niekerk at

  3. But if you look at a site that uses content negotiation in a "real" way this is not a problem.

    Take http://www.debian.org for example, when I go to that URL I get a swedish page. The Vary HTTP header field indicates that content negotiation is used and it also sends a Content-Location HTTP header field to indicate the "real" URL to the swedish index page ( /index.sv.html.

    So if I could link to that site using the type and hreflang attributes it would look like this:

    <a href="http://www.debian.org/index.sv.html" hreflang="sv" type="text/html">

    And it would work just fine.

    But lets say that I have a site that sends XHTML 1.1 to people that accept that and HTML 4.01 to the rest. And use the Content-Location header field in the same way as debian.org does, Google for example would Link to URLs that looks like this: http://example.com/index.html and http://example.com/index.xhtml instead of my nice clean URL.

    So doesn't this come down to, do you want to have something in the URL to indicate content type or not?

    Posted by Peter Winnberg at

  4. I'm trying not to go too far off-topic, so here goes... In reply to comment 3:

    I think this opens up a far bigger can of worms. One of the things I am also concerned about is the indexing by Google. How will Google find these different versions of the same page? Technically, Google must know about both the text/html and application/xhtml+xml versions and about all of the different available languages.

    Maybe the best way to do this is by putting querystring parameters in the URL. Then you can see that they aren't really necessary, but that they MAY (as defined by some RFC) be put inside the URL if desired.

    For example:

    Then you could also have:

    Or of course a combination:

    <a href="http://annevankesteren.nl/?content-type=application/xhtml+xml&amp;language=en" type="application/xhtml+xml" hreflang="en">Anne van Kesteren in English (XHTML)</a>

    Note that I did not mark up the abbreviations propperly, but that isn't meant to be part of the example.

    The point is that, if the querystring values are specified correctly, then they should be used. Those that are not specified, can be left up to content negotiation if necessary.

    But I think for normal links it is better to leave all of this up to content negotiation, unless there is some special reason not to. For example, if I ask Google for pages only in Afrikaans, it might decide to return the extra language querystring parameter inside the URLs.

    The most important thing is to have propper link elements to specify all alternatives. Stress testing wastes bandwidth IMHO.

    Posted by Charl van Niekerk at

  5. Documents should reside at one unique location and stay there.

    Different language versions of documents are different documents, and therefore should reside at different locations.

    This basicly is what Petter Winnberg wrote If people would follow these guidelines HREFLANG and TYPE wouldn't be harmful.

    Don't you just love thinking about utopia?

    Posted by Mark Wubben at

  6. If a resource is available in multiple languages, content negotiation which language the client wants is a fine method, very RESTful, and not bad at all. But what it seems most people forget, is notifying at the resource URI what languages the resource is available in. Content-Language is afaik supposed to solve that, and if clients supported it, the user could be given the available choices before the final GET was executed.

    Now, there's no negotiation. A resource which is available in many languages only serves the language it thinks is best for the client, even if the client would actually want something else. But if the user doesn't know what else is available, that negotiation is really a one-step process, like saying «I want red or green apples, in that specific order» and get green apples in return.

    If the resource beforehand (by doing a HEAD on it) could state that «I have yellow and green apples», the user could be presented these options, and choose one of them. Maybe the user would rather want yellow than green apples, for instance. Then, the chosen option could be ranked first in the Accept-Language header on the following GET request. The yellow apple would then be returned to the user.

    Posted by Asbjørn Ulsberg at

  7. In reply to comment 5 & 6:

    I don't think keeping seperate documents for different languages is optimal. You are then loosing out on the benifits of automatic content negotiation.

    When I go to Google, it is in Afrikaans because I set my preferences in my browser correctly. I don't want to have to click on something to get my language of choice every time. I think that this would be highly impractical, irritating, and unnecessary.

    I personally would like the computer to decide for me by default, unless I specify otherwise. The computer knows what my typical language preferences are by the settings I set, so why is human intervention necessary every time?

    However, a manual override is allways good. The list of available languages can be in the link elements, and they can then be displayed to me in some way through my browser. I can then select an alternate language if I like from the list.

    Posted by Charl van Niekerk at

  8. The beauty of using HTTP_ACCEPT_LANGUAGE is that you can automatically have a page displayed in the language of your preference without any further human intervention. However, if you follow Anne's idea of having hreflang influence that, the beauty is taken away.

    Why is it take away? You get the page you want first, of course (using content negotiation). But if you want to view the English content you would at least have the possibility.

    But if you look at a site that uses content negotiation in a "real" way this is not a problem.

    The problem is that no browser supports content-location, except for Opera with bugs. If I read the comments on the Mozilla bug thread correctly it seems that the RFC is incorrect.

    Google must know about both the text/html and application/xhtml+xml versions and about all of the different available languages.

    Why? I don't think Google has application/xhtml+xml in the accept header, so it won't get that version, never.

    The point is that, if the querystring values are specified correctly, then they should be used. Those that are not specified, can be left up to content negotiation if necessary.

    That isn't different from the solution provided earlier, using extensions for differentiation.

    The most important thing is to have propper link elements to specify all alternatives. Stress testing wastes bandwidth IMHO.

    Stress testing is impossible, like I said. Having the alternative specified doesn't work, when the URL is the same though.

    Different language versions of documents are different documents, and therefore should reside at different locations.

    +1 (You have the same opinion for different types of documents having the same content, right?)

    I don't think keeping seperate documents for different languages is optimal. You are then loosing out on the benifits of automatic content negotiation.

    Not at all, how would you loose these advantages? You could still use content negotiation even if you have multiple separate documents.

    When I go to Google, it is in Afrikaans because I set my preferences in my browser correctly. I don't want to have to click on something to get my language of choice every time. I think that this would be highly impractical, irritating, and unnecessary.

    You misunderstand the concept. Google will give you the localized version, but it will have a permanent location as well, allowing you to choose for the English version as well.

    Posted by Anne at

  9. Anne, as long as you have one URL for each different alternate version there is no problem.

    Then you can specify link rel="alternate" (along with type and hreflang) without any problem.

    Posted by Peter Winnberg at

  10. This is the problem:

    Posted by Anne at

  11. Different language versions of documents are different documents, and therefore should reside at different locations.

    +1 (You have the same opinion for different types of documents having the same content, right?)

    Yes, Atom files, for example, should have a different location (Preferably by adding .atom)

    Posted by Mark Wubben at

  12. The reliability of a link is small. It seems that half of the links referred to from scientific documents since around 1995 is not accessible anymore today, making a lot of scientific publications non-understandable because the information they rely on is simply not there. Considering this relative unreliability for links without a hreflang or type attribute, consider the number of times a MIME-type or a language might change [1].

    However, does something like this fall under reliability? I don't think so. Just because you have once chosen a language or a MIME-type does in no way mean that you would be unreliable by changing that. You are only unreliable if you remove the way to access a page at a certain location (of course not meaning IP-addresses).

    Still, this doesn't render these attributes useless. As stated by Charl van Niekerk, they are definitely useful for linking to and from pages on your own site. As with almost any element and attribute, these can be abused (just think of the alt attribute on the img tag).

    ------------
    [1] If you consider the latter unlikely, just take me or Anne. How unlikely would it be for us to switch from English to Dutch or the other way around? Or to German? Only a little less likely than changing MIME-type. Hereby I just mean that www.google.com would display Dutch as default instead of English and the English version would reside at www.google.co.uk (stupid example, but I needed something).

    Posted by Frenzie at

  13. The beauty of using HTTP_ACCEPT_LANGUAGE is that you can automatically have a page displayed in the language of your preference without any further human intervention. However, if you follow Anne's idea of having hreflang influence that, the beauty is taken away.

    Why is it take away? You get the page you want first, of course (using content negotiation).

    Anne, I don't really understand what you mean.

    My point was, that if you specify what language the user must view the page in, you are taking away the advantage of having the user's user agent figure out what language is best for him. Of course he can later choose a different language, but I would suggest rather linking to a language-neutral URL and let the user agent or the user figure it out himself.

    But if you want to view the English content you would at least have the possibility.

    This is precisely what I described in previous comments.

    Why? I don't think Google has application/xhtml+xml in the accept header, so it won't get that version, never.

    I know, but it could have. And I believe that is will have one day when XHTML is commonly used and understood.

    There could be many reasons why one would only like to retrieve XHTML documents. For example, I might want to know how many people have Afrikaans websites with XHTML on them.

    The point is that, if the querystring values are specified correctly, then they should be used. Those that are not specified, can be left up to content negotiation if necessary.

    That isn't different from the solution provided earlier, using extensions for differentiation.

    It is a little different. File extensions aren't propperly defined. A full-stop-point that is used to devide the name of a file and its extension isn't limited to that use. You could also have:

    http://www.google.com/some.uri.that/doesnot.exist

    In the above example, is exist a file extension?

    In other words, full-stop-points must be seen as part of the document path.

    However, querystring parameters is a different story. They are not meant to be part of the document path. However, the borderlines for this aren't propperly set either.

    For example, many content management systems abuse querystrings by serving completely different documents depending on querystring parameters. I think this also probably has to do with URI semantics and abuse of it. But then, these things aren't propperly defined either (as far as I know).

    You can always remember the Vary HTTP header. However, it doesn't specify to what extent the changes are inside the document. In other words, is it only different content-types or completely different documents? I think this is a little under-defined as well.

    I don't think keeping seperate documents for different languages is optimal. You are then loosing out on the benifits of automatic content negotiation.

    Not at all, how would you loose these advantages? You could still use content negotiation even if you have multiple separate documents.

    Sorry, I think I didn't express myself propperly. What I meant by different documents was actually different file names. Coming back again to rather using querystrings.

    You misunderstand the concept. Google will give you the localized version, but it will have a permanent location as well, allowing you to choose for the English version as well.

    What I was suggesting was something like this:

    Therefore, every language does have a different location, but if you type in http://www.google.com it will be left up to content negotiation.

    I think the only question that still exists, is that should different translations of one document each be seen as a document on its own? If they are seen as separate documents, giving each a separate filename is probably best. Otherwise, use querystrings.

    Posted by Charl van Niekerk at

  14. For example, many content management systems abuse querystrings by serving completely different documents depending on querystring parameters. I think this also probably has to do with URI semantics and abuse of it. But then, these things aren't propperly defined either (as far as I know).

    And that is exactly the reason one should use different files, using something like name.language.type for files. Since we are talking about different documents here. The English version is different from the Dutch one and Atom has different semantics compared to XHTML.

    Posted by Anne at

  15. I think different translations are separate documents. Google does form a sort of exception/shady area here because except for a few little details (Google search in English as opposed to Google zoeken in Dutch) the page is 99% the same after entering a search term.

    Posted by Frenzie at

  16. Frenzie, please see the example I posted in the entry (from Google), I don't call that the same page.

    Posted by Anne at

  17. And that is exactly the reason one should use different files, using something like name.language.type for files.

    Ok, I see your point. I agree.

    Posted by Charl van Niekerk at

  18. Anne, it's my fault, I had the window open for a about an hour before I returned to it again and pressed post. It's a bit stupid, because a lot of people could have posted in between.

    I just noticed a remarkable difference between Google English and Google Dutch, just check this:

    Results 1 - 10 of about 6,010,000 for zoek. (0.23 seconds)

    Resultaten 1 - 10 van circa 6,190,000 voor zoek (0.16 seconden)

    Looking at the results it also appears that Google Dutch automatically searches for Dutch results, even though I didn't tell it to do so (zoeken in pagina's in het Nederlands was off).

    The Zeitgeist is of course very different, but I would have expected the search results to be the same except for some language specific details (search, images and such) if entering the same search query. I was wrong. Of course there is a need for different versions in the way you stated, but I don't think that takes away that it would have been a sort of shadow area - would the results be the same, which they are not. Therefore, it's not a shadow area and the need for two different locations is much more clear.

    Posted by Frenzie at

  19. I would have expected the search results to be the same except for some language specific details (search, images and such) if entering the same search query.

    Though this is getting a bit off-topic: Google's various versions not only differ language-wise but also content-wise. Google removes pages from their index as required by local law, which leads to the indexes available to users of a specific language-version (whether they are from that country or not) being of different sizes.

    Posted by Gerrit at