Anne van Kesteren

URI (IRI) design

Lot's of things have been written about URI design, probably most well-known is Cool URIs don't change from Tim Berners-Lee. Although lot's of companies and people who just start building websites don't invest some time in creating a good structure it can certainly help increasing usability and marketing for instance. Easy to remember: http://example.org/contact, difficult: http://example.org/index.php?page=contact. That last example is not only difficult; it also looks like a URI that is not meant to stay. The next company which makes a redesign of the site and hasn't much knowledge of PHP, but likes Perl and generates static pages might use http://example.org/contact.htm. (Or the other way around, just as evil.) These things happen all the time on the web. The little things that make a website better, like proper encoding, guessable URIs and correct HTTP headers often don't get the attention they deserve.

Even if your company is paying attention to it, it can still be difficult. Let's say you have a simple page called 'mother'. The website you make will have three languages, French, English and Dutch. The page 'mother' is part of 'about' and will have the following URI: http://example.org/about/mother. That looks great, doesn't it? But how about the French page? 'mother' in French is 'mère', which has one character that isn't allowed in a URI, but is allowed in a IRI, something we can't use unfortunately. How is the URI for the French file going to look like:

  1. http://example.org/%C3%A0-propos/m%C3%A8re
  2. http://example.org/a-propos/mere
  3. http://example.org/about/mother.fr

If we could use IRIs the IRI would most likely look like: http://example.org/à-propos/mère, which might even be possible today if we look at the implementation recommendations of non-ASCII characters in URI attribute values, which is implemented in all major browsers according to the IRI document referenced before. I'm not sure which method I would use. The first is probably the most correct, but very difficult to spell, although that doesn't have to be the case in modern browsers. The second looks ok, but is plain wrong from a French point of view. If I would have taken a different country, like Japan, what kind of URI would you use for ウェブ (which is Japanese for "web" I hope). The URI encoded string for that Japanese word is: '%E3%82%A6%E3%82%A7%E3%83%96'. Although it is probably better, looks good in most western languages and it is good for Google to actually translate the word, even if you only make it 'mere' instead of 'mère', there are some disadvantages to this method:

These advantages, except for the first (although you could have redirects or special 404 for those cases) are taken away when you have a script that handles HTTP headers and content negotiation.

Now there isn't much of a problem to distinguish between documents when the words are different among languages, but what practice should be used when words are the same? In The Netherlands, we have borrowed quite some words from other languages for example and I guess that every language has had many influences from other. (The contrary could also be true, having two spelled-the-same-in-two-countries words with a different meaning, figure that out ;-).) Here we use the same word for 'contact' as people from Great Britain and other English-speaking countries do. Again, there are different options:

Personally, I would very much prefer the first. Adding directories doesn't seem like a nice solution to me and using arguments is not so clear as well. It is quite easy to pass arguments like 'contact.nl' to some script on the server and let that script handle a database request to return the appropriate content.

Above I mentioned some of the internationalization problems, but there is a lot more that is needed for a good URI:

Extensions like '.htm', '.xht', '.xml', '.png', '.css' and others don't make much sense either although it is easier to configure Apache and other web pages if files have them. However, you do need them when you have multiple files where the content isn't different, but the media type is. There are some weblogs for example that have both XHTML and HTML content and serve them depending on the browser you use. That is nice of course. I could point someone to: http://example.org/archive/2004/05/google and the server decides whether or not the client gets 'google.xht' or 'google.htm' (it doesn't really matter if those are physical files or dynamically generated ones). At least you can request the XHTML specific version if you want to. Most sites (including Google) don't offer this option. I guess this was it, go design your URIs and make sure the current URIs keep working.

Disclaimer: the French translation might be wrong, but it is clear what I meant, right?

Comments

  1. I sooo disagree with you about the extensions. Where did you get the concept that they are evil? Evil, as in — destructive, or bringing misery and harm upon the unsuspecting? Bring us to our doom, or something? I hope this is not Yet Another Holy Crusade for Nothing. New religions are invented all too fast nowadays.

    You obviously never handled graphic files in a substantive quantity, or worked with them daily, or you would never, ever declare what you just have. No one who works with graphics and multimedia is going to abandon extensions for the chimera of — what exactly?

    That is the entire problem. Removing extensions from static files, all kinds of them is a hassle of magnitude directly proportional to the usage. What I see is costs, costs, costs. Maintaining the apache rewrites for such trivial tasks is a horrible burden which does not benefi the user, nor - under any circumstances - the author. Several stages of time wasting are introduced inbetween the stage of creation and the stage of publication. And just who will write new tools to sniff the content type, formats, et alia, for local disks?

    If you mean only removing the extensions by rewrites, then this is particularly harmful and counterproductive. ARe you going to change the ways of everyone who ever produced graphics and prohibit them to generate multiple image versions with the same name and different extensions because you want them to get converted to a new religion? How are you going to convince them to make their life miserable? What is the reward? Angelic afterlife, or wot? And yeah, everyone is just going to love the internal URIs for images etc. Say /image . That a directory, or a PNG file, or wot? A document? Why, if there are to be no extensions, then we must hire some new staff who would keep track of it all for us. The users will also welcome this, no doubt. Anne versus the rest of the world.

    No, Anne, thank you very much, I'll have what I have now. There are nobler purposes for shedding my blood as it is.

    Posted by Moose at

  2. Having extensions is not that bad (I mentioned use cases, but not particularly for graphics). Example: you have PNG and SVG file. You could have both 'contact.png' and 'contact.svg' and point to it as 'contact' and let the server negotiate with the browser which file it wants to accept. (XHTML 2.0 is heavily based on this concept I believe.)

    Posted by Anne at

  3. The french is correct. Why do you prefer a hyphen? Ease of typing it compared to an underscore?

    I recently started thinking more about my URI's. I find .htaccess especially useful when it comes to this kind of stuff. I can specifically define a file, group of files, or a directory of files as a certain content type. Later I can change it without having to track down all the links and change the extensions and worry about 404's and 410's. I assume other server's have similar concepts built in.

    Posted by Devon at

  4. When you drop the extensions you also loose the difference between files and directories. I haven't decided yet if this is a good thing or a bad thing.

    For example; http://annevankesteren.nl/archives looks like a file, especially since /archives/ gets redirected to /archives. But then in the case of http://annevankesteren.nl/archives/2004/08 it would seem /archives is a directory. So I would guess that requesting /archives shows me the default page for the /archives-directory (/archives/index or something).

    All this seems to me to be very confusing, I hope there is a solution that is both consistent and easy.

    Posted by Tom at

  5. My favorite: /x/y/z/en, or if you want to add an extensions /x/y/z/en.html.

    Posted by Tonico at

  6. Tom: /archives/ is a directory, /archives is not. /mother.png is file, /mother.png/ a directory. There's no confusion about that, IMHO.

    Posted by Robbert Broersma at

  7. A directory is just a metapher. In Zope-land, for instance, you can access an action for an object like /x/y/object/document_view or /x/y/object/document_edit_form

    Posted by Tonico at

  8. I think the "default view" of a browser is a webdocument. This document might be HTML or XHTML, it might have a php or aspx or cfm extension. As the HTML page is the default view, why would you need the extension? Of course this is highly subjective, let's not fight over it.

    On the /archives/ subject, when you request that resource -- and it's an directory -- you actually get /archives/index.html. Because of this I think /archive is more clear. Again, of course, this is highly subjective.

    Posted by Mark Wubben at

  9. How do you know that /dir is not a directory? It could quite easily be, adding a trailing slash is just a convention. If you're suggesting we should take that convention and make it law (or butter it up with 'Reccomendation' if you prefer) then I agree with you, but you can't know that /dir is a directory just by the lack of a slash.

    Likewise, you can't know in /dir/x/y/x that dir is a physical directory (it certainly is a 'virtual' directory, or a directory clientside of .htaccess, that is, you could be .htaccessing it to a file called /dir, but the client doesn't care, it still looks like a directory).

    ps, Anne: I'm getting a load of validation error. I don't want to fight your system, but come on. \n\n = </p><p>. Don't make me put paragraph tags around my text, I'm lazy.

    Posted by David House at

  10. I'm on the fence here. I prefer not to see URLs with lots of crufty garbage; however, agressive rewriting is time-consuming, and as Moose reminds us, expensive.

    A querystring is not a bad thing - it just needs to be avoided if possible. I don't believe file extensions are necessary for pages, and quite frankly I like hiding the technology with which pages are built, but I do think they must remain for other files, such as graphics.

    Posted by Simon Jessey at

  11. Like Moose, I disagree completely. I can, however, see some uses for it:

    The first example is self-explanatory, but for the second I'd like to explain a bit. Let's say you wanted to visit PHP.net and refresh your memory on the order of the arguments in preg_match. You'd normally go to the website and type in the search block at the top right preg_match. But there's an easier way. You can type in your browser http://www.php.net/preg_match, and it would take you to the same page. Formatting URIs in that manner is useful for extra usability. It's relatively useless for me as an Opera user, though, because I just type php preg_match and it searches it XD.

    Posted by Dustin Wilson at

  12. Referring to Dustin's example, Firefox can sensibly find php preg_match too. I believe it uses Google's I'm feeling lucky thingy.

    Posted by Simon Jessey at

  13. If there ever was one, we can now stop the discussion about hyphens versus underscores. I was right all along and I'm happy with that.

    Posted by Anne at

  14. Hyphens are better of course. And I agree in everything you write, Anne, other than that I think every resource but static files should have some form of rewrite mechanism on top of them.

    Having .php, .asp, .aspx, .jsp etc in the URI is so horrific that I don't even know how to express it in decent english (without being sencored ;), and having querystrings on top of that is just so thoroughly evil that it should be illegal and punishable.

    Of course, querystrings have their usage, but I'd say that most of the time, they're utterly useless and should much rather have been rewritten from a nice path separator scheme. This makes the URI's incredibly much simpler for the users to read and type, makes Google like them better, and also separates the URI's from your web application implementation, so that you can re-implement everything in another language in the future, and still keep the old URI's.

    Posted by Asbjørn Ulsberg at

  15. I rewrote all my links that used queries on the end because I read Google didn't like them. I now believe Google doesn't mind them at all, but they may rank your pages lower. Either way, the links are much neater now.

    I try to use dashes instead of underscores in links for the simple reason that they remain visible when the link is underlined.

    I don't believe showing an extension such as ".php" is wrong, it informs the user how the page was made. (Ie: dynamically, not static.) Though you could argue the user should not know about this.

    IRI, something we can't use unfortunately...

    [On non-ASCII characters in URI attribute values] ...which is implemented in all major browsers according to the IRI document referenced before.

    Why can't we use IRI? The second part of the above quote suggests IRI will work in all modern browsers.

    I recently saw a website for a domain company saying you could now buy domain names with international characters in them. Some examples used Japanese characters. But these are not OK to use?

    http://example.org/contact.nl

    I do not like this suggestion. It implies the document is in a format ".nl", which is of course nonsense. (Same goes for ".fr".) Seems a bit like a hack to me. Someone mentioned www.php.net, well they have UK versions of their pages, referenced by changing the start of the URL, like this:

    http://uk.php.net

    I imagine a German version might be written as "http://de.php.net/", and so on.

    Make sure they [your page links] are permanent. Even if you need to use 10 permanent redirects, do it. Always keep your URIs working.

    This is sound advice. I cannot agree more. Every time I move a key page of my site, I try and remember to include a page saying it has moved, and offering the new address. There's nothing worse than googling a link only to find it's been changed.

    Posted by Chris Hester at

  16. Quoniam id fieri quod vis non potest, velis id quod possit... ;-)

    Posted by Robbert Broersma at

  17. What about using http://example.org/contact.html or http://example.org/contact.de to get a specific version of a page (be it language or HTML/XML) and http://example.org/contact to let the Server decide via Content Negotation or whatever?

    Posted by Christoph Wagner at

  18. Anne, do you (or anyone else here) know of a way to make Movable Type use hyphens instead of underscores?

    Posted by Devon at

  19. I do. Back in the days, when I used Movable Type, I searched on Google and found a nice post explaining the subject. You probably want to do some more hacking to make them even better! ;-)

    Posted by Anne at

    • Keep everything lowercase.

    Why do you recommend all lowercase URIs?

    And why are you talking about “.xht” and “.htm” file extensions? I think “.xhtml” and “.html” are much more widely used on the Web. Those three letter extensions leave some ancient DOS feeling … :-)

    Posted by Lars Kasper at