Anne van Kesteren

Good IRIs (URI design)

7 June 2005

When trying to figure out what IRIs to use it’s generally a good idea to avoid the usage of IDs in them; especially IDs that come straight out of the database and are autogenerated when you add a new post. IDs are an easy way to create a unique IRI and supposed to be permanent IRIs, sure, but they have several drawbacks. They can however be trivially implemented using some Apache-fu and PHP:

RewriteRule ^item/([0-9]+)$ /item.php?id=$1

In item.php you can have something like the following:

$q = mysql_query("SELECT content FROM items WHERE id = $_GET['id']");

… and you’re done. Or maybe not. An obvious argument against the usage of IDs is that they don’t convey any information. They are solely a unique identifier for the content, but can’t tell you what the content is about. A much larger problem however is that IDs can change. Consider you’re using a 1-based ID index and than you’re moving to some other system. You export all your content to some XML file — doing what cool people do — and later import it into a new database table. This table happens to use a 0-based ID index and all your new IDs are old ID-1. This sounds like fiction, but it happened to me. The weblog system I was using generated permanent links in this form:

/weblog/item/id

Where id is an ID. The new IRIs looked like:

/archives/year/month/day/slug

Back than I never took the time to redirect the old IRI to the new one. I’m not sure if I had the knowledge to do it, but I was certainly un-cool. After viewing my Apache logs for the past couple of weeks it appeared to me there were still resources on the web pointing to my very old scheme. Quickly I wrote a redirector to take care of it. (It takes the ID, minus 1 and then queries the database for a field named permalink in which row there is also the ID queried for. The result ends up in a HTTP 301 permanent redirect.)

Now of course when you’re doing everything yourself you can make sure IDs stay unique by storing them twice for example. Storing the permanent link in a separate field is always a good thing. But for most people this is not the case.

Also, as pointed out above IDs don’t say anything about the content at all. Slugs do. Slugs are often just a hyphenated title, but they don’t have to be that. Good slugs describe the content of the page they are associated with and for extra Google points they use hyphens and no underscores.

Another advantage of using slugs over IDs is that slugs aren’t IDs. This means they are stored separately and will be saved when exporting your data. This also means slugs don’t have to unique. They have to be unique however in combination with the current year, month and perhaps even day, hour, et cetera. Unless you fuck up, but then there was never a good solution for you anyway. The potential disadvantage of this is that publishers let people edit slugs without storing the old slug (and the publish timestamp) in a separate redirect table. Therefore people can break things. And if things can be broken, people will do it. No doubt about that. People edit the title of a post because it contained a typo and than quite a few publishers change the slug as well to name a common example. You say: Ugh! I say: Reality is like that.

(My personal publish system doesn’t let me edit slugs, never. When I check the publish checkbox (note that I can create an entry first, without publishing) and only then, the slug is stored to never be edited again, as is the publish timestamp and the unique tag URI (not IRI).)

With this kind of mess all over the web with daily new people using un-cool software breaking the principles of IRI design I don’t see any feasible solution. Some people were thinking about a meta solution were each resource got a URN besides its IRI as some resources have multiple IRIs — really, resources should have one canonical IRI — but I don’t think that’s going to work. I also think it solves another problem, because:

Resources disappear. Once every while someone throws his site offline and starts again or accidentally looses his domain name thanks to evil hosting providers breaking parts of the web. People die too, although the history of the web is probably too short for a personal webpage no longer being maintained (and taken offline due to lack of payments) because someone died. (It doesn’t sound impossible either.)
Software sucks. Microsoft isn’t releasing the same every couple of years just to let you pay — well, that too — they are innovating as well. Adding in feature requests or fixing long standing bugs. Software is never finished. There is no such thing as perfect in the software market I suppose. And that may be because of hardware limitations, which shifts the problem, but doesn’t prevent it. The point here is that web software sucks too. The first WordPress version doesn’t create the IRIs it creates now. A new WordPress upgrade might let you invent a new IRI scheme without adding proper redirects. No non-freaking end user is going to care about such stuff. Really, unless you help him with it and than he might think you’re even more pedantic than he thought. And of course, he’s right.
People don’t care. Google indexes their new stuff anyway and the blast of referrers they got from Slashdot (Word recognizes this word…) a month ago are over anyway. And if they didn’t get any link from Slashdot — or equivalent — they care even less. Did I mention already they don’t?

To end this little write-up with a question as I tend to think that answers are less relevant: Should a resource have one canonical IRI? If so, why does the ‘cool document’ have two? (For the few — perhaps masses — who don’t get it: You can add .html behind the link or remove it.)

Comments

You don't know how useful that post is for me right now! :-D
Posted by Jimmy Cerra at 4:34AM
The interesting thing is that, in my experience, personal sites adhere to good IRI design far more often than corporate sites do — and they're the ones that matter least as far Joe Average is concerned.
Good IRI design is important, though not as important as some other issues. The likes of Google make it less of a problem, but dead links are everywhere — and a very high portion of them seem to be to things which haven't gone away, just moved. A dead link to content which is no longer available is unavoidable without some sort of subscription/pingback mechanism for every page that gets linked to. Dead links to pages which have just moved are in many cases completely avoidable. Whether it's redirects (perfectly reasonable), or just sensible IRI design in the first place (i.e., so that your IRIs don't have to change when you redesign the site), it can be done one way or another.
Honestly, though, I don't think this is going to get fixed any more quickly or easily than the whole web consisting of pages which validate.
Random thought: perhaps a solution would be a content-management system focussed entirely on organisation of content, but open-ended as far as the makeup of the content itself was concerned (leaving room for that to be scripted or managed too, if necessary and desired). Some sort of variation on a Wiki maybe?
Posted by Mo at 4:39AM
Yeah, I had to do the permalink-change dance several times going from B2 betas to WordPress betas to WordPress 1.0. I just added a new set of redirect rules each time. (And complained about it, but stuck with it because they really were getting better.) Once post slugs were in place, I was set, and when they enabled custom permalink schemes I had no interest in changing.
Unfortunately there used to be a bug in the way post slugs were constructed, and if you had HTML tags in your title (silly, perhaps, but I've used them from time to time for things like mentioning a book or movie in the post title) it would strip the brackets first, leaving the name of the tag. I've left several posts with slugs like isomebooki-i-read because I didn't want to break potential links, and Apache seemed to be applying the rewrite rules before the redirects.
Of course, very few of my blog posts actually get incoming links. Most of my traffic is from searches. It's my other, non-CMSed pages that get most of the links and traffic. Ironically, because I don't need to compete with the built-in rewrite rules, I have more freedom to create redirects on the built-from-scratch plain HTML site.
And now I'm finding WP 1.5 sporadically stops using the permalink structure when it sends out pingbacks, so I'm starting to leave a lot of index.php?p=xyz links. I think it's connected to upgrading. The Permalink redirect plugin helps a lot there, but sometimes I wonder if it would be worth contacting each site's authors and asking them to please fix my link...
Posted by Kelson at 6:41AM
You're right about IDs for IRI design are bad. But when designing for better SERPs the use of:
```
/archives/year/month/day/slug
```
isn't great either, Google uses the IRIs for the SERPs and year, month and day might is not really indicative for content. It might be better to leave out the 'day' or even better to catagorize on subject. Even though your archiving method is screwed up then...
Posted by Jammer at 9:11AM
A new WordPress upgrade might let you invent a new IRI scheme without adding proper redirects.

Or it might add new redirects that cause trouble.
Pity Matthew Thomas. His eight-month old satire on the behavior of Safari and Firefox regarding Atom has become unreadable in the new version of Wordpress. The answer is in the IRI.
http://mpt.net.nz/archive/2004/09/30/atom
Thomas intended atom to be a part of his slug, but since Wordpress now helpfully offers an Atom feed for almost any part of a blog, it interprets Thomas's IRI as a request for the Atom feed for September 30, 2004. Any post slug that is only feed, rdf, rss, rss2, or atom will likely be affected.
Anne, didn’t you run into this problem a few months back? Was this one of the reasons you abandoned ship?
Posted by Mike Mariano at 10:15AM
Offtopic, but your one line of PHP has a major security hole in it. Hope your personal publish system does not include code like that.
Posted by Asdf at 1:50PM
Based on Anne's article The perfect weblog system, I use the following rules on one of my articles website pilgrim.maladoc.org:

List of articles

http://pilgrim.maladoc.org/articles/

Feed for the latest articles

http://pilgrim.maladoc.org/articles.atom

Feed for the latest comments

http://pilgrim.maladoc.org/comments.atom

Front page of an article

http://pilgrim.maladoc.org/articles/article-slug/

A chapter's page

http://pilgrim.maladoc.org/articles/article-slug/chapter-slug

http://pilgrim.maladoc.org/articles/article-slug/chapter-slug/subchapter-slug

Printer friendly version (HTML) of an article

http://pilgrim.maladoc.org/articles/article-slug.html

Printer friendly version (PDF) of an article

http://pilgrim.maladoc.org/articles/article-slug.pdf

Feed for the latest comments of an article

http://pilgrim.maladoc.org/articles/article-slug.atom

I don't know if it's the perfect solution but it seems to be a good one regarding all the previous comments of this post.
Posted by David Duret at 3:05PM
A security hole in what way? OK, he has to check the incomming arguments (but Anne could have just left it out in this example), but other than that, enlighten me please...
Posted by Momos at 3:16PM
I'm assuming it was left out of brevity, but the “security hole” is that mysql_escape_string should have been used, at the very least. (By preference, if I know the parameter is a numeric identifier, I just use intval instead).
Posted by Mo at 4:33PM
The ‘cool document’ has 3 IRIs! You can add .html.en too. This is a content negotiation problem.
Posted by Sjoerd Visscher at 4:50PM
Hey, I was to write this post since I just switched from /archives/year/month/day/slug/ to /archive/year/month/slug on my site. You stole my article, you bastard! :P Oh, and you should talk about your switch to non-/archives IRIs, and the new HREF permalinking system. Uber cool.
Posted by Mathias Bynens at 6:00PM
You can also add ?foo=bar to that document, so it has not only four IRIs, but endess of them. That applies to this very document aswell. I don't see why it is a problem though.
Posted by zcorpan at 6:17PM
4.3.0 The function mysql_escape_string() became deprecated, do not use this function. Instead, use mysql_real_escape_string().

Posted by Jeroen Brussich at 6:21PM
I do not understand the argument against IDs and migrating to a different system. Even when a transition to a new system wouldn’t allow the IDs to be preserved (which would be pretty lame, an ID is a normal record field which happens to auto increment but as for the rest is just as solid as the other fields, and should be retained), what would prevent me from turning the ID into a regular field and just using that from then on?
And I’ll say text links have their problems as well, e.g. with regard to compactness, and when the post title changes :). And the fact that there’s a date in the link pretty much negates the ease of remembering as well.
~Grauw
Posted by Laurens Holst at 8:02PM
Security could have been handled still?!
```
$_GET['id']=mysql_real_escape_string($_GET['id']);
```
Posted by Momos at 12:42AM
I use this (simplified):
# .htaccess RewriteCond %{REQUEST_URI} ^/articles [OR] RewriteCond %{REQUEST_URI} ^/register [OR] RewriteCond %{REQUEST_URI} ^/contact [OR] RewriteCond %{REQUEST_URI} ^/links RewriteRule (.*) /index.php
// PHP list($page, $variable1, $variable2, $etc) = explode("/",$_SERVER['REQUEST_URI']);
The IRI's in the .htaccess will now all be redirected to index.php, so you can create all sorts of virtual IRI's like http://host.com/links and http://host.com/articles/category/articleid. It's a really simple and multifunctional method to handle IRI's
Posted by Kees at 1:53AM
When trying to figure out what IRIs to use it’s generally a good idea to avoid the usage of IDs in them; especially IDs that come straight out of the database and are autogenerated when you add a new post. IDs are an easy way to create a unique IRI and supposed to be permanent IRIs, sure, but they have several drawbacks.

That’s so very interesting, seeing as you’re (still) using IDs for your comment permalinks rather than the more logical #comment-1, #comment-2 mathematical stuff (IDs depending on the page rather than site-wide). :P
Oh, and since you’re showing your Apache-fu skillz, I think you also might want to point to the mod_alias Redirect directive, which is extremely easy to use, and fast. For example, you could put the following in your .htaccess rather than using a combination of mod_rewrite and PHP:
```
Redirect 301 /archives http://annevankesteren.nl/
```
Posted by Mathias Bynens at 2:51AM
The permalinks for my comments are actually stored in the database. I might change them for new entries eventually though. I just haven’t got around doing it. And using Apache for redirects is fine, and occasionally it is useful, but I also changed several slugs in the process so using a server side scripting language is a better way here. (Adding 300 lines in your .htaccess looks so crazy that only Mark Pilgrim would do it and has done so.)
Posted by Anne at 3:00AM
Note also that as I’ve pointed out before comments are not really numeric and your logical way of thinking might clash with the real world. Actually, it will fail in the real world as sometimes comments have to be removed when comments after that comment have already been added. And away your logical structure is. Or perhaps you request moderation on trackbacks and let comments appear instantly. What do you use as link for the comments? What if you remove the trackback or keep it? All these questions are without a solid answer and therefore the only better alternate there is now is time. Basing them on time will make them both unique and logical, but like I said: Things take time.
Posted by Anne at 3:06AM
Belatedly, to answer the question itself: a resource should have as many canonical URIs as you're willing to maintain. If nobody ever sees the '.html' version, it doesn't matter that it exist. If people do, though, you've got to be willing to keep it around, long after you've switched pages from being '.html' to '.whizzyscript'. At least content negotiation relieves that particular headache.
Ideally, users should never need to know what sort of “server technology” is used to serve a website, though sometimes it's helpful to explicitly give some URIs a file extension.
Posted by Mo at 3:40PM
Personally, I hate the date components in permalinks. They’re fine for aggregate pages like monthly archives, but otherwise they really have no business showing up in links. Of course, my own permalinks right now are /log/1234/-style, which is suboptimal. I’m just not yet sure what I’ll do about the slugs; much of the backend is still in flux.
People die too, although the history of the web is probably too short for a personal webpage no longer being maintained (and taken offline due to lack of payments) because someone died.

I know of two cases where this has happened. The web is way beyond the required age for that.
If so, why does the ‘cool document’ have two? (For the few — perhaps masses — who don’t get it: You can add .html behind the link or remove it.)

Does it really? Is the .html version used in links in other documents or just an artifact that you found out about despite lack of advertising?
In any case, I agree with Mo that as long as all URIs are maintained, it’s fine for a document to have a multitude.
(Adding 300 lines in your .htaccess looks so crazy that only Mark Pilgrim would do it and has done so.)

Oh, there are more people than just him…
Ideally, users should never need to know what sort of “server technology” is used to serve a website, though sometimes it's helpful to explicitly give some URIs a file extension.

Absolutely.
Personally I prefer to end all of my permalinks in slashes, for two reasons.
1. It gives me a namespace to attach document-specific transient resources to, so a post with kitten pictures in an entry under /archive/kittenpics/ would have its pictures in /archive/kittenpics/1.jpg or such.
2. It is an easy way to distinguish permanent from transient URLs: I guarantee that those that end in slashes will never produce a 404 for as long as I live, while URLs that do not end in a slash may go missing at my discretion. (But if you hack off the last part, you get an URL with with a slash, so you are never completely lost. Hmm, it might be worthwhile to adjust my 404 page to provide an explanation and link.)
Posted by Aristotle Pagaltzis at 12:58PM
As Aristotle mentioned, multiple IRIs for one document is not a problem unless the different IRIs are being linked to. As for your /test/ directory, every file in there has at least two IRIs to which is linked (due to Apache dir listings) — one with the extension, and one without. That is wrong, since Google might index "both" documents.
Posted by Mathias Bynens at 2:26PM
You might be wrong with that. The content-location header always returns the file on the server. Google is likely to index two IRIs, but the same document.
Posted by Anne at 3:40PM