When trying to figure out what IRIs to use it’s generally a good idea to avoid the usage of ID
s in them; especially ID
s that come straight out of the database and are autogenerated when you add a new post. ID
s are an easy way to create a unique IRI and supposed to be permanent IRIs, sure, but they have several drawbacks. They can however be trivially implemented using some Apache-fu and PHP:
RewriteRule ^item/([0-9]+)$ /item.php?id=$1
In item.php
you can have something like the following:
$q = mysql_query("SELECT content FROM items WHERE id = $_GET['id']");
… and you’re done. Or maybe not. An obvious argument against the usage of ID
s is that they don’t convey any information. They are solely a unique identifier for the content, but can’t tell you what the content is about. A much larger problem however is that ID
s can change. Consider you’re using a 1-based ID
index and than you’re moving to some other system. You export all your content to some XML file — doing what cool people do — and later import it into a new database table. This table happens to use a 0-based ID
index and all your new ID
s are old ID
-1. This sounds like fiction, but it happened to me. The weblog system I was using generated permanent links in this form:
/weblog/item/id
Where id is an ID
. The new IRIs looked like:
/archives/year/month/day/slug
Back than I never took the time to redirect the old IRI to the new one. I’m not sure if I had the knowledge to do it, but I was certainly un-cool. After viewing my Apache logs for the past couple of weeks it appeared to me there were still resources on the web pointing to my very old scheme. Quickly I wrote a redirector to take care of it. (It takes the ID
, minus 1 and then queries the database for a field named permalink
in which row there is also the ID
queried for. The result ends up in a HTTP 301 permanent redirect.)
Now of course when you’re doing everything yourself you can make sure ID
s stay unique by storing them twice for example. Storing the permanent link in a separate field is always a good thing. But for most people this is not the case.
Also, as pointed out above ID
s don’t say anything about the content at all. Slugs do. Slugs are often just a hyphenated title, but they don’t have to be that. Good slugs describe the content of the page they are associated with and for extra Google points they use hyphens and no underscores.
Another advantage of using slugs over ID
s is that slugs aren’t ID
s. This means they are stored separately and will be saved when exporting your data. This also means slugs don’t have to unique. They have to be unique however in combination with the current year, month and perhaps even day, hour, et cetera. Unless you fuck up, but then there was never a good solution for you anyway. The potential disadvantage of this is that publishers let people edit slugs without storing the old slug (and the publish timestamp) in a separate redirect table. Therefore people can break things. And if things can be broken, people will do it. No doubt about that. People edit the title of a post because it contained a typo and than quite a few publishers change the slug as well to name a common example. You say: Ugh!
I say: Reality is like that.
(My personal publish system doesn’t let me edit slugs, never. When I check the publish checkbox (note that I can create an entry first, without publishing) and only then, the slug is stored to never be edited again, as is the publish timestamp and the unique tag URI (not IRI).)
With this kind of mess all over the web with daily new people using un-cool software breaking the principles of IRI design I don’t see any feasible solution. Some people were thinking about a meta solution were each resource got a URN besides its IRI as some resources have multiple IRIs — really, resources should have one canonical IRI — but I don’t think that’s going to work. I also think it solves another problem, because:
To end this little write-up with a question as I tend to think that answers are less relevant: Should a resource have one canonical IRI? If so, why does the ‘cool document’ have two? (For the few — perhaps masses — who don’t get it: You can add .html
behind the link or remove it.)
You don't know how useful that post is for me right now! :-D
The interesting thing is that, in my experience, personal sites adhere to good IRI design far more often than corporate sites do — and they're the ones that matter least as far Joe Average is concerned.
Good IRI design is important, though not as important as some other issues. The likes of Google make it less of a problem, but dead links are everywhere — and a very high portion of them seem to be to things which haven't gone away, just moved. A dead link to content which is no longer available is unavoidable without some sort of subscription/pingback mechanism for every page that gets linked to. Dead links to pages which have just moved are in many cases completely avoidable. Whether it's redirects (perfectly reasonable), or just sensible IRI design in the first place (i.e., so that your IRIs don't have to change when you redesign the site), it can be done one way or another.
Honestly, though, I don't think this is going to get fixed any more quickly or easily than the whole web consisting of pages which validate.
Random thought: perhaps a solution would be a content-management system focussed entirely on organisation of content, but open-ended as far as the makeup of the content itself was concerned (leaving room for that to be scripted or managed too, if necessary and desired). Some sort of variation on a Wiki maybe?
Yeah, I had to do the permalink-change dance several times going from B2 betas to WordPress betas to WordPress 1.0. I just added a new set of redirect rules each time. (And complained about it, but stuck with it because they really were getting better.) Once post slugs were in place, I was set, and when they enabled custom permalink schemes I had no interest in changing.
Unfortunately there used to be a bug in the way post slugs were constructed, and if you had HTML tags in your title (silly, perhaps, but I've used them from time to time for things like mentioning a book or movie in the post title) it would strip the brackets first, leaving the name of the tag. I've left several posts with slugs like isomebooki-i-read
because I didn't want to break potential links, and Apache seemed to be applying the rewrite rules before the redirects.
Of course, very few of my blog posts actually get incoming links. Most of my traffic is from searches. It's my other, non-CMSed pages that get most of the links and traffic. Ironically, because I don't need to compete with the built-in rewrite rules, I have more freedom to create redirects on the built-from-scratch plain HTML site.
And now I'm finding WP 1.5 sporadically stops using the permalink structure when it sends out pingbacks, so I'm starting to leave a lot of index.php?p=xyz
links. I think it's connected to upgrading. The Permalink redirect plugin helps a lot there, but sometimes I wonder if it would be worth contacting each site's authors and asking them to please fix my link...
You're right about IDs for IRI design are bad. But when designing for better SERPs the use of:
/archives/year/month/day/slug
isn't great either, Google uses the IRIs for the SERPs and year, month and day might is not really indicative for content. It might be better to leave out the 'day' or even better to catagorize on subject. Even though your archiving method is screwed up then...
A new WordPress upgrade might let you invent a new IRI scheme without adding proper redirects.
Or it might add new redirects that cause trouble.
Pity Matthew Thomas. His eight-month old satire on the behavior of Safari and Firefox regarding Atom has become unreadable in the new version of Wordpress. The answer is in the IRI.
http://mpt.net.nz/archive/2004/09/30/atom
Thomas intended atom
to be a part of his slug, but since Wordpress now helpfully offers an Atom feed for almost any part of a blog, it interprets Thomas's IRI as a request for the Atom feed for September 30, 2004. Any post slug that is only feed
, rdf
, rss
, rss2
, or atom
will likely be affected.
Anne, didn’t you run into this problem a few months back? Was this one of the reasons you abandoned ship?
Offtopic, but your one line of PHP has a major security hole in it. Hope your personal publish system does not include code like that.
Based on Anne's article The perfect weblog system, I use the following rules on one of my articles website pilgrim.maladoc.org:
http://pilgrim.maladoc.org/articles/
http://pilgrim.maladoc.org/articles.atom
http://pilgrim.maladoc.org/comments.atom
http://pilgrim.maladoc.org/articles/article-slug/
http://pilgrim.maladoc.org/articles/article-slug/chapter-slug
http://pilgrim.maladoc.org/articles/article-slug/chapter-slug/subchapter-slug
http://pilgrim.maladoc.org/articles/article-slug.html
http://pilgrim.maladoc.org/articles/article-slug.pdf
http://pilgrim.maladoc.org/articles/article-slug.atom
I don't know if it's the perfect solution but it seems to be a good one regarding all the previous comments of this post.
A security hole in what way? OK, he has to check the incomming arguments (but Anne could have just left it out in this example), but other than that, enlighten me please...
I'm assuming it was left out of brevity, but the “security hole” is that mysql_escape_string
should have been used, at the very least. (By preference, if I know the parameter is a numeric identifier, I just use intval
instead).
The ‘cool document’ has 3 IRIs! You can add .html.en
too. This is a content negotiation problem.
Hey, I was to write this post since I just switched from /archives/year/month/day/slug/
to /archive/year/month/slug
on my site. You stole my article, you bastard! :P Oh, and you should talk about your switch to non-/archives
IRIs, and the new HREF permalinking system. Uber cool.
You can also add ?foo=bar
to that document, so it has not only four IRIs, but endess of them. That applies to this very document aswell. I don't see why it is a problem though.
4.3.0 The function
mysql_escape_string()
became deprecated, do not use this function. Instead, usemysql_real_escape_string()
.
I do not understand the argument against IDs and migrating to a different system. Even when a transition to a new system wouldn’t allow the IDs to be preserved (which would be pretty lame, an ID is a normal record field which happens to auto increment but as for the rest is just as solid as the other fields, and should be retained), what would prevent me from turning the ID into a regular field and just using that from then on?
And I’ll say text links have their problems as well, e.g. with regard to compactness, and when the post title changes :). And the fact that there’s a date in the link pretty much negates the ease of remembering as well.
~Grauw
Security could have been handled still?!
$_GET['id']=mysql_real_escape_string($_GET['id']);
I use this (simplified):
# .htaccess
RewriteCond %{REQUEST_URI} ^/articles [OR]
RewriteCond %{REQUEST_URI} ^/register [OR]
RewriteCond %{REQUEST_URI} ^/contact [OR]
RewriteCond %{REQUEST_URI} ^/links
RewriteRule (.*) /index.php
// PHP
list($page, $variable1, $variable2, $etc) = explode("/",$_SERVER['REQUEST_URI']);
The IRI's in the .htaccess will now all be redirected to index.php, so you can create all sorts of virtual IRI's like http://host.com/links and http://host.com/articles/category/articleid. It's a really simple and multifunctional method to handle IRI's
When trying to figure out what IRIs to use it’s generally a good idea to avoid the usage of
ID
s in them; especiallyID
s that come straight out of the database and are autogenerated when you add a new post.ID
s are an easy way to create a unique IRI and supposed to be permanent IRIs, sure, but they have several drawbacks.
That’s so very interesting, seeing as you’re (still) using ID
s for your comment permalinks rather than the more logical #comment-1
, #comment-2
mathematical stuff (
depending on the page rather than site-wide). :P
ID
s
Oh, and since you’re showing your Apache-fu skillz, I think you also might want to point to the mod_alias Redirect
directive, which is extremely easy to use, and fast. For example, you could put the following in your .htaccess
rather than using a combination of mod_rewrite and PHP:
Redirect 301 /archives http://annevankesteren.nl/
The permalinks for my comments are actually stored in the database. I might change them for new entries eventually though. I just haven’t got around doing it. And using Apache for redirects is fine, and occasionally it is useful, but I also changed several slugs in the process so using a server side scripting language is a better way here. (Adding 300 lines in your .htaccess
looks so crazy that only Mark Pilgrim would do it and has done so.)
Note also that as I’ve pointed out before comments are not really numeric and your logical way of thinking might clash with the real world. Actually, it will fail in the real world as sometimes comments have to be removed when comments after that comment have already been added. And away your logical structure is. Or perhaps you request moderation on trackbacks and let comments appear instantly. What do you use as link for the comments? What if you remove the trackback or keep it? All these questions are without a solid answer and therefore the only better alternate there is now is time. Basing them on time will make them both unique and logical, but like I said: Things take time.
Belatedly, to answer the question itself: a resource should have as many canonical URIs as you're willing to maintain. If nobody ever sees the '.html' version, it doesn't matter that it exist. If people do, though, you've got to be willing to keep it around, long after you've switched pages from being '.html' to '.whizzyscript'. At least content negotiation relieves that particular headache.
Ideally, users should never need to know what sort of “server technology” is used to serve a website, though sometimes it's helpful to explicitly give some URIs a file extension.
Personally, I hate the date components in permalinks. They’re fine for aggregate pages like monthly archives, but otherwise they really have no business showing up in links. Of course, my own permalinks right now are /log/1234/
-style, which is suboptimal. I’m just not yet sure what I’ll do about the slugs; much of the backend is still in flux.
People die too, although the history of the web is probably too short for a personal webpage no longer being maintained (and taken offline due to lack of payments) because someone died.
I know of two cases where this has happened. The web is way beyond the required age for that.
If so, why does the ‘cool document’ have two? (For the few — perhaps masses — who don’t get it: You can add
.html
behind the link or remove it.)
Does it really? Is the .html
version used in links in other documents or just an artifact that you found out about despite lack of advertising?
In any case, I agree with Mo that as long as all URIs are maintained, it’s fine for a document to have a multitude.
(Adding 300 lines in your
.htaccess
looks so crazy that only Mark Pilgrim would do it and has done so.)
Oh, there are more people than just him…
Ideally, users should never need to know what sort of “server technology” is used to serve a website, though sometimes it's helpful to explicitly give some URIs a file extension.
Absolutely.
Personally I prefer to end all of my permalinks in slashes, for two reasons.
It gives me a namespace to attach document-specific transient resources to, so a post with kitten pictures in an entry under /archive/kittenpics/
would have its pictures in /archive/kittenpics/1.jpg
or such.
It is an easy way to distinguish permanent from transient URLs: I guarantee that those that end in slashes will never produce a 404 for as long as I live, while URLs that do not end in a slash may go missing at my discretion. (But if you hack off the last part, you get an URL with with a slash, so you are never completely lost. Hmm, it might be worthwhile to adjust my 404 page to provide an explanation and link.)
As Aristotle mentioned, multiple IRIs for one document is not a problem unless the different IRIs are being linked to. As for your /test/
directory, every file in there has at least two IRIs to which is linked (due to Apache dir listings) — one with the extension, and one without. That is wrong, since Google might index "both" documents.
You might be wrong with that. The content-location
header always returns the file on the server. Google is likely to index two IRIs, but the same document.