Anne van Kesteren

Avoid 404

29 December 2004

Another appropriate title for this post could be "fix it before you break it". Although a proper written 404 page can help you out it is better if you redirect a user to a new location (301) or return him a page that says that the page did exist, but will never return (410). The problem is that a lot of web companies start completely over when building a new web site and do not really look at the existing site. The result is that once a new site has been released, the old links are rendered outdated and now return a 404.

Such things happen quite often. Think of a site that uses ASP today and has somewhat friendly URIs: /contact.asp. That URI is not perfect (/contact), but it comes close and is far better than the URI that it will get when the new site, using PHP, will be released: /index.php?page=contact. Since the file contact.asp becomes redundant it is removed by the webmaster who has limited knowledge on how the web works. And another broken link has been added without a proper status code. The correct status code would have been a 301 in this case. /contact.asp should have given back a permanent redirect to /index.php?page=contact and the browser should look up that file, instead, it returns a 404.

When someone tries his bookmark contact.asp, he will get a 404 page telling him that the page he is looking for does not exist and he might want to check the sitemap (if it was a proper 404). Such things are very bad for usability and I personally think that if you are in a web company that does this kind of things you should be ashamed of yourself. Redirecting people to a new location is trivial. However, before even starting with thinks like /index.php?page=contact and redirecting old links, everyone would benefit if you invest some time in URI design. A much better location for the contact page is obvious, /contact.

Having such URIs makes them also more future proof. Although you can always change your server side technology and parse .asp as PHP it is much better if you have URIs that are free from such nonsense and can be used forever.

Comments

Sound advice.
But how are you going to avoid 404s when moving a non-future-proofed-uri based site with hundreds of pages from a non-case-sensitive ASP server to a modern standards-and-nice-uri based case-sensitive Linux/PHP server? I've been breaking my head about this, but have not found a satisfying answer...
Posted by Ben de Groot at 4:41AM
I completely agree, Anne. Abstracting the URI space from the application is a very good idea, and not to mention: very simple. Both Apache and IIS supports it, although IIS requires you to configure it locally (through the IIS MMC console) and not just through a text file in the file system.
I always do rewriting in my applications, not only because it makes the application more future-proof, but it looks better and more professional and is more user friendly.
Posted by Asbjørn Ulsberg at 5:21AM
*feels good to have practised this a long time before he read about it on Anne's weblog*
Ben, I don't know how the current system works, but when you also say to the server to parse .asp files as php and put a script reading out the url data in there and make it redirect to the new file? But well, without seeing the current system at work I doubt I can say something really senseful about it. Making a non-future-proof PHP based site into a future-proof PHP based site would be really easy though.
Posted by Frenzie at 6:36AM
I think this post makes two points:
1. Don't break existing URI's;
2. Use sustainable URI's.
The first doesn't apply to all sites, and isn't worth the time in quite some cases; looking at my own experience. (I should have switched hosts in some of these cases, for simple sites, and that's not at all what most clients want.) The second statement is of way more importance. It isn't that hard to do, and, just as with creating validating XHTML sites, in time it goes without even thinking about it.
Posted by Robbert Broersma at 6:57AM
Frenzie: it would be very labour-intensive, because in most cases the filenames change radically. In the present system there is no one-on-one relationship between the filename and the pagetitle. Also, it works with flat files (no DB), with asp only used to draw in header and footer, and generating some indexes. The new system (a hacked-up-by-me version of WordPress-1.3) generates nice hyphened uris based on the title. And I don't really feel for a 500-line htaccess file...
Redirect 301 /YuckyName.asp http://newsite.org/article/the-real-title
Anyway, a good 404 and search are our friends and Google will pick up the new uris fast enough, is my experience.
Posted by Ben de Groot at 7:34AM
Ben: Frankly, the 500-line .htaccess file is the way to go (or, better yet, 500 lines in httpd.conf so it's only parsed once). It's the only way to really handle all the possible links and bookmarks that might be out there. Of course, if most of your hits come from search engines, it may not be worth the effort. YMMV.
One thing I've done is set up my 404 page using SSI (to keep it low-impact, rather than using PHP), looking for a few keywords and outputting suggestions based on those. Not as elaborate as The Perfect 404, but I suspect it helps.
Posted by Kelson at 8:13AM
Well Ben, a 500-line .htaccess file wouldn't be really efficient (as you say), but well, a 404 possibly with a little explanation on that the site has become better but was too big to get a 301 (well, less technical of course), together with Google, I guess it would turn out allright.
Posted by Frenzie at 5:26PM
There is an even more interesting use for this. Why use redirects at all? Why not just have the 404 page return the data directly? This allows you to have a fully dynamic site structure.
You can also rewrite the header, so the 404 page can also handle XML, RSS, images and files. Not to mention that you can have multiple paths to the same content - faceted navigation.
My latest sites - including my own - uses the 404 page not only the catch and handle old URI's, but also manages the current URI's. In general the site contains 3-4 actual pages - the rest does not exist, but is handled trough the 404 page and the backend databases and/or xml/xslt files.
The 4 pages are: front page, search page, email handling and the 404 page.
Posted by Thomas Baekdal at 5:33PM
Theres the thing...a 500-line .htaccess file isn't really needed..if you use the right expressions.In theroy..you could shrink that immensely...if you used a few callbacks to previous rules,some good expressions,and some RewriteConds.Bottom line: think about the whole..instead of each individual file your moving.
PS: Nice touch on requiring well-formed markup in comments...tho I got caught on it at first(didn't know it was required)...I love it...
Posted by DemonicPuffin at 10:15PM
This post by Anne van Kesteren is a good reminder why abstract URLs are the only way to ensure that you won’t break them in the future.
Posted by B. Adam (Howell) at 2:41AM
It almost seems like file extensions should go away, if possible, and every static HTML file should be served as index.htm from its own directory.
Posted by pb at 6:24AM
Even /contact isn't perfect as you may rename the page to something else later, move it to a new folder, or remove it altogether.
One idea is to have a permanent database of every link. When a new page is created (or an old page is renamed or moved) then the database is updated. Any files not found go to a 404 page that checks the database and pulls out the correct link for you. This is then shown on screen so the user knows the link has changed. (It's not a good policy to redirect the user automatically. Plus they'll probably not spot the change and carry on using the old link.)
I changed a ton of links once but was lucky in that I was using a single page with a variable in the address. So I was able to create a message suggesting what the new link might be based on the variable. Otherwise whenever I delete a key page I like to leave a copy behind which gives a message saying the page has changed, offering the new address.
Posted by Chris Hester at 8:43AM
Chris, that is totally untrue. If you move the content of a page called Contact to a new folder you can still keep that URI. Ever read Cool URIs don't change? Ever used the Apache mod_rewrite module?
Also, when you actually want to change the location of a link, from /index.php?page=contact to the much more appropriate /contact you should certainly not give the end user a 404. That is just braindead. You should redirect them with a 301 which means that the link in question has permanently changed. Such redirections are cached by the browser and the user will probably not see the old location again.
It is very bad to let a user notice you are changing everything. He does not need to know that. Heck, he does not even need to know about locations and stuff and that they can change.
Posted by Anne at 5:04PM
Just wondering, are there browsers which automatically adjust bookmarks when given a 301?
Posted by Frenzie at 6:13PM
Opera does, at the very least, Frenzie. Well, I've not noticed it working for bookmarks, but I have noticed it working for newsfeeds; it's quite likely that the functionality is not isolated.
I can only assume that Mozilla is just as capable. There's no good reason it shouldn't be when HTTP 1.1 is so clear on the matter.
Posted by J. King at 7:28PM
Hi, just a little thing... It's not about 404s, but about broken links... One thing you colud do to avoid broken links is to use full URIs in your posts. That way I (and countless others, I'm sure) wouldn't have to come here from Bloglines to follow your other in-site links. href="/archives/2004/08/uri-design" is no good if I'm not in annevankesteren.nl.
By the way, I tried http://annevankesteren.nl/archives/2004/1205 (the link Bloglines gave me for yous *next* post, and guess what I got. Yup. A 404...
Anyway, I love this blog, so keep it up :-)
Posted by César at 12:42AM
It is very bad to let a user notice you are changing everything. He does not need to know that. Heck, he does not even need to know about locations and stuff and that they can change.

Really? But this seems to go against the W3C Accessibility Guidelines which say something about not redirecting users without their knowledge.
Posted by Chris Hester at 2:54AM
César: if there is one basic thing Bloglines should be able to handle, then it is a relative URI. You should actually go tell them to change, we are not doing anything wrong!
Posted by Robbert Broersma at 10:04AM
mod_rewrite is way to go. You can write few regular expressions that will take care of thousands of old URIs.
RewriteRule ^(.*).asp /index.php?page=$1
or second case:
RewriteRule ^(.*).asp /$1
Posted by porneL at 10:09AM
César, they handle it perfectly in the Atom feed. (I use xml:base to tell them what the base URI is, although technically, the feed URI already is the base URI so xml:base is not needed at all.)
Chris, the W3C writes that you should not use a frigging META element, something I have been telling here for ages. Instead, you should use a server side redirect. (Exactly, 301)
Posted by Anne at 8:43PM
Thank you. I'm swithing feeds right now :-)
Posted by César at 9:41PM
If you have a high traffic site, I suggest including those commands into your httpd.conf.
This way it will get compiled in, and the server will not need to read that extra .htaccess file and parse it every time, thus saving serving time.
It could be a dramatic difference for some sites.
Posted by Website at 12:07PM