Anne van Kesteren

Data persistence

26 October 2006

So let’s get a few things straight. I’m not really against people using XHTML, as long as they don’t do it in a crappy way. I mean, if the whole world would use XHTML right now that would be great for browser developers. Saves us some trouble figuring out the insanely complicated table parsing, that <image> becomes <img>, et cetera. Anyway, the world doesn’t play ball and neither does Internet Explorer for now and that’s fine.

Now if you don’t really care about data preservation this probably doesn’t apply to you, but if you’re publishing documents on the web right now, don’t have some fancy backend that will survive for fifty years and updates itself to export new data formats automagically and do care about your data it probably does.

The thing is that we can’t just move away from HTML and leave its “tag soup” an undefined mess. Leaving things undefined isn’t really a great way to preserve data. Not defining how HTML has to be interpreted guarantees dataloss in the future, because browsers are likely to change parsing rules somewhat overtime and authors code against different browsers and so content doesn’t really become interoperable anymore or even interpretable in the right way by a single browser. So defining how HTML exactly works doesn’t imply that it has to be used, it just means that we have a way to understand all the HTML files out there. And lets be clear, that’s about 98 percent of the web. The rest is Word and PDF.