Anne van Kesteren

Why generic XML on the web is a bad idea

Here’s a thought (emphasize mine):

Some people (Anne) are moving back to SGML HTML. But I’m starting to think it is time to move to generic XML. The emphany came when I was having trouble converting a PowerPoint slide to HTML. I decided on a whim to just make up my own tags and use CSS. It just worked… in MSIE6, Mozilla Firefox, and Opera.

Sure, centering an image with text embedded in it works too, but that’s not what using HTML is about. Or tag soup rendered XHTML for that matter. There is something HTML has, and it’s very easy to explain: semantics. Of course, using XML or even RDF serialized as XML you can describe your content much better and in far more detail, but there is no search engine out there that will understand you. For RDF there is a chance one day they will. Generic XML on the other hand will always fail to work. (Semantics will not be extracted.)

An example that shows the difference more clearly:

<em>Look at me when I talk to you!</em>

… and:

<angry>Look at me when I talk to you!</angry>

The latter element describes the content probably more accurately, but on ‘the web’ it means close to nothing. Because on the web it’s not humans who come by and try to parse the text, they already know how to read something correctly. No, software comes along and tries to make something meaningful of the above. As the latter is in a namespace no software will know and the latter is also not specified somewhere in a specification it will be ignored. The former however has been here since the beginning of HTML — even before it’s often wrongly considered presentational equivalent I — and will be recognized by software.

(Real XHTML is a more difficult issue, as its semantics are not really recognized by the major search engines at the moment. It’s treated like generic XML basically.)

Comments

  1. Whilst I agree with what you say (almost entirely, in fact — see the caveat included with my follow-up comment), I believe there's room for maneuver. It's certainly the case that blanket use of home-grown markup languages isn't going to do anybody any favours very quickly, but I do think the fact that you can take a lump of almost arbitary XML and apply styling to it could prove to be exceptionally useful in select contexts.

    I see it almost as a middle-ground. Traditionally many sites publish all the meaty documents as Word or PDF (more commonly the former, though the latter is gaining ground for various reasons). The rationale is simple: most people can read them, and they're easy to produce. As we move towards XML-based document formats, we're nearly somewhere that we can have our cake and eat it. The documents can be produced using “normal” office applications, and served up as-is to consumers, who can read them without additional plugins. To boot, they're relatively easy to convert to other formats if necessary, and search engines can at least parse the XML and index the content, even if they have no idea of the semantics.

    No, it's not ideal. “Ideal” would perhaps be some attribute in the xml: namespace that you could apply to any element (or indeed collection of elements — maybe even in a schema definition) that indicated the meaning of the element somehow. To keep the W3C happy, it could even use URNs, working on the principle that any consuming software which is going to have an understanding of the meanings of the various elements is going to need its own list of relevant behaviour types anyway (be they URNs, or something else).

    Even so, it's better than application/msword by a long way.

    Posted by Mo at

  2. Point me to that piece of software that understands em more than angry.

    Posted by Sjoerd Visscher at

  3. Real XHTML is a more difficult issue, as its semantics are not really recognized by the major search engines at the moment. It’s treated like generic XML basically

    I doubt that. XHTML 1.x is, by design, close enough to HTML that search engines can extract just as much in the way of semantics from it as from HTML.

    I strongly suspect, based on the way Google treats my site, it knows what an <h2> in XHTML 1.1 is (for instance).

    Now, it was not always true that Google knew how to handle XHTML sites. But it does now (and has for a while).

    Posted by Jacques Distler at

  4. Now, it was not always true that Google knew how to handle XHTML sites. But it does now (and has for a while).

    Seems to me it is still not indexing XHTML that’s really only available as XHTML.

    Posted by Henri Sivonen at

  5. Well, I couln't agree more with Anne. The web is a presentation medium. You can use it to pass data, but in a browser it does not make sence! You want to see something, so you use the language the browser will understand: HTML. It perfectly suits for that purpose.

    Ok, it can be very useful to define what something means, but it then it should be clear for the receiver. Defining your own elements can be useful for your own interpreter, but not for a browser. The browser expects a document it can read.

    You schould use a tool for the right purpose and let the right person use it. Just like you do not give a fly swatter to a blind person. ;)

    Posted by Bart Verkoeijen at

  6. Seems to me it is still not indexing XHTML that’s really only available as XHTML.

    Is that true if you include a DOCTYPE declaration, too? (I ask, because I really don't know how Google decides when a document is in a(n) "(un)known format.")

    Posted by Jacques Distler at

  7. Basically, yes.

    Posted by Anne at

  8. But that page got indexed, unlike Henri's example (and unlike Google's old behaviour). Most inscrutable!

    Posted by Jacques Distler at

  9. I'm now serving XHTML as application/xhtml+xml (no negotation). I found that no search engines (or other services like Altavista/Google translation) understand me either... :-(

    Posted by minghong at

  10. I see the problem now. Maybe that's because I'm using custom DOCTYPE... :-P

    Posted by minghong at

  11. Do you have pingbacks/trackbacks enabled yet? If not, consider this a "manual pingback" :-)

    dolphinling's weblog—Making generic XML work

    So I just read Anne’s Why generic XML on the web is a bad idea, and got an idea. Suppose there were a way, probably using RDF or something, map certain elements in it to well-accepted elements in another namespace?

    Posted by dolphinling at

  12. That's taken a little out of context. :-( Dave Shea's post was about the limited semantics of HTML and XHTML compared with common conventions:

    We don’t exactly have a rich vocabulary of element types capable of capturing the meaning and nuance behind every piece of text: We have code, but we don't have caption; We have kbd, but we don't have childlikescrawl; We have emphasis, but we don't have publicationtitle. And so on.

    I responded (emphasize mine):

    Some people (Anne) are moving back to SGML HTML. But I’m starting to think it is time to move to generic XML. The emphany came when I was having trouble converting a PowerPoint slide to HTML. I decided on a whim to just make up my own tags and use CSS. It just worked... in MSIE6, Mozilla Firefox, and Opera.

    It wasn't a much thought out decision - born out of frustration rather than some ideal. I generally think it is a good thing to use standards rather than create your own. Note these code fragements:

    1. <p class="caption">...</p>
    2. <div class="caption"><p>...</p></div>
    3. <c:caption><p>...</p></c:caption>

    There is little semantic difference between them! Meaning is assigned by convention - not by using personal classes rather than personal tags. That's why microformats are being developed and why the Semantic Web is useful too. They are specifications for these conventions. The SW is more useful, since it has the tools for defining ontological schemas so software doesn't need major rewrites to understand new formats.

    It's also why the semantic web not a silver bullet for AI pipe dreams or other nonsense - which BTW are red herring arguments against the sw. (I say this as an AI aficionado: AI is why I learned to program in the first place.)

    Posted by Jimmy Cerra at

  13. Generic XML on the other hand will always fail to work.

    Always is a long time - check out GRDDL. But there's a more fundamental question: why create data/content to target generic search engines?

    Posted by Danny at

  14. Face it, generic XML (think REST API's) is the new semantic web, but without the capitals. While true that no browsers and search engines are able to extract meaning from it, I think that'll come. Imagine special XSLT to transfer your namespaced XML into RDF, for instance.

    What if you would write XSLT to transform your funky XML into XHTML (with microformats, no less)? What if screenreaders could figure out they need to transfrom the XML in order to understand it, and actually did it? It's a whole different ballgame then, IMO.

    But before that happens, generic XML on websites serves little purpose, I believe.

    Posted by Mark Wubben at

  15. (Nice, that's dolphinlings idea too... I should read the trackbacks before I comment ;-)

    Posted by Mark Wubben at

  16. Why don't you just say that it (generic XML) lacks officially and/or broadly accepted semantics?

    Posted by Jens Meiert at

  17. I agree that right now, it wouldn't do much good. However, we'll have to make the step towards it at some point or another, if we wish to have broader support for it, which is something that can greatly benefit us, and we might as well start on it now.

    Perhaps not in a very complete, radical manner, but in little steps at a time, it could definitely help.

    Posted by Faruk Ateş at

  18. GRDDL provides a well-specified approach to taking HTML, XHTML and arbitrary XML and extracting any semantics defined using the RDF (+OWL etc). The usual approach is using XSLT to convert the data to an RDF format. With (X)HTML it uses the head @profile to locate the definition of the transform. Works fine with 'microformats'. With arbitrary XML, if namespaces are used it's usually straightforward to define the mapping, without namespaces you have to rely on heuristics a bit.

    Posted by Danny at

  19. Oops, should read "...using the RDF model..."

    Posted by Danny at

  20. (Testing new things.)

    Posted by Anne at

  21. Still, I remember my first time when I tried custom tag and styled it in CSS. It worked only in Mozilla then...

    Posted by dusoft at

  22. RDF is metadata for resources. Search engines will be able to understand your generic XML if you document it using RDF. There's some cool RDF/Semantic Web stuff out there - Give Google a search.

    Generic XML on the web is primarily used as a storage mechanism for so-called "web applications". Not necessarily stuff you want showing up in a search engine.

    One type of Generic XML (although it's anything but generic) is Docbook. Docbook is a rich extremely flexible XML doctype that can easily be transformed into other types of web media such as HTML and PDF.

    The point is there's more than one way to skin this cat.

    Posted by Scott L Holmes at

  23. Anne, your CSS is invalid.

    Posted by praetorian at