Anne van Kesteren

Content Negotiation

12 June 2004

It is a pity that most really useful techniques have (major) drawbacks (like XHTML) (like not being correctly support in Internet Explorer) or design issues (I think content negotiation is part of this group). Content negotiation describes itself quite good, the client negotiates with the server what type of content it will retrieve. Let's say I have the following URI:

http://example.ex/archives/2002/07/post-name

Depending on what the client has in it's Accept header the server will give the client the application/xhtml+xml, application/atom+xml, application/rdf+xml, image/svg+xml, application/xml, text/html, text/plain or application/pdf document. Of course, it will never return the evil text/xml, which isn't really necessary, since we already have XML output in 5 ways... (Making the web more complex can be fun too) This is all fiction for me at the moment, but I should probably try to add something like that (along with cached pages) to this weblog (and WordPress) to come a bit closer to norman.walsh.name who is miles ahead of me. Note that my XHTML handling is superior though and it is all about XHTML these days... (Not really actually, using valid minimal HTML 4.01 seems to be cool)

To get back to content negotiation; there are more accept headers: Accept-Language, Accept-Encoding and Accept-Charset. I would like to say I think Accept-Charset is redundant from my point of view since we can use unicode (UTF-8) for that purpose and have something every client should understand. Clients should just support that in my opinion. Note that that is different than sending multiple versions of exactly the same content, since the use case of application/atom+xml is totally different from application/xhtml+xml or let's say: image/svg+xml. In such cases we are talking about different clients who want different kind of documents. My (optimal) feed reader wants a application/atom+xml document for example, since that can be parsed in the best way for the purpose of the feed reader (providing easily scan able readable content in consistent styling with other parsed feeds). The document is also slightly differently organized opposed to the XHTML version. In detail: without the unnecessary navigation and cat pictures. It might even include slightly different content that is specifically meant for the feed reader and not important for the end user, like a modified and created date (the issued date is probably more important for the end user, since it could be part of the permanent entry link).

The browser on the other hand wants something like application/xhtml+xml if the browser is advanced, text/html if it is about Internet Explorer, image/svg+xml for the mobile phone and application/pdf if it is Safari (just kidding). Actually, since Internet Explorer has a screwed up Accept header that hopefully gets fixed in their next release you might want to exclude application/pdf from the content negotiation progress.

The other two headers Accept-Language and Accept-Encoding have their use cases. You can check the Accept-Language header of the client to send him back the language the user prefers. If you are doing this correctly, you make sure the content is only and correctly translated, not altered, since it wouldn't be that useful anymore. (I would like to read the English version if the Dutch version was incomplete or just wrong.) Accept-Encoding can be used to send the client smaller versions of the same file if the client accepts those formats. This is really interesting if you want to save some bandwidth and who doesn't, these days. After this small introduction we still haven't got to the problems of content negotiation.

There is one single problem and it isn't a problem for end users fortunately. End users could configure (in the optimal client) exactly what kind of documents they want to retrieve from the server and which versions of documents they dislike. For example, Mozilla has already quite a good Accept header albeit a big long. I specified the Accept-Language header myself and prefer Dutch at the moment I believe, not sure why. If I had some kind of client to browse feeds, I would put application/atom+xml in the Accept header (and application/xml for evil Mark Pilgrim who claims his Atom feeds have the application/atom+xml content-type in the HTML LINK element (I always wonder if I have to say people shouldn't take everything very serious) (I also wonder if I should take a screen shot of his source code)).

The problem is this: how do I say to the validator to validate my XHTML document (as in: not the PDF document, stupid!)? The second problem has to do with a user agent that supports multiple content-types. This all leads to a need for specifying which version you want to use as input. I believe that current practice is to have a file extension or other way to tell the server which document to return, which doesn't seem completely optimal to me, but it probably is.

Comments

Both Opera 7.51 and Camino (any version) have application/xhtml+xml in their Accept headers. Both will barf if your send them XHTML+MathML content with that MIME-type.
I think the day is still a ways off when one can reliably use the Accept headers (alone) in determining what to send the client, both for this reason and for the ones you cite.
Posted by Jacques Distler at 5:12AM
Jacques, does your Opera 7.51 barf on this page?
Posted by Moose at 5:55AM
Opera also has video/x-mng in its Accept header, despite not supporting MNG at all.
Posted by mawic at 10:39AM
Moose, your experiments with getting Opera to render a small subset of MathML using CSS are well-known, but beside the point.
I could list all the features of the MathML specification that your CSS hack doesn't support, but I'm sure you are as aware as I of its limitations.
Ultimately, if you worked hard adding the missing features, all you would do is reproduce what Gecko does to support MathML. What Gecko does, in essence, is convert the MathML "box model" to the CSS Box Model, and ship that off to be rendered.
Implementation
A mathematical expression can be represented as an aggregate set of boxes. These are the bounding boxes that would enclose mathematical entities (literal symbol, operator, delimiter, etc). With rules governing the positioning of these entities (subscript, superscript, fraction, etc), it is possible to construct the box-model in a recursive manner by traversing the parsing tree of the expression.
With the object-oriented paradigm, each box can be viewed as an object that has its own specific properties and shares a common set of properties with other objects. With the CSS paradigm, each box can be viewed as a CSS frame that possibly embeds other CSS frames. Hence there is a direct correspondence between the two paradigms.
MathML offers two formats for representing an equation: presentational tags and semantic/content tags. Given an equation in either format, the MathML project will ultimately aim at constructing a lump of CSS frames that can then be passed onto Gecko for layout and display.

If you succeeded, that would be a great thing, but you've made the task artificially difficult by not having the parse-tree available to you. Some things you clearly are not going to be able to do without it. (Not to mention the inability to deal with MathML named entities.)
Opera 7.51 will barf on perfectly valid XHTML+MathML pages. I don't mean not render the equations, I mean throw an XML parser error. Your experiments with carefully crafted simple MathML pages do not change that.
Posted by Jacques Distler at 2:16PM
Damn you Jacques, I was writing an e-mail to him about the same thing, but I had to go to the store :).
In your examples you completely forget about the Content Markup. It's that way of marking up things that makes MathML exciting. It's easy to make <msup/> and stuff work with a stylesheet, but I don't think it will be the same when you're planning to make equitations with that wonderful Content Markup. And using a stylesheet to make MathML work is, in my opinion, a way of hacking around.
You are doing a great job with promoting Opera, but sometimes you have to face that there are missing things.
Posted by Pieter Belmans at 2:32PM
There have been thoughts about MathML and CSS though, but that is a bit off topc.
Jacques Distler, note that MathML isn't really part of the application/xhtml+xml content-type. It is the correct MIME for the document, since the root element is in the XHTML namespace, but that doesn't mean the browser supports the other mixed fragments as well.
You should actually check if application/mathml+xml is in the Accept header if you want to be sure, but Mozilla doesn't support that yet.
Note also that Opera (and probably Camino as well (the same for the new 0.8?)), according to the XML specification, does nothing wrong when it "barfs" on the page, since unrecognized entities should give a parse error per XML design.
Posted by Anne at 3:58PM
Jacques — I do not believe in MathML. I have been discussing this topic with various people over the last year, and have come in favor of custom DTD and pure XML+CSS.
The only reason why I wasted a lot of my time on applying CSS to MathML was because people actively claimed it was impossible. So I proved them wrong. My examples do not save the world, nor do they have utilitarian value. They were done for a specific reason.
I didn't know you were aware of my math-related work before, but since you say you are, then perhaps you are aware of George Chavchanidze's work. If we both had not been as busy as we are, perhaps we would push it further than we did. George continues to pioneer alone, and I wish I could chip in.
I do not believe in MathML as a standard. Just for the record. There is brighter present and future than what it has to offer.
Anne, my sincere apologies for veering off-topic. I'll bugger off eventually :)
M.
Posted by Moose at 4:29PM
Anne, I looked up the post by Tim Scarfe which gave me the idea for content negotiation, which in turn led me to tell you about it. I'm not trying to bring up the whole XHTML/HTML debate again, just read the two updates to the post.
I agree with Tim (and indirectly with Paul Sowden) about opaque content negotiation:
To make my position clear; I would like to make an assumption about the type of resource I get from a URI. In other words, the relationship between the URI and the resource type should be one to one.

Posted by Mark Wubben at 6:57PM
You should actually check if application/mathml+xml is in the Accept header if you want to be sure

And how is the server supposed to know whether the file in question contains embedded MathML content?
application/xhtml+xml is absolutely the correct MIME type for an XHTML document containing embedded fragments from any other XML namespace(s). There is no provision (in any Spec I know of) for associating multiple MIME types with a given URL. So there's no way for the server to programmatically know it needs to check for multiple MIME types in the User-Agent's Accept headers.
This is not supposed to be a problem. When the User-Agent encounters elements from a namespace it does not support, it is simply supposed to ignore those elements and parse their contents.
That's the way things are supposed to work. But, the two examples I know of which Accept application/xhtml+xml, but which do not support MathML, both barf when they encounter embedded MathML.
Not a pretty picture for Content Negotiation.
Posted by Jacques Distler at 10:51PM
Jacques, do they barf on elements or entities? There is a subtle, yet important difference between the two.
Posted by Anne at 11:28PM
I did some quick tests with Camino. As far as I can tell, it barfs on named entities (even HTML named entities). Only XML named entities (&, ...) are allowed. I expect Opera 7.51 does the same.
Posted by Jacques Distler at 12:26AM
And that, as opposed to barfing on elements, is exactly what should happen when a browser isn't able to parse the DTD (and believe me, not a single (used) browser parses DTDs at the moment).
Posted by Anne at 12:32AM
So, since no User-Agent parses DTDs, but instead works from a "built-in" set of recognized DTDs, it's really irrelevant what XML MIME-type the User-Agent accepts. What counts is what DOCTYPEs it accepts.
Say I have a document which validates both as XHTML 1.1 and as XHTML 1.1 + fooML If I send it to a User-Agent which recognizes both XHTML 1.1 and XHTML 1.1 + fooML DOCTYPEs, I can send it with either DOCTYPE. But if the User-Agent doesn't recognize the latter DOCTYPE, it will barf when I send the document as XHTML 1.1 + fooML.
Have I got that right?
Clearly, we have bigger problems than MIME-type negotiation. Now we have to worry about "DOCTYPE negotiation" too?
Posted by Jacques Distler at 1:12AM
It might be possible that it is implemented that way, but I don't think that would be the correct implementation.
If a browser supported both fooML and XHTML and you used the XHTML DOCTYPE on a document that contains entities from fooML the browser should barf. If you have the same document, but use the XHTML + fooML DOCTYPE it should not barf, since the entities of fooML are defined in that DOCTYPE.
If Mozilla had application/mathml+xml in their Accept header, you could check for that when someone requests the page (along with a check for application/xhtml+xml support) and if they support (both) you send the document as application/xhtml+xml, since you know they support MathML and XHTML.
Posted by Anne at 2:14AM
What I think is missing, is a way to assert what content types the URI of topic is available in. That way, a client can assert with Accept what MIME types it wants, and the server can, with an X-Available-Types header, state what types the resource is available in.
If the client (probably with the user's interference) doesn't like the first MIME type it gets in return, it can negotiate another by altering its Accept header. All this content negotiation can be done with the HEAD method, to save bandwidth.
But since the server doesn't, at least as far as I know, have an opportunity to assert what MIME types the resource is available in, there will never be much negotiation going on. You're not negotiating when you're saying «I want this URI in that MIME type». Negotiation needs to be a dialogue in at least three parts, imho.
Posted by Asbjørn Ulsberg at 2:44AM
If a browser supported both fooML and XHTML and you used the XHTML DOCTYPE on a document that contains entities from fooML the browser should barf.

We're not talking about that. We're talking about a document with entities from XHTML. In fact, we're talking about a document which is valid under either DOCTYPE (because it's secretly just an XHTML document).
It's bizarre, but true that the browser will barf just because you change the DOCTYPE to XHTML+fooML
If Mozilla had application/mathml+xml in their Accept header, you could check for that when someone requests the page

Again, that requires "knowing" whether the document in question contains embedded MathML. Short of parsing the file (at least, reading its DTD), there's no way the Server can know that.
And, anyway, since existing browsers don't send out application/mathml+xml in their Accept headers, you're still stuck browser-sniffing anyway.
Posted by Jacques Distler at 2:51AM