Anne van Kesteren

XML problems on the web

30 November 2006

It seems to me that certain groups of people in standardization have a certain “wait and see” stance when it comes to things on the web. See for instance this lovely thread on the HTTP Working Group mailing list. I sort of have the feeling this applies to the XML Working Group as well. There are quite some issues with XML on the web today and they aren’t really trying to address them or declaring that XML shouldn’t be used on the web which would be another way of solving the problem. Here’s a list of issues:

The text/xml media type doesn’t work as prescribed.
User agents sniff for certain DOCTYPEs to support entities they come across in documents. They never use that DOCTYPE for anything else though. Except for the entities thing it’s completely meaningless. (They have to do that too, for what it’s worth. Lots of duplicate bug reports and unhappy customers guaranteed otherwise.)
Some user agents don’t throw for invalid characters (I think only Internet Explorer does).
The biggest consumers of XML data have a bozo switch meaning that XML document no longer means it’s also well-formed.

Yet, not much is done to try to revert all this (or to accept reality for that matter). I suppose some people think that all problems will magically go away. Besides this, I have a question. Can someone tell me if the following positively has to throw a parse error (if so, I’ll try getting it fixed in Opera):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><p>&test;</p>

Thanks!

Comments

Can someone tell me if the following positively has to throw a parse error (if so, I’ll try getting it fixed in Opera):
```
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><p>&test;</p>
```
Only if standalone="yes", see Well-formedness constraint: Entity Declared in the spec. (Or are both errors but only the one with standalone="yes" fatal? That's what I seem to get in libxml2.)
Posted by David Håsäther at 9:02PM
I think that is always an error. The reference to 'standalone' only tries to say that it's an error if there is a definition for that entity but it's not available from within the document. (presuming that the XHTML DTD does not contain en entity called test).
Posted by Martin Probst at 9:27PM
It would seem to make sense it being an error and my interpretation is similar to Martin's.
Posted by Robert Wellock at 9:57PM
User agents sniff for certain DOCTYPEs to support entities they come across in documents.

I am not sure what Opera and Safari do, but in Firefox there’s no sniffing (as understood from text/html) nor XML processor corruption. The entity resolver gives the parser an abridged (i.e. fake) DTD for certain public ids. That is, the trickery is in the DTD catalog. (This makes a difference compared to just breaking the parser, because you can reproduce the behavior with any conforming XML parser that allows pluggable entity resolvers.)
Some user agents don’t throw for invalid characters

Yeah. Sloppy character decoder integration is to blame.
Can someone tell me if the following positively has to throw a parse error

If the XML processor reads the external entity (which it isn’t required to do if it doesn’t claim to validate against the DTD) and the external entity does not declare an entity called test, the XML processor has to throw a fatal error. If the XML processor does not read the external entity, it has to report to the application that it skipped an entity. It is up to the application to decide what to do with this report. A Web browser probably wants to render a place holder while many other reasonable apps may opt to halt and catch fire. (Note that reading the external entity doesn’t mean that the XML processor read the real one—just that the entity resolver gave it something that it read. So Gecko’s behavior must be blamed on the entity resolver—not expat.)
Posted by Henri Sivonen at 3:44AM
For what it's worth, the example also violates Validity constraint: Root Element Type, but that's a validity constraint rather than a well-formedness constraint.
Posted by David Baron at 4:04AM
If your example is served as text/html, then it should be parsed as HTML, no matter what the doctype says. An entity like "test" is a parse error in HTML, but the parser should recover from that.
If the example is served as XML MIME type, then it's a parse error (unrecoverable).
Posted by Alexey Feldgendler at 3:44PM
Alexey, it’s served as XML, but it’s unclear to me that it has to throw a parse error. Did you read what Henri wrote?
Posted by Anne van Kesteren at 10:58PM