Anne van Kesteren

Inherently Instable

31 July 2007

People have been arguing for a long time that if only HTML had been strict from the start the web would be a neater, less complex, place. I have my doubts about that. I think the web is in an inherently instable equilibrium. There are a number of problems with strict error handling:

It complicates extensibility. Introducing a new feature would require a new version and a new version flag for a very strict systems, such as Silverlight. This is also true for the XML syntax, although XML is not a language itself. If you have a color feature that accepts colors and you want to introduce a new color, orange, that would not be possible without introducing a new version.
Market pressure leads user agents to implement new (experimental) features. In a strict world this would mean that pages using those new features would completely break in other user agents. I suppose you could argue that such a model might increase the chance for a monopoly.
There is also the factor of human error. At the level of the page author, the implementor and the specification writer. This will inevitably lead to small error checking mistakes in implementations which will lead to a less strict system when pages start depending on it. For instance, there is quite a lot of XML labeled with text/xml out there that assumes to be treated identically to XML labeled with application/xml.

To contrast HTML with XHTML, XHTML is only stricter in syntax which does not help you much. You can still write <input type="foobar" xyz=""/> for instance or put a div element inside script. What would have been nice if the error handling rules for HTML parsing (most notably the tree construction phase) were more predictable than they are now. However, I am afraid it is an unfortunate artifact of web history we have to live with.

Comments

You could argue that today's Web is a byproduct of a flawed, imperfect, "unstable" specification for HTML. So perhaps spending so much time trying to perfect things today could prevent comparable growth in the future. ;-)
Posted by Rahul at 7:17PM
Crucial to the growth of the web was that amateurs could create HTML-imperfect pages that would render in browsers.
If an errant </p> tag had resulted in an error message in MSIE and Netscape, instead of a dutifully-rendered web page, it would have slowed innovation and growth to a crawl.
I'd take a vibrant, diverse, amateur universe of tag soup over a sterile, expert-only, limited speck of well-formedness any day.
The web's tolerance for messy HTML is a net good.
Posted by Joe Grossberg at 8:30PM
On the other hand, errant tags causing problems might have caused more innovation in tag parsing and WYSIWYG editing and then Xopus might never have needed to exist.
Posted by Rahul at 9:28PM
Surely there's a third way.
If things had 'worked' but warned you there was a potential problem then it wouldn't have scared anyone away and would have presented an easy to climb gradient for amateurs to learn HTML even faster. I think we've all got to that stage in code or markup were you make one more small mistake and the whole thing stops working. That's why you need lints and unit tests and compiler warnings to stop it all collapsing in an avalanche of errors at an arbitrary point.
On the other side, having to explicitly catch and warn in a human readable manner on errors is effectively forcing you to document both your support for standards and your error handling. This allows both easier interop but also limits the growth of such error handling as at some point you have to say "that's just ridiculous" and stop trying to make things work when the users have made multiple errors, all of which you've flagged individually and provided an easy way for them to fix it.
I've never understood how handling garbage-in without even allowing experts to understand what magic you're performing under the covers to fix their errors can be considered a good thing.
And this seems to be the new way forward with things like Atom, HTML5 and validators that support them
Posted by dave at 9:54PM
It complicates extensibility. Introducing a new feature would require a new version and a new version flag for a very strict systems, such as Silverlight. This is also true for the XML syntax, although XML is not a language itself. If you have a color feature that accepts colors and you want to introduce a new color, orange, that would not be possible without introducing a new version.

Maybe you need to make explicit something that you are assuming here: namely, that your "strict error handling" involves throwing an error upon encountering an unknown element, attribute or attribute value.
This is not the way the web works, even in application/xhtml+xml (which most people think of as having "strict error handling"). There, unknown elements, attributes and attribute values are silently ignored. Which means that extensibility is decidedly not a problem. Indeed, to the contrary, it's why XHTML is touted for its extensibility.

Posted by Jacques Distler at 10:14PM
Meh. Again with the drum-beating and straw men?
This time you are confusing draconian error handling for the syntax with MustIgnore/MustUnderstand in the semantics. A confused premise leads to a confused conclusion. XML is not incompatible with MustIgnore; cf. Atom, which has no requirement for a new version for the introduction of new core features and where documents with extension elements do not break in feed readers that do not understand them.
Anyway, it is a known fact that the #1 usability issue for programming languages is the quality of error messages produced by the compiler/interpreter. The better they are, the easier it is to write (syntactically) code. Now consider the browser environment and then draw your own conclusions.
Posted by Aristotle Pagaltzis at 10:34PM
Jacques, Aristotle, I thought I addressed that case in the last paragraph? I am not really convinced that some draconian handling is better than none at all.
Posted by Anne van Kesteren at 10:41PM
Jacques, Aristotle, I thought I addressed that case in the last paragraph?

I don't see how:

To contrast HTML with XHTML, XHTML is only stricter in syntax which does not help you much.

is either true, or addresses the point.

I am not really convinced that some draconian handling is better than none at all.

Which is hard to reconcile with your argument about extensibility. "Some draconian error handling" (XHTML) promotes extensibility in ways that you yourself argue are not possible with fully-draconian error handling (your straw man, nonexistent web markup language). Nor are they possible with "none at all" (HTML).

I'm gonna have to agree with Aristotle that this sort of straw man advocacy is not the least bit helpful.

Posted by Jacques Distler at 12:27AM
Whereas Aristotle and Jacques are right that any given XML vocabulary can be designed to be extensible via MustIgnore, and thus Anne's argument is technically illiterate, Anne is nonetheless right in that the Web would not have been helped, and might possibly have never taken off, if it had required strict well-formedness (not that the concept then existed) in the early days. You really want strict WF-ness for machine-to-machine messages.
Note also that HTML Classic has always had an implicit MustIgnore rule; browsers have historically silently ignored anything they saw but didn't understand.
-Tim
Posted by Tim Bray at 1:13AM
If HTML parsers had been as strict regarding well-formedness errors as XML parsers, it wouldn't have been possible to introduce XHTML 1.0 in a backwards-compatible way (as per Appendix C).
Posted by Olav Junker Kjær at 1:24AM
There is also the factor of human error

The biggest problem with draconian error handling is indeed the human error, because it punishes the user of the website for a mistake made by the author.
Why should the user be denied access to content just because the author forgot to use & in place of &?
Posted by Tom Pike at 6:23AM
What I get out of this is that basically... HTML was originally too unlimited in certain ways no one realized until too late. So they made a "fix" and called it XML. Which was too limited in certain other ways no one realized until too late. Now, learning from this, results in WHATWG's HTML/XHTML 5.0.
Posted by Devon Young at 10:43AM
It complicates extensibility. Introducing a new feature would require a new version and a new version flag

Just like any other language. How many major milestones since the introduction of HTML? HTML 1 (ever got out of CERN?), 2, 3.2, 4. I count four in 15 years. You can draw comparisons to MSOffice formats, perl, php, windows directX API versions, flash versions. A technology with a 4-year turnover rate (and thus needing software updates) does not show extensibility problems. New browsers releases happen more often than HTML versions. The issue is already solved for XHTML with XML namespaces.
Market pressure leads user agents to implement new (experimental) features. In a strict world this would mean that pages using those new features would completely break in other user agents.

Now, seems the argument sums up as "either require a tyrannic strict parser or be absolutely loose". Well, I think we don't need more than a SGML well-formedness parser for HTML (XHTML is extended through standard XML namespaces). A well-formedness parser doesn't prevent extensibility, but assures at least some markup coherence and simplifies the parser a lot. Just like spoken languages allow addition of new words but prevent sentences like (in english) "fox over jumped dog the". Are english language rules preventing authors from writing? Is the spellchecking software business doomed to fail because the english language evolves?
There is also the factor of human error. At the level of the page author, the implementor and the specification writer. This will inevitably lead to small error checking mistakes in implementations which will lead to a less strict system when pages start depending on it.

So, a strict system leads to... a non-strict system? Because something might not be perfect, we must not even try?
Compilers for a well-defined language (C, C++, FORTRAN) all have shown different glitches in certain versions. When notified about the problem, the people coding the compiler parser correct their parsers, so the lifetime of a specific non-standard code that masquerades as valid code (which is limited to 1 specific compiler on 1 precise version) is kept to a minimum.
Anyway, any web developer trying at least 2 browsers would immediately catch these errors.
XHTML is only stricter in syntax which does not help you much

Well, a missing or extra closing div caught early can reduce headaches. Also, as more and more (X)HTML is generated by a scripting language, a well-formedness parser can show logic flaws in the script.
Posted by Patrice Levesque at 11:30AM
It complicates extensibility. Introducing a new feature would require a new version and a new version flag for a very strict systems, such as Silverlight. This is also true for the XML syntax, although XML is not a language itself. If you have a color feature that accepts colors and you want to introduce a new color, orange, that would not be possible without introducing a new version.

Well, orange was added to a new version of CSS (2.1), and only added because all current browsers supported it. Following your colour example though, let’s say we want to add a new one “lemon” — how can that done without versioning?
Without versioning, what is an older browser to do when it encounters the previously unknown “lemon” colour — should it display the text in a default colour like black? What if the background colour is a dark brown and this makes the text hard to read? Wouldn’t it be better if the browser just reported an error because it’s version number does not match and it doesn’t know what to do?
And upon seeing that error, wouldn’t it be better if a HTML author could easily determine which versions are supported by current browsers, and pull up the documentation for the lowest version to see the colours that can be used. How is this more complicated than current system of trial and error to see which browsers support what? Yes it still is too complicated for an amateur, but they’re using tools like Wordpress and Dreamweaver, not coding HTML by hand.
The idea that this complicates extensibility is odd. There is no extensibility in the current landscape. Versioning enables extensibility in a straight-forward way. How would the W3C or anybody else add the colour “lemon” to HTML if they wanted to?
Posted by adriand at 12:02PM
The issue is whether error handling should be undefined (like classic HTML), draconian (like XML parsing) or specified with unambiguous rules for error recovery (like CSS). I believe draconian is better than undefined, but specified error recovery is better than both, not least because it provides extensibility in a backwards compatible way. (Obviously its also much more work to specify and test.)
I don't believe in versioning since browser vendors are adding features incrementally and out of sync with spec versions.
Posted by Olav Junker Kjær at 3:28PM
I don't believe in versioning since browser vendors are adding features incrementally and out of sync with spec versions.

Well, versioning in small increases would still allow browser vendors to implement features incrementally, and give them clear goals to work towards. I think that would provide something far more interoperable than what we have currently where browsers implement a giant spec in bits and pieces.
Yeah, the issue is about strict error handling, but you can’t have that without versioning.
Posted by adriand at 4:48PM
What version is the English language currently at (British English, Oxford dialect)? I need to know because maybe I need to update my brain first to be able to understand all these comments correctly... If you are in a different namespace or version, please state that in advance so I can switch to one of my other brains.
Posted by Tino Zijdel at 6:17AM