Anne van Kesteren

Notes on HTML5 Parser History

We reached a new milestone with HTML5 this week. Henri Sivonen flipped a switch and now Firefox nightly builds ship with an HTML5 parser by default. It might not seem as a milestone to everyone however. As Henri puts it: A key feature of the HTML5 parser is that you don’t notice that anything has changed. Defining HTML to this level of detail was not something we planned doing from the beginning. Initially we envisioned and worked on simple extensions to HTML, such as <input type="date">. As the work progresses you realize that without understanding the fundamentals certain extensions — changes if you will — are not possible. Now you could change the fundamentals. And much software has been rewritten over the years to do exactly that. But there is no starting over with the web.

Therefore in the first half of 2006 Ian Hickson wrote down an HTML parser in English prose. As a section of Web Applications 1.0, as HTML5 was known then, and as HTML5 including several other drafts is known now. It was not universally thought to be a good idea. I remember Dave Hyatt expressing doubt we could ever capture the complexity accurately. Not longer after we had a little victory dance as WebKit was able to fix a bug thanks to simply reading the HTML5 parser draft. Likewise Opera has been able to fix numerous parsing interoperability issues by following the specification. Sometimes suggesting subtle improvements to it. However, until recently nobody had shipped a fully compliant HTML5 parser in a web browser.

At the end of 2006 James Graham and I (mostly James) started html5lib, a Python library written in Python for parsing HTML. Performance-wise it is much slower than equivalent libraries written in C, but it copes a lot better with the HTML out there. Best of all, it comes with a test suite for parsing HTML according to HTML5. This has helped Henri developing the code which in turn is what drives the new HTML parser in Firefox. The specification, the tests, html5lib,, Firefox’ new HTML parser, all incrementally evolving, feeding changes and ideas in all directions. It is exemplary as to how the web platform is made.

Having defined the fundamentals we also started reaping the benefits. Inline vector graphics and mathematics. No toolchain change required. New features for the script element that allow asynchronous loading and execution of scripts. Cutting costs for browser QA and implementors — allowing them to focus on other activities. No longer is there a need to reverse engineer other browsers. Everyone will attempt to follow the specification instead. And finally, lowering the entry cost for new players. We do not want another browser monoculture.