Anne van Kesteren

Project html5lib: performance

Between visiting family, cooking and food I managed to do some optimization on the html5lib code. For profiling I used the Web Applications 1.0 document. Parsing it took initially 47 seconds measured using the hotshot library. Forty-seven seconds for a total of 15773345 function calls. Loading the same file from the harddrive Opera parses and renders the page in about three seconds. It should be noted that rendering is a far more involved piece of work than parsing. For now, we’re planning to compete with actual “competition” first. BeautifulSoup, htmllib and HTMLParser fall in that category I think. Is there an easy way to run similar tests on those?

I did manage to speed us up, by the way. In the end (today) parsing a slightly more complicated version of that document (including all the new lines) takes around five seconds. Yay! Saved over forty-two seconds (fact about 42 #3). One tenth of the function calls survived. Lots of performance was gained by consuming multiple characters in a row from the inputstream in the tokenizer before emitting the token to the parser. I also rewrote parts of the parser to not initialize certain objects over and over but instead keep them in memory. For very large documents with lots of phase switches this can save a lot of overhead. Yesterday I folded the insertion modes and phases as specified in the Web Applications 1.0 into a single concept (the specification might change to reflect this) which makes our parser a lot easier to understand in my opinion. The downside is that it is slightly worse at reflecting the prose of the specification in code. Then again, the specification is likely to change in our favor.

People who want to dive into optimization should create testcases first or while doing the optimization. That and some patience will help. See also performance tuning tips from Mark Pilgrim.

Parsing XML should be pretty trivial, except for the annoying internal subset part. You basically have an inputstream which gives characters to the tokenizer and the parser checks if the tokens come in the right order and create a “correct tree” and all that. So all pretty straightforward. Parsing HTML is different. You still have an inputstream. You still have a tokenizer stage, but it’s more complicated. It has to deal with error handling, but also with input from the parser stage as some states within tokenizer do different things depending on which token has just been emitted. For instance, after you have encountered a script element start tag you have to consume characters and append them to the element until you see a script element closing tag (basically some lookahead handling after </). You also can’t simply tell a treebuilder that a start tag has been seen. Sometimes you need to insert elements directly before the last opened table element for instance (simply said). James is going to look into building some type of API on top of the parser so that everyone can implement that API and produce the tree he or she needs, such as ElementTree. This might clarify it a little (the former is XML, the latter HTML):

inputstream -> tokenizer -> parser -> treebuilder
inputstream -> tokenizer <-> parser <-> treebuilder

Comments

  1. I have written a Java HTML parser for a customer to be used in a crawler. To make best use of that, I did a SAX-like API. I can very much recommend such an approach.

    As for obvious performance optimizations in such an API, I avoided object creation and copying as much as I could, meaning that ContentHandler calls passed my internal buffer with offsets for text events. The HTMLElement implementation was reused for the next call and internally just held offsets into my buffer.

    The buffer was a sliding window on the input stream, keeping track on which chars had already been consumed. When finding the next token hit the end of the buffer, the window and the still to be used chars were moved.

    Cheers, Stefan

    Posted by Stefan at

  2. A pure SAX API doesn’t give you a conforming HTML parser. That was sort of one of the points I was trying to get across in this post.

    Posted by Anne van Kesteren at

  3. I did not mean "pure", maybe "inspired by" would describe it more exactly.

    If I understood you correctly, you think about implanting HTML "fixing" code into the parser/treebuilder stack. If you separate this code from the rest and use a SAX-like API, then your stack could maybe look like this:

      tokenizer - parser -*- fixer -*- treebuilder
    

    where -*- is a SAX-like API. And you can use the fixer as a sort of SAXFilter in other places than parsing.

    Posted by Stefan at