Between visiting family, cooking and food I managed to do some optimization on the
html5lib code. For profiling I used the Web Applications 1.0 document. Parsing it took initially 47 seconds measured using the
hotshot library. Forty-seven seconds for a total of 15773345 function calls. Loading the same file from the harddrive Opera parses and renders the page in about three seconds. It should be noted that rendering is a far more involved piece of work than parsing. For now, we’re planning to compete with actual “competition” first.
HTMLParser fall in that category I think. Is there an easy way to run similar tests on those?
I did manage to speed us up, by the way. In the end (today) parsing a slightly more complicated version of that document (including all the new lines) takes around five seconds. Yay! Saved over forty-two seconds (fact about 42 #3). One tenth of the function calls survived. Lots of performance was gained by consuming multiple characters in a row from the
inputstream in the
tokenizer before emitting the token to the
parser. I also rewrote parts of the
parser to not initialize certain objects over and over but instead keep them in memory. For very large documents with lots of phase switches this can save a lot of overhead. Yesterday I folded the insertion modes and phases as specified in the Web Applications 1.0 into a single concept (the specification might change to reflect this) which makes our parser a lot easier to understand in my opinion. The downside is that it is slightly worse at reflecting the prose of the specification in code. Then again, the specification is likely to change in our favor.
People who want to dive into optimization should create testcases first or while doing the optimization. That and some patience will help. See also performance tuning tips from Mark Pilgrim.
Parsing XML should be pretty trivial, except for the annoying internal subset part. You basically have an
inputstream which gives characters to the
tokenizer and the
parser checks if the tokens come in the right order and create a “correct tree” and all that. So all pretty straightforward. Parsing HTML is different. You still have an
inputstream. You still have a
tokenizer stage, but it’s more complicated. It has to deal with error handling, but also with input from the
parser stage as some states within
tokenizer do different things depending on which token has just been emitted. For instance, after you have encountered a
script element start tag you have to consume characters and append them to the element until you see a
script element closing tag (basically some lookahead handling after
</). You also can’t simply tell a
treebuilder that a start tag has been seen. Sometimes you need to insert elements directly before the last opened
table element for instance (simply said). James is going to look into building some type of API on top of the parser so that everyone can implement that API and produce the tree he or she needs, such as
ElementTree. This might clarify it a little (the former is XML, the latter HTML):
inputstream -> tokenizer -> parser -> treebuilder inputstream -> tokenizer <-> parser <-> treebuilder
I have written a Java HTML parser for a customer to be used in a crawler. To make best use of that, I did a SAX-like API. I can very much recommend such an approach.
As for obvious performance optimizations in such an API, I avoided object creation and copying as much as I could, meaning that ContentHandler calls passed my internal buffer with offsets for text events. The HTMLElement implementation was reused for the next call and internally just held offsets into my buffer.
The buffer was a sliding window on the input stream, keeping track on which chars had already been consumed. When finding the next token hit the end of the buffer, the window and the still to be used chars were moved.
A pure SAX API doesn’t give you a conforming HTML parser. That was sort of one of the points I was trying to get across in this post.
I did not mean "pure", maybe "inspired by" would describe it more exactly.
If I understood you correctly, you think about implanting HTML "fixing" code into the parser/treebuilder stack. If you separate this code from the rest and use a SAX-like API, then your stack could maybe look like this:
tokenizer - parser -*- fixer -*- treebuilder
-*- is a SAX-like API. And you can use the fixer as a sort of SAXFilter in other places than parsing.