html5lib
: performanceBetween visiting family, cooking and food I managed to do some optimization on the html5lib
code. For profiling I used the Web Applications 1.0 document. Parsing it took initially 47 seconds measured using the hotshot
library. Forty-seven seconds for a total of 15773345 function calls. Loading the same file from the harddrive Opera parses and renders the page in about three seconds. It should be noted that rendering is a far more involved piece of work than parsing. For now, we’re planning to compete with actual “competition” first. BeautifulSoup
, htmllib
and HTMLParser
fall in that category I think. Is there an easy way to run similar tests on those?
I did manage to speed us up, by the way. In the end (today) parsing a slightly more complicated version of that document (including all the new lines) takes around five seconds. Yay! Saved over forty-two seconds (fact about 42 #3). One tenth of the function calls survived. Lots of performance was gained by consuming multiple characters in a row from the inputstream
in the tokenizer
before emitting the token to the parser
. I also rewrote parts of the parser
to not initialize certain objects over and over but instead keep them in memory. For very large documents with lots of phase switches this can save a lot of overhead. Yesterday I folded the insertion modes and phases as specified in the Web Applications 1.0 into a single concept (the specification might change to reflect this) which makes our parser a lot easier to understand in my opinion. The downside is that it is slightly worse at reflecting the prose of the specification in code. Then again, the specification is likely to change in our favor.
People who want to dive into optimization should create testcases first or while doing the optimization. That and some patience will help. See also performance tuning tips from Mark Pilgrim.
Parsing XML should be pretty trivial, except for the annoying internal subset part. You basically have an inputstream
which gives characters to the tokenizer
and the parser
checks if the tokens come in the right order and create a “correct tree” and all that. So all pretty straightforward. Parsing HTML is different. You still have an inputstream
. You still have a tokenizer
stage, but it’s more complicated. It has to deal with error handling, but also with input from the parser
stage as some states within tokenizer
do different things depending on which token has just been emitted. For instance, after you have encountered a script
element start tag you have to consume characters and append them to the element until you see a script
element closing tag (basically some lookahead handling after </
). You also can’t simply tell a treebuilder
that a start tag has been seen. Sometimes you need to insert elements directly before the last opened table
element for instance (simply said). James is going to look into building some type of API on top of the parser so that everyone can implement that API and produce the tree he or she needs, such as ElementTree
. This might clarify it a little (the former is XML, the latter HTML):
inputstream -> tokenizer -> parser -> treebuilder inputstream -> tokenizer <-> parser <-> treebuilder
I have written a Java HTML parser for a customer to be used in a crawler. To make best use of that, I did a SAX-like API. I can very much recommend such an approach.
As for obvious performance optimizations in such an API, I avoided object creation and copying as much as I could, meaning that ContentHandler calls passed my internal buffer with offsets for text events. The HTMLElement implementation was reused for the next call and internally just held offsets into my buffer.
The buffer was a sliding window on the input stream, keeping track on which chars had already been consumed. When finding the next token hit the end of the buffer, the window and the still to be used chars were moved.
Cheers, Stefan
A pure SAX API doesn’t give you a conforming HTML parser. That was sort of one of the points I was trying to get across in this post.
I did not mean "pure", maybe "inspired by" would describe it more exactly.
If I understood you correctly, you think about implanting HTML "fixing" code into the parser/treebuilder stack. If you separate this code from the rest and use a SAX-like API, then your stack could maybe look like this:
tokenizer - parser -*- fixer -*- treebuilder
where -*-
is a SAX-like API. And you can use the fixer as a sort of SAXFilter in other places than parsing.