Anne van Kesteren

Project: HTML Parser in Python

6 December 2006

Over the weekend and in some spare time during the past few days I've been implementing an HTML5 tokenizer. (Based on work from James Graham.) The tokenizer here tells a potential tree constructor when a new start tag is found and what the attributes of that tag are. It also gives you a notification when it encounters a doctype, comment, et cetera. Pretty cool stuff. The project is called html5lib and you’re free to contribute.

Also note that HTML5 does not give you another way of parsing HTML. It in fact defines a way of parsing documents that is compatible with as much documents out there while remaining compatible with the DOM specifications.

Comments

It should be noted that there is also some (less finished, but partially functional) work on implementing the parser/treebuilder itself. At present the parser constructs a custom tree format that's based on the minimal amount of DOM core needed to test the parser code. Longer term the goal is to abstract out the treebuilding so treebuilders can be implemented for e.g. elementtree, XML.minidom and, perhaps, event-based models like SAX (this latter one is harder though because the HTML5 parser spec requires elements to be moved around after they have been created, so some tree-like buffering is needed with events only being emitted after there is no possibility of the subtree changing).
Posted by jgraham at 11:22PM
Hi Anne, Added your code to the list of other parsers.
Posted by Karl Dubost at 10:38PM
Good news!
Two questions:
1. Would we wait a XML version of the HTML parser?
2. Would we wait a HTML version of SLiP (Python) format?
Posted by Juan R. at 5:20PM