Archive for August 12th, 2013

The old HTML parser is dead! Long live the HTML parser!

August 12th, 2013 | Category: Uncategorized

Many years ago, I got my start contributing to Mozilla by working on the new HTML parser. At the time that I was starting to contribute, there were big changes happening in Mozilla-land, most notably AOL laying off basically all of its Netscape engineers. With that stroke, the available manpower working on Gecko shrank to a shadow of its former self, leaving large portions of the codebase unowned. Naturally, this meant that the least sexy portions of the code received even less attention than before and the parser, which had not been actively worked on (except to fix critical bugs) fell even further onto the back-burner. The reasons for this abandonment were not only due to the fact that HTML parsers are not the most exciting thing in the world to work on, but also because of a few other factors:

  • The code was very poorly documented, with many comments being incomprehensible. Understanding the code required reading both the immediately surrounding code as well as most of the rest of the class. In addition, there were often many ways of stating a simple constraint (e.g. HTML element A can contain HTML element B), each one with its own subtle side-effects.
  • The algorithm used was non-deterministic with regards to HTTP packet boundaries! When HTML5 was introduced, this one factor was the reason that Ian Hickson ignored Gecko’s algorithm entirely when specifying what should happen.
  • The coding style used in the relevant source files was unlike any other code in the tree (as well as being both terse and overly verbose at the same time).

With this in mind, when Mozilla made the decision to switch to an HTML5 parser, we decided to go with a new parser entirely. Henri Sivonen was kind enough to figure out how to hook his existing implementation up to Gecko (including an automatic translation from Java to C++). We were able to switch over to this new parser for everything except parsing the magical about:blank page. Henri has been working on fixing that; however, in the meantime, we’ve been schlepping the old code around with us. The other day, I decided to remove the parts of the old parser that were no longer useful to its remaining use (that is, everything that didn’t simply output a blank document).

The result of all this is that with bug 903912 landed, the old parser is basically gone. The most complicated bits have been torn out and we now only have one HTML parser in the tree.

13 comments