Aug 12

The old HTML parser is dead! Long live the HTML parser!

Category: Uncategorized

Many years ago, I got my start contributing to Mozilla by working on the new HTML parser. At the time that I was starting to contribute, there were big changes happening in Mozilla-land, most notably AOL laying off basically all of its Netscape engineers. With that stroke, the available manpower working on Gecko shrank to a shadow of its former self, leaving large portions of the codebase unowned. Naturally, this meant that the least sexy portions of the code received even less attention than before and the parser, which had not been actively worked on (except to fix critical bugs) fell even further onto the back-burner. The reasons for this abandonment were not only due to the fact that HTML parsers are not the most exciting thing in the world to work on, but also because of a few other factors:

  • The code was very poorly documented, with many comments being incomprehensible. Understanding the code required reading both the immediately surrounding code as well as most of the rest of the class. In addition, there were often many ways of stating a simple constraint (e.g. HTML element A can contain HTML element B), each one with its own subtle side-effects.
  • The algorithm used was non-deterministic with regards to HTTP packet boundaries! When HTML5 was introduced, this one factor was the reason that Ian Hickson ignored Gecko’s algorithm entirely when specifying what should happen.
  • The coding style used in the relevant source files was unlike any other code in the tree (as well as being both terse and overly verbose at the same time).

With this in mind, when Mozilla made the decision to switch to an HTML5 parser, we decided to go with a new parser entirely. Henri Sivonen was kind enough to figure out how to hook his existing implementation up to Gecko (including an automatic translation from Java to C++). We were able to switch over to this new parser for everything except parsing the magical about:blank page. Henri has been working on fixing that; however, in the meantime, we’ve been schlepping the old code around with us. The other day, I decided to remove the parts of the old parser that were no longer useful to its remaining use (that is, everything that didn’t simply output a blank document).

The result of all this is that with bug 903912 landed, the old parser is basically gone. The most complicated bits have been torn out and we now only have one HTML parser in the tree.

13 comments

13 Comments so far

  1. SalmanKhan August 13th, 2013 2:05 am

    I think every now and then its worth reminding others just how grateful we are (however invisibly) for their hard work. You, Henri & the team at Mozilla are awesome & we are your fans!

  2. njn August 13th, 2013 2:55 am

    Nice!

  3. Stu August 13th, 2013 3:07 am

    Nice 🙂

    What was used to convert the Java to C++ ? This sounds incredibly useful ..

  4. zcorpan August 13th, 2013 5:42 am

    “When HTML5 was introduced, this one factor was the reason that Ian Hickson ignored Gecko’s algorithm entirely when specifying what should happen.”

    This isn’t quite true. The HTML parsers of IE, Firefox, Safari and Opera were all taken into consideration. In the case of misnested formatting elements, the approach of Safari’s parser was chosen.

  5. mrbkap August 13th, 2013 6:32 am

    @stu: Henri took an existing Java -> C++ converter and added a bunch of Gecko-specific changes. You can find more about the process at this link.

    @zcorpan: Yeah, I guess it might have been better for me to say “…that Ian Hickson discarded Gecko’s algorithm when deciding which algorithm to specify.” The end result was the same 🙂

  6. Steffen August 13th, 2013 6:56 am

    What’s magical about “about:blank” that the new parser can’t handle?

  7. tim peterson August 13th, 2013 7:37 am

    this is a really great story, thanks for sharing!

  8. Arnab August 13th, 2013 7:56 am

    Yes now the mystery becomes clear. This is why every time I load about:blank, I see nothing – I always wondered what was wrong.

  9. Anders August 13th, 2013 8:46 am

    I got curious as to what was so hard about “about:blank”. A search found http://hsivonen.iki.fi/about-blank/ which seems relevant and includes the wonderful quotes “about:blank is probably the hardest Web page to load” and “We want to remove the old HTML parser from the code base entirely after Firefox 4, so special-casing about:blank to use the old parser is not a reasonable long-term solution.” 🙂

  10. Mark August 13th, 2013 4:20 pm

    You removed the blink tag didn’t you?

  11. taytay August 13th, 2013 10:15 pm

    Nice theme for a 90s hackers movie

  12. Tristan August 13th, 2013 10:34 pm

    Yay for getting rif of the old “new parser”!

    “many years ago”: 10 years last month. Netscape layoffs tool place mid-July 2003.

    Keeo up the good work!

    –Tristan

  13. Jet Villegas August 15th, 2013 9:55 am

    This is a nice reminder of the actual lifetime of things people put on the Web. It’s encouraging to see the evolution come full-cycle. Time to build the new “new HTML6 parser?”