One C++ Tokenizer Too Many: A DXR Story

Erik Rose

When your codebase is 2GB, grep doesn’t cut it anymore. It’s slow, and, in such a large corpus, many attempts to find a symbol get drowned out by false positives. Even modern IDEs begin to choke under the load. This is the domain of DXR, Mozilla’s tool for doing structured queries, free-text searches, and even trigram-accelerated regex matching on large projects like Firefox.

Of course, it’s a software engineering truism that providing speed at a moment’s notice exacts a price in pre-computation, and DXR is no exception. Every night, we run the entire mozilla-central codebase through the clang compiler, injecting a custom plugin which sees what the compiler sees and writes it all down in a database that can dish out fast answers later.

Except when things go awry.

During the Mozilla Summit, DXR had a conveniently timed series of failed indexing runs. A bit of digging revealed that, while the mozilla-central compilation was going off without a hitch, a run of the source through our custom C++ tokenizer was exploding in a later phase.

Wait—custom C++ tokenizer?!

This worn but dutiful little fossil harkens back to DXR’s pre-clang days. In the early Cretaceous, when gcc ruled the earth, we didn’t have an easy framework for compiler plugins; we had to get by on the clever application of heuristics. But, as the millennia wore on and the clang ecosystem evolved, the uses of the custom tokenizer eroded, until its only remaining purpose was to find #include directives so we could guess where they pointed—which we got wrong half the time anyway. It was time to toss that strategy in a tarpit.

And so, after a little compiler plugin tinkering, I’m pleased to announce that DXR now resolves all includes simply by lifting the correct answer out of clang. Before, we would often throw up our hands when including a file without a totally unique name (which happened a lot). Now, with only a few exceptions for weird macro corner cases, we successfully link all non-generated, tree-dwelling includes. And, of course, we lay the maintenance burden of tokenizing C++ squarely on the compiler’s shoulders, where it belongs.

Want to join us in hacking on compiler plugins, with a generous dollop of Python back-end code? Pitch in at