{"id":3624,"date":"2013-11-14T11:50:14","date_gmt":"2013-11-14T19:50:14","guid":{"rendered":"http:\/\/blog.mozilla.org\/webdev\/?p=3624"},"modified":"2013-11-14T11:32:01","modified_gmt":"2013-11-14T19:32:01","slug":"one-c-tokenizer-too-many-a-dxr-story","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/","title":{"rendered":"One C++ Tokenizer Too Many: A DXR Story"},"content":{"rendered":"<p>When your codebase is 2GB, grep doesn&#8217;t cut it anymore. It&#8217;s slow, and, in such a large corpus, many attempts to find a symbol get drowned out by false positives. Even modern IDEs begin to choke under the load. This is the domain of <a href=\"http:\/\/dxr.mozilla.org\/\">DXR<\/a>, Mozilla&#8217;s tool for doing structured queries, free-text searches, and even trigram-accelerated regex matching on large projects like Firefox.<\/p>\n<p>Of course, it&#8217;s a software engineering truism that providing speed at a moment&#8217;s notice exacts a price in pre-computation, and DXR is no exception. Every night, we run the entire mozilla-central codebase through the clang compiler, injecting a custom plugin which sees what the compiler sees and writes it all down in a database that can dish out fast answers later.<\/p>\n<p>Except when things go awry.<\/p>\n<p>During the Mozilla Summit, DXR had a conveniently timed series of failed indexing runs. A bit of digging revealed that, while the mozilla-central compilation was going off without a hitch, a run of the source through our custom C++ tokenizer was exploding in a later phase.<\/p>\n<p>Wait\u2014custom C++ tokenizer?!<\/p>\n<p>This worn but dutiful little fossil harkens back to DXR&#8217;s pre-clang days. In the early Cretaceous, when gcc ruled the earth, we didn&#8217;t have an easy framework for compiler plugins; we had to get by on the clever application of heuristics. But, as the millennia wore on and the clang ecosystem evolved, the uses of the custom tokenizer eroded, until its only remaining purpose was to find <code>#include<\/code> directives so we could guess where they pointed\u2014which we got wrong half the time anyway. It was time to toss that strategy in a tarpit.<\/p>\n<p>And so, after a little compiler plugin tinkering, I&#8217;m pleased to announce that DXR now resolves all includes simply by lifting the correct answer out of clang. Before, we would often throw up our hands when including a file without a totally unique name (which happened a <i>lot<\/i>). Now, with only a few exceptions for weird macro corner cases, we successfully link all non-generated, tree-dwelling includes. And, of course, we lay the maintenance burden of tokenizing C++ squarely on the compiler&#8217;s shoulders, where it belongs.<\/p>\n<p>Want to join us in hacking on compiler plugins, with a generous dollop of Python back-end code? Pitch in at <a href=\"https:\/\/wiki.mozilla.org\/DXR\">https:\/\/wiki.mozilla.org\/DXR<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Wait\u2014a custom C++ tokenizer?! What&#8217;s that doing in there? <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/\">Continue reading<\/a><\/p>\n","protected":false},"author":213,"featured_media":3502,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[311],"tags":[20275,91],"coauthors":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>One C++ Tokenizer Too Many: A DXR Story - Mozilla Web Development<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Erik Rose\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/\",\"url\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/\",\"name\":\"One C++ Tokenizer Too Many: A DXR Story - Mozilla Web Development\",\"isPartOf\":{\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png\",\"datePublished\":\"2013-11-14T19:50:14+00:00\",\"dateModified\":\"2013-11-14T19:32:01+00:00\",\"author\":{\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/#\/schema\/person\/e99c85edf86c46b46e5284384d5a7c12\"},\"breadcrumb\":{\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#primaryimage\",\"url\":\"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png\",\"contentUrl\":\"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png\",\"width\":273,\"height\":209},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.mozilla.org\/webdev\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"One C++ Tokenizer Too Many: A DXR Story\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/#website\",\"url\":\"https:\/\/blog.mozilla.org\/webdev\/\",\"name\":\"Mozilla Web Development\",\"description\":\"For make benefit of glorious tubes\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.mozilla.org\/webdev\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/#\/schema\/person\/e99c85edf86c46b46e5284384d5a7c12\",\"name\":\"Erik Rose\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/webdev\/#\/schema\/person\/image\/1c7953cea7e690a9e31cf08a4d68d829\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/73bfa51d6f44afed026160b59299faf2?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/73bfa51d6f44afed026160b59299faf2?s=96&d=mm&r=g\",\"caption\":\"Erik Rose\"},\"description\":\"Erik chips away at the barrier between human cognition and machine execution, through projects like DXR (search &amp; static analysis on Mozilla codebases), Fathom (semantic extraction from web pages), parsers, new languages, and a whole mess of Python libraries.\",\"sameAs\":[\"https:\/\/www.grinchcentral.com\/\",\"https:\/\/x.com\/ErikRose\"],\"url\":\"https:\/\/blog.mozilla.org\/webdev\/author\/erosemozilla-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"One C++ Tokenizer Too Many: A DXR Story - Mozilla Web Development","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/","twitter_misc":{"Written by":"Erik Rose","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/","url":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/","name":"One C++ Tokenizer Too Many: A DXR Story - Mozilla Web Development","isPartOf":{"@id":"https:\/\/blog.mozilla.org\/webdev\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#primaryimage"},"image":{"@id":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png","datePublished":"2013-11-14T19:50:14+00:00","dateModified":"2013-11-14T19:32:01+00:00","author":{"@id":"https:\/\/blog.mozilla.org\/webdev\/#\/schema\/person\/e99c85edf86c46b46e5284384d5a7c12"},"breadcrumb":{"@id":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#primaryimage","url":"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png","contentUrl":"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png","width":273,"height":209},{"@type":"BreadcrumbList","@id":"https:\/\/blog.mozilla.org\/webdev\/2013\/11\/14\/one-c-tokenizer-too-many-a-dxr-story\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.mozilla.org\/webdev\/"},{"@type":"ListItem","position":2,"name":"One C++ Tokenizer Too Many: A DXR Story"}]},{"@type":"WebSite","@id":"https:\/\/blog.mozilla.org\/webdev\/#website","url":"https:\/\/blog.mozilla.org\/webdev\/","name":"Mozilla Web Development","description":"For make benefit of glorious tubes","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.mozilla.org\/webdev\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.mozilla.org\/webdev\/#\/schema\/person\/e99c85edf86c46b46e5284384d5a7c12","name":"Erik Rose","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/webdev\/#\/schema\/person\/image\/1c7953cea7e690a9e31cf08a4d68d829","url":"https:\/\/secure.gravatar.com\/avatar\/73bfa51d6f44afed026160b59299faf2?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/73bfa51d6f44afed026160b59299faf2?s=96&d=mm&r=g","caption":"Erik Rose"},"description":"Erik chips away at the barrier between human cognition and machine execution, through projects like DXR (search &amp; static analysis on Mozilla codebases), Fathom (semantic extraction from web pages), parsers, new languages, and a whole mess of Python libraries.","sameAs":["https:\/\/www.grinchcentral.com\/","https:\/\/x.com\/ErikRose"],"url":"https:\/\/blog.mozilla.org\/webdev\/author\/erosemozilla-com\/"}]}},"jetpack_featured_media_url":"https:\/\/blog.mozilla.org\/webdev\/files\/2013\/06\/Screen-Shot-2013-06-13-at-12.12.17-.png","_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/posts\/3624"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/users\/213"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/comments?post=3624"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/posts\/3624\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/media\/3502"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/media?parent=3624"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/categories?post=3624"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/tags?post=3624"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/webdev\/wp-json\/wp\/v2\/coauthors?post=3624"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}