DXR gets more correct, less case-sensitive

Erik Rose


DXR, Mozilla’s evolving codebase search engine, has been taking patches at a furious rate these last two months. A great deal of work has gone into a UI refit, still in progress, which will improve discoverability, consistency, and power. Meanwhile, we have kept pushing more immediately enjoyable enhancements into production.

Cleaning Out The Pipes

One of these is a complete rewrite of our HTML generation pipeline. DXR pulls metadata about code from a number of disparate sources: syntax coloring from Pygments, locations of functions and macros from clang, links to Bugzilla bugs from a glorified regex. It then encodes those as begin-end pairs of text offsets, which it stitches together to make the final markup. However, the stitching was previously handled by a teetering state machine, stuffed info a single monolithic function with zero test coverage, replete with terrible mystery. As it turned out, it had been generating grossly invalid markup for some time. Fortunately, modern browsers are equally replete with terrible mystery and managed to make some semblance of sense out of things like </a></a></a>.

But now that’s gone away. The rewrite brings…

  • Correct markup
  • Support for line-spanning regions, as for multi-line comments or strings
  • Support for Windows line-endings (of which we did have a few in mozilla-central)
  • Full test coverage
  • And, perhaps most importantly in the long term, it modernizes our plugin contract by supporting annotation regions which overlap. This lets us enjoy truly decoupled plugins which no longer have to care if they’re used with others that emit overlapping regions. We can add plugins that support more languages and more types of analysis without having to worry about whether they’ll play nicely with the existing ecosystem. It also makes development of plugins outside the DXR codebase more practical.

    Other Improvements

    Other user-visible improvements include…

  • Case-insensitive searching for plain text. This is now the default.
  • Exposing values of constants using tooltips
  • Results now show in alphabetical order by path rather than in random order, so you can rule out entire directory trees more easily.
  • Searching for Layers.cpp:45 takes you straight to that line of the file.
  • Lexing .h files as C++ rather than C means we now highlight all those pesky C++ keywords.
  • We now syntax-color preprocessor directives in JS.
  • We’ve introduced override and overridden queries.
  • No more “l” in line-number anchors means no more mistaking them for “1”.
  • Fixed an off-by-one in line annotation position.
  • No longer consider uninitialized struct or class members to be var refs.
  • Support non-UTF-8 encodings of source files.
  • Distinguish identically named functions in different anonymous namespaces.
  • Thanks to James Abbatiello for lots of analysis improvements, Nick Cameron for the handy line-number search and syntax coloring, jonasac for several great fixes, and Schalk Neethling for a huge amount of work toward getting the UI refit out the door. If you’d like to join the DXR hacking community, we’ve got a nice ramp-up paved out for you and some easy bugs tagged.

    New UI Teasers

    As for the upcoming UI refit, there’s plenty in store:

  • A natural integration of the now fairly disjoint browse and search modes
  • Easy discoverability of all 26-or-so search filters: no more figuring them out through hearsay or by spelunking through the code
  • No more unpredictability of interface elements like the Advanced Search panel
  • First-class support for multiple trees, to be followed by more actual trees
  • A real query parser. You can express quotation marks without resorting to regexes, and you can use quoted strings as arguments to filters.
  • No more astonishing, disappearing splash page
  • Check out our mockups and our sometimes-broken staging site, and do keep the feedback coming. All of the above work was motivated by the comments you’ve already given us.

    Happy hacking!

    November/December Accomplishments of the WebProd Team

    Chris More

    Wow. It has been a really busy second half of Q4 for the Web Productions team! I wanted to share some recent accomplishments from the team and what we are up to next.

    Recent Key Accomplishments

    Snapshot of Upcoming Projects

    Want More?

    Enjoy the rest of 2013 and see you all in 2014!

    Improving Mozilla.org user experience for people all over the world

    Kohei Yoshino

    Every day, thousands of people visit Mozilla.org from across the globe. As the hub site of the global Mozilla community, it has a variety of content including the Firefox download page which has been localized into 80+ languages — like Firefox itself — thanks to the tremendous contributions of volunteer localizers.

    I can remember Mozilla.org at an early age — a developer-oriented, documentation-centric geeky site. Some community members were translating those documents into their languages (I myself translated hundreds of docs to Japanese) but their work was done outside of the official site. As time passed, Mozilla.org has become one of the most popular multilingual consumer sites on the Internet.

    Mozilla.org is still under active development, though. With the increasing number of translations, how can visitors from around the world make the most of this content? In this article I’ll briefly explain some great new features I have contributed for the past few months that may lead to a better user experience for people who speak different languages.

    Language Switcher

    Mozilla.org is migrating from the original PHP-based site to the new Python-based robust platform called Bedrock. The legacy site had a language switcher at the bottom of each page, but there was a problem: the switcher showed all supported languages regardless of whether the current page was actually localized. When a user chose Français from the list but the page wasn’t translated into French, they were taken back to the original English page. That behavior confused them because the page just reloaded without changing anything.

    The new language switcher on Bedrock only shows the localized languages for each page and works as expected. The number of the languages will continue to increase as localizers add new translations. (Bug 773371)

    We believe we can improve the language switcher even further. A simple dropdown menu is accessible but difficult to use when the list becomes too long. Also, my recent research showed that the user experience on switching language was varied among the Mozilla properties. It’s obvious we need a better solution like the Tabzilla universal site navigation widget. (Bug 919869)

    Search Engine Optimization

    The Language Switcher is not a collection of links but rather a dropdown menu using <select>, so it cannot tell search engines about our localization. Googlebot and others are enough smart to mechanically submit such a simple form but they definitely prefer a better way to crawl.

    We have implemented alternate URLs to solve the issue with just a little more HTML. Visit the home page and hit Ctrl+U (or Cmd+U) to view the source and find a list of <link rel="alternate" hreflang="x"> in the source. Search engines will recognize the list to show a localized page, if available, in their search results based on the searcher’s language. (Bug 481550)

    We’ll also soon serve comprehensive XML sitemaps with the alternate URLs as part of our SEO efforts. (Bug 906882)

    Translation Bar

    This is the latest cool addition to Mozilla.org. You might have seen a similar functionality if you have installed Google Toolbar or the lovely Chrome browser. It may ask you if you’d like to translate a foreign-language page into your language with Google Translate. While it’s useful in general, the quality of the translation largely depends on the language. For example, Japanese, my mother tongue, is one of the most difficult languages for machine translation. Here at Mozilla.org, we can enjoy the pages manually translated by native localizers, so why not offer our visitors the nicely localized page? That was the motivation for the new Translation Bar.

    The implementation was straightforward. As described above, we already have the alternate, localized URLs in the page source. A script compares the browser locale (navigator.language) against the list, then shows the bar if the translation is available in that language. If the user selects Yes, please, he/she will be promptly redirected to the localized page. If the user selects No, thanks, sessionStorage will remember the preference and the bar will be hidden in the subsequent browsing session.

    Of course, the labels on the bar are also localized. Visit a localized page to give it a try! The Translation Bar has just been deployed on Mozilla.org and other Mozilla sites may adopt it soon. (Bug 906943)

    Beyond Translation

    As a Japanese Web developer and localizer, I do know localization is not just translation. Each language and country has a different culture, customs and perspectives. Under the Mozilla mission, the Web Production team is working hard to deliver a great experience for everyone. The ongoing challenges include localized news and promotions on the home page, better fonts for multibyte characters, layout improvements for RTL languages, and more. I’m very glad to help the team.

    Mozilla is a lively, global, successful open-source community, and Mozilla.org is not merely a corporate site. You can contribute in many ways, like me. Do you speak a language other than English? Be one of the awesome localizers! Did you find any bugs on Mozilla.org or do you have any feedback on how to improve the site? Let us know via Bugzilla! Are you a Web developer with knowledge of HTML, CSS, JavaScript, or Python? Fork the GitHub repository, browse the bugs and send us pull requests!

    Tracking Deploys in Git Log

    Mike Cooper

    Knowing what is going on with git and many environments can be hard. In particular, it can be hard to easily know where the server environments are on the git history, and how the rest of the world relates to that. I’ve set up a couple interlocking gears of tooling that help me know whats going on.


    One thing that I love about GitHub is it’s network view, which gives a nice high level overview of branches and forks in a project. One thing I don’t like about it is that it only shows what is on GitHub, and is a bit light on details. So I did some hunting, and I found a set of git commands that does a pretty good at replicating GitHub’s network view.

    $ git log --graph --all --decorate

    I have this aliased to git net. Let’s break it down:

    • git log – This shows the history of commits.
    • --graph – This adds lines between commits showing merging, branching, and
      all the rest of the non-linearity git allows in history.
    • --all – This shows all refs in your repo, instead of only your current branch.
    • --decorate – This shows the name of each ref next to each commit, like
      “origin/master” or “upstream/master”.

    This isn’t that novel, but it is really nice. I often get asked what tool I’m using for this when I pull this up where other people can see it.

    Cron Jobs

    Having all the extra detail in my view of git’s history is nice, but it doesn’t help if I can only see what is on my laptop. I generally know what I’ve commited (on a good day), so the real goal here is to see what is in all of my remotes.

    In practice, I only have this done for my main day-job project, so the update script is specific to that project. It could be expanded to all my git repos, but I haven’t done that. To pull this off, I have this line in my crontab:

    */10 * * * * python2 /home/mythmon/src/kitsune/scripts/update-git.py

    I’ll get to the details of this script in the next section, but the important part is that it runs git fetch --all for the repo on question. To run this from a cronjob, I had to switch all my remotes to using https protocol for git instead of ssh, since my SSH keys aren’t unlocked. Git knows the passwords to my http remotes thanks to it’s gnome-keychain integration, so this all works without user interaction.

    This has the result of keeping git up to date on what refs exist in the world. I have my teammate’s repos as remotes, as well as our central master. This makes it easier for me to see what is going on in the world.

    Deployment Refs

    The last bit of information I wanted to see in my local network is the state of deployment on our servers. We have three environments that run our code, and knowing what I’m about to deploy is really useful. If you look in the screenshot above, you’ll notice a couple refs that are likely unfamiliar: deployed/state and deployed/prod, in green. This is the second part of the update-git.py script I mentioned above.

    As a part of the SUMO deploy process, we put a file on the server that contains the current git sha. This script reads that file, and makes local references in my git repo that correspond to them

    Aside: What’s a git ref?

    A git ref is anything that has a commit sha. So master is a ref. So
    are any other branches you create. Git also tracks remote content in
    the same way, in refs under refs/remotes.

    In short, a git ref is a generalization of tags, and branches, both
    remote and locale. It is how git keeps track of things with names, and
    it is what is written on the graph when --decorate is
    passed to log.

    Wait, creates git refs from thin air? Yeah. This is a cool trick my friend Jordan Evans taught me about git. Since git’s references are just files on the file system, you can make new ones easily. For example, in any git repo, the file .git/refs/heads/master contains a commit sha, which is how git knows where your master branch is. You could make new refs by editing these files manually, creating files and overwriting them to manipulate git. That’s a little messy though. Instead we should use git’s tools to do this.

    Git provides git update-ref to manipulate refs. For example, to make my deployment refs, I run something like git update-ref refs/heas/deployed/prod 895e1e5ae. The last argument can be any sort of commit reference, including HEAD or branch names. If the ref doesn’t exist, it will be created, and if you want to delete a ref, you can add -d. Cool stuff.

    All Together Now

    Now finally the entire script. Here I am using an git helper that I wrote that I have ommited for space. It works how you would expect, translating git.log(all=True, 'some-branch' to git log --all some-branch. I made a gist of it for the curious.

    The basic strategy is to get fetch all remotes, then add/update the refs for the various server environments using git update-rev. This is run on a cron every few minutes, and makes knowing what is going on a little easier, and git in a distributed team a little nicer.

    That’s It

    The general idea is really easy:

    1. Fetch remotes often.
    2. Write down deployment shas.
    3. Actually look at it all.

    The fact that it requires a little bit of cleverness, and a bit of git magic along the way means it took some time figure out. I think it was well worth it though.

    Originally from mythmon.com

    One C++ Tokenizer Too Many: A DXR Story

    Erik Rose

    When your codebase is 2GB, grep doesn’t cut it anymore. It’s slow, and, in such a large corpus, many attempts to find a symbol get drowned out by false positives. Even modern IDEs begin to choke under the load. This is the domain of DXR, Mozilla’s tool for doing structured queries, free-text searches, and even trigram-accelerated regex matching on large projects like Firefox.

    Of course, it’s a software engineering truism that providing speed at a moment’s notice exacts a price in pre-computation, and DXR is no exception. Every night, we run the entire mozilla-central codebase through the clang compiler, injecting a custom plugin which sees what the compiler sees and writes it all down in a database that can dish out fast answers later.

    Except when things go awry.

    During the Mozilla Summit, DXR had a conveniently timed series of failed indexing runs. A bit of digging revealed that, while the mozilla-central compilation was going off without a hitch, a run of the source through our custom C++ tokenizer was exploding in a later phase.

    Wait—custom C++ tokenizer?!

    This worn but dutiful little fossil harkens back to DXR’s pre-clang days. In the early Cretaceous, when gcc ruled the earth, we didn’t have an easy framework for compiler plugins; we had to get by on the clever application of heuristics. But, as the millennia wore on and the clang ecosystem evolved, the uses of the custom tokenizer eroded, until its only remaining purpose was to find #include directives so we could guess where they pointed—which we got wrong half the time anyway. It was time to toss that strategy in a tarpit.

    And so, after a little compiler plugin tinkering, I’m pleased to announce that DXR now resolves all includes simply by lifting the correct answer out of clang. Before, we would often throw up our hands when including a file without a totally unique name (which happened a lot). Now, with only a few exceptions for weird macro corner cases, we successfully link all non-generated, tree-dwelling includes. And, of course, we lay the maintenance burden of tokenizing C++ squarely on the compiler’s shoulders, where it belongs.

    Want to join us in hacking on compiler plugins, with a generous dollop of Python back-end code? Pitch in at https://wiki.mozilla.org/DXR.

    Updates from the Web Productions team

    Chris More

    Recent Key Accomplishments

    Snapshot of Upcoming Projects

    Interesting team statistic

    • Mozilla has roughly 50 websites, and the Web Productions team manages and develops 8 websites that account for 73% of all Mozilla web traffic!

    More Info

    Have any questions? Chat with us in IRC in #www or #webprod.

    Full-text search in Air Mozilla with PostgreSQL

    Peter Bengtsson


    In a previous post I explained why and how we migrated Air Mozilla to use PostgreSQL as the default database. We did this so we can leverage PostgreSQL’s powerful full-text search feature.

    First, off a tangent we go… Why not use the popular and also powerful full-text master ElasticSearch? Surely, since it’s built on top of Apache Lucene it’s bound to have some amazing full-text search and indexing features. I’m sure it does — but we don’t need them.

    All we want to do is find records whose title, description or short_description contain certain words spelled in the same stem. We also want highlighting so we can display a neat search results page with the matches emphasized (something that isn’t easy to do with regular expressions in Python when the results come back).

    PostgreSQL can do all of that and it’s fast. Very fast! By far, the biggest win of using the same database we already connect the Django ORM to is that we simply don’t have to worry about indexing. Like, at all. All you do is set this up as a migration:

    At the moment Air Mozilla only has English content, but some day there might be more languages. How to add indexes for different languages is pretty clear; you run the same migration as above with different languages named.

    That means that any inserts, updates or deletions automatically updates the full-text index for these columns in the database. We don’t have to worry about this at all, at any point in the ORM code. It just works!

    Now, let’s explain how the search works. A user types in a search query. E.g. “community”.

    What we want to do is to return an ORM QuerySet that:

    • contains all events that the user is allowed to see depending privacy or publishing workflow criteria and
    • whose title or short_description or description contains the search term.

    And, we want it to be ranked based on matches in the title “higher” compared to matches in the short_description or description. So let’s add that to the filtering:

    Now, that satisfies the “where part”. Next, we need to do something about the ranking, so we extend the code with this:

    Last but not least, we want to let PostgreSQL work out the highlighting of matches so you can show extracts on the search result page with the matched words emphasized. So you extend select with some more code to look like this:

    And there you have it. Note, that PostgreSQL inserts HTML markup into these title_highlit and desc_highlit extra annotations and it also escapes away any previous HTML so they’re safe to display in raw form in the Django template code. So it can look like this in the search results template:

    In plain PostgreSQL SQL there are actually ways to “combine” the rank calculation with the “where criteria” so that you don’t have to do both the rank calculation and the where operation separately. However that’s way out of scope for the Django ORM API and even though it’s possible to achieve, the code will quickly get messy.

    So, how long does it take to do this query? On my laptop, with a snapshot of the production database containing over 600 events, that big query takes 30-35 milliseconds. That’s fast enough.

    Migrating Air Mozilla from MySQL to PostgreSQL

    Peter Bengtsson

    Before we dig into the how let’s take a look at the why.

    From the beginning, Air Mozilla has been a straight forward Django project that uses the ORM without requiring any database specific features. It didn’t really matter what database you used. Here at Mozilla, we currently prefer MySQL because we have a rock solid and mature infrastructure set up around running it. (Thank you database team!)

    This week we launched full-text search in Air Mozilla. Here’s an example search. PostgreSQL supports very powerful features specifically for full-text search, including stemming, highlighting, ranking and custom dictionaries. (Note: MySQL has full-text search indexing too as of MySQL 5.6 but it does not yet support stemming or highlighting).

    So how did we migrate the database? In short, this tool: py-mysql2pgsql. What the tool does internally is that it connects to both MySQL and PostgreSQL and reads one table and record at a time to convert over to PostgreSQL. You can check out the code on github.com/philipsoutham/py-mysql2pgsql/.

    To run it, all you have to do is fill in the connection details into a YAML file for both MySQL and PostgreSQL and you should now have a working “clone”.

    There was one caveat though that irked me. MySQL does not support timestamps with time zone and PostgreSQL does. Django can work around this by applying the time zone with a Django settings variable. By having the time zone information in the database, we don’t have to fake the time zone information any more. It’s also bound to be more performant because you’ve moved the conversion to aware date times nearer to the database. To make this change, we wrote a simple conversion script that you can afterwards throw into PostgreSQL. You can see the rest of the instructions for the migration here.

    And here’s a little bonus; the time it took to actually run the migration was approximately 10-15 seconds to migrate over 25,000 rows across 42 tables. That’s connecting to a MySQL and a PostgreSQL in two separate different locations in the same data center.

    In a follow-up post I will try to explain more about how we do the full-text search in PostgreSQL with Django.

    Beer and Tell – September 2013 Edition

    Michael Kelly

    Gather ’round, children. Your distant cousin mkelly is going to share a tale of excitement and mystery, of heroes and villains, of action and adventure.

    That’s right, it’s the Beer and Tell Recap! You can also check out the wiki page or the recording.


    Screen Shot 2013-10-03 at 10.57.19Simon Wex and Robert Richter from the Mozilla Foundation presented Appmaker, an experiment into whether we can make it fun for non-developers to quickly make working apps. Widgets send “mail” to other widgets, which triggers them to do stuff like display cat pictures or take a photo. You can track development of Appmaker on their Github repo.


    My own Beer and Tell project is diecast, a grunt-init template for single-page frontend apps. diecast sets you up with Grunt commands for building and publishing your site to Github Pages, and includes require.js for JavaScript module loading and LESS for CSS preprocessing. It also uses Bower for downloading JavaScript libraries that you want to use. It’s a cornucopia of JavaScript buzzwords!


    Screen Shot 2013-10-03 at 10.56.48Brian Brennan showed us nodeschool.io, a community of learning focused on node.js. It’s focused around terminal-based challenges where you write code to solve realistic problems. Planned improvements include user accounts and open badge support.

    OpenCL in the Browser

    Scott Michaud presented a super-secret project (no links, sorry) that demoed a software renderer powered by OpenCL code running in Firefox. Presumably he was using the WebCL prototype plugin released by Nokia Research recently. Check it out!

    Dennis Dubstep

    While he was absent and thus didn’t present, Will Khan-Greene still added a screenshot of SUMO localized to dubstep on the wiki. This was achieved by using a dubstep locale in dennis, a set of localization tools that can, among other things, help you test out how your site looks with excessively large strings that may come from certain languages.


    Matt Basta told us about Crass, a CSS minifier that parses CSS instead of applying transformations, as most minifiers do. This allows it to perform optimizations that most other minifiers can’t, such as reordering properties and transforming values. Crass can also pretty-print the parsed CSS and is written in both Python and JavaScript.


    Basta also shared Panopticon (app is down as of writing), which is sort of like Skype and IRC combined. You join a room, select a user in that room, and get to see an animated GIF captured from their webcam, live. There’s a Github repository as well, for those who are interested or confused.

    Updated Nunjucks Documentation

    James Long demoed an upcoming update to the documentation for Nunjucks, a jinja2-inspired templating system for JavaScript. Keep an eye out for the update, which includes a full overview of the template language and API!


    Screen Shot 2013-10-03 at 19.28.10Finally, Michael Cooper presented a short game called BATCH, a programming game written in JavaScript. The main purpose of the game was to see if he could write a Beer and Tell project starting at Noon on the day of the event. I highly recommend that other people try this as well, if for nothing else than to stuff the Beer and Tell project list for next month.

    See ya’ll next month!

    Where Mozilla.org and Firefox Intersect

    Holly Habstritt Gaal

    written with Chris More and Jennifer Bertsch

    Engagement’s Web Productions team has grown from a small team of technical project managers, to a multidisciplinary web development team that initiates projects, uses metrics and qualitative testing to learn from our users, and has an iterative approach to web development. We’ve had a chance to see the influence that this growth, paired with the collaboration of teams across Mozilla, has had on our work. We can be pre-emptive instead of reactionary, share our knowledge and tools, and facilitate design process and collaboration. We would like to reintroduce ourselves to Mozilla and the UX community, expose where our team intersects with the user experience of our products, and invite you to collaborate with us.

    Not your typical web product
    Mozilla.org does not solely exist for marketing our products. It is unique from most web sites in that to support our products and users when they need us, we must stay inline with the roadmaps and release cycles of our products. For example, there are many touch points with our users that take place on Mozilla.org, some of which are part of the onboarding process. Onboarding is more than just downloading our products as it extends to the first “unboxing” experience, updating Firefox, and sharing helpful information about new product features. This ultimately contributes to retention and an understanding of our Firefox and Mozilla brands.

    In our user tests, we’ve found that users are more likely to respond positively when they have an expectation of both when and how a message is delivered. For Mozilla.org and our products, this expectation can be set by previewing an upcoming feature or new design on Mozilla.org for Firefox users. It can also be handled with a consistent pattern for how we present content updates, notifications, and new features across all of our products. The WebProd team doesn’t accomplish this alone and this is one example of how our users can benefit by our teams staying connected.

    Staying connected: the intersection of our roles and roadmaps
    What I’ve found at Mozilla is that separate teams are often working on similar challenges and share common goals. Collaborating across teams has been a great way to meet and learn from each other and is key to addressing our intersecting issues efficiently while ultimately creating a better end product. Most recently we have worked with SUMO to stay better aligned on presenting Firefox help messaging and we have also been collaborating on a cross-team effort to improve the First Run and Update experiences.

    At Mozilla we are all part of the chain of reactions that results in what our users experience. The WebProd team has been keeping the following in mind to better support Firefox users:

    • Align our websites to product roadmaps so we can offer support to our end users
    • Optimize user onboarding flows
    • Work in parallel with Engagement and Product teams’ goals
    • Ensure website content is localized in many languages
    • Complete migration of all legacy pages to Bedrock and our Sandstone theme, which is responsive by nature.
    • Support users on any device or operating system
    • Evaluate > Test > Improve

    A significant way we can all support our users is to recognize the intersections between our teams at Mozilla and the overlapping initiatives in our roadmaps. If the WebProd team can collaborate with you to create a better experience for our users, don’t hesitate to reach out to us.

    We’re easy to find!