DXR gets faster hardware, VCS integration, and snazzier indexing

DXR is Mozilla’s fast, full-featured code search tool for doing structured queries, free-text searches, and regex matching on huge codebases, like Firefox’s. Since my last post, we’ve completely replaced the production hardware, integrated with VCSs, and made scads of indexing improvements. The last quarter’s highlights include…

  • New stage and production hardware, with our own dedicated build box so we can support multiple codebases
  • Nightly updates from the mozilla-central tree, for both stage and prod
  • No more hours of downtime when a mozilla-central build fails
  • Blame, log, diff, and raw-file links in the sidebar
  • Much better JS syntax highlighting
  • Linkifying much more C++ template stuff: references to class templates, template params, and base classes of class templates when the base class is also a template and dependent on the type params of the derived class template (phew!)
  • Searching for namespaces and namespace aliases
  • Better finding of C definitions and declarations
  • Clang 3.3 support

Tell us what you want.

Now we want to hear from you. We’ve got a big, juicy fourth quarter coming up, and we want to make you happy. What would you, as a current or potential DXR user, like to see happen? Some ideas already high on our list are…

  • Indexing multiple trees, like Aurora and comm-central, eventually targeting the full list from MXR
  • A UI refit that will resolve client-side bugs, improve consistency, and add power. Take a look at the wireframes!
  • Structured search for JS: finding function definitions and calls, variable refs, and so on

What are your highest priorities? What would help you hack better on Mozilla code today? Leave a comment about what’s most important to you, whether it’s in the above list or not, and we will build our Q4 goals based on what you say.

Also, I’ll be manning a table at the Innovation Fair at the Mozilla Summit in Santa Clara. Stop by and make DXR wishes in person!

Finally, thanks to the people who make DXR possible: fubar, for all his ops work; and jcranmer, abbeyj, nrc, Bruce Stephens, jonasac, and nicolaisi, who keep the patches rolling in faster than I can review them!

12 responses

  1. Blair McBride wrote on :

    Firstly: AWESOME.

    Secondly: Is there any documentation for the URL parameters?

  2. Nicholas Nethercote wrote on :

    I wish you didn’t have to delimit regexps when you enter them in the “regexp:” box.

    Also, if you type “one two” (without the quotes) into the main search box, you get (I think?) hits only from files that contain both “one” and “two”, which I find astonishing. (Someone had to explain this to me, because I had no idea what it was doing.) I can’t imagine ever wanting to do a search like that, but I often want to search for an exact phrase like “one two”.

  3. David Blewett wrote on :

    Would love it if DXR could index Python code.

  4. Erik Rose wrote on :

    Blar: Do you really mean URL parameters? If so, the first is “tree”, which specifies the codebase to search inside. At the moment, this is always “mozilla-central”. The other is “q”, which holds the search query. For example, “callers:GetBounds” gets straightforwardly urlencoded into “q=callers%3AGetBounds”. The URL format is considered a private API at the moment; I have loose plans to replace it with a versioned REST API that returns results as JSON rather than piles of markup. Were you thinking of calling DXR from another tool?

    I suppose you actually meant to ask about the query syntax: “callers:”, “var:”, “function:”, etc. Many of these are listed in the Advanced Search panel, which shows (usually) when you press Return in the query field (a blooper that’ll go away in the UI refit). I’m embarrassed to say that there is no comprehensive list of them outside the source code at the moment, but you can see them at https://github.com/mozilla/dxr/blob/master/dxr/query.py#L14, and they’re not too hard to figure out. I actually have a half-finished document explaining the semantics of querying, which got put on hold when I discovered a lot of nonsensical things about query behavior; see Nicholas’s comment. We’ll have to do something about that soon, which will probably coincide with a new, non-SQLite backend.

  5. Erik Rose wrote on :

    Nicholas: Yep, the regex thing is silly and will go away, along with the rest of the silly UI bloopers, in the UI refresh. Schalk is building it all back up from scratch with modern libs and practices, and we’ve got a first pull req already laid down on a “ui” branch. We’ll soon be continuously deploying that branch to our staging site at dxr.allizom.org so you can give feedback.

    The “one two” thing is indeed astonishing. I had to read the source for an afternoon before I figured out what’s going on. It selects all the files that contain both words and then shows any lines from them that contain either. Pretty bizarre. The one nice side effect is that it’s great for finding long phrases from multi-line comments, but everyone seems to agree that a per-line search would be less astonishing. Would you agree? This will probably entail switching from SQLite to elasticsearch, which should give us a performance boost as well. (If we really liked the multi-line comment behavior, we could even tweak the indexer of each language to merge multi-line comments or adjacent lexer-concatenated strings into the equivalent of a single line.)

    Is it fair to say that it’s the usability failures that bug you most? If so, I can broaden your vote to all the UI confusions that are buzzing around in my head and in the issue tracker.

  6. patrick mcmanus wrote on :

    dxr is cool – I hesitate to comment – I don’t want to be a stop energy hater 🙂 but in the spirit of “what does dxr need to do to replace mxr in your workflow?”:

    dxr seems to lack case insensitive text seraches – which is my usual lazy way of telling mxr to figure out what I mean and I’m generally happy with what mxr does with that.

    the fact that dxr apparently considers the url scheme to be a private API is also a shortcoming for me comapred to mxr – a location bar custom keyword is my normal workflow for searching mxr (and that’s anecdotally true for many others).

  7. Erik Rose wrote on :

    No stop energy taken! Yep, it definitely needs case-sensitive search. Coming soon.

    You know what? I forgot about custom keyword searches, though I use them myself all the time for other sites. Thanks! I’ve been trying to decide whether, as we get a more discoverable UI (https://wiki.mozilla.org/DXR_UI_Refresh#Basic_Search) to keep a text-only query representation. This is a big reason in favor. Hmm, back to the old wireframe drawing board… 🙂

  8. patrick mcmanus wrote on :

    thanks erik! Oh, and https:// on the hosting please !

  9. Jonas Finnemann Jensen wrote on :

    As the former intern who worked on DXR last year and this year started to use DXR for work as full-time employee, I have to agree that search should have been single-line. Also, you can probably blame me for most things bad 🙂

    Now, when I designed the query language it was not designed to be private, I guess it’s just my fault for not documenting it. And yes, they are not very easily discovered, for example My
    Having used these, I’ve come to realize that many of the keywords are too long and not super well selected.
    But I have found it very useful to combine parameters like regexp:, function:, ext:, path:, “phrase search”, including the use of – for negation.
    (I think at some point query parameters where prefixed + for case sensitive search)

    My point is, I love the new design, but I don’t hope search parameters are going away, rename them, make single line search default… And definitely come up with better syntax for entering regular expression, I stole the current syntax from vim and sed. It might be smart if it was documented properly.
    (Note, I’ve found it very useful to combine regexp: with path: and ext:, but then again, I’m also the only one who knows how they work)

    By the way, I’m not familiar with elasticsearch, but substring search is very different from full text search…
    Regular expressions are not hard to support, if you just have substring search.
    Anyways, I would suggest you prototype with elasticsearch before saying it’s faster 🙂
    It’s totally plausible that it is faster, but as the entire database fits in memory, so if you want to distribute and throw more hardware that should be easy…

    Also I’m pretty sure trilite could be updated to do single line search only. Either by adding each line as a separate document or by changing the code. Maybe multiline should be available for phrase searches with \n, and regular expressions with a special parameter like regexp-multiline:
    (Okay, I’ll never be good at choosing concise parameter names)

    Anyways, I would be interested to hear how/if elasticsearch works out. Again, I would suggest trying it out first, it feels like it’s aimed at big-data, and DXR can hardly can be called big-data. Once the sqlite data is in memory, I think it’ll be hard to beat with Java, maybe if line search is done incorrectly with trilite, but trilite is pretty tight. As far as I can remember I couldn’t beat trilite with postgresql which also has trigram support.
    – Okay, to be fair I wrote trilite, so don’t take my word for it put it to the test 🙂
    (I’d love to hear about the results)

  10. capella wrote on :

    Tracking messages created / passed / acted on through the code by searching for “Reader:Add” (quotes included on purpose) generates false positive results that include “Reader:Added” … a different text string / target message …

    MXR behaves correctly here … Is there a different DXR way or is regex required in this situation?

  11. Erik Rose wrote on :

    capella: It’s finding Reader:Added because Reader:Add is a substring of it. If you want to exclude those, you can use the word-break sequence in a regex:

    regexp:/Reader:Add\b/

    I’m not sure it’ll match properly at the end of a line, but you can twiddle it further.

    Cheers!

  12. Erik Rose wrote on :

    Jonas and I talked at the Summit. Here’s a summary for anybody else following this thread:

    To clarify, I wasn’t talking about the query language being private; that would be silly. Rather, the DXR URLs aren’t well-specified yet and shouldn’t be depended on too much. We’ll replace them with something well-documented and which returns JSON rather than a big pile of HTML, so people can write additional tools on top of it. And obviously, we’d benchmark before changing anything as foundational as the index implementation.

    We’ve put together extensive plans for the future at https://wiki.mozilla.org/DXR_UI_Refresh#Plans_And_Priorities. Feel free to participate.