MXR Improvements

jakem

11

Over the last few weeks we’ve been making a number of small changes to the MXR web tool (https://mxr.mozilla.org/), and it seems about time to highlight some of the bigger changes.

Background

MXR is the Mozilla Cross-Reference system. It’s basically a convenient way to search a whole lot of Mozilla code for certain things. I’m sure a proper Mozilla developer could tell you all about how awesome and terrible it is, and what it can and cannot do for you. As a sysadmin, I can tell you it’s a rather intensive, complicated set of scripts that have to handle a lot of different code. It pulls code written in many different languages, from multiple different revision control systems, and indexes them. Some code trees are processed more than once a day, the rest are done daily.

As you can imagine, this results in a lot of special cases. MXR breaks it down into basically 3 steps (conveniently split into 3 scripts) – update the tree, generate the cross-reference identifier database, and generate the searchable index.

  1. The first step (update-src.pl) is relatively straightforward. It consists basically of updating the code for whichever tree you pass it as an argument. It’s a lot of special cases, since most trees are handled slightly differently from one another, but all in all it’s about what you’d expect. It’s the equivalent of “hg pull” or “svn update”. The only trouble is some trees are not nearly that simple, and involve quite a bit more work to update properly. In fact some trees are actually nested source repos, which must be updated independently.
  2. The second step (update-xref.pl) is where most of the time is spent. The ‘genxref’ script is a giant mess of Perl regex’s to detect identifiers (function names, etc) in various programming language. You give it a tree, and for every file in that tree it determines the file type and scans it appropriately.
  3. The third step (update-search.pl) is once again relatively simple. Searching is done via “glimpse”, so this is basically running “glimpseindex” on whichever tree you call it with.

The three steps are tied together with an overall calling shell script (update-full-onetree.sh). This is responsible for feeding each of these three scripts the appropriate arguments (the name of the tree), and reporting the overall output.

This calling script is in turn called by another calling script, which calls it for every tree. This is the final link in the chain, and this script goes in cron. There are actually 2 of these- one that cron calls every 4 hours, one that cron calls daily. They’re basically identical except for the list of trees they process.

Now that you know how it works, I can tell you about how it’s better now than it was a month ago. :)

Improvements

First things first: the daily script was no longer reporting its output. It was working, but not telling us about it. This was due to a newly-introduced “rsync” job, which was running with the verbose flag. This caused the output to be far larger than it should be, which broke the reporting. I removed the verbose flag, and all is back to normal.

With that out of the way, I could see that the “4-hour” job was taking approximately 2 hours to run, and the daily job as much as 30 hours in some cases. That’ll typically make any sysadmin a bit squeamish, and indeed it is non-ideal. However, these jobs have no built-in parallelism, and this server has multiple CPU cores available… meaning, the overlapping wasn’t normally a big problem.

At this point I wanted to improve the running time directly. Step 2 above is by far the longest-running (and hardest on CPU time), so I started there with the very nice NYTProf Perl profiler. This made it clear that most of the time was being spent in a few of the more complicated regex’s used in this script. I was able to make some very small improvements, but ultimately nothing appeared to be severely wrong- they weren’t doing anything terrible, and in fact appeared to have already been tuned by someone who is frankly better at it than I am. I did get some great suggestions on ways to improve the whole process, however, which may yet be implemented some day.

Having quickly scanned over this issue and found nothing of significance, I moved on to “update-search.pl”… the 3rd and final step. I skipped NYTProf here, because the usage is quite obviously tied up in “glimpseindex”. I was able to make a few small tweaks here, which I believe may ultimately result in faster searching. Specifically, I upgraded our indexes from the default “tiny” size to the mid-level “small” size. This roughly doubled our disk space used for indexes, but should have provided a nice boost to search performance. This also makes the indexing take longer, but it still ends up being much faster than the “step 2″ above, so I consider it a good trade. Unfortunately as I mentioned, IT doesn’t have a lot of face-time with this app, so it’s hard to judge for myself just how much faster searching really is.

Finally, the biggest change- parallelization of MXR jobs. After some time looking around, I couldn’t see any reason why we could not process multiple trees simultaneously. So, I set out to do just that.

I wanted to start simple, just to get something in place to prove that it would work. Thus, I re-worked the 4-hour cron job into a simple loop over the list of 4-hour trees, and had the loop execute 4 tasks at once and background them, then “wait” for completion before continuing. This is extremely simple using nothing more than standard shell semantics:

PARALLEL_JOBS=4
count=0
for i in $TREES; do
    echo -n "Starting $i at "; date
    nice -n 19 ./update-full-onetree.sh -cron $i &
    let count+=1
    [[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
done
wait

This works beautifully, for what it is. The problem is that it doesn’t always keep 4 jobs running. It starts 4 jobs, waits for all 4 to complete, then starts 4 more. This results in large gaps where we would ideally like to be running another tree, but instead just sit and wait. If the 4 jobs started at the same time all take about the same amount of time to run, it’s pretty good. But if one job takes an hour and the other 3 take only 5 minutes, you’ve got a lot of idle time. Effectively, in some cases it’s not much better than serial execution.

Still, this was enough to bring the running time on the 4-hour job down from 2 hours to about 45 minutes. A very nice win. I knew I would have to come back to this later.

You may notice this doesn’t actually change the amount of work being done, it just crams it all into a smaller amount of time. While that’s absolutely correct, I believe it still results in a performance win for searching- these update jobs are rather disk-intensive, and tend to obliterate the cache. By grouping them up, we effectively obliterate the cache for a shorter amount of time, meaning the *rest* of the time should benefit from somewhat less turbulent disk caching. Basically, longer periods of “good” and shorter periods of “bad”.

What I really wanted to do though was to actually do less work during an update cycle. One simple way to get this is to detect whether anything has changed since the last execution, and if nothing has, to skip as much as possible.

Clearly, we still need to do step 1, or we won’t actually know if anything has changed. Once that’s out of the way we can make a determination. Theoretically it should be possible to do this very efficiently… *if* the tree’s revision control system will tell you that. Unfortunately, this is a brick wall for a generic solution. I could write up specific checks for each individual tree, but I really wanted to have something more easily maintainable- I already knew it was a pain to do this, simply by working on the script for step 1. In the end, I settled for a simple “find” command, to look for any files modified more recently than the last time the tree was processed. It’s a lot more overhead, but “works every time”. The overhead is actually negligible compared to the runtime of step 2 and 3 anyway, so it actually ends up not mattering too much.

This results in a significant reduction in IOPS and CPU cycles consumed. It should also cause far less cache destruction, as most of the files are never actually read. Of course there is still some caching problems caused by this “find” command, but all in all it should be a vast reduction from actually reading each and every file.

With this in place, I was really starting to feel the inefficiency from the simple shell-based parallelization scheme. When a tree doesn’t get changed, it’s runtime gets very short- but this is a worst-case for that algorithm, so it devolves almost back into serial execution! Of course the overall runtime is still pretty good (30-45min, depending on which trees have or haven’t changed), but the efficiency is getting worse- there’s more dead time in between jobs. So I took the next step.

GNU Parallel to the Rescue

If you haven’t heard of GNU Parallel, I highly recommend it. There is an exceptionally good tutorial video on using parallel, here. Suffice it to say, I rewrote the above loop to look like this:

echo "$TREES" | parallel -j+0 'echo -n "{}: Starting " && date && nice -n 19 ./update-full-onetree.sh -cron {}; echo -n "{}: Ending " && date'

I’ll admit it’s not nearly as pretty to look at, but it has one very nice improvement- it will make sure that there is always the right number of jobs executing. If a job finishes, it will start the next one right away.

With this in place, the 4-hour job is now reduced to a maximum of 30 minutes… and a minimum of eight. Compare that to the original time of about 2 hours, every time, and you can easily see this is a massive improvement.

I’m making the same set of changes to the ‘daily’ job, and although I expect some great things, it unfortunately has one significant weakness: one of the trees alone takes about 17 hours to process, and gets pretty constant usage. So this will largely eliminate the situation where this job can ‘overrun’ itself, and will consolidate the vast majority of the trees to be completed very quickly, the overall job will still be bounded by this one tree.

Future Improvements

There are quite a few places where MXR could be improved further, but unfortunately much of it may be beyond my capabilities and I’ll need to seek some outside assistance.

  • Update step 2 to be multi-threaded internally. This will speed up jobs like the 17-hour monster.
  • Update step 2 to intelligently skip files within a tree if they haven’t changed since last processed. This would cut down on the per-job runtime significantly. As you can imagine, most files don’t change every day- changes are concentrated into only a very small percentage of files. If we can quickly skip the others, we can save a whole lot of wasted effort. The trick will be to do this without relying on a revision-control system, since there’s so little uniformity in that area.
  • Update step 3 to be more intelligent about generating indexes. This will take some playing with “glimpseindex”, but the short version is I believe we are discarding the previous index every time and starting from scratch. Obviously this is non-ideal.
  • Move MXR to a better home. It should be quite possible for the web app to live on a separate machine from the backend processing, which should help with caching. For that matter the web frontend could likely be a normal load-balanced cluster, and gain improved query time through horizontal scaling. Even the processing could be split up across multiple machines (and GNU parallel in fact makes this incredibly straightforward).

New Trees Soon!

Lastly, there are at least 2 trees I’m looking very closely at adding in the very near future. I don’t want to spoil the surprise, but both have been requested more than once, and have very good reasons for being added. Having dug into MXR quite a bit recently, I’m in a much better position now to do this than in the past. Look for an update on this soon. :)

11 responses

  1. Tony Mechelynck wrote on ::

    Update step 2 to intelligently skip files within a tree if they haven’t changed since last processed. […] The trick will be to do this without relying on a revision-control system, since there’s so little uniformity in that area.

    Sounds like a job for the “make” utility. Or did I miss some important info?

    1. jakem wrote on :

      I’ve only used ‘make’ in the context of building code, but I could see how it might be usable. I would need to investigate this to see how feasible it really is.

      What I really had in mind was a simple hack to the update-xref.pl script to build in some logic to the main loop(s)… something along the lines of “if (mtime < last-processed-time) then skip”. At that point the only trick is how to keep track of when files were last processed. Lots of ways to do that, just need to pick a good one.

  2. Dirkjan Ochtman wrote on ::

    Wouldn’t all of this time be better spent working on dxr, making it more feasible to replace MXR with DXR? AFAICT DXR is better in most regards…

    1. mrz wrote on ::

      Maybe, but MXR is current in use and folks are depending on it for their work.

  3. Ole Tange wrote on ::

    If you dislike the one-liner in GNU Parallel consider using ‘sem’ (which is part of GNU Parallel):

    for i in $TREES; do
    echo -n “Starting $i at “; date
    nice -n 19 sem –id myid ./update-full-onetree.sh -cron $i
    done
    sem –wait –id myid

  4. Ole Tange wrote on ::

    Oops should be:

    nice -n 19 sem -j +0 –id myid ./update-full-onetree.sh -cron $i

  5. Neil Rashbrook wrote on ::

    Update step 2 to intelligently skip files within a tree if they haven’t changed since last processed. […] The trick will be to do this without relying on a revision-control system, since there’s so little uniformity in that area.

    Doesn’t your find command already output the files that need to be processed?

    1. jakem wrote on :

      Not quite… it uses the GNU “-quit” flag, which quits after the first match. So it’s used as a trigger. Of course, it could be trivially altered to do this.

      The problem is that as far as I know there isn’t a good way to process one specific file. There’s a lot of logic coded up around update-xref.pl, and it all expects to be told what *tree* to process. So to do this would probably be non-trivial. That’s basically why I opted for this design. It’s non-ideal, but also very non-invasive to the core MXR scripts. It’s just a bolt-on enhancement.

      The right way to fix this, IMO, is to patch the update-xref.pl script to do the change detection itself. Then we could just always process every tree, and MXR would intelligently skip files that don’t need any work. This is the second bullet point in “Future Improvements”.

  6. Michael Crews wrote on :

    Would using inotify or similar for detecting the changes rather than using find be worthwhile, especially for the large codebases?

    1. jakem wrote on :

      Possibly, I haven’t investigated this. Thanks for the idea!

      One thing that would need to be taken into consideration is that certain files are excluded by find… that is, we don’t care if they change. I don’t expect that would be an issue, though.

      In most cases though, this “find” algorithm is already so vastly superior to the previous “always process everything” approach that I’m still riding high on that. Baby steps. :)

  7. Ted Mielczarek wrote on ::

    Thanks for the hard work! We developers use MXR on a daily basis. It may be outdated, crufty technology, but it has to work.