Mochitests are now Valgrind-clean

One of the things I’ve been doing these past few months is running Mochitests on Valgrind on a semi-regular basis — roughly weekly — and trying to drive the resultant error count down to zero. I did this some years back, in the approach to Firefox 4, but the situation has drifted since then.

Anyway, I am pleased to say that Mochitests on 64-bit Linux is once again Valgrind/Memcheck clean. That is, free from invalid memory accesses and uses of undefined values. This involved fixing around 25 bugs, mostly to do with uninitialised values. They fall into three categories:

  • Forgetting to initialise a field in a constructor. Pretty simple once you figure out which constructor is involved.
  • Forgetting to write values to an out-parameter — an ever-popular failure mode. Typically a called function intends to return a value through a pointer to a caller-supplied buffer, and also returns an success/failure indication. But there’s some path through it which returns “success” without writing to the buffer.
  • More complex sequencing errors, in which an object is allocated, and only partly initialised, in the expectation that other fields will be set up later, but before use, using an Init() method or some such. But the sequencing doesn’t quite work out, and Init() is called only after those fields are read.

The majority of the bugs are not to do with memory accesses in invalid places. I suspect most invalid accesses have already been picked up by our ASan runs.

The severity of the bugs varies. Any bug picked up by Valgrind might have serious consequences, as a result of making the computation depend on unknown values, either undefined ones, or ones taken from unknown places in memory. Some of the bugs seem fairly harmless.  Others looked like they could cause user visible weirdness, for example decisions along the lines of “is the download rate sufficient to support this video resolution, or should we request a lower resolution?” At least one would have been an obvious crash had the uninitialised value pulled out of memory been suitably unfavourable.

So, is it truly clean? Well, not entirely, for various reasons.  For one thing this is dynamic testing, of course, so it only covers scenarios that the test set exercises.

Secondly, the high level of C++ churn in the tree means there’s a constant, if low, influx of new memory errors, so we can only really get to “approaching zero” detectable errors. Doing this on the Aurora or Beta branches might help, but that doesn’t fit with our “mozilla-central first” development model.

Thirdly, this tests core Gecko code and the Linux specifics. It unfortunately doesn’t cover Mac specifics, Windows specifics or FirefoxOS specifics, although the latter is in progress.

One noticeable thing, if you play the run-Valgrind-and-fix-what-you-see game long enough, is that we can drive the observable memory errors in our code down to zero. That is, in anything that we build from source. But we can’t do anything about the various proprietary binary blobs we have to live with. Eventually we arrive at a situation where — at least so it seems — our code is cleaner than some of the other libraries it relies on.

Valgrinding Mochitests is quicker than it used to be. Back in the Firefox 4 days, it took about 14 CPU hours. Now it completes in about 5 hours. This is due to a combination of faster hardware (Intel Haswell), Valgrind generally being faster and scaling better for large processes, and having advanced to the point where it’s feasible to test “gcc -O2” compiled code with a near-zero false error rate. It may also be that Gecko is simply faster, giving less work for Valgrind to do.

I’m beginning to run Valgrind tests of FirefoxOS startup on phones, in particular on the Flame. Given how resource-hungry Valgrind is, doing so is something of a challenge, requiring dodging timeouts, memory limits, sandboxes and proprietary driver blobs. But it is just about doable, and various bits of FirefoxOS-specific badness are in the process of getting cleaned up.

From a personal standpoint, this is very satisfying. Memcheck was originally conceived as a tool for finding bugs you didn’t yet know you had, rather than as a post-crash diagnostic tool. So it’s great to see it being used to remove a whole class of bugs from our tree.

We also have toolery (TSan, Helgrind) to find low level data races.  Nathan Froyd and Christian Holler have made good progress in using them to detect races and in getting folks to fix those races, as tracked by metabug 929478. I look forward to the day when we can say that Mochitests is also free from detectable data races.

3 responses

  1. mccr8 wrote on :

    This is great!

  2. Rhys wrote on :

    Well done. That’s a great outcome and endorsement.
    Presumably there were a few V fixes or feature enhancements that might have been picked up through these latest runs, i.e. benefits flowing back in V?
    Can you share the background to some of these?

  3. RyanVM wrote on :

    The ATeam has been working this quarter to define what a Tier 2 platform would look like in our CI. It’s still my hope that we’ll progress to the point where we can run mochitests-on-Valgrind on a semi-regular basis in automation, even if per-push isn’t practical.