Finding data races in the JS engine

Some background

Back in March last year I spent some time using Valgrind’s Helgrind tool
to look for data races in the browser.  Data races happen when two
threads access the same piece of memory without any form of
synchronisation (either locking or the use of atomic operations), and at
least one of the accesses is a write.  That can lead to all manner of
data structure corruption and crashes.  What’s really bad is that such
bugs are often nearly impossible to reproduce, because they are timing
dependent.  They therefore fall into the Very Scary Bugs category, and
are something we really want to get rid of by any means possible.
Before release!

Helgrind is a runtime analysis tool that looks for such races.  It is far
from perfect, but better than nothing.  For those familiar with these
things, it’s a pure happens-before race detector capable of showing full
stacks for both memory accesses involved in a race.  It also checks for
lock ordering inconsistencies (a.k.a. potential deadlocks) and various
misuses of the POSIX pthreads API.

So what happened with Helgrinding the browser?  In short, I didn’t get
far.  Mostly I just got greyer hair.  I had hoped to get the browser
Helgrind-clean, in the same way it is now pretty much Memcheck-clean,
but the effort got mired in difficulties:

  • Race-checking is resource-intensive, more so than the memory checking
    that Valgrind (Memcheck) is normally used for.  A critical data
    structure in Helgrind (vector timestamps) turned out not to work well
    at the scale demanded for full browser runs, and they became
    infeasibly slow and memory hungry.
  • Happens-before race detection has the advantage of not giving false
    positives. But the downside is it is scheduling-sensitive, so
    identical runs sometimes report different subsets of the races that
    are really present. Add to that the nondeterminism of the browser and
    the slowness of Helgrind, and I had a major repeatability problem.
  • The browser has a number of deliberate or apparently-harmless races.
    Making sense of the race reports required examining bits of source
    code all over the tree that I’d never seen before and didn’t
    understand.  This proved to be difficult and time consuming.

So I left it at that, resolving one day to come back and fix the vector
timestamp representations, so as to at least avoid the resource
problems.

Nothing happened for some months.  Then, in December, Paul Biggar wrote
a nice self-contained threaded Javascript test case, bug 619595.

And I thought: hmm, maybe I should Helgrindify just the threaded jsshell
running Paul’s test.  The standalone JS engine is smaller and more
tractable than the browser, and I knew that Jason Orendorff had
successfully used Helgrind on it earlier in the year.  Also, there’s
been a vast amount of JS engine hackery in the past year, with
particular emphasis on the threading aspects.  And 4.0 is coming up
fast.  So, I thought, now might be a good time to give it a spin.

Preparation

To work properly, Helgrind needs to see all inter-thread synchronisation
events in the program under test.  It can do that without help for
programs which use only the POSIX pthreads API.  But many programs roll
their own synchronisation primitives, and since it can’t see those,
Helgrind reports huge numbers of races which don’t exist.  Both NSPR and
the JS engine do this (eg ThinLocks).  A small amount of markup using
client requests (bug 551155) provides Helgrind with the information it needs.

Paul’s test runs, or, at least, tries to run, all the Sunspider tests in
parallel.  I modified it trivially to make it run up to 10 copies of
each test in parallel.  This stresses Helgrind to the limit of
feasibility on my machine, but shakes out more races.

Then we’re ready to go.

Results

I found a number of races — some expected, some not.  The unintended
and dangerous-looking ones are:

  • allocator for YARR-generated code is not thread safe (bug 587288)
    — potential crasher
  • race on JSContext::defaultCompartmentIsLocked (bug 622691)
    — consequences unknown to me, but doesn’t look correct

Then there are two which are unintended but probably harmless.  In both
cases each thread initialises a shared data structure to some value
which never changes after that, so multiple initialisations are harmless:

  • jsdate.cpp: global “static jsdouble LocalTZA;” is raced
  • nanojit::Assembler::nHints[] is raced.

Then there are races which are intended and, so, presumably harmless, on
the following fields:

  • JSRuntime::gcMallocBytes (also gcBytes, I think)
  • JSRuntime::gcPoke
  • JSRuntime::protoHazardShape
  • JSThreadData::requestDepth
  • JSRuntime::gcIsNeeded

Finally, there’s one I can’t decide about:

  • The GC’s stack scanner races against other functions that touch the
    stack — that is, just about everything.  I don’t know if I expect
    that or not.  I would have thought that if one thread is scanning the
    stack, all the other threads are blocked waiting for it, hence there
    is no race.  So either (1) my understanding is wrong, (2) helgrind
    doesn’t see the inter-thread sync events causing other threads to
    wait, or (3) the stack scanner is borked.  I suspect (1) or (2).

Comments

It’s pleasing to have a list of at least some of the observable races in
the JS engine, since it provides something to cross-check assumed racey
behaviour against.  It’s also good to have found some unintended races
before release.

I’m a little concerned about the intended races, eg JSRuntime::gcPoke.
My sketchy understanding of the C++0x draft standard is that accesses to
shared locations must be mediated either by locking or by machine-level
atomic operations.  All other shared accesses count as races.  C++0x, in
the words of Hans J Boehm, ‘guarantees nothing in the event of a data
race.  Any program allowing a data race produces “undefined behavior”.’

Boehm has good presentation which summarises the proposed C++0x memory
model, at http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf.
See in particular slides 7, 9 and 11.

From a Helgrind-usage point of view, these races are easily suppressed
by adding client requests to specify that the fields in question should
not be race-checked.  But, overall, I still don’t like them: deliberate
races are a hindrance to understandability and to automated checking.

2 responses

  1. Brendan Eich wrote on :

    JSRuntime::gcPoke is synchronized by locking at a much higher level, namely by the “request model”. The GC, which reads gcPoke, runs only after all JS “requests” (concurrent executions of scripts and batches of API calls) have ended or suspended.

    See https://developer.mozilla.org/En/SpiderMonkey/Internals/Thread_Safety for some (slightly out of date) docs.

    /be

  2. jseward wrote on :

    Brendan: I wonder why Helgrind doesn’t see the GC entry/exit
    synchronisation properly, which is the impression I’m getting.
    Something to look into.