Profiler backend news, 10 April 2013

Here is some news about work on Firefox’s built-in time profiler (SPS), and in particular the work being done to use native stack unwinding by using the in-tree copy of the Breakpad unwinding library.

There has been a lot of activity in the two and a half weeks since we got back from Paris.  Some of that has been fixing fallout from the initial patch (779291), which proved difficult to land.

The immediate goals are to have the new profiling backend available on nightly builds for desktop Linux, Android and B2G.  Currently 64- and 32-bit Linux work.  If you want to give it a try, grab one and follow the instructions shown here.

  • Benoit committed a patch (851748) to remove some of the worst code duplication resulting from the 779291 landing.  He also cleaned up some of the profiler headers (851611).
  • Currently the profiler backend is controlled by environment variables it reads at startup.  Benoit made a step towards making these configurable within the GUI (856331).
  • One current problem is that it’s difficult to establish why unwinding gives poor results.  Breakpad sometimes fails to find or read CFI information for a shared object.  But that’s difficult to diagnose because the profiler’s logging facilities are poor.  I improved Breakpad’s logging facilities somewhat (853851, 857242) so we can send diagnostic information to the Android log or Linux console.  An as-yet unactioned item is to extract CFI-coverage statistics from Breakpad (859775) so as to make it obvious where objects lack adequate CFI.
  • Another problem that quickly appeared is that, as a last-ditch measure, Breakpad will try to unwind using stack scanning.  This is inherently imprecise and often adds non-existent frames to the trace, which seriously confuses the profile results.  I added a patch (855977) to disable stack scanning by default, but allows it to be selectively re-enabled if required.
  • There was some evaluation of unwinding ability on desktop Linux (855466).  64-bit works well; 32-bit works, but not so well, for reasons that are not yet entirely clear.
  • There is ongoing work to allow the profiler to work well on Android nightlies.  The problem is that nightlies use Mike Hommey’s alternative runtime linker (faulty.lib) and that requires some plumbing work to allow Breakpad to read unwind information direct out of the APK file.  Currently we have a proof of concept patch (802240) that appears to work, but requires some revision before landing.
  • There was investigation of a problem causing Firefox to livelock when starting any external program (837390).  This seems to be a bad interaction between the fork() syscall and the profiler signals.  An initial patch was posted.

Ongoing activities include: finishing up support for faulty.lib on Android, fixing the hang-at-fork problem, and trying to get a handle on why Breakpad gives us poorer unwind results than we expect, especially when doing frame-pointer based unwinding.

Valgrind now supports JEMalloc builds directly

Following some sterling hackery by Alexander Potapenko and Philippe Waroquiers, Valgrind/Memcheck now has direct support for JEMalloc, tcmalloc, and any other allocation library offering a malloc-style interface.

This means you no longer need to build Firefox using –disable-jemalloc on Linux.  You do still need –enable-valgrind, to stop V freaking out when JS GC scans the stack, but that’s no big deal.

To use this, you need to give Valgrind the flag –soname-synonyms=somalloc=NONE.  This tells it that the bundle of functions it needs to intercept (malloc, etc) live in the main executable instead of in libc.

You need a Valgrind trunk build of 11 May or later (svn r12559).

On MacOSX 10.7, all of this is irrelevant, since V appears to run a normal JEMalloc’d build  just fine.

On Android, the same trickery gives support for JEMalloc, but requires some extra hacks  which are not in the repo, so you can’t use it as-is.  I also have a bunch of code which checks for cross-DSO mismatches (eg, alloc by JEMalloc, free by libc), but that is also not committed yet.

Other recent developments, that you might want to upgrade to a trunk svn build for:

  • Translation chaining — improves performance noticably in some circumstances, particularly on lower end hardware and for the lighter weight tools.
  •  AVX support — trunk now contains partial AVX instruction set support, enough to run a Firefox built with “gcc-4.7.0 -mavx -O2”. Work to give more complete support is ongoing.
  • GDB server — this has been around for a while, but doesn’t  seem widely known.  Using it, you can control Valgrind with GDB, examine data when V reports errors, single step, all the usual stuff.  To use it, start Valgrind with –vgdb-error=0 and follow the directions it prints.
  • On MacOSX, reduced noise level when running Clang/LLVM generated code.

Visualising profiling results by aggregating stack traces

On Android, we have a couple of ways to periodically extract stack traces from a running Firefox: the built in profiler, and Valgrind.  In principle oprofile could also produce stacks at intervals, although it doesn’t.

Looking at zillions of stacks, or text-only output derived from them, isn’t much fun.  What I’d really like to do is take a large number of stacks sampled at equal cost intervals, and use Josef Weidendorfer’s excellent KCachegrind GUI to visualise the data.  KCachegrind, if you haven’t tried it, makes it easy to examine the dynamic call graph and to see how costs flow between callers and callees.

To do this, the call stacks need to be merged into an approximate dynamic call graph, cycles found and collapsed, and costs propagated back up from leaf nodes.  This produces the approximate inclusive and exclusive costs for each node (function) encountered.

So I wrote a program to do this.  It takes Valgrind-style call stacks and produces KCachegrind output files.  It could easily be generalised to other call stack generators — the call stack elements only need to be comparable for equality.

I ran Firefox on Android, used the STR in bug 728846 as a test workload, and obtained about 32000 16-entry stacks, sampled at approximately million-instruction intervals.  Merging them and viewing the results in KCachegrind produces output which at least roughly tallies with the ad-hoc observations listed in the bug report.

There’s clearly room for improvement, particularly in the handling of cycles, and the fact that it’s limited by how often the stack unwinder produces an incomplete trace.  Nevertheless it’s an interesting bit of code to have around.  Here’s a snapshot of a bit of the graph leading up to fast_composite_over_8888_0565, which features prominently in the report.

 

And here’s a picture of KCachegrind showing the caller and callee relationships for moz_pixman_image_composite32.

Valgrind for OSX 10.7 (Lion) status update

Various sets of fixes have been committed for Valgrind on Lion.  It now works well enough to run 64 bit Firefox builds and get sane results.  32 bit builds run too, but appear to hang, for threading related reasons I cannot figure out, despite quite some investigative effort.

There may be some false positives from Memcheck as a result of kludged-up syscall wrappers for some new syscalls that are 10.7-specific.  Let me know if you see errors which you think are obviously bogus.

Feedback is welcomed.  If you’re developing on Mac and have migrated to 10.7, I’d be interested to hear if it works for you. If you’re still on 10.6, I’d be interested to hear if I broke anything :-)  Btw, 10.5 support is pretty much dropped now — is anybody still using 10.5 for development?

The tracking bug report is valgrind bug 275168.

Quick reminder of how to check out and build:

  svn co svn://svn.valgrind.org/valgrind/trunk
  cd trunk
  ./autogen.sh
  ./configure --prefix=`pwd`/Inst
  make -j2
  make -j2 install

Valgrind on Android — Current Status

This is a long post.  Here’s a summary.

  • the good news: Valgrind’s Memcheck tool now works well enough on Nexus S that it can run Firefox and find real bugs, for example bug 688733.  The false error rate from Memcheck is pretty low, certainly low enough to be usable.
  • as of this morning, all required Valgrind fixes are in the Valgrind SVN trunk repo.
  • the bad news: this requires building a custom ROM and kernel for the Nexus S, that is to say, mucho hoop jumping.  Not the faint hearted.

The rest of this post explains what the constraints are and approximately what is necessary to get started.

Constraints

The main difficulty is the need to build a custom ROM and kernel.  There are three reasons for this:

  • To have a swap enabled kernel.  Starting a debuggable build of Firefox on Memcheck gives a process size of around 800MB.  Without swap, it gets OOM-killed at around 270MB.  But the default kernel isn’t configured for swap, hence a rebuild is required, as per Mike Hommey’s instructions.  Once in place, I gave the kernel a 1GB swap file parked in /sdcard, and that seemed to work ok.
  • Libraries with symbols.  Memcheck needs to have symbol information for /system/lib/libc.so and /system/bin/linker (libc and the dynamic linker).  Without these it generates huge numbers of false error reports and is unusable.  Symbols for the other libraries are nice (better stacktraces) but not essential.
  • To make it possible to start Firefox on Valgrind.  Valgrind can only insert itself into a process at an exec() transition, that is, when the process starts from new.  On normal Unixes this is no problem, since the shell from which you start Valgrind does the normal fork-exec thing.  But application start on Android is completely different, and doesn’t involve exec().

Instead, there is a master process called the Zygote.  To start an application (eg Firefox), a message is sent via a socket to the Zygote.  This creates a child with fork(), and the child then goes on to load the relevant bytecode and (presumably) native code and “becomes” Firefox.  So there’s no exec()  boundary for Valgrind to enter at.

Fortunately the AOSP folks provided a solution a couple of months back.  They modified Zygote so that it can start selected processes under the control of a user-specified wrapper, which is precisely the hook we need.  The AOSP tree now has this fix.

Overview of getting started

Here’s an overview of the process.  It doesn’t contain enough details to simply copy and paste, but it does give some idea of the hoop jumping that is unfortunately still required.

Download sources and build Android images, as per directions at http://source.android.com/source/building.html.  This in itself is a major exercise.  The relevant “lunch” flavour is full_crespo-eng, I think.  At the end of this stage, you’ll have (amongst things) libraries with symbols and a wrapper-enabled Zygote.  But not a swap enabled kernel.

Build a swap enabled kernel as per Mike Hommey’s instructions, and incorporate it into the images built in the previous stage.  In fact, I skipped this step — Mike kindly did it for me.

Push the images onto the phone, reboot, check it’s still alive.

Check out a copy of the Valgrind trunk from svn://svn.valgrind.org/valgrind/trunk, and build as described in detail in README.android.  If you complete that successfully, you’ll have a working installation of Valgrind on the phone at /data/local/Inst/bin/valgrind.

Install busybox on the phone, to make life a little less austere in the shell.

On the Linux host, generate a 1GB swap file and transfer it to /sdcard on the phone (that’s the only place it will fit).  Then enable swapping:

  cat /proc/swaps
  /data/local/Bin/busybox swapon /sdcard/swapfile1G
  cat /proc/swaps

Note you’ll have to manually re-enable swapping every time the phone is rebooted.

Copy from the host, the contents of out/target/product/crespo/symbols/system to /sdcard/symbols/system.  These are the debuginfo objects for the system libraries.  Valgrind expects them to be present, as per comments above, so it can read symbols for libc.so and /system/bin/linker.  This will copy far more than that, which is not essential but nice for debugging.

Build a Firefox you want to debug.  That of course means with line number info and with the flags –disable-jemalloc –enable-valgrind.  I strongly suggest you use “-O -g” for a good compromise between speed and debuggability.  When the build finishes, ask the build system to make an .apk file with the debug info in place, and install it.  The .apk will be huge, about 125MB:

  (cd $objdir && make package PKG_SKIP_STRIP=1)
  adb install -r $objdir/dist/fennec-9.0a1.en-US.eabi-arm.apk

We’re nearly there.  We have a device which is all set up, and a debuggable Firefox on it.  But we need to tell Zygote that we want to start Firefox with a wrapper, namely Valgrind.  In the shell on the phone, do this:

  setprop wrap.org.mozilla.fennec_sewardj "logwrapper /data/local/start_valgrind_fennec"

This tells Zygote that any startup of “org.mozilla.fennec_sewardj” should be done via an exec() of /data/local/start_valgrind_fennec applied to Zygote-specified arguments.  So, now we can put any old thing in a shell script, and Zygote will run it.  Here’s what I have for /data/local/start_valgrind_fennec:

  #!/system/bin/sh
  VGPARAMS='--error-limit=no'
  export TMPDIR=/data/data/org.mozilla.fennec_sewardj
  exec /data/local/Inst/bin/valgrind $VGPARAMS $*

Obviously you can put any Valgrind params you want in VGPARAMS; you get the general idea.  Note that this is ARM, so you don’t need the –smc-check= flag that’s necessary on x86 targets.

Only two more hoops to jump through now.  One question is where the Valgrind output should go.  Initially I tried using Valgrind’s little-known but very useful –log-socket= parameter (see here for details), but it seemed to crash the phone on a regular basis.

So I abandoned that.  By default, Valgrind’s output winds up in the phone’s system log, mixed up with lots of other stuff.  In the end I wound up running the following on the host, which works pretty well:

  adb logcat -c ; adb logcat | grep --line-buffered start_valgrind \
    | sed -u sQ/data/local/start_valgrind_QQg | tee logfile.txt

And finally .. we need to start Firefox.  Now, due to recent changes in how the libraries are packaged for Android, you can’t start it by pressing on the Fennec icon (well, you can, but Valgrind won’t read the debuginfo.)  Instead, issue this command in a shell on the phone:

  am start -a org.mozilla.gecko.DEBUG -n org.mozilla.fennec_sewardj/.App

This requests a “debug intent” startup of Firefox, which sidesteps the fancy dynamic unpacking of libraries into ashmem, and instead does the old style thing of unpacking them into /data/data/org.mozilla.fennec_sewardj.  From there Valgrind can read debuginfo in the normal way.

One minor last hint: run “top -d 2 -s cpu -m 19”.  Because Valgrind runs slowly on the phone, I’m  often in the situation of wondering am-I-waiting-for-it or is-it-waiting-for-me?  Running top pretty much answers that question.

And .. so .. it works!  It’s slow, but it appears to be stable, and, crucially, the false error rate from  Memcheck is low enough to be usable.

So, what’s next?  Writing this reminded me what a hassle it is to get all the ducks lined up right.  We need to streamline it.  Suggestions welcome!.

One thing I’ve been thinking about is to to avoid the need to have debuginfo on the target, by allowing Valgrind to query the host somehow.  Another thing I plan to do is make the Callgrind tool work, so we can get profile information too.

 

A low overhead, always-on, system-wide memory event log

This is really a RFC.

It seems to me that many end-user bugzilla reports about excessive memory use share a common structure.  First, the user reports that “I did A, B and C, had a coffee, went back to my machine and found that Firefox was sitting on N gigabytes of memory.”  Then follows lots of
discussion along the lines of “oh, so the cycle collector did/didn’t run while Fx was idle”, and “but no, that’s not right, because the tertiary FooBar timer inhibit mechanism will have made the XYZ collector run instead” kind of thing.

This goes on and on, while everybody tries to figure out what Firefox was actually doing in the period leading up to the too-much-memory observation.  Meanwhile the original reporter gets bored with the discussion and moves on to something else, and there’s general confusion, annoyance, and lack of reproducibility all round.

So the idea is simple: make a log file listing events which are known to have a significant impact on memory usage.  Nothing heavyweight, just one line per event, timestamped, plus brief relevant stats.  For example:

  • GC started / ended, total size N, reclaimed M bytes
  • GC mapped in new heap / unmapped heap
  • CC started / ended, total size N, reclaimed M bytes
  • image discard ran
  • jemalloc mmap’d more pages / munmapped some pages
  • nanojit / mjit mapped / unmapped code pages

Perhaps with some indirectly relevant events such as

  • downloaded another chunk of the phishing database
  • no user input seen for 17 minutes
  • new tab created; now there are 23 of them
  • user requested about:memory (+ what it produced)
  • extension Xyzzy loaded/initialised

That way, we’d at least have some information about the space history leading up to a situation where a user says “urk!  massive memory leak!”.

The log would be low overhead, so it can be used in production.  For sure we don’t want to log more than a couple of events per second.  Perhaps a moderate sized circular buffer (64KB? 1MB?) that could be dumped to disk on request.  Then, when someone reports excessive memory use, the first thing we ask for is the log file.

Partial implementations of this exist already.  There’s javascript.options.mem.log, for example, and I’m sure other subsystems have their own schemes.  But AFAIK there’s no uniform,  system-wide, lightweight, easy-to-use mechanism.  Would one be useful?

MVP: A Memory Value Profiler

I’ve been wondering how much of the data we keep in memory is actually useful, as opposed to being bits which are mostly or entirely constant.  Recently I hacked up a Valgrind tool to get some initial measurements.

Imagine a machine instruction that does a 32-bit memory read.  If the read value is observed only ever to be one of a small set, let’s say, 1, 2 or 3, then we might suspect that most of the bits in the word are unused.  Should we be concerned?  It depends whether the instruction accesses a wide range of addresses: if it does, that might be a clue that we’ve got some inefficient representation going on over a large area of memory.

MVP puts this into practice.  It watches all 1, 2, 4 and 8 byte non-stack memory read instructions, observing both the spread of read values and the spread of accessed addresses.  For each monitored instruction, it computes an aggregate wastefulness score.  This is the the number of always-constant bits in the values multiplied by (some approximation of) the number of different accessed addresses.

You can level various kinds of “Bogus!” accusations at it, but it does give some indication of structure and array fields that are not working very hard.  For example, given this:

#define MILLION 1000000
int a[MILLION];

for (i = 0; i < MILLION; i++)   a[i] = (i & 1) ? 1 : -1;

// ... later ...
for (i = 0; i < MILLION; i++)   sum += a[i];

it will tell you that the “sum += a[i]” reads are mostly pulling in constant bits:

  wasted 968,781  (31/32 cbs, 31,251 sects)  0x40057E: main (mvp1.c:25)

31 out of the 32 bits read are constant (it has some sophistication about ignoring constant sign bits), and the accesses are spread over 31,251 different “sectors” (128-byte chunks of address space).  Hence this read instruction gets a relatively high wastefulness metric of 968,781.

MVP runs Firefox without difficulty.  Results are interesting.  A couple of high-roller tidbits:

  wasted 1,143,304  (8/32 cbs, 142,913 sects)
     0x6431A7B: fast_composite_src_x888_0565 (pixman-fast-path.c:904)

fast_composite_src_x888_0565 converts image data from 24 bit 8:8:8 format to 16-bit 5:6:5 format, and the 24 bit values are parked in 32 bit words.  The data is spread widely (142,913 sectors) and the top 8 bits of the read data are observed to be constant, as we’d expect.

  wasted 989,696  (16/16 cbs, 61,856 sects) 
     0x652B054: js::NameNode::create(JSAtom*, JSTreeContext*)
     (jsparse.cpp:798)

This has to do with

  pn_dflags = (!tc->topStmt || tc->topStmt->type == STMT_BLOCK)
              ? PND_BLOCKCHILD : 0;

16 constant bits out of 16 implies that tc->topStmt->type (a uint16) has only ever been observed to be STMT_BLOCK here.

The most striking observation from the initial numbers, though, is somewhat obvious: that on a 64 bit platform, pointer-intensive structures are space-inefficient.  There are large numbers of 64-bit load instructions that pull in values with 36 or so constant bits.  That would correspond to fetching addresses of objects scattered across a few hundred megabytes of address space.

Perhaps we should work to minimise the number of heap objects.  Maybe what we need is a way to detect object pairs with coupled lifetimes, so they can be merged into a single object, and the connecting pointers removed.

Anyway, enough speculation.  The tool is very much WIP.  If anybody is interested in trying it out, or has comments/suggestions, let me know.

 

A thread-checking toolkit for Firefox

Back in January I blogged about using Helgrind to check for threading errors in Firefox’s JS engine.  That effort was the first step towards a bigger goal, namely to find and remove all unintended data races in the browser proper.

I have been wanting to get a point where our C++ developers can routinely use Helgrind to check for threading bugs in code, both new and old, in the same way that Valgrind’s Memcheck tool is now widely used to check for memory errors.  For the reasons discussed in my January posting, race checking is more difficult than memory checking.  Now, though, I believe we’re approaching the point where routine Helgrinding is feasible.

I’d like to introduce what amounts to a kit for thread-checking Firefox.  The main resource for this is at the MDC page “Debugging Mozilla with Helgrind“.  Here’s a summary.

There’s three parts to the kit:

  • A markup patch for the Mozilla code base.  This describes to Helgrind the effect of some synchronisation events it doesn’t understand and stops it complaining about some harmless races in the JS engine.
  • A suppression file that hides error reports in system libraries.
  • A development version of Helgrind.  This contains a bunch of correctness, diagnostic and scalability improvements.  A stock Valgrind installation won’t work.

With this framework in place, I completed a first run through Mochitests with Helgrind.  It took 32 CPU hours.  Around 15 bugs have been filed.  Some of them are now fixed, and others have been declared harmless.  But that’s just a beginning: there are many more uninvestigated reports lurking in the mochitests output.

Have a look at the MDC page for more details, including directions on how to get started.  And, of course, if you want help with any of this, please feel free to contact me.

Profiling the browser’s virtual memory behaviour

We’ve been chipping away at memory use of Firefox 4 for a couple of
months now, with good results.  Recently, though, I’ve been wondering
if we’re measuring the right things.  It seems to me there’s two
important things to measure:

  • Maximum virtual address space use for the process.  Why is this
    important?  Because if the process runs out of address space, it’s
    in serious trouble.  Ditto, perhaps worse, if the process uses up
    all the machine’s swap.
  • But the normal case is different: we don’t run out of address space
    or swap.  In this case I don’t care how much memory the browser
    uses.  Really.  When we talk about memory use in the non-OOM
    situation, we’re using that measure as a proxy for responsiveness.
    Excessive memory use isn’t intrinsically bad.  Rather, it’s the side
    effect that’s the problem: it causes paging, both for the browser
    and for everything else running on the machine, slowing
    everything down.

Trying to gauge responsiveness by looking at peak RSS figures strikes
me as a losing prospect.  The RSS values are set by some more-or-less
opaque kernel page discard algorithm, and depend on the behaviour of
all processes in the system, not just Firefox.  Worse, it’s uninformative:
we get no information about which parts of our code base are causing
paging.

So I hacked up a VM profiler.  This tells me the page fault behaviour
when running Firefox using a given amount of real memory.  It isn’t as
big a task as it sounds, since we already have 99.9% of the required
code in pace: Valgrind’s Cachegrind tool.  It just required replacing
the cache simulator with a virtual-to-physical address map simulator.

The profiler does a pretty much textbook pseudo-LRU clock algorithm
simulation.  It differentiates between page faults caused by data and
instruction accesses, since these require different fixes — make the
data smaller vs make the code smaller.  It also differentiates between
clean (page unmodified) and dirty (page modified, requires writeback)
faults.

Here are some preliminary results.  Bear in mind the profiler has only
just started to work, so the potential for bogosity is still large.

First question is: we know that 4.0 uses more memory than 3.6.x.  But
does that result in more paging?  I profiled both, loading 5 cad-comic
tabs (http://www.cad-comic.com/cad/random) and idling for a while, for
about 8 billion instructions.  Results, simulating 100MB of real memory:

3.6.x, release build, using jemalloc:

VM I accesses: 8,250,840,547  (3,186 clean faults + 350 dirty faults)
VM D accesses: 3,089,412,941  (5,239 clean faults + 552 dirty faults)

M-C, release build, using jemalloc:

VM I accesses: 8,473,182,041  ( 8,140 clean faults +  4,979 dirty faults)
VM D accesses: 3,372,806,043  (22,720 clean faults + 14,335 dirty faults)

Apparently it does page more.  Most of the paging is due to data
rather than instruction accesses.  Requires further investigation.

Second question is: where does that paging come from?  Are we missing
any easy wins?  From a somewhat longer run with bigger workload, I got
this (w/ apologies for terrible formatting):

Da (# data accesses)
.                Dfc (# clean data faults)
.                          function
------------------------------------------
18,921,574,436   382,023   PROGRAM TOTALS

.   19,339,625    60,583   js::Shape::trace
.    2,228,649    51,635   JSCompartment::purge
.   32,583,809    22,223   js_TraceScript
.   16,306,348    18,404   js::mjit::JITScript::purgePICs
.   18,160,249    12,847   js::mjit::JITScript::purgePICs
.   52,155,631    11,727   memset
.   27,229,391    10,813   js::PropertyTree::sweepShapes
.  120,482,308    10,256   js::gc::MarkChildren
.  138,049,859     9,134   memcpy
.    2,228,649     8,779   JSCompartment::sweep
.   179,083,731    8,057   js_TraceObject
.    6,269,454     5,949   js::mjit::JITScript::sweepCallICs

18% ish of the faults come from js::Shape::trace.

And quite a few come from js::mjit::JITScript::purgePICs (two
versions) and js::mjit::JITScript::sweepCallICs.  According to Dave
Anderson and Chris Leary, there might be some opportunity to poke
the code pages in a less jumping-around-y fashion.

Finding data races in the JS engine

Some background

Back in March last year I spent some time using Valgrind’s Helgrind tool
to look for data races in the browser.  Data races happen when two
threads access the same piece of memory without any form of
synchronisation (either locking or the use of atomic operations), and at
least one of the accesses is a write.  That can lead to all manner of
data structure corruption and crashes.  What’s really bad is that such
bugs are often nearly impossible to reproduce, because they are timing
dependent.  They therefore fall into the Very Scary Bugs category, and
are something we really want to get rid of by any means possible.
Before release!

Helgrind is a runtime analysis tool that looks for such races.  It is far
from perfect, but better than nothing.  For those familiar with these
things, it’s a pure happens-before race detector capable of showing full
stacks for both memory accesses involved in a race.  It also checks for
lock ordering inconsistencies (a.k.a. potential deadlocks) and various
misuses of the POSIX pthreads API.

So what happened with Helgrinding the browser?  In short, I didn’t get
far.  Mostly I just got greyer hair.  I had hoped to get the browser
Helgrind-clean, in the same way it is now pretty much Memcheck-clean,
but the effort got mired in difficulties:

  • Race-checking is resource-intensive, more so than the memory checking
    that Valgrind (Memcheck) is normally used for.  A critical data
    structure in Helgrind (vector timestamps) turned out not to work well
    at the scale demanded for full browser runs, and they became
    infeasibly slow and memory hungry.
  • Happens-before race detection has the advantage of not giving false
    positives. But the downside is it is scheduling-sensitive, so
    identical runs sometimes report different subsets of the races that
    are really present. Add to that the nondeterminism of the browser and
    the slowness of Helgrind, and I had a major repeatability problem.
  • The browser has a number of deliberate or apparently-harmless races.
    Making sense of the race reports required examining bits of source
    code all over the tree that I’d never seen before and didn’t
    understand.  This proved to be difficult and time consuming.

So I left it at that, resolving one day to come back and fix the vector
timestamp representations, so as to at least avoid the resource
problems.

Nothing happened for some months.  Then, in December, Paul Biggar wrote
a nice self-contained threaded Javascript test case, bug 619595.

And I thought: hmm, maybe I should Helgrindify just the threaded jsshell
running Paul’s test.  The standalone JS engine is smaller and more
tractable than the browser, and I knew that Jason Orendorff had
successfully used Helgrind on it earlier in the year.  Also, there’s
been a vast amount of JS engine hackery in the past year, with
particular emphasis on the threading aspects.  And 4.0 is coming up
fast.  So, I thought, now might be a good time to give it a spin.

Preparation

To work properly, Helgrind needs to see all inter-thread synchronisation
events in the program under test.  It can do that without help for
programs which use only the POSIX pthreads API.  But many programs roll
their own synchronisation primitives, and since it can’t see those,
Helgrind reports huge numbers of races which don’t exist.  Both NSPR and
the JS engine do this (eg ThinLocks).  A small amount of markup using
client requests (bug 551155) provides Helgrind with the information it needs.

Paul’s test runs, or, at least, tries to run, all the Sunspider tests in
parallel.  I modified it trivially to make it run up to 10 copies of
each test in parallel.  This stresses Helgrind to the limit of
feasibility on my machine, but shakes out more races.

Then we’re ready to go.

Results

I found a number of races — some expected, some not.  The unintended
and dangerous-looking ones are:

  • allocator for YARR-generated code is not thread safe (bug 587288)
    — potential crasher
  • race on JSContext::defaultCompartmentIsLocked (bug 622691)
    — consequences unknown to me, but doesn’t look correct

Then there are two which are unintended but probably harmless.  In both
cases each thread initialises a shared data structure to some value
which never changes after that, so multiple initialisations are harmless:

  • jsdate.cpp: global “static jsdouble LocalTZA;” is raced
  • nanojit::Assembler::nHints[] is raced.

Then there are races which are intended and, so, presumably harmless, on
the following fields:

  • JSRuntime::gcMallocBytes (also gcBytes, I think)
  • JSRuntime::gcPoke
  • JSRuntime::protoHazardShape
  • JSThreadData::requestDepth
  • JSRuntime::gcIsNeeded

Finally, there’s one I can’t decide about:

  • The GC’s stack scanner races against other functions that touch the
    stack — that is, just about everything.  I don’t know if I expect
    that or not.  I would have thought that if one thread is scanning the
    stack, all the other threads are blocked waiting for it, hence there
    is no race.  So either (1) my understanding is wrong, (2) helgrind
    doesn’t see the inter-thread sync events causing other threads to
    wait, or (3) the stack scanner is borked.  I suspect (1) or (2).

Comments

It’s pleasing to have a list of at least some of the observable races in
the JS engine, since it provides something to cross-check assumed racey
behaviour against.  It’s also good to have found some unintended races
before release.

I’m a little concerned about the intended races, eg JSRuntime::gcPoke.
My sketchy understanding of the C++0x draft standard is that accesses to
shared locations must be mediated either by locking or by machine-level
atomic operations.  All other shared accesses count as races.  C++0x, in
the words of Hans J Boehm, ‘guarantees nothing in the event of a data
race.  Any program allowing a data race produces “undefined behavior”.’

Boehm has good presentation which summarises the proposed C++0x memory
model, at http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf.
See in particular slides 7, 9 and 11.

From a Helgrind-usage point of view, these races are easily suppressed
by adding client requests to specify that the fields in question should
not be race-checked.  But, overall, I still don’t like them: deliberate
races are a hindrance to understandability and to automated checking.