29
Mar 17

on mutex performance and WTF::Lock

One of the things I’ve been doing this quarter is removing Gecko’s dependence on NSPR locks.  Gecko’s (non-recursive) mutexes and condition variables now use platform-specific constructs, rather than indirecting through NSPR.  This change makes things smaller, especially on POSIX platforms, and uses no dynamic memory allocation, so there are fewer untested failure paths.  I haven’t rigorously benchmarked things yet, but I suspect various operations are faster, too.

As I’ve done this, I’ve fielded questions about why we’re not using something like WTF::Lock or the Rust equivalent in parking_lot.  My response has always been some variant of the following: the benchmarks for the WTF::Lock blog post were conducted on OS X.  We have anecdotal evidence that mutex overhead can be quite high on OS X, and that changing locking strategies on OS X can be beneficial.  The blog post also says things like:

One advantage of OS mutexes is that they guarantee fairness: All threads waiting for a lock form a queue, and, when the lock is released, the thread at the head of the queue acquires it. It’s 100% deterministic. While this kind of behavior makes mutexes easier to reason about, it reduces throughput because it prevents a thread from reacquiring a mutex it just released.

This is certainly true for mutexes on OS X, as the measurements in the blog post show.  But fairness is not guaranteed for all OS mutexes; in fact, fairness isn’t even guaranteed in the pthreads standard (which OS X mutexes follow).  Fairness in OS X mutexes is an implementation detail.

These observations are not intended to denigrate the WTF::Lock work: the blog post and the work it describes are excellent.  But it’s not at all clear that the conclusions reached in that post necessarily carry over to other operating systems.

As a partial demonstration of the non-cross-platform applicability of some of the conclusions, I ported WebKit’s lock fairness benchmark to use raw pthreads constructs; the code is available on GitHub.  The benchmark sets up a number of threads that are all contending for a single lock.  The number of lock acquisitions for each thread over a given period of time is then counted.  While both of these qualities are configurable via command-line parameters in WebKit’s benchmark, they are fixed at 10 threads and 100ms in mine, mostly because I was lazy. The output I get on my Mac mini running OS X 10.10.5 is as follows:

1509
1509
1509
1509
1509
1509
1509
1508
1508
1508

Each line indicates the number of lock acquisitions performed by a given thread.  Notice the nearly-identical output for all the threads; this result follows from the fairness of OS X’s mutexes.

The output I get on my Linux box is quite different (aside from each thread performing significantly more lock acquisitions because of differences in processor speed, etc.):

108226
99025
103122
105539
101885
104715
104608
105590
103170
105476

The counts vary significantly between threads: Linux mutexes are not fair by default–and that’s perfectly OK.

What’s more, the developers of OS X have recognized this and added a way to make their mutexes non-fair.  In <pthread_spis.h>, there’s a OS X-only function, pthread_mutexattr_setpolicy_np.  (pthread mutex attributes control various qualities of pthread mutexes: normal, recursively acquirable, etc.  This particular function, supported since OS X 10.7, enables setting the fairness policy of mutexes to either _PTHREAD_MUTEX_POLICY_FAIRSHARE (the default) or _PTHREAD_MUTEX_POLICY_FIRSTFIT.  The firstfit policy is not documented anywhere, but I’m guessing that it’s something akin to the “barging” locks described in the WTF::Lock blog post: the lock is made available to whatever thread happens to get to it first, rather than enforcing a queue to acquire the lock.  (If you’re curious, the code dealing with firstfit policy can be found in Apple’s pthread_mutex.c.)

Running the benchmark on OS X with mutexes configured with the firstfit policy yields quite different numbers:

14627
13239
13503
13720
13989
13449
13943
14376
13927
14142

The variation in these numbers are more akin to what we saw with the non-fair locks on Linux, and what’s more, they’re almost an order of magnitude higher than the fair locks. Maybe we should start using firstfit locks in Gecko!  I don’t know how firstfit policy locks compare to something like WTF::Lock on my Mac mini, but it’s clear that saying simply “OS mutexes are slow” doesn’t tell the whole story. And of course there are other concerns, such as the size required by locks, that motivated the WTF::Lock work.

I have vague plans of doing more benchmarking, especially on Windows, where we may want to use slim reader/writer locks rather than critical sections, and evaluating Rust’s parking_lot on more platforms.  Pull requests welcome.