{"id":482,"date":"2017-03-29T11:25:21","date_gmt":"2017-03-29T15:25:21","guid":{"rendered":"http:\/\/blog.mozilla.org\/nfroyd\/?p=482"},"modified":"2017-03-29T11:25:21","modified_gmt":"2017-03-29T15:25:21","slug":"on-mutex-performance-part-1","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nfroyd\/2017\/03\/29\/on-mutex-performance-part-1\/","title":{"rendered":"on mutex performance and WTF::Lock"},"content":{"rendered":"<p>One of the things I&#8217;ve been doing this quarter is removing <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1312087\">Gecko&#8217;s dependence on NSPR locks<\/a>.\u00a0 Gecko&#8217;s (non-recursive) mutexes and condition variables now use platform-specific constructs, rather than indirecting through NSPR.\u00a0 This change makes things smaller, especially on POSIX platforms, and uses no dynamic memory allocation, so there are fewer untested failure paths.\u00a0 I haven&#8217;t rigorously benchmarked things yet, but I suspect various operations are faster, too.<\/p>\n<p>As I&#8217;ve done this, I&#8217;ve fielded questions about why we&#8217;re not using something like <a href=\"https:\/\/webkit.org\/blog\/6161\/locking-in-webkit\/\"><code>WTF::Lock<\/code><\/a> or the Rust equivalent in <a href=\"https:\/\/github.com\/Amanieu\/parking_lot\"><code>parking_lot<\/code><\/a>.\u00a0 My response has always been some variant of the following: the benchmarks for the <code>WTF::Lock<\/code> blog post were conducted on OS X.\u00a0 We have anecdotal evidence that <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1137963#c0\">mutex overhead can be quite high on OS X<\/a>, and that <a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=1195767\">changing locking strategies on OS X can be beneficial<\/a>.\u00a0 The blog post also says things like:<\/p>\n<blockquote><p>One advantage of OS mutexes is that they guarantee fairness: All threads waiting for a lock form a queue, and, when the lock is released, the thread at the head of the queue acquires it. It\u2019s 100% deterministic. While this kind of behavior makes mutexes easier to reason about, it reduces throughput because it prevents a thread from reacquiring a mutex it just released.<\/p><\/blockquote>\n<p>This is certainly true for mutexes on OS X, as the measurements in the blog post show.\u00a0 But fairness is not guaranteed for all OS mutexes; in fact, fairness isn&#8217;t even guaranteed in the pthreads standard (which OS X mutexes follow).\u00a0 Fairness in OS X mutexes is an implementation detail.<\/p>\n<p>These observations are not intended to denigrate the <code>WTF::Lock<\/code> work: the blog post and the work it describes are excellent.\u00a0 But it&#8217;s not at all clear that the conclusions reached in that post necessarily carry over to other operating systems.<\/p>\n<p>As a partial demonstration of the non-cross-platform applicability of some of the conclusions, I ported WebKit&#8217;s <a href=\"http:\/\/trac.webkit.org\/browser\/trunk\/Source\/WTF\/benchmarks\/LockFairnessTest.cpp?rev=200444\">lock fairness benchmark<\/a> to use raw pthreads constructs; <a href=\"https:\/\/github.com\/froydnj\/locking-benchmark\">the code is available on GitHub<\/a>.\u00a0 The benchmark sets up a number of threads that are all contending for a single lock.\u00a0 The number of lock acquisitions for each thread over a given period of time is then counted.\u00a0 While both of these qualities are configurable via command-line parameters in WebKit&#8217;s benchmark, they are fixed at 10 threads and 100ms in mine, mostly because I was lazy. The output I get on my Mac mini running OS X 10.10.5 is as follows:<\/p>\n<pre>1509\r\n1509\r\n1509\r\n1509\r\n1509\r\n1509\r\n1509\r\n1508\r\n1508\r\n1508\r\n<\/pre>\n<p>Each line indicates the number of lock acquisitions performed by a given thread.\u00a0 Notice the nearly-identical output for all the threads; this result follows from the fairness of OS X&#8217;s mutexes.<\/p>\n<p>The output I get on my Linux box is quite different (aside from each thread performing significantly more lock acquisitions because of differences in processor speed, etc.):<\/p>\n<pre>108226\r\n99025\r\n103122\r\n105539\r\n101885\r\n104715\r\n104608\r\n105590\r\n103170\r\n105476\r\n<\/pre>\n<p>The counts vary significantly between threads: Linux mutexes are not fair by default&#8211;and that&#8217;s perfectly OK.<\/p>\n<p>What&#8217;s more, the developers of OS X have recognized this and added a way to make their mutexes non-fair.\u00a0 In <a href=\"https:\/\/opensource.apple.com\/\/source\/libpthread\/libpthread-218.30.1\/pthread\/pthread_spis.h.auto.html\"><code>&lt;pthread_spis.h&gt;<\/code><\/a>, there&#8217;s a OS X-only function, <code>pthread_mutexattr_setpolicy_np<\/code>.\u00a0 (pthread mutex attributes control various qualities of pthread mutexes: normal, recursively acquirable, etc.\u00a0 This particular function, supported since OS X 10.7, enables setting the fairness policy of mutexes to either <code>_PTHREAD_MUTEX_POLICY_FAIRSHARE<\/code> (the default) or <code>_PTHREAD_MUTEX_POLICY_FIRSTFIT<\/code>.\u00a0 The firstfit policy is not documented anywhere, but I&#8217;m guessing that it&#8217;s something akin to the &#8220;barging&#8221; locks described in the <code>WTF::Lock<\/code> blog post: the lock is made available to whatever thread happens to get to it first, rather than enforcing a queue to acquire the lock.\u00a0 (If you&#8217;re curious, the code dealing with firstfit policy can be found in Apple&#8217;s <a href=\"https:\/\/opensource.apple.com\/\/source\/libpthread\/libpthread-218.30.1\/src\/pthread_mutex.c.auto.html\">pthread_mutex.c<\/a>.)<\/p>\n<p>Running the benchmark on OS X with mutexes configured with the firstfit policy yields quite different numbers:<\/p>\n<pre>14627\r\n13239\r\n13503\r\n13720\r\n13989\r\n13449\r\n13943\r\n14376\r\n13927\r\n14142\r\n<\/pre>\n<p>The variation in these numbers are more akin to what we saw with the non-fair locks on Linux, and what&#8217;s more, they&#8217;re almost an order of magnitude higher than the fair locks. Maybe we should start using firstfit locks in Gecko!\u00a0 I don&#8217;t know how firstfit policy locks compare to something like <code>WTF::Lock<\/code> on my Mac mini, but it&#8217;s clear that saying simply &#8220;OS mutexes are slow&#8221; doesn&#8217;t tell the whole story.  And of course there are other concerns, such as the size required by locks, that motivated the <code>WTF::Lock<\/code> work.<\/p>\n<p>I have vague plans of doing more benchmarking, especially on Windows, where we may want to use <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/aa904937(v=vs.85).aspx\">slim reader\/writer locks<\/a> rather than <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/ms682530(v=vs.85).aspx\">critical sections<\/a>, and evaluating Rust&#8217;s <code>parking_lot<\/code> on more platforms.\u00a0 Pull requests welcome.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the things I&#8217;ve been doing this quarter is removing Gecko&#8217;s dependence on NSPR locks.\u00a0 Gecko&#8217;s (non-recursive) mutexes and condition variables now use platform-specific constructs, rather than indirecting through NSPR.\u00a0 This change makes things smaller, especially on POSIX platforms, and uses no dynamic memory allocation, so there are fewer untested failure paths.\u00a0 I haven&#8217;t [&hellip;]<\/p>\n","protected":false},"author":320,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/posts\/482"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/users\/320"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/comments?post=482"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/posts\/482\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/media?parent=482"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/categories?post=482"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nfroyd\/wp-json\/wp\/v2\/tags?post=482"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}