27
Apr 15

recent tsan improvements for firefox

One of my goals for Q2 was to make Firefox usable with “modern” (version 3.6+) clang. For reasons previously unknown, Firefox would only work with relatively old clang (version ~3.3); indeed, our wiki page for TSan recommends checking out specific SVN versions!

I’m happy to say that goal has been met.  If you like, you can read (some of) the gory details in bug 1153244.  Checking out a development version of Clang and building that should enable you to test Firefox (from mozilla-central, since that’s where the aforementioned bug has landed thus far) with TSan.  Upgrading through several major releases has its benefits: TSan appears to be somewhat more stable—tests that would previously timeout reliably now run without complaint.  I’m not sure that it’s quite reliable enough to run in automation yet, but running TSan tests in automation is definitely more feasible than it was last quarter.

With an eye towards doing that, and also with an eye towards improving individual developer experience, I’ve recently landed several patches that teach our test harnesses about TSan-specific things.  For instance, when we’re running mochitests, we only care about races in the browser itself; we don’t care about races while running any of the supporting tools (particularly xpcshell, which shares a lot of code with the browser through libxul).  So TSan should only be enabled for the actual browser executable.  We already do similar things for our ASan builds.

I also discovered that my local builds weren’t using the llvm-symbolizer tool to produce backtraces in TSan’s reports, but instead defaulting to using addr2lineaddr2line is a fine tool, but llvm-symbolizer understands debug information better (for instance, it shows you synthesized frames for inlined functions, which is a huge win) and llvm-symbolizer is somewhat faster than addr2line. In my case, it was a combination of llvm-symbolizer not being in my PATH, and not informing TSan about the location of llvm-symbolizer ($OBJDIR/dist/bin/, usually). The former was addressed by setting LLVM_SYMBOLIZER in my mozconfig (not necessary if llvm-symbolizer is on your PATH), and the latter was addressed by adding a bit of code to our test scripts.

Now, to update the wiki with better instructions!


31
Mar 15

tsan bug finding update

At the beginning of Q1, I set a goal to investigate races with Thread Sanitizer and to fix the “top” 10 races discovered with the tool.  Ten races seemed like a conservative number; we didn’t know how many races there were, their impact, or how difficult fixing them would be.  We also weren’t sure how much participation we could count on from area experts, and it might have turned out that I would spend a significant amount of time figuring out what was going on with the races and fixing them myself.

I’m happy to report that according to Bugzilla, nearly 30 of the races reported this quarter have been fixed.  Folks in the networking stack, JavaScript GC, JavaScript JIT, and ImageLib have been super-responsive in addressing problems.  There are even a few bugs in the afore-linked query that have been fixed by folks on the WebRTC team running TSan or similar thread-safety tools (Clang’s Thread Safety Analysis, to be specific) to detect bugs themselves, which is heartening to see.  And it’s also worth mentioning that at least one of the unfixed DOM media bugs detected by TSan has seen some significant threadsafety refactoring work in its dependent bugs.  All this attention to data races has been very encouraging.

I plan on continuing the TSan runs in Q2 along with getting TSan-on-Firefox working with more-or-less current versions of Clang.  Having to download specific Subversion revisions of the tools or precompiled Clang 3.3 (!) binaries to make TSan work is discouraging, and that could obviously work much better.


20
Feb 15

finding races in Firefox with ThreadSanitizer

We use a fair number of automated tools for memory errors (AddressSanitizer/Leak Sanitizer for use-after-free and buffer overflows; custom leak checking on refcounted objects; Valgrind tests and Julian Seward’s mochitests on Valgrind periodic testing), but we do very little in terms of checking for data races between threads.  As more and more components of the browser use threads in earnest (parts of the JavaScript engine, including the GC; graphics; networking; Web features like indexedDB, workers, and WebRTC; I have probably left out some others), preventing and/or fixing data races become more and more important as a way of ensuring both correctness and stability. One of my goals this quarter is running mochitests and reftests with ThreadSanitizer (TSan), reporting any races that it finds, and either fixing some of them myself or convincing other people to fix them.

What is a data race? Informally, data races occur when two different threads operate on the same location in memory without any synchronization between the threads. So if you do:

*ptr = 1;

in one thread and:

if (*ptr == 1) {
...
}

in another thread (without locks or similar), that’s a data race. It’s not guaranteed that the second thread will see the value written by the first thread, and if the code was written with that assumption, things can (and usually do) work as expected, but can also go badly wrong. When things do go badly wrong, it can be extremely frustrating to find the actual problem, because the problems don’t typically show up at the point where the data race happened. Of course, since the bug is dependent on caches and timing issues, the problem doesn’t always reproduce on a developer’s machine, either. You can see one of Chromium’s experiences with data races producing a common crash, and it took them nearly three months to find a fix. TSan told them exactly where to look. Google has written blog posts about their experiences using TSan with Chromium and more generally with other software they develop, including Go and their own internal tools.

When faced with evidence of a data race, people sometimes say, “well, int/long/pointer accesses are atomic, so it’s OK.” This is decidedly not true, at least not in the sense that writes to memory locations of such types are immediately visible to all threads. People sometimes try using volatile to fix the problem; that doesn’t work either (volatile says nothing about concurrent accesses between threads, or visibility of operations to other threads). Even if you think your data race is benign–that it can’t cause problems–you might be surprised at what can go wrong with benign data races. Hans Boehm, who led the effort to define the C++ memory model, has written a paper describing why there is no such thing as a benign data race at the C/C++ language level. Data races are always real bugs and are explicitly undefined behavior according to the C++ standard.

I’m not the first person to try using TSan on Firefox inside Mozilla; Christian Holler started filing bugs about races detected by TSan over a year ago.  So far, I’ve filed about 30 bugs from running “interesting” groups of mochitests: mostly mochitests under dom/ and layout/. That doesn’t sound like that many, and there are a couple reasons for that. One is that the same races tend to get reported over and over again. I’ve applied some local fixes to silence some of the really obnoxious races, but you still have to sort through the same races to find the interesting ones. (I should probably spend quality time setting up suppression files, but I’ve avoided doing that thus far for the fear of inadvertently silencing other races that I haven’t seen/reported before.) Test runs are also slow (about 10x slower than a non-TSan run), and it’s fairly common for the browser to simply hang during testing. I’ve been able to fiddle around with the mochitest harness to increase timeouts or the number of timeouts permitted, but that makes tests go even slower, as tests that do timeout take longer and longer to do so.

Fortunately, people have been responsive to the bugs I’ve filed; Honza Bambas, Valentin Gosu, and Patrick McManus on the networking side of things; Terrence Cole, Jon Coppeard, Andrew McCreight, and Shu-yu Guo on the JavaScript side of things; and Milan Sreckovic on the graphics side of things have all been fixing bugs and/or helping analyze what’s going wrong as they’ve been filed.

If you want to try out TSan locally, there’s an excellent introduction to using TSan on MDN.  I strongly recommend following the instructions for building your own, local clang; my experiences have shown that the released versions of clang don’t always work (at least with Firefox) when trying to use TSan. If you find new bugs in your favorite piece of code, please file a bug, CC me, and make it block bug 929478.