We use a fair number of automated tools for memory errors (AddressSanitizer/Leak Sanitizer for use-after-free and buffer overflows; custom leak checking on refcounted objects; Valgrind tests and Julian Seward’s mochitests on Valgrind periodic testing), but we do very little in terms of checking for data races between threads. As more and more components of the browser use threads in earnest (parts of the JavaScript engine, including the GC; graphics; networking; Web features like indexedDB, workers, and WebRTC; I have probably left out some others), preventing and/or fixing data races become more and more important as a way of ensuring both correctness and stability. One of my goals this quarter is running mochitests and reftests with ThreadSanitizer (TSan), reporting any races that it finds, and either fixing some of them myself or convincing other people to fix them.
What is a data race? Informally, data races occur when two different threads operate on the same location in memory without any synchronization between the threads. So if you do:
*ptr = 1;
in one thread and:
if (*ptr == 1) {
...
}
in another thread (without locks or similar), that’s a data race. It’s not guaranteed that the second thread will see the value written by the first thread, and if the code was written with that assumption, things can (and usually do) work as expected, but can also go badly wrong. When things do go badly wrong, it can be extremely frustrating to find the actual problem, because the problems don’t typically show up at the point where the data race happened. Of course, since the bug is dependent on caches and timing issues, the problem doesn’t always reproduce on a developer’s machine, either. You can see one of Chromium’s experiences with data races producing a common crash, and it took them nearly three months to find a fix. TSan told them exactly where to look. Google has written blog posts about their experiences using TSan with Chromium and more generally with other software they develop, including Go and their own internal tools.
When faced with evidence of a data race, people sometimes say, “well, int
/long
/pointer accesses are atomic, so it’s OK.” This is decidedly not true, at least not in the sense that writes to memory locations of such types are immediately visible to all threads. People sometimes try using volatile
to fix the problem; that doesn’t work either (volatile
says nothing about concurrent accesses between threads, or visibility of operations to other threads). Even if you think your data race is benign–that it can’t cause problems–you might be surprised at what can go wrong with benign data races. Hans Boehm, who led the effort to define the C++ memory model, has written a paper describing why there is no such thing as a benign data race at the C/C++ language level. Data races are always real bugs and are explicitly undefined behavior according to the C++ standard.
I’m not the first person to try using TSan on Firefox inside Mozilla; Christian Holler started filing bugs about races detected by TSan over a year ago. So far, I’ve filed about 30 bugs from running “interesting” groups of mochitests: mostly mochitests under dom/
and layout/
. That doesn’t sound like that many, and there are a couple reasons for that. One is that the same races tend to get reported over and over again. I’ve applied some local fixes to silence some of the really obnoxious races, but you still have to sort through the same races to find the interesting ones. (I should probably spend quality time setting up suppression files, but I’ve avoided doing that thus far for the fear of inadvertently silencing other races that I haven’t seen/reported before.) Test runs are also slow (about 10x slower than a non-TSan run), and it’s fairly common for the browser to simply hang during testing. I’ve been able to fiddle around with the mochitest harness to increase timeouts or the number of timeouts permitted, but that makes tests go even slower, as tests that do timeout take longer and longer to do so.
Fortunately, people have been responsive to the bugs I’ve filed; Honza Bambas, Valentin Gosu, and Patrick McManus on the networking side of things; Terrence Cole, Jon Coppeard, Andrew McCreight, and Shu-yu Guo on the JavaScript side of things; and Milan Sreckovic on the graphics side of things have all been fixing bugs and/or helping analyze what’s going wrong as they’ve been filed.
If you want to try out TSan locally, there’s an excellent introduction to using TSan on MDN. I strongly recommend following the instructions for building your own, local clang; my experiences have shown that the released versions of clang don’t always work (at least with Firefox) when trying to use TSan. If you find new bugs in your favorite piece of code, please file a bug, CC me, and make it block bug 929478.