There’s more than one way to null a pointer
I haven’t blogged in a while because I’ve been in heads-down mode working mostly on getting closures and the arguments keyword to trace in TraceMonkey and investigating topcrashes. But we’ve just solved the most common JS-engine topcrash, so I’m resurfacing with the story, and hopefully some useful tricks and analysis for anyone else working on anything like this.
The Crash Reports. The crash in question is a crash in js_Interpret, which is a huge function with cases for every bytecode in SpiderMonkey. js_Interpret has been placing around 16-19 on the Firefox 3.5 topcrashes list. Looking at the stack traces, I found that a certain crash point (line of code where we crash) accounted for the majority of the crashes:
0 js3250.dll js_Interpret js/src/jsinterp.cpp:4436 1 js3250.dll js_Invoke js/src/jsinterp.cpp:1394 2 xul.dll nsXPCWrappedJSClass::CallMethod js/src/xpconnect/src/xpcwrappedjsclass.cpp:1697 3 xul.dll nsXPCWrappedJS::CallMethod js/src/xpconnect/src/xpcwrappedjs.cpp:569
The line of code on which the crash occurs, jsinterp.cpp:4436, is for the fast path of reading a JS object property.
There was no bug on file with steps to reproduce, or any other kinds of data that would let us directly debug the problem.
Crash Addresses. Clicking on a few of these reports, I saw that the Windows crash address was usually 0x0, 0x4, 0x8, or 0xc. I wrote a script that verified that guess on a larger sample. That pattern told me the crash was probably due to an attempt to read from an array of word-sized elements through a null pointer. Something like this:
// a is NULL int x = a[i];
If i is usually in the range 0-3, then most crash addresses would be 0, 4, 8, or 12.
If you expand all the macros on the crash line, there is in fact an array read, reading a JS object property as
obj->dslots[slot-5]. dslots is a dynamically allocated array of 4-byte JS values that holds object properties after the first two (depending on the exact object type). If the object has few enough properties, then dslots is null.
Minidumps. Using minidumps and disassembling code around the crash point, I was able to prove the crash was in fact caused by null dslots.
Disassembly also showed that about 25% of the crash reports reported the wrong line of code, apparently because the compiler’s profile-guided optimization mangled the x86 code too much for the crash analyzer to handle. This confused me at first, because I had happened to start with a crash report that listed the wrong line.
I also found a few of the local variables’ values using the minidump. The minidump doesn’t have enough information for the debugger to find those variables. I got them by reverse-engineering the x86 code, finding stores to stack locations (stores through ebp), matching them up with variables, and then reading the values out of the minidump. I only got 3 values, and they weren’t terribly informative, but they did rule out certain code paths and special cases, narrowing significantly the kinds of JS objects I needed to think about. For example, the variables showed that the crash was not related to dense arrays (array objects represented by C arrays) or DOM objects.
At this point, I knew that the crash was because dslots was null, and I knew it wasn’t related to certain advanced features, but that’s it. In particular, dslots absolutely shouldn’t be null on that code path, because that is a code path for a property lookup that finds a property in the property cache with a slot greater than 5. Somehow, the property cache was giving us the wrong answer for the current object, but there was no indication why. One possibility is that the property cache had wrong data. Another is that something set dslots to null when it shouldn’t.
Bits and grapes and dates and cakes. Now, if I had steps to reproduce and could debug, one of my first questions would be, “What code sets dslots to null for the object we are looking at?”
Thinking about that, I got an idea so dumb it was clever. I knew I wanted to know the code point that set dslots to null. And I knew the only real piece of data about the crash I could get was the crash address. So, clearly, the answer is to make each code point that sets dslots to null use a different null address. Then, the crash address would give the answer. Like this:
// someplace that sets dslots to null obj->dslots = (void*) 256; // some other place that sets dslots to null obj->dslots = (void*) 512;
If the crash addresses come back 256, 260, 264, 268, then the first point is the one setting dslots to null. If they are 512, 516, 520, 524, then it’s the second point. Of course, it could also be a mixture.
There are a bunch of other details that need to be handled to make this work, like changing code that tests if dslots is null, which can be found in the patches posted in bug 519363.
The reason this works is that the “zero page” (page with base address 0) is protected on all popular operating systems, and the null pointer is defined as address 0. Thus, in normal operation, any attempt to read through a null pointer, or an array based with a null pointer (as long as the index isn’t so large it gets to another page), will crash with a protection fault. But this means that any pointer with address less than 4096 will also crash. By using multiple null pointers, and changing tests for null to comparisons with 4096, we can send back up to 12 bits of information through the crash address. For our bug, we actually only get 9 or 10 bits, because 2 or 3 bits are taken up by the array index part of the crash address. But there were only about 6 places dslots was set to null, so we had more than enough bits.
The first try showed that all the objects for the crash were created by the same function: js_NewObjectWithGivenProto. This immediately (well, after the 1 or 2 days between pushing a patch and having nightly test users experience the crash) ruled out some guesses–we had thought the problem might be that objects were getting GC’d too soon, but the crash addresses proved that all the crash objects were “fresh” and had never been GC’d.
I noticed that js_NewObjectWithGivenProto had a few significantly different code paths related to object creation: (1) a path for “special” objects (like DOM objects), (2) a path for normal objects that are “similar” to their prototypes, and (3) a path for normal objects that are not “similar” to their prototypes. I tweaked the null pointer instrumentation to distinguish these paths, and found that all the crash objects went through path (2). This was very interesting, because path (2) uses special code to set the “shape” of the object, which is also key to the property cache lookup just before the crash.
(Aside: an object’s “shape” is a number that represents its collection of properties (and the attributes of those properties). Two objects that have the same shape are required to have the same properties and the same dslots layout for those properties. See also https://developer.mozilla.org/En/SpiderMonkey/Internals/Property_cache#Shape.)
At this point, my working hypothesis was that somehow, using that initialization path, we ended up with two objects that had the same shape but different layouts. But I had no idea how that could happen.
Software archaeology. Next, I tired a lot of things that didn’t work. It took a long time and it was really frustrating. I also read a bunch of code and got some reasonable guesses as to how the shape problem could occur, but none could be tested through the null pointer trick, but only through more complicated methods that also required upgrades to minidump collection that hadn’t occurred yet.
Eventually, I got a better understanding of the kinds of data I could get from the crash database, and I realized I might be able to find out when this crash had started occurring for trunk nightly users. I dug through all the data and found 3 patches closely related to object shapes and prototypes that were checked in just before the crash started occurring.
The issues with those patches are pretty technical, and some are for possibly security-sensitive bugs, so I won’t go into great detail. But we got a better guess about what was happening and where to look. I posted my findings, then started reading the patches and trying to think about what might be wrong with it and how I could test those things, but I didn’t get any particularly good ideas right away.
Jorendorff. Then, magically, Jason Orendorff figured out the problem. He read the code around one of the patches and found a significant bug. He didn’t tell me exactly why he was reading that code, so I can’t be absolutely sure it was because I pointed at it, but I think it was.
From there, he wrote a patch, which we checked in. And then waited. After 4 days passed and there were no more crashes at this crash point on trunk, I decided it was fixed. Yay!!!
I think I learned a few lessons about debugging crashes in the field with only crash report data and basically none of the things you normally hope to have debugging a crash.
One lesson is that it’s important to use all the sources of data available: crash reports, patterns in crash report attributes, minidumps, revision history, bug comments, other articles and docs, source code, and data collected using instrumentation patches. And all these have to be used together.
A second lesson is the null pointer trick, and also the value of being able to push instrumentation patches and get back field data in 1-2 days. This would be much more powerful if we could collect additional information in the crash reports, as controlled by the programmer. (E.g., crash report skidmarks.) I think we can actually do that now with minidumps by printing useful information to buffers allocated on the C stack, which are then part of the minidump.
A third lesson is the value of the history. Being able to see the day on which a bug started and find the patches that went in before that day was huge. One problem we have here is that TraceMonkey generally merges to mozilla-central only every week, so if the problem was introduced in a patch first pushed to TM, there is an entire week of patches that might have caused the problem. We should consider going to a more frequent merge schedule.
A side point is that sifting crash report data and history would be a lot easier if we had query access to the underlying crash data. For now, everything has to go through the web interface, which is limiting (e.g., you can only get back 500 crash reports of a given kind of a given day) and slow (you need to write a script that downloads a lot of crash reports, parses the data, and then run the query you actually care about, so it’s slow in both developer time and computer time).
The final lesson, which may be obvious, is the value of getting more eyes on the problem. I got substantial advice from both Brendan and of course Jason on this. I also benefited a lot from being able to talk through the issues with Andreas, David Anderson, and Luke Wagner. I read the code that contained the bug (which was ultimately a flipped
if condition), but I didn’t see it because I assumed that other parts of the code did the right thing so the
if would be correct. But they didn’t, and Jason knew it or saw it.
Debugging this kind of crash is all about trying to sort things out that make no sense with almost no data, and it requires both creativity (to think of weird failure modes and experiments to try) and perceptiveness (to see the bug once you’ve narrowed down the search). I suppose creativity and perceptiveness are related in that they are both about being able to see an especially interesting straw in a (sensory or imaginary) haystack. More unique viewers means more chances to see that funny straw.