Limits of reliability – Nicholas Nethercote

Julian Seward asked me an interesting question a while ago: “what are the factors that limit Firefox’s reliability?” (You can use “crash rate” as a reasonable definition of “reliability”.)

He suggested two things:

Firefox depends on external code, such as plug-ins.
Many crashes are hard to reproduce and so don’t get fixed.

For the first, Electrolysis (a.k.a. process separation) is on track to pretty much make it a non-problem. It’s already in place for Flash, and will eventually be for other plug-ins. So that’s good.

For the second, I see two main sub-factors.

Firefox is implemented in C++ which is prone to memory-related bugs and data races, both of which can make crash reproduction difficult. Using a safer language like Rust would make many (all?) of these bugs impossible. Unfortunately, Rust isn’t production-ready, and rewriting even parts of the browser is a huge undertaking. So we better get started ASAP 🙂
Second, Firefox has some nasty low-level code like the garbage collector; bugs in it be very difficult to reproduce. I don’t see an obvious way to improve this other than the usual: testing, code review, using simple algorithms, etc.

Maybe we need to brainstorm ways to make crashes easier to reproduce. My 11am brain is weaker than your 4am brain, but here are some stream-of-consciousness ideas:

– debug builds fail faster than opt builds, making crashes happen closer to their causes. We could move some of the debug-build brittleness into the opt builds — poisoning GC’d objects, for example.

– expanding roc’s idea, run our entire user base inside of a record+replay VM (along with their computers, of course). Then we can blame our users for the crashes — if you hadn’t gotten up early and fired up your browser, it wouldn’t have crashed on you. Keanu Reeves can help us with this one.

– write out space-limited audit trails. When GC discards something, log a minimalistic description of what it was in a circular queue that gets submitted with a minidump.

– on a crash, save out horrendously detailed information on the state of everything at the time of the crash, privacy and disk space be damned. Then ask the user if they’d like to help reproducing the problem. If so, restore as much of the state as possible from the dump and try again. If that succeeds, do it *again*, only this time enable some runtime instrumentation (“brittle mode” — hey, just download the corresponding debug build and run that!). Or skip the middle step.

– post-Electrolysis: when you free a chunk of memory, hand it over to a different process to use and unmap it from the origin process. Then you get a seg fault when you reuse collected memory instead of random corruption, without wasting gobs of memory.

I feel like I’m being too GC-focused.

– put guard words before & after array allocations. Check them during GC, or on a background thread, or something.

– catch fatal signals and overwrite the entire process image with a program that displays calming waterfall videos. (Or porn, depending on browser history?) If the crashes aren’t reported, they don’t exist.

– add some hysteresis to event queuing — wait a few milliseconds before adding an event to the queue, and if more events come in and there is no forced ordering among them (which is hard to detect, but never mind, I’m brainstorming), insert them into the queue in “sorted” order.

The goal is to make different runs of the same actions behave more similarly. In particular, the original crashing run and the developer’s attempting-to-reproduce run should be as similar as possible.

– similar, but applied to random number generation: generate random seeds for each page load and use the appropriate generator for all randomness within that page. Save the seeds into the minidump.

– or applied to parallel network loads: record the order (or timestamp) that each page was loaded, and feed that into a transparent proxy that the test system goes through

2 replies on “Limits of reliability”