Cumulative heap profiling in Firefox with DMD

DMD is a tool that I originally created to help identify where new memory reporters should be added to Firefox in order to reduce the “heap-unclassified” measurement in about:memory. (The name is actually short for “Dark Matter Detector”, because we sometimes call the “heap-unclassified” measurement “dark matter“.)

Recently, I’ve modified DMD to become a more general heap profiling tool. It now has three distinct modes of operation.

  1. “Dark matter”: this mode gives you DMD’s original behaviour.
  2. “Live”: this mode tracks all the live blocks on the system heap, and lets you take snapshots at particular points in time.
  3. Cumulative“: this mode tracks all the blocks that have ever been allocated on the system heap, and so gives you information about all the allocations done by Firefox during an entire session.

Most memory profilers (including as about:memory) are snapshot-based, and so work much like DMD’s “live” mode. But “cumulative” mode is equally interesting.

In particular, unlike “live” mode, “cumulative” mode tells you about parts of the code that are responsible for allocating many short-lived heap blocks (sometimes called “heap churn”). Such allocations can hurt performance: allocations and deallocations themselves aren’t free, particularly because they require obtaining a global lock; such code often involves unnecessary initialization or copying of heap data; and if these allocations come in a variety of sizes they can cause additional heap fragmentation.

Another nice thing about cumulative heap profiling is that, unlike live heap profiling, you don’t have to decide when to take snapshots. You can just profile an entire workload of interest and get the results at the end.

I’ve used DMD’s cumulative mode to find inefficiencies in SpiderMonkey’s source compression  and code generation, SQLite, NSS, nsTArray, XDR encoding, Gnome start-up, IPC messaging, nsStreamLoader, cycle collection, and safe-browsing. There are “start doing something clever” optimizations and then there are “stop doing something stupid” optimizations, and every one of these fixes has been one of the latter. Each change has avoided cumulative heap allocations ranging from tens to thousands of MiBs.

It’s typically difficult to quantify any speed-ups from these changes, because the workloads are noisy and non-deterministic, but I’m convinced that simple changes to fix these issues are worthwhile. For one, I used cumulative profiling (via a different tool) to drive the major improvements I made to pdf.js earlier this year. Also, Chrome developers have found that “Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag”.

So far I have only profiled a few simple workloads. There are all sorts of things I haven’t tried: text-heavy pages, image-heavy pages, audio and video, WebRTC, WebGL, popular benchmarks… the list goes on. I intend to do more profiling and fix things where I can, but it would be great to have help from domain experts with this stuff. If you want to try out cumulative heap profiling in Firefox, please read the DMD instructions and feel free to ask me for help. In particular, I now have a good feel for which hot allocations are unavoidable and reasonable — plenty of them, of course — and which are avoidable. Let’s get rid of the avoidable ones.

7 Responses to Cumulative heap profiling in Firefox with DMD

  1. Emanuel Hoogeveen

    Good stuff. One thing I thought this would do, but which dmd.py doesn’t appear to support, is take the number of blocks into account when sorting the entries. A large array that’s only allocated once isn’t that interesting from the perspective of cumulative allocations. Perhaps a useful addition to dmd.py would be the ability to sort by the number of blocks? Particularly tiny allocations might not be that interesting either though (even if there are a lot of them).

    I noticed that Firefox takes a *very* long time to start with |–sample-below 1| (at least on Windows). It just sits there using 2920 KiB for ages, using no CPU, so I’m wondering what it’s waiting for.

    • Nicholas Nethercote

      Good suggestion on sorting by block counts. I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1110455.

      How long is “very” long? –sample-below=1 does make DMD substantially slower, but it shouldn’t be ridiculous…

      • Emanuel Hoogeveen

        I didn’t time it, but it felt like at least 5 minutes. With the default sample-below it takes a few seconds. What makes me most suspicious is that it didn’t use a noticeable amount of CPU and was sitting at the same amount of memory for ages. Maybe it’s making system calls that for some reason take a long time? I don’t know enough about how DMD works to do more than guess.

        • Nicholas Nethercote

          5 minutes is definitely too long. Something weird must be happening 🙁

          • Emanuel Hoogeveen

            I sampled the stack in MSVC: it’s hitting the timeout at [1] over and over again. Perhaps it’s calling this too early during startup? The stack looks like this:

            > dmd.dll!EnsureWalkThreadReady() Line 306 C++
            dmd.dll!NS_StackWalk(void (unsigned int, void *, void *, void *) * aCallback, unsigned int aSkipFrames, unsigned int aMaxFrames, void * aClosure, unsigned int aThread, void * aPlatformData) Line 515 C++
            dmd.dll!mozilla::dmd::StackTrace::Get(mozilla::dmd::Thread * aT) Line 744 C++
            dmd.dll!mozilla::dmd::AllocCallback(void * aPtr, unsigned int aReqSize, mozilla::dmd::Thread * aT) Line 1125 C++
            dmd.dll!replace_malloc(unsigned int aSize) Line 1200 C++
            mozglue.dll!malloc_impl(unsigned int size) Line 152 C

            [1] http://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsStackWalk.cpp#306

          • Nicholas Nethercote

            Using –sample-below=1 causes DMD to request many more stack traces. But otherwise, things are much the same. It shouldn’t be requesting stack traces any earlier than when sampling is used. So I don’t understand what’s happening there.

            I don’t like the Windows stack-walking code 🙁

          • Emanuel Hoogeveen

            Well, I think it’s happening without –sample-below=1 as well – just a lot less, two or three times in total – so I think there just aren’t a lot of big allocations during startup.