Category Archives: Memory allocation

Cumulative heap profiling in Firefox with DMD

DMD is a tool that I originally created to help identify where new memory reporters should be added to Firefox in order to reduce the “heap-unclassified” measurement in about:memory. (The name is actually short for “Dark Matter Detector”, because we sometimes call the “heap-unclassified” measurement “dark matter“.)

Recently, I’ve modified DMD to become a more general heap profiling tool. It now has three distinct modes of operation.

  1. “Dark matter”: this mode gives you DMD’s original behaviour.
  2. “Live”: this mode tracks all the live blocks on the system heap, and lets you take snapshots at particular points in time.
  3. Cumulative“: this mode tracks all the blocks that have ever been allocated on the system heap, and so gives you information about all the allocations done by Firefox during an entire session.

Most memory profilers (including as about:memory) are snapshot-based, and so work much like DMD’s “live” mode. But “cumulative” mode is equally interesting.

In particular, unlike “live” mode, “cumulative” mode tells you about parts of the code that are responsible for allocating many short-lived heap blocks (sometimes called “heap churn”). Such allocations can hurt performance: allocations and deallocations themselves aren’t free, particularly because they require obtaining a global lock; such code often involves unnecessary initialization or copying of heap data; and if these allocations come in a variety of sizes they can cause additional heap fragmentation.

Another nice thing about cumulative heap profiling is that, unlike live heap profiling, you don’t have to decide when to take snapshots. You can just profile an entire workload of interest and get the results at the end.

I’ve used DMD’s cumulative mode to find inefficiencies in SpiderMonkey’s source compression  and code generation, SQLite, NSS, nsTArray, XDR encoding, Gnome start-up, IPC messaging, nsStreamLoader, cycle collection, and safe-browsing. There are “start doing something clever” optimizations and then there are “stop doing something stupid” optimizations, and every one of these fixes has been one of the latter. Each change has avoided cumulative heap allocations ranging from tens to thousands of MiBs.

It’s typically difficult to quantify any speed-ups from these changes, because the workloads are noisy and non-deterministic, but I’m convinced that simple changes to fix these issues are worthwhile. For one, I used cumulative profiling (via a different tool) to drive the major improvements I made to pdf.js earlier this year. Also, Chrome developers have found that “Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag”.

So far I have only profiled a few simple workloads. There are all sorts of things I haven’t tried: text-heavy pages, image-heavy pages, audio and video, WebRTC, WebGL, popular benchmarks… the list goes on. I intend to do more profiling and fix things where I can, but it would be great to have help from domain experts with this stuff. If you want to try out cumulative heap profiling in Firefox, please read the DMD instructions and feel free to ask me for help. In particular, I now have a good feel for which hot allocations are unavoidable and reasonable — plenty of them, of course — and which are avoidable. Let’s get rid of the avoidable ones.

Better documentation for memory profiling and leak detection tools

Until recently, the documentation for all of Mozilla’s memory profiling and leak detection tools had some major problems.

  • It was scattered across MDN, the Mozilla Wiki, and the Mozilla archive site (yes, really).
  • Documentation for several tools was spread across multiple pages.
  • Documentation for some tools was meagre, non-existent, or overly verbose.
  • Some of the documentation was out of date, e.g. describing tools that no longer exist.

A little while back I fixed these problems.

  • The documentation for these tools is now all on MDN. If you look at the MDN Performance page in the “Memory profiling and leak detection tools” section, you’ll see a brief description of each tool that explains the circumstances in which it is useful, and a link to the relevant documentation.
  • The full list of documented tools includes: about:memory, DMD, areweslimyet.com, BloatView, Refcount tracing and balancing, GC and CC logs, Valgrind, LeakSanitizer, Apple tools, TraceMalloc, Leak Gauge, and LogAlloc.
  • As well as consolidating all the pages in one place, I also improved some of the pages (with the help of people like Andrew McCreight). In particular, about:memory now has reasonably detailed documentation, something it has lacked until now.

Please take a look, and if you see any problems let me know. Or, if you’re feeling confident just fix things yourself! Thanks.

Please grow your buffers exponentially

If you record every heap allocation and re-allocation done by Firefox you find some interesting things. In particular, you find some sub-optimal buffer growth strategies that cause a lot of heap churn.

Think about a data structure that involves a contiguous, growable buffer, such as a string or a vector. If you append to it and it doesn’t have enough space for the appended elements, you need to allocate a new buffer, copy the old contents to the new buffer, and then free the old buffer. realloc() is usually used for this, because it does these three steps for you.

The crucial question: when you have to grow a buffer, how much do you grow it? One obvious answer is “just enough for the new elements”. That might seem  space-efficient at first glance, but if you have to repeatedly grow the buffer it can quickly turn bad.

Consider a simple but not outrageous example. Imagine you have a buffer that starts out 1 byte long and you add single bytes to it until it is 1 MiB long. If you use the “just-enough” strategy you’ll cumulatively allocate this much memory:

1 + 2 + 3 + … + 1,048,575 + 1,048,576 = 549,756,338,176 bytes

Ouch. O(n2) behaviour really hurts when n gets big enough. Of course the peak memory usage won’t be nearly this high, but all those reallocations and copying will be slow.

In practice it won’t be this bad because heap allocators round up requests, e.g. if you ask for 123 bytes you’ll likely get something larger like 128 bytes. The allocator used by Firefox (an old, extensively-modified version of jemalloc) rounds up all requests between 4 KiB and 1 MiB to the nearest multiple of 4 KiB. So you’ll actually allocate approximately this much memory:

4,096 + 8,192 + 12,288 + … + 1,044,480 + 1,048,576 = 134,742,016 bytes

(This ignores the sub-4 KiB allocations, which in total are negligible.) Much better. And if you’re lucky the OS’s virtual memory system will do some magic with page tables to make the copying cheap. But still, it’s a lot of churn.

A strategy that is usually better is exponential growth. Doubling the buffer each time is the simplest strategy:

4,096 + 8,192 + 16,384 + 32,768 + 65,536 + 131,072 + 262,144 + 524,288 + 1,048,576 = 2,093,056 bytes

That’s more like it; the cumulative size is just under twice the final size, and the series is short enough now to write it out in full, which is nice — calls to malloc() and realloc() aren’t that cheap because they typically require acquiring a lock. I particularly like the doubling strategy because it’s simple and it also avoids wasting usable space due to slop.

Recently I’ve converted “just enough” growth strategies to exponential growth strategies in XDRBuffer and nsTArray, and I also found a case in SQLite that Richard Hipp has fixed. These pieces of code now match numerous places that already used exponential growth: pldhash, JS::HashTable, mozilla::Vector, JSString, nsString, and pdf.js.

Pleasingly, the nsTArray conversion had a clear positive effect. Not only did the exponential growth strategy reduce the amount of heap churn and the number of realloc() calls, it also reduced heap fragmentation: the “heap-overhead” part of the purple measurement on AWSY (a.k.a. “RSS: After TP5, tabs closed [+30s, forced GC]”) dropped by 4.3 MiB! This makes sense if you think about it: an allocator can fulfil power-of-two requests like 64 KiB, 128 KiB, and 256 KiB with less waste than it can awkward requests like 244 KiB, 248 KiB, 252 KiB, etc.

So, if you know of some more code in Firefox that uses a non-exponential growth strategy for a buffer, please fix it, or let me know so I can look at it. Thank you.

MemShrink progress, week 31

Lots of good stuff happened this week in MemShrink-land.

Necko Buffer Cache

I removed the Necko buffer cache.  This cache used nsRecyclingAllocator to delay the freeing of up to 24 short-lived 32KB chunks (24 x 16KB on Mobile/B2G) in the hope that they could be reused soon.  The cache would fill up very quickly and the chunks wouldn’t be freed unless zero re-allocations occurred for 15 minutes.  This idea is to avoid malloc/free churn, but the re-allocations weren’t that frequent, and heap allocators like jemalloc tend to handle cases like this pretty well, so performance wasn’t affected noticeably.  Removing the cache has the following additional benefits.

  • nsRecyclingAllocator added a clownshoe-ish extra word to each chunk, which meant that the 32KB (or 16KB) allocations were being rounded up by jemalloc to 36KB (or 20KB), resulting in 96KB (24 x 4KB) of wasted memory.
  • This cache wasn’t covered by memory reporters, so about:memory’s “heap-unclassified” number has dropped by 864KB (24 x 36KB) on desktop and 480KB (24 x 20KB) on mobile.

I also removed the one other use of nsRecyclingAllocator in libjar, for which this recycling behaviour also had no noticeable effect.  This meant I could then remove nsRecyclingAllocator itself.  Taras Glek summarized things nicely:  “I’m happy to see placebos go away”.

Other Stuff

I gave a talk at the linux.conf.au Browser MiniConf entitled “Notes on Reducing Firefox’s Memory Consumption”, which got some attention on Slashdot.

Henrik Skupin reported that Ghostery 2.6.2 and 2.7beta2 have memory leaks (zombie compartments).  I contacted the authors today and they said they are looking into the problem, so hopefully we’ll see a fix soon.

Gian-Carlo Pascutto reworked the code that downloads the safe browsing database to use less memory, and to have a fallback if memory runs out.  This memory usage is transient and so the main benefit is that it prevents some out-of-memory crashes that were happening frequently.

Last week I mentioned bug 703427, which held up the possibility of a simple, big reduction in SQLite memory usage.  Marco Bonardo did some analysis, and unfortunately the patch caused large increases in the number of disk accesses, and so it won’t work.  A shame.

Kyle Huey fixed a zombie compartment that occurred when right-clicking on a single-line textbox.  The fun thing about this was that in only 3 hours and 35 minutes, the following events happened: the bug report was filed, the problem was confirmed by two people, the bug report was re-categorized into the appropriate component, a patch was posted, the patch was reviewed, the patch landed on mozilla-central, and the bug report was marked as fixed!  And approval for back-porting to Aurora was granted 8 hours later.  Not bad.  Kyle has also made progress on a more frequent zombie compartment caused by searching for text.

Jonathan Kew made the shaped-word caches (which are involved in text rendering) discard their data on memory pressure events.

Quote of the week

A commenter on my blog named jas said (the full comment is here):

a year ago, FF’s memory usage was about 10x what chrome was using in respect to the sites we were running…

so we have switched to chrome…

i just tested FF 9.0.1 against chrome, and it actually is running better than chrome in the memory department, which is good. but, it’s not good enough to make us switch back (running maybe 20% better in terms of memory). a tenfold difference would warrant a switch. in our instance, it was too little, too late.

but glad you are making improvements.

So that’s good, I guess?

I also like this comment from the aforementioned Slashdot thread!

Bug Counts

Here are the current bug counts.

  • P1: 24 (-3/+1)
  • P2: 131 (-8/+7)
  • P3: 69 (-1/+3)
  • Unprioritized: 3 (-3/+2)

That’s a net drop of two, largely because I went through and closed some P1 and P2 bugs that were stale or had been fixed elsewhere.

Notes on Reducing Firefox’s Memory Consumption

I gave a talk yesterday at the linux.conf.au Browser MiniConf, held in Ballarat, Australia.  Its title was “Notes On Reducing Firefox’s Memory Consumption”.

Below are the slides and notes in a SlideShare embedding. If you find that embedding problematic (some people do) you may prefer to download the PDF version directly.

MemShrink progress, week 9

Firefox 8 graduated to the Aurora channel this week, and the development period for what will become Firefox 9 began.  Lots of MemShrink activity happened this week, and I think all the changes listed below will make it into Firefox 8.

Avoiding Wasted Memory

I have blogged previously about memory wasted by “clownshoes” bugs.   Ed Morley found a webpage that resulted in 700MB of memory being wasted by the PLArena clownshoes bug.  Basically, on platforms where jemalloc is used (Windows, Linux), half the memory allocated by nsPresArena (which is built on top of PLArena) was wasted.  (On Mac the waste was 11%, because the Mac allocator rounds up less aggressively than jemalloc).

Fixing this problem properly for all PLArenas takes time because it requires changes to NSPR, so I made a spot-fix for the nsPresArena case.  This is a particularly big win on very large pages, but it saves around 3MB even on Gmail. This spot-fix has been granted beta approval and so will, barring disaster, make it into Firefox 7.

A Firefox Nightly user did some measurements with different browsers on the problematic page:

  • Firefox 8.0a1 before patch: 2.0 GB
  • Firefox 8.0a1 after patch: 1.3 GB
  • Latest Chrome canary build and dev (15.0.849.0): 1.1GB
  • Webkit2Process of Safari 5.1: 1.05 GB
  • Internet Explorer 9.0.2: 838 MB
  • Latest Opera Next 12.00: 727 MB

So this fix gets Firefox within spitting distance of other browsers, which is good!

In other developments related to avoiding wasted memory:

  • Luke Wagner discovered that, on typical websites, most JSScripts are byte-compiled but never run.  A JSScript roughly corresponds to a JavaScript function.  In hindsight, it’s not such a surprising result — Firefox byte-compiles all loaded JavaScript code, and you can imagine lots of websites use libraries like jQuery but only use a small fraction of the functions in the library.  Making byte-compilation lazy could potentially save MBs of memory per compartment.  But that will require some non-trivial reworking of the JS engine, and so is unlikely to happen in the short-term.
  • Kyle Huey avoided a small amount (~100KB per browser process) of waste due to rounding up in XPT arenas.

Improving about:memory

I made some progress on a Valgrind tool to help identify the memory that is currently reported only by the “heap-unclassified” bucket in about:memory.  It’s called “DMD”, short for “Dark Matter Detector”.  It’s in early stages and I still need to teach it about most of Firefox’s memory reporters, but it’s already spitting out useful data, which led to me and Ehsan Akhgari landing memory reporters for the JS atom table and the Hunspell spell checker.  We also now have some insight (here and here) about memory usage for very large pages.

Mounir Lamouri turned on the memory reporter for the DOM that he’s been working on for some time.  This shows up under “dom” in about:memory.  There are still some cases that require handling;  you can follow the progress of these here.

Andrew McCreight replaced about:memory’s buttons so you can force a cycle collection without also forcing a garbage collection, which may be useful in hunting down certain problems.

Finally, Sander van Veen added the existing “js-compartments-user” and “js-compartments-system” to the statistics collected by telemetry (his first landed patch!), and I did likewise for the “storage/sqlite” reporter.  I also added a new “tjit-data/trace-monitor” memory reporter that accounts for some of the memory used by TraceMonkey.

Miscellaneous

Igor Bukanov tweaked the handling of empty chunks by the JavaScript garbage collector.  That sounds boring until you see the results on Gregor Wagner’s 150-tab stress test: resident memory usage dropped 9.5% with all 150 tabs open, and dropped by 27% after all those tabs were closed.

Brian Hackett fixed a memory leak in type inference, which gets it one step closer to being ready to land.

Christian Höltje fixed a leak in his “It’s All Text” add-on that was causing zombie compartments.  This fix will be in version 1.6.0, which is currently awaiting to receive AMO approval, but can be obtained here in the meantime.  This fix and last week’s fix of a memory leak in LastPass are very encouraging — per-compartment reporters in about:memory have, for the first time, given add-on developers a reasonable tool for identifying memory leaks.  I hope we can continue to improve the situation here.  Several people have asked me for documentation on how to avoid memory leaks in add-ons.  I’m not the person to write that guide (I’m not a Gecko expert and I know almost nothing about add-ons) but hopefully someone else can step up to the plate.

Bug counts

Here’s the change in MemShrink bug counts.

  • P1: 30 (-0, +1)
  • P2: 64 (-4, +6)
  • P3: 36 (-5, +0)
  • Unprioritized: 1 (-2, +1)

Good progress on P3 bugs, but they’re the least important ones.  Other than that, new bugs are still being reported faster than they’re being fixed.  If you want to help but don’t know where to start, feel free to email me or ping me on IRC and I’ll do my best to help get you involved.

 

Clownshoes available in sizes 2^10+1 and up!

I’ve been working a lot on about:memory lately.  It’s a really useful tool, but one frustrating thing about it is that a decent chunk of our memory usage is not covered by any of the existing memory reporters, and so is falling into the “heap-unclassified” bucket:

This bucket typically accounts for 30–45% of about:memory’s “explicit” measurement.  We’ve discussed this “dark matter” several times in MemShrink meetings.  There’s lots of work underway to add new reporters to uninstrumented parts of Gecko, but I’ve seen cases where it looks like a decent chunk of the missing memory is due to the JS engine, which is already seemingly thoroughly instrumented.

This week I realized that some of the dark matter could be due to “slop” from jemalloc, the browser’s heap allocator.  What is slop?  When you ask a heap allocator for a certain number of bytes, it’ll often give you back more than you asked for, rounding the size up.  This wastes some memory, but there are good reasons for it — it makes the heap allocator much faster and simpler, and helps avoid fragmentation when the memory is freed.

The following comment from jemalloc.c shows jemalloc’s size classes.  Any request that’s not for one of the listed sizes below is rounded up to the nearest size.

*   |=====================================|
*   | Category | Subcategory    |    Size |
*   |=====================================|
*   | Small    | Tiny           |       2 |
*   |          |                |       4 |
*   |          |                |       8 |
*   |          |----------------+---------|
*   |          | Quantum-spaced |      16 |
*   |          |                |      32 |
*   |          |                |      48 |
*   |          |                |     ... |
*   |          |                |     480 |
*   |          |                |     496 |
*   |          |                |     512 |
*   |          |----------------+---------|
*   |          | Sub-page       |    1 kB |
*   |          |                |    2 kB |
*   |=====================================|
*   | Large                     |    4 kB |
*   |                           |    8 kB |
*   |                           |   12 kB |
*   |                           |     ... |
*   |                           | 1012 kB |
*   |                           | 1016 kB |
*   |                           | 1020 kB |
*   |=====================================|
*   | Huge                      |    1 MB |
*   |                           |    2 MB |
*   |                           |    3 MB |
*   |                           |     ... |
*   |=====================================|

In extreme cases, jemalloc will return almost double what you asked for.  For example, if you ask for 1,025 bytes, it’ll give you 2,048.  A lot of the time you have to just live with slop;  if you need to heap-allocate an object that’s 680 bytes, jemalloc will give you 1,024 bytes, and you just have to accept the 344 bytes of waste.  But if you have some flexibility in your request size, it’s a good idea to pick a size that’s a power-of-two, because that always gives you zero slop.  (Remember this, it’s important later on.)

So I instrumented jemalloc to print out an entry for every heap allocation, showing the requested amount, the actual allocated amount, and the resulting slop.  I then started up Firefox, opened Gmail, then shut down Firefox.  Next, I ran the resulting log through a wonderful little concordance-like script I have called “counts” which analyzes a file and tells you how many times each distinct line occurs.  Here’s the top ten lines of the output:

909062 numbers:
( 1 ) 205920 (22.7%, 22.7%): small:     24 ->     32 (     8 )
( 2 )  66162 ( 7.3%, 29.9%): small:     72 ->     80 (     8 )
( 3 )  61772 ( 6.8%, 36.7%): small:     40 ->     48 (     8 )
( 4 )  54386 ( 6.0%, 42.7%): small:   1056 ->   2048 (   992 )
( 5 )  48501 ( 5.3%, 48.0%): small:     18 ->     32 (    14 )
( 6 )  47668 ( 5.2%, 53.3%): small:     15 ->     16 (     1 )
( 7 )  24938 ( 2.7%, 56.0%): large:   4095 ->   4096 (     1 )
( 8 )  24278 ( 2.7%, 58.7%): small:     56 ->     64 (     8 )
( 9 )  13064 ( 1.4%, 60.1%): small:    104 ->    112 (     8 )
(10 )  12852 ( 1.4%, 61.6%): small:    136 ->    144 (     8 )

There were 909,062 lines, which means there were 909,062 allocations.  (I didn’t print information about frees.)  The most common line was “small: 24 -> 32 ( 8 )” which occurred 205,920 times, accounting for 22.7% of all the lines in the file.  The “small” refers to jemalloc’s class size categories (see above), and every allocation request of 24 bytes was rounded up to 32 bytes, resulting in 8 bytes of slop.

Looking through the list, I saw a case where 1,048,578 (2^20+2) bytes had been requested, and jemalloc had returned 2,097,152 (2^21) bytes.  That’s a huge (1MB) waste, and also smelled very fishy.  And on further inspection there were a lot of smaller but still suspicious cases where a number slightly larger than a power-of-two was being rounded up to the next power-of-two:  1,032 to 2048;  1,056 to 2,048;  2,087 to 4,096;  4,135 to 8,192, etc.

I investigated a number of these by adding some code to jemalloc to detect these suspicious request sizes, and then using Valgrind to print out a stack trace every time one occurred.  What I found was four distinct places in the codebase where the code in question had flexibility in the amount of memory it allocated, and so had tried to ask for a power-of-two, but had botched the request and thus ended up asking for slightly more than a power-of-two!

  • In js/src/vm/String.cpp, when concatenating two strings, the capacity of the resultant string is rounded up to the next power-of-two (unless the string is really big, in which case the rounding is less aggressive).  This is a reasonable thing to do, as it means that if a further concatenation occurs, there’s a good chance there’ll be space for the extra chars to be added in-line, avoiding an extra allocation and copy.  But the code first did its power-of-two round-up, and then added sizeof(jschar) (which is two bytes) to the size, to allow space for the terminating NULL char.  This code was responsible for the 1,048,578 byte request I mentioned earlier.  Luke Wagner fixed this bug with a minimal change earlier this week.  I was unable to easily measure the overall cost of this, but avoiding 1MB of slop with some frequency can only be a good thing.
  • nsprpub/lib/ds/plarena.c implements an arena allocator.  Each arena pool is given a size for each arena when it’s created, and the size passed in is almost always a power-of-two such as 1,024 or 4,096.  Which is good, except that the allocator then has to add a few more bytes onto the request because each arena has a header that holds book-keeping information.  When I start Firefox and load Gmail on my Linux64 box, I see that approximately 3MB of space is wasted because of this bug.  The fix is simple, it just requires that the arena payload be reduced by the size of the header;  my patch is awaiting review.  Sadly enough, this problem was identified 3.5 years ago but not fixed.
  • js/src/jsarena.cpp is almost identical to nsprpub/lib/ds/plarena.c and the story is the same:  it has the same problem; it was first identified 3.5 years ago but not fixed; and my patch is awaiting review.  This didn’t make much difference in practice because this problem had been separately identified for the particular JSArenaPool that allocates the most memory, and worked around using an ill-documented hack.
  • db/sqlite3/src/sqlite3.c is the best one of all.  SQLite is very careful about measuring its memory usage and provides an API to access those measurements.  But in order to measure its own memory usage accurately, it adds 8 bytes to every allocation request in order to store the requested size at the start of the resulting heap block.  In doing so, it converts a lot of 1,024 byte requests into 1,032 byte requests.  Ironically enough, the slop caused by storing the requested size rendered the stored size inaccurate.  I determined that as a result, SQLite is using roughly 1.15–1.20x more memory in Firefox than it thinks it is.  So, go look at the “storage/sqlite” number in about:memory and mentally adjust it accordingly.   (On my Mac laptop which has an old profile and 11 tabs open it’s currently 70MB, so there’s probably an additional 10–14MB of slop).  SQLite is an upstream product, and the authors are working on a fix.

Once all these fixes have landed, there will be two benefits.  First, Firefox’s memory usage will go down, which is good for everyone.  Second, the proportion of dark matter in about:memory will also drop, which makes our memory profiling more accurate and useful.

After that, converting existing reporters to use malloc_usable_size will account for much of the legitimate slop, further reducing the amount of dark matter. And then I have a master plan for identifying whatever’s left.

clownshoes

So what’s up with this post’s title?  Andreas Gal uses the term “clownshoes” to refer to code that is laughably bad.  When I found and filed the first of these bugs, in a fit of whimsy I put “[clownshoes]” in its whiteboard.  Only later did a realize how suitable this was: a clown’s shoes and these botched power-of-two-sized allocations both have lots of empty space in them, and thus make their owner less nimble than they could be.  Perfect.