Clownshoes available in sizes 2^10+1 and up!

I’ve been working a lot on about:memory lately.  It’s a really useful tool, but one frustrating thing about it is that a decent chunk of our memory usage is not covered by any of the existing memory reporters, and so is falling into the “heap-unclassified” bucket:

This bucket typically accounts for 30–45% of about:memory’s “explicit” measurement.  We’ve discussed this “dark matter” several times in MemShrink meetings.  There’s lots of work underway to add new reporters to uninstrumented parts of Gecko, but I’ve seen cases where it looks like a decent chunk of the missing memory is due to the JS engine, which is already seemingly thoroughly instrumented.

This week I realized that some of the dark matter could be due to “slop” from jemalloc, the browser’s heap allocator.  What is slop?  When you ask a heap allocator for a certain number of bytes, it’ll often give you back more than you asked for, rounding the size up.  This wastes some memory, but there are good reasons for it — it makes the heap allocator much faster and simpler, and helps avoid fragmentation when the memory is freed.

The following comment from jemalloc.c shows jemalloc’s size classes.  Any request that’s not for one of the listed sizes below is rounded up to the nearest size.

*   |=====================================|
*   | Category | Subcategory    |    Size |
*   |=====================================|
*   | Small    | Tiny           |       2 |
*   |          |                |       4 |
*   |          |                |       8 |
*   |          |----------------+---------|
*   |          | Quantum-spaced |      16 |
*   |          |                |      32 |
*   |          |                |      48 |
*   |          |                |     ... |
*   |          |                |     480 |
*   |          |                |     496 |
*   |          |                |     512 |
*   |          |----------------+---------|
*   |          | Sub-page       |    1 kB |
*   |          |                |    2 kB |
*   |=====================================|
*   | Large                     |    4 kB |
*   |                           |    8 kB |
*   |                           |   12 kB |
*   |                           |     ... |
*   |                           | 1012 kB |
*   |                           | 1016 kB |
*   |                           | 1020 kB |
*   |=====================================|
*   | Huge                      |    1 MB |
*   |                           |    2 MB |
*   |                           |    3 MB |
*   |                           |     ... |
*   |=====================================|

In extreme cases, jemalloc will return almost double what you asked for.  For example, if you ask for 1,025 bytes, it’ll give you 2,048.  A lot of the time you have to just live with slop;  if you need to heap-allocate an object that’s 680 bytes, jemalloc will give you 1,024 bytes, and you just have to accept the 344 bytes of waste.  But if you have some flexibility in your request size, it’s a good idea to pick a size that’s a power-of-two, because that always gives you zero slop.  (Remember this, it’s important later on.)

So I instrumented jemalloc to print out an entry for every heap allocation, showing the requested amount, the actual allocated amount, and the resulting slop.  I then started up Firefox, opened Gmail, then shut down Firefox.  Next, I ran the resulting log through a wonderful little concordance-like script I have called “counts” which analyzes a file and tells you how many times each distinct line occurs.  Here’s the top ten lines of the output:

909062 numbers:
( 1 ) 205920 (22.7%, 22.7%): small:     24 ->     32 (     8 )
( 2 )  66162 ( 7.3%, 29.9%): small:     72 ->     80 (     8 )
( 3 )  61772 ( 6.8%, 36.7%): small:     40 ->     48 (     8 )
( 4 )  54386 ( 6.0%, 42.7%): small:   1056 ->   2048 (   992 )
( 5 )  48501 ( 5.3%, 48.0%): small:     18 ->     32 (    14 )
( 6 )  47668 ( 5.2%, 53.3%): small:     15 ->     16 (     1 )
( 7 )  24938 ( 2.7%, 56.0%): large:   4095 ->   4096 (     1 )
( 8 )  24278 ( 2.7%, 58.7%): small:     56 ->     64 (     8 )
( 9 )  13064 ( 1.4%, 60.1%): small:    104 ->    112 (     8 )
(10 )  12852 ( 1.4%, 61.6%): small:    136 ->    144 (     8 )

There were 909,062 lines, which means there were 909,062 allocations.  (I didn’t print information about frees.)  The most common line was “small: 24 -> 32 ( 8 )” which occurred 205,920 times, accounting for 22.7% of all the lines in the file.  The “small” refers to jemalloc’s class size categories (see above), and every allocation request of 24 bytes was rounded up to 32 bytes, resulting in 8 bytes of slop.

Looking through the list, I saw a case where 1,048,578 (2^20+2) bytes had been requested, and jemalloc had returned 2,097,152 (2^21) bytes.  That’s a huge (1MB) waste, and also smelled very fishy.  And on further inspection there were a lot of smaller but still suspicious cases where a number slightly larger than a power-of-two was being rounded up to the next power-of-two:  1,032 to 2048;  1,056 to 2,048;  2,087 to 4,096;  4,135 to 8,192, etc.

I investigated a number of these by adding some code to jemalloc to detect these suspicious request sizes, and then using Valgrind to print out a stack trace every time one occurred.  What I found was four distinct places in the codebase where the code in question had flexibility in the amount of memory it allocated, and so had tried to ask for a power-of-two, but had botched the request and thus ended up asking for slightly more than a power-of-two!

  • In js/src/vm/String.cpp, when concatenating two strings, the capacity of the resultant string is rounded up to the next power-of-two (unless the string is really big, in which case the rounding is less aggressive).  This is a reasonable thing to do, as it means that if a further concatenation occurs, there’s a good chance there’ll be space for the extra chars to be added in-line, avoiding an extra allocation and copy.  But the code first did its power-of-two round-up, and then added sizeof(jschar) (which is two bytes) to the size, to allow space for the terminating NULL char.  This code was responsible for the 1,048,578 byte request I mentioned earlier.  Luke Wagner fixed this bug with a minimal change earlier this week.  I was unable to easily measure the overall cost of this, but avoiding 1MB of slop with some frequency can only be a good thing.
  • nsprpub/lib/ds/plarena.c implements an arena allocator.  Each arena pool is given a size for each arena when it’s created, and the size passed in is almost always a power-of-two such as 1,024 or 4,096.  Which is good, except that the allocator then has to add a few more bytes onto the request because each arena has a header that holds book-keeping information.  When I start Firefox and load Gmail on my Linux64 box, I see that approximately 3MB of space is wasted because of this bug.  The fix is simple, it just requires that the arena payload be reduced by the size of the header;  my patch is awaiting review.  Sadly enough, this problem was identified 3.5 years ago but not fixed.
  • js/src/jsarena.cpp is almost identical to nsprpub/lib/ds/plarena.c and the story is the same:  it has the same problem; it was first identified 3.5 years ago but not fixed; and my patch is awaiting review.  This didn’t make much difference in practice because this problem had been separately identified for the particular JSArenaPool that allocates the most memory, and worked around using an ill-documented hack.
  • db/sqlite3/src/sqlite3.c is the best one of all.  SQLite is very careful about measuring its memory usage and provides an API to access those measurements.  But in order to measure its own memory usage accurately, it adds 8 bytes to every allocation request in order to store the requested size at the start of the resulting heap block.  In doing so, it converts a lot of 1,024 byte requests into 1,032 byte requests.  Ironically enough, the slop caused by storing the requested size rendered the stored size inaccurate.  I determined that as a result, SQLite is using roughly 1.15–1.20x more memory in Firefox than it thinks it is.  So, go look at the “storage/sqlite” number in about:memory and mentally adjust it accordingly.   (On my Mac laptop which has an old profile and 11 tabs open it’s currently 70MB, so there’s probably an additional 10–14MB of slop).  SQLite is an upstream product, and the authors are working on a fix.

Once all these fixes have landed, there will be two benefits.  First, Firefox’s memory usage will go down, which is good for everyone.  Second, the proportion of dark matter in about:memory will also drop, which makes our memory profiling more accurate and useful.

After that, converting existing reporters to use malloc_usable_size will account for much of the legitimate slop, further reducing the amount of dark matter. And then I have a master plan for identifying whatever’s left.

clownshoes

So what’s up with this post’s title?  Andreas Gal uses the term “clownshoes” to refer to code that is laughably bad.  When I found and filed the first of these bugs, in a fit of whimsy I put “[clownshoes]” in its whiteboard.  Only later did a realize how suitable this was: a clown’s shoes and these botched power-of-two-sized allocations both have lots of empty space in them, and thus make their owner less nimble than they could be.  Perfect.

11 Responses to Clownshoes available in sizes 2^10+1 and up!

  1. Brilliant as ever Nicholas. I enjoy seeing these strides made in memory.

    As I see the different posts over time for memory reduction, it seems like there are different categories (some being rediscovered?) for reducing memory.
    I wonder if it would be helpful to create a sort of bible listing the areas of code to watch/analyze for likely (and historically probable) memory issues?
    This is no area of expertise so I could be off, but in recent times you guys have been on a quest to reduce memory and I would hate if in 5 or 10 years that some of the lessons learned now would be forgotten while there’s an opportunity now to mark the trail.

    I guess my bottom-line question is: What are we doing to prevent this from happening again?

  2. Amazing work Nicholas. You sure are working really hard to down these numbers. Keep up the good work. My best wishes for you.\
    Clownshoes is perfect for such bugs.

  3. Hey :-),

    Very interesting investigations and works. You have mentioned a bug from SQLite with a link to Mozilla’s bugstracker, but is there a link to SQLite’s bugstracker?

    Thanks :-).

  4. In regards to sqlite, this crossed my mind in the past but I never bothered to look deeper, so I’m going to go ahead and ask here:

    How is the page cache allocated?

    That’s by far the largest use of memory by the library. On disk, database pages are always power-of-two in size. So any header, including the 8-byte-size one, will cause this problem, unless the cache is allocated in larger blocks or something.

  5. To sort-of-answer my own question, it looks like the page cache allocations are done in pcache1Alloc and their size is “sizeof(PgHdr1) + pCache->szPage” where PgHdr1 is just a few bytes, so it seems this would hit the problem pretty hard.

    However the two main DBs in Fx (places and urlclassifier3) use 32KB pages, so it’d round up to 36KB and waste “only” about 12% memory.

    • Nicholas Nethercote

      Mark: you’re spot on! I did some more measurements, which you can see in https://bugzilla.mozilla.org/show_bug.cgi?id=676189#c15. pCache->szPage is 208 bytes more than 32768; that 208 bytes holds “metadata about the underlying database page on disk”. So we’ve got (a) the PgHdr1, and (b) the metadata, and (c) the size field all contributing to making the pcache1Alloc allocations a bit more than 32768 (33024 bytes to be exact), which results in 3804 bytes of slop. In practice I saw this account for 55–83% of SQLite’s total slop.

  6. Just want to say thank you so much for all the work your doing in the MemShrink effort. I have been following your blog for a while now and are eagerly waiting for every update.

    The memory problems with Firefox has many times almost forced me to switch. Your work has made the wait worth it. I am so looking forward to Firefox 7 when I will be able to enjoy this myself.

    /thomas

  7. Hi. A quick question. Do you know what memory allocator is used in the Android version? Is it jemalloc, too? Thanks.

  8. Clearly one developer on the Memshrink team is worth more than 2 on any other team. Unsung heroes!