Categories
DMD Firefox Memory allocation Memory consumption MemShrink

Please grow your buffers exponentially

If you record every heap allocation and re-allocation done by Firefox you find some interesting things. In particular, you find some sub-optimal buffer growth strategies that cause a lot of heap churn.

Think about a data structure that involves a contiguous, growable buffer, such as a string or a vector. If you append to it and it doesn’t have enough space for the appended elements, you need to allocate a new buffer, copy the old contents to the new buffer, and then free the old buffer. realloc() is usually used for this, because it does these three steps for you.

The crucial question: when you have to grow a buffer, how much do you grow it? One obvious answer is “just enough for the new elements”. That might seem  space-efficient at first glance, but if you have to repeatedly grow the buffer it can quickly turn bad.

Consider a simple but not outrageous example. Imagine you have a buffer that starts out 1 byte long and you add single bytes to it until it is 1 MiB long. If you use the “just-enough” strategy you’ll cumulatively allocate this much memory:

1 + 2 + 3 + … + 1,048,575 + 1,048,576 = 549,756,338,176 bytes

Ouch. O(n2) behaviour really hurts when n gets big enough. Of course the peak memory usage won’t be nearly this high, but all those reallocations and copying will be slow.

In practice it won’t be this bad because heap allocators round up requests, e.g. if you ask for 123 bytes you’ll likely get something larger like 128 bytes. The allocator used by Firefox (an old, extensively-modified version of jemalloc) rounds up all requests between 4 KiB and 1 MiB to the nearest multiple of 4 KiB. So you’ll actually allocate approximately this much memory:

4,096 + 8,192 + 12,288 + … + 1,044,480 + 1,048,576 = 134,742,016 bytes

(This ignores the sub-4 KiB allocations, which in total are negligible.) Much better. And if you’re lucky the OS’s virtual memory system will do some magic with page tables to make the copying cheap. But still, it’s a lot of churn.

A strategy that is usually better is exponential growth. Doubling the buffer each time is the simplest strategy:

4,096 + 8,192 + 16,384 + 32,768 + 65,536 + 131,072 + 262,144 + 524,288 + 1,048,576 = 2,093,056 bytes

That’s more like it; the cumulative size is just under twice the final size, and the series is short enough now to write it out in full, which is nice — calls to malloc() and realloc() aren’t that cheap because they typically require acquiring a lock. I particularly like the doubling strategy because it’s simple and it also avoids wasting usable space due to slop.

Recently I’ve converted “just enough” growth strategies to exponential growth strategies in XDRBuffer and nsTArray, and I also found a case in SQLite that Richard Hipp has fixed. These pieces of code now match numerous places that already used exponential growth: pldhash, JS::HashTable, mozilla::Vector, JSString, nsString, and pdf.js.

Pleasingly, the nsTArray conversion had a clear positive effect. Not only did the exponential growth strategy reduce the amount of heap churn and the number of realloc() calls, it also reduced heap fragmentation: the “heap-overhead” part of the purple measurement on AWSY (a.k.a. “RSS: After TP5, tabs closed [+30s, forced GC]”) dropped by 4.3 MiB! This makes sense if you think about it: an allocator can fulfil power-of-two requests like 64 KiB, 128 KiB, and 256 KiB with less waste than it can awkward requests like 244 KiB, 248 KiB, 252 KiB, etc.

So, if you know of some more code in Firefox that uses a non-exponential growth strategy for a buffer, please fix it, or let me know so I can look at it. Thank you.

23 replies on “Please grow your buffers exponentially”

Another useful trick: grow your buffers exponentially, then once you’re “done” putting data in them, shrink them back to the right size.

…where shrinking involves allocating the total amount of memory used and copying that data over. It would save memory, but is going to be slower. You’ll be allocating data twice.

Yes, I’ve heard this before too, based on the idea that you can reuse some of the previous freed allocations for some of the later allocations. But this assumes that the allocator is able to combine adjacent, different-sized freed blocks into larger blocks later on. But as I understand it, jemalloc segregates the heap in such a way that such combinations are impossible.

Strictly speaking, exponential growth with arbitrary factor will work in about the same way – at least compared to “just enough” growth (by a constant). The differences between different factors are much more subtle and depend on how exactly the allocator is implemented (e.g. whether it reuses the allocated blocks).

For example, if you grow the buffer by a factor of K, you’ll waste (K-1)/4 of the buffer on average. So with a growth factor 2, you’ll waste 1/4 of the buffer on average, assuming uniform distribution of request sizes of course. E.g. average size of requests between 1024 and 2048 B is 1536B, so 25% of the 2048B buffer is wasted. By lowering the growth factor, the amount of wasted space gets smaller – with growth factor 1.5, you only waste ~12.5% on average.

But does that result in lower memory usage on average? Not necessarily. For example if the allocator keeps the allocated chunks for reuse, it’s usually done by matching the buffer size – if you ask for 100B, the appropriate chunk size is 128B, so the allocator checks whether it already has a free chunk of this size and either reuses it or allocates a new one. By using lower growth factor, there will be many more “sizes” and the hit ratio will get worse, so while lower growth factor results in lower amount of waste at the block level, it may easily result in more allocated memory in total. Similarly for ‘realloc’ – lower growth factor means hitting the current buffer size more frequently.

So the exact growth factor is just half the story – the other parts of the allocator are very important too. And it’s impossible to come up with a perfect allocator, working equally well for all possible workloads.

I wrote a series of of posts about the href=”http://blog.pgaddict.com/posts/palloc-overhead-examples”>overhead when using palloc (which is the PostgreSQL allocator interface) few weeks ago, giving a few more examples.

In the case of Firefox, could defaulting realloc to automatically do this help?

(i.e. if I ask for 300 bytes then give me 512. If I ask for 700 give me 1024)

Or does realloc get used for things other than buffers?

Yeah, it does get used for other things. No single strategy will fit all cases.

Modifying standard C library is generally always a bad idea.

I think the problem is here is people reinventing wheels. Unless you know what you’re doing implementing container types from scratch is a remarkably bad idea.

One thing is switching to two different exponential ratios for reallocation up and down. Another thing is tuning these parameters. If we’re not doing that, which it seems we obviously haven’t, we really should make nsTArray wrap std::vector.

Note, the point of exponential reallocation is to ensure an asymptotic complexity of O(1) for all insertions.

It’s really surprising/scary that Firefox didn’t already do this. Wow, just Wow…

That can be problematic too. Those sort of allocations was responsible one of the 1st things fixed by memshrink a few years ago. NJN’s “clownshoe” allocation bugs were from cases where a developer asked for eg a 1024 byte container (forgetting that the container had a few bytes of overhead) and the allocator rounding 1028 bytes up to 2048.

Note I haven’t looked exactly why, but libgobject is reallocating at least some things by increments of 8 (spotted that in allocation logs of Firefox startup on Linux)

Facebook’s FBVector seems to have been carefully tuned, with and without jemalloc – see https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md . You can see it’s growth strategy at https://github.com/facebook/folly/blob/8470cb6ed9524328bffa359522dba32b5f12e746/folly/FBVector.h#L1150 . In summary:

– If the size is small enough that jemalloc won’t resize in place (i.e. less than 4096 bytes), then double the size.

– If the size is bigger than that, it resizes to (old size * 3 + 1) / 2 (i.e. multiply by 1.5. I guess the +1 is there to make it round up.).

A lot of us have been saying for many years now that Firefox has some serious performance problems, and does not use memory effectively.

Yet we were told, “No! These problems don’t exist!” or we were told, “They’ve all been fixed!”

Now I read this informative article, and it confirms what many of us were saying all along: Firefox does have serious performance problems, and it does not use memory effectively.

I am very pleased to see these problems are getting fixed, but it pains me greatly that it took so long! If only the Firefox developers had listened to we users so many years ago, these issues could have been dealt with then, rather than now!

May this be a lesson going forward: if users say there is a problem, there most likely is!

Jakov: the problem is that the vast majority of those complaints are useless because, like your comment, there’s no technical detail which can be used to see where a problem is. In some cases, the problems were external or measurement error (e.g. Windows Update chewing 100% CPU in the background -> “Firefox is slow!”) and in many cases the problems were caused by a poorly designed extension such as Ad Block Plus which used a large amount of data per tab rather than something the Firefox developers have control over.

If you actually want to help, it’s a lot more likely to be fixed if you can write a detailed bug report showing a repeatable way to trigger the problem, confirming that it happens without any extensions enabled, and including before and after about:memory reports to help the developers diagnose it. In my experience the Mozilla developers are extremely motivated to fix problems as long as you don’t expect them to be magicians who can figure out what “Firefox is using too much memory” meant to you.

The doubling strategy is what Go does for it’s (go routine stacks?) stack these days due to the same/similar problem.

This is true, but the recent change was about something else. The stacks used to be discontinuous, but that led to random inefficiencies when a boundary between stack fragments was crossed back and forth very often. Thus stacks were made continuous (and were made to grow exponentially).

Jakov, that doesn’t seem like a very accurate sketch of reality.

1. You’re posting this on the blog of someone who, within those “many years”, spearheaded a vigorous, public, and successful programme to improve Firefox’s memory usage. That doesn’t sound like denying that memory-use problems existed. It sounds like the exact reverse.

2. You write as if NN is now saying something like “yes, there are serious problems that we have been denying for years”, but he isn’t. There is a big difference between “there are some bits of our code that do something suboptimal” and “we have serious performance problems”. This article does not confirm what you say it does.

3. Empirically, Firefox’s memory usage these days is somewhat better than, say, Chrome’s.

Exponential growth still sucks from a spatial efficiency standpoint, wasting a large fraction of memory whenever you a byte on to the back of a full buffer.

The main reason we actually get away with this strategy for large allocations is because virtual memory paging means an allocation isn’t really happening until you scribble in to the next 4K page and fault. Once those pages are dirtied though, anything that flip-flops in size is going to unavoidable cause bloat, massively disincentivizing keeping buffers around for reuse. This in turn means more memory fragmentation and the need for smarter fragmentation resistant allocators.

IMHO it’s a nasty cycle that stems from using overly simplistic data structure in the first place for things that need to be highly elastic.

All these cases start with a small size like 8 bytes and double from there: 16, 32, 64, 128, etc. I omitted the smaller numbers from the series just to simplify the presentation. I even explained this: “This ignores the sub-4 KiB allocations, which in total are negligible.”

It really depends on the use-case. The best universal concept is the exponential growth. But since our Hardware resources are not infinite, you should not keeps this strategy when the size empty of the buffer is much larger than the new elements you will insert. Please think about it and find a strategy that doesn’t look only for the allocation counts, but also for the amount of wasted ressources in relation to the available resources.

Very simple but already bettern than simply doubling would be to keep doubling until your buffer has more than 5% of the system resources and to switch to a linear strategy at this point.

Please keep thinking about it and I’m sure you will come up with an even better and very good solution.

I wanted to comment on the ‘slimmer pdf.js’ post but seems to be closed.

I’m trying to modify PDF.js to basically disable all cache. I submitted an issue here: https://github.com/mozilla/pdf.js/issues/5469

It’d be great if you can tell me where to start or what to look for, it’s ok if you can’t/won’t though

Comments are closed.