Categories
MemShrink Performance Personal Rust

Farewell, Mozilla

Today is my last day working for Mozilla. I will soon be starting a new job with Apple.

I have worked on a lot of different things over my twelve years at Mozilla. Some numbers:

  • Three years as a contractor, and nine as an employee.
  • 4,441 commits to mozilla-central, 560 to rustc, 148 to rustc-perf, and smaller numbers to several other repositories.
  • 2,561 bugs filed in Bugzilla, 2,118 bugs assigned to me, 27,647 comments, 2,411 patches reviewed.
  • Three module peerages and one module ownership.
  • 277 blog posts.
  • Six managers and four managees, across three teams. (One of my managees later became my manager. Thankfully, it worked well!)
  • More trans-Pacific air miles than I want to count.

Two areas of work stand out for me.

  • I started the MemShrink project and for several years played the roles of tech lead, engineering project manager, engineer, and publicist. It changed Firefox’s memory consumption from its biggest technical weakness into a strength, and enabled the use of more processes in Electrolysis (for responsiveness) and Fission (for security).
  • My work on the Rust compiler, rustc-perf, and related profilers helped the compiler become roughly 2.5x faster over a three year period, and laid a foundation for ongoing future improvements.

I have a lot of memories, and the ones relating to these two projects are at the forefront. Thank you to everyone I’ve worked with. It’s been a good time.

As I understand it, this blog will stay up in read-only mode indefinitely. I will make a copy of all the posts and if it ever goes down I will rehost them at my personal site.

All the best to everyone.

Categories
Firefox Memory consumption MemShrink Programming

Slimmer and simpler static atoms

String interning is:

a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. The single copy of each string is called its intern.

In Firefox’s code we use the term atom rather than intern, and atom table rather than string intern pool. I don’t know why; those names have been used for a long time.

Furthermore, Firefox distinguishes between static atoms, which are those that are chosen at compile time and can be directly referred to via an identifier, and dynamic atoms, which are added on-demand at runtime. This post is about the former.

In 2016, Firefox’s implementation of static atoms was complex and inefficient. I filed a bug about this that included the following ASCII diagram showing all the data structures involved for a single atom for the string “foobar”.

static nsFakeStringBuffer<N=7> foobar_buffer (.data, 8+2N bytes)
/-----------------------------------------\ <------+
| int32_t mRefCnt = 1 // never reaches 0  |        | 
| uint32_t mSize = 14 // 7 x 16-bit chars |        | 
| u"foobar"           // the actual chars | <----+ | 
\-----------------------------------------/      | | 
                                                 | | 
PermanentAtomImpl (heap, 32 bytes)               | | 
/----------------------------------------------\ | | <-+
| void* vtablePtr    // implicit               | | |   | 
| uint32_t mLength = 6                         | | |   | 
| uint32_t mHash = ...                         | | |   | 
| char16_t* mString = @------------------------|-+ |   | 
| uintptr_t mRefCnt  // from NS_DECL_ISUPPORTS |   |   | 
\----------------------------------------------/   |   | 
                                                   |   | 
static nsIAtom* foobar (.bss, 8 bytes)             |   | 
/---\ <-----------------------------------+        |   | 
| @-|-------------------------------------|------------+
\---/                                     |        |   | 
                                          |        |   | 
static nsStaticAtom (.d.r.ro.l, 16 bytes) |        |   | 
(this element is part of a larger array)  |        |   | 
/------------------------------------\    |        |   | 
| nsStringBuffer* mStringBuffer = O--|----|--------+   | 
| nsIAtom** mAtom = @----------------|----+            | 
\------------------------------------/                 | 
                                                       | 
AtomTableEntry (heap, ~2 x 16 bytes[*])                | 
(this entry is part of gAtomTable)                     | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| AtomImpl* mAtom = @-----|----------------------------+
\-------------------------/                            | 
                                                       | 
StaticAtomEntry (heap, ~2 x 16 bytes[*])               | 
(this entry is part of gStaticAtomTable)               | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| nsIAtom* mAtom = @------|----------------------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

There is a lot going on in that diagram, but putting that all together gave the following overhead per atom.

  • Static shared: 0 bytes
  • Static unshared: 8 + 2(length+1) + 8 + 16
  • Dynamic: 32 + ~32 + ~32 bytes
  • Total bytes: (2(length+1) + 64 + ~64) * num_processes

(Although these atoms are “static” in the sense of being known at compile-time, a lot of the associated data was allocated dynamically.)

At the time there were about 2,700 static atoms, and avg_length was about 11, so the overhead was roughly:

  • 0 bytes fixed, and
  •  410,400 bytes per process. (Or more, depending on how the relocations required for the static pointers were represented, which depended on the platform.)

Today, things have improved greatly and now look like the following.

const char16_t[7] (.rodata, 2(N+1) bytes)
(this is detail::gGkAtoms.foobar_string)
/-----------------------------------------\ <--+
| u"foobar"           // the actual chars |    | 
\-----------------------------------------/    | 
                                               | 
const nsStaticAtom (.rodata, 12 bytes)         | 
(this is within detail::gGkAtoms.mAtoms[])     | 
/-------------------------------------\ <---+  | 
| uint32_t mLength:30 = 6             |     |  | 
| uint32_t mKind:2 = AtomKind::Static |     |  | 
| uint32_t mHash = ...                |     |  | 
| uint32_t mStringOffset = @----------|-----|--+
\-------------------------------------/     | 
                                            | 
constexpr nsStaticAtom* (0 bytes) @---------+
(this is nsGkAtoms::foobar)                 | 
                                            | 
AtomTableEntry (heap, ~2 x 16 bytes[*])     | 
(this entry is part of gAtomTable)          | 
/-------------------------\                 | 
| uint32_t mKeyHash = ... |                 | 
| nsAtom* mAtom = @-------|-----------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

That gives the following overhead per atom.

  • Static shared: 12 + 2(length+1) bytes
  • Static unshared: 0 bytes
  • Dynamic: ~32 bytes
  • Total: 12 + 2(length+1) + ~32 * num_processes

We now have about 2,300 static atoms and avg_length is still around 11, so the overhead is roughly:

  • 82,800 bytes fixed, and
  • 73,600 bytes per process.

I won’t explain all the parts of the two diagrams, but it can be seen that we’ve gone from six pieces per static atom to four; the size and complexity of the remaining pieces are greatly reduced; there are no static pointers (only constexpr pointers and integral offsets) and thus no relocations; and there is a lot more interprocess sharing thanks to more use of const. Also, there is no need for a separate static atom table any more, because the main atom table is thread-safe and the HTML5 parser (the primary user of the separate static atom table) now has a small but highly effective static atoms cache.

Things that aren’t visible from the diagrams: atoms are no longer exposed to JavaScript code via XPIDL, there are no longer any virtual methods involved, and all atoms are defined in a single place (with no duplicates) instead of 7 or 8 different places. Notably, the last few steps were blocked for some time by a bug in MSVC involving the handling of constexpr.

The bug dependency tree gives a good indication of how many separate steps were involved in this work. If there is any lesson to be had here, it’s that small improvements add up over time.

Categories
MemShrink

MemShrink status

Mozilla’s MemShrink project started almost six years ago. It successfully reduced Firefox’s memory consumption to the point where Firefox is now widely (and accurately) seen as the least memory-hungry browser.

Most of the big improvements were made in the first two or three years of the MemShrink project, and it doesn’t get much publicity any more, but it is still ticking along quietly: meetings occur, regressions are tracked, measurements are made, and so on. Some time ago I passed leadership of it to the capable hands of Eric Rahm, who has just written a status update that is worth reading.

Eric also wrote recently about the reincarnation of areweslimyet.com. He maintained that site for several years, which was a thankless but important task, but he won’t need to any more because its functionality has been folded into Mozilla’s main performance monitoring tools. Thank you, Eric!

Categories
Firefox Memory consumption MemShrink Uptime

Firefox 64-bit for Windows can take advantage of more memory

By default, on Windows, Firefox is a 32-bit application. This means that it is limited to using at most 4 GiB of memory, even on machines that have more than 4 GiB of physical memory (RAM). In fact, depending on the OS configuration, the limit may be as low as 2 GiB.

Now, 2–4 GiB might sound like a lot of memory, but it’s not that unusual for power users to use that much. This includes:

  • users with many (dozens or even hundreds) of tabs open;
  • users with many (dozens) of extensions;
  • users of memory-hungry web sites and web apps; and
  • users who do all of the above!

Furthermore, in practice it’s not possible to totally fill up this available space because fragmentation inevitably occurs. For example, Firefox might need to make a 10 MiB allocation and there might be more than 10 MiB of unused memory, but if that available memory is divided into many pieces all of which are smaller than 10 MiB, then the allocation will fail.

When an allocation does fail, Firefox can sometimes handle it gracefully. But often this isn’t possible, in which case Firefox will abort. Although this is a controlled abort, the effect for the user is basically identical to an uncontrolled crash, and they’ll have to restart Firefox. A significant fraction of Firefox crashes/aborts are due to this problem, known as address space exhaustion.

Fortunately, there is a solution to this problem available to anyone using a 64-bit version of Windows: use a 64-bit version of Firefox. Now, 64-bit applications typically use more memory than 32-bit applications. This is because pointers, a common data type, are twice as big; a rough estimate for 64-bit Firefox is that it might use 25% more memory. However, 64-bit applications also have a much larger address space, which means they can access vast amounts of physical memory, and address space exhaustion is all but impossible. (In this way, switching from a 32-bit version of an application to a 64-bit version is the closest you can get to downloading more RAM!)

Therefore, if you have a machine with 4 GiB or less of RAM, switching to 64-bit Firefox probably won’t help. But if you have 8 GiB or more, switching to 64-bit Firefox probably will help the memory usage situation.

Official 64-bit versions of Firefox have been available since December 2015. If the above discussion has interested you, please try them out. But note the following caveats.

  • Flash and Silverlight are the only supported 64-bit plugins.
  • There are some Flash content regressions due to our NPAPI sandbox (for content that uses advanced features like GPU acceleration or microphone APIs).

On the flip side, as well as avoiding address space exhaustion problems, a security feature known as ASLR works much better in 64-bit applications than in 32-bit applications, so 64-bit Firefox will be slightly more secure.

Work is being ongoing to fix or minimize the mentioned caveats, and it is expected that 64-bit Firefox will be rolled out in increasing numbers in the not-too-distant future.

UPDATE: Chris Peterson gave me the following measurements about daily active users on Windows.

  • 66.0% are running 32-bit Firefox on 64-bit Windows. These users could switch to a 64-bit Firefox.
  • 32.3% are running 32-bit Firefox on 32-bit Windows. These users cannot switch to a 64-bit Firefox.
  • 1.7% are running 64-bit Firefox already.

UPDATE 2: Also from Chris Peterson, here are links to 64-bit builds for all the channels:

Categories
Firefox Garbage Collection Memory consumption MemShrink

More compacting GC

Jon Coppeard recently extended SpiderMonkey’s compacting GC abilities. Previously, the GC could only compact GC arena containing JavaScript objects. Now it can also compact arenas containing shapes (a data structure used within SpiderMonkey which isn’t visible to user code) and strings, which are two of the largest users of memory in the GC heap after objects.

These improvements should result in savings of multiple MiBs in most workloads, and they are on track to ship in Firefox 48, which will be released in early August. Great work, Jon!

Categories
Firefox Garbage Collection Memory consumption MemShrink

Compacting GC

Go read Jon Coppeard’s description of the compacting GC algorithm now used by SpiderMonkey!

Categories
about:memory AdBlock Plus Firefox Memory consumption MemShrink

Firefox 41 will use less memory when running AdBlock Plus

Last year I wrote about AdBlock Plus’s effect on Firefox’s memory usage. The most important part was the following.

First, there’s a constant overhead just from enabling ABP of something like 60–70 MiB. […] This appears to be mostly due to additional JavaScript memory usage, though there’s also some due to extra layout memory.

Second, there’s an overhead of about 4 MiB per iframe, which is mostly due to ABP injecting a giant stylesheet into every iframe. Many pages have multiple iframes, so this can add up quickly. For example, if I load TechCrunch and roll over the social buttons on every story […], without ABP, Firefox uses about 194 MiB of physical memory. With ABP, that number more than doubles, to 417 MiB.

An even more extreme example is this page, which contains over 400 iframes. Without ABP, Firefox uses about 370 MiB. With ABP, that number jumps to 1960 MiB.

(This description was imprecise; the overhead is actually per document, which includes both top-level documents in a tab and documents in iframes.)

Last week Mozilla developer Cameron McCormack landed patches to fix bug 77999, which was filed more than 14 years ago. These patches enable sharing of CSS-related data — more specifically, they add data structures that share the results of cascading user agent style sheets — and in doing so they entirely fix the second issue, which is the more important of the two.

For example, on the above-mentioned “extreme example” (a.k.a. the Vim Color Scheme Test) memory usage dropped by 3.62 MiB per document. There are 429 documents on that page, which is a total reduction of about 1,550 MiB, reducing memory usage for that page down to about 450 MiB, which is not that much more than when AdBlock Plus is absent. (All these measurements are on a 64-bit build.)

I also did measurements on various other sites and confirmed the consistent saving of ~3.6 MiB per document when AdBlock Plus is enabled. The number of documents varies widely from page to page, so the exact effect depends greatly on workload. (I wanted to test TechCrunch again, but its front page has been significantly changed so it no longer triggers such high memory usage.) For example, for one of my measurements I tried opening the front page and four articles from each of nytimes.com, cnn.com and bbc.co.uk, for a total of 15 tabs. With Cameron’s patches applied Firefox with AdBlock Plus used about 90 MiB less physical memory, which is a reduction of over 10%.

Even when AdBlock Plus is not enabled this change has a moderate benefit. For example, in the Vim Color Scheme Test the memory usage for each document dropped by 0.09 MiB, reducing memory usage by about 40 MiB.

If you want to test this change out yourself, you’ll need a Nightly build of Firefox and a development build of AdBlock Plus. (Older versions of AdBlock Plus don’t work with Nightly due to a recent regression related to JavaScript parsing). In Firefox’s about:memory page you’ll see the reduction in the “style-sets” measurements. You’ll also see a new entry under “layout/rule-processor-cache”, which is the measurement of the newly shared data; it’s usually just a few MiB.

This improvement is on track to make it into Firefox 41, which is scheduled for release on September 22, 2015. For users on other release channels, Firefox 41 Beta is scheduled for release on August 11, and Firefox 41 Developer Edition is scheduled to be released in the next day or two.

Categories
DMD Firefox Memory allocation MemShrink Performance

Cumulative heap profiling in Firefox with DMD

DMD is a tool that I originally created to help identify where new memory reporters should be added to Firefox in order to reduce the “heap-unclassified” measurement in about:memory. (The name is actually short for “Dark Matter Detector”, because we sometimes call the “heap-unclassified” measurement “dark matter“.)

Recently, I’ve modified DMD to become a more general heap profiling tool. It now has three distinct modes of operation.

  1. “Dark matter”: this mode gives you DMD’s original behaviour.
  2. “Live”: this mode tracks all the live blocks on the system heap, and lets you take snapshots at particular points in time.
  3. Cumulative“: this mode tracks all the blocks that have ever been allocated on the system heap, and so gives you information about all the allocations done by Firefox during an entire session.

Most memory profilers (including as about:memory) are snapshot-based, and so work much like DMD’s “live” mode. But “cumulative” mode is equally interesting.

In particular, unlike “live” mode, “cumulative” mode tells you about parts of the code that are responsible for allocating many short-lived heap blocks (sometimes called “heap churn”). Such allocations can hurt performance: allocations and deallocations themselves aren’t free, particularly because they require obtaining a global lock; such code often involves unnecessary initialization or copying of heap data; and if these allocations come in a variety of sizes they can cause additional heap fragmentation.

Another nice thing about cumulative heap profiling is that, unlike live heap profiling, you don’t have to decide when to take snapshots. You can just profile an entire workload of interest and get the results at the end.

I’ve used DMD’s cumulative mode to find inefficiencies in SpiderMonkey’s source compression  and code generation, SQLite, NSS, nsTArray, XDR encoding, Gnome start-up, IPC messaging, nsStreamLoader, cycle collection, and safe-browsing. There are “start doing something clever” optimizations and then there are “stop doing something stupid” optimizations, and every one of these fixes has been one of the latter. Each change has avoided cumulative heap allocations ranging from tens to thousands of MiBs.

It’s typically difficult to quantify any speed-ups from these changes, because the workloads are noisy and non-deterministic, but I’m convinced that simple changes to fix these issues are worthwhile. For one, I used cumulative profiling (via a different tool) to drive the major improvements I made to pdf.js earlier this year. Also, Chrome developers have found that “Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag”.

So far I have only profiled a few simple workloads. There are all sorts of things I haven’t tried: text-heavy pages, image-heavy pages, audio and video, WebRTC, WebGL, popular benchmarks… the list goes on. I intend to do more profiling and fix things where I can, but it would be great to have help from domain experts with this stuff. If you want to try out cumulative heap profiling in Firefox, please read the DMD instructions and feel free to ask me for help. In particular, I now have a good feel for which hot allocations are unavoidable and reasonable — plenty of them, of course — and which are avoidable. Let’s get rid of the avoidable ones.

Categories
about:memory Firefox Memory allocation Memory consumption MemShrink Performance

Better documentation for memory profiling and leak detection tools

Until recently, the documentation for all of Mozilla’s memory profiling and leak detection tools had some major problems.

  • It was scattered across MDN, the Mozilla Wiki, and the Mozilla archive site (yes, really).
  • Documentation for several tools was spread across multiple pages.
  • Documentation for some tools was meagre, non-existent, or overly verbose.
  • Some of the documentation was out of date, e.g. describing tools that no longer exist.

A little while back I fixed these problems.

  • The documentation for these tools is now all on MDN. If you look at the MDN Performance page in the “Memory profiling and leak detection tools” section, you’ll see a brief description of each tool that explains the circumstances in which it is useful, and a link to the relevant documentation.
  • The full list of documented tools includes: about:memory, DMD, areweslimyet.com, BloatView, Refcount tracing and balancing, GC and CC logs, Valgrind, LeakSanitizer, Apple tools, TraceMalloc, Leak Gauge, and LogAlloc.
  • As well as consolidating all the pages in one place, I also improved some of the pages (with the help of people like Andrew McCreight). In particular, about:memory now has reasonably detailed documentation, something it has lacked until now.

Please take a look, and if you see any problems let me know. Or, if you’re feeling confident just fix things yourself! Thanks.

Categories
DMD Firefox Memory allocation Memory consumption MemShrink

Please grow your buffers exponentially

If you record every heap allocation and re-allocation done by Firefox you find some interesting things. In particular, you find some sub-optimal buffer growth strategies that cause a lot of heap churn.

Think about a data structure that involves a contiguous, growable buffer, such as a string or a vector. If you append to it and it doesn’t have enough space for the appended elements, you need to allocate a new buffer, copy the old contents to the new buffer, and then free the old buffer. realloc() is usually used for this, because it does these three steps for you.

The crucial question: when you have to grow a buffer, how much do you grow it? One obvious answer is “just enough for the new elements”. That might seem  space-efficient at first glance, but if you have to repeatedly grow the buffer it can quickly turn bad.

Consider a simple but not outrageous example. Imagine you have a buffer that starts out 1 byte long and you add single bytes to it until it is 1 MiB long. If you use the “just-enough” strategy you’ll cumulatively allocate this much memory:

1 + 2 + 3 + … + 1,048,575 + 1,048,576 = 549,756,338,176 bytes

Ouch. O(n2) behaviour really hurts when n gets big enough. Of course the peak memory usage won’t be nearly this high, but all those reallocations and copying will be slow.

In practice it won’t be this bad because heap allocators round up requests, e.g. if you ask for 123 bytes you’ll likely get something larger like 128 bytes. The allocator used by Firefox (an old, extensively-modified version of jemalloc) rounds up all requests between 4 KiB and 1 MiB to the nearest multiple of 4 KiB. So you’ll actually allocate approximately this much memory:

4,096 + 8,192 + 12,288 + … + 1,044,480 + 1,048,576 = 134,742,016 bytes

(This ignores the sub-4 KiB allocations, which in total are negligible.) Much better. And if you’re lucky the OS’s virtual memory system will do some magic with page tables to make the copying cheap. But still, it’s a lot of churn.

A strategy that is usually better is exponential growth. Doubling the buffer each time is the simplest strategy:

4,096 + 8,192 + 16,384 + 32,768 + 65,536 + 131,072 + 262,144 + 524,288 + 1,048,576 = 2,093,056 bytes

That’s more like it; the cumulative size is just under twice the final size, and the series is short enough now to write it out in full, which is nice — calls to malloc() and realloc() aren’t that cheap because they typically require acquiring a lock. I particularly like the doubling strategy because it’s simple and it also avoids wasting usable space due to slop.

Recently I’ve converted “just enough” growth strategies to exponential growth strategies in XDRBuffer and nsTArray, and I also found a case in SQLite that Richard Hipp has fixed. These pieces of code now match numerous places that already used exponential growth: pldhash, JS::HashTable, mozilla::Vector, JSString, nsString, and pdf.js.

Pleasingly, the nsTArray conversion had a clear positive effect. Not only did the exponential growth strategy reduce the amount of heap churn and the number of realloc() calls, it also reduced heap fragmentation: the “heap-overhead” part of the purple measurement on AWSY (a.k.a. “RSS: After TP5, tabs closed [+30s, forced GC]”) dropped by 4.3 MiB! This makes sense if you think about it: an allocator can fulfil power-of-two requests like 64 KiB, 128 KiB, and 256 KiB with less waste than it can awkward requests like 244 KiB, 248 KiB, 252 KiB, etc.

So, if you know of some more code in Firefox that uses a non-exponential growth strategy for a buffer, please fix it, or let me know so I can look at it. Thank you.