Categories
DMD Firefox Memory consumption Performance Rust

Better stack fixing for Firefox

I recently undertook a project to improve the stack fixing tools used for Firefox. This has resulted in some large performance wins (e.g. 10x-100x) and a significant improvement in code quality. The story involves Rust, Python, executable and debug info formats, Taskcluster, and many unexpected complications.

What is a stack fixer?

Within the Firefox code base, a stack fixer is a program that post-processes (“fixes”) the stack frames produced by MozFormatCodeAddress(), which often lack one or more of: function name, file name, or line number. It reads debug info from binaries (libraries and executables) to do so. It reads from standard input and writes to standard output. Lines matching the special stack frame format are modified appropriately. For example, a line like this in the input that names an executable or library:

#01: ???[tests/example +0x43a0]

is changed to a line in the output that names a function, source file, and line number:

#01: main (/home/njn/moz/fix-stacks/tests/example.c:24)

Lines that do not match the special stack frame format are passed through unchanged.

This process is sometimes called “symbolication”, though I will use “stack fixing” in this post because that’s the term used within the Firefox code base.

Stack fixing is used in two main ways for Firefox.

  • When tests are run on debug builds, a stack trace is produced if a crash or assertion failure happens.
  • The heap profiling tool DMD records many stack traces at heap allocation points. The stack frames from these stack traces are written to an output file.

A developer needs high-quality stack fixing for the stack traces to be useful in either case.

That doesn’t sound that complicated

The idea is simple, but the reality isn’t.

  • The debug info format is different on each of the major platforms: Windows (PE/PDB), Mac (Mach-O), and Linux (ELF/DWARF).
  • We also support Breakpad symbols, a cross-platform debug info format that we use on automation. (Using it on local builds is something of a pain.)
  • Each debug info format is complicated.
  • Firefox is built from a number of libraries, but most of its code is in a single library called libxul, whose size ranges from 100 MiB to 2 GiB, depending on the platform and the particular kind of build. This stresses stack fixers.

Before I started this work, we had three different Python scripts for stack fixing.

  • fix_linux_stack.py: This script does native stack fixing on Linux. It farms out most of the work to addr2line, readelf, and objdump.
  • fix_macosx_stack.py: This script does native stack fixing on Mac. It farms out most of the work to atos, otool, and c++filt.
  • fix_stack_using_bpsyms.py: This script does stack fixing using Breakpad symbols. It does the work itself.

Note that there is no fix_windows_stack.py script. We did not have a native stack-fixing option for Windows.

This was an inelegant mishmash. More importantly, the speed of these scripts was poor and highly variable. Stack fixing could take anywhere from tens of seconds to tens of minutes, depending on the platform, build configuration, and number of stack frames that needed fixing. For example, on my fast 28-core Linux box I would often have to wait 20 minutes or more to post-process the files from a DMD run.

One tool to rule them all

It would be nice to have a single program that could handle all the necessary formats. It would also be nice if it was much faster than the existing scripts.

Fortunately, the Symbolic Rust crate written by Sentry provided the perfect foundation for such a tool. It provides the multi-platform debug info functionality needed for stack fixing, and also has high performance. In November last year I started a project to implement a new stack fixer in Rust, called fix-stacks.

Implementing the tool

First I got it working on Linux. I find Linux is often the easiest platform to get new code working on, at least partly because it’s the platform I’m most familiar with. In this case it was also helped by the fact that on Linux debug info is most commonly stored within the binary (library or executable) that it describes, which avoids the need to find a separate debug info file. The code was straightforward. The Symbolic crate did the hard part of reading the debug info, and my code just had to use the APIs provided to iterate over the parsed data and build up some data structures that could then be searched.

Then I got it working on Windows. I find Windows is often the hardest platform to get new code working on, but that wasn’t the case here. The only complication was that Windows debug info is stored in a PDB file that is separate from the binary, but Symbolic has a function for getting the name of that file from the binary, so it wasn’t hard to add code to look in that separate file.

Then I got it working on Mac. This was by far the hardest platform, for two reasons. First, the code had to handle fat binaries, which contain code for multiple architectures. Fortunately, Symbolic has direct support for fat binaries so that wasn’t too bad.

Second, the normal approach on Mac is to read debug info from the files produced by dsymutil, in which the debug info is neatly packaged. Unfortunately, dsymutil is very slow and we try to avoid running it in the Firefox build system if possible. So I took an alternative approach: read the binary’s symbol table and then read debug info from the object files and archive files it mentions. I knew that atos used this approach, but unfortunately its source code isn’t available, so I couldn’t see exactly what it did. If I couldn’t get the approach working myself the whole project was at risk; a one-tool-to-rule-them-all strategy falls short it if doesn’t work on one platform.

I spent quite some time reading about the Mach-O file format and using the MachOView utility to inspect Mach-O binaries. Symbolic doesn’t provide an API for reading symbol tables, so I had to use the lower-level goblin crate for that part. (Symbolic uses goblin itself, which means that fix-stacks is using goblin both directly and indirectly.) First I got it working on some very small test files, then on some smaller libraries within Firefox, and finally (to great relief!) on libxul. At each step I had to deal with new complications in the file format that I hadn’t known about in advance. I also had to modify Symbolic itself to handle some edge cases in .o files.

After that, I got fix-stacks working on Breakpad symbols. This was more straightforward; the only tricky part was navigating the directory structure that Firefox uses for storing the Breakpad symbols files. (I found out the hard way that the directory structure is different on Windows.)

One final complication is that DMD’s output, which gets run through the stack fixer, is in JSON format. So fix-stacks has a JSON mode (enabled with --json) that does the appropriate things with JSON escape characters on both input and output. This took three attempts to get completely right.

The end result is a single program that can fix stacks on all four of the formats we need. The stack traces produced by fix-stacks are sometimes different to those produced by the old stack fixing scripts. In my experience these differences are minor and you won’t notice them if you aren’t looking for them.

Code size

The source code for the first version of fix-stacks, which only supported Linux, was 275 lines (excluding tests). The current version, with support for Windows, Mac, Breakpad symbols, and JSON handling, is 891 lines (excluding tests).

In comparison, the Symbolic crate is about 20,000 lines of Rust code in total (including tests), and the three sub-crates that fix-stacks uses (debuginfo, demangle, and common) are 11,400 lines of Rust code. goblin is another 18,000 lines of code. (That’s what I call “leveraging the ecosystem”!)

Beyond Symbolic and goblin, the only other external crates that fix-stacks uses are fxhash, regex, and serde_json.

Testing

Testing is important for a tool like this. It’s hard to write test inputs manually in formats like ELF/DWARF, PE/PDB, and Mach-O, so I used clang to generate inputs from some simple C programs. Both the C programs and the binary files generated from them are in the repository.

Some of the generated inputs needed additional changes after they were generated by clang. This is explained by the testing README file:

The stack frames produced by `MozFormatCodeAddress()` contain absolute paths
and refer to build files, which means that `fix-stacks` can only be sensibly
run on the same machine that produced the stack frames.

However, the test inputs must work on any machine, not just the machine that
produced those inputs. Furthermore, it is convenient when developing if all the
tests works on all platforms, e.g. the tests involving ELF/DWARF files should
work on Windows, and the tests involving PE/PDB files should work on Linux.

To allow this requires the following.
- All paths in inputs must be relative, rather than absolute.
- All paths must use forward slashes rather than backslashes as directory
  separators. (This is because Windows allows both forward slashes and
  backslashes, but Linux and Mac only allow forward slashes.) This includes the
  paths in text inputs, and also some paths within executables (such as a PE
  file's reference to a PDB file).

To satisfy these constraints required some hex-editing of the generated input files. Quoting the README again:

`example-windows.exe` and `example-windows.pdb` were produced on a Windows 10
laptop by clang 9.0 with this command within `tests/`:
```
clang -g example.c -o example-windows.exe
```
`example-windows.exe` was then hex-edited to change the PDB reference from the
absolute path `c:\Users\njn\moz\fix-stacks\tests\example-windows.pdb` to the
relative path `tests/////////////////////////////example-windows.pdb`. (The use
of many redundant forward slashes is a hack to keep the path the same length,
which avoids the need for more complex changes to that file.)

A hack, to be sure, but an effective one.

The steps required to produce the Mac test inputs were even more complicated because they involve fat binaries. I was careful to make that README file clearly describe the steps I took to generate all the test inputs. The effort has paid off multiple times when modifying the tests.

Integrating the tool

Once I had the fix-stacks working well, I thought that most of the work was done and integrating it into the Firefox build and test system would be straightforward. I was mistaken! The integration ended up being a similar amount of work.

First, I added three new jobs to Mozilla’s Taskcluster instance to build fix-stacks and make it available for downloading on Windows, Mac, and Linux; this is called a “toolchain”. This required making changes to various Taskcluster configuration files, and writing a shell script containing the build instructions. All of this was new to me, and it isn’t documented, so I had to cargo-cult from similar existing toolchains while asking lots of questions of the relevant experts. You can’t test jobs like these on your own machine so it took me dozens of “try” pushes to Mozilla’s test machines to get it working, with each push taking roughly 10 minutes to complete.

Then I added a wrapper script (fix_stacks.py) and changed the native stack fixing path in DMD to use it instead of fix_linux_stack.py or fix_macosx_stack.py. This took some care, with numerous try pushes to manually check that the stacks produced by fix_stacks.py were as good as or better than the ones produced by the old scripts. To do this manual checking I first had to deliberately break the DMD test, because the stacks produced are not printed in the test log when the test passes. I also had to update mach bootstrap so it would install a pre-built fix-stacks executable in the user’s .mozbuild directory, which was another unfamiliar part of the code for me. Plus I fixed a problem with the fix-stacks toolchain for Mac: the fix-stacks executable was being cross-compiled on a Linux machine, but some errors meant it was not actually cross-compiling, but simply building a Linux executable. Plus I fixed a problem with the fix-stacks toolchain for Windows: it was building a 64-bit executable, but that wouldn’t work on our 32-bit test jobs; cross-compiling a 32-bit Windows executable on Linux turned out to be the easiest way to fix it. Again, these toolchain fixes took numerous trial-and-error try pushes to get things working. Once it was all working, native stack fixing on Windows was available for DMD for the first time.

Then I changed the native stack fixing path in tests to use fix_stacks.py. This required some minor changes to fix_stacks.py‘s output, to make it more closely match that of the old scripts, to satisfy some tests. I also had to modify the Taskcluster configuration to install the fix-stacks executable in a few extra places; again this required some trial-and-error with try pushes. (Some of those modifications I added after my initial landing attempt was backed out due to causing failures in a tier 2 job that doesn’t run by default on try, *cough*.) At this point, native stack fixing on Windows was available for test output for the first time.

Then I re-enabled stack-fixing for local test runs on Mac. It had been disabled in December 2019 because fixing a single stack typically took at least 15 minutes. With fix_stacks.py it takes about 30 seconds, and it also now prints out a “this may take a while” message to prepare the user for their 30 second wait.

Along the way, I noticed that one use point of the old stack fixing scripts, in automation.py.in, was dead code. Geoff Brown kindly removed this dead code.

Then I changed the Breakpad symbols stack fixing path in DMD to use fix_stacks.py, which was simple.

And then Henrik Skupin noticed that the fix-stacks executable wasn’t installed when you ran mach bootstrap for artifact builds, so I fixed that.

And then I was told that I had broken the AWSY-DMD test jobs on Windows. This wasn’t noticed for weeks because those jobs don’t run by default, and to run them on try you must opt into the “full” job list, which is unusual. The problem was some gnarly file locking caused by the way file descriptors are inherited when a child process is spawned on Windows in Python 2; working this out took some time. (It wouldn’t be a problem on Python 3, but unfortunately this code is Python 2 and that cannot be easily changed.) I thought I had a fix, but it caused other problems, and so I ended up disabling stack fixing on Windows for this job, which was a shame, but put us back where we started, with no stack fixing on Windows for that particular job.

And then I changed the Breakpad symbols stack fixing path in tests to use fix_stacks.py, which seemed simple. But it turns out that tests on Android partly run using code from the current Firefox repository, and partly using code from the “host utils”, which is a snapshot of the Firefox repository from… the last time someone updated the snapshot. (This has something to do with part of the tests actually taking place on Linux machines; I don’t understand the details and have probably mis-described the setup.) The host utils in use at the time was several months old and lacked the fix_stacks.py script. So Andrew Erickson kindly updated the host utils for me. And then I fixed a few more Taskcluster configuration issues, and then the “simple” fix could land. And then I fixed another configuration issues that showed up later, in a follow-up bug.

And then I removed the old stack fixing scripts because they weren’t being used any more.

And then I found a better solution to the Windows + Python 2 file descriptor issue, allowing me to re-enable stack fixing for the Windows AWSY-DMD job. (With another host utils update, to keep the Android tests working.)

And then I updated all the online documentation I could find that referred to the old scripts, all of it on MDN.

And then I closed the meta-bug that had been tracking all of this work. Hooray!

And then I was told of another obscure test output issue relating to web platform tests, which I have not yet landed a fix for. One lesson here is that changing code that potentially affects the output of every test suite is a fraught endeavour, with the possibility of a long tail of problems showing up intermittently.

Performance

I did some speed and peak memory measurements on the two common use cases: fixing many stack frames in a DMD file, and fixing a single stack trace from an assertion failure in a test. The machines I used are: a fast 28-core Linux desktop machine, a 2019 16-inch 8-core MacBook Pro, and an old Lenovo ThinkPad Windows laptop. The fix-stacks executable is compiled with LTO, because I found it gives speed-ups of up to 30%.

First, the following measurements are for fixing a DMD output file produced by an optimized Firefox build, old vs. new.

  • Linux native: 4m44s / 4.8 GB vs. 21s / 2.4 GB
  • Mac native: 15m47s / 1.0 GB vs. 31s / 2.4 GB
  • Windows native: N/A vs. 29s / 2.6 GB
  • Linux Breakpad symbols: 25s / 2.1 GB vs. 13s / 0.6 GB

(Each platform had a different input file, with some variations in the sizes, so cross-platform comparisons aren’t meaningful.)

On Linux we see a 13x speed-up, and I have seen up to 100x improvements on larger inputs. This is because the old script started quickly, but then each additional stack frame fixed was relatively slow. In comparison, the new script has a slightly higher overhead at start-up but then each additional stack frame fixed is very fast. Memory usage is halved, but still high, because libxul is so large.

On Mac the new script is 30x faster than the old script, but memory usage is more than doubled, interestingly. atos must have a particularly compact representation of the data.

On Windows we couldn’t natively fix stacks before.

For Breakpad symbols we see a 2x speed-up and peak memory usage is less than one-third.

Second, the following measurements are for fixing a single stack trace produced by a debug Firefox build, old vs. new.

  • Linux native: 9s / 1.5 GB vs. 13s / 2.5 GB
  • Mac native: 15m01s / 1.1 GB vs. 27s / 2.6 GB
  • Win native: N/A vs. 30s / 3.2 GB
  • Linux Breakpad symbols: 27s / 3.5 GB vs. 13s / 1.1 GB

On Linux, both speed and peak memory usage are somewhat worse. Perhaps addr2line is optimized for doing a small number of lookups.

On Mac the new script is again drastically faster, 33x this time, but memory usage is again more than doubled.

On Windows, again, we couldn’t natively fix stacks before.

For Breakpad symbols we again see a 2x speed-up and peak memory usage of less than one-third.

You might have noticed that the memory usage for the single stack trace was generally higher than for the DMD output. I think this is because the former is an optimized build, while the latter is a debug build.

In summary:

  • The speed of native stack fixing is massively improved in many cases, with 10x-100x improvements typical, and slightly slower in only one case. This represents some drastic time savings for Firefox developers.
  • The peak memory usage of native stack fixing is sometimes lower, sometimes higher, and still quite high in general. But the amount of memory needed is still much less than that required to compile Firefox, so it shouldn’t be a problem for Firefox developers.
  • Native stack fixing is now possible on Windows, which makes things easier for Firefox developers on Windows.
  • For Breakpad symbols stack fixing is 2x faster and takes 3x less memory. This represents some significant savings in machine time on automation, and will also reduce the chance of failures caused by running out of memory, which can be a problem in practice.

My experience with Rust

Much of my work using Rust has been on the Rust compiler itself, but that mostly involves making small edits to existing code. fix-stacks is the third production-quality Rust project I have written from scratch, the others being Firefox’s new prefs parser (just under 1000 lines of code) and counts (just under 100 lines of code).

My experience in all cases has been excellent.

  • I have high confidence in the code’s correctness, and that I’m not missing edge cases that could occur in either C++ (due to lack of safety checks) or Python (due to dynamic typing).
  • The deployed code has been reliable.
  • Rust is a very pleasant language to write code in: expressive, powerful, and many things just feel “right”.
  • I have been writing C++ a lot longer than Rust but I feel more competent and effective in Rust, due to its safety and expressiveness.
  • Performance is excellent.
  • As mentioned above, the entire fix-stacks project wouldn’t have happened without the third-party Symbolic crate.

Rust gives me a feeling of “no compromises” that other languages don’t.

Conclusion

Stack fixing is much better now, and it took more work than I expected!

Many thanks to Mike Hommey, Eric Rahm, and Gabriele Svelto for answering lots of questions and reviewing many patches along the way.

Categories
DMD

DMD is usable again on all Tier 1 platforms

DMD is heap profiler built into Firefox, best known for being the tool used to  diagnose the sources of “heap-unclassified” memory in about:memory.

It’s been unusable on Win32 for a long time due to incredibly slow start-up times. And recently it became very slow on Mac due to a performance regression in libunwind.

Fortunately I have been able to fix this in both cases (Win32, Mac) by using FramePointerStackWalk() instead of MozStackWalk() to do the stack tracing within DMD. (The Gecko Profiler likewise uses FramePointerStackWalk() on those two platforms, and it was my recent work on the profiler that taught me that there was an alternative stack walker available.)

So DMD should be usable and effective on all Tier 1 platforms. I have tested it on  Win32, Win64, Linux64 and Mac. I haven’t tested it in Linux32 or Android. Please let me know if you encounter any problems.

Categories
DMD Firefox Memory allocation MemShrink Performance

Cumulative heap profiling in Firefox with DMD

DMD is a tool that I originally created to help identify where new memory reporters should be added to Firefox in order to reduce the “heap-unclassified” measurement in about:memory. (The name is actually short for “Dark Matter Detector”, because we sometimes call the “heap-unclassified” measurement “dark matter“.)

Recently, I’ve modified DMD to become a more general heap profiling tool. It now has three distinct modes of operation.

  1. “Dark matter”: this mode gives you DMD’s original behaviour.
  2. “Live”: this mode tracks all the live blocks on the system heap, and lets you take snapshots at particular points in time.
  3. Cumulative“: this mode tracks all the blocks that have ever been allocated on the system heap, and so gives you information about all the allocations done by Firefox during an entire session.

Most memory profilers (including as about:memory) are snapshot-based, and so work much like DMD’s “live” mode. But “cumulative” mode is equally interesting.

In particular, unlike “live” mode, “cumulative” mode tells you about parts of the code that are responsible for allocating many short-lived heap blocks (sometimes called “heap churn”). Such allocations can hurt performance: allocations and deallocations themselves aren’t free, particularly because they require obtaining a global lock; such code often involves unnecessary initialization or copying of heap data; and if these allocations come in a variety of sizes they can cause additional heap fragmentation.

Another nice thing about cumulative heap profiling is that, unlike live heap profiling, you don’t have to decide when to take snapshots. You can just profile an entire workload of interest and get the results at the end.

I’ve used DMD’s cumulative mode to find inefficiencies in SpiderMonkey’s source compression  and code generation, SQLite, NSS, nsTArray, XDR encoding, Gnome start-up, IPC messaging, nsStreamLoader, cycle collection, and safe-browsing. There are “start doing something clever” optimizations and then there are “stop doing something stupid” optimizations, and every one of these fixes has been one of the latter. Each change has avoided cumulative heap allocations ranging from tens to thousands of MiBs.

It’s typically difficult to quantify any speed-ups from these changes, because the workloads are noisy and non-deterministic, but I’m convinced that simple changes to fix these issues are worthwhile. For one, I used cumulative profiling (via a different tool) to drive the major improvements I made to pdf.js earlier this year. Also, Chrome developers have found that “Chrome in general is *very* close to the threshold where heap lock contention causes noticeable UI lag”.

So far I have only profiled a few simple workloads. There are all sorts of things I haven’t tried: text-heavy pages, image-heavy pages, audio and video, WebRTC, WebGL, popular benchmarks… the list goes on. I intend to do more profiling and fix things where I can, but it would be great to have help from domain experts with this stuff. If you want to try out cumulative heap profiling in Firefox, please read the DMD instructions and feel free to ask me for help. In particular, I now have a good feel for which hot allocations are unavoidable and reasonable — plenty of them, of course — and which are avoidable. Let’s get rid of the avoidable ones.

Categories
DMD Firefox Memory allocation Memory consumption MemShrink

Please grow your buffers exponentially

If you record every heap allocation and re-allocation done by Firefox you find some interesting things. In particular, you find some sub-optimal buffer growth strategies that cause a lot of heap churn.

Think about a data structure that involves a contiguous, growable buffer, such as a string or a vector. If you append to it and it doesn’t have enough space for the appended elements, you need to allocate a new buffer, copy the old contents to the new buffer, and then free the old buffer. realloc() is usually used for this, because it does these three steps for you.

The crucial question: when you have to grow a buffer, how much do you grow it? One obvious answer is “just enough for the new elements”. That might seem  space-efficient at first glance, but if you have to repeatedly grow the buffer it can quickly turn bad.

Consider a simple but not outrageous example. Imagine you have a buffer that starts out 1 byte long and you add single bytes to it until it is 1 MiB long. If you use the “just-enough” strategy you’ll cumulatively allocate this much memory:

1 + 2 + 3 + … + 1,048,575 + 1,048,576 = 549,756,338,176 bytes

Ouch. O(n2) behaviour really hurts when n gets big enough. Of course the peak memory usage won’t be nearly this high, but all those reallocations and copying will be slow.

In practice it won’t be this bad because heap allocators round up requests, e.g. if you ask for 123 bytes you’ll likely get something larger like 128 bytes. The allocator used by Firefox (an old, extensively-modified version of jemalloc) rounds up all requests between 4 KiB and 1 MiB to the nearest multiple of 4 KiB. So you’ll actually allocate approximately this much memory:

4,096 + 8,192 + 12,288 + … + 1,044,480 + 1,048,576 = 134,742,016 bytes

(This ignores the sub-4 KiB allocations, which in total are negligible.) Much better. And if you’re lucky the OS’s virtual memory system will do some magic with page tables to make the copying cheap. But still, it’s a lot of churn.

A strategy that is usually better is exponential growth. Doubling the buffer each time is the simplest strategy:

4,096 + 8,192 + 16,384 + 32,768 + 65,536 + 131,072 + 262,144 + 524,288 + 1,048,576 = 2,093,056 bytes

That’s more like it; the cumulative size is just under twice the final size, and the series is short enough now to write it out in full, which is nice — calls to malloc() and realloc() aren’t that cheap because they typically require acquiring a lock. I particularly like the doubling strategy because it’s simple and it also avoids wasting usable space due to slop.

Recently I’ve converted “just enough” growth strategies to exponential growth strategies in XDRBuffer and nsTArray, and I also found a case in SQLite that Richard Hipp has fixed. These pieces of code now match numerous places that already used exponential growth: pldhash, JS::HashTable, mozilla::Vector, JSString, nsString, and pdf.js.

Pleasingly, the nsTArray conversion had a clear positive effect. Not only did the exponential growth strategy reduce the amount of heap churn and the number of realloc() calls, it also reduced heap fragmentation: the “heap-overhead” part of the purple measurement on AWSY (a.k.a. “RSS: After TP5, tabs closed [+30s, forced GC]”) dropped by 4.3 MiB! This makes sense if you think about it: an allocator can fulfil power-of-two requests like 64 KiB, 128 KiB, and 256 KiB with less waste than it can awkward requests like 244 KiB, 248 KiB, 252 KiB, etc.

So, if you know of some more code in Firefox that uses a non-exponential growth strategy for a buffer, please fix it, or let me know so I can look at it. Thank you.

Categories
about:memory DMD Firefox Memory consumption MemShrink

Measuring memory used by third-party code

Firefox’s memory reporting infrastructure, which underlies about:memory, is great. And when it lacks coverage — causing the “heap-unclassified” number to get large — we can use DMD to identify where the unreported allocations are coming from. Using this information, we can extend existing memory reporters or write new ones to cover the missing heap blocks.

But there is one exception: third-party code. Well… some libraries support custom allocators, which is great, because it lets us provide a counting allocator. And if we have a copy of the third-party code within Firefox, we can even use some pre-processor hacks to forcibly provide custom counting allocators for code that doesn’t support them.

But some heap allocations are done by code we have no control over, like OpenGL drivers. For example, after opening a simple WebGL demo on my Linux box, I have over 50% “heap-unclassified”.

208.11 MB (100.0%) -- explicit
├──107.76 MB (51.78%) ── heap-unclassified

DMD’s output makes it clear that the OpenGL drivers are responsible. The following record is indicative.

Unreported: 1 block in stack trace record 2 of 3,597
 15,486,976 bytes (15,482,896 requested / 4,080 slop)
 6.92% of the heap (20.75% cumulative); 10.56% of unreported (31.67% cumulative)
 Allocated at
 replace_malloc (/home/njn/moz/mi8/co64dmd/memory/replace/dmd/../../../../memory/replace/dmd/DMD.cpp:1245) 0x7bf895f1
 _swrast_CreateContext (??:?) 0x3c907f03
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3cd84fa8
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3cd9fa2c
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3cd8b996
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3ce1f790
 ??? (/usr/lib/x86_64-linux-gnu/dri/i965_dri.so) 0x3ce1f935
 glXGetDriverConfig (??:?) 0x3dce1827
 glXDestroyGLXPixmap (??:?) 0x3dcbc213
 glXCreateNewContext (??:?) 0x3dcbc48a
 mozilla::gl::GLContextGLX::CreateGLContext(mozilla::gfx::SurfaceCaps const&, mozilla::gl::GLContextGLX*, bool, _XDisplay*, unsigned long, __GLXFBConfigRec*, bool, gfxXlibSurface*) (/home/njn/moz/mi8/co64dmd/gfx/gl/../../../gfx/gl/GLContextProviderGLX.cpp:783) 0x753c99f4

The bottom-most frame is for a function (CreateGLContext) within Firefox’s codebase, and then control passes to the OpenGL driver, which eventually does a heap allocation, which ends up in DMD’s replace_malloc function.

The following DMD report is a similar case that shows up on Firefox OS.

Unreported: 1 block in stack trace record 1 of 463
 1,454,080 bytes (1,454,080 requested / 0 slop)
 9.75% of the heap (9.75% cumulative); 21.20% of unreported (21.20% cumulative)
 Allocated at
 replace_calloc /Volumes/firefoxos/B2G/gecko/memory/replace/dmd/DMD.cpp:1264 (0xb6f90744 libdmd.so+0x5744)
 os_calloc (0xb25aba16 libgsl.so+0xda16) (no addr2line)
 rb_alloc_primitive_lists (0xb1646ebc libGLESv2_adreno.so+0x74ebc) (no addr2line)
 rb_context_create (0xb16446c6 libGLESv2_adreno.so+0x726c6) (no addr2line)
 gl2_context_create (0xb16216f6 libGLESv2_adreno.so+0x4f6f6) (no addr2line)
 eglCreateClientApiContext (0xb25d3048 libEGL_adreno.so+0x1a048) (no addr2line)
 qeglDrvAPI_eglCreateContext (0xb25c931c libEGL_adreno.so+0x1031c) (no addr2line)
 eglCreateContext (0xb25bfb58 libEGL_adreno.so+0x6b58) (no addr2line)
 eglCreateContext /Volumes/firefoxos/B2G/frameworks/native/opengl/libs/EGL/eglApi.cpp:527 (0xb423dda2 libEGL.so+0xeda2)
 mozilla::gl::GLLibraryEGL::fCreateContext(void*, void*, void*, int const*) /Volumes/firefoxos/B2G/gecko/gfx/gl/GLLibraryEGL.h:180 (discriminator 3) (0xb4e88f4c libxul.so+0x789f4c)

We can’t traverse these allocations in the usual manner to measure them, because we have no idea about the layout of the relevant data structures. And we can’t provide a custom counting allocator to code outside of Firefox’s codebase.

However, although we pass control to the driver, control eventually comes back to the heap allocator, and that is something that we do have some power to change. So I had an idea to toggle some kind of mode that records all the allocations that occur within a section of code, as the following code snippet demonstrates.

SetHeapBlockTagForThread("webgl-create-new-context");
context = glx.xCreateNewContext(display, cfg, LOCAL_GLX_RGBA_TYPE, glxContext, True);
ClearHeapBlockTagForThread();

The calls on either side of glx.xCreateNewContext tell the allocator that it should tag all allocations done within that call. And later on, the relevant memory reporter can ask the allocator how many of these allocations remain and how big they are. I’ve implemented a draft version of this, and it basically works, as the following about:memory output shows.

216.97 MB (100.0%) -- explicit
├───78.50 MB (36.18%) ── webgl-contexts
├───32.37 MB (14.92%) ── heap-unclassified

The implementation is fairly simple.

  • There’s a global hash table which records which live heap blocks have a tag associated with them. (Most heap blocks don’t have a tag, so this table stays small.)
  • When SetHeapBlockTagForThread is called, the given tag is stored in thread-local storage. When ClearHeapBlockTagForThread is called, the tag is cleared.
  • When an allocation happens, we (quickly) check if there’s a tag set for the current thread and if so, put a (pointer,tag) pair into the table. Otherwise, we do nothing extra.
  • When a deallocation happens, we check if the deallocated block is in the table, and remove it if so.
  • To find all the live heap blocks with a particular tag, we simply iterate over the table looking for tag matches. This can be used by a memory reporter.

Unfortunately, the implementation isn’t suitable for landing in Firefox’s code, for several reasons.

  • It uses Mike Hommey’s replace_malloc infrastructure to wrap the default allocator (jemalloc). This works well — DMD does the same thing — but using it requires doing a special build and then setting some environment variables at start-up. This is ok for an occasional-use tool that’s only used by Firefox developers, but it’s important that about:memory works in vanilla builds without any additional effort.
  • Alternatively, I could modify jemalloc directly, but we’re hoping to one day move away from our old, heavily-modified version of jemalloc and start using an unmodified jemalloc3.
  • It may have a non-trivial performance hit. Although I haven’t measured performance yet — the above points are a bigger show-stopper at the moment — I’m worried about having to do a hash table lookup on every deallocation. Alternative implementations (store a marker in each block, or store tagged blocks in their own zone) are possible but present their own difficulties.
  • It can miss some allocations. When adding a tag for a particular section of code, you don’t want to mark every allocation that occurs while that section executes, because there could be multiple threads running and you don’t want to mark allocations from other threads. So it restricts the marking to a single thread, but if the section creates a new thread itself, any allocations done on that new thread will be missed. This might sound unlikely, but my implementation appears to miss some allocations and this is my best theory as to why.

This issue of OpenGL drivers and other kinds of third-party code has been a long-term shortcoming with about:memory. For the first time I have a partial solution, though it still has major problems. I’d love to hear if anyone has additional ideas on how to make it better.

Categories
about:memory DMD Firefox MemShrink

DMD now works on Windows

DMD is our tool for improving Firefox’s memory reporting.  It helps identify where new memory reporters need to be added in order to reduce the “heap-unclassified” value in about:memory.

DMD has always worked well on Linux, and moderately well on Mac (it is crashy for some people).  And it works on Android and B2G.  But it has never worked on Windows.

So I’m happy to report that DMD now does work on Windows, thanks to the excellent efforts of Catalin Iacob.  If you’re on Windows and you’ve been seeing high “heap-unclassified” values, and you’re able to build Firefox yourself, please give DMD a try.

Categories
DMD Firefox Garbage Collection Memory consumption MemShrink

MemShrink progress, week 85–86

Lots of news today.

Fixed Regressions

I wrote last time about a couple of bad regressions that AWSY identified.

The ongoing DOM bindings work will hopefully fully fix the second regression before the end of this development cycle (February 18).

AWSY

John Schoenick made three big improvements to AWSY.

  • It now measures every push to mozilla-inbound.  Previously it measured mozilla-central once per day.  This will make it easier and faster to identify patches responsible for regressions.
  • It’s now possible to trigger an AWSY run for any try build.  Unfortunately John hasn’t yet written instructions on how to do this;  I hope he will soon…
  • AWSY now measures Fennec as well.  Kartikaya Gupta created the benchmark that is used for this.  He also fixed a 4 MB regression that it identified.

Leaks Fixed

Benoit Jacob fixed a CC leak that he found with his refgraph tool.

Johnny Stenback fixed a leak involving SVG that he found with DMD.  This was a very slow leak that Johnny had seen repeatedly, which manifested as slowly increasing “heap-unclassified” values in about:memory over days or even weeks.  It’s a really nice case because it shows that DMD can be used on long-running sessions with minimal performance impact.

Justin Lebar fixed a B2G leak relating to forms.js.

Randall Jesup fixed a leak relating to WebRTC.

Andrew McCreight fixed a leak relating to HTMLButtonElement.

Erik Vold fixed a leak in the Restartless Restart add-on.

Miscellaneous

Brian Hackett optimized the representation of JS objects that feature both array (indexed) elements and named properties.  Previously, if an object had both elements and named properties, it would use a sparse representation that was very memory-inefficient if many array elements were present.  This performance fault had been known for a long time, and it caused bad memory blow-ups every once in a while, so it’s great to have it fixed.

As a follow-up, Brian also made it possible for objects that use the sparse representation to change back to the dense array representation if enough array elements are subsequently added.  This should also avoid some occasional blow-ups that occur when arrays get filled in in complex ways.

Gregory Szorc reduced the memory consumption of the new Firefox Health Report feature, from ~3 MB to ~1–1.5 MB: here and here and here and here. On a related note, Bill McCloskey is making good progress with reducing compartment overhead, which should be a sizeable win once it lands.

Gregory also reduced the memory consumption of Firefox Sync:  https: here and here.

Jonathan Kew reduced the amount of memory used by textruns when Facebook Messenger is enabled.

The Add-on SDK is now present in mozilla-central, which is a big step towards getting all add-ons that use it to always use the latest version.  This is nice because it will mean that when memory leaks in the SDK are fixed (and there have been many) all add-ons that use it will automatically get the benefit, without having to be repacked.

Generational GC

Generational garbage collection is an ongoing JS engine project that should reap big wins once it’s completed.  I don’t normally write about things that haven’t been finished, but this is a big project and I’ve had various people asking about it recently, so here’s an update.

Generational GC is one the JS teams two major goals for the near-term.  (The other is benchmark and/or game performance, I can’t remember which.)  You can see from the plan that there are eight people working on it (though not all of them are working on it all the time).

Brian Hackett implemented a static analysis that can determine which functions in the JS engine can trigger garbage collection.  On top of that, he then implemented a static analysis that can identify rooting hazards and unnecessary roots.  This may sound esoteric, but it has massively reduced the amount of work required to complete exact rooting, which is the key prerequisite for generational GC.  To give you an idea:  Terrence Cole estimated that it reduced the number of distinct code locations that need to be looked at and possibly modified from ~10,000 to ~200!  Great stuff.

Another good step was taken when I removed support for E4X from the JS engine.  E4X is an old JavaScript language extension that never gained wide support and was only implemented in Firefox.  The code implementing it was complicated, and an ongoing source of many bugs and security flaws.  The removal cut almost 13,000 lines of code and over 16,000 lines of tests.  It’s been destined for the chopping block for a long time, and its presence has been blocking generational GC, so all the JS team members are glad to see it go.

Bug Counts

Here are the current bug counts.

  • P1: 16 (-5/+0)
  • P2: 119 (-6/+0)
  • P3: 104 (-0/+0)
  • Unprioritized: 22 (-0/+18)

Three of the P1 “fixes” weren’t actual fixes, but cases where a bug was WONTFIXed, or downgraded.  The unprioritized number is high because we skipped this week’s MemShrink meeting due to the DOM work week in London, which occupied three of our regular contributors.

Categories
about:memory B2G DMD Fennec Firefox Memory consumption MemShrink

MemShrink progress, week 79–82

I skipped the last MemShrink report due to Christmas, so we have four weeks’ worth of bug fixes today.

LEAKS

Joe Walker fixed a bad leak found by Jesse Ruderman:  if you closed a browser window with the developer toolbar open it would leak “everything”.  This was a MemShrink P1 bug.

Anton Kovalyov fixed a leak involving scratchpad.  This bug was also found by Jesse Ruderman.

Randell Jesup fixed some WebRTC leaks.

John Schoenick fixed a leak involving plugins.

Josh Aas fixed a leak in some networking code.

DMD

I wrote a more detailed blog post about DMD.  Here is the take-away message.

about:memory is MemShrink’s not-so-secret weapon when it comes to understanding Firefox’s memory consumption… and DMD is how we make about:memory better.

Lots of under-the-hood improvements have been made to DMD since I wrote that.  Users on Mac, Linux and B2G who aren’t afraid of doing their own builds should try it out.  Also, Ehsan Akhgari got it to build on Windows, though it’s not yet clear how well it works on that platform.  If anyone wants to try it out, please let me know how it goes.

Memory Reporters

Ben Turner made the workers memory reporter be able to handle workers that use ctypes.  This was important, especially on B2G, because each process can have one or two or more such workers — this is for Firefox chrome stuff, not web content — and we weren’t measuring them at all, and they can take multiple MiB each.

I fixed the orphan DOM node memory reporter.  The introduction of WebIDL had changed the layout of some paired JS/DOM objects, and such objects weren’t being reported.  (DMD discovered the unreported memory, and Boris Zbarsky helped me interpret what it meant.)  I see this accounting for multiple MiB of orphan nodes when using Gmail.

I added a memory reporter for the event listenener manager’s hash table.  It starts off small, but I’ve seen it go as high as 1.5 MiB after lots of browsing.  (DMD helped me identify this too.)

Kartikaya Gupta added a memory reporter for graphics textures on Android.

I added a memory reporter for any ctypes data that is hanging off JS objects.  I added it because one DMD profile on B2G showed non-trivial amounts of ctypes data, but that seems to have been a fluke and it rarely shows much memory now.  Oh well.

Miscellaneous

Andrew McCreight improved CC shutdown logging, which will make it easier to identify shutdown leaks.

Rail Aliiev and Kartikaya Gupta enabled a new NDK for Fennec builds on releng machines.  This might result in smaller binary sizes, which saves memory.

Bug Counts

Here are the current bug counts.

  • P1: 15 (-2/+0)
  • P2: 116 (-10/+0)
  • P3: 102 (-4/+0)
  • Unprioritized: 18 (-0/+18)

The number of unprioritized bugs is high because we didn’t have a MemShrink meeting this week.  This was because Justin Lebar and Kyle Huey are in Berlin for the B2G work week.  We’ll have our next meeting two weeks from today.

Categories
about:memory DMD Firefox Memory consumption MemShrink

DMD

I recently landed a new version of DMD on mozilla-central.  DMD is a tool helps us understand and thus reduce Firefox’s memory consumption.  But in order to understand DMD, you first have to understand about:memory.

about:memory

The MemShrink project started about 18 months ago, and it has been very successful in reducing Firefox’s memory consumption.  about:memory is MemShrink’s not-so-secret weapon when it comes to understanding Firefox’s memory consumption.

about:memory screenshot

about:memory has some wonderful characteristics:  it provides literally thousands of measurements;  it’s available in ordinary release builds;  and it’s trivial to run (just type “about:memory” in the address bar).  This means that non-expert users can easily provide developers with detailed measurements if they are having problems.

However, about:memory has two shortcomings.  First, it simply visualizes the data provided by the memory reporter infrastructure.  The coverage provided by this infrastructure is good, but there are still some gaps.  These gaps manifest primarily in the “heap-unclassified” number in about:memory, which represents all the heap allocations that the memory reporters didn’t cover.  Unfortunately, about:memory can provide zero insight into what is within “heap-unclassified”.

Second, it’s hard to verify.  There’s no obvious way to tell if the numbers it gives are accurate (with a few exceptions;  e.g. negative numbers are obviously wrong).  We have established good practices for writing memory reporters (measure sizes, don’t compute then;  traverse data structures rather than maintaining size counters) that prevent most errors, but it’s still quite easy to accidentally count a block of memory twice.

This is where DMD comes in.  It helps with both the heap-unclassified and double-counting problems.

DMD version 1

I wrote the first version of DMD over a year ago.  “DMD” is short for “dark matter detector”, because “heap-unclassified” memory is sometimes jokingly called “dark matter”.

DMD works by intercepting all calls to malloc/free/etc.  This lets DMD track extra information about every heap block, such as where it was allocated.  Furthermore, DMD has hooks into the memory reporters (via functions created with the NS_MEMORY_REPORTER_MALLOC_SIZEOF_FUN macro, for those who are interested) so it knows when any heap block is measured by a memory reporter.

With that in place, DMD knows how many times each heap block has been reported.  So, after running the memory reporters, we can up each block into one of the following three groups.

  • Blocks that haven’t been reported indicate gaps in existing memory reporters.  They contribute to “heap-unclassified”.
  • Blocks that have been reported once are good.
  • Blocks that have been reported twice or more indicate defects in existing memory reporters.  They cause some measurements in about:memory to be too high, and “heap-unclassified” to be too low.

DMD presents its results in a sorted fashion that lists the unreported and twice-reported cases that are more important first.

This first version of DMD was implemented as a Valgrind tool, and what’s more, it required a special, patched build of Valgrind to run.  This meant it was difficult to set up, ran very slowly, and could only be used on Linux and Mac.  Despite these shortcomings, it has proved itself extremely useful, helping us get “heap-unclassified” down greatly (on my Linux desktop machine it’s typically between 8 and 12%) and identifying several cases of double-counting.

DMD Version 2

Some time after I wrote the first version of DMD, I realized that all it needed to work was the ability to intercept malloc/free/etc. and to obtain stack traces.  Valgrind was overkill for these purposes — it’s capable of supporting much more invasive tools — and, furthermore, there were existing tools (e.g. trace-malloc) in the Mozilla codebase that did exactly these things.  In other words, it would be possible to build a new version of DMD that integrated directly into the browser.

Work on this new version stalled for a while, because the old one was working well enough.  But then B2G’s Operation Slim Fast started, and it quickly became obvious that “heap-unclassified” on B2G was typically much higher than it was on desktop.  This is primarily because B2G has lots of small processes, and so various unmeasured things that weren’t a big deal on desktop became much more significant on B2G.  And DMD version 1 doesn’t work on B2G devices.

So that provided the impetus to complete the new version.  Mike Hommey finished his new replace-malloc infrastructure recently, which helped greatly, and the new version of DMD landed on mozilla-central last week.  The old version will soon be retired.

DMD now works on Linux, Android, Mac, B2G, and is just shy of working on Windows (where Ehsan Akhgari has been making gradual progress).  It’s also much faster, partly because it avoids the overhead of Valgrind, and partly because it now uses sampling.  The sampling causes blocks smaller than a certain size (4 KiB by default, though you can change that) to be sampled, while blocks larger than that are measured accurately;  this sacrifices a moderate amount of accuracy for a large amount of speed.

Try it

about:memory is wonderful, and DMD is how we make about:memory better.  There are now detailed instructions on how to build DMD, run it, and interpret its output.  It’s quite easy now, requiring only minor changes to the usual build and run steps, and several people have already done it successfully.  Please try it out, and help us further improve about:memory.

Categories
B2G DMD Firefox Memory consumption MemShrink

MemShrink progress, week 77–78

DMD

The big news this week is the landing of the native version of DMD.  The old version of DMD is a Valgrind-based tool that has been instrumental in reducing the size of about:memory’s “heap-unclassified”.  (For example, with trunk builds on my Linux desktop machine it’s now frequently less than 10%.)

However, the old version of DMD wasn’t easy to run.  For example, it required patching Valgrind’s source code and re-building it, among other things.  As a result, only a handful of people ever ran it.  Furthermore, because it’s a Valgrind tool it is very slow and will never run on Windows.

In contrast, the new version is much easier to use.  It just requires a slight configuration change at build time and a slight change in how the browser is invoked.

It’s also much faster, especially if you use the sampling mode which trades a small amount of precision for a great deal of speed.  Crucially, this means it is usable on B2G.  Also, it should be possible for people to use it for long-running sessions, which can be helpful for identifying slow leaks.

And while the new version doesn’t currently run on Windows, it should be fairly straightforward to get it working.  If anyone is interested in helping with this, please contact me, or take a look at the open bug.

I will write a more detailed post about the new version some time in the next few days, once I have updated the documentation and finalized a few more tweaks that should improve usability.  In the meantime, Justin Lebar is already using it in earnest to better understand B2G’s memory consumpion (e.g. see here, here, here, here, here, here, here).

Thanks to Mike Hommey for his excellent replace-malloc infrastructure, on which the new version is built, and to Justin for lots of help, especially with Android and B2G details.

B2G

The option that allows B2G to merge system compartments, previously implemented and landed by Kyle Huey, was enabled by default.  This is a big deal, because it’s the single biggest memory consumption improvement B2G has seen and is likely to see before version 1 is released.

Gabriel Svelto reduced the number of unused dirty pages kept around by jemalloc on B2G.  This can save ~4 MiB of memory across all processes, at a potential cost of making some allocations a little slower.  On a device as memory-constrained as the B2G phones, this is a good trade-off.

Social API

Felipe Gomes reduced the memory consumption of the social API when multiple browser windows are open.  This change has been backported to the beta channel, so it will be present in Firefox 18, in time for the bigger publicity push for this feature.  (For those of you that don’t like Facebook, please note that if you don’t enable the social API it will not affect Firefox’s performance in any way.)

Miscellaneous

I added a MEMORY_VSIZE telemetry reporters, which measures virtual memory consumption.  This might help us understand how many Windows users would benefit from 64-bit builds.

Bug counts

Here are the current bug counts.

  • P1: 17 (-3/+3)
  • P2: 126 (-2/+14)
  • P3: 106 (-4/+7)
  • Unprioritized: 0 (-2/+0)