Category: Memory consumption

Better stack fixing for Firefox

Post author By Nicholas Nethercote
Post date April 15, 2020
4 Comments on Better stack fixing for Firefox

I recently undertook a project to improve the stack fixing tools used for Firefox. This has resulted in some large performance wins (e.g. 10x-100x) and a significant improvement in code quality. The story involves Rust, Python, executable and debug info formats, Taskcluster, and many unexpected complications.

What is a stack fixer?

Within the Firefox code base, a stack fixer is a program that post-processes (“fixes”) the stack frames produced by MozFormatCodeAddress(), which often lack one or more of: function name, file name, or line number. It reads debug info from binaries (libraries and executables) to do so. It reads from standard input and writes to standard output. Lines matching the special stack frame format are modified appropriately. For example, a line like this in the input that names an executable or library:

#01: ???[tests/example +0x43a0]

is changed to a line in the output that names a function, source file, and line number:

#01: main (/home/njn/moz/fix-stacks/tests/example.c:24)

Lines that do not match the special stack frame format are passed through unchanged.

This process is sometimes called “symbolication”, though I will use “stack fixing” in this post because that’s the term used within the Firefox code base.

Stack fixing is used in two main ways for Firefox.

When tests are run on debug builds, a stack trace is produced if a crash or assertion failure happens.
The heap profiling tool DMD records many stack traces at heap allocation points. The stack frames from these stack traces are written to an output file.

A developer needs high-quality stack fixing for the stack traces to be useful in either case.

That doesn’t sound that complicated

The idea is simple, but the reality isn’t.

The debug info format is different on each of the major platforms: Windows (PE/PDB), Mac (Mach-O), and Linux (ELF/DWARF).
We also support Breakpad symbols, a cross-platform debug info format that we use on automation. (Using it on local builds is something of a pain.)
Each debug info format is complicated.
Firefox is built from a number of libraries, but most of its code is in a single library called libxul, whose size ranges from 100 MiB to 2 GiB, depending on the platform and the particular kind of build. This stresses stack fixers.

Before I started this work, we had three different Python scripts for stack fixing.

fix_linux_stack.py: This script does native stack fixing on Linux. It farms out most of the work to addr2line, readelf, and objdump.
fix_macosx_stack.py: This script does native stack fixing on Mac. It farms out most of the work to atos, otool, and c++filt.
fix_stack_using_bpsyms.py: This script does stack fixing using Breakpad symbols. It does the work itself.

Note that there is no fix_windows_stack.py script. We did not have a native stack-fixing option for Windows.

This was an inelegant mishmash. More importantly, the speed of these scripts was poor and highly variable. Stack fixing could take anywhere from tens of seconds to tens of minutes, depending on the platform, build configuration, and number of stack frames that needed fixing. For example, on my fast 28-core Linux box I would often have to wait 20 minutes or more to post-process the files from a DMD run.

One tool to rule them all

It would be nice to have a single program that could handle all the necessary formats. It would also be nice if it was much faster than the existing scripts.

Fortunately, the Symbolic Rust crate written by Sentry provided the perfect foundation for such a tool. It provides the multi-platform debug info functionality needed for stack fixing, and also has high performance. In November last year I started a project to implement a new stack fixer in Rust, called fix-stacks.

Implementing the tool

First I got it working on Linux. I find Linux is often the easiest platform to get new code working on, at least partly because it’s the platform I’m most familiar with. In this case it was also helped by the fact that on Linux debug info is most commonly stored within the binary (library or executable) that it describes, which avoids the need to find a separate debug info file. The code was straightforward. The Symbolic crate did the hard part of reading the debug info, and my code just had to use the APIs provided to iterate over the parsed data and build up some data structures that could then be searched.

Then I got it working on Windows. I find Windows is often the hardest platform to get new code working on, but that wasn’t the case here. The only complication was that Windows debug info is stored in a PDB file that is separate from the binary, but Symbolic has a function for getting the name of that file from the binary, so it wasn’t hard to add code to look in that separate file.

Then I got it working on Mac. This was by far the hardest platform, for two reasons. First, the code had to handle fat binaries, which contain code for multiple architectures. Fortunately, Symbolic has direct support for fat binaries so that wasn’t too bad.

Second, the normal approach on Mac is to read debug info from the files produced by dsymutil, in which the debug info is neatly packaged. Unfortunately, dsymutil is very slow and we try to avoid running it in the Firefox build system if possible. So I took an alternative approach: read the binary’s symbol table and then read debug info from the object files and archive files it mentions. I knew that atos used this approach, but unfortunately its source code isn’t available, so I couldn’t see exactly what it did. If I couldn’t get the approach working myself the whole project was at risk; a one-tool-to-rule-them-all strategy falls short it if doesn’t work on one platform.

I spent quite some time reading about the Mach-O file format and using the MachOView utility to inspect Mach-O binaries. Symbolic doesn’t provide an API for reading symbol tables, so I had to use the lower-level goblin crate for that part. (Symbolic uses goblin itself, which means that fix-stacks is using goblin both directly and indirectly.) First I got it working on some very small test files, then on some smaller libraries within Firefox, and finally (to great relief!) on libxul. At each step I had to deal with new complications in the file format that I hadn’t known about in advance. I also had to modify Symbolic itself to handle some edge cases in .o files.

After that, I got fix-stacks working on Breakpad symbols. This was more straightforward; the only tricky part was navigating the directory structure that Firefox uses for storing the Breakpad symbols files. (I found out the hard way that the directory structure is different on Windows.)

One final complication is that DMD’s output, which gets run through the stack fixer, is in JSON format. So fix-stacks has a JSON mode (enabled with --json) that does the appropriate things with JSON escape characters on both input and output. This took three attempts to get completely right.

The end result is a single program that can fix stacks on all four of the formats we need. The stack traces produced by fix-stacks are sometimes different to those produced by the old stack fixing scripts. In my experience these differences are minor and you won’t notice them if you aren’t looking for them.

Code size

The source code for the first version of fix-stacks, which only supported Linux, was 275 lines (excluding tests). The current version, with support for Windows, Mac, Breakpad symbols, and JSON handling, is 891 lines (excluding tests).

In comparison, the Symbolic crate is about 20,000 lines of Rust code in total (including tests), and the three sub-crates that fix-stacks uses (debuginfo, demangle, and common) are 11,400 lines of Rust code. goblin is another 18,000 lines of code. (That’s what I call “leveraging the ecosystem”!)

Beyond Symbolic and goblin, the only other external crates that fix-stacks uses are fxhash, regex, and serde_json.

Testing

Testing is important for a tool like this. It’s hard to write test inputs manually in formats like ELF/DWARF, PE/PDB, and Mach-O, so I used clang to generate inputs from some simple C programs. Both the C programs and the binary files generated from them are in the repository.

Some of the generated inputs needed additional changes after they were generated by clang. This is explained by the testing README file:

The stack frames produced by `MozFormatCodeAddress()` contain absolute paths
and refer to build files, which means that `fix-stacks` can only be sensibly
run on the same machine that produced the stack frames.

However, the test inputs must work on any machine, not just the machine that
produced those inputs. Furthermore, it is convenient when developing if all the
tests works on all platforms, e.g. the tests involving ELF/DWARF files should
work on Windows, and the tests involving PE/PDB files should work on Linux.

To allow this requires the following.
- All paths in inputs must be relative, rather than absolute.
- All paths must use forward slashes rather than backslashes as directory
  separators. (This is because Windows allows both forward slashes and
  backslashes, but Linux and Mac only allow forward slashes.) This includes the
  paths in text inputs, and also some paths within executables (such as a PE
  file's reference to a PDB file).

To satisfy these constraints required some hex-editing of the generated input files. Quoting the README again:

`example-windows.exe` and `example-windows.pdb` were produced on a Windows 10
laptop by clang 9.0 with this command within `tests/`:
```
clang -g example.c -o example-windows.exe
```
`example-windows.exe` was then hex-edited to change the PDB reference from the
absolute path `c:\Users\njn\moz\fix-stacks\tests\example-windows.pdb` to the
relative path `tests/////////////////////////////example-windows.pdb`. (The use
of many redundant forward slashes is a hack to keep the path the same length,
which avoids the need for more complex changes to that file.)

A hack, to be sure, but an effective one.

The steps required to produce the Mac test inputs were even more complicated because they involve fat binaries. I was careful to make that README file clearly describe the steps I took to generate all the test inputs. The effort has paid off multiple times when modifying the tests.

Integrating the tool

Once I had the fix-stacks working well, I thought that most of the work was done and integrating it into the Firefox build and test system would be straightforward. I was mistaken! The integration ended up being a similar amount of work.

First, I added three new jobs to Mozilla’s Taskcluster instance to build fix-stacks and make it available for downloading on Windows, Mac, and Linux; this is called a “toolchain”. This required making changes to various Taskcluster configuration files, and writing a shell script containing the build instructions. All of this was new to me, and it isn’t documented, so I had to cargo-cult from similar existing toolchains while asking lots of questions of the relevant experts. You can’t test jobs like these on your own machine so it took me dozens of “try” pushes to Mozilla’s test machines to get it working, with each push taking roughly 10 minutes to complete.

Then I added a wrapper script (fix_stacks.py) and changed the native stack fixing path in DMD to use it instead of fix_linux_stack.py or fix_macosx_stack.py. This took some care, with numerous try pushes to manually check that the stacks produced by fix_stacks.py were as good as or better than the ones produced by the old scripts. To do this manual checking I first had to deliberately break the DMD test, because the stacks produced are not printed in the test log when the test passes. I also had to update mach bootstrap so it would install a pre-built fix-stacks executable in the user’s .mozbuild directory, which was another unfamiliar part of the code for me. Plus I fixed a problem with the fix-stacks toolchain for Mac: the fix-stacks executable was being cross-compiled on a Linux machine, but some errors meant it was not actually cross-compiling, but simply building a Linux executable. Plus I fixed a problem with the fix-stacks toolchain for Windows: it was building a 64-bit executable, but that wouldn’t work on our 32-bit test jobs; cross-compiling a 32-bit Windows executable on Linux turned out to be the easiest way to fix it. Again, these toolchain fixes took numerous trial-and-error try pushes to get things working. Once it was all working, native stack fixing on Windows was available for DMD for the first time.

Then I changed the native stack fixing path in tests to use fix_stacks.py. This required some minor changes to fix_stacks.py‘s output, to make it more closely match that of the old scripts, to satisfy some tests. I also had to modify the Taskcluster configuration to install the fix-stacks executable in a few extra places; again this required some trial-and-error with try pushes. (Some of those modifications I added after my initial landing attempt was backed out due to causing failures in a tier 2 job that doesn’t run by default on try, *cough*.) At this point, native stack fixing on Windows was available for test output for the first time.

Then I re-enabled stack-fixing for local test runs on Mac. It had been disabled in December 2019 because fixing a single stack typically took at least 15 minutes. With fix_stacks.py it takes about 30 seconds, and it also now prints out a “this may take a while” message to prepare the user for their 30 second wait.

Along the way, I noticed that one use point of the old stack fixing scripts, in automation.py.in, was dead code. Geoff Brown kindly removed this dead code.

Then I changed the Breakpad symbols stack fixing path in DMD to use fix_stacks.py, which was simple.

And then Henrik Skupin noticed that the fix-stacks executable wasn’t installed when you ran mach bootstrap for artifact builds, so I fixed that.

And then I was told that I had broken the AWSY-DMD test jobs on Windows. This wasn’t noticed for weeks because those jobs don’t run by default, and to run them on try you must opt into the “full” job list, which is unusual. The problem was some gnarly file locking caused by the way file descriptors are inherited when a child process is spawned on Windows in Python 2; working this out took some time. (It wouldn’t be a problem on Python 3, but unfortunately this code is Python 2 and that cannot be easily changed.) I thought I had a fix, but it caused other problems, and so I ended up disabling stack fixing on Windows for this job, which was a shame, but put us back where we started, with no stack fixing on Windows for that particular job.

And then I changed the Breakpad symbols stack fixing path in tests to use fix_stacks.py, which seemed simple. But it turns out that tests on Android partly run using code from the current Firefox repository, and partly using code from the “host utils”, which is a snapshot of the Firefox repository from… the last time someone updated the snapshot. (This has something to do with part of the tests actually taking place on Linux machines; I don’t understand the details and have probably mis-described the setup.) The host utils in use at the time was several months old and lacked the fix_stacks.py script. So Andrew Erickson kindly updated the host utils for me. And then I fixed a few more Taskcluster configuration issues, and then the “simple” fix could land. And then I fixed another configuration issues that showed up later, in a follow-up bug.

And then I removed the old stack fixing scripts because they weren’t being used any more.

And then I found a better solution to the Windows + Python 2 file descriptor issue, allowing me to re-enable stack fixing for the Windows AWSY-DMD job. (With another host utils update, to keep the Android tests working.)

And then I updated all the online documentation I could find that referred to the old scripts, all of it on MDN.

And then I closed the meta-bug that had been tracking all of this work. Hooray!

And then I was told of another obscure test output issue relating to web platform tests, which I have not yet landed a fix for. One lesson here is that changing code that potentially affects the output of every test suite is a fraught endeavour, with the possibility of a long tail of problems showing up intermittently.

Performance

I did some speed and peak memory measurements on the two common use cases: fixing many stack frames in a DMD file, and fixing a single stack trace from an assertion failure in a test. The machines I used are: a fast 28-core Linux desktop machine, a 2019 16-inch 8-core MacBook Pro, and an old Lenovo ThinkPad Windows laptop. The fix-stacks executable is compiled with LTO, because I found it gives speed-ups of up to 30%.

First, the following measurements are for fixing a DMD output file produced by an optimized Firefox build, old vs. new.

Linux native: 4m44s / 4.8 GB vs. 21s / 2.4 GB
Mac native: 15m47s / 1.0 GB vs. 31s / 2.4 GB
Windows native: N/A vs. 29s / 2.6 GB
Linux Breakpad symbols: 25s / 2.1 GB vs. 13s / 0.6 GB

(Each platform had a different input file, with some variations in the sizes, so cross-platform comparisons aren’t meaningful.)

On Linux we see a 13x speed-up, and I have seen up to 100x improvements on larger inputs. This is because the old script started quickly, but then each additional stack frame fixed was relatively slow. In comparison, the new script has a slightly higher overhead at start-up but then each additional stack frame fixed is very fast. Memory usage is halved, but still high, because libxul is so large.

On Mac the new script is 30x faster than the old script, but memory usage is more than doubled, interestingly. atos must have a particularly compact representation of the data.

On Windows we couldn’t natively fix stacks before.

For Breakpad symbols we see a 2x speed-up and peak memory usage is less than one-third.

Second, the following measurements are for fixing a single stack trace produced by a debug Firefox build, old vs. new.

Linux native: 9s / 1.5 GB vs. 13s / 2.5 GB
Mac native: 15m01s / 1.1 GB vs. 27s / 2.6 GB
Win native: N/A vs. 30s / 3.2 GB
Linux Breakpad symbols: 27s / 3.5 GB vs. 13s / 1.1 GB

On Linux, both speed and peak memory usage are somewhat worse. Perhaps addr2line is optimized for doing a small number of lookups.

On Mac the new script is again drastically faster, 33x this time, but memory usage is again more than doubled.

On Windows, again, we couldn’t natively fix stacks before.

For Breakpad symbols we again see a 2x speed-up and peak memory usage of less than one-third.

You might have noticed that the memory usage for the single stack trace was generally higher than for the DMD output. I think this is because the former is an optimized build, while the latter is a debug build.

In summary:

The speed of native stack fixing is massively improved in many cases, with 10x-100x improvements typical, and slightly slower in only one case. This represents some drastic time savings for Firefox developers.
The peak memory usage of native stack fixing is sometimes lower, sometimes higher, and still quite high in general. But the amount of memory needed is still much less than that required to compile Firefox, so it shouldn’t be a problem for Firefox developers.
Native stack fixing is now possible on Windows, which makes things easier for Firefox developers on Windows.
For Breakpad symbols stack fixing is 2x faster and takes 3x less memory. This represents some significant savings in machine time on automation, and will also reduce the chance of failures caused by running out of memory, which can be a problem in practice.

My experience with Rust

Much of my work using Rust has been on the Rust compiler itself, but that mostly involves making small edits to existing code. fix-stacks is the third production-quality Rust project I have written from scratch, the others being Firefox’s new prefs parser (just under 1000 lines of code) and counts (just under 100 lines of code).

My experience in all cases has been excellent.

I have high confidence in the code’s correctness, and that I’m not missing edge cases that could occur in either C++ (due to lack of safety checks) or Python (due to dynamic typing).
The deployed code has been reliable.
Rust is a very pleasant language to write code in: expressive, powerful, and many things just feel “right”.
I have been writing C++ a lot longer than Rust but I feel more competent and effective in Rust, due to its safety and expressiveness.
Performance is excellent.
As mentioned above, the entire fix-stacks project wouldn’t have happened without the third-party Symbolic crate.

Rust gives me a feeling of “no compromises” that other languages don’t.

Conclusion

Stack fixing is much better now, and it took more work than I expected!

Many thanks to Mike Hommey, Eric Rahm, and Gabriele Svelto for answering lots of questions and reviewing many patches along the way.

Memory allocation Memory consumption Performance Valgrind

A better DHAT

Post author By Nicholas Nethercote
Post date April 17, 2019

DHAT is a heap profiler that comes with Valgrind. (The name is short for “Dynamic Heap Analysis Tool”.) It tells your where all your heap allocations come from, and can help you find the following: places that cause excessive numbers of allocations; leaks; unused and under-used allocations; short-lived allocations; and allocations with inefficient data layouts. This old blog post goes into some detail.

In the new Valgrind 3.15 release I have given DHAT a thorough overhaul.

The old DHAT was very useful and I have used it a lot while profiling the Rust compiler. But it had some rather annoying limitations, which the new DHAT overcomes.

First, the old DHAT dumped its data as text at program termination. The new DHAT collects its data in a file which is read by a graphical viewer that runs in a web browser. This gives several advantages.

The separation of data collection and data presentation means you can run a program once under DHAT and then sort and filter the data in various ways, instead of having to choose a particular sort order in advance. Also, full data is in the output file, and the graphical viewer chooses what to omit.
The data can be sorted in more ways than previously. Some of these sorts involve useful filters such as “short-lived” and “zero reads or zero writes”.
The graphical viewer allows parts of the data to be hidden and unhidden as necessary.

Second, the old DHAT divided its output into records, where each record consisted of all the heap allocations that have the same allocation stack trace. The choice of stack trace depth could greatly affect the output.

In contrast, the new DHAT is based around trees of stack traces that avoid the need to choose stack trace depth. This avoids both the problem of not enough depth (when records that should be distinct are combined, and may not contain enough information to be actionable) and the problem of too much depth (when records that should be combined are separated, making them seem less important than they really are).

Third, the new DHAT also collects and/or shows data that the old DHAT did not.

Byte and block measurements are shown with a percentage relative to the global measurements, which helps gauge relative significance of different parts of the profile.
Byte and block measurements are also shown with an allocation rate (bytes and blocks per million instructions), which enables comparisons across multiple profiles, even if those profiles represent different workloads.
Both global and per-node measurements are taken at the global heap peak, which gives Massif-like insight into the point of peak memory use.
The final/lifetimes stats are a bit more useful than the old deaths stats. (E.g. the old deaths stats didn’t take into account lifetimes of unfreed blocks.)

Finally, the new DHAT has a better handling of realloc. The sequence p = malloc(100); realloc(p, 200); now increases the total block count by 2 and the total byte count by 300. In the old DHAT it increased them by 1 and 200. The new handling is a more operational view that better reflects the effect of allocations on performance. It makes a significant difference in the results, giving paths involving reallocation (e.g. repeated pushing to a growing vector) more prominence.

Overall these changes make DHAT more powerful and easier to use.

The following screenshot gives an idea of what the new graphical viewer looks like.

The new DHAT can be run using the --tool=dhat flag, in contrast with the old DHAT, which was an “experimental” tool and so used the --tool=exp-dhat flag. For more details see the documentation.

Memory consumption Performance Programming Rust

How to get the size of Rust types with -Zprint-type-sizes

Post author By Nicholas Nethercote
Post date November 9, 2018

When optimizing Rust code it’s sometimes useful to know how big a type is, i.e. how many bytes it takes up in memory. std::mem::size_of can tell you, but often you want to know the exact layout as well. For example, an enum might be surprisingly big, in which case you probably will want to know if, for example, there is one variant that is much bigger than the others.

The -Zprint-type-sizes option does exactly this. It isn’t enabled on release versions of rustc, so you’ll need to a nightly version of rustc. Here is one possible invocation:

cargo +nightly rustc -- -Zprint-type-sizes

It will print out details of the size, layout, and alignment of all types in use. For example, for this type:

enum E {
    A,
    B(i32),
    C(u64, u8, u64, u8),
    D(Vec<u32>),
}

it prints the following, plus info about a few built-in types:

print-type-size type: `E`: 32 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `A`: 0 bytes
print-type-size     variant `B`: 7 bytes
print-type-size         padding: 3 bytes
print-type-size         field `.0`: 4 bytes, alignment: 4 bytes
print-type-size     variant `C`: 23 bytes
print-type-size         field `.1`: 1 bytes
print-type-size         field `.3`: 1 bytes
print-type-size         padding: 5 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size         field `.2`: 8 bytes
print-type-size     variant `D`: 31 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 24 bytes, alignment: 8 bytes

It shows:

the size and alignment of the type;
for enums, the size of the discriminant;
for enums, the size of each variant;
the size, alignment, and ordering of all fields (note that the compiler has reordered variant C‘s fields to minimize the size of E);
the size and location of all padding.

Every detail you could possibly want is there. Brilliant!

For rustc developers, there’s an extra-special trick for getting the size of a type within rustc itself. Put code like this into a file a.rs:

#![feature(rustc_private)]
extern crate syntax;
use syntax::ast::Expr;
fn main() {
    let _x = std::mem::size_of::<Expr>();
}

and then compile it like this:

RUSTC_BOOTSTRAP=1 rustc -Zprint-type-sizes a.rs

I won’t pretend to understand how it works, but the use of rustc_private and RUSTC_BOOTSTRAP somehow let you see inside rustc while using it, rather than while compiling it. I have used this trick for PRs such as this one.

Memory consumption Performance Rust

How to speed up the Rust compiler in 2018: NLL edition

Post author By Nicholas Nethercote
Post date November 6, 2018
1 Comment on How to speed up the Rust compiler in 2018: NLL edition

Niko Matsakis recently blogged about the Rust compiler’s new borrow checker, which implements non-lexical lifetimes (NLL). The new borrow checker is a really nice improvement to Rust, because it accepts many sound programs that the old borrow checker rejected.

In the blog post, Niko wrote briefly about the performance of the new borrow checker.

Finally, those of you who read the previous posts may remember that the performance of the NLL checker was a big stumbling block. I’m happy to report that the performance issues were largely addressed: there remains some slight overhead to using NLL, but it is largely not noticeable in practice, and I expect we’ll continue to improve it over time.

This paragraph is true, but glosses over a lot of details! This post will be about my contributions to this performance work.

Wins

Before I describe individual improvements, it’s worth mentioning that the new borrow checker uses bitsets (1D) and bit matrices (2D) heavily. A number of my wins involved these data structures.

#51869: This PR changed some code so that it overwrote an existing dense bitset rather than replacing it with a newly created one of the same size, reducing instruction counts for most benchmarks, the best by 1.5%.

#51870: This PR reused a structure containing two bitsets rather than recreating it afresh for every statement in a basic block, reducing instruction counts for numerous benchmarks, the best by 1%.

#52250: The compiler has a SparseBitMatrix type. Rows were added on demand, and each row was implemented as a sparse bitset using a BTreeMap. In practice, many of the rows were relatively dense, with 10–90% of the bits being set. This PR changed SparseBitMatrix to use a dense representation for rows, reducing instruction counts on one benchmark by 33% and by 1% on a few others. The PR had a mixed effect on memory usage, increasing the peak on some benchmarks and reducing it on others. (Never fear! #54318 below ended up fixing the regressions.)

#52342: This PR avoided a bunch of allocations in Canonicalizer methods, reducing instruction counts on numerous benchmarks, the best by 2%.

#53383: Further profiling showed that some dense bitsets were large, but had very few bits set within them, so the dense representation was wasteful. This PR implemented a new hybrid bitset type that uses a sparse representation for bitsets with up to 8 bits set and switches to a dense representation beyond that, and used it to replace dense bitsets in several places, reducing instruction counts on the five slowest benchmarks by 55%, 21%, 16%, 10% and 9%, and reducing peak memory usage of three benchmarks by 53%, 33%, and 9%.

#53513: This PR force-inlined a function at one hot callsite, reducing instruction counts on two benchmarks by 1%.

#53551: This PR avoided some clone calls, reducing instruction counts for one benchmark by 0.5%.

#53733: This PR added special handling for a very common and simple case in unroll_place, reducing the instruction counts on one benchmark by 25%.

#53942: A function called precompute_borrows_out_of_scope does a traversal of one or more basic blocks. In order to detect whether a basic block had been previously visited, it recorded the ID of every visited statement in a hash table. Some basic blocks can have many statements, resulting in many hash table lookups. This PR changed the code to record the ID of visited basic blocks in the hash table instead of visited statements — trickier than it sounds because the analysis can start in the middle of a basic block, in which case the first half might need to be eventually visited — reducing instruction counts on one benchmark by 60%.

#54211: Liveness analysis created an array that could get very large. Each element in the array was a struct containing two u32s and a bool. Each of those elements took up 12 bytes, but only 9 of the bytes held data. This PR split the array into three separate arrays, one per field, making the code slightly less readable but reducing peak memory usage on one benchmark by 20%. (Fun fact: those u32s used to be usizes, but I shrunk them back in May.)

#54213: This PR tweaked some code so that the lifetimes of two large data structures didn’t overlap, reducing peak memory usage by 27% on one benchmark and 8% on another. (Note: those data structures are dominated by, you guessed it, bitsets!)

#54318: This PR changed SparseBitMatrix so that each instantiated row used the hybrid bitset representation from #53383 instead of the dense representation, reducing peak memory usage by 14–45% on four benchmarks, and 96% (from 29.1GB to 1.2GB) on one external crate! This PR also fixed the peak memory regression that #52250 introduced for a few benchmarks.

#54420: I subsequently realized that #54211 didn’t go far enough. Some debugging println! statements showed that both of the u32s in each liveness entry almost always held a special value that meant “invalid”. When data is repetitive, compression is possible: I could use a packed representation where the common (INVALID, INVALID, true) and (INVALID, INVALID, false) cases were represented by special u32 values, and all other triples were represented by a u32 index into an auxiliary table. This PR changed the representation as described, reducing instruction counts on numerous benchmarks, the best by 16%, and reducing peak memory usage on numerous benchmarks, the best by 38%. (I also tried a more compact representation where each element was a single byte; it reduced peak memory usage some more by the instruction count reduction was less, so I went with the earlier approach.)

Progress and current status

You probably noticed that some of the improvements in the previous section were large, and I wasn’t the only one working on NLL performance; Niko Matsakis and David Wood also contributed some big wins. This is because the new borrow checker’s performance was originally, to be honest, terrible. This is understandable; the focus had been on functionality and correctness, which is fair enough for a large and complex new component. Nonetheless, in June I was very nervous about its performance.

To be more specific, “check” builds (which don’t generate code) ran as much as 50x slower with the new borrow checker on some benchmarks. And multiple benchmarks were disabled on CI because they were simply too slow or used too much memory.

Issue #52028 tells a representative story. It was originally filed because the html5ever benchmark was triggering out-of-memory failures on CI. Measurements with Massif and DHAT showed that its peak heap memory usage was over 14 GB, largely caused by a single 12 GB allocation! In comparison, the peak with the old borrow checker was roughly 200–300 MB. If you read through that issue, you can see that over a period of 2.5 months we reduced the memory usage from 14 GB, to 10 GB, to 2 GB, to 1.2 GB, to 600 MB, to 501 MB, and finally to 266 MB.

And things are pretty good now. The instruction counts on “check” builds for all benchmarks are at most 18% higher with the new borrow checker than the old borrow checker, and are typically around 5%. (And note that “check” builds are the worst-case scenario; non-“check” builds will see a smaller relative slowdown because of the extra time needed for code generation, which is unaffected by the borrow checker.) Memory usage is similar: all benchmarks except one have peak memory usage that is at most 20% higher, with the typical value around 3%. (The one remaining exceptional benchmark uses 2.7x memory.) The worse numbers generally occur on programs containing very large constants.

I’m not entirely happy even with this level of performance regression, but for now I have run out of ideas for improving it further. The new borrow checker is a lot more sophisticated and tracks a lot more data, so strict performance parity is a tough ask. Nonetheless, given how bad performance was a few months ago, I’m happy that we’ve got it down to a level where most people probably won’t notice any difference. Given that the new borrow checker makes Rust a significantly nicer and easier language to use, I hope it’s an acceptable trade-off.

Firefox Memory consumption MemShrink Programming

Slimmer and simpler static atoms

Post author By Nicholas Nethercote
Post date October 12, 2018
2 Comments on Slimmer and simpler static atoms

String interning is:

a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. The single copy of each string is called its intern.

In Firefox’s code we use the term atom rather than intern, and atom table rather than string intern pool. I don’t know why; those names have been used for a long time.

Furthermore, Firefox distinguishes between static atoms, which are those that are chosen at compile time and can be directly referred to via an identifier, and dynamic atoms, which are added on-demand at runtime. This post is about the former.

In 2016, Firefox’s implementation of static atoms was complex and inefficient. I filed a bug about this that included the following ASCII diagram showing all the data structures involved for a single atom for the string “foobar”.

static nsFakeStringBuffer<N=7> foobar_buffer (.data, 8+2N bytes)
/-----------------------------------------\ <------+
| int32_t mRefCnt = 1 // never reaches 0  |        | 
| uint32_t mSize = 14 // 7 x 16-bit chars |        | 
| u"foobar"           // the actual chars | <----+ | 
\-----------------------------------------/      | | 
                                                 | | 
PermanentAtomImpl (heap, 32 bytes)               | | 
/----------------------------------------------\ | | <-+
| void* vtablePtr    // implicit               | | |   | 
| uint32_t mLength = 6                         | | |   | 
| uint32_t mHash = ...                         | | |   | 
| char16_t* mString = @------------------------|-+ |   | 
| uintptr_t mRefCnt  // from NS_DECL_ISUPPORTS |   |   | 
\----------------------------------------------/   |   | 
                                                   |   | 
static nsIAtom* foobar (.bss, 8 bytes)             |   | 
/---\ <-----------------------------------+        |   | 
| @-|-------------------------------------|------------+
\---/                                     |        |   | 
                                          |        |   | 
static nsStaticAtom (.d.r.ro.l, 16 bytes) |        |   | 
(this element is part of a larger array)  |        |   | 
/------------------------------------\    |        |   | 
| nsStringBuffer* mStringBuffer = O--|----|--------+   | 
| nsIAtom** mAtom = @----------------|----+            | 
\------------------------------------/                 | 
                                                       | 
AtomTableEntry (heap, ~2 x 16 bytes[*])                | 
(this entry is part of gAtomTable)                     | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| AtomImpl* mAtom = @-----|----------------------------+
\-------------------------/                            | 
                                                       | 
StaticAtomEntry (heap, ~2 x 16 bytes[*])               | 
(this entry is part of gStaticAtomTable)               | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| nsIAtom* mAtom = @------|----------------------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

There is a lot going on in that diagram, but putting that all together gave the following overhead per atom.

Static shared: 0 bytes
Static unshared: 8 + 2(length+1) + 8 + 16
Dynamic: 32 + ~32 + ~32 bytes
Total bytes: (2(length+1) + 64 + ~64) * num_processes

(Although these atoms are “static” in the sense of being known at compile-time, a lot of the associated data was allocated dynamically.)

At the time there were about 2,700 static atoms, and avg_length was about 11, so the overhead was roughly:

0 bytes fixed, and
410,400 bytes per process. (Or more, depending on how the relocations required for the static pointers were represented, which depended on the platform.)

Today, things have improved greatly and now look like the following.

const char16_t[7] (.rodata, 2(N+1) bytes)
(this is detail::gGkAtoms.foobar_string)
/-----------------------------------------\ <--+
| u"foobar"           // the actual chars |    | 
\-----------------------------------------/    | 
                                               | 
const nsStaticAtom (.rodata, 12 bytes)         | 
(this is within detail::gGkAtoms.mAtoms[])     | 
/-------------------------------------\ <---+  | 
| uint32_t mLength:30 = 6             |     |  | 
| uint32_t mKind:2 = AtomKind::Static |     |  | 
| uint32_t mHash = ...                |     |  | 
| uint32_t mStringOffset = @----------|-----|--+
\-------------------------------------/     | 
                                            | 
constexpr nsStaticAtom* (0 bytes) @---------+
(this is nsGkAtoms::foobar)                 | 
                                            | 
AtomTableEntry (heap, ~2 x 16 bytes[*])     | 
(this entry is part of gAtomTable)          | 
/-------------------------\                 | 
| uint32_t mKeyHash = ... |                 | 
| nsAtom* mAtom = @-------|-----------------+
\-------------------------/

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

That gives the following overhead per atom.

Static shared: 12 + 2(length+1) bytes
Static unshared: 0 bytes
Dynamic: ~32 bytes
Total: 12 + 2(length+1) + ~32 * num_processes

We now have about 2,300 static atoms and avg_length is still around 11, so the overhead is roughly:

82,800 bytes fixed, and
73,600 bytes per process.

I won’t explain all the parts of the two diagrams, but it can be seen that we’ve gone from six pieces per static atom to four; the size and complexity of the remaining pieces are greatly reduced; there are no static pointers (only constexpr pointers and integral offsets) and thus no relocations; and there is a lot more interprocess sharing thanks to more use of const. Also, there is no need for a separate static atom table any more, because the main atom table is thread-safe and the HTML5 parser (the primary user of the separate static atom table) now has a small but highly effective static atoms cache.

Things that aren’t visible from the diagrams: atoms are no longer exposed to JavaScript code via XPIDL, there are no longer any virtual methods involved, and all atoms are defined in a single place (with no duplicates) instead of 7 or 8 different places. Notably, the last few steps were blocked for some time by a bug in MSVC involving the handling of constexpr.

The bug dependency tree gives a good indication of how many separate steps were involved in this work. If there is any lesson to be had here, it’s that small improvements add up over time.

Firefox Memory consumption MemShrink Uptime

Firefox 64-bit for Windows can take advantage of more memory

Post author By Nicholas Nethercote
Post date July 22, 2016
16 Comments on Firefox 64-bit for Windows can take advantage of more memory

By default, on Windows, Firefox is a 32-bit application. This means that it is limited to using at most 4 GiB of memory, even on machines that have more than 4 GiB of physical memory (RAM). In fact, depending on the OS configuration, the limit may be as low as 2 GiB.

Now, 2–4 GiB might sound like a lot of memory, but it’s not that unusual for power users to use that much. This includes:

users with many (dozens or even hundreds) of tabs open;
users with many (dozens) of extensions;
users of memory-hungry web sites and web apps; and
users who do all of the above!

Furthermore, in practice it’s not possible to totally fill up this available space because fragmentation inevitably occurs. For example, Firefox might need to make a 10 MiB allocation and there might be more than 10 MiB of unused memory, but if that available memory is divided into many pieces all of which are smaller than 10 MiB, then the allocation will fail.

When an allocation does fail, Firefox can sometimes handle it gracefully. But often this isn’t possible, in which case Firefox will abort. Although this is a controlled abort, the effect for the user is basically identical to an uncontrolled crash, and they’ll have to restart Firefox. A significant fraction of Firefox crashes/aborts are due to this problem, known as address space exhaustion.

Fortunately, there is a solution to this problem available to anyone using a 64-bit version of Windows: use a 64-bit version of Firefox. Now, 64-bit applications typically use more memory than 32-bit applications. This is because pointers, a common data type, are twice as big; a rough estimate for 64-bit Firefox is that it might use 25% more memory. However, 64-bit applications also have a much larger address space, which means they can access vast amounts of physical memory, and address space exhaustion is all but impossible. (In this way, switching from a 32-bit version of an application to a 64-bit version is the closest you can get to downloading more RAM!)

Therefore, if you have a machine with 4 GiB or less of RAM, switching to 64-bit Firefox probably won’t help. But if you have 8 GiB or more, switching to 64-bit Firefox probably will help the memory usage situation.

Official 64-bit versions of Firefox have been available since December 2015. If the above discussion has interested you, please try them out. But note the following caveats.

Flash and Silverlight are the only supported 64-bit plugins.
There are some Flash content regressions due to our NPAPI sandbox (for content that uses advanced features like GPU acceleration or microphone APIs).

On the flip side, as well as avoiding address space exhaustion problems, a security feature known as ASLR works much better in 64-bit applications than in 32-bit applications, so 64-bit Firefox will be slightly more secure.

Work is being ongoing to fix or minimize the mentioned caveats, and it is expected that 64-bit Firefox will be rolled out in increasing numbers in the not-too-distant future.

UPDATE: Chris Peterson gave me the following measurements about daily active users on Windows.

66.0% are running 32-bit Firefox on 64-bit Windows. These users could switch to a 64-bit Firefox.
32.3% are running 32-bit Firefox on 32-bit Windows. These users cannot switch to a 64-bit Firefox.
1.7% are running 64-bit Firefox already.

UPDATE 2: Also from Chris Peterson, here are links to 64-bit builds for all the channels:

Release
Beta
Developer Edition
Nightly (en-US only)
Nightly (other locales)

Firefox Garbage Collection Memory consumption MemShrink

More compacting GC

Jon Coppeard recently extended SpiderMonkey’s compacting GC abilities. Previously, the GC could only compact GC arena containing JavaScript objects. Now it can also compact arenas containing shapes (a data structure used within SpiderMonkey which isn’t visible to user code) and strings, which are two of the largest users of memory in the GC heap after objects.

These improvements should result in savings of multiple MiBs in most workloads, and they are on track to ship in Firefox 48, which will be released in early August. Great work, Jon!

Firefox Garbage Collection Memory consumption MemShrink

Compacting GC

Go read Jon Coppeard’s description of the compacting GC algorithm now used by SpiderMonkey!

about:memory AdBlock Plus Firefox Memory consumption MemShrink

Firefox 41 will use less memory when running AdBlock Plus

Post author By Nicholas Nethercote
Post date July 1, 2015
21 Comments on Firefox 41 will use less memory when running AdBlock Plus

Last year I wrote about AdBlock Plus’s effect on Firefox’s memory usage. The most important part was the following.

First, there’s a constant overhead just from enabling ABP of something like 60–70 MiB. […] This appears to be mostly due to additional JavaScript memory usage, though there’s also some due to extra layout memory.

Second, there’s an overhead of about 4 MiB per iframe, which is mostly due to ABP injecting a giant stylesheet into every iframe. Many pages have multiple iframes, so this can add up quickly. For example, if I load TechCrunch and roll over the social buttons on every story […], without ABP, Firefox uses about 194 MiB of physical memory. With ABP, that number more than doubles, to 417 MiB.

An even more extreme example is this page, which contains over 400 iframes. Without ABP, Firefox uses about 370 MiB. With ABP, that number jumps to 1960 MiB.

(This description was imprecise; the overhead is actually per document, which includes both top-level documents in a tab and documents in iframes.)

Last week Mozilla developer Cameron McCormack landed patches to fix bug 77999, which was filed more than 14 years ago. These patches enable sharing of CSS-related data — more specifically, they add data structures that share the results of cascading user agent style sheets — and in doing so they entirely fix the second issue, which is the more important of the two.

For example, on the above-mentioned “extreme example” (a.k.a. the Vim Color Scheme Test) memory usage dropped by 3.62 MiB per document. There are 429 documents on that page, which is a total reduction of about 1,550 MiB, reducing memory usage for that page down to about 450 MiB, which is not that much more than when AdBlock Plus is absent. (All these measurements are on a 64-bit build.)

I also did measurements on various other sites and confirmed the consistent saving of ~3.6 MiB per document when AdBlock Plus is enabled. The number of documents varies widely from page to page, so the exact effect depends greatly on workload. (I wanted to test TechCrunch again, but its front page has been significantly changed so it no longer triggers such high memory usage.) For example, for one of my measurements I tried opening the front page and four articles from each of nytimes.com, cnn.com and bbc.co.uk, for a total of 15 tabs. With Cameron’s patches applied Firefox with AdBlock Plus used about 90 MiB less physical memory, which is a reduction of over 10%.

Even when AdBlock Plus is not enabled this change has a moderate benefit. For example, in the Vim Color Scheme Test the memory usage for each document dropped by 0.09 MiB, reducing memory usage by about 40 MiB.

If you want to test this change out yourself, you’ll need a Nightly build of Firefox and a development build of AdBlock Plus. (Older versions of AdBlock Plus don’t work with Nightly due to a recent regression related to JavaScript parsing). In Firefox’s about:memory page you’ll see the reduction in the “style-sets” measurements. You’ll also see a new entry under “layout/rule-processor-cache”, which is the measurement of the newly shared data; it’s usually just a few MiB.

This improvement is on track to make it into Firefox 41, which is scheduled for release on September 22, 2015. For users on other release channels, Firefox 41 Beta is scheduled for release on August 11, and Firefox 41 Developer Edition is scheduled to be released in the next day or two.

about:memory Firefox Memory consumption Rust Servo

Measuring data structure sizes: Firefox (C++) vs. Servo (Rust)

Post author By Nicholas Nethercote
Post date June 3, 2015

Firefox’s about:memory page presents fine-grained measurements of memory usage. Here’s a short example.

725.84 MB (100.0%) -- explicit
├──504.36 MB (69.49%) -- window-objects
│ ├──115.84 MB (15.96%) -- top(https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound, id=2147483655)
│ │ ├───85.30 MB (11.75%) -- active
│ │ │ ├──84.75 MB (11.68%) -- window(https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound)
│ │ │ │ ├──36.51 MB (05.03%) -- dom
│ │ │ │ │ ├──16.46 MB (02.27%) ── element-nodes
│ │ │ │ │ ├──13.08 MB (01.80%) ── orphan-nodes
│ │ │ │ │ └───6.97 MB (00.96%) ++ (4 tiny)
│ │ │ │ ├──25.17 MB (03.47%) -- js-compartment(https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound)
│ │ │ │ │ ├──23.29 MB (03.21%) ++ classes
│ │ │ │ │ └───1.87 MB (00.26%) ++ (7 tiny)
│ │ │ │ ├──21.69 MB (02.99%) ++ layout
│ │ │ │ └───1.39 MB (00.19%) ++ (2 tiny)
│ │ │ └───0.55 MB (00.08%) ++ window(https://login.persona.org/communication_iframe)
│ │ └───30.54 MB (04.21%) ++ js-zone(0x7f131ed6e000)

A typical about:memory invocation contains many thousands of measurements. Although they can be hard for non-experts to interpret, they are immensely useful to Firefox developers. For this reason, I’m currently implementing a similar system in Servo, which is a next-generation browser engine that’s implemented in Rust. Although the implementation in Servo is heavily based on the Firefox implementation, Rust has some features that make the Servo implementation a lot nicer than the Firefox implementation, which is written in C++. This blog post is a deep dive that explains how and why.

Measuring data structures in Firefox

A lot of the measurements done for about:memory are of heterogeneous data structures that live on the heap and contain pointers. We want such data structures to be able to measure themselves. Consider the following simple example.

struct CookieDomainTuple
{
  nsCookieKey key;
  nsRefPtr<nsCookie> cookie;
 
  size_t SizeOfExcludingThis(mozilla::MallocSizeOf aMallocSizeOf) const;
};

The things to immediately note about this type are as follows.

The details of nsCookieKey and nsCookie don’t matter here.
nsRefPtr is a smart pointer type.
There is a method, called SizeOfExcludingThis, for measuring the size of a CookieDomainTuple.

That measurement method has the following form.

size_t
CookieDomainTuple::SizeOfExcludingThis(MallocSizeOf aMallocSizeOf) const
{
  size_t amount = 0;
  amount += key.SizeOfExcludingThis(aMallocSizeOf);
  amount += cookie->SizeOfIncludingThis(aMallocSizeOf);
  return amount;
}

Things to note here are as follows.

aMallocSizeOf is a pointer to a function that takes a pointer to a heap block and returns the size of that block in bytes. Under the covers it’s implemented with a function like malloc_usable_size. Using a function like this is superior to computing the size analytically, because (a) it’s less error-prone and (b) it measures the actual size of heap blocks, which is often larger than the requested size because heap allocators round up some request sizes. It will also naturally measure any padding between members.
The two data members are measured by invocations to size measurement methods that they provide.
The first of these is called SizeOfExcludingThis. The “excluding this” here is necessary because key is an nsCookieKey that sits within a CookieDomainTuple. We don’t want to measure the nsCookieKey struct itself, just any additional heap blocks that it has pointers to.
The second of these is called SizeOfIncludingThis. The “including this” here is necessary because cookie is just a pointer to an nsCookie struct, which we do want to measure, along with any additional heap blocks it has pointers to.
We need to be careful with these calls. If we call SizeOfIncludingThis when we should call SizeOfExcludingThis, we’ll likely get a crash due to calling aMallocSizeOf on a non-heap pointer. And if we call SizeOfExcludingThis when we should call SizeOfIncludingThis, we’ll miss measuring the struct.
If this struct had a pointer to a raw heap buffer — e.g. a char* member — it would measure it by calling aMallocSizeOf directly on the pointer.

With that in mind, you can see that this method is itself a SizeOfExcludingThis method, and indeed, it doesn’t measure the memory used by the struct instance itself. A method that did include that memory would look like the following.

size_t
CookieDomainTuple::SizeOfIncludingThis(MallocSizeOf aMallocSizeOf)
{ 
  return aMallocSizeOf(this) + SizeOfExcludingThis(aMallocSizeOf);
}

All it does is measure the CookieDomainTuple struct itself — i.e. this — and then call the SizeOfExcludingThis method, which measures all child structures.

There are a few other wrinkles.

Often we want to ignore a data member. Perhaps it’s a scalar value, such as an integer. Perhaps it’s a non-owning pointer to something and that thing would be better measured as part of the measurement of another data structure. Perhaps it’s something small that isn’t worth measuring. In these cases we generally use comments in the measurement method to explain why a field isn’t measured, but it’s easy for these comments to fall out-of-date. It’s also easy to forget to update the measurement method when a new data member is added.
Every SizeOfIncludingThis method body looks the same: return aMallocSizeOf(this) + SizeOfExcludingThis(aMallocSizeOf);
Reference-counting complicates things, because you end up with pointers that conceptually own a fraction of another structure.
Inheritance complicates things.

(The full documentation goes into more detail.)

Even with all the wrinkles, it all works fairly well. Having said that, there are a lot of SizeOfExcludingThis and SizeOfIncludingThis methods that are boilerplate-y and tedious to write.

Measuring data structures in SERVO

When I started implementing a similar system in Servo, I naturally followed a similar design. But I soon found I was able to improve upon it.

With the same functions defined for lots of types, it was natural to define a Rust trait, like the following.

pub trait HeapSizeOf { 
  fn size_of_including_self(&self) -> usize;
  fn size_of_excluding_self(&self) -> usize; 
}

Having to repeatedly define size_of_including_self when its definition always looks the same is a pain. But heap pointers in Rust are handled via the parameterized Box type, and it’s possible to implement traits for this type. This means we can implement size_of_excluding_this for all Box types — thus removing the need for size_of_including_this — in one fell swoop, as the following code shows.

impl<T: HeapSizeOf> HeapSizeOf for Box<T> {
  fn size_of_excluding_self(&self) -> usize {
    heap_size_of(&**self as *const T as *const c_void) + (**self).size_of_excluding_self()
  }
}

The pointer manipulations are hairy, but basically it says that if T implements the HeapSizeOf trait, then we can measure Box<T> by measuring the T struct itself (via heap_size_of, which is similar to the aMallocSizeOf function in the Firefox example), and then measuring the things hanging off T (via the size_of_excluding_self call). Excellent!

With the including/excluding distinction gone, I renamed size_of_excluding_self as heap_size_of_children, which I thought communicated the same idea more clearly; it seems better for the name to describe what it is measuring rather than what it is not measuring.

But there was still a need for a lot of tedious boilerplate code, as this example shows.

pub struct DisplayList {
  pub background_and_borders: LinkedList<DisplayItem>,
  pub block_backgrounds_and_borders: LinkedList<DisplayItem>,
  pub floats: LinkedList<DisplayItem>,
  pub content: LinkedList<DisplayItem>,
  pub positioned_content: LinkedList<DisplayItem>,
  pub outlines: LinkedList<DisplayItem>,
  pub children: LinkedList<Arc<StackingContext>>,
}

impl HeapSizeOf for DisplayList {
  fn heap_size_of_children(&self) -> usize {
    self.background_and_borders.heap_size_of_children() +
    self.block_backgrounds_and_borders.heap_size_of_children() +
    self.floats.heap_size_of_children() +
    self.content.heap_size_of_children() +
    self.positioned_content.heap_size_of_children() +
    self.outlines.heap_size_of_children() +
    self.children.heap_size_of_children()
  }
}

However, the Rust compiler has the ability to automatically derive implementations for some built-in traits. Even better, the compiler lets you write plug-ins that do arbitrary transformations of the syntax tree, which makes it possible to write a plug-in that does the same for non-built-in traits on request. And the delightful Manish Goregaokar has done exactly that. This allows the example above to be reduced to the following.

#[derive(HeapSizeOf)]
pub struct DisplayList {
  pub background_and_borders: LinkedList<DisplayItem>,
  pub block_backgrounds_and_borders: LinkedList<DisplayItem>,
  pub floats: LinkedList<DisplayItem>,
  pub content: LinkedList<DisplayItem>,
  pub positioned_content: LinkedList<DisplayItem>,
  pub outlines: LinkedList<DisplayItem>,
  pub children: LinkedList<Arc<StackingContext>>,
}

The first line is an annotation that triggers the plug-in to do the obvious thing: generate a heap_size_of_children definition that just calls heap_size_of_children on all the struct fields. Wonderful!

But you may remember that I mentioned that sometimes in Firefox’s C++ code we want to ignore a particular member. This is also true in Servo’s Rust code, so the plug-in supports an ignore_heap_size annotation which can be applied to any field in the struct definition; the plug-in will duly ignore any such field.

If a new field is added which has a type for which HeapSizeOf has not been implemented, the compiler will complain. This means that we can’t add a new field to a struct and forget to measure it. The ignore_heap_size_of annotation also requires a string argument, which (by convention) holds a brief explanation why the member is ignored, as the following example shows.

#[ignore_heap_size_of = "Because it is a non-owning reference."]
pub image: Arc<Image>,

(An aside: the best way to handle Arc is an open question. If one of the references is clearly the owner, it probably makes sense to count the full size for that one reference. Otherwise, it is probably best to divide the size equally among all the references.)

The plug-in also has a known_heap_size_of! macro that lets us easily dictate the heap size of built-in types (such as integral types, whose heap size is zero). This works because Rust allows implementations of custom traits for built-in types. It provides additional uniformity because built-in types don’t need special treatment. The following line says that all the built-in signed integer types have a heap_size_of_children value of zero.

known_heap_size_of!(0, i8, i16, i32, i64, isize);

Finally, if there is a type for which the measurement needs to do something more complicated, we can still implement heap_size_of_children manually.

Conclusion

The Servo implementation is much nicer than the Firefox implementation, in the following ways.

There is no need for an including/excluding split thanks to trait implementations on Box. This avoids boilerplate some code and makes it impossible to accidentally call the wrong method.
Struct fields that use built-in types are handled the same way as all others, because Rust trait implementations can be defined for built-in types.
Even more boilerplate is avoided thanks to the compiler plug-in that auto-derives HeapSizeOf implementations; it can even ignore fields.
For ignored fields, the required string parameter makes it impossible to forget to explain why the field is ignored.

These are possible due to several powerful language and compiler features of Rust that C++ lacks. There may be some C++ features that could improve the Firefox code — and I’d love to hear suggestions — but it’s never going to be as nice as the Rust code.