Categories
Memory allocation Memory consumption Performance Valgrind

A better DHAT

DHAT is a heap profiler that comes with Valgrind. (The name is short for “Dynamic Heap Analysis Tool”.) It tells your where all your heap allocations come from, and can help you find the following: places that cause excessive numbers of allocations; leaks; unused and under-used allocations; short-lived allocations; and allocations with inefficient data layouts. This old blog post goes into some detail.

In the new Valgrind 3.15 release I have given DHAT a thorough overhaul.

The old DHAT was very useful and I have used it a lot while profiling the Rust compiler. But it had some rather annoying limitations, which the new DHAT overcomes.

First, the old DHAT dumped its data as text at program termination. The new DHAT collects its data in a file which is read by a graphical viewer that runs in a web browser. This gives several advantages.

  • The separation of data collection and data presentation means you can run a program once under DHAT and then sort and filter the data in various ways, instead of having to choose a particular sort order in advance. Also, full data is in the output file, and the graphical viewer chooses what to omit.
  • The data can be sorted in more ways than previously. Some of these sorts involve useful filters such as “short-lived” and “zero reads or zero writes”.
  • The graphical viewer allows parts of the data to be hidden and unhidden as necessary.

Second, the old DHAT divided its output into records, where each record consisted of all the heap allocations that have the same allocation stack trace. The choice of stack trace depth could greatly affect the output.

In contrast, the new DHAT is based around trees of stack traces that avoid the need to choose stack trace depth. This avoids both the problem of not enough depth (when records that should be distinct are combined, and may not contain enough information to be actionable) and the problem of too much depth (when records that should be combined are separated, making them seem less important than they really are).

Third, the new DHAT also collects and/or shows data that the old DHAT did not.

  • Byte and block measurements are shown with a percentage relative to the global measurements, which helps gauge relative significance of different parts of the profile.
  • Byte and block measurements are also shown with an allocation rate (bytes and blocks per million instructions), which enables comparisons across multiple profiles, even if those profiles represent different workloads.
  • Both global and per-node measurements are taken at the global heap peak, which gives Massif-like insight into the point of peak memory use.
  • The final/lifetimes stats are a bit more useful than the old deaths stats. (E.g. the old deaths stats didn’t take into account lifetimes of unfreed blocks.)

Finally, the new DHAT has a better handling of realloc. The sequence p = malloc(100); realloc(p, 200); now increases the total block count by 2 and the total byte count by 300. In the old DHAT it increased them by 1 and 200. The new handling is a more operational view that better reflects the effect of allocations on performance. It makes a significant difference in the results, giving paths involving reallocation (e.g. repeated pushing to a growing vector) more prominence.

Overall these changes make DHAT more powerful and easier to use.

The following screenshot gives an idea of what the new graphical viewer looks like.

Sample output from DHAT's viewer

The new DHAT can be run using the --tool=dhat flag, in contrast with the old DHAT, which was an “experimental” tool and so used the --tool=exp-dhat flag. For more details see the documentation.

Categories
Memory consumption Performance Programming Rust

How to get the size of Rust types with -Zprint-type-sizes

When optimizing Rust code it’s sometimes useful to know how big a type is, i.e. how many bytes it takes up in memory. std::mem::size_of can tell you, but often you want to know the exact layout as well. For example, an enum might be surprisingly big, in which case you probably will want to know if, for example, there is one variant that is much bigger than the others.

The -Zprint-type-sizes option does exactly this. It isn’t enabled on release versions of rustc, so you’ll need to a nightly version of rustc. Here is one possible invocation:

cargo +nightly rustc -- -Zprint-type-sizes

It will print out details of the size, layout, and alignment of all types in use. For example, for this type:

enum E {
    A,
    B(i32),
    C(u64, u8, u64, u8),
    D(Vec<u32>),
}

it prints the following, plus info about a few built-in types:

print-type-size type: `E`: 32 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `A`: 0 bytes
print-type-size     variant `B`: 7 bytes
print-type-size         padding: 3 bytes
print-type-size         field `.0`: 4 bytes, alignment: 4 bytes
print-type-size     variant `C`: 23 bytes
print-type-size         field `.1`: 1 bytes
print-type-size         field `.3`: 1 bytes
print-type-size         padding: 5 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size         field `.2`: 8 bytes
print-type-size     variant `D`: 31 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 24 bytes, alignment: 8 bytes

It shows:

  • the size and alignment of the type;
  • for enums, the size of the discriminant;
  • for enums, the size of each variant;
  • the size, alignment, and ordering of all fields (note that the compiler has reordered variant C‘s fields to minimize the size of E);
  • the size and location of all padding.

Every detail you could possibly want is there. Brilliant!

For rustc developers, there’s an extra-special trick for getting the size of a type within rustc itself. Put code like this into a file a.rs:

#![feature(rustc_private)]
extern crate syntax;
use syntax::ast::Expr;
fn main() {
    let _x = std::mem::size_of::<Expr>();
}

and then compile it like this:

RUSTC_BOOTSTRAP=1 rustc -Zprint-type-sizes a.rs

I won’t pretend to understand how it works, but the use of rustc_private and RUSTC_BOOTSTRAP somehow let you see inside rustc while using it, rather than while compiling it. I have used this trick for PRs such as this one.

Categories
Memory consumption Performance Rust

How to speed up the Rust compiler in 2018: NLL edition

Niko Matsakis recently blogged about the Rust compiler’s new borrow checker, which implements non-lexical lifetimes (NLL). The new borrow checker is a really nice improvement to Rust, because it accepts many sound programs that the old borrow checker rejected.

In the blog post, Niko wrote briefly about the performance of the new borrow checker.

Finally, those of you who read the previous posts may remember that the performance of the NLL checker was a big stumbling block. I’m happy to report that the performance issues were largely addressed: there remains some slight overhead to using NLL, but it is largely not noticeable in practice, and I expect we’ll continue to improve it over time.

This paragraph is true, but glosses over a lot of details! This post will be about my contributions to this performance work.

Wins

Before I describe individual improvements, it’s worth mentioning that the new borrow checker uses bitsets (1D) and bit matrices (2D) heavily. A number of my wins involved these data structures.

#51869: This PR changed some code so that it overwrote an existing dense bitset rather than replacing it with a newly created one of the same size, reducing instruction counts for most benchmarks, the best by 1.5%.

#51870: This PR reused a structure containing two bitsets rather than recreating it afresh for every statement in a basic block, reducing instruction counts for numerous benchmarks, the best by 1%.

#52250: The compiler has a SparseBitMatrix type. Rows were added on demand, and each row was implemented as a sparse bitset using a BTreeMap. In practice, many of the rows were relatively dense, with 10–90% of the bits being set. This PR changed SparseBitMatrix to use a dense representation for rows, reducing instruction counts on one benchmark by 33% and by 1% on a few others. The PR had a mixed effect on memory usage, increasing the peak on some benchmarks and reducing it on others. (Never fear! #54318 below ended up fixing the regressions.)

#52342: This PR avoided a bunch of allocations in Canonicalizer methods, reducing instruction counts on numerous benchmarks, the best by 2%.

#53383: Further profiling showed that some dense bitsets were large, but had very few bits set within them, so the dense representation was wasteful. This PR implemented a new hybrid bitset type that uses a sparse representation for bitsets with up to 8 bits set and switches to a dense representation beyond that, and used it to replace dense bitsets in several places, reducing instruction counts on the five slowest benchmarks by 55%, 21%, 16%, 10% and 9%, and reducing peak memory usage of three benchmarks by 53%, 33%, and 9%.

#53513: This PR force-inlined a function at one hot callsite, reducing instruction counts on two benchmarks by 1%.

#53551: This PR avoided some clone calls, reducing instruction counts for one benchmark by 0.5%.

#53733: This PR added special handling for a very common and simple case in unroll_place, reducing the instruction counts on one benchmark by 25%.

#53942: A function called precompute_borrows_out_of_scope does a traversal of one or more basic blocks. In order to detect whether a basic block had been previously visited, it recorded the ID of every visited statement in a hash table. Some basic blocks can have many statements, resulting in many hash table lookups. This PR changed the code to record the ID of visited basic blocks in the hash table instead of visited statements — trickier than it sounds because the analysis can start in the middle of a basic block, in which case the first half might need to be eventually visited — reducing instruction counts on one benchmark by 60%.

#54211: Liveness analysis created an array that could get very large. Each element in the array was a struct containing two u32s and a bool. Each of those elements took up 12 bytes, but only 9 of the bytes held data. This PR split the array into three separate arrays, one per field, making the code slightly less readable but reducing peak memory usage on one benchmark by 20%. (Fun fact: those u32s used to be usizes, but I shrunk them back in May.)

#54213: This PR tweaked some code so that the lifetimes of two large data structures didn’t overlap, reducing peak memory usage by 27% on one benchmark and 8% on another. (Note: those data structures are dominated by, you guessed it, bitsets!)

#54318: This PR changed SparseBitMatrix so that each instantiated row used the hybrid bitset representation from #53383 instead of the dense representation, reducing peak memory usage by 14–45% on four benchmarks, and 96% (from 29.1GB to 1.2GB) on one external crate! This PR also fixed the peak memory regression that #52250 introduced for a few benchmarks.

#54420: I subsequently realized that #54211 didn’t go far enough. Some debugging println! statements showed that both of the u32s in each liveness entry almost always held a special value that meant “invalid”. When data is repetitive, compression is possible: I could use a packed representation where the common (INVALID, INVALID, true) and (INVALID, INVALID, false) cases were represented by special u32 values, and all other triples were represented by a u32 index into an auxiliary table. This PR changed the representation as described, reducing instruction counts on numerous benchmarks, the best by 16%, and reducing peak memory usage on numerous benchmarks, the best by 38%. (I also tried a more compact representation where each element was a single byte; it reduced peak memory usage some more by the instruction count reduction was less, so I went with the earlier approach.)

Progress and current status

You probably noticed that some of the improvements in the previous section were large, and I wasn’t the only one working on NLL performance; Niko Matsakis and David Wood also contributed some big wins. This is because the new borrow checker’s performance was originally, to be honest, terrible. This is understandable; the focus had been on functionality and correctness, which is fair enough for a large and complex new component. Nonetheless, in June I was very nervous about its performance.

To be more specific, “check” builds (which don’t generate code) ran as much as 50x slower with the new borrow checker on some benchmarks. And multiple benchmarks were disabled on CI because they were simply too slow or used too much memory.

Issue #52028 tells a representative story. It was originally filed because the html5ever benchmark was triggering out-of-memory failures on CI. Measurements with Massif and DHAT showed that its peak heap memory usage was over 14 GB, largely caused by a single 12 GB allocation! In comparison, the peak with the old borrow checker was roughly 200–300 MB. If you read through that issue, you can see that over a period of 2.5 months we reduced the memory usage from 14 GB, to 10 GB, to 2 GB, to 1.2 GB, to 600 MB, to 501 MB, and finally to 266 MB.

And things are pretty good now. The instruction counts on “check” builds for all benchmarks are at most 18% higher with the new borrow checker than the old borrow checker, and are typically around 5%. (And note that “check” builds are the worst-case scenario; non-“check” builds will see a smaller relative slowdown because of the extra time needed for code generation, which is unaffected by the borrow checker.) Memory usage is similar: all benchmarks except one have peak memory usage that is at most 20% higher, with the typical value around 3%. (The one remaining exceptional benchmark uses 2.7x memory.) The worse numbers generally occur on programs containing very large constants.

I’m not entirely happy even with this level of performance regression, but for now I have run out of ideas for improving it further. The new borrow checker is a lot more sophisticated and tracks a lot more data, so strict performance parity is a tough ask. Nonetheless, given how bad performance was a few months ago, I’m happy that we’ve got it down to a level where most people probably won’t notice any difference. Given that the new borrow checker makes Rust a significantly nicer and easier language to use, I hope it’s an acceptable trade-off.

Categories
Performance Programming Work habits

Ad Hoc Profiling

I have used a variety of profiling tools over the years, including several I wrote myself.

But there is one profiling tool I have used more than any other. It is capable of providing invaluable, domain-specific profiling data of a kind not obtainable by any general-purpose profiler.

It’s a simple text processor implemented in a few dozen lines of code. I use it in combination with logging print statements in the programs I am profiling. No joke.

Post-processing

The tool is called counts, and it tallies line frequencies within text files, like an improved version of the Unix command chain sort | uniq -c. For example, given the following input.

a 1
b 2
b 2
c 3
c 3
c 3
d 4
d 4
d 4
d 4

counts produces the following output.

10 counts:
(  1)        4 (40.0%, 40.0%): d 4
(  2)        3 (30.0%, 70.0%): c 3
(  3)        2 (20.0%, 90.0%): b 2
(  4)        1 (10.0%,100.0%): a 1

It gives a total line count, and shows all the unique lines, ordered by frequency, with individual and cumulative percentages.

Alternatively, when invoked with the -w flag, it assigns each line a weight, determined by the last integer that appears on the line (or 1 if there is no such integer).  On the same input, counts -w produces the following output.

30 counts:
(  1)       16 (53.3%, 53.3%): d 4
(  2)        9 (30.0%, 83.3%): c 3
(  3)        4 (13.3%, 96.7%): b 2
(  4)        1 ( 3.3%,100.0%): a 1

The total and per-line counts are now weighted; the output incorporates both frequency and a measure of magnitude.

That’s it. That’s all counts does. I originally implemented it in 48 lines of Perl, then later rewrote it as 48 lines of Python, and then later again rewrote it as 71 lines of Rust.

In terms of benefit-to-effort ratio, it is by far the best code I have ever written.

counts in action

As an example, I added print statements to Firefox’s heap allocator so it prints a line for every allocation that shows its category, requested size, and actual size. A short run of Firefox with this instrumentation produced a 77 MB file containing 5.27 million lines. counts produced the following output for this file.

5270459 counts:
( 1) 576937 (10.9%, 10.9%): small 32 (32)
( 2) 546618 (10.4%, 21.3%): small 24 (32)
( 3) 492358 ( 9.3%, 30.7%): small 64 (64)
( 4) 321517 ( 6.1%, 36.8%): small 16 (16)
( 5) 288327 ( 5.5%, 42.2%): small 128 (128)
( 6) 251023 ( 4.8%, 47.0%): small 512 (512)
( 7) 191818 ( 3.6%, 50.6%): small 48 (48)
( 8) 164846 ( 3.1%, 53.8%): small 256 (256)
( 9) 162634 ( 3.1%, 56.8%): small 8 (8)
( 10) 146220 ( 2.8%, 59.6%): small 40 (48)
( 11) 111528 ( 2.1%, 61.7%): small 72 (80)
( 12) 94332 ( 1.8%, 63.5%): small 4 (8)
( 13) 91727 ( 1.7%, 65.3%): small 56 (64)
( 14) 78092 ( 1.5%, 66.7%): small 168 (176)
( 15) 64829 ( 1.2%, 68.0%): small 96 (96)
( 16) 60394 ( 1.1%, 69.1%): small 88 (96)
( 17) 58414 ( 1.1%, 70.2%): small 80 (80)
( 18) 53193 ( 1.0%, 71.2%): large 4096 (4096)
( 19) 51623 ( 1.0%, 72.2%): small 1024 (1024)
( 20) 45979 ( 0.9%, 73.1%): small 2048 (2048)

Unsurprisingly, small allocations dominate. But what happens if we weight each entry by its size? counts -w produced the following output.

2554515775 counts:
( 1) 501481472 (19.6%, 19.6%): large 32768 (32768)
( 2) 217878528 ( 8.5%, 28.2%): large 4096 (4096)
( 3) 156762112 ( 6.1%, 34.3%): large 65536 (65536)
( 4) 133554176 ( 5.2%, 39.5%): large 8192 (8192)
( 5) 128523776 ( 5.0%, 44.6%): small 512 (512)
( 6) 96550912 ( 3.8%, 48.3%): large 3072 (4096)
( 7) 94164992 ( 3.7%, 52.0%): small 2048 (2048)
( 8) 52861952 ( 2.1%, 54.1%): small 1024 (1024)
( 9) 44564480 ( 1.7%, 55.8%): large 262144 (262144)
( 10) 42200576 ( 1.7%, 57.5%): small 256 (256)
( 11) 41926656 ( 1.6%, 59.1%): large 16384 (16384)
( 12) 39976960 ( 1.6%, 60.7%): large 131072 (131072)
( 13) 38928384 ( 1.5%, 62.2%): huge 4864000 (4866048)
( 14) 37748736 ( 1.5%, 63.7%): huge 2097152 (2097152)
( 15) 36905856 ( 1.4%, 65.1%): small 128 (128)
( 16) 31510912 ( 1.2%, 66.4%): small 64 (64)
( 17) 24805376 ( 1.0%, 67.3%): huge 3097600 (3100672)
( 18) 23068672 ( 0.9%, 68.2%): huge 1048576 (1048576)
( 19) 22020096 ( 0.9%, 69.1%): large 524288 (524288)
( 20) 18980864 ( 0.7%, 69.9%): large 5432 (8192)

This shows that the cumulative count of allocated bytes (2.55GB) is dominated by a mixture of larger allocation sizes.

This example gives just a taste of what counts can do.

(An aside: in both cases it’s good the see there isn’t much slop, i.e. the difference between the requested sizes and actual sizes are mostly 0. That 5432 entry at the bottom of the second table is curious, though.)

Other Uses

This technique is often useful when you already know something — e.g. a general-purpose profiler showed that a particular function is hot — but you want to know more.

  • Exactly how many times are paths X, Y and Z executed? For example, how often do lookups succeed or fail in data structure D? Print an identifying string each time a path is hit.
  • How many times does loop L iterate? What does the loop count distribution look like? Is it executed frequently with a low loop count, or infrequently with a high loop count, or a mix? Print the iteration count before or after the loop.
  • How many elements are typically in hash table H at this code location? Few? Many? A mixture? Print the element count.
  • What are the contents of vector V at this code location? Print the contents.
  • How many bytes of memory are used by data structure D at this code location? Print the byte size.
  • Which call sites of function F are the hot ones? Print an identifying string at the call site.

Then use counts to aggregate the data. Often this domain-specific data is critical to fully optimize hot code.

Worse is better

Print statements are an admittedly crude way to get this kind of information, profligate with I/O and disk space. In many cases you could do it in a way that uses machine resources much more efficiently, e.g. by creating a small table data structure in the code to track frequencies, and then printing that table at program termination.

But that would require:

  • writing the custom table (collection and printing);
  • deciding where to define the table;
  • possibly exposing the table to multiple modules;
  • deciding where to initialize the table; and
  • deciding where to print the contents of the table.

That is a pain, especially in a large program you don’t fully understand.

Alternatively, sometimes you want information that a general-purpose profiler could give you, but running that profiler on your program is a hassle because the program you want to profile is actually layered under something else, and setting things up properly takes effort.

In contrast, inserting print statements is trivial. Any measurement can be set up in no time at all. (Recompiling is often the slowest part of the process.) This encourages experimentation. You can also kill a running program at any point with no loss of profiling data.

Don’t feel guilty about wasting machine resources; this is temporary code. You might sometimes end up with output files that are gigabytes in size. But counts is fast because it’s so simple… and the Rust version is 3–4x faster than the Python version, which is nice. Let the machine do the work for you. (It does help if you have a machine with an SSD.)

Ad Hoc Profiling

For a long time I have, in my own mind, used the term ad hoc profiling to describe this combination of logging print statements and frequency-based post-processing. Wikipedia defines “ad hoc” as follows.

In English, it generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes

The process of writing custom code to collect this kind of profiling data — in the manner I disparaged in the previous section — truly matches this definition of “ad hoc”.

But counts is valuable specifically because it makes this type of custom profiling less ad hoc and more repeatable. I should arguably call it “generalized ad hoc profiling” or “not so ad hoc profiling”… but those names don’t have quite the same ring to them.

Tips

Use unbuffered output for the print statements. In C and C++ code, use fprintf(stderr, ...). In Rust code use eprintln!. (Update: Rust 1.32 added the dbg! macro, which also works well.)

Pipe the stderr output to file, e.g. firefox 2> log.

Sometimes programs print other lines of output to stderr that should be ignored by counts. (Especially if they include integer IDs that counts -w would interpret as weights!) Prepend all logging lines with a short identifier, and then use grep $ID log | counts to ignore the other lines. If you use more than one prefix, you can grep for each prefix individually or all together.

Occasionally output lines get munged together when multiple print statements are present. Because there are typically many lines of output, having a few garbage ones almost never matters.

It’s often useful to use both counts and counts -w on the same log file; each one gives different insights into the data.

To find which call sites of a function are hot, you can instrument the call sites directly. But it’s easy to miss one, and the same print statements need to be repeated multiple times. An alternative is to add an extra string or integer argument to the function, pass in a unique value from each call site, and then print that value within the function.

It’s occasionally useful to look at the raw logs as well as the output of counts, because the sequence of output lines can be informative. For example, I recently diagnosed an occurrences of quadratic behaviour in the Rust compiler by seeing that a loop iterated 1, 2, 3, …, 9000+ times.

The Code

counts is available here.

Conclusion

I use counts to do ad hoc profiling all the time. It’s the first tool I reach for any time I have a question about code execution patterns. I have used it extensively for every bout of major performance work I have done in the past few years, as well as in plenty of other circumstances. I even built direct support for it into rustc-perf, the Rust compiler’s benchmark suite, via the profile eprintln subcommand. Give it a try!

Categories
Performance Rust

How to speed up the Rust compiler some more in 2018

I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.

rustc-perf improvements

Since my last post, rustc-perf — the benchmark suite, harness and visualizer — has seen some improvements. First, some new benchmarks were added: cargo, ripgrep, sentry-cli, and webrender. Also, the parser benchmark has been removed because it was a toy program and thus not a good benchmark.

Second, I added support for several new profilers: Callgrind, Massif, rustc’s own -Ztime-passes, and the use of ad hoc eprintln! statements added to rustc. (This latter case is more useful than it might sound; in combination with post-processing it can be very helpful, as we will see below.)

Finally, the graphs shown on the website now have better y-axis scaling, which makes many of them easier to read. Also, there is a new dashboard view that shows performance across rustc releases.

Failures and incompletes

After my last post, multiple people said they would be interested to hear about optimization attempts of mine that failed. So here is an incomplete selection. I suggest that rustc experts read through these, because there is a chance they will be able to identify alternative approaches that I have overlooked.

nearest_common_ancestors 1: I managed to speed up this hot function in a couple of ways, but a third attempt failed. The representation of the scope tree is not done via a typical tree data structure; instead there is a HashMap of child/parent pairs. This means that moving from a child to its parent node requires a HashMap lookup, which is expensive. I spent some time designing and implementing an alternative data structure that stored nodes in a vector and the child-to-parent links were represented as indices to other elements in the vector. This meant that child-to-parent moves only required stepping through the vector. It worked, but the speed-up turned out to be very small, and the new code was significantly more complicated, so I abandoned it.

nearest_common_ancestors 2: A different part of the same function involves storing seen nodes in a vector. Searching this unsorted vector is O(n), so I tried instead keeping it in sorted order and using binary search, which gives O(log n) search. However, this change meant that node insertion changed from amortized O(1) to O(n) — instead of a simple push onto the end of the vector, insertion could be at any point, which which required shifting all subsequent elements along. Overall this change made things slightly worse.

PredicateObligation SmallVec: There is a type Vec<PredicationObligation> that is instantiated frequently, and the vectors often have few elements. I tried using a SmallVec instead, which avoids the heap allocations when the number of elements is below a threshold. (A trick I’ve used multiple times.) But this made things significantly slower! It turns out that these Vecs are copied around quite a bit, and a SmallVec is larger than a Vec because the elements are inline. Furthermore PredicationObligation is a large type, over 100 bytes. So what happened was that memcpy calls were inserted to copy these SmallVecs around. The slowdown from the extra function calls and memory traffic easily outweighed the speedup from avoiding the Vec heap allocations.

SipHasher128: Incremental compilation does a lot of hashing of data structures in order to determine what has changed from previous compilation runs. As a result, the hash function used for this is extremely hot. I tried various things to speed up the hash function, including LEB128-encoding of usize inputs (a trick that worked in the past) but I failed to speed it up.

LEB128 encoding: Speaking of LEB128 encoding, it is used a lot when writing metadata to file. I tried optimizing the LEB128 functions by special-casing the common case where the value is less than 128 and so can be encoded in a single byte. It worked, but gave a negligible improvement, so I decided it wasn’t worth the extra complication.

Token shrinking: A previous PR shrunk the Token type from 32 to 24 bytes, giving a small win. I tried also replacing the Option<ast::Name> in Literal with just ast::Name and using an empty name to represent “no name”. That  change reduced it to 16 bytes, but produced a negligible speed-up and made the code uglier, so I abandoned it.

#50549: rustc’s string interner was structured in such a way that each interned string was duplicated twice. This PR changed it to use a single Rc‘d allocation, speeding up numerous benchmark runs, the best by 4%. But just after I posted the PR, @Zoxc posted #50607, which allocated the strings out of an arena, as an alternative. This gave better speed-ups and was landed instead of my PR.

#50491: This PR introduced additional uses of LazyBTreeMap (a type I had previously introduced to reduce allocations) speeding up runs of multiple benchmarks, the best by 3%. But at around the same time, @porglezomp completed a PR that changed BTreeMap to have the same lazy-allocation behaviour as LazyBTreeMap, rendering my PR moot. I subsequently removed the LazyBTreeMap type because it was no longer necessary.

#51281: This PR, by @Mark-Simulacrum, removed an unnecessary heap allocation from the RcSlice type. I had looked at this code because DHAT’s output showed it was hot in some cases, but I erroneously concluded that the extra heap allocation was unavoidable, and moved on! I should have asked about it on IRC.

Wins

#50339: rustc’s pretty-printer has a buffer that can contain up to 55 entries, and each entry is 48 bytes on 64-bit platforms. (The 55 somehow comes from the algorithm being used; I won’t pretend to understand how or why.) Cachegrind’s output showed that the pretty printer is invoked frequently (when writing metadata?) and that the zero-initialization of this buffer was expensive. I inserted some eprintln! statements and found that in the vast majority of cases only the first element of the buffer was ever touched. So this PR changed the buffer to default to length 1 and extend when necessary, speeding up runs for numerous benchmarks, the best by 3%.

#50365: I had previously optimized the nearest_common_ancestor function. Github user @kirillkh kindly suggested a tweak to the code from that PR which reduced the number comparisons required. This PR implemented that tweak, speeding up runs of a couple of benchmarks by another 1–2%.

#50391: When compiling certain annotations, rustc needs to convert strings from unescaped form to escaped form. It was using the escape_unicode function to do this, which unnecessarily converts every char to \u{1234} form, bloating the resulting strings greatly. This PR changed the code to use the escape_default function — which only escapes chars that genuinely need escaping — speeding up runs of most benchmarks, the best by 13%. It also slightly reduced the on-disk size of produced rlibs, in the best case by 15%.

#50525: Cachegrind showed that even after the previous PR, the above string code was still hot, because string interning was happening on the resulting string, which was unnecessary in the common case where escaping didn’t change the string. This PR added a scan to determine if escaping is necessary, thus avoiding the re-interning in the common case, speeding up a few benchmark runs, the best by 3%.

#50407: Cachegrind’s output showed that the trivial methods for the simple BytePos and CharPos types in the parser are (a) extremely hot and (b) not being inlined. This PR annotated them so they are inlined, speeding up most benchmarks, the best by 5%.

#50564: This PR did the same thing for the methods of the Span type, speeding up incremental runs of a few benchmarks, the best by 3%.

#50931: This PR did the same thing for the try_get function, speeding up runs of many benchmarks, the best by 1%.

#50418: DHAT’s output showed that there were many heap allocations of the cmt type, which is refcounted. Some code inspection and ad hoc instrumentation with eprintln! showed that many of these allocated cmt instances were very short-lived. However, some of them ended up in longer-lived chains, in which the refcounting was necessary. This PR changed the code so that cmt instances were mostly created on the stack by default, and then promoted to the heap only when necessary, speeding up runs of three benchmarks by 1–2%. This PR was a reasonably large change that took some time, largely because it took me five(!) attempts (the final git branch was initially called cmt5) to find the right dividing line between where to use stack allocation and where to use refcounted heap allocation.

#50565: DHAT’s output showed that the dep_graph structure, which is a IndexVec<DepNodeIndex,Vec<DepNodeIndex>>, caused many allocations, and some eprintln! instrumentation showed that the inner Vec‘s were mostly only a few elements. This PR changed the Vec<DepNodeIndex> to SmallVec<[DepNodeIndex;8]>, which avoids heap allocations when the number of elements is less than 8, speeding up incremental runs of many benchmarks, the best by 2%.

#50566: Cachegrind’s output shows that the hottest part of rustc’s lexer is the bump function, which is responsible for advancing the lexer onto the next input character. This PR streamlined it slightly, speeding up most runs of a couple of benchmarks by 1–3%.

#50818: Both Cachegrind and DHAT’s output showed that the opt_normalize_projection_type function was hot and did a lot of heap allocations. Some eprintln! instrumentation showed that there was a hot path involving this function that could be explicitly extracted that would avoid unnecessary HashMap lookups and the creation of short-lived Vecs. This PR did just that, speeding up most runs of serde and futures by 2–4%.

#50855: DHAT’s output showed that the macro parser performed a lot of heap allocations, particular on the html5ever benchmark. This PR implemented ways to avoid three of them: (a) by storing a slice of a Vec in a struct instead of a clone of the Vec; (b) by introducing a “ref or box” type that allowed stack allocation of the MatcherPos type in the common case, but heap allocation when necessary; and (c) by using Cow to avoid cloning a PathBuf that is rarely modified. These changes sped up runs of html5ever by up to 10%, and crates.io by up to 3%. I was particularly pleased with these changes because they all involved non-trivial changes to memory management that required the introduction of additional explicit lifetimes. I’m starting to get the hang of that stuff… explicit lifetimes no longer scare me the way they used to. It certainly helps that rustc’s error messages do an excellent job of explaining where explicit lifetimes need to be added.

#50932: DHAT’s output showed that a lot of HashSet instances were being created in order to de-duplicate the contents of a commonly used vector type. Some eprintln! instrumentation showed that most of these vectors only had 1 or 2 elements, in which case the de-duplication can be done trivially without involving a HashSet. (Note that the order of elements within this vector is important, so de-duplication via sorting wasn’t an option.) This PR added special handling of these trivial cases, speeding up runs of a few benchmarks, the best by 2%.

#50981: The compiler does a liveness analysis that involves vectors of indices that represent variables and program points. In rare cases these vectors can be enormous; compilation of the inflate benchmark involved one that has almost 6 million 24-byte elements, resulting in 143MB of data. This PR changed the type used for the indices from usize to u32, which is still more than large enough, speeding up “clean incremental” builds of inflate by about 10% on 64-bit platforms, as well as reducing their peak memory usage by 71MB.

What’s next?

These improvements, along with those recently done by others, have significantly sped up the compiler over the past month or so: many workloads are 10–30% faster, and some even more than that. I have also seen some anecdotal reports from users about the improvements over recent versions, and I would be interested to hear more data points, especially those involving rustc nightly.

The profiles produced by Cachegrind, Callgrind, and DHAT are now all looking very “flat”, i.e. with very little in the way of hot functions that stick out as obvious optimization candidates. (The main exceptions are the SipHasher128 functions I mentioned above, which I haven’t been able to improve.) As a result, it has become difficult for me to make further improvements via “bottom-up” profiling and optimization, i.e. optimizing hot leaf and near-leaf functions in the call graph.

Therefore, future improvement will likely come from “top-down” profiling and optimization, i.e. observations such as “rustc spends 20% of its time in phase X, how can that be improved”? The obvious place to start is the part of compilation taken up by LLVM. In many debug and opt builds LLVM accounts for up to 70–80% of instructions executed. It doesn’t necessarily account for that much time, because the LLVM parts of execution are parallelized more than the rustc parts, but it still is the obvious place to focus next. I have looked at a small amount of generated MIR and LLVM IR, but there’s a lot to take in. Making progress will likely require a lot broader understanding of things than many of the optimizations described above, most of which require only a small amount of local knowledge about a particular part of the compiler’s code.

If anybody reading this is interested in trying to help speed up rustc, I’m happy to answer questions and provide assistance as much as I can. The #rustc IRC channel is also a good place to ask for help.

Categories
Performance Rust

The Rust compiler is getting faster

TL;DR: The Rust compiler has gotten 1.06x–4x faster over the past month.

As changes are made to the Rust compiler, a suite of benchmarks measuring compile time is run regularly on the development version. The data is viewable at http://perf.rust-lang.org. The default view is graphical, showing data from the past month.

Screenshot of perf.rust-lang.org showing measurements of the html5ever benchmark

The screenshot above shows the graphs for a single benchmark called “html5ever”, which consists of an old version of the project of the same name. Each one shows measurements for a different kind of build: a debug build, a “check” build (which detect errors but don’t generate code), and an optimized build. Within each graph there are the following three data series.

  • Clean: a normal build.
  • Baseline incremental: an incremental build with no prior incremental runs. Such a build is a little slower than a normal build, because it does normal compilation and also gathers information to guide subsequent incremental builds.
  • Clean incremental: an incremental build run immediately after a baseline incremental build. This is the best-case scenario for incremental compilation in which the minimal amount of work is done.

If you visit the site yourself you’ll see that most of the benchmarks have more than three data series, including ones for incremental builds done after small code changes (a more realistic use case), and one for builds with non-lexical lifetimes enabled.

The x-axis shows time and the y-axis shows instruction counts. Other units of measurements are available, including cycles, time, and memory usage. Instruction counts are shown as the default; this isn’t ideal because it’s only a proxy for the measurement that really matters (time)… but it’s a pretty good proxy, and it has a lot lower variation than the time measurements, which is important when detecting changes.

This graphical view is particularly useful for detecting major changes. For example, you can see that in early May there was a major regression for “clean” and “baseline incremental” builds, which Alex Crichton fixed a few days later.

As well as the graphical view, the site also provides a textual “compare” view, which can be reached via the link at the top left of each page. This view compares measurements from two revisions of the compiler; by default it compares the most recently measured revision with one from a month ago. (It can also be used locally, which is very useful to evaluate changes that speed up the compiler.)

The screenshot above is of the “compare” view at the time of writing. Each line corresponds to a single graph from the graphical view. (If you visit the site and click on an individual entry it will expand and show all of the measurements. The resemblance between those measurements and this screenshot will of course diminish over time.) The “avg” column shows the average change across all the data series. The “min” and “max” columns show the minimum and maximum changes for any of the data series. The “serde” and “script-servo” lines are empty because those benchmarks were added to the suite less than a month ago, so no comparison can be made.

The table has many numbers, but the thing to take away is that they are almost all significantly negative, meaning that compile time has reduced. The “avg” numbers range from 6% to 38%; the “min” numbers (i.e. best result) go as high as 75%; the “max” numbers (i.e. worst result) go as high as 36%.

In conclusion: the Rust compiler has gotten significantly faster in the past month. Across a wide range of programs, and a wide range of build configurations, compile times have reduced by between 6% and 75%. To put it another way, the compiler has gotten between 1.06x and 4x faster.

These benefits are available right now to users of the Nightly channel. Users of the Release channel will see them more gradually, spread across one or two versions released over the next few months.

Categories
Performance Rust

How to speed up the Rust compiler in 2018

18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I’ve recently taken this work up again. Also, in the meantime rustc’s build system has been replaced and its benchmark suite has been overhauled. So it’s a good time for an update.

Getting the code

The steps for getting the rustc code haven’t changed. First, I fork the main Rust repository on GitHub. Then I make two local clones: a base clone that I won’t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:

user=nnethercote
for r in rust0 rust1 ; do
  git clone https://github.com/$user/rust $r
  cd $r
  git remote add upstream https://github.com/rust-lang/rust
  git remote set-url origin git@github.com:$user/rust
done

Building the Rust compiler

The compiler’s build system is complex with many possible invocations and configurations. I’ll cover the absolute minimum information required to understand how I’ve been using it.

(UPDATE, December 2019: I have updated this to account for config.toml   changes since this post was written.)

First, you need a config.toml file, which sits at the root of the repository and dictates the compiler’s configuration. I used the provided config.toml.example as a starting point, cut it down a lot, and ended up with the following.

# Last updated in December 2019
[llvm]
release-debuginfo = true
[rust]
debuginfo-level = 1

This configuration matches the compiler that ships (in particular, in terms of optimization levels), with the exception that it has maximal debug info present, which ensures that profiles are as detailed as possible.

Building rustc can be confusing, particularly because of the multiple compiler stages and the terminology around them. Here is the command I use.

(UPDATE, September 2020: I have updated this to account for the src directory being renamed compiler.)

./x.py build --stage 2 compiler/rustc

This produces a stage 2 compiler that can handle procedural macros and so is suitable for profiling the benchmark suite.

The benchmark suite

rustc’s benchmark suite is called rustc-perf. It consists of two parts:

  • collector: The 24 benchmark programs, 14 of which are “real” code (i.e. crates used in real applications), and 10 of which are toy programs or stress test microbenchmarks. Also, the harness code that runs and measures them.
  • site: Code for displaying measurements as a website.

The test harness is very thorough. For each benchmark, it measures Debug, Opt, and Check (no code generation) invocations. Furthermore, within each of those categories, it does five or more runs, including a normal build and various kinds of incremental builds. A full benchmarking run measures over 400 invocations of rustc. Note however that only the compilation of the final crate is measured in each benchmark; compilation of dependent crates is not included.

rustc-perf was created primarily for the perf.rust-lang.org site, which tracks rustc’s performance over time. I recently modified it so it could be used to compare two builds of rustc on a local machine, which is a fundamental operation when optimizing. This can be done by running the suite and site locally and navigating to the local “compare” page, which looks like this:

Screenshot of rustc-perf's compare page

Note that rustc-perf uses perf-stat for its measurements, so the benchmarking functionality currently only works on Linux.

I also extended rustc-perf so the benchmarks can be run under a profiler. I implemented support for perf-record, Cachegrind, and DHAT, because they are the profilers that I am most familiar with; it isn’t hard to add support for other profilers (including non-Linux ones). The advantage of integrating support for a profiler intto rustc-perf is that it gets the profiler invocations underneath the Cargo invocations, ensuring that the right rustc invocations are measured.

Wins

Here are the improvements I’ve made to rustc over the past few weeks.

#49993: Cachegrind’s output showed that derived methods for the Token type were taking up significant time. This PR changed the ‘#’ counts in raw Lit variants from usize to u16, which reduced the size of Token from 32 to 24 bytes, speeding up some of the runs for coercions and html5ever by 1%.

#50051: Rust’s Option::ok_or function transforms an Option<T> value into a Result<T, E>. It’s a bit of a footgun… as the docs say:

Arguments passed to ok_or are eagerly evaluated; if you are passing the result of a function call, it is recommended to use ok_or_else, which is lazily evaluated.

DHAT’s output showed that one hot allocation site involved the creation of a String while getting the value of the MIRI_BACKTRACE environment variable. This seemed to me like a strange thing to be happening frequently, and it turns out that it was at the end of a chain of calls that were only needed in the (rare) error case of an ok_or call. This PR changed the code to use a trivial closure with ok_or_else, speeding up runs for a lot of benchmarks, the best by 6%.

#50052: DHAT’s output showed that the char_lit function, which parses “\u{…}” literals, was doing a lot of allocations while stripping out ‘_’ chars. This PR avoided that allocation, speeding up various runs — particularly ones for regex, futures, clap, coercions, hyper, and encoding — the best by 6%.

#50106: Cachegrind’s output showed that the nearest_common_ancestor function, which computes the lowest common ancestor of two nodes in a scope tree, was very hot. The algorithm in use constructed the full scope chain for each node, and then worked backward from the end of the two scope chains until a difference was found. This is a reasonable algorithm in many circumstances, but some ad hoc instrumentation (eprintln! statements plus some simple post-processing) showed that the scope chains usually only differ by a handful of elements at the front and then have very long common tails, with dozens or even hundreds of elements. This PR switched to a different algorithm that looks for differences from the front of the scope chain, speeding up runs for many benchmarks, the best by 8%.

#50174: By default, Rust’s HashSet and HashMap use a hash function that is very high quality but also very slow. Therefore, the Rust compiler internally uses different types, FxHashSet and FxHashMap, which are identical to the standard ones except they use a much faster hash function. Unfortunately, it’s easy to forget about them and use the standard hash tables instead. Cachegrind’s output showed that the default hash function code (SipHash) was executed a lot, and that one particularly hot hash table (the symbol interner) was using HashMap. This PR (trivially) changed that table to an FxHashMap, speeding up runs for numerous benchmarks, the best by 5%.

#50240: Some of Rust’s standard containers, such as Vec, HashSet, and HashMap, have the nice property that by default they don’t allocate until an element is inserted. This is good because it’s surprising how often such containers are created but never inserted into. DHAT’s output showed that such behaviour would also help with a couple of the compiler’s uses of BTreeMap. I tried and failed to implement this behaviour directly in BTreeMap; according to Gankro, “BTreeMap is some of the most complex unsafe code in libstd” and “I just scared off a grizzled firefox dev explaining it“! Instead this PR introduced a thin wrapper type (LazyBTreeMap) around BTreeMap and used it in the handful of relevant places within the compiler, speeding up the runs for several benchmarks, the best by 3%. #50266 is open to do the general fix for BTreeMap, whereupon LazyBTreeMap will be able to be removed.

#50246: Cachegrind’s output showed that a function named dump_allocs was hot for some benchmarks. This sounded to me like a logging or debugging function of some kind, and investigation confirmed that it was traversing data structures in order to build up strings that went unused in the standard case where logging is disabled. This PR (trivially) changed this function and a couple of related ones to be no-ops if logging is disabled, speeding up runs for coercions, tuple-stress, html5ever, and encoding, the best by almost 15%! This shows how not doing unclever things is often as important as doing clever things when it comes to optimizing software.

Update: It’s worth noting that I also made three or four optimization attempts that didn’t work out — where I made a change that seemed like it should help, based on profiling data, but the effect was negligible. Success isn’t guaranteed!

Future work

All of the PRs mentioned above (except for the aborted BTreeMap change) involved small, simple changes to the Rust compiler’s code. I’m not a rustc expert, but I do know how to use a couple of profilers well, and I’ve been able to make a difference. I’m sure there are more improvements of this nature to be made, and I encourage other people to try profiling rustc with their favourite profilers to see what they can find. This is valuable because rustc’s speed is something that Rust users often complain about. And it’s fun, if you like that sort of thing 🙂  I’m happy to help people, and the members of the #rustc IRC channel are very friendly and helpful.

Having said that, in a lot of cases, especially for opt builds, the majority of execution time is within LLVM, which rustc uses for code generation. Speeding up LLVM itself may be difficult, but I hope/suspect there is room for improvement in the way that rustc interacts with LLVM. If anyone has ideas on that front I’d love to hear about them.

Categories
Gecko Profiler Performance

Improving the Gecko Profiler

Over the last three months I landed 167 patches, via 41 Bugzilla bugs, for the Gecko Profiler. These included crash fixes, assertion failure fixes, data race fixes, optimization fixes, and a great many refactorings.

Background

The Gecko Profiler is a profiler built into Firefox. It can be used with the Gecko Profiler Addon to profile Firefox. It also provides the core profiling mechanism that is used by Firefox’s devtools to profile JavaScript content code.

It’s a crucial component.  It was originally written 5 years ago in something of a hurry, because we desperately needed a built-in profiler, one that could give more detailed, custom information than is possible with an external profiler. As I understand it, part of it was imported from V8, so it was a mix of previously existing code and new code. And in the years since it has been extended by multiple people, in a variety of ways that didn’t necessarily mesh well.

As a result, at the start of Q1 it was in pretty bad shape. Crashes and assertion failures were frequent, and the code itself was hard to read and maintain. Markus Stange had recently taken over ownership, and had made some great improvements to the Addon, but the core code needed work as well. So I started digging in to see what I could find and improve. There was a lot!

Fixes

Bug 1331571. The profiler had code for incorporating power consumption estimates from the Intel Power Gadget. Unfortunately, this integration had major flaws: Intel Power Gadget only gives very coarse power consumption estimates; the profiler samples at 1000Hz and is CPU intensive and so is likely to skew the power consumption estimates significantly; and nobody had ever used it meaningfully. So I removed it.

Bug 1317771. The profiler had a “standalone” configuration that allowed it to be used in programs other than Firefox. But it was complex (lots of #ifdef statements) and broken and unlikely to be of use. So I removed it.

Bug 1328369, Bug 1328373: Dead code removal.

Bug 1332577. The public API for the profiler was a mess. It was split across several header files, and most API functions had an “outer” version with a profiler_ prefix that immediately called into an inner version with a mozilla_sampler_ prefix (though there were some inconsistencies). So I combined the various header files into a single file, GeckoProfiler.h, and simplified the API functions into a single level, all consistently named with a profiler_ prefix.

Bug 1333296. Even the name of the profiler was a source of confusion. It was originally known as “SPS”, which I believe is short for “simple profiling system”. At some point that changed to “the Gecko Profiler”, although was also occasionally referred to as “the built-in profiler”! Because of this history, the code was littered with references to SPS. In this bug I updated them all to refer to the Gecko Profiler. (I updated the MDN docs, too. The page name still uses “Built-in Profiler” because I don’t know how to change MDN page names.)

Bug 1329684. I removed some mutex wrapper classes that I think were necessary at one point for the “standalone” configuration.

Bug 1328365. Thread-local storage was being used to store a pointer to an object that was only accessed on the main thread, so I changed it to be a global variable. I also renamed some variables whose names referred to a type that had been renamed a long time ago.

Bug 1333655. The profiler had a cross-platform thread abstraction that was clumsy and over-engineered, so I streamlined it.

Bug 1334466. The profiler had a class called Sampler, which I think was imported from V8, and a subclass called GeckoSampler. Both classes were fairly complex, and we only ever instantiated the subclass. The separation merely obscured things, so I merged the subclass into Sampler. Having all that code in a single class and a single module made it much easier to see exactly what it was doing.

Bug 1335595. Two classes, ThreadInfo and ThreadProfile, were used for per-thread data. They were hopelessly intertwined: each one had a pointer to the other, and multiple fields were present (i.e. duplicated) in both of them. So I just merged them.

Bug 1336326. Three minor clean-ups.

Bug 816598. I implemented a memory reporter for the profiler. This was first requested in 2012, and five duplicate bugs had been filed in the interim!

Bug 1126576. I removed some very grotty manual refcounting from the PseudoStack class, which simplified things. A little too much, in fact… I misunderstood how things worked, causing a crash, which I subsequently fixed in bug 1340161.

Bug 1337189. The aforementioned Sampler class was still over-engineered. It only ever had 0 or 1 instantiations, and basically was an unnecessary level of abstraction. In this bug I basically got rid of it by merging it into another file. Which took 27 patches! (One of these patches introduced a regression, which I later fixed in bug 1340327.) At this point a lot of the core code that had previously been spread across multiple files and classes was now in a single file, tools/profiler/core/platform.cpp, and it was becoming increasingly obvious that there was a lot of global state being accessed from multiple threads with insufficient thread synchronization.

Bug 1338957. The profiler tracks which threads are “sleeping” (i.e. blocked on some operation), to support an optimization where it samples sleeping threads in a cheaper-than-normal fashion. It was using two counters and a boolean to track the sleep state of each thread. These three values were accessed from multiple threads; two of them were atomic, and one wasn’t, so the whole setup was very racy. I managed to condense the three values into a single atomic tri-state value, which made things simpler and thread-safe.

Bug 1339327. I landed eight refactoring patches with no particular common theme, mostly involving renaming things and removing unnecessary stuff.

Bug 1339435. I removed two erroneous assertions that I had added in an earlier patch — two functions that I thought only ran on the main thread turned out to run off the main thread as well.

Bug 1339695. The profiler has a lot of code that is specific to a particular architecture (e.g. x86), OS (e.g. Windows), or platform (e.g. x86/Windows). The #ifdef statements used to select these were massively inconsistent — turns out there are many ways to detect this stuff — so I fixed this up. Among other things, this involved using the nice constants in tools/profiler/core/PlatformMacros.h consistently throughout the profiler’s code. (I fixed a regression — caused by mistyping one of the #ifdef conditions, alas! — from this change in bug 1350211. And another one involving --disable-profiling in bug 1348776.) I also renamed some files that had .cc extensions instead of the usual .cpp because they had (I think) been imported from V8.

Bug 1340928. At this point I had started working on a major change to the handling of the profiler’s core global state. It would inevitably be a big patch, but I wanted it to be as small as possible, so I started aggressively carving off small changes that could be landed separately. This bug featured 16 of them.

Bug 1328378. The profiler has two kinds of samples: periodic, which are taken by a separate thread in response to a timer firing, and synchronous, which a thread takes itself in response to a request via the profiler API. There are a lot of similarities between the two, but also some important differences. This bug took some steps to simplify the messy handling of synchronous samples.

Bug 1344118. I mentioned earlier that the profiler tracks which threads are “sleeping” to support an optimization: when a thread is asleep, we can mostly duplicate its last sample without unwinding its stack. But the optimization was buggy and would become a catastrophic pessimization in certain circumstances, due to what should have been a short O(1)-ish buffer search becoming O(n²)-ish, which would quickly peg one CPU at 100% usage. As far as I can tell, this bug was present in the optimization ever since it was implemented three years ago. (It’s possible it wasn’t noticed because its effect increase as more threads are profiled, but the profiler defaults to only profiling the main thread and the compositor thread.) The fix was straightforward once the diagnosis was made, and Julian Seward did a follow-up that made the optimization even more effective.

Bug 1342306. In this bug I put almost all of the profiler’s global state into a single class and protected accesses to it with a mutex. Unlike the old code, the new code is simple and obviously thread-safe. The final patch in this bug was much bigger than I would have liked, at 142 KiB, even after I carved off as many precursor patches as I could. Unsurprisingly, there were some follow-up fixes required: bug 1346356 (a leak and a deadlock), bug 1347044 (another deadlock), bug 1348374 (yet another deadlock), and bug 1350967 (surprise! another deadlock).

Bug 1345262. I fixed an assertion failure caused by the profiler and the JS engine having differing views about what functions should be called on what threads.

Bug 1347348. Five more assorted clean-ups.

Bug 1349856. I fixed a minor error involving a call to the profiler from Gecko.

Bug 1346132. I removed the profiler’s bespoke logging system, replacing it with the standard Mozilla one. I also made the logging output more concise and informative.

Bug 1350212. I cleaned up a class and its uses a bit.

Bug 1351523. I reordered one function’s arguments to match the order used in two related functions.

Bug 1351528. I removed some unused values from an enum.

Bug 1348024. I simplified some environment variables used by the profiler.

Bug 1351946. I removed some gnarly code for starting the profiler on B2G.

Bug 1351136. The profiler’s testing coverage is not very good, as can be seen from the numerous regressions I introduced and fixed. So I added a gtest that improves coverage. There’s still room for more test coverage improvement.

Bug 1351963. I further clarified the handling of synchronous vs. periodic samples, and simplified the ownership of some per-thread data structures.

Discussion

I learned some interesting things while doing this work.

Learning a component

Three months ago I knew almost nothing about the profiler’s code. Today I’m a module peer.

At the start of January I had been told that the profiler needed work, and I had assigned myself a Q1 deliverable to “land three improvements to the Gecko Profiler”. I started just by looking for easy, obvious code clean-ups, such as dead code removal, fixing inconsistent naming of things, and removing unnecessary indirections. (These are the kinds of clean-ups you can make with only shallow understanding.) The profiler proved to be a target-rich environment for such changes!

After I made a few dozen such changes I started understanding more deeply how the pieces fit together. (Partly because I’d been staring at the code a lot, and partly because my changes were making the code easier to understand. Refactorings add up.) I started interleaving my easy clean-up patches with ones that required more insight into how the profiler worked. I made numerous mistakes along the way, as the various regression fixes above show. But that’s ok.

I also kept a text file in which I had a list of ideas for things to fix. Every time I saw something that looked like it could be improved, I added it to the file, and I repeatedly checked the file when deciding what to work on next. As my understanding of the code improved, multiple times I realized that items I had written down were wrong, or incomplete, or that seemingly separate things were related. (In fact, I’m still using the file, because I still have numerous things I want to improve.)

Multi-threaded programming basics

Although I first learned C and C++ about 20 years ago, and I have worked at Mozilla for more than 8 years, this was the first time I’ve ever done serious multi-threaded programming, i.e. at a level where I needed a reasonably deep understanding of how various threads can interact. I got the following two great tips from Julian Seward, which helped a lot.

  • Write down pseudocode for each thread.
  • Write down potential worst-case thread operation interleavings.

I also found it helpful to add comments (or assertions, where appropriate) to the top of functions that indicate which thread or threads they run on. For example:

void profiler_gathered_OOP_profile()
{
  MOZ_RELEASE_ASSERT(NS_IsMainThread());
  ...
}

and:

void profiler_thread_sleep()
{ 
  // This function runs both on and off the main thread.
  ...
}

A useful idiom: proof-of-lock tokens

I also employed a programming idiom that turned out to be extremely helpful.  Most of the global profiler state is in single class called ProfilerState. There is a single instance of this class, gPS, and a single mutex that protects it, gPSMutex. To guarantee that no code is able to access gPS‘s contents without first locking the mutex, for every field in ProfilerState there is a getter and a setter, both of which require a “proof-of-lock” token, which takes the form of a const PS::AutoLock&, where PS::AutoLock is an RAII type that locks and unlocks a mutex.

For example, consider this function, which checks if the profiler is paused.

bool profiler_is_paused()
{
  PS::AutoLock lock(gPSMutex);
  if (!gPS->IsActive(lock)) {
    return false;
  }
  return gPS->IsPaused(lock);
}

The PS::AutoLock locks the mutex. IsActive() and IsPaused() both access fields within gPS, and so they are passed lock, which serves as the proof-of-lock value. IsPaused() and SetIsPaused() are implemented as follow.

bool IsPaused(const PS::AutoLock&) const { return mIsPaused; }
void SetIsPaused(const PS::AutoLock&, bool aIsPaused) { mIsPaused = aIsPaused; }

Neither function actually uses the proof-of-lock token. Nonetheless, any function that calls a ProfilerState getter or setter must either lock gPSMutex, or have an ancestor that does. This idiom has two very nice benefits.

  • You can’t access gPS‘s contents without having first locked gPSMutex. (Well, it is possible to subvert the protection, but not by accident.)
  • It’s obvious that all functions that have a proof-of-lock argument are called only while gPSMutex is locked.

Functions that are called from multiple places sometimes must be split in two: an outer function in which the mutex is initially unlocked, and an inner function that takes a proof-of-lock token. This isn’t hard, though.

Deadlocks vs. data races

After my big change to the profiler’s global state, I had to fix numerous deadlocks. This wasn’t too hard. Deadlocks (caused by too much thread synchronization) are obvious, easy to diagnose, and these ones weren’t hard to fix. It’s useful to contrast them with data races (caused by too little thread synchronization) which typically have subtle effects and are difficult to diagnose.

Patch discipline

For this work I wrote a lot of small patches. This is my preferred way to work, for two reasons. First, small patches make life easier for reviewers, which in turn results in faster reviews. Second, small patches make regression hunting easy. It’s always nice when you bisect a regression to a small patch.

Future work and Thanks

The profiler still has plenty of room for improvement, and I’m planning to do more work on it in Q2. In the meantime, if you’ve tried the profiler in the past and had problems it might be worth trying again. It’s in much better shape now.

Finally, many thanks to Markus Stange for reviewing the majority of the patches and answering lots of questions, and Julian Seward for reviewing most of the remainder and for numerous helpful discussions about threaded programming.

Categories
Performance Rust

How to speed up the Rust compiler some more

I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.

Heap allocations

My last post mentioned how heap allocations were frequent within rustc. This led to some wild speculation in some venues: about whether Rust should use heap allocation at all, about whether garbage collection was necessary, and so on. I have two comments about this.

The first is that rustc is a compiler, and like most compilers its execution is dominated by complex traversals of large tree structures: ASTs, IRs, etc. Tree structures typically require heap allocation. In particular, a lot of these tree structures contain Vec, HashMap and HashSet fields, all of which unavoidably use heap allocation.

The second is that although some heap allocation is unavoidable, the amount that rustc was doing was excessive. It is clear that nobody had (or not for a long time) made a concerted effort to minimize heap allocations. With the help of DHAT I’ve been able to greatly reduce the amount done. Any time you throw a new profiler at any large codebase that hasn’t been heavily optimized there’s a good chance you’ll be able to make some sizeable improvements.

Tools

As in my previous post, I focused almost entirely on the benchmarks present in rustc-benchmarks to guide my efforts.

Once again I mostly used Cachegrind and DHAT to profile these benchmarks. I also used Massif (plus the excellent massif-visualizer) to profile peak memory usage on one workload (see below).

I have also recently started using perf a bit more. Huon Wilson told me to add the --callgraph=dwarf flag to my perf record invocations. That improves things significantly, though I still find perf puzzling and frustrating at times, even after reading Brendan Gregg’s thorough examples page.

Wins

#37229: rustc spends a lot of time doing hash table lookups, enough that the cost of hashing is significant. In this PR I changed the hash function used by rustc (FNV) to one inspired by the hash function used within Firefox. The new function is faster because it can process an entire word at a time, rather than one byte at a time. The new hash function is slightly worse in terms of the number of collisions but the change sped up compilation of most workloads by 3–6%.

#37161 & #37318 & #37373: These three PRs removed a lot of unnecessary heap allocations (mostly due to clone()) in the macro parser. They sped up the compilation of html5ever by 20% and 7% and 2%, respectively.

#37267 & #37298: rustc uses “deflate” compression in a couple of places: crate metadata and LLVM bitcode. In the metadata case rustc was often compressing metadata and then throwing away the result! (Thanks to eddyb for diagnosing this and explaining to me how to fix it.) Avoiding this unnecessary work speed up compilation of syntex-incr by 6% and several others by 1–2%.

For the LLVM bitcode case I tweaked the deflate settings so that it ran almost twice as fast while the compression ratio was only slightly worse. This sped up compilation of syntex-incr by another 8% and a couple of others by 1–2%. It’s possible that switching to a different compression algorithm would help some more, thought that would be a much larger change.

#37108: This PR avoided interning values of the Substs type in cases where it was easy to tell that the value had previously been interned. It sped up compilation of several benchmarks by 1–4%.

#37705: This PR avoided some unnecessary calls to mk_ty. It sped up compilation of one benchmark by 5% and a few others by 1–2%.

#37083: This PR inlined various methods involved with uleb128 decoding, which is used when reading crate metadata. This sped up compilation of several benchmarks by 1%.

#36973: This PR fixed things so that a data structure that is only required for incremental compilation is not touched during non-incremental compilation. This sped up compilation of several non-incremental benchmarks by 1%.

#37445 & #37764: One unusual workload was found to make rustc consume excessive amounts of memory (4.5 GiB!) which made it OOM on some machines. I used Massif to identify ways to improve this.

The first PR reduced the size of the Expr enum by shrinking the outsized InlineAsm variant, which reduced peak memory usage by 9%. The second PR removed scope_auxiliary, a data structure used only during MIR dumping (and of marginal utility even then). This reduced peak memory usage by another 10%.

I have filed a PR to add a cut-down version of this workload to rustc-benchmarks.

#36993: This PR did some manual inlining and restructuring of several ObligationForest functions. It sped up compilation of inflate by 2%.

#37642: This PR reduced some excessive indirection present in the representation of HIR. It sped up compilation of some benchmarks by 1, 2 and 4%.

#37427: rustc uses Blake2b hashing to determine when a function’s code has changed. This PR reduced the number of bytes to be hashed by (a) avoiding hashing filenames twice for each span, and (b) pre-uleb128-encoding 32-bit and 64-bit integers, which are usually small. This sped up compilation of syntex-incr by 2%.

Future work

As the compiler front-end gets faster, the proportion of rustc execution time spent in the LLVM back-end increases. For some benchmarks it now exceeds 50% (when doing debug builds), though there is still plenty of variation. Thanks to mrhota there is a new –enable-llvm-release-debuginfo configure option that (unsurprisingly) enabled debuginfo within LLVM. This means that profilers can now give filenames and line numbers for LLVM. I’ve looked at a few places in LLVM that show up high in profiles, though I haven’t yet managed to make any useful changes to it.

Another interesting new development is pnkfelix’s -Zprint-type-sizes option, which should land soon, and will potentially be useful for any program written in Rust, not just rustc. This option will make it trivial to see how each type is laid out in memory, which will make it easy to see how types can be rearranged to be smaller. Up until now this has been a painful and imprecise exercise.

Finally, I want to encourage anyone else with the slightest knack and/or enthusiasm for optimizing code to take a look at rustc. You might be thinking that there isn’t that much low-hanging fruit left, but I am confident there is. It’s getting harder for me to find things to improve, but I am one just person with a particular background and a few preferred profiling tools. I am confident that other people with different backgrounds and tools will find plenty of stuff that I cannot. Rust compile speed still isn’t great, even after these improvements, but the more people who pitch in the faster they’ll improve. Don’t be shy, and if you have any questions please contact me or ask in the #rust or #rustc IRC channels.

Categories
add-ons Firefox Performance

Faster Firefox startup & shutdown with add-ons present

Thanks to some recent improvements by Kris Maglione, add-ons built with Firefox’s add-ons SDK are now much faster to start up and shut down. For users with multiple add-ons installed, this can make a large difference. For example, one user reported the following.

I’ve noticed a dramatic improvement in startup and shutdown time since the fixes landed on aurora.

Configuration: Two windows with 15 tabs each and 22 addons.

  • Startup: 25 seconds -> 14 seconds (44% reduction)
  • Shutdown: 30 seconds -> 5 seconds (83% reduction)

Shutdown used to take so long it was triggering a minidump.

Your mileage may vary, but these improvements have the potential to benefit many users. They are present in Firefox 50 (currently in Beta, and due for release on November 15) and all later versions.