How to speed up the Rust compiler in 2018: NLL edition

Niko Matsakis recently blogged about the Rust compiler’s new borrow checker, which implements non-lexical lifetimes (NLL). The new borrow checker is a really nice improvement to Rust, because it accepts many sound programs that the old borrow checker rejected.

In the blog post, Niko wrote briefly about the performance of the new borrow checker.

Finally, those of you who read the previous posts may remember that the performance of the NLL checker was a big stumbling block. I’m happy to report that the performance issues were largely addressed: there remains some slight overhead to using NLL, but it is largely not noticeable in practice, and I expect we’ll continue to improve it over time.

This paragraph is true, but glosses over a lot of details! This post will be about my contributions to this performance work.

Wins

Before I describe individual improvements, it’s worth mentioning that the new borrow checker uses bitsets (1D) and bit matrices (2D) heavily. A number of my wins involved these data structures.

#51869: This PR changed some code so that it overwrote an existing dense bitset rather than replacing it with a newly created one of the same size, reducing instruction counts for most benchmarks, the best by 1.5%.

#51870: This PR reused a structure containing two bitsets rather than recreating it afresh for every statement in a basic block, reducing instruction counts for numerous benchmarks, the best by 1%.

#52250: The compiler has a SparseBitMatrix type. Rows were added on demand, and each row was implemented as a sparse bitset using a BTreeMap. In practice, many of the rows were relatively dense, with 10–90% of the bits being set. This PR changed SparseBitMatrix to use a dense representation for rows, reducing instruction counts on one benchmark by 33% and by 1% on a few others. The PR had a mixed effect on memory usage, increasing the peak on some benchmarks and reducing it on others. (Never fear! #54318 below ended up fixing the regressions.)

#52342: This PR avoided a bunch of allocations in Canonicalizer methods, reducing instruction counts on numerous benchmarks, the best by 2%.

#53383: Further profiling showed that some dense bitsets were large, but had very few bits set within them, so the dense representation was wasteful. This PR implemented a new hybrid bitset type that uses a sparse representation for bitsets with up to 8 bits set and switches to a dense representation beyond that, and used it to replace dense bitsets in several places, reducing instruction counts on the five slowest benchmarks by 55%, 21%, 16%, 10% and 9%, and reducing peak memory usage of three benchmarks by 53%, 33%, and 9%.

#53513: This PR force-inlined a function at one hot callsite, reducing instruction counts on two benchmarks by 1%.

#53551: This PR avoided some clone calls, reducing instruction counts for one benchmark by 0.5%.

#53733: This PR added special handling for a very common and simple case in unroll_place, reducing the instruction counts on one benchmark by 25%.

#53942: A function called precompute_borrows_out_of_scope does a traversal of one or more basic blocks. In order to detect whether a basic block had been previously visited, it recorded the ID of every visited statement in a hash table. Some basic blocks can have many statements, resulting in many hash table lookups. This PR changed the code to record the ID of visited basic blocks in the hash table instead of visited statements — trickier than it sounds because the analysis can start in the middle of a basic block, in which case the first half might need to be eventually visited — reducing instruction counts on one benchmark by 60%.

#54211: Liveness analysis created an array that could get very large. Each element in the array was a struct containing two u32s and a bool. Each of those elements took up 12 bytes, but only 9 of the bytes held data. This PR split the array into three separate arrays, one per field, making the code slightly less readable but reducing peak memory usage on one benchmark by 20%. (Fun fact: those u32s used to be usizes, but I shrunk them back in May.)

#54213: This PR tweaked some code so that the lifetimes of two large data structures didn’t overlap, reducing peak memory usage by 27% on one benchmark and 8% on another. (Note: those data structures are dominated by, you guessed it, bitsets!)

#54318: This PR changed SparseBitMatrix so that each instantiated row used the hybrid bitset representation from #53383 instead of the dense representation, reducing peak memory usage by 14–45% on four benchmarks, and 96% (from 29.1GB to 1.2GB) on one external crate! This PR also fixed the peak memory regression that #52250 introduced for a few benchmarks.

#54420: I subsequently realized that #54211 didn’t go far enough. Some debugging println! statements showed that both of the u32s in each liveness entry almost always held a special value that meant “invalid”. When data is repetitive, compression is possible: I could use a packed representation where the common (INVALID, INVALID, true) and (INVALID, INVALID, false) cases were represented by special u32 values, and all other triples were represented by a u32 index into an auxiliary table. This PR changed the representation as described, reducing instruction counts on numerous benchmarks, the best by 16%, and reducing peak memory usage on numerous benchmarks, the best by 38%. (I also tried a more compact representation where each element was a single byte; it reduced peak memory usage some more by the instruction count reduction was less, so I went with the earlier approach.)

Progress and current status

You probably noticed that some of the improvements in the previous section were large, and I wasn’t the only one working on NLL performance; Niko Matsakis and David Wood also contributed some big wins. This is because the new borrow checker’s performance was originally, to be honest, terrible. This is understandable; the focus had been on functionality and correctness, which is fair enough for a large and complex new component. Nonetheless, in June I was very nervous about its performance.

To be more specific, “check” builds (which don’t generate code) ran as much as 50x slower with the new borrow checker on some benchmarks. And multiple benchmarks were disabled on CI because they were simply too slow or used too much memory.

Issue #52028 tells a representative story. It was originally filed because the html5ever benchmark was triggering out-of-memory failures on CI. Measurements with Massif and DHAT showed that its peak heap memory usage was over 14 GB, largely caused by a single 12 GB allocation! In comparison, the peak with the old borrow checker was roughly 200–300 MB. If you read through that issue, you can see that over a period of 2.5 months we reduced the memory usage from 14 GB, to 10 GB, to 2 GB, to 1.2 GB, to 600 MB, to 501 MB, and finally to 266 MB.

And things are pretty good now. The instruction counts on “check” builds for all benchmarks are at most 18% higher with the new borrow checker than the old borrow checker, and are typically around 5%. (And note that “check” builds are the worst-case scenario; non-“check” builds will see a smaller relative slowdown because of the extra time needed for code generation, which is unaffected by the borrow checker.) Memory usage is similar: all benchmarks except one have peak memory usage that is at most 20% higher, with the typical value around 3%. (The one remaining exceptional benchmark uses 2.7x memory.) The worse numbers generally occur on programs containing very large constants.

I’m not entirely happy even with this level of performance regression, but for now I have run out of ideas for improving it further. The new borrow checker is a lot more sophisticated and tracks a lot more data, so strict performance parity is a tough ask. Nonetheless, given how bad performance was a few months ago, I’m happy that we’ve got it down to a level where most people probably won’t notice any difference. Given that the new borrow checker makes Rust a significantly nicer and easier language to use, I hope it’s an acceptable trade-off.

Wins

Progress and current status

One reply on “How to speed up the Rust compiler in 2018: NLL edition”