How to speed up the Rust compiler some more

I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.

Heap allocations

My last post mentioned how heap allocations were frequent within rustc. This led to some wild speculation in some venues: about whether Rust should use heap allocation at all, about whether garbage collection was necessary, and so on. I have two comments about this.

The first is that rustc is a compiler, and like most compilers its execution is dominated by complex traversals of large tree structures: ASTs, IRs, etc. Tree structures typically require heap allocation. In particular, a lot of these tree structures contain Vec, HashMap and HashSet fields, all of which unavoidably use heap allocation.

The second is that although some heap allocation is unavoidable, the amount that rustc was doing was excessive. It is clear that nobody had (or not for a long time) made a concerted effort to minimize heap allocations. With the help of DHAT I’ve been able to greatly reduce the amount done. Any time you throw a new profiler at any large codebase that hasn’t been heavily optimized there’s a good chance you’ll be able to make some sizeable improvements.

Tools

As in my previous post, I focused almost entirely on the benchmarks present in rustc-benchmarks to guide my efforts.

Once again I mostly used Cachegrind and DHAT to profile these benchmarks. I also used Massif (plus the excellent massif-visualizer) to profile peak memory usage on one workload (see below).

I have also recently started using perf a bit more. Huon Wilson told me to add the --callgraph=dwarf flag to my perf record invocations. That improves things significantly, though I still find perf puzzling and frustrating at times, even after reading Brendan Gregg’s thorough examples page.

Wins

#37229: rustc spends a lot of time doing hash table lookups, enough that the cost of hashing is significant. In this PR I changed the hash function used by rustc (FNV) to one inspired by the hash function used within Firefox. The new function is faster because it can process an entire word at a time, rather than one byte at a time. The new hash function is slightly worse in terms of the number of collisions but the change sped up compilation of most workloads by 3–6%.

#37161 & #37318 & #37373: These three PRs removed a lot of unnecessary heap allocations (mostly due to clone()) in the macro parser. They sped up the compilation of html5ever by 20% and 7% and 2%, respectively.

#37267 & #37298: rustc uses “deflate” compression in a couple of places: crate metadata and LLVM bitcode. In the metadata case rustc was often compressing metadata and then throwing away the result! (Thanks to eddyb for diagnosing this and explaining to me how to fix it.) Avoiding this unnecessary work speed up compilation of syntex-incr by 6% and several others by 1–2%.

For the LLVM bitcode case I tweaked the deflate settings so that it ran almost twice as fast while the compression ratio was only slightly worse. This sped up compilation of syntex-incr by another 8% and a couple of others by 1–2%. It’s possible that switching to a different compression algorithm would help some more, thought that would be a much larger change.

#37108: This PR avoided interning values of the Substs type in cases where it was easy to tell that the value had previously been interned. It sped up compilation of several benchmarks by 1–4%.

#37705: This PR avoided some unnecessary calls to mk_ty. It sped up compilation of one benchmark by 5% and a few others by 1–2%.

#37083: This PR inlined various methods involved with uleb128 decoding, which is used when reading crate metadata. This sped up compilation of several benchmarks by 1%.

#36973: This PR fixed things so that a data structure that is only required for incremental compilation is not touched during non-incremental compilation. This sped up compilation of several non-incremental benchmarks by 1%.

#37445 & #37764: One unusual workload was found to make rustc consume excessive amounts of memory (4.5 GiB!) which made it OOM on some machines. I used Massif to identify ways to improve this.

The first PR reduced the size of the Expr enum by shrinking the outsized InlineAsm variant, which reduced peak memory usage by 9%. The second PR removed scope_auxiliary, a data structure used only during MIR dumping (and of marginal utility even then). This reduced peak memory usage by another 10%.

I have filed a PR to add a cut-down version of this workload to rustc-benchmarks.

#36993: This PR did some manual inlining and restructuring of several ObligationForest functions. It sped up compilation of inflate by 2%.

#37642: This PR reduced some excessive indirection present in the representation of HIR. It sped up compilation of some benchmarks by 1, 2 and 4%.

#37427: rustc uses Blake2b hashing to determine when a function’s code has changed. This PR reduced the number of bytes to be hashed by (a) avoiding hashing filenames twice for each span, and (b) pre-uleb128-encoding 32-bit and 64-bit integers, which are usually small. This sped up compilation of syntex-incr by 2%.

Future work

As the compiler front-end gets faster, the proportion of rustc execution time spent in the LLVM back-end increases. For some benchmarks it now exceeds 50% (when doing debug builds), though there is still plenty of variation. Thanks to mrhota there is a new –enable-llvm-release-debuginfo configure option that (unsurprisingly) enabled debuginfo within LLVM. This means that profilers can now give filenames and line numbers for LLVM. I’ve looked at a few places in LLVM that show up high in profiles, though I haven’t yet managed to make any useful changes to it.

Another interesting new development is pnkfelix’s -Zprint-type-sizes option, which should land soon, and will potentially be useful for any program written in Rust, not just rustc. This option will make it trivial to see how each type is laid out in memory, which will make it easy to see how types can be rearranged to be smaller. Up until now this has been a painful and imprecise exercise.

Finally, I want to encourage anyone else with the slightest knack and/or enthusiasm for optimizing code to take a look at rustc. You might be thinking that there isn’t that much low-hanging fruit left, but I am confident there is. It’s getting harder for me to find things to improve, but I am one just person with a particular background and a few preferred profiling tools. I am confident that other people with different backgrounds and tools will find plenty of stuff that I cannot. Rust compile speed still isn’t great, even after these improvements, but the more people who pitch in the faster they’ll improve. Don’t be shy, and if you have any questions please contact me or ask in the #rust or #rustc IRC channels.

2 Responses to How to speed up the Rust compiler some more

  1. Using perf directly is painful. If you’re just looking at CPU time, sysprof is really nice to work with. And I bet if you wanted to be able to look at other metrics (cache misses, etc.), the current maintainer would be up for working with you on a UI for choosing those.

    Thanks for your work on rustc!