I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.
My last post mentioned how heap allocations were frequent within rustc. This led to some wild speculation in some venues: about whether Rust should use heap allocation at all, about whether garbage collection was necessary, and so on. I have two comments about this.
The first is that rustc is a compiler, and like most compilers its execution is dominated by complex traversals of large tree structures: ASTs, IRs, etc. Tree structures typically require heap allocation. In particular, a lot of these tree structures contain
HashSet fields, all of which unavoidably use heap allocation.
The second is that although some heap allocation is unavoidable, the amount that rustc was doing was excessive. It is clear that nobody had (or not for a long time) made a concerted effort to minimize heap allocations. With the help of DHAT I’ve been able to greatly reduce the amount done. Any time you throw a new profiler at any large codebase that hasn’t been heavily optimized there’s a good chance you’ll be able to make some sizeable improvements.
As in my previous post, I focused almost entirely on the benchmarks present in rustc-benchmarks to guide my efforts.
Once again I mostly used Cachegrind and DHAT to profile these benchmarks. I also used Massif (plus the excellent massif-visualizer) to profile peak memory usage on one workload (see below).
I have also recently started using perf a bit more. Huon Wilson told me to add the
--callgraph=dwarf flag to my
perf record invocations. That improves things significantly, though I still find perf puzzling and frustrating at times, even after reading Brendan Gregg’s thorough examples page.
#37229: rustc spends a lot of time doing hash table lookups, enough that the cost of hashing is significant. In this PR I changed the hash function used by rustc (FNV) to one inspired by the hash function used within Firefox. The new function is faster because it can process an entire word at a time, rather than one byte at a time. The new hash function is slightly worse in terms of the number of collisions but the change sped up compilation of most workloads by 3–6%.
#37161 & #37318 & These three PRs removed a lot of unnecessary heap allocations (mostly due to
clone()) in the macro parser. They sped up the compilation of html5ever by 20% and 7% and 2%, respectively.
#37267 & #37298: rustc uses “deflate” compression in a couple of places: crate metadata and LLVM bitcode. In the metadata case rustc was often compressing metadata and then throwing away the result! (Thanks to eddyb for diagnosing this and explaining to me how to fix it.) Avoiding this unnecessary work speed up compilation of syntex-incr by 6% and several others by 1–2%.
For the LLVM bitcode case I tweaked the deflate settings so that it ran almost twice as fast while the compression ratio was only slightly worse. This sped up compilation of syntex-incr by another 8% and a couple of others by 1–2%. It’s possible that switching to a different compression algorithm would help some more, thought that would be a much larger change.
#37108: This PR avoided interning values of the
Substs type in cases where it was easy to tell that the value had previously been interned. It sped up compilation of several benchmarks by 1–4%.
& One unusual workload was found to make rustc consume excessive amounts of memory (4.5 GiB!) which made it OOM on some machines. I used Massif to identify ways to improve this.
The first PR reduced the size of the
Expr enum by shrinking the outsized
InlineAsm variant, which reduced peak memory usage by 9%. The second PR removed
scope_auxiliary, a data structure used only during MIR dumping (and of marginal utility even then). This reduced peak memory usage by another 10%.
I have filed a PR to add a cut-down version of this workload to rustc-benchmarks.
to determine when a function’s code has changed. This PR reduced the number of bytes to be hashed by (a) avoiding hashing filenames twice for each span, and (b) pre-uleb128-encoding 32-bit and 64-bit integers, which are usually small. This s
As the compiler front-end gets faster, the proportion of rustc execution time spent in the LLVM back-end increases. For some benchmarks it now exceeds 50% (when doing debug builds), though there is still plenty of variation. Thanks to mrhota there is a new –enable-llvm-release-debuginfo configure option that (unsurprisingly) enabled debuginfo within LLVM. This means that profilers can now give filenames and line numbers for LLVM. I’ve looked at a few places in LLVM that show up high in profiles, though I haven’t yet managed to make any useful changes to it.
Another interesting new development is pnkfelix’s -Zprint-type-sizes option, which should land soon, and will potentially be useful for any program written in Rust, not just rustc. This option will make it trivial to see how each type is laid out in memory, which will make it easy to see how types can be rearranged to be smaller. Up until now this has been a painful and imprecise exercise.
Finally, I want to encourage anyone else with the slightest knack and/or enthusiasm for optimizing code to take a look at rustc. You might be thinking that there isn’t that much low-hanging fruit left, but I am confident there is. It’s getting harder for me to find things to improve, but I am one just person with a particular background and a few preferred profiling tools. I am confident that other people with different backgrounds and tools will find plenty of stuff that I cannot. Rust compile speed still isn’t great, even after these improvements, but the more people who pitch in the faster they’ll improve. Don’t be shy, and if you have any questions please contact me or ask in the #rust or #rustc IRC channels.