I last wrote in April about my work on speeding up the Rust compiler. Time for another update.
Weekly performance triage
First up is a process change: I have started doing weekly performance triage. Each Tuesday I have been looking at the performance results of all the PRs merged in the past week. For each PR that has regressed or improved performance by a non-negligible amount, I add a comment to the PR with a link to the measurements. I also gather these results into a weekly report, which is mentioned in This Week in Rust, and also looked at in the weekly compiler team meeting.
The goal of this is to ensure that regressions are caught quickly and appropriate action is taken, and to raise awareness of performance issues in general. It takes me about 45 minutes each time. The instructions are written in such a way that anyone can do it, though it will take a bit of practice for newcomers to become comfortable with the process. I have started sharing the task around, with Mark Rousskov doing the most recent triage.
This process change was inspired by the “Regressions prevented” section of an excellent blost post from Nikita Popov (a.k.a. nikic), about the work they have been doing to improve the speed of LLVM. (The process also takes some ideas from the Firefox Nightly crash triage that I set up a few years ago when I was leading Project Uptime.)
The speed of LLVM directly impacts the speed of rustc, because rustc uses LLVM for its backend. This is a big deal in practice. The upgrade to LLVM 10 caused some significant performance regressions for rustc, though enough other performance improvements landed around the same time that the relevant rustc release was still faster overall. However, thanks to nikic’s work, the upgrade to LLVM 11 will win back much of the performance lost in the upgrade to LLVM 10.
It seems that LLVM performance perhaps hasn’t received that much attention in the past, so I am pleased to see this new focus. Methodical performance work takes a lot of time and effort, and can’t effectively be done by a single person over the long-term. I strongly encourage those working on LLVM to make this a team effort, and anyone with the relevant skills and/or interest to get involved.
Better benchmarking and profiling
There have also been some major improvements to rustc-perf, the performance suite and harness that drives perf.rust-lang.org, and which is also used for local benchmarking and profiling.
#683: The command-line interface for the local benchmarking and profiling commands was ugly and confusing, so much so that one person mentioned on Zulip that they tried and failed to use them. We really want people to be doing local benchmarking and profiling, so I filed this issue and then implemented PRs #685 and #687 to fix it. To give you an idea of the improvement, the following shows the minimal commands to benchmark the entire suite.
# Old target/release/collector --db <DB> bench_local --rustc <RUSTC> --cargo <CARGO> <ID>
# New target/release/collector bench_local <RUSTC> <ID>
Full usage instructions are available in the README.
#675: Joshua Nelson added support for benchmarking rustdoc. This is good because rustdoc performance has received little attention in the past.
#699, #702, #727, #730: These PRs added some proper CI testing for the local benchmarking and profiling commands, which had a history of being unintentionally broken.
Mark Rousskov also made many small improvements to rustc-perf, including reducing the time it takes to run the suite, and improving the presentation of status information.
Last year I wrote about inlining and code bloat, and how they can have a major effect on compile times. I mentioned that tooling to measure code size would be helpful. So I was happy to learn about the wonderful cargo-llvm-lines, which measures how many lines of LLVM IR generated for each function. The results can be surprising, because generic functions (especially commons ones like
Result::map_err()) can be instantiated dozens or even hundreds of times in a single crate.
I worked on multiple PRs involving cargo-llvm-lines.
#15: This PR added percentages to the output of cargo-llvm-lines, making it easier to tell how important each function’s contribution is to the total amount of code.
#20, #663: These PRs added support for cargo-llvm-lines within rustc-perf, which made it easy to measure the LLVM IR produced for the standard benchmarks.
RawVec::grow() is a function that gets called by
Vec::push(). It’s a large generic function that deals with various cases relating to the growth of vectors. This PR moved most of the non-generic code into a separate non-generic function, for wins of up to 5%.
(Even after that PR, measurements show that the vector growth code still accounts for a non-trivial amount of code, and it feels like there is further room for improvement. I made several failed attempts to improve it further: #72189, #73912, #75093, #75129. Even though they reduced the amount of LLVM IR generated, they were performance losses. I suspect this is because these additional changes affected the inlining of some of these functions, which can be hot.)
#72166: This PR added some specialized
Iterator methods (
find_map()) for slices, winning up to 9% on
clap-rs, and up to 2% on various other benchmarks.
#72139: This PR added a direct implementation for
Iterator::fold(), replacing the old implementation that called the more general
Iterator::try_fold(). This won up to 2% on several benchmarks.
#73882: This PR streamlined the code in
RawVec::allocate_in(), winning up to 1% on numerous benchmarks.
cargo-llvm-lines is also useful to application/crate authors. For example, Simon Sapin managed to speed up compilation of the largest crate in Servo by 28%! Install it with
cargo install cargo-llvm-lines and then run it with
cargo llvm-lines (for debug builds) or
cargo llvm-lines --release (for release builds).
#71942: this PR shrunk the
LocalDecl type from 128 bytes to 56 bytes, reducing peak memory usage of a few benchmarks by a few percent.
#72227: If you push multiple elements onto an empty
Vec it has to repeatedly reallocate memory. The growth strategy in use resulted in the following sequence of capacities: 0, 1, 2, 4, 8, 16, etc. “Tiny vecs are dumb”, so this PR changed it to 0, 4, 8, 16, etc., in most cases, which reduced the number of allocations done by rustc itself by 10% or more and sped up many benchmarks by up to 4%. In theory, the change could increase memory usage, but in practice it doesn’t.
#74214: This PR eliminated some symbol interner accesses, for wins of up to 0.5%.
#74310: This PR changed
SparseBitSet to use an
ArrayVec instead of a
SmallVec for its storage, which is possible because the maximum length is known in advance, for wins of up to 1%.
#75133: This PR eliminated two fields in a struct that were only used in the printing of an error message in the case of an internal compiler error, for wins of up to 2%.
Speed changes since April 2020
Since my last blog post, changes in compile times have been mixed (table, graphs). It’s disappointing to not see a sea of green in the table results like last time, and there are many regressions that seem to be alarming. But it’s not as bad as it first seems! Understanding this requires knowing a bit about the details of the benchmark suite.
Most of the benchmarks that saw large percentage regressions are extremely short-running. (The benchmark descriptions help make this clearer.) For example a non-incremental check build of
helloworld went from 0.03s to 0.08s. (#70107 and #74682) are two major causes.) In practice, a tiny additional overhead of a few 10s of milliseconds per crate isn’t going to be noticeable when many crates take seconds or tens of seconds to compile.
Among the “real-world” benchmarks, some of them saw mixed results (e.g.
ripgrep), while some of them saw clear improvement, some of which were large (e.g.
With all that in mind, since my last post, the compiler is probably either no slower or somewhat faster for most real-world cases.
Another interesting data point about the speed of rustc over the long-term came from Hacker News: compilation of one project (lewton) got 2.5x faster over the past three years.
LLVM 11 hasn’t landed yet, so that will give some big improvements for real-world cases soon. Hopefully for my next post the results will be more uniformly positive.