libsyntax has three tables in a global data structure, called
Globals, storing information about spans (code locations), symbols, and hygiene data (which relates to macro expansion). Accessing these tables is moderately expensive, so I found various ways to improve things.
#59693: Every element in the AST has a span, which describes its position in the original source code. Each span consists of an offset, a length, and a third value that is related to macro expansion. The three fields are 12 bytes in total, which is a lot to attach to every AST element, and much of the time the three fields can fit in much less space. So the compiler used a 4 byte compressed form with a fallback to a hash table stored in
Globals for spans that didn’t fit in 4 bytes. This PR changed that to 8 bytes. This increased memory usage and traffic slightly, but reduced the fallback rate from roughly 10-20% to less than 1%, speeding up many workloads, the best by an amazing 14%.
#61253: There are numerous operations that accessed the hygiene data, and often these were called in pairs or trios, thus repeating the hygiene data lookup. This PR introduced compound operations that avoid the repeated lookups. This won 10% on packed-simd, up to 3% on numerous other workloads.
#61484: Similar to #61253, this won up to 2% on many benchmarks.
#60630: The compiler has an interned string type, called symbol. It used this inconsistently. As a result, lots of comparisons were made between symbols and ordinary strings, which required a lookup of the string in the symbols table and then a char-by-char comparison. A symbol-to-symbol comparison is much cheaper, requiring just an integer comparison. This PR removed the symbol-to-string comparison operations, forcing more widespread use of the symbol type. (Fortunately, most of the introduced symbol uses involved statically-known, pre-interned strings, so there weren’t additional interning costs.) This won up to 1% on various benchmarks, and made the use of symbols more consistent.
#60815: Similar to #60630, this also won up to 1% on various benchmarks.
The following improvements didn’t have any common theme.
#57719: This PR inlined a very hot function, for a 4% win on one workload.
#58210: This PR changed a hot assertion to run only in debug builds, for a 20%(!) win on one workload.
#58207: I mentioned string interning earlier. The Rust compiler also uses interning for a variety of other types where duplicate values are common, including a type called
LazyConst. However, the
intern_lazy_const function was buggy and didn’t actually do any interning — it just allocated a new
LazyConst without first checking if it had been seen before! This PR fixed that problem, reducing peak memory usage and page faults by 59% on one benchmark.
#59507: The pretty-printer was calling
write! for every space of
indentation, and on some workloads the indentation level can exceed 100. This PR reduced it to a single
write! call in the vast majority of cases, for up to a
7% win on a few benchmarks.
#59626: This PR changed the preallocated size of one data structure to better match what was needed in practice, reducing peak memory usage by 20 MiB on some workloads.
#61612: This PR optimized a hot path within the parser, whereby constant tokens were uselessly subjected to repeated “is it a keyword?” tests, for up to a 7% win on programs with large constants.
The following changes involved improvements to our profiling tools.
#59899: I modified the output of
-Zprint-type-sizes so that enum variants are listed from largest to smallest. This makes it much easier to see outsized variants, especially for enums with many variants.
#62110: I improved the output of the
-Ztime-passes flag by removing some uninteresting entries that bloated the output and adding a measurement for the total compilation time.
Also, I improved the profiling support within the rustc-perf benchmark suite. First, I added support for profiling with OProfile. I admit I haven’t used it enough yet to gain any wins. It seg faults about half the time when I run it, which isn’t encouraging.
Second, I added support for profiling with the new version of DHAT. This blog post is about 2019, but it’s worth mentioning some improvements I made with the new DHAT’s help in Q4 of 2018, since I didn’t write a blog post about that period: #55167, #55346, #55383, #55384, #55501, #55525, #55556, #55574, #55604, #55777, #55558, #55745, #55778, #55905, #55906, #56268, #56090, #56269, #56336, #56369, #56737, and (ena crate) #14.
Finally, I wrote up brief descriptions for all the benchmarks in rustc-perf.
The improvements above (and all the improvements I’ve done before that) can be described as micro-optimizations, where I used profiling data to optimize a small piece of code.
But it’s also worth thinking about larger, systemic improvements to Rust compiler speed. In this vein, I worked in Q2 with Alex Crichton on pipelined compilation, a feature that increases the amount of parallelism available when building a multi-crate Rust project by overlapping the compilation of dependent crates. In diagram form, a compilation without pipelining looks like this:
metadata metadata [-libA----|--------][-libB----|--------][-binary-----------] 0s 5s 10s 15s 20s 30s
With pipelined compilation, it looks like this:
[-libA----|--------] [-libB----|--------] [-binary-----------] 0s 5s 10s 15s 25s
For more details on how it works, how to use it, and lots of measurements, see this thread. The effects are highly dependent on a project’s crate structure and the compiling machine’s configuration. We have seen speed-ups as high as 1.84x, while some projects see no speed-up at all. At worst, it should make things only negligibly slower, because it’s not causing any additional work, just changing the order in which certain things happen.
Pipelined compilation is currently a Nightly-only feature. There is a tracking issue for stabilizing the feature here.
I have a list of things I want to investigate in Q3.
- For pipelined compilation, I want to try pushing metadata creation even earlier in the compiler front-end, which may increase the speed-ups some more.
- The compiler uses
memcpya lot; not directly, but the generated code uses it for value moves and possibly other reasons. In “check” builds that don’t do any code generation, typically 2-8% of all instructions executed occur within
memcpy. I want to understand why this is and see if it can be improved. One possibility is moves of excessively large types within the compiler; another possibility is poor code generation. The former would be easier to fix. The latter would be harder to fix, but would benefit many Rust programs.
- Incremental compilation sometimes isn’t very effective. On some workloads, if you make a tiny change and recompile incrementally it takes about the same time as a full non-incremental compilation. Perhaps a small change to the incremental implementation could result in some big wins.
- I want to see if there are other hot paths within the parser that could be improved, like in #61612.
I also have various pieces of Firefox work that I need to do in Q3, so I might not get to all of these. If you are interested in working on these ideas, or anything else relating to Rust compiler speed, please get in touch.