How to speed up the Rust compiler one last time in 2019

I last wrote in October about my work on speeding up the Rust compiler. With the year’s end approaching, it’s time for an update.

Doc comments

#65750: Normal comments are stripped out during parsing and not represented in the AST. But doc comments have semantic meaning (`/// blah` is equivalent to the attribute #[doc="blah"]) and so must be represented explicitly in the AST. Furthermore:

  • In a multi-line doc comment using /// or //! comment markers every separate line is treated as a separate attribute;
  • Multi-line doc comments using /// and //! markers are very common, much more so than those using /** */ or /*! */ markers, particularly in the standard library where doc comment lines often outnumber code lines;
  • The AST representation of attributes is quite expensive, partly because it retains the original tokens for complex reasons involving procedural macros.

As a result, doc comments had a surprisingly high cost. This PR introduced a special, cheaper representation for doc comments, giving wins on many benchmarks, some in excess of 10%.

Token streams

#65455: This PR avoided some unnecessary conversions from TokenTree type to the closely related TokenStream type, avoiding allocations and giving wins on many benchmarks of up to 5%. It included one of the most satisfying commits I’ve made to rustc.

Screenshot of the satisfying commit, as viewed on GitHub

Up to 5% wins by changing only three lines. But what intricate lines they are! There is a lot going on, and they’re probably incomprehensible to anyone not deeply familiar with the relevant types. My eyes used to just bounce off those lines. As is typical for this kind of change, the commit message is substantially longer than the commit itself.

It’s satisfying partly because it shows my knowledge of Rust has improved, particularly Into and iterators. (One thing I worked out along the way: collect is just another name for FromIterator::from_iter.) Having said that, now that a couple of months have passed, it still takes me effort to remember how those lines work.

But even more so, it’s satisfying because it wouldn’t have happened had I not repeatedly simplified the types involved over the past year. They used to look like this:

pub struct TokenStream { kind: TokenStreamKind };
pub enum TokenStreamKind {

(Note the recursion!)

After multiple simplifying PRs (#56737, #57004, #57486, #58476, #65261), they now look like this:

pub struct TokenStream(pub Lrc<Vec<TreeAndJoint>>);
pub type TreeAndJoint = (TokenTree, IsJoint);
pub enum IsJoint { Joint, NonJoint }

(Note the lack of recursion!)

I didn’t set out to simplify the type for speed, I just found it ugly and confusing. But now it’s simple enough that understanding and optimizing intricate uses is much easier. When those three lines appeared in a profile, I could finally understand them. Refactoring FTW!

(A third reason for satisfaction: I found the inefficiency using the improved DHAT I worked on earlier this year. Hooray for nice tools!)

#65641: This was a nice follow-on from the previous PR. TokenStream had custom impls of the RustcEncodable/RustcDecodable traits. These were necessary when TokenStream was more complicated, but now that it’s much simpler the derived impls suffice. As well as simplifying the code, the PR won up to 3% on some benchmarks because the custom impls created some now-unnecessary intermediate structures.

#65198: David Tolnay reported that basic concatenation of tokens, as done by many procedural macros, could be exceedingly slow, and that operating directly on strings could be 100x faster. This PR removed quadratic behaviour in two places, both of which duplicated a token vector when appending a new token. (Raise a glass to the excellent Rc::make_mut function, which I used in both cases.) The PR gave some very small (< 1%) speed-ups on the standard benchmarks but sped up a microbenchmark that exhibited the problem by over 1000x, and made it practical for procedural benchmarks to use tokens. I also added David’s microbenchmark to rustc-perf to avoid future regressions.


#65260: This PR added a special-case check to skip an expensive function call in a hot function, giving a 7% win on the unicode_normalization benchmark.

#65480: LexicalResolver::iterate_until_fixed_point() was a hot function that iterated over some constraints, gradually removing them until none remain. The constraints were stored in a SmallVec and retain was used to remove the elements. This PR (a) changed the function so it stored the constraints in an immutable Vec and then used a BitSet to record which which constraints were still live, and (b) inlined the function at its only call site. These changes won another 7% on unicode_normalization, but only after I had sped up BitSet iteration with some micro-optimizations in #65425.

#66537: This PR moved an expensive function call after a test and early return that is almost always taken, giving a 2% win on unicode_normalization.


#66408: The ObligationForest type has a vector of obligations that it processes in a fixpoint manner. Each fixpoint iteration involves running an operation on every obligation, which can cause new obligations to be appended to the vector. Previously, newly-added obligations would not be considered until the next fixpoint iteration. This PR changed the code so it would consider those new obligations in the iteration in which they were added, reducing the number of iterations required to reach a fixpoint, and winning up to 8% on a few benchmarks.

#66013: The Rust compiler is designed around a demand-driven query system. Query execution is memoized: when a query is first invoked the compiler will perform the computation and store the result in a hash table, and on subsequent invocations the compiler will return the result from the hash table. For the parallel configuration of rustc this hash table experiences a lot of contention, and so it is sharded; each query lookup has two parts, one to get the shard, and one within the shard. This PR changed things so that the key was only hashed once and the hash value reused for both parts of the lookup, winning up to 3% on the single-threaded configuration of parallel compiler. (It had no benefit for the non-parallel compiler, which is what currently ships, because it does not use sharding.)

#66012: This PR removed the trivial_dropck_outlives query from the query system because the underlying computation is so simple that it is faster to redo it whenever necessary than to look up the result in the query hash table. This won up to 1% on some benchmarks, and possibly more for the parallel compiler, due to the abovementioned contention over the query hash table.

#65463: The function expand_pattern() caused many arena allocations where a 4 KiB arena chunk was allocated that only ever held a single small and short-lived allocation. This PR avoided these, reducing the number of bytes allocated by up to 2% for various benchmarks, though the effect on runtime was minimal.

#66540 – This PR changed a Vec to a SmallVec in Candidate::match_pairs, reducing allocation rates for a tiny speed win.


The compiler has an interned string type called Symbol that is widely used to represent identifiers and other strings. The compiler also had two variants of that type, confusingly called InternedString and LocalInternedString. In a series of PRs (#65426, #65545, #65657, #65776) I removed InternedString and minimized the functionality and uses of LocalInternedString (and also renamed it as SymbolStr). This was made possible by Matthew Jasper’s work  eliminating gensyms. The changes won up to 1% on various benchmarks, but the real benefit was code simplicity, as issue #60869 described. Like the TokenStream type I mentioned above, these types had annoyed me for some time, so it was good to get them into a nicer state.


The storage of historical benchmark data changed recently, which (as far as I can tell) makes comparisons to versions earlier than early November difficult. Nonetheless, wall-time results for the past month (2019-11-07 to 2019-12-08) cover the second half of that period, and they are generally good, as the following screenshot shows.

Walltime results from showing lots of benchmark improvements

Overall, lots of improvements (green), some in the double digits, and only a small number of regressions (red). It’s a good trend, and it’s not just due to the PRs mentioned above.  For example, MIR constant propagation won 2-10% in a lot of cases. It duplicates an optimization that LLVM also does, but performs it earlier in the pipeline when there’s less code. This saves time because it makes LLVM’s job easier, and it also helps the code quality of debug builds where the LLVM constant propagation isn’t run.

Final words

All of the above improvements (and most of the ones in my previous posts) I found by profiling with Cachegrind, Callgrind, DHAT, and counts, but there are plenty of other profilers out there. I encourage anyone who has interest and even slight familiarity with profiling (or a willingness to learn) to give it a try. The Rust compiler is an easy target, profiling-wise, because it’s a batch program that is mostly single-threaded and almost deterministic. (Compare it to Firefox, for example, which is interactive, multi-threaded, and highly non-deterministic.) Give it a go!

How to speed up the Rust compiler some more in 2019

In July I wrote about my efforts to speed up the Rust compiler in 2019. I also described how the Rust compiler has gotten faster in 2019, with compile time reductions of 20-50% on most benchmarks. Now that Q3 is finished it’s a good time to see how things have changed since then.

Speed improvements in Q3 2019

The following image shows changes in time taken to compile many of the standard benchmarks used on the Rust performance tracker. It compares a revision of the the compiler from 2019-07-23 with a revision of the compiler from 2019-10-09.

Screenshot of Rust compiler benchmark improvements for Q3

These are the wall-time results. There are three different build kinds measured for each one: a debug build, an optimized build, and a check build (which detects errors but doesn’t generate code). For each build kind there is a mix of incremental and non-incremental runs done. The numbers for the individual runs aren’t shown here but you can see them if you view the results directly on the site and click around. (Note that the site has had some reliability issues lately. Apologies if you have difficulty with that link.) The “avg” column shows the average change for those runs. The “min” and “max” columns show the minimum and maximum changes among those same runs.

There are a few regressions, mostly notably for the ctfe-stress-2 benchmark, which is an artificial stress test of compile-time function evaluation and so isn’t too much of a concern. But there are many more improvements, including double-digit improvements for clap-rs, inflate, unicode_normalization, keccak, wg-grammar, serde, deep-vector, script-servo, and style-servo. There have been many interesting things going on.


For a long time, profilers like Cachegrind and Callgrind have shown that 2-6% of the instructions executed by the Rust compiler occur in calls to memcpy. This seems high! Curious about this, I modified DHAT to track calls to memcpy, much in the way it normally tracks calls to malloc.

The results showed that most of the memcpy calls come from a relatively small number of code locations. Also, all the memcpy calls involved values that exceed 128 bytes. It turns out that LLVM will use inline code for copies of values that are 128 bytes or smaller. (Inline code will generally be faster, but memcpy calls will be more compact above a certain copy size.)

I was able to eliminate some of these memcpy calls in the following PRs.

#64302: This PR shrank the ObligationCauseCode type from 56 bytes to 32 bytes by boxing two of its variants, speeding up many benchmarks by up to 2.6%. The benefit mostly came because the PredicateObligation type (which contains an ObligationCauseCode) shrank from 136 bytes to 112 bytes, which dropped it below the 128 byte memcpy threshold. I also tried reducing the size of ObligationCauseCode to 24 bytes by boxing two additional variants, but this had worse performance because more allocations were required.

#64374: The compiler’s parser has this type:

pub type PResult<'a, T> = Result<T, DiagnosticBuilder<'a>

It’s used as the return type for a lot of parsing functions. The T value is always small, but DiagnosticBuilder was 176 bytes, so PResult had a minimum size of  184 bytes. And DiagnosticBuilder is only needed when there is a parsing error, so this was egregiously inefficient. This PR boxed DiagnosticBuilder so that PResult has a minimum size of 16 bytes, speeding up a number of benchmarks by up to 2.6%.

#64394: This PR reduced the size of the SubregionOrigin type from 120 bytes to 32 bytes by boxing its largest variant, which sped up many benchmarks slightly (by less than 1%). If you are wondering why this type caused memcpy calls despite being less than 128 bytes, it’s because it is used in a BTreeMap and the tree nodes exceeded 128 bytes.


One of the biggest causes of memcpy calls is within a data structure called ObligationForest, which represents a bunch of constraints (relating to type checking and trait resolution, I think) that take the form of a collection of N-ary trees. ObligationForest uses a single vector to store the tree nodes, and links between nodes are represented as numeric indices into that vector.

Nodes are regularly removed from this vector by a function called ObligationForest::compress. This operation is challenging to implement efficiently because the vector can contain thousands of nodes and nodes are removed only a few at a time, and order must be preserved, so there is a lot of node shuffling that occurs. (The numeric indices of all remaining nodes must be updated appropriately afterwards, which further constrains things.) The shuffling requires lots of  swap calls, and each one of those does three memcpy calls (let tmp = a; a = b; b = tmp, more or less). And each node is 176 bytes! While trying to get rid of these memcpy calls, I got very deep into ObligationForest and made the following PRs that aren’t related to the copying.

#64420: This PR inlined a hot function, speeding up a few benchmarks by up to 2.8%. The function in question is indirectly recursive, and LLVM will normally refuse to inline such functions. But I was able to work around this by using a trick: creating two variants of the function, one marked with #[inline(always)] (for the hot call sites) and one marked with #[inline(never)] (for the cold call sites).

#64500: This PR did a bunch of code clean-ups, some of which helped performance to the tune of up to 1.7%. The improvements came from factoring out some repeated expressions, and using iterators and retain instead of while loops in some places.

#64545: This PR did various things, improving performance by up to 13.8%. The performance wins came from: combining a split parent/descendants representation to avoid frequent chaining of iterators (chained iterators are inherently slower than non-chained iterators); adding a variant of the shallow_resolve function specialized for the calling pattern at a hot call site; and using explicit iteration instead of Iterator::all. (More about that last one below.)

#64627: This PR also did various things, improving performance by up to 18.4%. The biggest improvements came from: changing some code that dealt with a vector to special-case the 0-element and 1-element cases, which dominated; and inlining an extremely hot function (using a variant of the abovementioned #[inline(always)]#[inline(never)] trick).

These PRs account for most of the improvements to the following benchmarks: inflatekeccak, cranelift-codegen, and serde. Parts of the ObligationForest code was so hot for these benchmarks (especially inflate and keccak) that it was worth micro-optimizing them to the nth degree. When I find hot code like this, there are always two approaches: (a) try to speed it up, or (b) try to avoid calling it. In this case I did (a), but I do wonder if the users of ObligationForest could be more efficient in how they use it.

The above PRs are a nice success story, but I should also mention that I tried a ton of other micro-optimizations that didn’t work.

  • I tried drain_filter in compress. It was slower.
  • I tried several invasive changes to the data representation, all of which ended up slowing things down.
  • I tried using swap_and_remove instead of swap in compress, This gave speed-ups, but changed the order that predicates are processed in, which changed the order and/or contents of error messages produced in lots of tests. I was unable to tell if these error message changes were legitimate — some were simple, but some were not — so I abandoned all approaches that altered predicate order.
  • I tried boxing ObligationForest nodes, to reduce the number of bytes copied in compress. It reduced the amount of copying, but was a net slowdown because it increased the number of allocations performed.
  • I tried inlining some other functions, for no benefit.
  • I used unsafe code to remove the swap calls, but the speed-up was only 1% in the best case and I wasn’t confident that my code was panic-safe, so I abandoned that effort.
  • There were even a number of seemingly innocuous code clean-ups that I had to abandon because they hurt performance measurably. I think this is because the code is so hot in some benchmarks that even tiny changes can affect code generation adversely. (I generally use instruction counts rather than wall time to make these evaluations, because instruction counts have very low variance.)

Amusingly enough, the memcpy calls in compress were what started all this, and despite the big wins, I didn’t manage to get rid of them!

Inlining and code bloat

I mentioned above that in #64545 I got an improvement by replacing a hot call to Iterator::all with explicit iteration. The reason I tried this was that I looked at the implementation of Iterator::all and saw that it was surprisingly complicated: it wrapped the given predicate in a closure that
returned a LoopState, passed that closure to try_for_each which
wrapped the first closure in a second closure, and passed that second closure
to try_fold which did the actual iteration using the second
closure. Phew!

Just for kicks I tried replacing this complex implementation with the obvious, simple implementation, and got a small speed-up on keccak, which I was using for most of my performance testing. So I did the same thing for three similar Iterator methods (any, find and find_map), submitted #64572, and did a CI perf run. The results were surprising and extraordinary: 1-5% reductions for many benchmarks, but double-digits for some, and 20-50% reductions for some clap-rs runs. Uh, what? Investigation showed that the reduction came from LLVM doing less work during code generation. These functions are all marked with #[inline] and so the simpler versions result in less code for LLVM to process. Sure enough, the big wins all came in debug and opt builds, with little effect on check builds.

This surprised me greatly. There’s been a long-running theory that the LLVM IR produced by the Rust compiler’s front end is low quality, that LLVM takes a long time to optimize it, and more front-end optimization could speed up LLVM’s code generation. #64572 demonstrates a related, but much simpler prospect: we can speed up compilation by making commonly inlined library functions smaller. In hindsight, it makes sense that this would have an effect, but the size of the effect is nonetheless astounding to me.

But there’s a trade-off. Sometimes a simpler, smaller function is slower. For the iterator methods there are some cases where that is true, so the library experts were unwilling to land #64572 as is. Fortunately, it was possible to obtain much of the potential compile time improvements without compromising runtime.

  • In #64600, scottmcm removed an aggressive specialization of try_fold  for slices that had an unrolled loop that called the given closure four times. This got about 60% of the improvements of #64572.
  • In #64885, andjo403 simplified the four Iterator methods to call try_fold directly, removing one closure layer. This got about another 15% of the improvements of #64572.

I had a related idea, which was to use simpler versions for debug builds and complex versions for opt builds. I tried three different ways of doing this.

  • Use if cfg!(debug_assertions) within the method bodies.
  • Have two versions of each method, one marked with #[cfg(debug_assertions)], the other marked with #[cfg(not(debug_assertions))].
  • Mark each method with #[cfg_attr(debug_assertions, inline)] so that the methods are inlined only in optimized builds.

None of these worked; they either had little effect or made things worse. I’m hazy on the details of how library functions get incorporated; maybe there’s another way to make this idea work.

In a similar vein, Alex Crichton opened #64846, which changes hashbrown (Rust’s hash table implementation) so it is less aggressive about inlining. This got some sizeable improvements on some benchmarks (up to 18% on opt builds of cargo) but also caused small regressions for a lot of other benchmarks. In this case, the balance between “slower hash tables” and “less code to compile” is delicate, and a final decision is yet to be made.

Overall, this is an exciting new area of investigation for improving Rust compile times. We definitely want new tooling to help identify which library functions are causing the most code bloat. Hopefully these functions can be tweaked so that compile times improve without hurting runtime performance much.


As well as all the above stuff, which all arose due to my investigations into memcpy calls, I had a few miscellaneous improvements that arose from normal profiling.

#65089: In #64673, simulacrum got up to 30% wins on the unicode_normalization benchmark by special-casing a type size computation that is extremely hot. (That benchmark is dominated by large match expressions containing many integral patterns.) Inspired by this, in this PR I made a few changes that moved the special case to a slightly earlier point that avoided even more unnecessary operations, for wins of up to 11% on that same benchmark.

#64949: The following pattern occurs in a few places in the compiler.

let v = self.iter().map(|p| p.fold_with(folder)).collect::<SmallVec<[_; 8]>>()

I.e. we map some values into a SmallVec. A few of these places are very hot, and in most cases the number of elements produced is 0, 1, or 2. This PR changed those hot locations to handle one or more of the 0/1/2 cases directly without using iteration and SmallVec::collect, speeding up numerous benchmarks by up to 7.8%.

#64801: This PR avoided a chained iterator in a hot location, speeding up the wg-grammar benchmark by up to 1.9%.

Finally, in #64112 I tried making pipelinined compilation more aggressive by moving crate metadata writing before type checking and borrow checking. Unfortunately, it wasn’t much of a win, and it would slightly delay error message emission when compiling code with errors, so I abandoned the effort.

Visualizing Rust compilation

Speeding up the Rust compiler isn’t the only way to make a Rust project build faster. Changing the crate structure of a project can also make a big difference. The good news here is that Eric Huss has implemented an amazing tool for visualizing Rust compilation, which can be used to identify inefficient crate structures in Rust projects.

The tool extremely easy to use. First, update to the latest Nightly:

rustup update nightly

Then just add -Ztimings to your build command, e.g.:

cargo +nightly build -Ztimings

At the end of the build it will print the name of an HTML file containing the data. Here’s part of the visualization for the Rust compiler itself:

Screenshot of -Ztimings output when compiling the Rust compiler

Full data is available here. (I recommend moving the “Scale” slider to 7 or 8 so that horizontal scrolling isn’t necessary.)

Two things leap out from this visualization.

  • The rustc crate takes about twice as long as any other crate to compile. It is the “long pole” of the build, and its presence serializes the build significantly. Breaking it up could improve compilation time quite a bit. I filed #65031 about this.
  • Pipelined compilation (released in Rust 1.38) is a huge win for the compiler itself. Pipelining allows a dependent crate to start building as soon as metadata is produced. In the visualization, this corresponds to point where the bar for a graph changes colour from light blue to purple. Imagine if the rustc crate had to finish before all the crates below it could even start! It used to take about 45 minutes for an optimized stage 2 build on my fast Linux desktop machine; thanks to pipelining it now takes about 26 minutes.

I also filed #65088 to add -Ztimings support to the Rust compiler’s own build system. (Enabling the visualization isn’t as simple for the compiler as it is for most Rust projects. The compiler’s build system is complicated by the fact that it’s a bootstrapping compiler that has to be built multiple times.)

We have already heard from multiple people that they used it fix inefficiencies in their crate structure, speeding up their builds significantly. Anyone who works on a sizeable Rust project should try out this tool.

The Rust compiler is still getting faster

A key theme of the Rust 2019 roadmap is maturity. This covers a variety of topics, but a crucial one is compile times. For example, the roadmap itself has the following as the first main theme for the compiler team.

Improving “core strength” by lowering raw compilation times and also generating better code (which in turn can help with compilation times)

The roadmap explainer post has a “polish” section that has the following as the first example.

Compile times and IDE support

I previously wrote about one period of improvement in Rust compiler speed. How are things going in 2019?

Speed improvements in 2019

The following image shows changes in time taken to compile the standard benchmarks used on the Rust performance tracker. It compares the compiler from 2019-01-01 with the compiler from 2019-07-24 (the most recent data at the time of writing).

Table showing Rust compiler speedups between 2019-01-01 and 2019-07-24

These are the wall-time results for 29 benchmarks. There are three different build kinds measured for each one: a debug build, an optimized build, and a check build (which detects errors but doesn’t generate code). For each build kind there is a mix of incremental and non-incremental runs done. The numbers for the individual runs aren’t shown here but you can see them if you view the results directly on the site and click around. The “avg” column shows the average change for those runs. The “min” and “max” columns show the minimum and maximum changes among those same runs.

The table has 261 numbers. The thing to take away is that 258 of them are negative, representing a decrease in compile time. Most of the “avg” values are in the range -20% to -40%. The “min” values (representing the best time reduction for each build kind) range from -12.4% to -51.3%. Even the “max” values (representing the worst time reduction for each build kind) are mostly better than -10%. These are pleasing results.

speed improvements since late 2017

What happens if we look further back? The image below compares the compiler from 2017-11-12 (the earliest date for which I could get data from the site) against the compiler from 2019-07-24, a period of just over 20 months.

Table showing Rust compiler speedups between 2017-11-12 and 2019-07-24

These are the wall-time results for only 18 benchmarks, because the benchmark suite was smaller in late 2017. Check builds were also not measured then. You can view the results directly on the site.

My initial thought from looking at the “avg” results was “the compiler is twice as fast” but closer inspection shows that’s not quite true; the average “avg” result is 42%. (I know that averaging averages is statistically dubious, I did it just to get a rough feel.) Overall, the results are significantly better than those for 2019: the “avg” values range from -19.9% to -61.3%, and the “min” values are mostly better than -60%.

(And don’t forget that time reduction percentages can be misleading when they get large. A 50% time reduction means the compiler is twice as fast; a 75% time reduction means the compiler is four times as fast; a 90% time reduction means the compiler is ten times as fast.)

All this is good news. The Rust compiler has long had a reputation for being slow. I still wouldn’t describe it as fast, but it is clearly a lot faster than it used to be. Many thanks to all those who made this happen, and I would be happy to hear from anyone who wants to help continue the trend!

Thanks to theZcuber for a Reddit post that was the starting point for this article.

How to speed up the Rust compiler in 2019

I have written previously about my efforts to speed up the Rust compiler in 2016 (part 1, part 2) and 2018 (part 1, part 2, NLL edition). It’s time for an update on the first half of 2019.

Faster globals

libsyntax has three tables in a global data structure, called Globals, storing information about spans (code locations), symbols, and hygiene data (which relates to macro expansion). Accessing these tables is moderately expensive, so I found various ways to improve things.

#59693: Every element in the AST has a span, which describes its position in the original source code. Each span consists of an offset, a length, and a third value that is related to macro expansion. The three fields are 12 bytes in total, which is a lot to attach to every AST element, and much of the time the three fields can fit in much less space. So the compiler used a 4 byte compressed form with a fallback to a hash table stored in Globals for spans that didn’t fit in 4 bytes. This PR changed that to 8 bytes. This increased memory usage and traffic slightly, but reduced the fallback rate from roughly 10-20% to less than 1%, speeding up many workloads, the best by an amazing 14%.

#61253: There are numerous operations that accessed the hygiene data, and often these were called in pairs or trios, thus repeating the hygiene data lookup. This PR introduced compound operations that avoid the repeated lookups. This won 10% on packed-simd, up to 3% on numerous other workloads.

#61484: Similar to #61253, this won up to 2% on many benchmarks.

#60630: The compiler has an interned string type, called symbol. It used this inconsistently. As a result, lots of comparisons were made between symbols and ordinary strings, which required a lookup of the string in the symbols table and then a char-by-char comparison. A symbol-to-symbol comparison is much cheaper, requiring just an integer comparison. This PR removed the symbol-to-string comparison operations, forcing more widespread use of the symbol type. (Fortunately, most of the introduced symbol uses involved statically-known, pre-interned strings, so there weren’t additional interning costs.) This won up to 1% on various benchmarks, and made the use of symbols more consistent.

#60815: Similar to #60630, this also won up to 1% on various benchmarks.

#60467, #60910, #61035, #60973: These PRs avoiding some more unnecessary symbol interning, for sub-1% wins.


The following improvements didn’t have any common theme.

#57719: This PR inlined a very hot function, for a 4% win on one workload.

#58210: This PR changed a hot assertion to run only in debug builds, for a 20%(!) win on one workload.

#58207: I mentioned string interning earlier. The Rust compiler also uses interning for a variety of other types where duplicate values are common, including a type called LazyConst. However, the intern_lazy_const function was buggy and didn’t actually do any interning — it just allocated a new LazyConst without first checking if it had been seen before! This PR fixed that problem, reducing peak memory usage and page faults by 59% on one benchmark.

#59507: The pretty-printer was calling write! for every space of
indentation, and on some workloads the indentation level can exceed 100. This PR reduced it to a single write! call in the vast majority of cases, for up to a
7% win on a few benchmarks.

#59626: This PR changed the preallocated size of one data structure to better match what was needed in practice, reducing peak memory usage by 20 MiB on some workloads.

#61612: This PR optimized a hot path within the parser, whereby constant tokens were uselessly subjected to repeated “is it a keyword?” tests, for up to a 7% win on programs with large constants.

Profiling improvements

The following changes involved improvements to our profiling tools.

#59899: I modified the output of -Zprint-type-sizes so that enum variants are listed from largest to smallest. This makes it much easier to see outsized variants, especially for enums with many variants.

#62110: I improved the output of the -Ztime-passes flag by removing some uninteresting entries that bloated the output and adding a measurement for the total compilation time.

Also, I improved the profiling support within the rustc-perf benchmark suite. First, I added support for profiling with OProfile. I admit I haven’t used it enough yet to gain any wins. It seg faults about half the time when I run it, which isn’t encouraging.

Second, I added support for profiling with the new version of DHAT. This blog post is about 2019, but it’s worth mentioning some improvements I made with the new DHAT’s help in Q4 of 2018, since I didn’t write a blog post about that period: #55167, #55346, #55383, #55384, #55501, #55525, #55556, #55574, #55604, #55777, #55558, #55745, #55778, #55905, #55906, #56268, #56090, #56269, #56336, #56369, #56737, and (ena crate) #14.

Finally, I wrote up brief descriptions for all the benchmarks in rustc-perf.

pipelined compilation

The improvements above (and all the improvements I’ve done before that) can be described as micro-optimizations, where I used profiling data to optimize a small piece of code.

But it’s also worth thinking about larger, systemic improvements to Rust compiler speed. In this vein, I worked in Q2 with Alex Crichton on pipelined compilation, a feature that increases the amount of parallelism available when building a multi-crate Rust project by overlapping the compilation of dependent crates. In diagram form, a compilation without pipelining looks like this:

         metadata            metadata
0s        5s       10s       15s       20s                30s

With pipelined compilation, it looks like this:

0s        5s       10s       15s                25s

I did the work on the Rust compiler side, and Alex did the work on the Cargo side.

For more details on how it works, how to use it, and lots of measurements, see this thread. The effects are highly dependent on a project’s crate structure and the compiling machine’s configuration. We have seen speed-ups as high as 1.84x, while some projects see no speed-up at all. At worst, it should make things only negligibly slower, because it’s not causing any additional work, just changing the order in which certain things happen.

Pipelined compilation is currently a Nightly-only feature. There is a tracking issue for stabilizing the feature here.

Future work

I have a list of things I want to investigate in Q3.

  • For pipelined compilation, I want to try pushing metadata creation even earlier in the compiler front-end, which may increase the speed-ups some more.
  • The compiler uses memcpy a lot; not directly, but the generated code uses it for value moves and possibly other reasons. In “check” builds that don’t do any code generation, typically 2-8% of all instructions executed occur within memcpy. I want to understand why this is and see if it can be improved. One possibility is moves of excessively large types within the compiler; another possibility is poor code generation. The former would be easier to fix. The latter would be harder to fix, but would benefit many Rust programs.
  • Incremental compilation sometimes isn’t very effective. On some workloads, if you make a tiny change and recompile incrementally it takes about the same time as a full non-incremental compilation. Perhaps a small change to the incremental implementation could result in some big wins.
  • I want to see if there are other hot paths within the parser that could be improved, like in #61612.

I also have various pieces of Firefox work that I need to do in Q3, so I might not get to all of these. If you are interested in working on these ideas, or anything else relating to Rust compiler speed, please get in touch.

A better DHAT

DHAT is a heap profiler that comes with Valgrind. (The name is short for “Dynamic Heap Analysis Tool”.) It tells your where all your heap allocations come from, and can help you find the following: places that cause excessive numbers of allocations; leaks; unused and under-used allocations; short-lived allocations; and allocations with inefficient data layouts. This old blog post goes into some detail.

In the new Valgrind 3.15 release I have given DHAT a thorough overhaul.

The old DHAT was very useful and I have used it a lot while profiling the Rust compiler. But it had some rather annoying limitations, which the new DHAT overcomes.

First, the old DHAT dumped its data as text at program termination. The new DHAT collects its data in a file which is read by a graphical viewer that runs in a web browser. This gives several advantages.

  • The separation of data collection and data presentation means you can run a program once under DHAT and then sort and filter the data in various ways, instead of having to choose a particular sort order in advance. Also, full data is in the output file, and the graphical viewer chooses what to omit.
  • The data can be sorted in more ways than previously. Some of these sorts involve useful filters such as “short-lived” and “zero reads or zero writes”.
  • The graphical viewer allows parts of the data to be hidden and unhidden as necessary.

Second, the old DHAT divided its output into records, where each record consisted of all the heap allocations that have the same allocation stack trace. The choice of stack trace depth could greatly affect the output.

In contrast, the new DHAT is based around trees of stack traces that avoid the need to choose stack trace depth. This avoids both the problem of not enough depth (when records that should be distinct are combined, and may not contain enough information to be actionable) and the problem of too much depth (when records that should be combined are separated, making them seem less important than they really are).

Third, the new DHAT also collects and/or shows data that the old DHAT did not.

  • Byte and block measurements are shown with a percentage relative to the global measurements, which helps gauge relative significance of different parts of the profile.
  • Byte and block measurements are also shown with an allocation rate (bytes and blocks per million instructions), which enables comparisons across multiple profiles, even if those profiles represent different workloads.
  • Both global and per-node measurements are taken at the global heap peak, which gives Massif-like insight into the point of peak memory use.
  • The final/lifetimes stats are a bit more useful than the old deaths stats. (E.g. the old deaths stats didn’t take into account lifetimes of unfreed blocks.)

Finally, the new DHAT has a better handling of realloc. The sequence p = malloc(100); realloc(p, 200); now increases the total block count by 2 and the total byte count by 300. In the old DHAT it increased them by 1 and 200. The new handling is a more operational view that better reflects the effect of allocations on performance. It makes a significant difference in the results, giving paths involving reallocation (e.g. repeated pushing to a growing vector) more prominence.

Overall these changes make DHAT more powerful and easier to use.

The following screenshot gives an idea of what the new graphical viewer looks like.

Sample output from DHAT's viewer

The new DHAT can be run using the --tool=dhat flag, in contrast with the old DHAT, which was an “experimental” tool and so used the --tool=exp-dhat flag. For more details see the documentation.

How to get the size of Rust types with -Zprint-type-sizes

When optimizing Rust code it’s sometimes useful to know how big a type is, i.e. how many bytes it takes up in memory. std::mem::size_of can tell you, but often you want to know the exact layout as well. For example, an enum might be surprisingly big, in which case you probably will want to know if, for example, there is one variant that is much bigger than the others.

The -Zprint-type-sizes option does exactly this. Just pass it to a nightly version of rustc — it isn’t enabled on release versions, unfortunately — and it’ll print out details of the size, layout, and alignment of all types in use. For example, for this type:

enum E {
    C(u64, u8, u64, u8),

it prints the following, plus info about a few built-in types:

print-type-size type: `E`: 32 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `A`: 0 bytes
print-type-size     variant `B`: 7 bytes
print-type-size         padding: 3 bytes
print-type-size         field `.0`: 4 bytes, alignment: 4 bytes
print-type-size     variant `C`: 23 bytes
print-type-size         field `.1`: 1 bytes
print-type-size         field `.3`: 1 bytes
print-type-size         padding: 5 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size         field `.2`: 8 bytes
print-type-size     variant `D`: 31 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 24 bytes, alignment: 8 bytes

It shows:

  • the size and alignment of the type;
  • for enums, the size of the discriminant;
  • for enums, the size of each variant;
  • the size, alignment, and ordering of all fields (note that the compiler has reordered variant C‘s fields to minimize the size of E);
  • the size and location of all padding.

Every detail you could possibly want is there. Brilliant!

For rustc developers, there’s an extra-special trick for getting the size of a type within rustc itself. Put code like this into a file

extern crate syntax;
use syntax::ast::Expr;
fn main() {
    let _x = std::mem::size_of::<Expr>();

and then compile it like this:

RUSTC_BOOTSTRAP=1 rustc -Zprint-type-sizes

I won’t pretend to understand how it works, but the use of rustc_private and RUSTC_BOOTSTRAP somehow let you see inside rustc while using it, rather than while compiling it. I have used this trick for PRs such as this one.

How to speed up the Rust compiler in 2018: NLL edition

Niko Matsakis recently blogged about the Rust compiler’s new borrow checker, which implements non-lexical lifetimes (NLL). The new borrow checker is a really nice improvement to Rust, because it accepts many sound programs that the old borrow checker rejected.

In the blog post, Niko wrote briefly about the performance of the new borrow checker.

Finally, those of you who read the previous posts may remember that the performance of the NLL checker was a big stumbling block. I’m happy to report that the performance issues were largely addressed: there remains some slight overhead to using NLL, but it is largely not noticeable in practice, and I expect we’ll continue to improve it over time.

This paragraph is true, but glosses over a lot of details! This post will be about my contributions to this performance work.


Before I describe individual improvements, it’s worth mentioning that the new borrow checker uses bitsets (1D) and bit matrices (2D) heavily. A number of my wins involved these data structures.

#51869: This PR changed some code so that it overwrote an existing dense bitset rather than replacing it with a newly created one of the same size, reducing instruction counts for most benchmarks, the best by 1.5%.

#51870: This PR reused a structure containing two bitsets rather than recreating it afresh for every statement in a basic block, reducing instruction counts for numerous benchmarks, the best by 1%.

#52250: The compiler has a SparseBitMatrix type. Rows were added on demand, and each row was implemented as a sparse bitset using a BTreeMap. In practice, many of the rows were relatively dense, with 10–90% of the bits being set. This PR changed SparseBitMatrix to use a dense representation for rows, reducing instruction counts on one benchmark by 33% and by 1% on a few others. The PR had a mixed effect on memory usage, increasing the peak on some benchmarks and reducing it on others. (Never fear! #54318 below ended up fixing the regressions.)

#52342: This PR avoided a bunch of allocations in Canonicalizer methods, reducing instruction counts on numerous benchmarks, the best by 2%.

#53383: Further profiling showed that some dense bitsets were large, but had very few bits set within them, so the dense representation was wasteful. This PR implemented a new hybrid bitset type that uses a sparse representation for bitsets with up to 8 bits set and switches to a dense representation beyond that, and used it to replace dense bitsets in several places, reducing instruction counts on the five slowest benchmarks by 55%, 21%, 16%, 10% and 9%, and reducing peak memory usage of three benchmarks by 53%, 33%, and 9%.

#53513: This PR force-inlined a function at one hot callsite, reducing instruction counts on two benchmarks by 1%.

#53551: This PR avoided some clone calls, reducing instruction counts for one benchmark by 0.5%.

#53733: This PR added special handling for a very common and simple case in unroll_place, reducing the instruction counts on one benchmark by 25%.

#53942: A function called precompute_borrows_out_of_scope does a traversal of one or more basic blocks. In order to detect whether a basic block had been previously visited, it recorded the ID of every visited statement in a hash table. Some basic blocks can have many statements, resulting in many hash table lookups. This PR changed the code to record the ID of visited basic blocks in the hash table instead of visited statements — trickier than it sounds because the analysis can start in the middle of a basic block, in which case the first half might need to be eventually visited — reducing instruction counts on one benchmark by 60%.

#54211: Liveness analysis created an array that could get very large. Each element in the array was a struct containing two u32s and a bool. Each of those elements took up 12 bytes, but only 9 of the bytes held data. This PR split the array into three separate arrays, one per field, making the code slightly less readable but reducing peak memory usage on one benchmark by 20%. (Fun fact: those u32s used to be usizes, but I shrunk them back in May.)

#54213: This PR tweaked some code so that the lifetimes of two large data structures didn’t overlap, reducing peak memory usage by 27% on one benchmark and 8% on another. (Note: those data structures are dominated by, you guessed it, bitsets!)

#54318: This PR changed SparseBitMatrix so that each instantiated row used the hybrid bitset representation from #53383 instead of the dense representation, reducing peak memory usage by 14–45% on four benchmarks, and 96% (from 29.1GB to 1.2GB) on one external crate! This PR also fixed the peak memory regression that #52250 introduced for a few benchmarks.

#54420: I subsequently realized that #54211 didn’t go far enough. Some debugging println! statements showed that both of the u32s in each liveness entry almost always held a special value that meant “invalid”. When data is repetitive, compression is possible: I could use a packed representation where the common (INVALID, INVALID, true) and (INVALID, INVALID, false) cases were represented by special u32 values, and all other triples were represented by a u32 index into an auxiliary table. This PR changed the representation as described, reducing instruction counts on numerous benchmarks, the best by 16%, and reducing peak memory usage on numerous benchmarks, the best by 38%. (I also tried a more compact representation where each element was a single byte; it reduced peak memory usage some more by the instruction count reduction was less, so I went with the earlier approach.)

Progress and current status

You probably noticed that some of the improvements in the previous section were large, and I wasn’t the only one working on NLL performance; Niko Matsakis and David Wood also contributed some big wins. This is because the new borrow checker’s performance was originally, to be honest, terrible. This is understandable; the focus had been on functionality and correctness, which is fair enough for a large and complex new component. Nonetheless, in June I was very nervous about its performance.

To be more specific, “check” builds (which don’t generate code) ran as much as 50x slower with the new borrow checker on some benchmarks. And multiple benchmarks were disabled on CI because they were simply too slow or used too much memory.

Issue #52028 tells a representative story. It was originally filed because the html5ever benchmark was triggering out-of-memory failures on CI. Measurements with Massif and DHAT showed that its peak heap memory usage was over 14 GB, largely caused by a single 12 GB allocation! In comparison, the peak with the old borrow checker was roughly 200–300 MB. If you read through that issue, you can see that over a period of 2.5 months we reduced the memory usage from 14 GB, to 10 GB, to 2 GB, to 1.2 GB, to 600 MB, to 501 MB, and finally to 266 MB.

And things are pretty good now. The instruction counts on “check” builds for all benchmarks are at most 18% higher with the new borrow checker than the old borrow checker, and are typically around 5%. (And note that “check” builds are the worst-case scenario; non-“check” builds will see a smaller relative slowdown because of the extra time needed for code generation, which is unaffected by the borrow checker.) Memory usage is similar: all benchmarks except one have peak memory usage that is at most 20% higher, with the typical value around 3%. (The one remaining exceptional benchmark uses 2.7x memory.) The worse numbers generally occur on programs containing very large constants.

I’m not entirely happy even with this level of performance regression, but for now I have run out of ideas for improving it further. The new borrow checker is a lot more sophisticated and tracks a lot more data, so strict performance parity is a tough ask. Nonetheless, given how bad performance was a few months ago, I’m happy that we’ve got it down to a level where most people probably won’t notice any difference. Given that the new borrow checker makes Rust a significantly nicer and easier language to use, I hope it’s an acceptable trade-off.

Slimmer and simpler static atoms

String interning is:

a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. The single copy of each string is called its intern.

In Firefox’s code we use the term atom rather than intern, and atom table rather than string intern pool. I don’t know why; those names have been used for a long time.

Furthermore, Firefox distinguishes between static atoms, which are those that are chosen at compile time and can be directly referred to via an identifier, and dynamic atoms, which are added on-demand at runtime. This post is about the former.

In 2016, Firefox’s implementation of static atoms was complex and inefficient. I filed a bug about this that included the following ASCII diagram showing all the data structures involved for a single atom for the string “foobar”.

static nsFakeStringBuffer<N=7> foobar_buffer (.data, 8+2N bytes)
/-----------------------------------------\ <------+
| int32_t mRefCnt = 1 // never reaches 0  |        | 
| uint32_t mSize = 14 // 7 x 16-bit chars |        | 
| u"foobar"           // the actual chars | <----+ | 
\-----------------------------------------/      | | 
                                                 | | 
PermanentAtomImpl (heap, 32 bytes)               | | 
/----------------------------------------------\ | | <-+
| void* vtablePtr    // implicit               | | |   | 
| uint32_t mLength = 6                         | | |   | 
| uint32_t mHash = ...                         | | |   | 
| char16_t* mString = @------------------------|-+ |   | 
| uintptr_t mRefCnt  // from NS_DECL_ISUPPORTS |   |   | 
\----------------------------------------------/   |   | 
                                                   |   | 
static nsIAtom* foobar (.bss, 8 bytes)             |   | 
/---\ <-----------------------------------+        |   | 
| @-|-------------------------------------|------------+
\---/                                     |        |   | 
                                          |        |   | 
static nsStaticAtom (, 16 bytes) |        |   | 
(this element is part of a larger array)  |        |   | 
/------------------------------------\    |        |   | 
| nsStringBuffer* mStringBuffer = O--|----|--------+   | 
| nsIAtom** mAtom = @----------------|----+            | 
\------------------------------------/                 | 
AtomTableEntry (heap, ~2 x 16 bytes[*])                | 
(this entry is part of gAtomTable)                     | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| AtomImpl* mAtom = @-----|----------------------------+
\-------------------------/                            | 
StaticAtomEntry (heap, ~2 x 16 bytes[*])               | 
(this entry is part of gStaticAtomTable)               | 
/-------------------------\                            | 
| uint32_t mKeyHash = ... |                            | 
| nsIAtom* mAtom = @------|----------------------------+

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

There is a lot going on in that diagram, but putting that all together gave the following overhead per atom.

  • Static shared: 0 bytes
  • Static unshared: 8 + 2(length+1) + 8 + 16
  • Dynamic: 32 + ~32 + ~32 bytes
  • Total bytes: (2(length+1) + 64 + ~64) * num_processes

(Although these atoms are “static” in the sense of being known at compile-time, a lot of the associated data was allocated dynamically.)

At the time there were about 2,700 static atoms, and avg_length was about 11, so the overhead was roughly:

  • 0 bytes fixed, and
  •  410,400 bytes per process. (Or more, depending on how the relocations required for the static pointers were represented, which depended on the platform.)

Today, things have improved greatly and now look like the following.

const char16_t[7] (.rodata, 2(N+1) bytes)
(this is detail::gGkAtoms.foobar_string)
/-----------------------------------------\ <--+
| u"foobar"           // the actual chars |    | 
\-----------------------------------------/    | 
const nsStaticAtom (.rodata, 12 bytes)         | 
(this is within detail::gGkAtoms.mAtoms[])     | 
/-------------------------------------\ <---+  | 
| uint32_t mLength:30 = 6             |     |  | 
| uint32_t mKind:2 = AtomKind::Static |     |  | 
| uint32_t mHash = ...                |     |  | 
| uint32_t mStringOffset = @----------|-----|--+
\-------------------------------------/     | 
constexpr nsStaticAtom* (0 bytes) @---------+
(this is nsGkAtoms::foobar)                 | 
AtomTableEntry (heap, ~2 x 16 bytes[*])     | 
(this entry is part of gAtomTable)          | 
/-------------------------\                 | 
| uint32_t mKeyHash = ... |                 | 
| nsAtom* mAtom = @-------|-----------------+

[*] Each hash table is half full on average, so each entry takes up
approximately twice its actual size.

That gives the following overhead per atom.

  • Static shared: 12 + 2(length+1) bytes
  • Static unshared: 0 bytes
  • Dynamic: ~32 bytes
  • Total: 12 + 2(length+1) + ~32 * num_processes

We now have about 2,300 static atoms and avg_length is still around 11, so the overhead is roughly:

  • 82,800 bytes fixed, and
  • 73,600 bytes per process.

I won’t explain all the parts of the two diagrams, but it can be seen that we’ve gone from six pieces per static atom to four; the size and complexity of the remaining pieces are greatly reduced; there are no static pointers (only constexpr pointers and integral offsets) and thus no relocations; and there is a lot more interprocess sharing thanks to more use of const. Also, there is no need for a separate static atom table any more, because the main atom table is thread-safe and the HTML5 parser (the primary user of the separate static atom table) now has a small but highly effective static atoms cache.

Things that aren’t visible from the diagrams: atoms are no longer exposed to JavaScript code via XPIDL, there are no longer any virtual methods involved, and all atoms are defined in a single place (with no duplicates) instead of 7 or 8 different places. Notably, the last few steps were blocked for some time by a bug in MSVC involving the handling of constexpr.

The bug dependency tree gives a good indication of how many separate steps were involved in this work. If there is any lesson to be had here, it’s that small improvements add up over time.

Ad Hoc Profiling

I have used a variety of profiling tools over the years, including several I wrote myself.

But there is one profiling tool I have used more than any other. It is capable of providing invaluable, domain-specific profiling data of a kind not obtainable by any general-purpose profiler.

It’s a simple text processor implemented in a few dozen lines of code. I use it in combination with logging print statements in the programs I am profiling. No joke.


The tool is called counts, and it tallies line frequencies within text files, like an improved version of the Unix command chain sort | uniq -c. For example, given the following input.

a 1
b 2
b 2
c 3
c 3
c 3
d 4
d 4
d 4
d 4

counts produces the following output.

10 counts:
(  1)        4 (40.0%, 40.0%): d 4
(  2)        3 (30.0%, 70.0%): c 3
(  3)        2 (20.0%, 90.0%): b 2
(  4)        1 (10.0%,100.0%): a 1

It gives a total line count, and shows all the unique lines, ordered by frequency, with individual and cumulative percentages.

Alternatively, when invoked with the -w flag, it assigns each line a weight, determined by the last integer that appears on the line (or 1 if there is no such integer).  On the same input, counts -w produces the following output.

30 counts:
(  1)       16 (53.3%, 53.3%): d 4
(  2)        9 (30.0%, 83.3%): c 3
(  3)        4 (13.3%, 96.7%): b 2
(  4)        1 ( 3.3%,100.0%): a 1

The total and per-line counts are now weighted; the output incorporates both frequency and a measure of magnitude.

That’s it. That’s all counts does. I originally implemented it in 48 lines of Perl, then later rewrote it as 48 lines of Python, and then later again rewrote it as 71 lines of Rust.

In terms of benefit-to-effort ratio, it is by far the best code I have ever written.

counts in action

As an example, I added print statements to Firefox’s heap allocator so it prints a line for every allocation that shows its category, requested size, and actual size. A short run of Firefox with this instrumentation produced a 77 MB file containing 5.27 million lines. counts produced the following output for this file.

5270459 counts:
( 1) 576937 (10.9%, 10.9%): small 32 (32)
( 2) 546618 (10.4%, 21.3%): small 24 (32)
( 3) 492358 ( 9.3%, 30.7%): small 64 (64)
( 4) 321517 ( 6.1%, 36.8%): small 16 (16)
( 5) 288327 ( 5.5%, 42.2%): small 128 (128)
( 6) 251023 ( 4.8%, 47.0%): small 512 (512)
( 7) 191818 ( 3.6%, 50.6%): small 48 (48)
( 8) 164846 ( 3.1%, 53.8%): small 256 (256)
( 9) 162634 ( 3.1%, 56.8%): small 8 (8)
( 10) 146220 ( 2.8%, 59.6%): small 40 (48)
( 11) 111528 ( 2.1%, 61.7%): small 72 (80)
( 12) 94332 ( 1.8%, 63.5%): small 4 (8)
( 13) 91727 ( 1.7%, 65.3%): small 56 (64)
( 14) 78092 ( 1.5%, 66.7%): small 168 (176)
( 15) 64829 ( 1.2%, 68.0%): small 96 (96)
( 16) 60394 ( 1.1%, 69.1%): small 88 (96)
( 17) 58414 ( 1.1%, 70.2%): small 80 (80)
( 18) 53193 ( 1.0%, 71.2%): large 4096 (4096)
( 19) 51623 ( 1.0%, 72.2%): small 1024 (1024)
( 20) 45979 ( 0.9%, 73.1%): small 2048 (2048)

Unsurprisingly, small allocations dominate. But what happens if we weight each entry by its size? counts -w produced the following output.

2554515775 counts:
( 1) 501481472 (19.6%, 19.6%): large 32768 (32768)
( 2) 217878528 ( 8.5%, 28.2%): large 4096 (4096)
( 3) 156762112 ( 6.1%, 34.3%): large 65536 (65536)
( 4) 133554176 ( 5.2%, 39.5%): large 8192 (8192)
( 5) 128523776 ( 5.0%, 44.6%): small 512 (512)
( 6) 96550912 ( 3.8%, 48.3%): large 3072 (4096)
( 7) 94164992 ( 3.7%, 52.0%): small 2048 (2048)
( 8) 52861952 ( 2.1%, 54.1%): small 1024 (1024)
( 9) 44564480 ( 1.7%, 55.8%): large 262144 (262144)
( 10) 42200576 ( 1.7%, 57.5%): small 256 (256)
( 11) 41926656 ( 1.6%, 59.1%): large 16384 (16384)
( 12) 39976960 ( 1.6%, 60.7%): large 131072 (131072)
( 13) 38928384 ( 1.5%, 62.2%): huge 4864000 (4866048)
( 14) 37748736 ( 1.5%, 63.7%): huge 2097152 (2097152)
( 15) 36905856 ( 1.4%, 65.1%): small 128 (128)
( 16) 31510912 ( 1.2%, 66.4%): small 64 (64)
( 17) 24805376 ( 1.0%, 67.3%): huge 3097600 (3100672)
( 18) 23068672 ( 0.9%, 68.2%): huge 1048576 (1048576)
( 19) 22020096 ( 0.9%, 69.1%): large 524288 (524288)
( 20) 18980864 ( 0.7%, 69.9%): large 5432 (8192)

This shows that the cumulative count of allocated bytes (2.55GB) is dominated by a mixture of larger allocation sizes.

This example gives just a taste of what counts can do.

(An aside: in both cases it’s good the see there isn’t much slop, i.e. the difference between the requested sizes and actual sizes are mostly 0. That 5432 entry at the bottom of the second table is curious, though.)

Other Uses

This technique is often useful when you already know something — e.g. a general-purpose profiler showed that a particular function is hot — but you want to know more.

  • Exactly how many times are paths X, Y and Z executed? For example, how often do lookups succeed or fail in data structure D? Print an identifying string each time a path is hit.
  • How many times does loop L iterate? What does the loop count distribution look like? Is it executed frequently with a low loop count, or infrequently with a high loop count, or a mix? Print the iteration count before or after the loop.
  • How many elements are typically in hash table H at this code location? Few? Many? A mixture? Print the element count.
  • What are the contents of vector V at this code location? Print the contents.
  • How many bytes of memory are used by data structure D at this code location? Print the byte size.
  • Which call sites of function F are the hot ones? Print an identifying string at the call site.

Then use counts to aggregate the data. Often this domain-specific data is critical to fully optimize hot code.

Worse is better

Print statements are an admittedly crude way to get this kind of information, profligate with I/O and disk space. In many cases you could do it in a way that uses machine resources much more efficiently, e.g. by creating a small table data structure in the code to track frequencies, and then printing that table at program termination.

But that would require:

  • writing the custom table (collection and printing);
  • deciding where to define the table;
  • possibly exposing the table to multiple modules;
  • deciding where to initialize the table; and
  • deciding where to print the contents of the table.

That is a pain, especially in a large program you don’t fully understand.

Alternatively, sometimes you want information that a general-purpose profiler could give you, but running that profiler on your program is a hassle because the program you want to profile is actually layered under something else, and setting things up properly takes effort.

In contrast, inserting print statements is trivial. Any measurement can be set up in no time at all. (Recompiling is often the slowest part of the process.) This encourages experimentation. You can also kill a running program at any point with no loss of profiling data.

Don’t feel guilty about wasting machine resources; this is temporary code. You might sometimes end up with output files that are gigabytes in size. But counts is fast because it’s so simple… and the Rust version is 3–4x faster than the Python version, which is nice. Let the machine do the work for you. (It does help if you have a machine with an SSD.)

Ad Hoc Profiling

For a long time I have, in my own mind, used the term ad hoc profiling to describe this combination of logging print statements and frequency-based post-processing. Wikipedia defines “ad hoc” as follows.

In English, it generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes

The process of writing custom code to collect this kind of profiling data — in the manner I disparaged in the previous section — truly matches this definition of “ad hoc”.

But counts is valuable specifically because it makes this type of custom profiling less ad hoc and more repeatable. I should arguably call it “generalized ad hoc profiling” or “not so ad hoc profiling”… but those names don’t have quite the same ring to them.


Use unbuffered output for the print statements. In C and C++ code, use fprintf(stderr, ...). In Rust code use eprintln!. (Update: Rust 1.32 added the dbg! macro, which also works well.)

Pipe the stderr output to file, e.g. firefox 2> log.

Sometimes programs print other lines of output to stderr that should be ignored by counts. (Especially if they include integer IDs that counts -w would interpret as weights!) Prepend all logging lines with a short identifier, and then use grep $ID log | counts to ignore the other lines. If you use more than one prefix, you can grep for each prefix individually or all together.

Occasionally output lines get munged together when multiple print statements are present. Because there are typically many lines of output, having a few garbage ones almost never matters.

It’s often useful to use both counts and counts -w on the same log file; each one gives different insights into the data.

To find which call sites of a function are hot, you can instrument the call sites directly. But it’s easy to miss one, and the same print statements need to be repeated multiple times. An alternative is to add an extra string or integer argument to the function, pass in a unique value from each call site, and then print that value within the function.

It’s occasionally useful to look at the raw logs as well as the output of counts, because the sequence of output lines can be informative. For example, I recently diagnosed an occurrences of quadratic behaviour in the Rust compiler by seeing that a loop iterated 1, 2, 3, …, 9000+ times.

The Code

counts is available here.


I use counts to do ad hoc profiling all the time. It’s the first tool I reach for any time I have a question about code execution patterns. I have used it extensively for every bout of major performance work I have done in the past few years, as well as in plenty of other circumstances. I even built direct support for it into rustc-perf, the Rust compiler’s benchmark suite, via the profile eprintln subcommand. Give it a try!