{"id":3148,"date":"2016-10-14T16:35:02","date_gmt":"2016-10-14T05:35:02","guid":{"rendered":"http:\/\/blog.mozilla.org\/nnethercote\/?p=3148"},"modified":"2020-11-02T10:49:10","modified_gmt":"2020-11-01T23:49:10","slug":"how-to-speed-up-the-rust-compiler","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nnethercote\/2016\/10\/14\/how-to-speed-up-the-rust-compiler\/","title":{"rendered":"How to speed up the Rust compiler"},"content":{"rendered":"<p><a href=\"http:\/\/rust-lang.org\/\">Rust<\/a> is a great language, and Mozilla plans to <a href=\"https:\/\/wiki.mozilla.org\/Oxidation\">use it extensively<\/a> in Firefox. However, the Rust compiler (rustc) is <a href=\"https:\/\/www.reddit.com\/r\/rust\/comments\/4vbmv4\/can_we_talk_about_build_times\/\">quite slow<\/a> and compile times are a pain point for many Rust users. Recently I&#8217;ve been working on improving that. This post covers how I&#8217;ve done this, and should be of interest to anybody else who wants to help speed up the Rust compiler. Although I&#8217;ve done all this work on Linux it should be mostly applicable to other platforms as well.<\/p>\n<h3>Getting the code<\/h3>\n<p>The first step is to get the rustc code. First, I fork the <a href=\"https:\/\/github.com\/rust-lang\/rust\/\">main Rust repository<\/a> on GitHub. Then I make two local clones: a base clone that I won&#8217;t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:<\/p>\n<pre>user=nnethercote\r\nfor r in rust0 rust1 ; do\r\n  cd ~\/moz\r\n  git clone https:\/\/github.com\/$user\/rust $r\r\n  cd $r\r\n  git remote add upstream https:\/\/github.com\/rust-lang\/rust\r\n  git remote set-url origin git@github.com:$user\/rust\r\ndone<\/pre>\n<h3>Building the Rust compiler<\/h3>\n<p>Within the two repositories, I first configure:<\/p>\n<pre>.\/configure --enable-optimize --enable-debuginfo<\/pre>\n<p>I configure with optimizations enabled because that matches release versions of rustc. And I configure with debug info enabled so that I get good information from profilers.<\/p>\n<p>[<strong>Update:<\/strong> I now add &#8211;enable-llvm-release-debuginfo which builds the LLVM back-end with debug info too.]<\/p>\n<p>Then I build:<\/p>\n<pre>RUSTFLAGS='' make -j8<\/pre>\n<p>[<strong>Update:<\/strong> I previously had <code>-Ccodegen-units=8<\/code> in <code>RUSTFLAGS<\/code> because it speeds up compile times. But Lars Bergstrom informed me that it can slow down the resulting program significantly. I measured and he was right &#8212; the resulting rustc was about 5&#8211;10% slower. So I&#8217;ve stopped using it now.]<\/p>\n<p>That does a full build, which does the following:<\/p>\n<ul>\n<li>Downloads a stage0 compiler, which will be used to build the stage1 local compiler.<\/li>\n<li>Builds LLVM, which will become part of the local compilers.<\/li>\n<li>Builds the stage1 compiler with the stage0 compiler.<\/li>\n<li>Builds the stage2 compiler with the stage1 compiler.<\/li>\n<\/ul>\n<p>It can be mind-bending to grok all the stages, especially with regards to how libraries work. (One notable example: the stage1 compiler uses the system allocator, but the stage2 compiler uses jemalloc.) I&#8217;ve found that the stage1 and stage2 compilers have similar performance. Therefore, I mostly measure the stage1 compiler because it&#8217;s much faster to just build the stage1 compiler, which I do with the following command.<\/p>\n<pre>RUSTFLAGS='' make -j8 rustc-stage1<\/pre>\n<p>Building the compiler takes a while, which isn&#8217;t surprising. What is more surprising is that rebuilding the compiler after a small change also takes a while. That&#8217;s because a lot of code gets recompiled after any change. There are two reasons for this.<\/p>\n<ul>\n<li>Rust&#8217;s unit of compilation is the <em>crate<\/em>. Each crate can consist of multiple files. If you modify a crate, the whole crate must be rebuilt. This isn&#8217;t surprising.<\/li>\n<li>rustc&#8217;s dependency checking is very coarse. If you modify a crate, every other crate that depends on it will also be rebuilt, no matter how trivial the modification. This surprised me greatly. For example, any modification to the parser (which is in a crate called libsyntax) causes multiple other crates to be recompiled, a process which takes 6 minutes on my fast desktop machine. Almost any change to the compiler will result in a rebuild that takes at least 2 or 3 minutes.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/blog.rust-lang.org\/2016\/09\/08\/incremental.html\">Incremental compilation<\/a> should greatly improve the dependency situation, but it&#8217;s still in an experimental state and I haven&#8217;t tried it yet.<\/p>\n<p>To run all the tests I do this (after a full build):<\/p>\n<pre>ulimit -c 0 &amp;&amp; make check<\/pre>\n<p>The checking aborts if you don&#8217;t do the ulimit, because the tests produces lots of core files and it doesn&#8217;t want to swamp your disk.<\/p>\n<p>The build system is complex, with lots of options. This command gives a nice overview of some common invocations:<\/p>\n<pre>make tips<\/pre>\n<h3>Basic profiling<\/h3>\n<p>The next step is to do some basic profiling. I like to be careful about which rustc I am invoking at any time, especially if there&#8217;s a system-wide version installed, so I avoid relying on <code>PATH<\/code> and instead define some environment variables like this:<\/p>\n<pre>export RUSTC01=\"$HOME\/moz\/rust0\/x86_64-unknown-linux-gnu\/stage1\/bin\/rustc\"\r\nexport RUSTC02=\"$HOME\/moz\/rust0\/x86_64-unknown-linux-gnu\/stage2\/bin\/rustc\"\r\nexport RUSTC11=\"$HOME\/moz\/rust1\/x86_64-unknown-linux-gnu\/stage1\/bin\/rustc\"\r\nexport RUSTC12=\"$HOME\/moz\/rust1\/x86_64-unknown-linux-gnu\/stage2\/bin\/rustc\"<\/pre>\n<p>In the examples that follow I will use <code>$RUSTC01<\/code> as the version of rustc that I invoke.<\/p>\n<p>rustc has the ability to produce some basic stats about the time and memory used by each compiler pass. It is enabled with the <code>-Ztime-passes<\/code> flag. If you are invoking rustc directly you&#8217;d do it like this:<\/p>\n<pre>$RUSTC01 -Ztime-passes a.rs<\/pre>\n<p>If you are building with Cargo you can instead do this:<\/p>\n<pre>RUSTC=$RUSTC01 cargo rustc -- -Ztime-passes<\/pre>\n<p>The <code>RUSTC=<\/code> part tells Cargo you want to use a non-default rustc, and the part after the <code>--<\/code> is flags that will be passed to rustc when it builds the final crate. (A bit weird, but useful.)<\/p>\n<p>Here is some sample output from <code>-Ztime-passes<\/code>:<\/p>\n<pre>time: 0.056; rss: 49MB parsing\r\ntime: 0.000; rss: 49MB recursion limit\r\ntime: 0.000; rss: 49MB crate injection\r\ntime: 0.000; rss: 49MB plugin loading\r\ntime: 0.000; rss: 49MB plugin registration\r\ntime: 0.103; rss: 87MB expansion\r\ntime: 0.000; rss: 87MB maybe building test harness\r\ntime: 0.002; rss: 87MB maybe creating a macro crate\r\ntime: 0.000; rss: 87MB checking for inline asm in case the target doesn't support it\r\ntime: 0.005; rss: 87MB complete gated feature checking\r\ntime: 0.008; rss: 87MB early lint checks\r\ntime: 0.003; rss: 87MB AST validation\r\ntime: 0.026; rss: 90MB name resolution\r\ntime: 0.019; rss: 103MB lowering ast -&gt; hir\r\ntime: 0.004; rss: 105MB indexing hir\r\ntime: 0.003; rss: 105MB attribute checking\r\ntime: 0.003; rss: 105MB language item collection\r\ntime: 0.004; rss: 105MB lifetime resolution\r\ntime: 0.000; rss: 105MB looking for entry point\r\ntime: 0.000; rss: 105MB looking for plugin registrar\r\ntime: 0.015; rss: 109MB region resolution\r\ntime: 0.002; rss: 109MB loop checking\r\ntime: 0.002; rss: 109MB static item recursion checking\r\ntime: 0.060; rss: 109MB compute_incremental_hashes_map\r\ntime: 0.000; rss: 109MB load_dep_graph\r\ntime: 0.021; rss: 109MB type collecting\r\ntime: 0.000; rss: 109MB variance inference\r\ntime: 0.038; rss: 113MB coherence checking\r\ntime: 0.126; rss: 114MB wf checking\r\ntime: 0.219; rss: 118MB item-types checking\r\ntime: 1.158; rss: 125MB item-bodies checking\r\ntime: 0.000; rss: 125MB drop-impl checking\r\ntime: 0.092; rss: 127MB const checking\r\ntime: 0.015; rss: 127MB privacy checking\r\ntime: 0.002; rss: 127MB stability index\r\ntime: 0.011; rss: 127MB intrinsic checking\r\ntime: 0.007; rss: 127MB effect checking\r\ntime: 0.027; rss: 127MB match checking\r\ntime: 0.014; rss: 127MB liveness checking\r\ntime: 0.082; rss: 127MB rvalue checking\r\ntime: 0.145; rss: 161MB MIR dump\r\n time: 0.015; rss: 161MB SimplifyCfg\r\n time: 0.033; rss: 161MB QualifyAndPromoteConstants\r\n time: 0.034; rss: 161MB TypeckMir\r\n time: 0.001; rss: 161MB SimplifyBranches\r\n time: 0.006; rss: 161MB SimplifyCfg\r\ntime: 0.089; rss: 161MB MIR passes\r\ntime: 0.202; rss: 161MB borrow checking\r\ntime: 0.005; rss: 161MB reachability checking\r\ntime: 0.012; rss: 161MB death checking\r\ntime: 0.014; rss: 162MB stability checking\r\ntime: 0.000; rss: 162MB unused lib feature checking\r\ntime: 0.101; rss: 162MB lint checking\r\ntime: 0.000; rss: 162MB resolving dependency formats\r\n time: 0.001; rss: 162MB NoLandingPads\r\n time: 0.007; rss: 162MB SimplifyCfg\r\n time: 0.017; rss: 162MB EraseRegions\r\n time: 0.004; rss: 162MB AddCallGuards\r\n time: 0.126; rss: 164MB ElaborateDrops\r\n time: 0.001; rss: 164MB NoLandingPads\r\n time: 0.012; rss: 164MB SimplifyCfg\r\n time: 0.008; rss: 164MB InstCombine\r\n time: 0.003; rss: 164MB Deaggregator\r\n time: 0.001; rss: 164MB CopyPropagation\r\n time: 0.003; rss: 164MB AddCallGuards\r\n time: 0.001; rss: 164MB PreTrans\r\ntime: 0.182; rss: 164MB Prepare MIR codegen passes\r\n time: 0.081; rss: 167MB write metadata\r\n time: 0.590; rss: 177MB translation item collection\r\n time: 0.034; rss: 180MB codegen unit partitioning\r\n time: 0.032; rss: 300MB internalize symbols\r\ntime: 3.491; rss: 300MB translation\r\ntime: 0.000; rss: 300MB assert dep graph\r\ntime: 0.000; rss: 300MB serialize dep graph\r\n time: 0.216; rss: 292MB llvm function passes [0]\r\n time: 0.103; rss: 292MB llvm module passes [0]\r\n time: 4.497; rss: 308MB codegen passes [0]\r\n time: 0.004; rss: 308MB codegen passes [0]\r\ntime: 5.185; rss: 308MB LLVM passes\r\ntime: 0.000; rss: 308MB serialize work products\r\ntime: 0.257; rss: 297MB linking<\/pre>\n<p>As far as I can tell, the indented passes are sub-passes, and the parent pass is the first non-indented pass afterwards.<\/p>\n<h3>More serious profiling<\/h3>\n<p>The <code>-Ztime-passes<\/code> flag gives a good overview, but you really need a profiling tool that gives finer-grained information to get far. I&#8217;ve done most of my profiling with two Valgrind tools, Cachegrind and DHAT. I invoke Cachegrind like this:<\/p>\n<pre>valgrind \\\r\n --tool=cachegrind --cache-sim=no --branch-sim=yes \\\r\n --cachegrind-out-file=$OUTFILE $RUSTC01 ...<\/pre>\n<p>where <code>$OUTFILE<\/code> specifies an output filename. I find the instruction counts measured by Cachegrind to be highly useful; the branch simulation results are occasionally useful, and the cache simulation results are almost never useful.<\/p>\n<p>The Cachegrind output looks like this:<\/p>\n<pre>--------------------------------------------------------------------------------\r\n            Ir \r\n--------------------------------------------------------------------------------\r\n22,153,170,953 PROGRAM TOTALS\r\n\r\n--------------------------------------------------------------------------------\r\n         Ir file:function\r\n--------------------------------------------------------------------------------\r\n923,519,467 \/build\/glibc-GKVZIf\/glibc-2.23\/malloc\/malloc.c:_int_malloc\r\n879,700,120 \/home\/njn\/moz\/rust0\/src\/rt\/miniz.c:tdefl_compress\r\n629,196,933 \/build\/glibc-GKVZIf\/glibc-2.23\/malloc\/malloc.c:_int_free\r\n394,687,991 ???:???\r\n379,869,259 \/home\/njn\/moz\/rust0\/src\/libserialize\/leb128.rs:serialize::leb128::read_unsigned_leb128\r\n376,921,973 \/build\/glibc-GKVZIf\/glibc-2.23\/malloc\/malloc.c:malloc\r\n263,083,755 \/build\/glibc-GKVZIf\/glibc-2.23\/string\/::\/sysdeps\/x86_64\/multiarch\/memcpy-avx-unaligned.S:__memcpy_avx_unaligned\r\n257,219,281 \/home\/njn\/moz\/rust0\/src\/libserialize\/opaque.rs:&lt;serialize::opaque::Decoder&lt;'a&gt; as serialize::serialize::Decoder&gt;::read_usize\r\n217,838,379 \/build\/glibc-GKVZIf\/glibc-2.23\/malloc\/malloc.c:free\r\n217,006,132 \/home\/njn\/moz\/rust0\/src\/librustc_back\/sha2.rs:rustc_back::sha2::Engine256State::process_block\r\n211,098,567 ???:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&amp;, llvm::CodeGenOpt::Level)\r\n185,630,213 \/home\/njn\/moz\/rust0\/src\/libcore\/hash\/sip.rs:&lt;rustc_incremental::calculate_svh::hasher::IchHasher as core::hash::Hasher&gt;::write\r\n171,360,754 \/home\/njn\/moz\/rust0\/src\/librustc_data_structures\/fnv.rs:&lt;rustc::ty::subst::Substs&lt;'tcx&gt; as core::hash::Hash&gt;::hash\r\n150,026,054 ???:llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)<\/pre>\n<p>Here &#8220;Ir&#8221; is short for &#8220;I-cache reads&#8221;, which corresponds to the number of instructions executed. Cachegrind also gives line-by-line annotations of the source code.<\/p>\n<p>The Cachegrind results indicate that malloc and free are usually the two hottest functions in the compiler. So I also use DHAT, which is a malloc profiler that <a href=\"https:\/\/blog.mozilla.org\/jseward\/2010\/12\/05\/fun-n-games-with-dhat\/\">tells you exactly where all your malloc calls are coming from<\/a>.\u00a0 I invoke DHAT like this:<\/p>\n<pre>\/home\/njn\/grind\/ws3\/vg-in-place \\\r\n --tool=exp-dhat --show-top-n=1000 --num-callers=4 \\\r\n --sort-by=tot-blocks-allocd $RUSTC01 ... 2&gt; $OUTFILE<\/pre>\n<p>I sometimes also use <code>--sort-by=tot-bytes-allocd<\/code>. DHAT&#8217;s output looks like this:<\/p>\n<pre>==16425== -------------------- 1 of 1000 --------------------\r\n==16425== max-live: 30,240 in 378 blocks\r\n==16425== tot-alloc: 20,866,160 in 260,827 blocks (avg size 80.00)\r\n==16425== deaths: 260,827, at avg age 113,438 (0.00% of prog lifetime)\r\n==16425== acc-ratios: 0.74 rd, 1.00 wr (15,498,021 b-read, 20,866,160 b-written)\r\n==16425== at 0x4C2BFA6: malloc (vg_replace_malloc.c:299)\r\n==16425== by 0x5AD392B: &lt;syntax::ptr::P&lt;T&gt; as serialize::serialize::Decodable&gt;::decode (heap.rs:59)\r\n==16425== by 0x5AD4456: &lt;core::iter::Map&lt;I, F&gt; as core::iter::iterator::Iterator&gt;::next (serialize.rs:201)\r\n==16425== by 0x5AE2A52: rustc_metadata::decoder::&lt;impl rustc_metadata::cstore::CrateMetadata&gt;::get_attributes (vec.rs:1556)\r\n==16425== \r\n==16425== -------------------- 2 of 1000 --------------------\r\n==16425== max-live: 1,360 in 17 blocks\r\n==16425== tot-alloc: 10,378,160 in 129,727 blocks (avg size 80.00)\r\n==16425== deaths: 129,727, at avg age 11,622 (0.00% of prog lifetime)\r\n==16425== acc-ratios: 0.47 rd, 0.92 wr (4,929,626 b-read, 9,599,798 b-written)\r\n==16425== at 0x4C2BFA6: malloc (vg_replace_malloc.c:299)\r\n==16425== by 0x881136A: &lt;syntax::ptr::P&lt;T&gt; as core::clone::Clone&gt;::clone (heap.rs:59)\r\n==16425== by 0x88233A7: syntax::ext::tt::macro_parser::parse (vec.rs:1105)\r\n==16425== by 0x8812E66: syntax::tokenstream::TokenTree::parse (tokenstream.rs:230)<\/pre>\n<p>The &#8220;deaths&#8221; value here indicate the total number of calls to malloc for each call stack, which is usually the metric of most interest. The &#8220;acc-ratios&#8221; value can also be interesting, especially if the &#8220;rd&#8221; value is 0.00, because that indicates the allocated blocks are never read. (See below for example of problems that I found this way.)<\/p>\n<p>For both profilers I also pipe <code>$OUTFILE<\/code> through eddyb&#8217;s <a href=\"https:\/\/gist.github.com\/eddyb\/3a233c4709018e92b866\">rustfilt.sh<\/a> script which demangles ugly Rust symbols like this:<\/p>\n<pre>_$LT$serialize..opaque..Decoder$LT$$u27$a$GT$$u20$as$u20$serialize..serialize..Decoder$GT$::read_usize::h87863ec7f9234810<\/pre>\n<p>to something much nicer, like this:<\/p>\n<pre>&lt;serialize::opaque::Decoder&lt;'a&gt; as serialize::serialize::Decoder&gt;::read_usize<\/pre>\n<p>[<strong>Update:<\/strong> native support for Rust demangling recently landed in Valgrind&#8217;s repo. I use a trunk version of Valgrind so I no longer need to use rustfilt.sh in combination with Valgrind.]<\/p>\n<p>For programs that use Cargo, sometimes it&#8217;s useful to know the exact rustc invocations that Cargo uses. Find out with either of these commands:<\/p>\n<pre>RUSTC=$RUSTC01 cargo build -v\r\nRUSTC=$RUSTC01 cargo rust -v<\/pre>\n<p>I also have done a decent amount of ad hoc println profiling, where I insert <code>println!<\/code> calls in hot parts of the code and then I use a script to post-process them. This can be very useful when I want to know exactly how many times particular code paths are hit.<\/p>\n<p>I&#8217;ve also tried <code>perf<\/code>. It works, but I&#8217;ve never established much of a rapport with it. YMMV. In general, any profiler that works with C or C++ code should also work with Rust code.<\/p>\n<h3>Finding suitable benchmarks<\/h3>\n<p>Once you know how you&#8217;re going to profile you need some good workloads. You could use the compiler itself, but it&#8217;s big and complicated and reasoning about the various stages can be confusing, so I have avoided that myself.<\/p>\n<p>Instead, I have focused entirely on <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-benchmarks\">rustc-benchmarks<\/a>, a pre-existing rustc benchmark suite. It contains 13 benchmarks of various sizes. It has been used to track rustc&#8217;s performance at <a href=\"http:\/\/perf.rust-lang.org\/\">perf.rust-lang.org<\/a> for some time, but it wasn&#8217;t easy to use locally until I <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-benchmarks\/pull\/17\/\">wrote a script<\/a> for that purpose. I invoke it something like this:<\/p>\n<pre>.\/compare.py \\\r\n  \/home\/njn\/moz\/rust0\/x86_64-unknown-linux-gnu\/stage1\/bin\/rustc \\\r\n  \/home\/njn\/moz\/rust1\/x86_64-unknown-linux-gnu\/stage1\/bin\/rustc<\/pre>\n<p>It compares the two given compilers, doing debug builds, on the benchmarks See the next section for example output. If you want to run a subset of the benchmarks you can specify them as additional arguments.<\/p>\n<p>Each benchmark in rustc-benchmarks has a makefile with three targets. See the <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-benchmarks\/blob\/master\/README.md\">README<\/a> for details on these targets, which can be helpful.<\/p>\n<h3>Wins<\/h3>\n<p>Here are the results if I compare the following two versions of rustc with compare.py.<\/p>\n<ul>\n<li>The commit just before my first commit (on September 12).<\/li>\n<li>A commit from October 13.<\/li>\n<\/ul>\n<pre>futures-rs-test  5.028s vs  4.433s --&gt; 1.134x faster (variance: 1.020x, 1.030x)\r\nhelloworld       0.283s vs  0.235s --&gt; 1.202x faster (variance: 1.012x, 1.025x)\r\nhtml5ever-2016-  6.293s vs  5.652s --&gt; 1.113x faster (variance: 1.011x, 1.008x)\r\nhyper.0.5.0      6.182s vs  5.039s --&gt; 1.227x faster (variance: 1.002x, 1.018x)\r\ninflate-0.1.0    5.168s vs  4.935s --&gt; 1.047x faster (variance: 1.001x, 1.002x)\r\nissue-32062-equ  0.457s vs  0.347s --&gt; 1.316x faster (variance: 1.010x, 1.007x)\r\nissue-32278-big  2.046s vs  1.706s --&gt; 1.199x faster (variance: 1.003x, 1.007x)\r\njld-day15-parse  1.793s vs  1.538s --&gt; 1.166x faster (variance: 1.059x, 1.020x)\r\npiston-image-0. 13.871s vs 11.885s --&gt; 1.167x faster (variance: 1.005x, 1.005x)\r\nregex.0.1.30     2.937s vs  2.516s --&gt; 1.167x faster (variance: 1.010x, 1.002x)\r\nrust-encoding-0  2.414s vs  2.078s --&gt; 1.162x faster (variance: 1.006x, 1.005x)\r\nsyntex-0.42.2   36.526s vs 32.373s --&gt; 1.128x faster (variance: 1.003x, 1.004x)\r\nsyntex-0.42.2-i 21.500s vs 17.916s --&gt; 1.200x faster (variance: 1.007x, 1.013x)<\/pre>\n<p>Not all of the improvement is due to my changes, but I have managed a few nice wins, including the following.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/36592\">#36592<\/a>: There is an arena allocator called TypedArena. rustc creates <em>many<\/em> of these, mostly short-lived. On creation, each arena would allocate a 4096 byte chunk, in preparation for the first arena allocation request. But DHAT&#8217;s output showed me that the vast majority of arenas never received such a request! So I made TypedArena lazy &#8212; the first chunk is now only allocated when necessary. This reduced the number of calls to malloc greatly, which sped up compilation of several rustc-benchmarks by 2&#8211;6%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/36734\">#36734<\/a>: This one was similar. Rust&#8217;s HashMap implementation is lazy &#8212; it doesn&#8217;t allocate any memory for elements until the first one is inserted. This is a good thing because it&#8217;s surprisingly common in large programs to create HashMaps that are never used. However, Rust&#8217;s HashSet implementation (which is just a layer on top of the HashMap) didn&#8217;t have this property, and guess what? rustc also creates large numbers of HashSets that are never used. (Again, DHAT&#8217;s output made this obvious.) So I fixed that, which sped up compilation of several rustc-benchmarks by 1&#8211;4%. Even better, because this change is to Rust&#8217;s stdlib, rather than rustc itself, it will speed up any program that creates HashSets without using them.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/36917\">#36917:<\/a> This one involved avoiding some useless data structure manipulation when a particular table was empty. Again, DHAT pointed out a table that was created but never read, which was the clue I needed to identify this improvement. This sped up two benchmarks by 16% and a couple of others by 3&#8211;5%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/37064\">#37064<\/a>: This one changed a hot function in serialization code to return a Cow&lt;str&gt; instead of a String, which avoided a lot of allocations.<\/p>\n<h3>Future work<\/h3>\n<p>Profiles indicate that the following parts of the compiler account for a lot of its runtime.<\/p>\n<ul>\n<li>malloc and free are still the two hottest functions in most benchmarks. Avoiding heap allocations can be a win.<\/li>\n<li>Compression is used for crate metadata and LLVM bitcode. (This shows up in profiles under a function called <code>tdefl_compress<\/code>.)\u00a0 There is an <a href=\"https:\/\/github.com\/rust-lang\/rust\/issues\/37086\">issue<\/a> open about this.<\/li>\n<li>Hash table operations are hot. A lot of this comes from the interning of various values during type checking; see the CtxtInterners type for details.<\/li>\n<li>Crate metadata decoding is also costly.<\/li>\n<li>LLVM execution is a big chunk, especially when doing optimized builds. So far I have treated LLVM as a black box and haven&#8217;t tried to change it, at least partly because I don&#8217;t know how to build it with debug info, which is necessary to get source files and line numbers in profiles. [<strong>Update:<\/strong> there is a new <a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/37742\">&#8211;enable-llvm-release-debuginfo<\/a> configure option that causes LLVM to be build with debug info.]<\/li>\n<\/ul>\n<p>A lot of programs have broadly similar profiles, but occasionally you get an odd one that stresses a different part of the compiler. For example, in rustc-benchmarks, inflate-0.1.0 is dominated by operations involving the (delighfully named) ObligationsForest (see <a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/36993\">#36993<\/a>), and html5ever-2016-08-25 is dominated by what I think is macro processing. So it&#8217;s worth profiling the compiler on new codebases.<\/p>\n<h3>Caveat lector<\/h3>\n<p>I&#8217;m still a newcomer to Rust development. Although I&#8217;ve had lots of help on the #rustc IRC channel &#8212; big thanks to eddyb and simulacrum in particular &#8212; there may be things I am doing wrong or sub-optimally. Nonetheless, I hope this is a useful starting point for newcomers who want to speed up the Rust compiler.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rust is a great language, and Mozilla plans to use it extensively in Firefox. However, the Rust compiler (rustc) is quite slow and compile times are a pain point for many Rust users. Recently I&#8217;ve been working on improving that. This post covers how I&#8217;ve done this, and should be of interest to anybody else [&hellip;]<\/p>\n","protected":false},"author":139,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[311,16179],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/3148"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/users\/139"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/comments?post=3148"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/3148\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/media?parent=3148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/categories?post=3148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/tags?post=3148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}