{"id":3175,"date":"2018-04-30T15:13:45","date_gmt":"2018-04-30T04:13:45","guid":{"rendered":"http:\/\/blog.mozilla.org\/nnethercote\/?p=3175"},"modified":"2020-09-08T16:21:29","modified_gmt":"2020-09-08T05:21:29","slug":"how-to-speed-up-the-rust-compiler-in-2018","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nnethercote\/2018\/04\/30\/how-to-speed-up-the-rust-compiler-in-2018\/","title":{"rendered":"How to speed up the Rust compiler in 2018"},"content":{"rendered":"<p>18 months ago I <a href=\"https:\/\/blog.mozilla.org\/nnethercote\/2016\/10\/14\/how-to-speed-up-the-rust-compiler\/\">wrote about<\/a> <a href=\"https:\/\/blog.mozilla.org\/nnethercote\/2016\/11\/23\/how-to-speed-up-the-rust-compiler-some-more\/\">some work<\/a> I did to speed up the Rust compiler (rustc). I&#8217;ve recently taken this work up again. Also, in the meantime rustc&#8217;s build system has been replaced and its benchmark suite has been overhauled. So it&#8217;s a good time for an update.<\/p>\n<h3>Getting the code<\/h3>\n<p>The steps for getting the rustc code haven&#8217;t changed. First, I fork the <a href=\"https:\/\/github.com\/rust-lang\/rust\/\">main Rust repository<\/a> on GitHub. Then I make two local clones: a base clone that I won\u2019t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:<\/p>\n<pre>user=nnethercote\r\nfor r in rust0 rust1 ; do\r\n  git clone https:\/\/github.com\/$user\/rust $r\r\n  cd $r\r\n  git remote add upstream https:\/\/github.com\/rust-lang\/rust\r\n  git remote set-url origin git@github.com:$user\/rust\r\ndone<\/pre>\n<h3>Building the Rust compiler<\/h3>\n<p>The compiler&#8217;s build system is <a href=\"https:\/\/forge.rust-lang.org\/x-py.html\">complex<\/a> with many possible invocations and configurations. I&#8217;ll cover the absolute minimum information required to understand how I&#8217;ve been using it.<\/p>\n<p>(<strong>UPDATE, December 2019:<\/strong> I have updated this to account for <code>config.toml<\/code> \u00a0 changes since this post was written.)<\/p>\n<p>First, you need a <code>config.toml<\/code> file, which sits at the root of the repository and dictates the compiler&#8217;s configuration. I used the provided <code>config.toml.example<\/code> as a starting point, cut it down a lot, and ended up with the following.<\/p>\n<pre># Last updated in December 2019\r\n[llvm]\r\nrelease-debuginfo = true\r\n[rust]\r\ndebuginfo-level = 1\r\n<\/pre>\n<p>This configuration matches the compiler that ships (in particular, in terms of optimization levels), with the exception that it has maximal debug info present, which ensures that profiles are as detailed as possible.<\/p>\n<p>Building rustc can be confusing, particularly because of the multiple compiler stages and the terminology around them. Here is the command I use.<\/p>\n<p>(<strong>UPDATE, September 2020:<\/strong> I have updated this to account for the <code>src<\/code> directory being renamed <code>compiler<\/code>.)<\/p>\n<pre>.\/x.py build --stage 2 compiler\/rustc<\/pre>\n<p>This produces a stage 2 compiler that can handle procedural macros and so is suitable for profiling the benchmark suite.<\/p>\n<h3>The benchmark suite<\/h3>\n<p>rustc&#8217;s benchmark suite is called <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-perf\">rustc-perf<\/a>. It consists of two parts:<\/p>\n<ul>\n<li>collector: The 24 benchmark programs, 14 of which are &#8220;real&#8221; code (i.e. crates used in real applications), and 10 of which are toy programs or stress test microbenchmarks. Also, the harness code that runs and measures them.<\/li>\n<li>site: Code for displaying measurements as a website.<\/li>\n<\/ul>\n<p>The test harness is very thorough. For each benchmark, it measures Debug, Opt, and Check (no code generation) invocations. Furthermore, within each of those categories, it does five or more runs, including a normal build and various kinds of incremental builds. A full benchmarking run measures over 400 invocations of rustc. Note however that only the compilation of the final crate is measured in each benchmark; compilation of dependent crates is not included.<\/p>\n<p>rustc-perf was created primarily for the <a href=\"http:\/\/perf.rust-lang.org\/\">perf.rust-lang.org<\/a>\u00a0site, which tracks rustc&#8217;s performance over time. I recently <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-perf\/pull\/207\">modified<\/a> it so it could be used to <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-perf\/tree\/master\/collector\">compare two builds of rustc on a local machine<\/a>, which is a fundamental operation when optimizing. This can be done by running the suite and site locally and navigating to the local <a href=\"http:\/\/perf.rust-lang.org\/compare.html\"> &#8220;compare&#8221; page<\/a>, which looks like this:<\/p>\n<p><a href=\"http:\/\/blog.mozilla.org\/nnethercote\/2018\/04\/30\/how-to-speed-up-the-rust-compiler-in-2018\/compare\/\" rel=\"attachment wp-att-3176\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-3176\" src=\"https:\/\/blog.mozilla.org\/nnethercote\/files\/2018\/04\/compare.png\" alt=\"Screenshot of rustc-perf's compare page\" width=\"667\" height=\"526\" srcset=\"https:\/\/blog.mozilla.org\/nnethercote\/files\/2018\/04\/compare.png 667w, https:\/\/blog.mozilla.org\/nnethercote\/files\/2018\/04\/compare-252x199.png 252w, https:\/\/blog.mozilla.org\/nnethercote\/files\/2018\/04\/compare-600x473.png 600w\" sizes=\"(max-width: 667px) 100vw, 667px\" \/><\/a><\/p>\n<p>Note that rustc-perf uses perf-stat for its measurements, so the benchmarking functionality currently only works on Linux.<\/p>\n<p>I also <a href=\"https:\/\/github.com\/rust-lang-nursery\/rustc-perf\/pull\/209\">extended<\/a> rustc-perf so the benchmarks can be run under a profiler. I implemented support for perf-record, Cachegrind, and DHAT, because they are the profilers that I am most familiar with; it isn&#8217;t hard to add support for other profilers (including non-Linux ones). The advantage of integrating support for a profiler intto rustc-perf is that it gets the profiler invocations underneath the Cargo invocations, ensuring that the right rustc invocations are measured.<\/p>\n<h3>Wins<\/h3>\n<p>Here are the improvements I&#8217;ve made to rustc over the past few weeks.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/49993\">#49993<\/a>: Cachegrind&#8217;s output showed that derived methods for the <code>Token<\/code> type were taking up significant time. This PR changed the &#8216;#&#8217; counts in raw <code>Lit<\/code> variants from <code>usize<\/code> to <code>u16<\/code>, which reduced the size of <code>Token<\/code>\u00a0from 32 to 24 bytes, speeding up some of the runs for coercions and html5ever by 1%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/50051\">#50051<\/a>: Rust&#8217;s <code>Option::ok_or<\/code>\u00a0function transforms an <code>Option&lt;T&gt;<\/code> value into a <code>Result&lt;T, E&gt;<\/code>. It&#8217;s a bit of a footgun&#8230; as the <a href=\"https:\/\/doc.rust-lang.org\/std\/option\/enum.Option.html#method.ok_or\">docs say<\/a>:<\/p>\n<blockquote><p>Arguments passed to <code>ok_or<\/code> are eagerly evaluated; if you are passing the result of a function call, it is recommended to use <a href=\"https:\/\/doc.rust-lang.org\/std\/option\/enum.Option.html#method.ok_or_else\"><code>ok_or_else<\/code><\/a>, which is lazily evaluated.<\/p><\/blockquote>\n<p>DHAT&#8217;s output showed that one hot allocation site involved the creation of a <code>String<\/code> while getting the value of the <code>MIRI_BACKTRACE<\/code> environment variable. This seemed to me like a strange thing to be happening frequently, and it turns out that it was at the end of a chain of calls that were only needed in the (rare) error case of an <code>ok_or<\/code> call. This PR changed the code to use a trivial closure with <code>ok_or_else<\/code>, speeding up runs for a lot of benchmarks, the best by 6%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/50052\">#50052<\/a>: DHAT&#8217;s output showed that the <code>char_lit<\/code> function, which parses &#8220;\\u{&#8230;}&#8221; literals, was doing a lot of allocations while stripping out &#8216;_&#8217; chars. This PR avoided that allocation, speeding up various runs &#8212; particularly ones for regex, futures, clap, coercions, hyper, and encoding &#8212; the best by 6%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/50106\">#50106<\/a>: Cachegrind&#8217;s output showed that the <code>nearest_common_ancestor<\/code> function, which computes the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Lowest_common_ancestor\"> lowest common ancestor<\/a> of two nodes in a scope tree, was very hot. The algorithm in use constructed the full scope chain for each node, and then worked backward from the end of the two scope chains until a difference was found. This is a reasonable algorithm in many circumstances, but some ad hoc instrumentation (<code>eprintln!<\/code> statements plus some simple post-processing) showed that the scope chains usually only differ by a handful of elements at the front and then have very long common tails, with dozens or even hundreds of elements. This PR switched to a different algorithm that looks for differences from the front of the scope chain, speeding up runs for many benchmarks, the best by 8%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/50174\">#50174<\/a>: By default, Rust&#8217;s <code>HashSet<\/code> and <code>HashMap<\/code> use a hash function that is very high quality but also very slow. Therefore, the Rust compiler internally uses different types, <code>FxHashSet<\/code> and <code>FxHashMap<\/code>, which are identical to the standard ones except they use a much faster hash function. Unfortunately, it&#8217;s easy to forget about them and use the standard hash tables instead. Cachegrind&#8217;s output showed that the default hash function code (SipHash) was executed a lot, and that one particularly hot hash table (the symbol interner) was using <code>HashMap<\/code>. This PR (trivially) changed that table to an <code>FxHashMap<\/code>, speeding up runs for numerous benchmarks, the best by 5%.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/50240\">#50240<\/a>: Some of Rust&#8217;s standard containers, such as <code>Vec<\/code>, <code>HashSet<\/code>, and <code>HashMap<\/code>, have the nice property that by default they don&#8217;t allocate until an element is inserted. This is good because it&#8217;s surprising how often such containers are created but never inserted into. DHAT&#8217;s output showed that such behaviour would also help with a couple of the compiler&#8217;s uses of BTreeMap. I tried and failed to implement this behaviour directly in BTreeMap; according to Gankro, &#8220;BTreeMap is some of the <a href=\"https:\/\/github.com\/rust-lang\/rust\/blob\/7f3444e1baf0d335b4bf379f845dbc28cdd0509c\/src\/liballoc\/btree\/node.rs#L257-L280\">most complex unsafe code in libstd&#8221;\u00a0<\/a>and &#8220;<a href=\"https:\/\/twitter.com\/Gankro\/status\/989683166658121728\">I just scared off a grizzled firefox dev explaining it<\/a>&#8220;! Instead this PR introduced a thin wrapper type (<code>LazyBTreeMap)<\/code> around <code>BTreeMap<\/code> and used it in the handful of relevant places within the compiler, speeding up the runs for several benchmarks, the best by 3%. <a href=\"https:\/\/github.com\/rust-lang\/rust\/issues\/50266\">#50266<\/a>\u00a0is open to do the general fix for <code>BTreeMap<\/code>, whereupon <code>LazyBTreeMap<\/code> will be able to be removed.<\/p>\n<p><a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/50246\">#50246<\/a>: Cachegrind&#8217;s output showed that a function named <code>dump_allocs<\/code> was hot for some benchmarks. This sounded to me like a logging or debugging function of some kind, and investigation confirmed that it was traversing data structures in order to build up strings that went unused in the standard case where logging is disabled. This PR (trivially) changed this function and a couple of related ones to be no-ops if logging is disabled, speeding up runs for coercions, tuple-stress, html5ever, and encoding, the best by almost 15%! This shows how <em>not doing unclever things<\/em> is often as important as <em>doing clever things<\/em> when it comes to optimizing software.<\/p>\n<p><strong>Update:<\/strong> It&#8217;s worth noting that I also made three or four optimization attempts that didn&#8217;t work out &#8212; where I made a change that seemed like it should help, based on profiling data, but the effect was negligible. Success isn&#8217;t guaranteed!<\/p>\n<h3>Future work<\/h3>\n<p>All of the PRs mentioned above (except for the aborted <code>BTreeMap<\/code> change) involved small, simple changes to the Rust compiler&#8217;s code. I&#8217;m not a rustc expert, but I do know how to use a couple of profilers well, and I&#8217;ve been able to make a difference. I&#8217;m sure there are more improvements of this nature to be made, and I encourage other people to try profiling rustc with their favourite profilers to see what they can find. This is valuable because rustc&#8217;s speed is something that Rust users often complain about. And it&#8217;s fun, if you like that sort of thing \ud83d\ude42\u00a0 I&#8217;m happy to help people, and the members of the #rustc IRC channel are very friendly and helpful.<\/p>\n<p>Having said that, in a lot of cases, especially for opt builds, the majority of execution time is within LLVM, which rustc uses for code generation. Speeding up LLVM itself may be difficult, but I hope\/suspect there is room for improvement in the way that rustc interacts with LLVM. If anyone has ideas on that front I&#8217;d love to hear about them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I&#8217;ve recently taken this work up again. Also, in the meantime rustc&#8217;s build system has been replaced and its benchmark suite has been overhauled. So it&#8217;s a good time for an update. Getting the code The steps for [&hellip;]<\/p>\n","protected":false},"author":139,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[311,16179],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/3175"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/users\/139"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/comments?post=3175"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/3175\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/media?parent=3175"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/categories?post=3175"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/tags?post=3175"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}