The Rust compiler is getting faster

TL;DR: The Rust compiler has gotten 1.06x–4x faster over the past month.

As changes are made to the Rust compiler, a suite of benchmarks measuring compile time is run regularly on the development version. The data is viewable at http://perf.rust-lang.org. The default view is graphical, showing data from the past month.

Screenshot of perf.rust-lang.org showing measurements of the html5ever benchmark

The screenshot above shows the graphs for a single benchmark called “html5ever”, which consists of an old version of the project of the same name. Each one shows measurements for a different kind of build: a debug build, a “check” build (which detect errors but don’t generate code), and an optimized build. Within each graph there are the following three data series.

  • Clean: a normal build.
  • Baseline incremental: an incremental build with no prior incremental runs. Such a build is a little slower than a normal build, because it does normal compilation and also gathers information to guide subsequent incremental builds.
  • Clean incremental: an incremental build run immediately after a baseline incremental build. This is the best-case scenario for incremental compilation in which the minimal amount of work is done.

If you visit the site yourself you’ll see that most of the benchmarks have more than three data series, including ones for incremental builds done after small code changes (a more realistic use case), and one for builds with non-lexical lifetimes enabled.

The x-axis shows time and the y-axis shows instruction counts. Other units of measurements are available, including cycles, time, and memory usage. Instruction counts are shown as the default; this isn’t ideal because it’s only a proxy for the measurement that really matters (time)… but it’s a pretty good proxy, and it has a lot lower variation than the time measurements, which is important when detecting changes.

This graphical view is particularly useful for detecting major changes. For example, you can see that in early May there was a major regression for “clean” and “baseline incremental” builds, which Alex Crichton fixed a few days later.

As well as the graphical view, the site also provides a textual “compare” view, which can be reached via the link at the top left of each page. This view compares measurements from two revisions of the compiler; by default it compares the most recently measured revision with one from a month ago. (It can also be used locally, which is very useful to evaluate changes that speed up the compiler.)

The screenshot above is of the “compare” view at the time of writing. Each line corresponds to a single graph from the graphical view. (If you visit the site and click on an individual entry it will expand and show all of the measurements. The resemblance between those measurements and this screenshot will of course diminish over time.) The “avg” column shows the average change across all the data series. The “min” and “max” columns show the minimum and maximum changes for any of the data series. The “serde” and “script-servo” lines are empty because those benchmarks were added to the suite less than a month ago, so no comparison can be made.

The table has many numbers, but the thing to take away is that they are almost all significantly negative, meaning that compile time has reduced. The “avg” numbers range from 6% to 38%; the “min” numbers (i.e. best result) go as high as 75%; the “max” numbers (i.e. worst result) go as high as 36%.

In conclusion: the Rust compiler has gotten significantly faster in the past month. Across a wide range of programs, and a wide range of build configurations, compile times have reduced by between 6% and 75%. To put it another way, the compiler has gotten between 1.06x and 4x faster.

These benefits are available right now to users of the Nightly channel. Users of the Release channel will see them more gradually, spread across one or two versions released over the next few months.

How to speed up the Rust compiler in 2018

18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I’ve recently taken this work up again. Also, in the meantime rustc’s build system has been replaced and its benchmark suite has been overhauled. So it’s a good time for an update.

Getting the code

The steps for getting the rustc code haven’t changed. First, I fork the main Rust repository on GitHub. Then I make two local clones: a base clone that I won’t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:

user=nnethercote
for r in rust0 rust1 ; do
  git clone https://github.com/$user/rust $r
  cd $r
  git remote add upstream https://github.com/rust-lang/rust
  git remote set-url origin git@github.com:$user/rust
done

Building the Rust compiler

The compiler’s build system is complex with many possible invocations and configurations. I’ll cover the absolute minimum information required to understand how I’ve been using it.

First, you need a config.toml file, which sits at the root of the repository and dictates the compiler’s configuration. I used the provided config.toml.example as a starting point, cut it down a lot, and ended up with the following.

[llvm]
optimize = true
release-debuginfo = true
assertions = false

[rust]
optimize = true
codegen-units = 1
debug-assertions = false
debuginfo = true
debuginfo-lines = true
use-jemalloc = false

It’s possible that some of these lines just restate defaults, but I figure it doesn’t hurt to be explicit. This configuration has the following characteristics.

  • It results in the production of a fully optimized rustc, which is important for profiling. The exception to this is that I disable jemalloc, because DHAT doesn’t work when jemalloc is enabled.
  • It has maximal debug info present, which ensures that profiles are as detailed as possible.

Building rustc can be confusing, particularly because of the multiple compiler stages and the terminology around them. Here is the command I use.

./x.py build --stage 2 src/rustc

This produces a stage 2 compiler that can handle procedural macros and so is suitable for profiling the benchmark suite.

The benchmark suite

rustc’s benchmark suite is called rustc-perf. It consists of two parts:

  • collector: The 24 benchmark programs, 14 of which are “real” code (i.e. crates used in real applications), and 10 of which are toy programs or stress test microbenchmarks. Also, the harness code that runs and measures them.
  • site: Code for displaying measurements as a website.

The test harness is very thorough. For each benchmark, it measures Debug, Opt, and Check (no code generation) invocations. Furthermore, within each of those categories, it does five or more runs, including a normal build and various kinds of incremental builds. A full benchmarking run measures over 400 invocations of rustc. Note however that only the compilation of the final crate is measured in each benchmark; compilation of dependent crates is not included.

rustc-perf was created primarily for the perf.rust-lang.org site, which tracks rustc’s performance over time. I recently modified it so it could be used to compare two builds of rustc on a local machine, which is a fundamental operation when optimizing. This can be done by running the suite and site locally and navigating to the local “compare” page, which looks like this:

Screenshot of rustc-perf's compare page

Note that rustc-perf uses perf-stat for its measurements, so the benchmarking functionality currently only works on Linux.

I also extended rustc-perf so the benchmarks can be run under a profiler. I implemented support for perf-record, Cachegrind, and DHAT, because they are the profilers that I am most familiar with; it isn’t hard to add support for other profilers (including non-Linux ones). The advantage of integrating support for a profiler intto rustc-perf is that it gets the profiler invocations underneath the Cargo invocations, ensuring that the right rustc invocations are measured.

Wins

Here are the improvements I’ve made to rustc over the past few weeks.

#49993: Cachegrind’s output showed that derived methods for the Token type were taking up significant time. This PR changed the ‘#’ counts in raw Lit variants from usize to u16, which reduced the size of Token from 32 to 24 bytes, speeding up some of the runs for coercions and html5ever by 1%.

#50051: Rust’s Option::ok_or function transforms an Option<T> value into a Result<T, E>. It’s a bit of a footgun… as the docs say:

Arguments passed to ok_or are eagerly evaluated; if you are passing the result of a function call, it is recommended to use ok_or_else, which is lazily evaluated.

DHAT’s output showed that one hot allocation site involved the creation of a String while getting the value of the MIRI_BACKTRACE environment variable. This seemed to me like a strange thing to be happening frequently, and it turns out that it was at the end of a chain of calls that were only needed in the (rare) error case of an ok_or call. This PR changed the code to use a trivial closure with ok_or_else, speeding up runs for a lot of benchmarks, the best by 6%.

#50052: DHAT’s output showed that the char_lit function, which parses “\u{…}” literals, was doing a lot of allocations while stripping out ‘_’ chars. This PR avoided that allocation, speeding up various runs — particularly ones for regex, futures, clap, coercions, hyper, and encoding — the best by 6%.

#50106: Cachegrind’s output showed that the nearest_common_ancestor function, which computes the lowest common ancestor of two nodes in a scope tree, was very hot. The algorithm in use constructed the full scope chain for each node, and then worked backward from the end of the two scope chains until a difference was found. This is a reasonable algorithm in many circumstances, but some ad hoc instrumentation (eprintln! statements plus some simple post-processing) showed that the scope chains usually only differ by a handful of elements at the front and then have very long common tails, with dozens or even hundreds of elements. This PR switched to a different algorithm that looks for differences from the front of the scope chain, speeding up runs for many benchmarks, the best by 8%.

#50174: By default, Rust’s HashSet and HashMap use a hash function that is very high quality but also very slow. Therefore, the Rust compiler internally uses different types, FxHashSet and FxHashMap, which are identical to the standard ones except they use a much faster hash function. Unfortunately, it’s easy to forget about them and use the standard hash tables instead. Cachegrind’s output showed that the default hash function code (SipHash) was executed a lot, and that one particularly hot hash table (the symbol interner) was using HashMap. This PR (trivially) changed that table to an FxHashMap, speeding up runs for numerous benchmarks, the best by 5%.

#50240: Some of Rust’s standard containers, such as Vec, HashSet, and HashMap, have the nice property that by default they don’t allocate until an element is inserted. This is good because it’s surprising how often such containers are created but never inserted into. DHAT’s output showed that such behaviour would also help with a couple of the compiler’s uses of BTreeMap. I tried and failed to implement this behaviour directly in BTreeMap; according to Gankro, “BTreeMap is some of the most complex unsafe code in libstd” and “I just scared off a grizzled firefox dev explaining it“! Instead this PR introduced a thin wrapper type (LazyBTreeMap) around BTreeMap and used it in the handful of relevant places within the compiler, speeding up the runs for several benchmarks, the best by 3%. #50266 is open to do the general fix for BTreeMap, whereupon LazyBTreeMap will be able to be removed.

#50246: Cachegrind’s output showed that a function named dump_allocs was hot for some benchmarks. This sounded to me like a logging or debugging function of some kind, and investigation confirmed that it was traversing data structures in order to build up strings that went unused in the standard case where logging is disabled. This PR (trivially) changed this function and a couple of related ones to be no-ops if logging is disabled, speeding up runs for coercions, tuple-stress, html5ever, and encoding, the best by almost 15%! This shows how not doing unclever things is often as important as doing clever things when it comes to optimizing software.

Update: It’s worth noting that I also made three or four optimization attempts that didn’t work out — where I made a change that seemed like it should help, based on profiling data, but the effect was negligible. Success isn’t guaranteed!

Future work

All of the PRs mentioned above (except for the aborted BTreeMap change) involved small, simple changes to the Rust compiler’s code. I’m not a rustc expert, but I do know how to use a couple of profilers well, and I’ve been able to make a difference. I’m sure there are more improvements of this nature to be made, and I encourage other people to try profiling rustc with their favourite profilers to see what they can find. This is valuable because rustc’s speed is something that Rust users often complain about. And it’s fun, if you like that sort of thing 🙂  I’m happy to help people, and the members of the #rustc IRC channel are very friendly and helpful.

Having said that, in a lot of cases, especially for opt builds, the majority of execution time is within LLVM, which rustc uses for code generation. Speeding up LLVM itself may be difficult, but I hope/suspect there is room for improvement in the way that rustc interacts with LLVM. If anyone has ideas on that front I’d love to hear about them.

A New Preferences Parser for Firefox

Firefox’s preferences system uses data files to store information about default preferences within Firefox, and user preferences in a user’s profile (such as prefs.js, which records changes to preference values, and user.js, which allows users to override default preference values).

A new parser

These data files use a custom format, and therefore Firefox has a custom parser for them. I recently rewrote the parser. The new parser has the following benefits over the old parser.

  • It is faster (raw parsing speed is close to 2x faster).
  • It is safer (because it’s written in Rust rather than C++).
  • It is more correct and better tested (the old one got various obscure edge cases wrong).
  • It is more readable, and easier to modify.
  • It issues no warnings, only errors.
  • It is slightly stricter (e.g. doesn’t allow any malformed input, and it catches integer overflow).
  • It has error recovery and better error messages (including correct line numbers).

Modifiability

Modifiability was the prime motivation for the change. I wanted to make some adjustments to the preferences file grammar, but this would have been very difficult in the old parser, because it was written in an awkward style.

It was essentially a single loop containing a giant switch statement on a state variable. This switch was executed for every single char in a file. The states held by the state variable had names like PREF_PARSE_QUOTED_STRING, PREF_PARSE_UNTIL_OPEN_PAREN, PREF_PARSE_COMMENT_BLOCK_MAYBE_END. It also had a second state variable, because in some places a single one wasn’t enough; the parser had to return to the previous state after exiting the current state. Furthermore, lexing and parsing were not separate, so code to handle comments and whitespace was spread around in various places.

The new parser is a recursive descent parser — even though the grammar doesn’t actually have any recursion — in which the structure of the code reflects the structure of the grammar. Lexing is distinct from parsing. As a result, the new parser is much easier to read and modify. In particular, after landing it I added error recovery without too much effort; that would have been almost impossible in the old parser.

Note that the idea of error recovery for preferences parsing was first proposed in bug 107264, filed in 2001! After landing it, I tweeted the following.

Amazingly enough, the original reporter is on Twitter and responded!

Strictness

The new parser is slightly stricter and rejects some malformed input that the old parser accepted.

Junk chars

Disconcertingly, the old parser allowed arbitrary junk between preferences (including at the start and end of the prefs file) so long as that junk didn’t include any of the following chars: ‘/’, ‘#’, ‘u’, ‘s’, ‘p’. This means that lines like these:

!foo@bar&pref("prefname", true);
ticky_pref("prefname", true);    // missing 's' at start
User_pref("prefname", true);     // should be 'u' at start

would all be treated the same as this:

pref("prefname", true);

The new parser disallows such junk because it isn’t necessary and seems like an unintentional botch by the old parser. In practice, this caught a couple of prefs that accidentally had an extra ‘;’ at the end.

SUB char

The old parser allowed the SUB (0x1a) character between tokens and treated it like ‘\n’.

The new parser does not allow this character. SUB was used to indicate end-of-file (not end-of-line) in some old operating systems such as MS-DOS, but this doesn’t seem necessary today.

Invalid escapes

The old parser tolerated (with a warning) invalid escape sequences within  string literals — such as “\q” (not a valid escape) and “\x1” and “\u12″(both of which have insufficient hex digits) — accepting them literally.

The new parser does not tolerate invalid escape sequences because it doesn’t seem necessary and would complicate things.

NUL char

The old parser tolerated the NUL character (0x00) within string literals; this is
dangerous because C++ code that manipulates string values with embedded NULs will almost certainly consider those chars as end-of-string markers.

The new parser treats the NUL character as end-of-file, to avoid this danger. (The escape sequences “\x00” and “\u0000” are also disallowed.)

Integer overflow

The old parser allowed integer literals to overflow, silently wrapping them.

The new parser treats integer overflow as a parse error. This seems better,
and it caught overflows of several existing prefs.

Consequences

Error recovery minimizes the risk of data loss caused by the increased strictness because malformed pref lines in prefs.js will be removed but well-formed pref lines afterwards are preserved.

Nonetheless, please keep an eye out for any other problems that might arise from this change.

Attributes

I mentioned before that I wanted to make some adjustments to the preferences file grammar. Specifically, I changed the grammar used by default preference files (but not user preference files) to support annotating each preference with one or more boolean attributes. The attributes supported so far are ‘sticky’ and ‘locked’. For example:

pref("sticky.pref", true, sticky);
pref("locked.pref", 123, locked);
pref("sticky-and-locked-pref", "blah", sticky, locked);

Note that the addition of the ‘locked’ attribute fixed a 10 year old bug.

When will this ship?

All of these changes are on track to ship in Firefox 60, which is due to release on May 9th.

Nicer commands for content processes

The current Firefox release creates content processes with very long and ugly commands. Here’s an example, as reported by ‘ps’ on my Linux64 machine (with wrapping to make it show up entirely on your screen):

/home/njn/moz/mi8/o64/dist/bin/firefox -contentproc -childID 2 -isForBrowser -in
tPrefs 6:50|7:-1|20:0|35:1000|43:20|44:5|45:10|52:0|58:128|59:10000|64:0|66:400|
67:1|68:0|69:0|70:100|75:0|76:120|77:120|162:2|163:1|167:60|168:30|169:512000|17
8:5000|180:6|194:8192|195:524288|196:5|209:10000|230:24|231:32768|233:0|234:0|24
3:5|247:1048576|249:100|250:5000|252:600|253:4|254:1|262:20|279:4|291:60000|309:
300|310:30| -boolPrefs 1:0|2:0|4:1|5:1|25:1|28:1|29:1|30:1|32:1|33:1|34:1|37:1|3
8:1|39:1|42:1|46:1|47:0|48:0|49:1|50:1|51:1|53:0|56:1|57:1|60:1|61:0|62:0|63:0|6
5:0|71:1|72:1|73:1|74:1|78:1|79:1|80:0|81:0|82:1|83:1|84:1|85:1|86:1|89:0|90:0|9
3:1|94:1|98:1|99:1|100:0|101:1|102:1|103:1|104:0|105:0|107:0|108:0|109:1|110:1|1
11:1|114:1|115:1|116:1|117:1|118:1|119:0|120:0|121:1|122:0|123:0|124:1|125:1|126
:0|127:1|128:1|129:0|131:1|132:0|133:0|134:1|135:1|136:0|137:0|138:0|139:1|140:1
|141:1|142:1|143:0|144:1|145:1|146:1|147:1|148:1|149:1|150:0|151:1|152:1|153:0|1
54:1|155:0|156:0|157:0|158:1|159:1|160:1|161:1|164:1|165:0|172:1|175:0|176:0|177
:1|181:0|183:1|184:0|185:1|187:1|189:0|190:0|193:0|197:1|198:0|199:1|200:1|201:0
|204:1|208:1|210:1|211:0|213:1|216:0|228:0|229:0|232:1|235:0|237:1|238:1|240:1|2
41:0|248:1|251:1|256:0|257:0|258:0|259:1|260:1|261:0|266:1|269:1|270:1|271:1|272
:1|273:1|274:0|275:1|281:1|284:0|285:1|286:1|287:0|288:1|289:1|290:1|292:0|293:0
|295:1|304:1|305:1|306:0|307:0|308:0| -stringPrefs 3:7;default|215:3;1.0|226:332
;  ¼½¾ǃː??։֊׃״؉؊٪۔܁܂܃܄ᅟᅠ᜵           ???‐ ’․‧??????? ‹›⁁⁄⁒ ⅓⅔⅕⅖⅗⅘⅙⅚?⅜⅝⅞⅟∕∶⎮╱⧶⧸⫻⫽>
⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻ 。〔〕〳゠ㅤ㈝㈞㎮㎯㏆㏟꞉︔︕︿﹝﹞?./。ᅠ???�|227:4;
high|280:36;97c47c17-80b5-497c-b5ee-1c5db2d06af6| -schedulerPrefs 0001,2 -appdir
 /home/njn/moz/mi8/o64/dist/bin/browser 4500 true tab

That’s 1733 characters! Most of it is for three flags, called -intPrefs, -boolPrefs, and -stringPrefs, which are used to pass values of prefs that are required very early in the content process’s existence, before the first normal IPC message is received. These pref values are encoded compactly, but there are enough of them that they make the command quite long.

Furthermore, there is one pref (network.IDN.blacklist_chars) that contains a bunch of unusual characters. That explains the gobbledygook on the 4th and 3rd last lines. Unsurprisingly, this gobbledygook was reported as a bug… which has been duplicated five times.

The good news is that I recently fixed this, by changing things so that only pref values that have changed since startup get passed in this way. (Content processes are able to see the startup values, and so don’t need to be told them.) On my machine and profile, this reduces the number of pref values passed via the command from ~222 to ~3. It also reduces the number sent later via IPC from ~3165 to ~180.

The command now looks like this:

/home/njn/local/install/firefox/firefox -contentproc -childID 4 -isForBrowser -b
oolPrefs 37:1|235:1|257:1|294:1| -stringPrefs 280:36;8bdf7a6e-d39a-494c-b55a-829
2c4fc6254| -schedulerPrefs 0001,2 -greomni /home/njn/local/install/firefox/omni.
ja -appomni /home/njn/local/install/firefox/browser/omni.ja -appdir /home/njn/lo
cal/install/firefox/browser 3237 true tab

That’s 361 characters, all of them ASCII. Much better! And I have a plan to reduce things even further.

This change is in Firefox 60, which is on track to be released around May 9th.

DMD is usable again on all Tier 1 platforms

DMD is heap profiler built into Firefox, best known for being the tool used to  diagnose the sources of “heap-unclassified” memory in about:memory.

It’s been unusable on Win32 for a long time due to incredibly slow start-up times. And recently it became very slow on Mac due to a performance regression in libunwind.

Fortunately I have been able to fix this in both cases (Win32, Mac) by using FramePointerStackWalk() instead of MozStackWalk() to do the stack tracing within DMD. (The Gecko Profiler likewise uses FramePointerStackWalk() on those two platforms, and it was my recent work on the profiler that taught me that there was an alternative stack walker available.)

So DMD should be usable and effective on all Tier 1 platforms. I have tested it on  Win32, Win64, Linux64 and Mac. I haven’t tested it in Linux32 or Android. Please let me know if you encounter any problems.

How we made compiler warnings fatal in Firefox

Compiler warnings are mostly good: they identify real problems, and when false positives do occur they are usually easy to work around. However, if they’re not fatal, they tend to be ignored and build up. (See bug 187528 for an idea!)

One way to prevent the build-up is to make them fatal, so they become errors. But won’t that cause problems? Not if you’re careful. Here’s how we did it for Firefox.

  • Choose with some care which warnings end up fatal. Don’t be afraid to modify your choices as time goes on.
  • Introduce a mechanism for enabling fatal warnings on a per-directory basis. Mozilla’s custom build system used to have a directive called FAIL_ON_WARNINGS for this purpose.
  • Set things up so that fatal warnings are off by default, but enabled on continuous integration (CI). This means the primary coverage is via CI. You don’t want fatal warnings on by default because it causes problems for developers who use non-standard compilers (e.g. pre-release versions with new warning classes). Developers using the same compilers as CI can turn it on locally if they want without problem.
  • Allow per-file exceptions for particular kinds of warnings, because there are occasionally warnings you just want to ignore.
  • Fix warnings one directory at a time, and turn on fatal warnings for that directory as soon as it’s warning-free.
  • Invert the sense of the per-directory mechanism once you’ve converted more than half of the directories. For Mozilla code we now have the ALLOW_COMPILER_WARNINGS directive. It’s almost exclusively used for directories containing third-party code which is not under our control.
  • Gradually expand the coverage of which compilers you have fatal warnings for. Mozilla code now does this for GCC, clang, and MSVC.
  • Congratulations! You now have fatal warnings on everywhere that is practical.

With a setup like this, it’s possible for a patch to compile on a developer’s machine but fail to compile on CI. But that’s just one of many ways in which full CI runs may fail when local runs don’t. So it’s not as bad as it seems.

Also, before upgrading the compilers on CI you need to address any new warnings, by fixing them or suppressing them or filing a compiler bug report. But this isn’t a bad thing.

It took a long time for Firefox to reach this stage, but I think it was worth the effort. Thank you to Chris Peterson and all the others who helped with this.

MemShrink status

Mozilla’s MemShrink project started almost six years ago. It successfully reduced Firefox’s memory consumption to the point where Firefox is now widely (and accurately) seen as the least memory-hungry browser.

Most of the big improvements were made in the first two or three years of the MemShrink project, and it doesn’t get much publicity any more, but it is still ticking along quietly: meetings occur, regressions are tracked, measurements are made, and so on. Some time ago I passed leadership of it to the capable hands of Eric Rahm, who has just written a status update that is worth reading.

Eric also wrote recently about the reincarnation of areweslimyet.com. He maintained that site for several years, which was a thankless but important task, but he won’t need to any more because its functionality has been folded into Mozilla’s main performance monitoring tools. Thank you, Eric!

Improving the Gecko Profiler

Over the last three months I landed 167 patches, via 41 Bugzilla bugs, for the Gecko Profiler. These included crash fixes, assertion failure fixes, data race fixes, optimization fixes, and a great many refactorings.

Background

The Gecko Profiler is a profiler built into Firefox. It can be used with the Gecko Profiler Addon to profile Firefox. It also provides the core profiling mechanism that is used by Firefox’s devtools to profile JavaScript content code.

It’s a crucial component.  It was originally written 5 years ago in something of a hurry, because we desperately needed a built-in profiler, one that could give more detailed, custom information than is possible with an external profiler. As I understand it, part of it was imported from V8, so it was a mix of previously existing code and new code. And in the years since it has been extended by multiple people, in a variety of ways that didn’t necessarily mesh well.

As a result, at the start of Q1 it was in pretty bad shape. Crashes and assertion failures were frequent, and the code itself was hard to read and maintain. Markus Stange had recently taken over ownership, and had made some great improvements to the Addon, but the core code needed work as well. So I started digging in to see what I could find and improve. There was a lot!

Fixes

Bug 1331571. The profiler had code for incorporating power consumption estimates from the Intel Power Gadget. Unfortunately, this integration had major flaws: Intel Power Gadget only gives very coarse power consumption estimates; the profiler samples at 1000Hz and is CPU intensive and so is likely to skew the power consumption estimates significantly; and nobody had ever used it meaningfully. So I removed it.

Bug 1317771. The profiler had a “standalone” configuration that allowed it to be used in programs other than Firefox. But it was complex (lots of #ifdef statements) and broken and unlikely to be of use. So I removed it.

Bug 1328369, Bug 1328373: Dead code removal.

Bug 1332577. The public API for the profiler was a mess. It was split across several header files, and most API functions had an “outer” version with a profiler_ prefix that immediately called into an inner version with a mozilla_sampler_ prefix (though there were some inconsistencies). So I combined the various header files into a single file, GeckoProfiler.h, and simplified the API functions into a single level, all consistently named with a profiler_ prefix.

Bug 1333296. Even the name of the profiler was a source of confusion. It was originally known as “SPS”, which I believe is short for “simple profiling system”. At some point that changed to “the Gecko Profiler”, although was also occasionally referred to as “the built-in profiler”! Because of this history, the code was littered with references to SPS. In this bug I updated them all to refer to the Gecko Profiler. (I updated the MDN docs, too. The page name still uses “Built-in Profiler” because I don’t know how to change MDN page names.)

Bug 1329684. I removed some mutex wrapper classes that I think were necessary at one point for the “standalone” configuration.

Bug 1328365. Thread-local storage was being used to store a pointer to an object that was only accessed on the main thread, so I changed it to be a global variable. I also renamed some variables whose names referred to a type that had been renamed a long time ago.

Bug 1333655. The profiler had a cross-platform thread abstraction that was clumsy and over-engineered, so I streamlined it.

Bug 1334466. The profiler had a class called Sampler, which I think was imported from V8, and a subclass called GeckoSampler. Both classes were fairly complex, and we only ever instantiated the subclass. The separation merely obscured things, so I merged the subclass into Sampler. Having all that code in a single class and a single module made it much easier to see exactly what it was doing.

Bug 1335595. Two classes, ThreadInfo and ThreadProfile, were used for per-thread data. They were hopelessly intertwined: each one had a pointer to the other, and multiple fields were present (i.e. duplicated) in both of them. So I just merged them.

Bug 1336326. Three minor clean-ups.

Bug 816598. I implemented a memory reporter for the profiler. This was first requested in 2012, and five duplicate bugs had been filed in the interim!

Bug 1126576. I removed some very grotty manual refcounting from the PseudoStack class, which simplified things. A little too much, in fact… I misunderstood how things worked, causing a crash, which I subsequently fixed in bug 1340161.

Bug 1337189. The aforementioned Sampler class was still over-engineered. It only ever had 0 or 1 instantiations, and basically was an unnecessary level of abstraction. In this bug I basically got rid of it by merging it into another file. Which took 27 patches! (One of these patches introduced a regression, which I later fixed in bug 1340327.) At this point a lot of the core code that had previously been spread across multiple files and classes was now in a single file, tools/profiler/core/platform.cpp, and it was becoming increasingly obvious that there was a lot of global state being accessed from multiple threads with insufficient thread synchronization.

Bug 1338957. The profiler tracks which threads are “sleeping” (i.e. blocked on some operation), to support an optimization where it samples sleeping threads in a cheaper-than-normal fashion. It was using two counters and a boolean to track the sleep state of each thread. These three values were accessed from multiple threads; two of them were atomic, and one wasn’t, so the whole setup was very racy. I managed to condense the three values into a single atomic tri-state value, which made things simpler and thread-safe.

Bug 1339327. I landed eight refactoring patches with no particular common theme, mostly involving renaming things and removing unnecessary stuff.

Bug 1339435. I removed two erroneous assertions that I had added in an earlier patch — two functions that I thought only ran on the main thread turned out to run off the main thread as well.

Bug 1339695. The profiler has a lot of code that is specific to a particular architecture (e.g. x86), OS (e.g. Windows), or platform (e.g. x86/Windows). The #ifdef statements used to select these were massively inconsistent — turns out there are many ways to detect this stuff — so I fixed this up. Among other things, this involved using the nice constants in tools/profiler/core/PlatformMacros.h consistently throughout the profiler’s code. (I fixed a regression — caused by mistyping one of the #ifdef conditions, alas! — from this change in bug 1350211. And another one involving --disable-profiling in bug 1348776.) I also renamed some files that had .cc extensions instead of the usual .cpp because they had (I think) been imported from V8.

Bug 1340928. At this point I had started working on a major change to the handling of the profiler’s core global state. It would inevitably be a big patch, but I wanted it to be as small as possible, so I started aggressively carving off small changes that could be landed separately. This bug featured 16 of them.

Bug 1328378. The profiler has two kinds of samples: periodic, which are taken by a separate thread in response to a timer firing, and synchronous, which a thread takes itself in response to a request via the profiler API. There are a lot of similarities between the two, but also some important differences. This bug took some steps to simplify the messy handling of synchronous samples.

Bug 1344118. I mentioned earlier that the profiler tracks which threads are “sleeping” to support an optimization: when a thread is asleep, we can mostly duplicate its last sample without unwinding its stack. But the optimization was buggy and would become a catastrophic pessimization in certain circumstances, due to what should have been a short O(1)-ish buffer search becoming O(n²)-ish, which would quickly peg one CPU at 100% usage. As far as I can tell, this bug was present in the optimization ever since it was implemented three years ago. (It’s possible it wasn’t noticed because its effect increase as more threads are profiled, but the profiler defaults to only profiling the main thread and the compositor thread.) The fix was straightforward once the diagnosis was made, and Julian Seward did a follow-up that made the optimization even more effective.

Bug 1342306. In this bug I put almost all of the profiler’s global state into a single class and protected accesses to it with a mutex. Unlike the old code, the new code is simple and obviously thread-safe. The final patch in this bug was much bigger than I would have liked, at 142 KiB, even after I carved off as many precursor patches as I could. Unsurprisingly, there were some follow-up fixes required: bug 1346356 (a leak and a deadlock), bug 1347044 (another deadlock), bug 1348374 (yet another deadlock), and bug 1350967 (surprise! another deadlock).

Bug 1345262. I fixed an assertion failure caused by the profiler and the JS engine having differing views about what functions should be called on what threads.

Bug 1347348. Five more assorted clean-ups.

Bug 1349856. I fixed a minor error involving a call to the profiler from Gecko.

Bug 1346132. I removed the profiler’s bespoke logging system, replacing it with the standard Mozilla one. I also made the logging output more concise and informative.

Bug 1350212. I cleaned up a class and its uses a bit.

Bug 1351523. I reordered one function’s arguments to match the order used in two related functions.

Bug 1351528. I removed some unused values from an enum.

Bug 1348024. I simplified some environment variables used by the profiler.

Bug 1351946. I removed some gnarly code for starting the profiler on B2G.

Bug 1351136. The profiler’s testing coverage is not very good, as can be seen from the numerous regressions I introduced and fixed. So I added a gtest that improves coverage. There’s still room for more test coverage improvement.

Bug 1351963. I further clarified the handling of synchronous vs. periodic samples, and simplified the ownership of some per-thread data structures.

Discussion

I learned some interesting things while doing this work.

Learning a component

Three months ago I knew almost nothing about the profiler’s code. Today I’m a module peer.

At the start of January I had been told that the profiler needed work, and I had assigned myself a Q1 deliverable to “land three improvements to the Gecko Profiler”. I started just by looking for easy, obvious code clean-ups, such as dead code removal, fixing inconsistent naming of things, and removing unnecessary indirections. (These are the kinds of clean-ups you can make with only shallow understanding.) The profiler proved to be a target-rich environment for such changes!

After I made a few dozen such changes I started understanding more deeply how the pieces fit together. (Partly because I’d been staring at the code a lot, and partly because my changes were making the code easier to understand. Refactorings add up.) I started interleaving my easy clean-up patches with ones that required more insight into how the profiler worked. I made numerous mistakes along the way, as the various regression fixes above show. But that’s ok.

I also kept a text file in which I had a list of ideas for things to fix. Every time I saw something that looked like it could be improved, I added it to the file, and I repeatedly checked the file when deciding what to work on next. As my understanding of the code improved, multiple times I realized that items I had written down were wrong, or incomplete, or that seemingly separate things were related. (In fact, I’m still using the file, because I still have numerous things I want to improve.)

Multi-threaded programming basics

Although I first learned C and C++ about 20 years ago, and I have worked at Mozilla for more than 8 years, this was the first time I’ve ever done serious multi-threaded programming, i.e. at a level where I needed a reasonably deep understanding of how various threads can interact. I got the following two great tips from Julian Seward, which helped a lot.

  • Write down pseudocode for each thread.
  • Write down potential worst-case thread operation interleavings.

I also found it helpful to add comments (or assertions, where appropriate) to the top of functions that indicate which thread or threads they run on. For example:

void profiler_gathered_OOP_profile()
{
  MOZ_RELEASE_ASSERT(NS_IsMainThread());
  ...
}

and:

void profiler_thread_sleep()
{ 
  // This function runs both on and off the main thread.
  ...
}

A useful idiom: proof-of-lock tokens

I also employed a programming idiom that turned out to be extremely helpful.  Most of the global profiler state is in single class called ProfilerState. There is a single instance of this class, gPS, and a single mutex that protects it, gPSMutex. To guarantee that no code is able to access gPS‘s contents without first locking the mutex, for every field in ProfilerState there is a getter and a setter, both of which require a “proof-of-lock” token, which takes the form of a const PS::AutoLock&, where PS::AutoLock is an RAII type that locks and unlocks a mutex.

For example, consider this function, which checks if the profiler is paused.

bool profiler_is_paused()
{
  PS::AutoLock lock(gPSMutex);
  if (!gPS->IsActive(lock)) {
    return false;
  }
  return gPS->IsPaused(lock);
}

The PS::AutoLock locks the mutex. IsActive() and IsPaused() both access fields within gPS, and so they are passed lock, which serves as the proof-of-lock value. IsPaused() and SetIsPaused() are implemented as follow.

bool IsPaused(const PS::AutoLock&) const { return mIsPaused; }
void SetIsPaused(const PS::AutoLock&, bool aIsPaused) { mIsPaused = aIsPaused; }

Neither function actually uses the proof-of-lock token. Nonetheless, any function that calls a ProfilerState getter or setter must either lock gPSMutex, or have an ancestor that does. This idiom has two very nice benefits.

  • You can’t access gPS‘s contents without having first locked gPSMutex. (Well, it is possible to subvert the protection, but not by accident.)
  • It’s obvious that all functions that have a proof-of-lock argument are called only while gPSMutex is locked.

Functions that are called from multiple places sometimes must be split in two: an outer function in which the mutex is initially unlocked, and an inner function that takes a proof-of-lock token. This isn’t hard, though.

Deadlocks vs. data races

After my big change to the profiler’s global state, I had to fix numerous deadlocks. This wasn’t too hard. Deadlocks (caused by too much thread synchronization) are obvious, easy to diagnose, and these ones weren’t hard to fix. It’s useful to contrast them with data races (caused by too little thread synchronization) which typically have subtle effects and are difficult to diagnose.

Patch discipline

For this work I wrote a lot of small patches. This is my preferred way to work, for two reasons. First, small patches make life easier for reviewers, which in turn results in faster reviews. Second, small patches make regression hunting easy. It’s always nice when you bisect a regression to a small patch.

Future work and Thanks

The profiler still has plenty of room for improvement, and I’m planning to do more work on it in Q2. In the meantime, if you’ve tried the profiler in the past and had problems it might be worth trying again. It’s in much better shape now.

Finally, many thanks to Markus Stange for reviewing the majority of the patches and answering lots of questions, and Julian Seward for reviewing most of the remainder and for numerous helpful discussions about threaded programming.

How to speed up the Rust compiler some more

I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.

Heap allocations

My last post mentioned how heap allocations were frequent within rustc. This led to some wild speculation in some venues: about whether Rust should use heap allocation at all, about whether garbage collection was necessary, and so on. I have two comments about this.

The first is that rustc is a compiler, and like most compilers its execution is dominated by complex traversals of large tree structures: ASTs, IRs, etc. Tree structures typically require heap allocation. In particular, a lot of these tree structures contain Vec, HashMap and HashSet fields, all of which unavoidably use heap allocation.

The second is that although some heap allocation is unavoidable, the amount that rustc was doing was excessive. It is clear that nobody had (or not for a long time) made a concerted effort to minimize heap allocations. With the help of DHAT I’ve been able to greatly reduce the amount done. Any time you throw a new profiler at any large codebase that hasn’t been heavily optimized there’s a good chance you’ll be able to make some sizeable improvements.

Tools

As in my previous post, I focused almost entirely on the benchmarks present in rustc-benchmarks to guide my efforts.

Once again I mostly used Cachegrind and DHAT to profile these benchmarks. I also used Massif (plus the excellent massif-visualizer) to profile peak memory usage on one workload (see below).

I have also recently started using perf a bit more. Huon Wilson told me to add the --callgraph=dwarf flag to my perf record invocations. That improves things significantly, though I still find perf puzzling and frustrating at times, even after reading Brendan Gregg’s thorough examples page.

Wins

#37229: rustc spends a lot of time doing hash table lookups, enough that the cost of hashing is significant. In this PR I changed the hash function used by rustc (FNV) to one inspired by the hash function used within Firefox. The new function is faster because it can process an entire word at a time, rather than one byte at a time. The new hash function is slightly worse in terms of the number of collisions but the change sped up compilation of most workloads by 3–6%.

#37161 & #37318 & #37373: These three PRs removed a lot of unnecessary heap allocations (mostly due to clone()) in the macro parser. They sped up the compilation of html5ever by 20% and 7% and 2%, respectively.

#37267 & #37298: rustc uses “deflate” compression in a couple of places: crate metadata and LLVM bitcode. In the metadata case rustc was often compressing metadata and then throwing away the result! (Thanks to eddyb for diagnosing this and explaining to me how to fix it.) Avoiding this unnecessary work speed up compilation of syntex-incr by 6% and several others by 1–2%.

For the LLVM bitcode case I tweaked the deflate settings so that it ran almost twice as fast while the compression ratio was only slightly worse. This sped up compilation of syntex-incr by another 8% and a couple of others by 1–2%. It’s possible that switching to a different compression algorithm would help some more, thought that would be a much larger change.

#37108: This PR avoided interning values of the Substs type in cases where it was easy to tell that the value had previously been interned. It sped up compilation of several benchmarks by 1–4%.

#37705: This PR avoided some unnecessary calls to mk_ty. It sped up compilation of one benchmark by 5% and a few others by 1–2%.

#37083: This PR inlined various methods involved with uleb128 decoding, which is used when reading crate metadata. This sped up compilation of several benchmarks by 1%.

#36973: This PR fixed things so that a data structure that is only required for incremental compilation is not touched during non-incremental compilation. This sped up compilation of several non-incremental benchmarks by 1%.

#37445 & #37764: One unusual workload was found to make rustc consume excessive amounts of memory (4.5 GiB!) which made it OOM on some machines. I used Massif to identify ways to improve this.

The first PR reduced the size of the Expr enum by shrinking the outsized InlineAsm variant, which reduced peak memory usage by 9%. The second PR removed scope_auxiliary, a data structure used only during MIR dumping (and of marginal utility even then). This reduced peak memory usage by another 10%.

I have filed a PR to add a cut-down version of this workload to rustc-benchmarks.

#36993: This PR did some manual inlining and restructuring of several ObligationForest functions. It sped up compilation of inflate by 2%.

#37642: This PR reduced some excessive indirection present in the representation of HIR. It sped up compilation of some benchmarks by 1, 2 and 4%.

#37427: rustc uses Blake2b hashing to determine when a function’s code has changed. This PR reduced the number of bytes to be hashed by (a) avoiding hashing filenames twice for each span, and (b) pre-uleb128-encoding 32-bit and 64-bit integers, which are usually small. This sped up compilation of syntex-incr by 2%.

Future work

As the compiler front-end gets faster, the proportion of rustc execution time spent in the LLVM back-end increases. For some benchmarks it now exceeds 50% (when doing debug builds), though there is still plenty of variation. Thanks to mrhota there is a new –enable-llvm-release-debuginfo configure option that (unsurprisingly) enabled debuginfo within LLVM. This means that profilers can now give filenames and line numbers for LLVM. I’ve looked at a few places in LLVM that show up high in profiles, though I haven’t yet managed to make any useful changes to it.

Another interesting new development is pnkfelix’s -Zprint-type-sizes option, which should land soon, and will potentially be useful for any program written in Rust, not just rustc. This option will make it trivial to see how each type is laid out in memory, which will make it easy to see how types can be rearranged to be smaller. Up until now this has been a painful and imprecise exercise.

Finally, I want to encourage anyone else with the slightest knack and/or enthusiasm for optimizing code to take a look at rustc. You might be thinking that there isn’t that much low-hanging fruit left, but I am confident there is. It’s getting harder for me to find things to improve, but I am one just person with a particular background and a few preferred profiling tools. I am confident that other people with different backgrounds and tools will find plenty of stuff that I cannot. Rust compile speed still isn’t great, even after these improvements, but the more people who pitch in the faster they’ll improve. Don’t be shy, and if you have any questions please contact me or ask in the #rust or #rustc IRC channels.

Faster Firefox startup & shutdown with add-ons present

Thanks to some recent improvements by Kris Maglione, add-ons built with Firefox’s add-ons SDK are now much faster to start up and shut down. For users with multiple add-ons installed, this can make a large difference. For example, one user reported the following.

I’ve noticed a dramatic improvement in startup and shutdown time since the fixes landed on aurora.

Configuration: Two windows with 15 tabs each and 22 addons.

  • Startup: 25 seconds -> 14 seconds (44% reduction)
  • Shutdown: 30 seconds -> 5 seconds (83% reduction)

Shutdown used to take so long it was triggering a minidump.

Your mileage may vary, but these improvements have the potential to benefit many users. They are present in Firefox 50 (currently in Beta, and due for release on November 15) and all later versions.