DMD is usable again on all Tier 1 platforms

DMD is heap profiler built into Firefox, best known for being the tool used to  diagnose the sources of “heap-unclassified” memory in about:memory.

It’s been unusable on Win32 for a long time due to incredibly slow start-up times. And recently it became very slow on Mac due to a performance regression in libunwind.

Fortunately I have been able to fix this in both cases (Win32, Mac) by using FramePointerStackWalk() instead of MozStackWalk() to do the stack tracing within DMD. (The Gecko Profiler likewise uses FramePointerStackWalk() on those two platforms, and it was my recent work on the profiler that taught me that there was an alternative stack walker available.)

So DMD should be usable and effective on all Tier 1 platforms. I have tested it on  Win32, Win64, Linux64 and Mac. I haven’t tested it in Linux32 or Android. Please let me know if you encounter any problems.

How we made compiler warnings fatal in Firefox

Compiler warnings are mostly good: they identify real problems, and when false positives do occur they are usually easy to work around. However, if they’re not fatal, they tend to be ignored and build up. (See bug 187528 for an idea!)

One way to prevent the build-up is to make them fatal, so they become errors. But won’t that cause problems? Not if you’re careful. Here’s how we did it for Firefox.

  • Choose with some care which warnings end up fatal. Don’t be afraid to modify your choices as time goes on.
  • Introduce a mechanism for enabling fatal warnings on a per-directory basis. Mozilla’s custom build system used to have a directive called FAIL_ON_WARNINGS for this purpose.
  • Set things up so that fatal warnings are off by default, but enabled on continuous integration (CI). This means the primary coverage is via CI. You don’t want fatal warnings on by default because it causes problems for developers who use non-standard compilers (e.g. pre-release versions with new warning classes). Developers using the same compilers as CI can turn it on locally if they want without problem.
  • Allow per-file exceptions for particular kinds of warnings, because there are occasionally warnings you just want to ignore.
  • Fix warnings one directory at a time, and turn on fatal warnings for that directory as soon as it’s warning-free.
  • Invert the sense of the per-directory mechanism once you’ve converted more than half of the directories. For Mozilla code we now have the ALLOW_COMPILER_WARNINGS directive. It’s almost exclusively used for directories containing third-party code which is not under our control.
  • Gradually expand the coverage of which compilers you have fatal warnings for. Mozilla code now does this for GCC, clang, and MSVC.
  • Congratulations! You now have fatal warnings on everywhere that is practical.

With a setup like this, it’s possible for a patch to compile on a developer’s machine but fail to compile on CI. But that’s just one of many ways in which full CI runs may fail when local runs don’t. So it’s not as bad as it seems.

Also, before upgrading the compilers on CI you need to address any new warnings, by fixing them or suppressing them or filing a compiler bug report. But this isn’t a bad thing.

It took a long time for Firefox to reach this stage, but I think it was worth the effort. Thank you to Chris Peterson and all the others who helped with this.

MemShrink status

Mozilla’s MemShrink project started almost six years ago. It successfully reduced Firefox’s memory consumption to the point where Firefox is now widely (and accurately) seen as the least memory-hungry browser.

Most of the big improvements were made in the first two or three years of the MemShrink project, and it doesn’t get much publicity any more, but it is still ticking along quietly: meetings occur, regressions are tracked, measurements are made, and so on. Some time ago I passed leadership of it to the capable hands of Eric Rahm, who has just written a status update that is worth reading.

Eric also wrote recently about the reincarnation of areweslimyet.com. He maintained that site for several years, which was a thankless but important task, but he won’t need to any more because its functionality has been folded into Mozilla’s main performance monitoring tools. Thank you, Eric!

Improving the Gecko Profiler

Over the last three months I landed 167 patches, via 41 Bugzilla bugs, for the Gecko Profiler. These included crash fixes, assertion failure fixes, data race fixes, optimization fixes, and a great many refactorings.

Background

The Gecko Profiler is a profiler built into Firefox. It can be used with the Gecko Profiler Addon to profile Firefox. It also provides the core profiling mechanism that is used by Firefox’s devtools to profile JavaScript content code.

It’s a crucial component.  It was originally written 5 years ago in something of a hurry, because we desperately needed a built-in profiler, one that could give more detailed, custom information than is possible with an external profiler. As I understand it, part of it was imported from V8, so it was a mix of previously existing code and new code. And in the years since it has been extended by multiple people, in a variety of ways that didn’t necessarily mesh well.

As a result, at the start of Q1 it was in pretty bad shape. Crashes and assertion failures were frequent, and the code itself was hard to read and maintain. Markus Stange had recently taken over ownership, and had made some great improvements to the Addon, but the core code needed work as well. So I started digging in to see what I could find and improve. There was a lot!

Fixes

Bug 1331571. The profiler had code for incorporating power consumption estimates from the Intel Power Gadget. Unfortunately, this integration had major flaws: Intel Power Gadget only gives very coarse power consumption estimates; the profiler samples at 1000Hz and is CPU intensive and so is likely to skew the power consumption estimates significantly; and nobody had ever used it meaningfully. So I removed it.

Bug 1317771. The profiler had a “standalone” configuration that allowed it to be used in programs other than Firefox. But it was complex (lots of #ifdef statements) and broken and unlikely to be of use. So I removed it.

Bug 1328369, Bug 1328373: Dead code removal.

Bug 1332577. The public API for the profiler was a mess. It was split across several header files, and most API functions had an “outer” version with a profiler_ prefix that immediately called into an inner version with a mozilla_sampler_ prefix (though there were some inconsistencies). So I combined the various header files into a single file, GeckoProfiler.h, and simplified the API functions into a single level, all consistently named with a profiler_ prefix.

Bug 1333296. Even the name of the profiler was a source of confusion. It was originally known as “SPS”, which I believe is short for “simple profiling system”. At some point that changed to “the Gecko Profiler”, although was also occasionally referred to as “the built-in profiler”! Because of this history, the code was littered with references to SPS. In this bug I updated them all to refer to the Gecko Profiler. (I updated the MDN docs, too. The page name still uses “Built-in Profiler” because I don’t know how to change MDN page names.)

Bug 1329684. I removed some mutex wrapper classes that I think were necessary at one point for the “standalone” configuration.

Bug 1328365. Thread-local storage was being used to store a pointer to an object that was only accessed on the main thread, so I changed it to be a global variable. I also renamed some variables whose names referred to a type that had been renamed a long time ago.

Bug 1333655. The profiler had a cross-platform thread abstraction that was clumsy and over-engineered, so I streamlined it.

Bug 1334466. The profiler had a class called Sampler, which I think was imported from V8, and a subclass called GeckoSampler. Both classes were fairly complex, and we only ever instantiated the subclass. The separation merely obscured things, so I merged the subclass into Sampler. Having all that code in a single class and a single module made it much easier to see exactly what it was doing.

Bug 1335595. Two classes, ThreadInfo and ThreadProfile, were used for per-thread data. They were hopelessly intertwined: each one had a pointer to the other, and multiple fields were present (i.e. duplicated) in both of them. So I just merged them.

Bug 1336326. Three minor clean-ups.

Bug 816598. I implemented a memory reporter for the profiler. This was first requested in 2012, and five duplicate bugs had been filed in the interim!

Bug 1126576. I removed some very grotty manual refcounting from the PseudoStack class, which simplified things. A little too much, in fact… I misunderstood how things worked, causing a crash, which I subsequently fixed in bug 1340161.

Bug 1337189. The aforementioned Sampler class was still over-engineered. It only ever had 0 or 1 instantiations, and basically was an unnecessary level of abstraction. In this bug I basically got rid of it by merging it into another file. Which took 27 patches! (One of these patches introduced a regression, which I later fixed in bug 1340327.) At this point a lot of the core code that had previously been spread across multiple files and classes was now in a single file, tools/profiler/core/platform.cpp, and it was becoming increasingly obvious that there was a lot of global state being accessed from multiple threads with insufficient thread synchronization.

Bug 1338957. The profiler tracks which threads are “sleeping” (i.e. blocked on some operation), to support an optimization where it samples sleeping threads in a cheaper-than-normal fashion. It was using two counters and a boolean to track the sleep state of each thread. These three values were accessed from multiple threads; two of them were atomic, and one wasn’t, so the whole setup was very racy. I managed to condense the three values into a single atomic tri-state value, which made things simpler and thread-safe.

Bug 1339327. I landed eight refactoring patches with no particular common theme, mostly involving renaming things and removing unnecessary stuff.

Bug 1339435. I removed two erroneous assertions that I had added in an earlier patch — two functions that I thought only ran on the main thread turned out to run off the main thread as well.

Bug 1339695. The profiler has a lot of code that is specific to a particular architecture (e.g. x86), OS (e.g. Windows), or platform (e.g. x86/Windows). The #ifdef statements used to select these were massively inconsistent — turns out there are many ways to detect this stuff — so I fixed this up. Among other things, this involved using the nice constants in tools/profiler/core/PlatformMacros.h consistently throughout the profiler’s code. (I fixed a regression — caused by mistyping one of the #ifdef conditions, alas! — from this change in bug 1350211. And another one involving --disable-profiling in bug 1348776.) I also renamed some files that had .cc extensions instead of the usual .cpp because they had (I think) been imported from V8.

Bug 1340928. At this point I had started working on a major change to the handling of the profiler’s core global state. It would inevitably be a big patch, but I wanted it to be as small as possible, so I started aggressively carving off small changes that could be landed separately. This bug featured 16 of them.

Bug 1328378. The profiler has two kinds of samples: periodic, which are taken by a separate thread in response to a timer firing, and synchronous, which a thread takes itself in response to a request via the profiler API. There are a lot of similarities between the two, but also some important differences. This bug took some steps to simplify the messy handling of synchronous samples.

Bug 1344118. I mentioned earlier that the profiler tracks which threads are “sleeping” to support an optimization: when a thread is asleep, we can mostly duplicate its last sample without unwinding its stack. But the optimization was buggy and would become a catastrophic pessimization in certain circumstances, due to what should have been a short O(1)-ish buffer search becoming O(n²)-ish, which would quickly peg one CPU at 100% usage. As far as I can tell, this bug was present in the optimization ever since it was implemented three years ago. (It’s possible it wasn’t noticed because its effect increase as more threads are profiled, but the profiler defaults to only profiling the main thread and the compositor thread.) The fix was straightforward once the diagnosis was made, and Julian Seward did a follow-up that made the optimization even more effective.

Bug 1342306. In this bug I put almost all of the profiler’s global state into a single class and protected accesses to it with a mutex. Unlike the old code, the new code is simple and obviously thread-safe. The final patch in this bug was much bigger than I would have liked, at 142 KiB, even after I carved off as many precursor patches as I could. Unsurprisingly, there were some follow-up fixes required: bug 1346356 (a leak and a deadlock), bug 1347044 (another deadlock), bug 1348374 (yet another deadlock), and bug 1350967 (surprise! another deadlock).

Bug 1345262. I fixed an assertion failure caused by the profiler and the JS engine having differing views about what functions should be called on what threads.

Bug 1347348. Five more assorted clean-ups.

Bug 1349856. I fixed a minor error involving a call to the profiler from Gecko.

Bug 1346132. I removed the profiler’s bespoke logging system, replacing it with the standard Mozilla one. I also made the logging output more concise and informative.

Bug 1350212. I cleaned up a class and its uses a bit.

Bug 1351523. I reordered one function’s arguments to match the order used in two related functions.

Bug 1351528. I removed some unused values from an enum.

Bug 1348024. I simplified some environment variables used by the profiler.

Bug 1351946. I removed some gnarly code for starting the profiler on B2G.

Bug 1351136. The profiler’s testing coverage is not very good, as can be seen from the numerous regressions I introduced and fixed. So I added a gtest that improves coverage. There’s still room for more test coverage improvement.

Bug 1351963. I further clarified the handling of synchronous vs. periodic samples, and simplified the ownership of some per-thread data structures.

Discussion

I learned some interesting things while doing this work.

Learning a component

Three months ago I knew almost nothing about the profiler’s code. Today I’m a module peer.

At the start of January I had been told that the profiler needed work, and I had assigned myself a Q1 deliverable to “land three improvements to the Gecko Profiler”. I started just by looking for easy, obvious code clean-ups, such as dead code removal, fixing inconsistent naming of things, and removing unnecessary indirections. (These are the kinds of clean-ups you can make with only shallow understanding.) The profiler proved to be a target-rich environment for such changes!

After I made a few dozen such changes I started understanding more deeply how the pieces fit together. (Partly because I’d been staring at the code a lot, and partly because my changes were making the code easier to understand. Refactorings add up.) I started interleaving my easy clean-up patches with ones that required more insight into how the profiler worked. I made numerous mistakes along the way, as the various regression fixes above show. But that’s ok.

I also kept a text file in which I had a list of ideas for things to fix. Every time I saw something that looked like it could be improved, I added it to the file, and I repeatedly checked the file when deciding what to work on next. As my understanding of the code improved, multiple times I realized that items I had written down were wrong, or incomplete, or that seemingly separate things were related. (In fact, I’m still using the file, because I still have numerous things I want to improve.)

Multi-threaded programming basics

Although I first learned C and C++ about 20 years ago, and I have worked at Mozilla for more than 8 years, this was the first time I’ve ever done serious multi-threaded programming, i.e. at a level where I needed a reasonably deep understanding of how various threads can interact. I got the following two great tips from Julian Seward, which helped a lot.

  • Write down pseudocode for each thread.
  • Write down potential worst-case thread operation interleavings.

I also found it helpful to add comments (or assertions, where appropriate) to the top of functions that indicate which thread or threads they run on. For example:

void profiler_gathered_OOP_profile()
{
  MOZ_RELEASE_ASSERT(NS_IsMainThread());
  ...
}

and:

void profiler_thread_sleep()
{ 
  // This function runs both on and off the main thread.
  ...
}

A useful idiom: proof-of-lock tokens

I also employed a programming idiom that turned out to be extremely helpful.  Most of the global profiler state is in single class called ProfilerState. There is a single instance of this class, gPS, and a single mutex that protects it, gPSMutex. To guarantee that no code is able to access gPS‘s contents without first locking the mutex, for every field in ProfilerState there is a getter and a setter, both of which require a “proof-of-lock” token, which takes the form of a const PS::AutoLock&, where PS::AutoLock is an RAII type that locks and unlocks a mutex.

For example, consider this function, which checks if the profiler is paused.

bool profiler_is_paused()
{
  PS::AutoLock lock(gPSMutex);
  if (!gPS->IsActive(lock)) {
    return false;
  }
  return gPS->IsPaused(lock);
}

The PS::AutoLock locks the mutex. IsActive() and IsPaused() both access fields within gPS, and so they are passed lock, which serves as the proof-of-lock value. IsPaused() and SetIsPaused() are implemented as follow.

bool IsPaused(const PS::AutoLock&) const { return mIsPaused; }
void SetIsPaused(const PS::AutoLock&, bool aIsPaused) { mIsPaused = aIsPaused; }

Neither function actually uses the proof-of-lock token. Nonetheless, any function that calls a ProfilerState getter or setter must either lock gPSMutex, or have an ancestor that does. This idiom has two very nice benefits.

  • You can’t access gPS‘s contents without having first locked gPSMutex. (Well, it is possible to subvert the protection, but not by accident.)
  • It’s obvious that all functions that have a proof-of-lock argument are called only while gPSMutex is locked.

Functions that are called from multiple places sometimes must be split in two: an outer function in which the mutex is initially unlocked, and an inner function that takes a proof-of-lock token. This isn’t hard, though.

Deadlocks vs. data races

After my big change to the profiler’s global state, I had to fix numerous deadlocks. This wasn’t too hard. Deadlocks (caused by too much thread synchronization) are obvious, easy to diagnose, and these ones weren’t hard to fix. It’s useful to contrast them with data races (caused by too little thread synchronization) which typically have subtle effects and are difficult to diagnose.

Patch discipline

For this work I wrote a lot of small patches. This is my preferred way to work, for two reasons. First, small patches make life easier for reviewers, which in turn results in faster reviews. Second, small patches make regression hunting easy. It’s always nice when you bisect a regression to a small patch.

Future work and Thanks

The profiler still has plenty of room for improvement, and I’m planning to do more work on it in Q2. In the meantime, if you’ve tried the profiler in the past and had problems it might be worth trying again. It’s in much better shape now.

Finally, many thanks to Markus Stange for reviewing the majority of the patches and answering lots of questions, and Julian Seward for reviewing most of the remainder and for numerous helpful discussions about threaded programming.

How to speed up the Rust compiler some more

I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.

Heap allocations

My last post mentioned how heap allocations were frequent within rustc. This led to some wild speculation in some venues: about whether Rust should use heap allocation at all, about whether garbage collection was necessary, and so on. I have two comments about this.

The first is that rustc is a compiler, and like most compilers its execution is dominated by complex traversals of large tree structures: ASTs, IRs, etc. Tree structures typically require heap allocation. In particular, a lot of these tree structures contain Vec, HashMap and HashSet fields, all of which unavoidably use heap allocation.

The second is that although some heap allocation is unavoidable, the amount that rustc was doing was excessive. It is clear that nobody had (or not for a long time) made a concerted effort to minimize heap allocations. With the help of DHAT I’ve been able to greatly reduce the amount done. Any time you throw a new profiler at any large codebase that hasn’t been heavily optimized there’s a good chance you’ll be able to make some sizeable improvements.

Tools

As in my previous post, I focused almost entirely on the benchmarks present in rustc-benchmarks to guide my efforts.

Once again I mostly used Cachegrind and DHAT to profile these benchmarks. I also used Massif (plus the excellent massif-visualizer) to profile peak memory usage on one workload (see below).

I have also recently started using perf a bit more. Huon Wilson told me to add the --callgraph=dwarf flag to my perf record invocations. That improves things significantly, though I still find perf puzzling and frustrating at times, even after reading Brendan Gregg’s thorough examples page.

Wins

#37229: rustc spends a lot of time doing hash table lookups, enough that the cost of hashing is significant. In this PR I changed the hash function used by rustc (FNV) to one inspired by the hash function used within Firefox. The new function is faster because it can process an entire word at a time, rather than one byte at a time. The new hash function is slightly worse in terms of the number of collisions but the change sped up compilation of most workloads by 3–6%.

#37161 & #37318 & #37373: These three PRs removed a lot of unnecessary heap allocations (mostly due to clone()) in the macro parser. They sped up the compilation of html5ever by 20% and 7% and 2%, respectively.

#37267 & #37298: rustc uses “deflate” compression in a couple of places: crate metadata and LLVM bitcode. In the metadata case rustc was often compressing metadata and then throwing away the result! (Thanks to eddyb for diagnosing this and explaining to me how to fix it.) Avoiding this unnecessary work speed up compilation of syntex-incr by 6% and several others by 1–2%.

For the LLVM bitcode case I tweaked the deflate settings so that it ran almost twice as fast while the compression ratio was only slightly worse. This sped up compilation of syntex-incr by another 8% and a couple of others by 1–2%. It’s possible that switching to a different compression algorithm would help some more, thought that would be a much larger change.

#37108: This PR avoided interning values of the Substs type in cases where it was easy to tell that the value had previously been interned. It sped up compilation of several benchmarks by 1–4%.

#37705: This PR avoided some unnecessary calls to mk_ty. It sped up compilation of one benchmark by 5% and a few others by 1–2%.

#37083: This PR inlined various methods involved with uleb128 decoding, which is used when reading crate metadata. This sped up compilation of several benchmarks by 1%.

#36973: This PR fixed things so that a data structure that is only required for incremental compilation is not touched during non-incremental compilation. This sped up compilation of several non-incremental benchmarks by 1%.

#37445 & #37764: One unusual workload was found to make rustc consume excessive amounts of memory (4.5 GiB!) which made it OOM on some machines. I used Massif to identify ways to improve this.

The first PR reduced the size of the Expr enum by shrinking the outsized InlineAsm variant, which reduced peak memory usage by 9%. The second PR removed scope_auxiliary, a data structure used only during MIR dumping (and of marginal utility even then). This reduced peak memory usage by another 10%.

I have filed a PR to add a cut-down version of this workload to rustc-benchmarks.

#36993: This PR did some manual inlining and restructuring of several ObligationForest functions. It sped up compilation of inflate by 2%.

#37642: This PR reduced some excessive indirection present in the representation of HIR. It sped up compilation of some benchmarks by 1, 2 and 4%.

#37427: rustc uses Blake2b hashing to determine when a function’s code has changed. This PR reduced the number of bytes to be hashed by (a) avoiding hashing filenames twice for each span, and (b) pre-uleb128-encoding 32-bit and 64-bit integers, which are usually small. This sped up compilation of syntex-incr by 2%.

Future work

As the compiler front-end gets faster, the proportion of rustc execution time spent in the LLVM back-end increases. For some benchmarks it now exceeds 50% (when doing debug builds), though there is still plenty of variation. Thanks to mrhota there is a new –enable-llvm-release-debuginfo configure option that (unsurprisingly) enabled debuginfo within LLVM. This means that profilers can now give filenames and line numbers for LLVM. I’ve looked at a few places in LLVM that show up high in profiles, though I haven’t yet managed to make any useful changes to it.

Another interesting new development is pnkfelix’s -Zprint-type-sizes option, which should land soon, and will potentially be useful for any program written in Rust, not just rustc. This option will make it trivial to see how each type is laid out in memory, which will make it easy to see how types can be rearranged to be smaller. Up until now this has been a painful and imprecise exercise.

Finally, I want to encourage anyone else with the slightest knack and/or enthusiasm for optimizing code to take a look at rustc. You might be thinking that there isn’t that much low-hanging fruit left, but I am confident there is. It’s getting harder for me to find things to improve, but I am one just person with a particular background and a few preferred profiling tools. I am confident that other people with different backgrounds and tools will find plenty of stuff that I cannot. Rust compile speed still isn’t great, even after these improvements, but the more people who pitch in the faster they’ll improve. Don’t be shy, and if you have any questions please contact me or ask in the #rust or #rustc IRC channels.

Faster Firefox startup & shutdown with add-ons present

Thanks to some recent improvements by Kris Maglione, add-ons built with Firefox’s add-ons SDK are now much faster to start up and shut down. For users with multiple add-ons installed, this can make a large difference. For example, one user reported the following.

I’ve noticed a dramatic improvement in startup and shutdown time since the fixes landed on aurora.

Configuration: Two windows with 15 tabs each and 22 addons.

  • Startup: 25 seconds -> 14 seconds (44% reduction)
  • Shutdown: 30 seconds -> 5 seconds (83% reduction)

Shutdown used to take so long it was triggering a minidump.

Your mileage may vary, but these improvements have the potential to benefit many users. They are present in Firefox 50 (currently in Beta, and due for release on November 15) and all later versions.

How to speed up the Rust compiler

Rust is a great language, and Mozilla plans to use it extensively in Firefox. However, the Rust compiler (rustc) is quite slow and compile times are a pain point for many Rust users. Recently I’ve been working on improving that. This post covers how I’ve done this, and should be of interest to anybody else who wants to help speed up the Rust compiler. Although I’ve done all this work on Linux it should be mostly applicable to other platforms as well.

Getting the code

The first step is to get the rustc code. First, I fork the main Rust repository on GitHub. Then I make two local clones: a base clone that I won’t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:

user=nnethercote
for r in rust0 rust1 ; do
  cd ~/moz
  git clone https://github.com/$user/rust $r
  cd $r
  git remote add upstream https://github.com/rust-lang/rust
  git remote set-url origin git@github.com:$user/rust
done

Building the Rust compiler

Within the two repositories, I first configure:

./configure --enable-optimize --enable-debuginfo

I configure with optimizations enabled because that matches release versions of rustc. And I configure with debug info enabled so that I get good information from profilers.

[Update: I now add –enable-llvm-release-debuginfo which builds the LLVM back-end with debug info too.]

Then I build:

RUSTFLAGS='' make -j8

[Update: I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile times. But Lars Bergstrom informed me that it can slow down the resulting program significantly. I measured and he was right — the resulting rustc was about 5–10% slower. So I’ve stopped using it now.]

That does a full build, which does the following:

  • Downloads a stage0 compiler, which will be used to build the stage1 local compiler.
  • Builds LLVM, which will become part of the local compilers.
  • Builds the stage1 compiler with the stage0 compiler.
  • Builds the stage2 compiler with the stage1 compiler.

It can be mind-bending to grok all the stages, especially with regards to how libraries work. (One notable example: the stage1 compiler uses the system allocator, but the stage2 compiler uses jemalloc.) I’ve found that the stage1 and stage2 compilers have similar performance. Therefore, I mostly measure the stage1 compiler because it’s much faster to just build the stage1 compiler, which I do with the following command.

RUSTFLAGS='' make -j8 rustc-stage1

Building the compiler takes a while, which isn’t surprising. What is more surprising is that rebuilding the compiler after a small change also takes a while. That’s because a lot of code gets recompiled after any change. There are two reasons for this.

  • Rust’s unit of compilation is the crate. Each crate can consist of multiple files. If you modify a crate, the whole crate must be rebuilt. This isn’t surprising.
  • rustc’s dependency checking is very coarse. If you modify a crate, every other crate that depends on it will also be rebuilt, no matter how trivial the modification. This surprised me greatly. For example, any modification to the parser (which is in a crate called libsyntax) causes multiple other crates to be recompiled, a process which takes 6 minutes on my fast desktop machine. Almost any change to the compiler will result in a rebuild that takes at least 2 or 3 minutes.

Incremental compilation should greatly improve the dependency situation, but it’s still in an experimental state and I haven’t tried it yet.

To run all the tests I do this (after a full build):

ulimit -c 0 && make check

The checking aborts if you don’t do the ulimit, because the tests produces lots of core files and it doesn’t want to swamp your disk.

The build system is complex, with lots of options. This command gives a nice overview of some common invocations:

make tips

Basic profiling

The next step is to do some basic profiling. I like to be careful about which rustc I am invoking at any time, especially if there’s a system-wide version installed, so I avoid relying on PATH and instead define some environment variables like this:

export RUSTC01="$HOME/moz/rust0/x86_64-unknown-linux-gnu/stage1/bin/rustc"
export RUSTC02="$HOME/moz/rust0/x86_64-unknown-linux-gnu/stage2/bin/rustc"
export RUSTC11="$HOME/moz/rust1/x86_64-unknown-linux-gnu/stage1/bin/rustc"
export RUSTC12="$HOME/moz/rust1/x86_64-unknown-linux-gnu/stage2/bin/rustc"

In the examples that follow I will use $RUSTC01 as the version of rustc that I invoke.

rustc has the ability to produce some basic stats about the time and memory used by each compiler pass. It is enabled with the -Ztime-passes flag. If you are invoking rustc directly you’d do it like this:

$RUSTC01 -Ztime-passes a.rs

If you are building with Cargo you can instead do this:

RUSTC=$RUSTC01 cargo rustc -- -Ztime-passes

The RUSTC= part tells Cargo you want to use a non-default rustc, and the part after the -- is flags that will be passed to rustc when it builds the final crate. (A bit weird, but useful.)

Here is some sample output from -Ztime-passes:

time: 0.056; rss: 49MB parsing
time: 0.000; rss: 49MB recursion limit
time: 0.000; rss: 49MB crate injection
time: 0.000; rss: 49MB plugin loading
time: 0.000; rss: 49MB plugin registration
time: 0.103; rss: 87MB expansion
time: 0.000; rss: 87MB maybe building test harness
time: 0.002; rss: 87MB maybe creating a macro crate
time: 0.000; rss: 87MB checking for inline asm in case the target doesn't support it
time: 0.005; rss: 87MB complete gated feature checking
time: 0.008; rss: 87MB early lint checks
time: 0.003; rss: 87MB AST validation
time: 0.026; rss: 90MB name resolution
time: 0.019; rss: 103MB lowering ast -> hir
time: 0.004; rss: 105MB indexing hir
time: 0.003; rss: 105MB attribute checking
time: 0.003; rss: 105MB language item collection
time: 0.004; rss: 105MB lifetime resolution
time: 0.000; rss: 105MB looking for entry point
time: 0.000; rss: 105MB looking for plugin registrar
time: 0.015; rss: 109MB region resolution
time: 0.002; rss: 109MB loop checking
time: 0.002; rss: 109MB static item recursion checking
time: 0.060; rss: 109MB compute_incremental_hashes_map
time: 0.000; rss: 109MB load_dep_graph
time: 0.021; rss: 109MB type collecting
time: 0.000; rss: 109MB variance inference
time: 0.038; rss: 113MB coherence checking
time: 0.126; rss: 114MB wf checking
time: 0.219; rss: 118MB item-types checking
time: 1.158; rss: 125MB item-bodies checking
time: 0.000; rss: 125MB drop-impl checking
time: 0.092; rss: 127MB const checking
time: 0.015; rss: 127MB privacy checking
time: 0.002; rss: 127MB stability index
time: 0.011; rss: 127MB intrinsic checking
time: 0.007; rss: 127MB effect checking
time: 0.027; rss: 127MB match checking
time: 0.014; rss: 127MB liveness checking
time: 0.082; rss: 127MB rvalue checking
time: 0.145; rss: 161MB MIR dump
 time: 0.015; rss: 161MB SimplifyCfg
 time: 0.033; rss: 161MB QualifyAndPromoteConstants
 time: 0.034; rss: 161MB TypeckMir
 time: 0.001; rss: 161MB SimplifyBranches
 time: 0.006; rss: 161MB SimplifyCfg
time: 0.089; rss: 161MB MIR passes
time: 0.202; rss: 161MB borrow checking
time: 0.005; rss: 161MB reachability checking
time: 0.012; rss: 161MB death checking
time: 0.014; rss: 162MB stability checking
time: 0.000; rss: 162MB unused lib feature checking
time: 0.101; rss: 162MB lint checking
time: 0.000; rss: 162MB resolving dependency formats
 time: 0.001; rss: 162MB NoLandingPads
 time: 0.007; rss: 162MB SimplifyCfg
 time: 0.017; rss: 162MB EraseRegions
 time: 0.004; rss: 162MB AddCallGuards
 time: 0.126; rss: 164MB ElaborateDrops
 time: 0.001; rss: 164MB NoLandingPads
 time: 0.012; rss: 164MB SimplifyCfg
 time: 0.008; rss: 164MB InstCombine
 time: 0.003; rss: 164MB Deaggregator
 time: 0.001; rss: 164MB CopyPropagation
 time: 0.003; rss: 164MB AddCallGuards
 time: 0.001; rss: 164MB PreTrans
time: 0.182; rss: 164MB Prepare MIR codegen passes
 time: 0.081; rss: 167MB write metadata
 time: 0.590; rss: 177MB translation item collection
 time: 0.034; rss: 180MB codegen unit partitioning
 time: 0.032; rss: 300MB internalize symbols
time: 3.491; rss: 300MB translation
time: 0.000; rss: 300MB assert dep graph
time: 0.000; rss: 300MB serialize dep graph
 time: 0.216; rss: 292MB llvm function passes [0]
 time: 0.103; rss: 292MB llvm module passes [0]
 time: 4.497; rss: 308MB codegen passes [0]
 time: 0.004; rss: 308MB codegen passes [0]
time: 5.185; rss: 308MB LLVM passes
time: 0.000; rss: 308MB serialize work products
time: 0.257; rss: 297MB linking

As far as I can tell, the indented passes are sub-passes, and the parent pass is the first non-indented pass afterwards.

More serious profiling

The -Ztime-passes flag gives a good overview, but you really need a profiling tool that gives finer-grained information to get far. I’ve done most of my profiling with two Valgrind tools, Cachegrind and DHAT. I invoke Cachegrind like this:

valgrind \
 --tool=cachegrind --cache-sim=no --branch-sim=yes \
 --cachegrind-out-file=$OUTFILE $RUSTC01 ...

where $OUTFILE specifies an output filename. I find the instruction counts measured by Cachegrind to be highly useful; the branch simulation results are occasionally useful, and the cache simulation results are almost never useful.

The Cachegrind output looks like this:

--------------------------------------------------------------------------------
            Ir 
--------------------------------------------------------------------------------
22,153,170,953 PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir file:function
--------------------------------------------------------------------------------
923,519,467 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:_int_malloc
879,700,120 /home/njn/moz/rust0/src/rt/miniz.c:tdefl_compress
629,196,933 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:_int_free
394,687,991 ???:???
379,869,259 /home/njn/moz/rust0/src/libserialize/leb128.rs:serialize::leb128::read_unsigned_leb128
376,921,973 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:malloc
263,083,755 /build/glibc-GKVZIf/glibc-2.23/string/::/sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:__memcpy_avx_unaligned
257,219,281 /home/njn/moz/rust0/src/libserialize/opaque.rs:<serialize::opaque::Decoder<'a> as serialize::serialize::Decoder>::read_usize
217,838,379 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:free
217,006,132 /home/njn/moz/rust0/src/librustc_back/sha2.rs:rustc_back::sha2::Engine256State::process_block
211,098,567 ???:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&, llvm::CodeGenOpt::Level)
185,630,213 /home/njn/moz/rust0/src/libcore/hash/sip.rs:<rustc_incremental::calculate_svh::hasher::IchHasher as core::hash::Hasher>::write
171,360,754 /home/njn/moz/rust0/src/librustc_data_structures/fnv.rs:<rustc::ty::subst::Substs<'tcx> as core::hash::Hash>::hash
150,026,054 ???:llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)

Here “Ir” is short for “I-cache reads”, which corresponds to the number of instructions executed. Cachegrind also gives line-by-line annotations of the source code.

The Cachegrind results indicate that malloc and free are usually the two hottest functions in the compiler. So I also use DHAT, which is a malloc profiler that tells you exactly where all your malloc calls are coming from.  I invoke DHAT like this:

/home/njn/grind/ws3/vg-in-place \
 --tool=exp-dhat --show-top-n=1000 --num-callers=4 \
 --sort-by=tot-blocks-allocd $RUSTC01 ... 2> $OUTFILE

I sometimes also use --sort-by=tot-bytes-allocd. DHAT’s output looks like this:

==16425== -------------------- 1 of 1000 --------------------
==16425== max-live: 30,240 in 378 blocks
==16425== tot-alloc: 20,866,160 in 260,827 blocks (avg size 80.00)
==16425== deaths: 260,827, at avg age 113,438 (0.00% of prog lifetime)
==16425== acc-ratios: 0.74 rd, 1.00 wr (15,498,021 b-read, 20,866,160 b-written)
==16425== at 0x4C2BFA6: malloc (vg_replace_malloc.c:299)
==16425== by 0x5AD392B: <syntax::ptr::P<T> as serialize::serialize::Decodable>::decode (heap.rs:59)
==16425== by 0x5AD4456: <core::iter::Map<I, F> as core::iter::iterator::Iterator>::next (serialize.rs:201)
==16425== by 0x5AE2A52: rustc_metadata::decoder::<impl rustc_metadata::cstore::CrateMetadata>::get_attributes (vec.rs:1556)
==16425== 
==16425== -------------------- 2 of 1000 --------------------
==16425== max-live: 1,360 in 17 blocks
==16425== tot-alloc: 10,378,160 in 129,727 blocks (avg size 80.00)
==16425== deaths: 129,727, at avg age 11,622 (0.00% of prog lifetime)
==16425== acc-ratios: 0.47 rd, 0.92 wr (4,929,626 b-read, 9,599,798 b-written)
==16425== at 0x4C2BFA6: malloc (vg_replace_malloc.c:299)
==16425== by 0x881136A: <syntax::ptr::P<T> as core::clone::Clone>::clone (heap.rs:59)
==16425== by 0x88233A7: syntax::ext::tt::macro_parser::parse (vec.rs:1105)
==16425== by 0x8812E66: syntax::tokenstream::TokenTree::parse (tokenstream.rs:230)

The “deaths” value here indicate the total number of calls to malloc for each call stack, which is usually the metric of most interest. The “acc-ratios” value can also be interesting, especially if the “rd” value is 0.00, because that indicates the allocated blocks are never read. (See below for example of problems that I found this way.)

For both profilers I also pipe $OUTFILE through eddyb’s rustfilt.sh script which demangles ugly Rust symbols like this:

_$LT$serialize..opaque..Decoder$LT$$u27$a$GT$$u20$as$u20$serialize..serialize..Decoder$GT$::read_usize::h87863ec7f9234810

to something much nicer, like this:

<serialize::opaque::Decoder<'a> as serialize::serialize::Decoder>::read_usize

[Update: native support for Rust demangling recently landed in Valgrind’s repo. I use a trunk version of Valgrind so I no longer need to use rustfilt.sh in combination with Valgrind.]

For programs that use Cargo, sometimes it’s useful to know the exact rustc invocations that Cargo uses. Find out with either of these commands:

RUSTC=$RUSTC01 cargo build -v
RUSTC=$RUSTC01 cargo rust -v

I also have done a decent amount of ad hoc println profiling, where I insert println! calls in hot parts of the code and then I use a script to post-process them. This can be very useful when I want to know exactly how many times particular code paths are hit.

I’ve also tried perf. It works, but I’ve never established much of a rapport with it. YMMV. In general, any profiler that works with C or C++ code should also work with Rust code.

Finding suitable benchmarks

Once you know how you’re going to profile you need some good workloads. You could use the compiler itself, but it’s big and complicated and reasoning about the various stages can be confusing, so I have avoided that myself.

Instead, I have focused entirely on rustc-benchmarks, a pre-existing rustc benchmark suite. It contains 13 benchmarks of various sizes. It has been used to track rustc’s performance at perf.rust-lang.org for some time, but it wasn’t easy to use locally until I wrote a script for that purpose. I invoke it something like this:

./compare.py \
  /home/njn/moz/rust0/x86_64-unknown-linux-gnu/stage1/bin/rustc \
  /home/njn/moz/rust1/x86_64-unknown-linux-gnu/stage1/bin/rustc

It compares the two given compilers, doing debug builds, on the benchmarks See the next section for example output. If you want to run a subset of the benchmarks you can specify them as additional arguments.

Each benchmark in rustc-benchmarks has a makefile with three targets. See the README for details on these targets, which can be helpful.

Wins

Here are the results if I compare the following two versions of rustc with compare.py.

  • The commit just before my first commit (on September 12).
  • A commit from October 13.
futures-rs-test  5.028s vs  4.433s --> 1.134x faster (variance: 1.020x, 1.030x)
helloworld       0.283s vs  0.235s --> 1.202x faster (variance: 1.012x, 1.025x)
html5ever-2016-  6.293s vs  5.652s --> 1.113x faster (variance: 1.011x, 1.008x)
hyper.0.5.0      6.182s vs  5.039s --> 1.227x faster (variance: 1.002x, 1.018x)
inflate-0.1.0    5.168s vs  4.935s --> 1.047x faster (variance: 1.001x, 1.002x)
issue-32062-equ  0.457s vs  0.347s --> 1.316x faster (variance: 1.010x, 1.007x)
issue-32278-big  2.046s vs  1.706s --> 1.199x faster (variance: 1.003x, 1.007x)
jld-day15-parse  1.793s vs  1.538s --> 1.166x faster (variance: 1.059x, 1.020x)
piston-image-0. 13.871s vs 11.885s --> 1.167x faster (variance: 1.005x, 1.005x)
regex.0.1.30     2.937s vs  2.516s --> 1.167x faster (variance: 1.010x, 1.002x)
rust-encoding-0  2.414s vs  2.078s --> 1.162x faster (variance: 1.006x, 1.005x)
syntex-0.42.2   36.526s vs 32.373s --> 1.128x faster (variance: 1.003x, 1.004x)
syntex-0.42.2-i 21.500s vs 17.916s --> 1.200x faster (variance: 1.007x, 1.013x)

Not all of the improvement is due to my changes, but I have managed a few nice wins, including the following.

#36592: There is an arena allocator called TypedArena. rustc creates many of these, mostly short-lived. On creation, each arena would allocate a 4096 byte chunk, in preparation for the first arena allocation request. But DHAT’s output showed me that the vast majority of arenas never received such a request! So I made TypedArena lazy — the first chunk is now only allocated when necessary. This reduced the number of calls to malloc greatly, which sped up compilation of several rustc-benchmarks by 2–6%.

#36734: This one was similar. Rust’s HashMap implementation is lazy — it doesn’t allocate any memory for elements until the first one is inserted. This is a good thing because it’s surprisingly common in large programs to create HashMaps that are never used. However, Rust’s HashSet implementation (which is just a layer on top of the HashMap) didn’t have this property, and guess what? rustc also creates large numbers of HashSets that are never used. (Again, DHAT’s output made this obvious.) So I fixed that, which sped up compilation of several rustc-benchmarks by 1–4%. Even better, because this change is to Rust’s stdlib, rather than rustc itself, it will speed up any program that creates HashSets without using them.

#36917: This one involved avoiding some useless data structure manipulation when a particular table was empty. Again, DHAT pointed out a table that was created but never read, which was the clue I needed to identify this improvement. This sped up two benchmarks by 16% and a couple of others by 3–5%.

#37064: This one changed a hot function in serialization code to return a Cow<str> instead of a String, which avoided a lot of allocations.

Future work

Profiles indicate that the following parts of the compiler account for a lot of its runtime.

  • malloc and free are still the two hottest functions in most benchmarks. Avoiding heap allocations can be a win.
  • Compression is used for crate metadata and LLVM bitcode. (This shows up in profiles under a function called tdefl_compress.)  There is an issue open about this.
  • Hash table operations are hot. A lot of this comes from the interning of various values during type checking; see the CtxtInterners type for details.
  • Crate metadata decoding is also costly.
  • LLVM execution is a big chunk, especially when doing optimized builds. So far I have treated LLVM as a black box and haven’t tried to change it, at least partly because I don’t know how to build it with debug info, which is necessary to get source files and line numbers in profiles. [Update: there is a new –enable-llvm-release-debuginfo configure option that causes LLVM to be build with debug info.]

A lot of programs have broadly similar profiles, but occasionally you get an odd one that stresses a different part of the compiler. For example, in rustc-benchmarks, inflate-0.1.0 is dominated by operations involving the (delighfully named) ObligationsForest (see #36993), and html5ever-2016-08-25 is dominated by what I think is macro processing. So it’s worth profiling the compiler on new codebases.

Caveat lector

I’m still a newcomer to Rust development. Although I’ve had lots of help on the #rustc IRC channel — big thanks to eddyb and simulacrum in particular — there may be things I am doing wrong or sub-optimally. Nonetheless, I hope this is a useful starting point for newcomers who want to speed up the Rust compiler.

How to get localized Firefox Nightly builds

One of the easiest and best ways that someone can help Mozilla and Firefox is to run Firefox Nightly. I’ve been doing it on my Windows, Mac and Linux machines for the past couple of months. It requires daily restarts, but otherwise it has been a smooth experience for me.

Unfortunately the number of Nightly users has been steadily dropping for some time, which hurts our ability to catch crashes and other regressions early. Pascal Chevrel and Marcia Knous are leading efforts underway to reverse this trend.

One problem with Nightly builds has been their visibility. In particular, finding localized (non-English) builds was difficult. That situation has just improved: thanks to Kohei Yoshino there is now a single page containing Nightly builds for all platforms and locales. As far as I know there are no other pages that currently link to that page, but perhaps that will happen as part of the planned work to give Nightly builds a place on mozilla.org.

If you have friends and family who would like to help Mozilla and are willing to use pre-release versions of Firefox, please suggest Firefox Nightly to them.

How to switch to a 64-bit Firefox on Windows

I recently wrote about 64-bit Firefox builds on Windows, explaining why you might want to switch — it can reduce the likelihood of out-of-memory crashes — and also some caveats.

However, I didn’t explain how to switch, so I will do that now.

First, if you want to make sure that you aren’t already running a 64-bit Firefox, type “about:support” in the address bar and then look at the User Agent field in the Application Basics table near the top of the page.

  • If it contains the string “Win64”, you are already running a 64-bit Firefox.
  • If it contains the string “WOW64“, you are running a 32-bit Firefox on a 64-bit Windows installation, which means you can switch to a 64-bit build.
  • Otherwise, you are running a 32-bit Firefox on a 32-bit Windows installation, and cannot switch to a 64-bit Firefox.

Here are links to pages contain 64-bit builds for all the different release channels.

  • Release
  • Beta
  • Developer Edition
  • Nightly:  This is a user-friendly page, but it only has the en-US locale.
  • Nightly: This is a more intimidating page, but it has all locales. Look for a file with a name of the form firefox-<VERSION>.<LOCALE>.win64.installer.exe, e.g. firefox-50.0a1.de.win64.installer.exe for Nightly 50 in German.

By default, 32-bit Firefox and 64-bit Firefox are installed to different locations:

  • C:\Program Files (x86)\Mozilla Firefox\
  • C:\Program Files\Mozilla Firefox\

If you are using a 32-bit Firefox and then you download and install a 64-bit Firefox, by default you will end up with two versions of  Firefox installed. (But note that if you let the 64-bit Firefox installer add shortcuts to the desktop and/or taskbar, these shortcuts will replace any existing shortcuts to 32-bit Firefox.)

Both the 32-bit Firefox and the 64-bit Firefox will use the same profile, which means all your history, bookmarks, extensions, etc., will be available in either version. You’ll be able to run both versions, though not at the same time with the same profile. If you decide you don’t need both versions you can simply remove the unneeded version through the Windows system settings, as normal; your profile will not be touched when you do this.

Finally, there is a plan to gradually roll out 64-bit Firefox to Windows users in increasing numbers.

Firefox 64-bit for Windows can take advantage of more memory

By default, on Windows, Firefox is a 32-bit application. This means that it is limited to using at most 4 GiB of memory, even on machines that have more than 4 GiB of physical memory (RAM). In fact, depending on the OS configuration, the limit may be as low as 2 GiB.

Now, 2–4 GiB might sound like a lot of memory, but it’s not that unusual for power users to use that much. This includes:

  • users with many (dozens or even hundreds) of tabs open;
  • users with many (dozens) of extensions;
  • users of memory-hungry web sites and web apps; and
  • users who do all of the above!

Furthermore, in practice it’s not possible to totally fill up this available space because fragmentation inevitably occurs. For example, Firefox might need to make a 10 MiB allocation and there might be more than 10 MiB of unused memory, but if that available memory is divided into many pieces all of which are smaller than 10 MiB, then the allocation will fail.

When an allocation does fail, Firefox can sometimes handle it gracefully. But often this isn’t possible, in which case Firefox will abort. Although this is a controlled abort, the effect for the user is basically identical to an uncontrolled crash, and they’ll have to restart Firefox. A significant fraction of Firefox crashes/aborts are due to this problem, known as address space exhaustion.

Fortunately, there is a solution to this problem available to anyone using a 64-bit version of Windows: use a 64-bit version of Firefox. Now, 64-bit applications typically use more memory than 32-bit applications. This is because pointers, a common data type, are twice as big; a rough estimate for 64-bit Firefox is that it might use 25% more memory. However, 64-bit applications also have a much larger address space, which means they can access vast amounts of physical memory, and address space exhaustion is all but impossible. (In this way, switching from a 32-bit version of an application to a 64-bit version is the closest you can get to downloading more RAM!)

Therefore, if you have a machine with 4 GiB or less of RAM, switching to 64-bit Firefox probably won’t help. But if you have 8 GiB or more, switching to 64-bit Firefox probably will help the memory usage situation.

Official 64-bit versions of Firefox have been available since December 2015. If the above discussion has interested you, please try them out. But note the following caveats.

  • Flash and Silverlight are the only supported 64-bit plugins.
  • There are some Flash content regressions due to our NPAPI sandbox (for content that uses advanced features like GPU acceleration or microphone APIs).

On the flip side, as well as avoiding address space exhaustion problems, a security feature known as ASLR works much better in 64-bit applications than in 32-bit applications, so 64-bit Firefox will be slightly more secure.

Work is being ongoing to fix or minimize the mentioned caveats, and it is expected that 64-bit Firefox will be rolled out in increasing numbers in the not-too-distant future.

UPDATE: Chris Peterson gave me the following measurements about daily active users on Windows.

  • 66.0% are running 32-bit Firefox on 64-bit Windows. These users could switch to a 64-bit Firefox.
  • 32.3% are running 32-bit Firefox on 32-bit Windows. These users cannot switch to a 64-bit Firefox.
  • 1.7% are running 64-bit Firefox already.

UPDATE 2: Also from Chris Peterson, here are links to 64-bit builds for all the channels: