Author Archives: Nicholas Nethercote

How to speed up the Rust compiler

Rust is a great language, and Mozilla plans to use it extensively in Firefox. However, the Rust compiler (rustc) is quite slow and compile times are a pain point for many Rust users. Recently I’ve been working on improving that. This post covers how I’ve done this, and should be of interest to anybody else who wants to help speed up the Rust compiler. Although I’ve done all this work on Linux it should be mostly applicable to other platforms as well.

Getting the code

The first step is to get the rustc code. First, I fork the main Rust repository on GitHub. Then I make two local clones: a base clone that I won’t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:

for r in rust0 rust1 ; do
  cd ~/moz
  git clone$user/rust $r
  cd $r
  git remote add upstream
  git remote set-url origin$user/rust

Building the Rust compiler

Within the two repositories, I first configure:

./configure --enable-optimize --enable-debuginfo

I configure with optimizations enabled because that matches release versions of rustc. And I configure with debug info enabled so that I get good information from profilers.

Then I build:

RUSTFLAGS='' make -j8

[Update: I previously had -Ccodegen-units=8 in RUSTFLAGS because it speeds up compile times. But Lars Bergstrom informed me that it can slow down the resulting program significantly. I measured and he was right — the resulting rustc was about 5–10% slower. So I’ve stopped using it now.]

That does a full build, which does the following:

  • Downloads a stage0 compiler, which will be used to build the stage1 local compiler.
  • Builds LLVM, which will become part of the local compilers.
  • Builds the stage1 compiler with the stage0 compiler.
  • Builds the stage2 compiler with the stage1 compiler.

It can be mind-bending to grok all the stages, especially with regards to how libraries work. (One notable example: the stage1 compiler uses the system allocator, but the stage2 compiler uses jemalloc.) I’ve found that the stage1 and stage2 compilers have similar performance. Therefore, I mostly measure the stage1 compiler because it’s much faster to just build the stage1 compiler, which I do with the following command.

RUSTFLAGS='-Ccodegen-units=8' make -j8 rustc-stage1

Building the compiler takes a while, which isn’t surprising. What is more surprising is that rebuilding the compiler after a small change also takes a while. That’s because a lot of code gets recompiled after any change. There are two reasons for this.

  • Rust’s unit of compilation is the crate. Each crate can consist of multiple files. If you modify a crate, the whole crate must be rebuilt. This isn’t surprising.
  • rustc’s dependency checking is very coarse. If you modify a crate, every other crate that depends on it will also be rebuilt, no matter how trivial the modification. This surprised me greatly. For example, any modification to the parser (which is in a crate called libsyntax) causes multiple other crates to be recompiled, a process which takes 6 minutes on my fast desktop machine. Almost any change to the compiler will result in a rebuild that takes at least 2 or 3 minutes.

Incremental compilation should greatly improve the dependency situation, but it’s still in an experimental state and I haven’t tried it yet.

To run all the tests I do this (after a full build):

ulimit -c 0 && make check

The checking aborts if you don’t do the ulimit, because the tests produces lots of core files and it doesn’t want to swamp your disk.

The build system is complex, with lots of options. This command gives a nice overview of some common invocations:

make tips

Basic profiling

The next step is to do some basic profiling. I like to be careful about which rustc I am invoking at any time, especially if there’s a system-wide version installed, so I avoid relying on PATH and instead define some environment variables like this:

export RUSTC01="$HOME/moz/rust0/x86_64-unknown-linux-gnu/stage1/bin/rustc"
export RUSTC02="$HOME/moz/rust0/x86_64-unknown-linux-gnu/stage2/bin/rustc"
export RUSTC11="$HOME/moz/rust1/x86_64-unknown-linux-gnu/stage1/bin/rustc"
export RUSTC12="$HOME/moz/rust1/x86_64-unknown-linux-gnu/stage2/bin/rustc"

In the examples that follow I will use $RUSTC01 as the version of rustc that I invoke.

rustc has the ability to produce some basic stats about the time and memory used by each compiler pass. It is enabled with the -Ztime-passes flag. If you are invoking rustc directly you’d do it like this:

$RUSTC01 -Ztime-passes

If you are building with Cargo you can instead do this:

RUSTC=$RUSTC01 cargo rustc -- -Ztime-passes

The RUSTC= part tells Cargo you want to use a non-default rustc, and the part after the -- is flags that will be passed to rustc when it builds the final crate. (A bit weird, but useful.)

Here is some sample output from -Ztime-passes:

time: 0.056; rss: 49MB parsing
time: 0.000; rss: 49MB recursion limit
time: 0.000; rss: 49MB crate injection
time: 0.000; rss: 49MB plugin loading
time: 0.000; rss: 49MB plugin registration
time: 0.103; rss: 87MB expansion
time: 0.000; rss: 87MB maybe building test harness
time: 0.002; rss: 87MB maybe creating a macro crate
time: 0.000; rss: 87MB checking for inline asm in case the target doesn't support it
time: 0.005; rss: 87MB complete gated feature checking
time: 0.008; rss: 87MB early lint checks
time: 0.003; rss: 87MB AST validation
time: 0.026; rss: 90MB name resolution
time: 0.019; rss: 103MB lowering ast -> hir
time: 0.004; rss: 105MB indexing hir
time: 0.003; rss: 105MB attribute checking
time: 0.003; rss: 105MB language item collection
time: 0.004; rss: 105MB lifetime resolution
time: 0.000; rss: 105MB looking for entry point
time: 0.000; rss: 105MB looking for plugin registrar
time: 0.015; rss: 109MB region resolution
time: 0.002; rss: 109MB loop checking
time: 0.002; rss: 109MB static item recursion checking
time: 0.060; rss: 109MB compute_incremental_hashes_map
time: 0.000; rss: 109MB load_dep_graph
time: 0.021; rss: 109MB type collecting
time: 0.000; rss: 109MB variance inference
time: 0.038; rss: 113MB coherence checking
time: 0.126; rss: 114MB wf checking
time: 0.219; rss: 118MB item-types checking
time: 1.158; rss: 125MB item-bodies checking
time: 0.000; rss: 125MB drop-impl checking
time: 0.092; rss: 127MB const checking
time: 0.015; rss: 127MB privacy checking
time: 0.002; rss: 127MB stability index
time: 0.011; rss: 127MB intrinsic checking
time: 0.007; rss: 127MB effect checking
time: 0.027; rss: 127MB match checking
time: 0.014; rss: 127MB liveness checking
time: 0.082; rss: 127MB rvalue checking
time: 0.145; rss: 161MB MIR dump
 time: 0.015; rss: 161MB SimplifyCfg
 time: 0.033; rss: 161MB QualifyAndPromoteConstants
 time: 0.034; rss: 161MB TypeckMir
 time: 0.001; rss: 161MB SimplifyBranches
 time: 0.006; rss: 161MB SimplifyCfg
time: 0.089; rss: 161MB MIR passes
time: 0.202; rss: 161MB borrow checking
time: 0.005; rss: 161MB reachability checking
time: 0.012; rss: 161MB death checking
time: 0.014; rss: 162MB stability checking
time: 0.000; rss: 162MB unused lib feature checking
time: 0.101; rss: 162MB lint checking
time: 0.000; rss: 162MB resolving dependency formats
 time: 0.001; rss: 162MB NoLandingPads
 time: 0.007; rss: 162MB SimplifyCfg
 time: 0.017; rss: 162MB EraseRegions
 time: 0.004; rss: 162MB AddCallGuards
 time: 0.126; rss: 164MB ElaborateDrops
 time: 0.001; rss: 164MB NoLandingPads
 time: 0.012; rss: 164MB SimplifyCfg
 time: 0.008; rss: 164MB InstCombine
 time: 0.003; rss: 164MB Deaggregator
 time: 0.001; rss: 164MB CopyPropagation
 time: 0.003; rss: 164MB AddCallGuards
 time: 0.001; rss: 164MB PreTrans
time: 0.182; rss: 164MB Prepare MIR codegen passes
 time: 0.081; rss: 167MB write metadata
 time: 0.590; rss: 177MB translation item collection
 time: 0.034; rss: 180MB codegen unit partitioning
 time: 0.032; rss: 300MB internalize symbols
time: 3.491; rss: 300MB translation
time: 0.000; rss: 300MB assert dep graph
time: 0.000; rss: 300MB serialize dep graph
 time: 0.216; rss: 292MB llvm function passes [0]
 time: 0.103; rss: 292MB llvm module passes [0]
 time: 4.497; rss: 308MB codegen passes [0]
 time: 0.004; rss: 308MB codegen passes [0]
time: 5.185; rss: 308MB LLVM passes
time: 0.000; rss: 308MB serialize work products
time: 0.257; rss: 297MB linking

As far as I can tell, the indented passes are sub-passes, and the parent pass is the first non-indented pass afterwards.

More serious profiling

The -Ztime-passes flag gives a good overview, but you really need a profiling tool that gives finer-grained information to get far. I’ve done most of my profiling with two Valgrind tools, Cachegrind and DHAT. I invoke Cachegrind like this:

valgrind \
 --tool=cachegrind --cache-sim=no --branch-sim=yes \
 --cachegrind-out-file=$OUTFILE $RUSTC01 ...

where $OUTFILE specifies an output filename. I find the instruction counts measured by Cachegrind to be highly useful; the branch simulation results are occasionally useful, and the cache simulation results are almost never useful.

The Cachegrind output looks like this:

22,153,170,953 PROGRAM TOTALS

         Ir file:function
923,519,467 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:_int_malloc
879,700,120 /home/njn/moz/rust0/src/rt/miniz.c:tdefl_compress
629,196,933 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:_int_free
394,687,991 ???:???
379,869,259 /home/njn/moz/rust0/src/libserialize/
376,921,973 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:malloc
263,083,755 /build/glibc-GKVZIf/glibc-2.23/string/::/sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:__memcpy_avx_unaligned
257,219,281 /home/njn/moz/rust0/src/libserialize/<serialize::opaque::Decoder<'a> as serialize::serialize::Decoder>::read_usize
217,838,379 /build/glibc-GKVZIf/glibc-2.23/malloc/malloc.c:free
217,006,132 /home/njn/moz/rust0/src/librustc_back/
211,098,567 ???:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults&, llvm::CodeGenOpt::Level)
185,630,213 /home/njn/moz/rust0/src/libcore/hash/<rustc_incremental::calculate_svh::hasher::IchHasher as core::hash::Hasher>::write
171,360,754 /home/njn/moz/rust0/src/librustc_data_structures/<rustc::ty::subst::Substs<'tcx> as core::hash::Hash>::hash
150,026,054 ???:llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)

Here “Ir” is short for “I-cache reads”, which corresponds to the number of instructions executed. Cachegrind also gives line-by-line annotations of the source code.

The Cachegrind results indicate that malloc and free are usually the two hottest functions in the compiler. So I also use DHAT, which is a malloc profiler that tells you exactly where all your malloc calls are coming from.  I invoke DHAT like this:

/home/njn/grind/ws3/vg-in-place \
 --tool=exp-dhat --show-top-n=1000 --num-callers=4 \
 --sort-by=tot-blocks-allocd $RUSTC01 ... 2> $OUTFILE

I sometimes also use --sort-by=tot-bytes-allocd. DHAT’s output looks like this:

==16425== -------------------- 1 of 1000 --------------------
==16425== max-live: 30,240 in 378 blocks
==16425== tot-alloc: 20,866,160 in 260,827 blocks (avg size 80.00)
==16425== deaths: 260,827, at avg age 113,438 (0.00% of prog lifetime)
==16425== acc-ratios: 0.74 rd, 1.00 wr (15,498,021 b-read, 20,866,160 b-written)
==16425== at 0x4C2BFA6: malloc (vg_replace_malloc.c:299)
==16425== by 0x5AD392B: <syntax::ptr::P<T> as serialize::serialize::Decodable>::decode (
==16425== by 0x5AD4456: <core::iter::Map<I, F> as core::iter::iterator::Iterator>::next (
==16425== by 0x5AE2A52: rustc_metadata::decoder::<impl rustc_metadata::cstore::CrateMetadata>::get_attributes (
==16425== -------------------- 2 of 1000 --------------------
==16425== max-live: 1,360 in 17 blocks
==16425== tot-alloc: 10,378,160 in 129,727 blocks (avg size 80.00)
==16425== deaths: 129,727, at avg age 11,622 (0.00% of prog lifetime)
==16425== acc-ratios: 0.47 rd, 0.92 wr (4,929,626 b-read, 9,599,798 b-written)
==16425== at 0x4C2BFA6: malloc (vg_replace_malloc.c:299)
==16425== by 0x881136A: <syntax::ptr::P<T> as core::clone::Clone>::clone (
==16425== by 0x88233A7: syntax::ext::tt::macro_parser::parse (
==16425== by 0x8812E66: syntax::tokenstream::TokenTree::parse (

The “deaths” value here indicate the total number of calls to malloc for each call stack, which is usually the metric of most interest. The “acc-ratios” value can also be interesting, especially if the “rd” value is 0.00, because that indicates the allocated blocks are never read. (See below for example of problems that I found this way.)

For both profilers I also pipe $OUTFILE through eddyb’s script which demangles ugly Rust symbols like this:


to something much nicer, like this:

<serialize::opaque::Decoder<'a> as serialize::serialize::Decoder>::read_usize

For programs that use Cargo, sometimes it’s useful to know the exact rustc invocations that Cargo uses. Find out with either of these commands:

RUSTC=$RUSTC01 cargo build -v
RUSTC=$RUSTC01 cargo rust -v

I also have done a decent amount of ad hoc println profiling, where I insert println! calls in hot parts of the code and then I use a script to post-process them. This can be very useful when I want to know exactly how many times particular code paths are hit.

I’ve also tried perf. It works, but I’ve never established much of a rapport with it. YMMV. In general, any profiler that works with C or C++ code should also work with Rust code.

Finding suitable benchmarks

Once you know how you’re going to profile you need some good workloads. You could use the compiler itself, but it’s big and complicated and reasoning about the various stages can be confusing, so I have avoided that myself.

Instead, I have focused entirely on rustc-benchmarks, a pre-existing rustc benchmark suite. It contains 13 benchmarks of various sizes. It has been used to track rustc’s performance at for some time, but it wasn’t easy to use locally until I wrote a script for that purpose. I invoke it something like this:

./ \
  /home/njn/moz/rust0/x86_64-unknown-linux-gnu/stage1/bin/rustc \

It compares the two given compilers, doing debug builds, on the benchmarks See the next section for example output. If you want to run a subset of the benchmarks you can specify them as additional arguments.

Each benchmark in rustc-benchmarks has a makefile with three targets. See the README for details on these targets, which can be helpful.


Here are the results if I compare the following two versions of rustc with

  • The commit just before my first commit (on September 12).
  • A commit from October 13.
futures-rs-test  5.028s vs  4.433s --> 1.134x faster (variance: 1.020x, 1.030x)
helloworld       0.283s vs  0.235s --> 1.202x faster (variance: 1.012x, 1.025x)
html5ever-2016-  6.293s vs  5.652s --> 1.113x faster (variance: 1.011x, 1.008x)
hyper.0.5.0      6.182s vs  5.039s --> 1.227x faster (variance: 1.002x, 1.018x)
inflate-0.1.0    5.168s vs  4.935s --> 1.047x faster (variance: 1.001x, 1.002x)
issue-32062-equ  0.457s vs  0.347s --> 1.316x faster (variance: 1.010x, 1.007x)
issue-32278-big  2.046s vs  1.706s --> 1.199x faster (variance: 1.003x, 1.007x)
jld-day15-parse  1.793s vs  1.538s --> 1.166x faster (variance: 1.059x, 1.020x)
piston-image-0. 13.871s vs 11.885s --> 1.167x faster (variance: 1.005x, 1.005x)
regex.0.1.30     2.937s vs  2.516s --> 1.167x faster (variance: 1.010x, 1.002x)
rust-encoding-0  2.414s vs  2.078s --> 1.162x faster (variance: 1.006x, 1.005x)
syntex-0.42.2   36.526s vs 32.373s --> 1.128x faster (variance: 1.003x, 1.004x)
syntex-0.42.2-i 21.500s vs 17.916s --> 1.200x faster (variance: 1.007x, 1.013x)

Not all of the improvement is due to my changes, but I have managed a few nice wins, including the following.

#36592: There is an arena allocator called TypedArena. rustc creates many of these, mostly short-lived. On creation, each arena would allocate a 4096 byte chunk, in preparation for the first arena allocation request. But DHAT’s output showed me that the vast majority of arenas never received such a request! So I made TypedArena lazy — the first chunk is now only allocated when necessary. This reduced the number of calls to malloc greatly, which sped up compilation of several rustc-benchmarks by 2–6%.

#36734: This one was similar. Rust’s HashMap implementation is lazy — it doesn’t allocate any memory for elements until the first one is inserted. This is a good thing because it’s surprisingly common in large programs to create HashMaps that are never used. However, Rust’s HashSet implementation (which is just a layer on top of the HashMap) didn’t have this property, and guess what? rustc also creates large numbers of HashSets that are never used. (Again, DHAT’s output made this obvious.) So I fixed that, which sped up compilation of several rustc-benchmarks by 1–4%. Even better, because this change is to Rust’s stdlib, rather than rustc itself, it will speed up any program that creates HashSets without using them.

#36917: This one involved avoiding some useless data structure manipulation when a particular table was empty. Again, DHAT pointed out a table that was created but never read, which was the clue I needed to identify this improvement. This sped up two benchmarks by 16% and a couple of others by 3–5%.

#37064: This one changed a hot function in serialization code to return a Cow<str> instead of a String, which avoided a lot of allocations.

Future work

Profiles indicate that the following parts of the compiler account for a lot of its runtime.

  • malloc and free are still the two hottest functions in most benchmarks. Avoiding heap allocations can be a win.
  • Compression is used for crate metadata and LLVM bitcode. (This shows up in profiles under a function called tdefl_compress.)  There is an issue open about this.
  • Hash table operations are hot. A lot of this comes from the interning of various values during type checking; see the CtxtInterners type for details.
  • Crate metadata decoding is also costly.
  • LLVM execution is a big chunk, especially when doing optimized builds. So far I have treated LLVM as a black box and haven’t tried to change it, at least partly because I don’t know how to build it with debug info, which is necessary to get source files and line numbers in profiles.

A lot of programs have broadly similar profiles, but occasionally you get an odd one that stresses a different part of the compiler. For example, in rustc-benchmarks, inflate-0.1.0 is dominated by operations involving the (delighfully named) ObligationsForest (see #36993), and html5ever-2016-08-25 is dominated by what I think is macro processing. So it’s worth profiling the compiler on new codebases.

Caveat lector

I’m still a newcomer to Rust development. Although I’ve had lots of help on the #rustc IRC channel — big thanks to eddyb and simulacrum in particular — there may be things I am doing wrong or sub-optimally. Nonetheless, I hope this is a useful starting point for newcomers who want to speed up the Rust compiler.

How to get localized Firefox Nightly builds

One of the easiest and best ways that someone can help Mozilla and Firefox is to run Firefox Nightly. I’ve been doing it on my Windows, Mac and Linux machines for the past couple of months. It requires daily restarts, but otherwise it has been a smooth experience for me.

Unfortunately the number of Nightly users has been steadily dropping for some time, which hurts our ability to catch crashes and other regressions early. Pascal Chevrel and Marcia Knous are leading efforts underway to reverse this trend.

One problem with Nightly builds has been their visibility. In particular, finding localized (non-English) builds was difficult. That situation has just improved: thanks to Kohei Yoshino there is now a single page containing Nightly builds for all platforms and locales. As far as I know there are no other pages that currently link to that page, but perhaps that will happen as part of the planned work to give Nightly builds a place on

If you have friends and family who would like to help Mozilla and are willing to use pre-release versions of Firefox, please suggest Firefox Nightly to them.

How to switch to a 64-bit Firefox on Windows

I recently wrote about 64-bit Firefox builds on Windows, explaining why you might want to switch — it can reduce the likelihood of out-of-memory crashes — and also some caveats.

However, I didn’t explain how to switch, so I will do that now.

First, if you want to make sure that you aren’t already running a 64-bit Firefox, type “about:support” in the address bar and then look at the User Agent field in the Application Basics table near the top of the page.

  • If it contains the string “Win64”, you are already running a 64-bit Firefox.
  • If it contains the string “WOW64“, you are running a 32-bit Firefox on a 64-bit Windows installation, which means you can switch to a 64-bit build.
  • Otherwise, you are running a 32-bit Firefox on a 32-bit Windows installation, and cannot switch to a 64-bit Firefox.

Here are links to pages contain 64-bit builds for all the different release channels.

  • Release
  • Beta
  • Developer Edition
  • Nightly:  This is a user-friendly page, but it only has the en-US locale.
  • Nightly: This is a more intimidating page, but it has all locales. Look for a file with a name of the form firefox-<VERSION>.<LOCALE>.win64.installer.exe, e.g. for Nightly 50 in German.

By default, 32-bit Firefox and 64-bit Firefox are installed to different locations:

  • C:\Program Files (x86)\Mozilla Firefox\
  • C:\Program Files\Mozilla Firefox\

If you are using a 32-bit Firefox and then you download and install a 64-bit Firefox, by default you will end up with two versions of  Firefox installed. (But note that if you let the 64-bit Firefox installer add shortcuts to the desktop and/or taskbar, these shortcuts will replace any existing shortcuts to 32-bit Firefox.)

Both the 32-bit Firefox and the 64-bit Firefox will use the same profile, which means all your history, bookmarks, extensions, etc., will be available in either version. You’ll be able to run both versions, though not at the same time with the same profile. If you decide you don’t need both versions you can simply remove the unneeded version through the Windows system settings, as normal; your profile will not be touched when you do this.

Finally, there is a plan to gradually roll out 64-bit Firefox to Windows users in increasing numbers.

Firefox 64-bit for Windows can take advantage of more memory

By default, on Windows, Firefox is a 32-bit application. This means that it is limited to using at most 4 GiB of memory, even on machines that have more than 4 GiB of physical memory (RAM). In fact, depending on the OS configuration, the limit may be as low as 2 GiB.

Now, 2–4 GiB might sound like a lot of memory, but it’s not that unusual for power users to use that much. This includes:

  • users with many (dozens or even hundreds) of tabs open;
  • users with many (dozens) of extensions;
  • users of memory-hungry web sites and web apps; and
  • users who do all of the above!

Furthermore, in practice it’s not possible to totally fill up this available space because fragmentation inevitably occurs. For example, Firefox might need to make a 10 MiB allocation and there might be more than 10 MiB of unused memory, but if that available memory is divided into many pieces all of which are smaller than 10 MiB, then the allocation will fail.

When an allocation does fail, Firefox can sometimes handle it gracefully. But often this isn’t possible, in which case Firefox will abort. Although this is a controlled abort, the effect for the user is basically identical to an uncontrolled crash, and they’ll have to restart Firefox. A significant fraction of Firefox crashes/aborts are due to this problem, known as address space exhaustion.

Fortunately, there is a solution to this problem available to anyone using a 64-bit version of Windows: use a 64-bit version of Firefox. Now, 64-bit applications typically use more memory than 32-bit applications. This is because pointers, a common data type, are twice as big; a rough estimate for 64-bit Firefox is that it might use 25% more memory. However, 64-bit applications also have a much larger address space, which means they can access vast amounts of physical memory, and address space exhaustion is all but impossible. (In this way, switching from a 32-bit version of an application to a 64-bit version is the closest you can get to downloading more RAM!)

Therefore, if you have a machine with 4 GiB or less of RAM, switching to 64-bit Firefox probably won’t help. But if you have 8 GiB or more, switching to 64-bit Firefox probably will help the memory usage situation.

Official 64-bit versions of Firefox have been available since December 2015. If the above discussion has interested you, please try them out. But note the following caveats.

  • Flash and Silverlight are the only supported 64-bit plugins.
  • There are some Flash content regressions due to our NPAPI sandbox (for content that uses advanced features like GPU acceleration or microphone APIs).

On the flip side, as well as avoiding address space exhaustion problems, a security feature known as ASLR works much better in 64-bit applications than in 32-bit applications, so 64-bit Firefox will be slightly more secure.

Work is being ongoing to fix or minimize the mentioned caveats, and it is expected that 64-bit Firefox will be rolled out in increasing numbers in the not-too-distant future.

UPDATE: Chris Peterson gave me the following measurements about daily active users on Windows.

  • 66.0% are running 32-bit Firefox on 64-bit Windows. These users could switch to a 64-bit Firefox.
  • 32.3% are running 32-bit Firefox on 32-bit Windows. These users cannot switch to a 64-bit Firefox.
  • 1.7% are running 64-bit Firefox already.

UPDATE 2: Also from Chris Peterson, here are links to 64-bit builds for all the channels:

I want more users on the Nightly channel

I have been working recently on a new Platform Engineering initiative called Uptime, the goal of which is to reduce Firefox’s crash rate on both desktop and mobile. As a result I’ve been spending a lot of time looking at crash reports, particular on the Nightly channel. This in turn has increased my appreciation of how important Nightly channel users are.

A crash report from a Nightly user is much more useful than a crash report from a non-Nightly user, for two reasons.

  • If a developer lands a change that triggers crashes for Nightly users, they will get fast feedback via crash reports, often within a day or two.  This maximizes the likelihood of a fix, because the particular change will be fresh in the developer’s mind. Also, backing out changes is usually easy at this point. In contrast, finding out about a crash weeks or months later is less useful.
  • Because a new Nightly build is done every night, if a new crash signature appears, we have a fairly small regression window. This makes it easier to identify which change caused the new crashes.

Also, Nightly builds contain some extra diagnostics and checks that can also be helpful with identifying a range of problems. (See MOZ_DIAGNOSTIC_ASSERT for one example.)

If we could significantly increase the size of our Nightly user population, that would definitely help reduce crash rates. We would get data about a wider range of crashes. We would also get stronger signals for specific crash-causing defects. This is important because the number of crash reports received for each Nightly build is relatively low, and it’s often the case that a cluster of crash reports that come from two or more different users will receive more attention than a cluster that comes from a single user.

(You might be wondering how we distinguish those two cases. Each crash report doesn’t contain enough information to individually identify the user — unless the user entered their email address into the crash reporting form — but crash reports do contain enough information that you can usually tell if two different crash reports have come from two different users. For example, the installation time alone is usually enough, because it’s measured to the nearest second.)

All this is doubly true on Android, where the number of Nightly users is much smaller than on Windows, Mac and Linux.

Using the Nightly channel is not the best choice for everyone. There are some disadvantages.

  • Nightly is less stable than later channels, but not drastically so. The crash rate is typically 1.5–2.5 times higher than Beta or Release, though occasionally it spikes higher for a short period. So a Nightly user should be comfortable with the prospect of less stability.
  • Nightly gets updated every 24 hours, which some people would find annoying.

There are also advantages.

  • Nightly users get to experience new features and fixes immediately.
  • Nightly users get the satisfaction that they are helping produce a better Firefox. The frustration of any crash is offset by the knowledge that the information in the corresponding crash report is disproportionately valuable. Indeed, there’s a non-trivial likelihood that a single crash report from a Nightly user will receive individual attention from an engineer.

If you, or somebody you know, thinks that those advantages outweigh the disadvantages, please consider switching. Thank you.

More compacting GC

Jon Coppeard recently extended SpiderMonkey’s compacting GC abilities. Previously, the GC could only compact GC arena containing JavaScript objects. Now it can also compact arenas containing shapes (a data structure used within SpiderMonkey which isn’t visible to user code) and strings, which are two of the largest users of memory in the GC heap after objects.

These improvements should result in savings of multiple MiBs in most workloads, and they are on track to ship in Firefox 48, which will be released in early August. Great work, Jon!

Talky is a nice WebRTC client

I’ve written before about using Firefox Hello, the video chat feature that is now built into Firefox. Firefox Hello is built on top of WebRTC, which is now part of HTML. This means that video chat can also be implemented in ordinary webpages.

I’ve been using Talky recently for lots of 1-on-1 meetings and even some groups meetings. It has some really nice features.

  • You can choose a room name, which becomes part of the URL — e.g.
  • There’s an optional tab-sharing feature.
  • The UI is simple and provides a symmetric experience for all participants.

Great stuff!

Fast starring of oranges on try pushes

If you do lots of try pushes it’s worth learning to star oranges using the keyboard instead of the mouse. It’s much faster that way.

The important keystrokes are as follows.

  • ‘j’ and ‘n’ move focus to the next unstarred failure. (‘k’ and ‘p’ move focus to the previous unstarred failure.)
  • Space adds the selected failure to the pinboard.
  • Ctrl-Enter stars all jobs in the pinboard.

So a typical keystroke sequence to star multiple jobs would be: j, space, j, space, j, space, ctrl-enter. Between each j and space you should, of course, check that the failure matches an existing one.

This information, along with lots of other interesting tidbits, is in the Treeherder User Guide.

Thank you to Phil Ringnalda for teaching me this.

Getting my bluetooth keyboard working again after an Ubuntu upgrade

I have a wireless bluetooth keyboard (a Logitech diNovo ultra-flat, about seven years old) that I love. Every time I update my Ubuntu installation there’s a ~50% chance that it’ll stop working, and on the update to 15.10 that I just did I got unlucky.

To get it working again I had to comment out the following two lines in /lib/udev/rules.d/97-hid2hci.rules and then reboot.

KERNEL=="hiddev*", ATTRS{idVendor}=="046d", ATTRS{idProduct}=="c70[345abce]|c71[3bc]", \
  RUN+="hid2hci --method=logitech-hid --devpath=%p"

I’ve had to do something similar on more than one previous occasion. The idea originated here, but note that the name of the rules file has changed since that was written.

(When I updated to 15.04 this problem did not manifest. However, I got unlucky and the batteries in the keyboard died while the update was occurring. Batteries in this keyboard typically last 4–5 months, and diagnosing dead batteries is normally easy — hey, the keyboard stopped working suddenly! — but because Ubuntu updates had caused troubles with this keyboard in the past I assumed the update was the cause. I didn’t think to try new batteries until I’d spent a couple of tedious hours deep in the bluetooth configuration weeds. Lesson learned.)

I rewrote Firefox’s BMP decoder

Recently I’ve been deliberately working on some areas of Firefox I’m unfamiliar with, particular relating to graphics. This led me to rewriting Firefox’s BMP decoder and learn a number of interesting things along the way.

Image decoding

Image decoding is basically the process of taking an image encoded in a file and extracting its pixels. In principle it’s simple. You start by reading some information about the image, such as its size and colour depth, which typically comes in some kind of fixed-size header. Then you read the pixel data, which is variable-sized.

This isn’t hard if you have all the data available at the start. But in the context of a browser it makes sense to decode incrementally as data comes in over the network. In that situation you have to be careful and constantly check if you have enough data yet to safely read the next chunk of data. This checking is error-prone and tends to spread itself all over the image decoder.

For this reason, Seth Fowler recently wrote a new class called StreamingLexer that encapsulates this checking and exposes a nice state-based interface to image decoders. When a decoder changes state (e.g. it finishes reading the header) it tells StreamingLexer how many bytes it needs to safely enter the next state (e.g. to read the first row of pixels) and StreamingLexer won’t return control to the decoder until that many bytes are available.

Another consideration when decoding images is that you can’t trust them. E.g. an image might claim to be 100 x 100 pixels but actually contain less data than that. If you’re not careful you could easily read memory you shouldn’t, which could cause crashes or security problems. StreamingLexer helps with this, too.

StreamingLexer makes image decoders simpler and safer, and converting the BMP decoder to use it was my starting point.

The BMP format

The BMP format comes from Windows. On the web it’s mostly used on the web for favicons though it can be used for normal images.

There’s no specification for BMP. There are eight in-use versions of the format that I know of, with later versions mostly(!) extending earlier versions. If you’re interested, you can read the brief description of all these versions that I wrote in a big comment at the top of nsBMPDecoder.cpp.

Because the format is so gnarly I started getting nervous that my rewrite might  introduce bugs in some of the darker corners, especially once Seth told me that our BMP test coverage wasn’t that good.

So I searched around and found Jason Summers’ wonderful BMP Suite, which exercises pretty much every corner of the BMP format. Version 2.3 of the BMP Suite contains 57 images, 23 of which are “good” (obviously valid), 14 of which are “bad” (obviously invalid) and 20 of which are “questionable” (not obviously valid or invalid). The presence of this last category demonstrates just how ill-specified BMP is as a format, and some of the “questionable” tests have two or three reference images, any of which could be considered a correct rendering. (Furthermore, it’s possible to render a number of the “bad” images in a reasonable way.)

This test suite was enormously helpful. As well as giving me greater confidence in my changes, it immediately showed that we had several defects in the existing BMP decoder, particular relating to the scaling of 16-bit colors and an almost complete lack of transparency handling. In comparison, Chrome rendered pretty much all the images in BMP suite reasonably, and Safari and Edge got a few wrong but still did better than Firefox.

Fixing the problems

So I fixed these problems as part of my rewrite. The following images show a number of test images that Firefox used to render incorrectly; in each case a correct rendering is on the left, and our old incorrect rendering is on the right.

bad-bmp-2 bad-bmp-3 bad-bmp-4 bad-bmp-5

It’s clear that the old defects were mostly related to colour-handling, though the first pair of images shows a problem relating to the starting point of the pixel data.

(These images are actually from an old version of Firefox with version 2.4 of BMP Suite, which I just discovered was released only a few days ago. I just filed a bug to update the copy we use in automated testing. Happily, it looks like the new code does reasonable things with all the images added in v2.4.)

These improvements will ship in Firefox 44, which is scheduled to be released in late January, 2016. And with that done I now need to start thinking about rewriting the GIF decoder