Author Archives: Nicholas Nethercote

San Francisco Oxidation meeting notes

At last week’s Mozilla All Hands meeting in San Francisco we had an Oxidation meeting about the use of Rust in Firefox. It was low-key, being mostly about status and progress. The notes are here for those who are interested.

How to speed up the Rust compiler some more in 2018

I recently wrote about some work I’ve done to speed up the Rust compiler. Since then I’ve done some more.

rustc-perf improvements

Since my last post, rustc-perf — the benchmark suite, harness and visualizer — has seen some improvements. First, some new benchmarks were added: cargo, ripgrep, sentry-cli, and webrender. Also, the parser benchmark has been removed because it was a toy program and thus not a good benchmark.

Second, I added support for several new profilers: Callgrind, Massif, rustc’s own -Ztime-passes, and the use of ad hoc eprintln! statements added to rustc. (This latter case is more useful than it might sound; in combination with post-processing it can be very helpful, as we will see below.)

Finally, the graphs shown on the website now have better y-axis scaling, which makes many of them easier to read. Also, there is a new dashboard view that shows performance across rustc releases.

Failures and incompletes

After my last post, multiple people said they would be interested to hear about optimization attempts of mine that failed. So here is an incomplete selection. I suggest that rustc experts read through these, because there is a chance they will be able to identify alternative approaches that I have overlooked.

nearest_common_ancestors 1: I managed to speed up this hot function in a couple of ways, but a third attempt failed. The representation of the scope tree is not done via a typical tree data structure; instead there is a HashMap of child/parent pairs. This means that moving from a child to its parent node requires a HashMap lookup, which is expensive. I spent some time designing and implementing an alternative data structure that stored nodes in a vector and the child-to-parent links were represented as indices to other elements in the vector. This meant that child-to-parent moves only required stepping through the vector. It worked, but the speed-up turned out to be very small, and the new code was significantly more complicated, so I abandoned it.

nearest_common_ancestors 2: A different part of the same function involves storing seen nodes in a vector. Searching this unsorted vector is O(n), so I tried instead keeping it in sorted order and using binary search, which gives O(log n) search. However, this change meant that node insertion changed from amortized O(1) to O(n) — instead of a simple push onto the end of the vector, insertion could be at any point, which which required shifting all subsequent elements along. Overall this change made things slightly worse.

PredicateObligation SmallVec: There is a type Vec<PredicationObligation> that is instantiated frequently, and the vectors often have few elements. I tried using a SmallVec instead, which avoids the heap allocations when the number of elements is below a threshold. (A trick I’ve used multiple times.) But this made things significantly slower! It turns out that these Vecs are copied around quite a bit, and a SmallVec is larger than a Vec because the elements are inline. Furthermore PredicationObligation is a large type, over 100 bytes. So what happened was that memcpy calls were inserted to copy these SmallVecs around. The slowdown from the extra function calls and memory traffic easily outweighed the speedup from avoiding the Vec heap allocations.

SipHasher128: Incremental compilation does a lot of hashing of data structures in order to determine what has changed from previous compilation runs. As a result, the hash function used for this is extremely hot. I tried various things to speed up the hash function, including LEB128-encoding of usize inputs (a trick that worked in the past) but I failed to speed it up.

LEB128 encoding: Speaking of LEB128 encoding, it is used a lot when writing metadata to file. I tried optimizing the LEB128 functions by special-casing the common case where the value is less than 128 and so can be encoded in a single byte. It worked, but gave a negligible improvement, so I decided it wasn’t worth the extra complication.

Token shrinking: A previous PR shrunk the Token type from 32 to 24 bytes, giving a small win. I tried also replacing the Option<ast::Name> in Literal with just ast::Name and using an empty name to represent “no name”. That  change reduced it to 16 bytes, but produced a negligible speed-up and made the code uglier, so I abandoned it.

#50549: rustc’s string interner was structured in such a way that each interned string was duplicated twice. This PR changed it to use a single Rc‘d allocation, speeding up numerous benchmark runs, the best by 4%. But just after I posted the PR, @Zoxc posted #50607, which allocated the strings out of an arena, as an alternative. This gave better speed-ups and was landed instead of my PR.

#50491: This PR introduced additional uses of LazyBTreeMap (a type I had previously introduced to reduce allocations) speeding up runs of multiple benchmarks, the best by 3%. But at around the same time, @porglezomp completed a PR that changed BTreeMap to have the same lazy-allocation behaviour as LazyBTreeMap, rendering my PR moot. I subsequently removed the LazyBTreeMap type because it was no longer necessary.

#51281: This PR, by @Mark-Simulacrum, removed an unnecessary heap allocation from the RcSlice type. I had looked at this code because DHAT’s output showed it was hot in some cases, but I erroneously concluded that the extra heap allocation was unavoidable, and moved on! I should have asked about it on IRC.

Wins

#50339: rustc’s pretty-printer has a buffer that can contain up to 55 entries, and each entry is 48 bytes on 64-bit platforms. (The 55 somehow comes from the algorithm being used; I won’t pretend to understand how or why.) Cachegrind’s output showed that the pretty printer is invoked frequently (when writing metadata?) and that the zero-initialization of this buffer was expensive. I inserted some eprintln! statements and found that in the vast majority of cases only the first element of the buffer was ever touched. So this PR changed the buffer to default to length 1 and extend when necessary, speeding up runs for numerous benchmarks, the best by 3%.

#50365: I had previously optimized the nearest_common_ancestor function. Github user @kirillkh kindly suggested a tweak to the code from that PR which reduced the number comparisons required. This PR implemented that tweak, speeding up runs of a couple of benchmarks by another 1–2%.

#50391: When compiling certain annotations, rustc needs to convert strings from unescaped form to escaped form. It was using the escape_unicode function to do this, which unnecessarily converts every char to \u{1234} form, bloating the resulting strings greatly. This PR changed the code to use the escape_default function — which only escapes chars that genuinely need escaping — speeding up runs of most benchmarks, the best by 13%. It also slightly reduced the on-disk size of produced rlibs, in the best case by 15%.

#50525: Cachegrind showed that even after the previous PR, the above string code was still hot, because string interning was happening on the resulting string, which was unnecessary in the common case where escaping didn’t change the string. This PR added a scan to determine if escaping is necessary, thus avoiding the re-interning in the common case, speeding up a few benchmark runs, the best by 3%.

#50407: Cachegrind’s output showed that the trivial methods for the simple BytePos and CharPos types in the parser are (a) extremely hot and (b) not being inlined. This PR annotated them so they are inlined, speeding up most benchmarks, the best by 5%.

#50564: This PR did the same thing for the methods of the Span type, speeding up incremental runs of a few benchmarks, the best by 3%.

#50931: This PR did the same thing for the try_get function, speeding up runs of many benchmarks, the best by 1%.

#50418: DHAT’s output showed that there were many heap allocations of the cmt type, which is refcounted. Some code inspection and ad hoc instrumentation with eprintln! showed that many of these allocated cmt instances were very short-lived. However, some of them ended up in longer-lived chains, in which the refcounting was necessary. This PR changed the code so that cmt instances were mostly created on the stack by default, and then promoted to the heap only when necessary, speeding up runs of three benchmarks by 1–2%. This PR was a reasonably large change that took some time, largely because it took me five(!) attempts (the final git branch was initially called cmt5) to find the right dividing line between where to use stack allocation and where to use refcounted heap allocation.

#50565: DHAT’s output showed that the dep_graph structure, which is a IndexVec<DepNodeIndex,Vec<DepNodeIndex>>, caused many allocations, and some eprintln! instrumentation showed that the inner Vec‘s were mostly only a few elements. This PR changed the Vec<DepNodeIndex> to SmallVec<[DepNodeIndex;8]>, which avoids heap allocations when the number of elements is less than 8, speeding up incremental runs of many benchmarks, the best by 2%.

#50566: Cachegrind’s output shows that the hottest part of rustc’s lexer is the bump function, which is responsible for advancing the lexer onto the next input character. This PR streamlined it slightly, speeding up most runs of a couple of benchmarks by 1–3%.

#50818: Both Cachegrind and DHAT’s output showed that the opt_normalize_projection_type function was hot and did a lot of heap allocations. Some eprintln! instrumentation showed that there was a hot path involving this function that could be explicitly extracted that would avoid unnecessary HashMap lookups and the creation of short-lived Vecs. This PR did just that, speeding up most runs of serde and futures by 2–4%.

#50855: DHAT’s output showed that the macro parser performed a lot of heap allocations, particular on the html5ever benchmark. This PR implemented ways to avoid three of them: (a) by storing a slice of a Vec in a struct instead of a clone of the Vec; (b) by introducing a “ref or box” type that allowed stack allocation of the MatcherPos type in the common case, but heap allocation when necessary; and (c) by using Cow to avoid cloning a PathBuf that is rarely modified. These changes sped up runs of html5ever by up to 10%, and crates.io by up to 3%. I was particularly pleased with these changes because they all involved non-trivial changes to memory management that required the introduction of additional explicit lifetimes. I’m starting to get the hang of that stuff… explicit lifetimes no longer scare me the way they used to. It certainly helps that rustc’s error messages do an excellent job of explaining where explicit lifetimes need to be added.

#50932: DHAT’s output showed that a lot of HashSet instances were being created in order to de-duplicate the contents of a commonly used vector type. Some eprintln! instrumentation showed that most of these vectors only had 1 or 2 elements, in which case the de-duplication can be done trivially without involving a HashSet. (Note that the order of elements within this vector is important, so de-duplication via sorting wasn’t an option.) This PR added special handling of these trivial cases, speeding up runs of a few benchmarks, the best by 2%.

#50981: The compiler does a liveness analysis that involves vectors of indices that represent variables and program points. In rare cases these vectors can be enormous; compilation of the inflate benchmark involved one that has almost 6 million 24-byte elements, resulting in 143MB of data. This PR changed the type used for the indices from usize to u32, which is still more than large enough, speeding up “clean incremental” builds of inflate by about 10% on 64-bit platforms, as well as reducing their peak memory usage by 71MB.

What’s next?

These improvements, along with those recently done by others, have significantly sped up the compiler over the past month or so: many workloads are 10–30% faster, and some even more than that. I have also seen some anecdotal reports from users about the improvements over recent versions, and I would be interested to hear more data points, especially those involving rustc nightly.

The profiles produced by Cachegrind, Callgrind, and DHAT are now all looking very “flat”, i.e. with very little in the way of hot functions that stick out as obvious optimization candidates. (The main exceptions are the SipHasher128 functions I mentioned above, which I haven’t been able to improve.) As a result, it has become difficult for me to make further improvements via “bottom-up” profiling and optimization, i.e. optimizing hot leaf and near-leaf functions in the call graph.

Therefore, future improvement will likely come from “top-down” profiling and optimization, i.e. observations such as “rustc spends 20% of its time in phase X, how can that be improved”? The obvious place to start is the part of compilation taken up by LLVM. In many debug and opt builds LLVM accounts for up to 70–80% of instructions executed. It doesn’t necessarily account for that much time, because the LLVM parts of execution are parallelized more than the rustc parts, but it still is the obvious place to focus next. I have looked at a small amount of generated MIR and LLVM IR, but there’s a lot to take in. Making progress will likely require a lot broader understanding of things than many of the optimizations described above, most of which require only a small amount of local knowledge about a particular part of the compiler’s code.

If anybody reading this is interested in trying to help speed up rustc, I’m happy to answer questions and provide assistance as much as I can. The #rustc IRC channel is also a good place to ask for help.

The Rust compiler is getting faster

TL;DR: The Rust compiler has gotten 1.06x–4x faster over the past month.

As changes are made to the Rust compiler, a suite of benchmarks measuring compile time is run regularly on the development version. The data is viewable at http://perf.rust-lang.org. The default view is graphical, showing data from the past month.

Screenshot of perf.rust-lang.org showing measurements of the html5ever benchmark

The screenshot above shows the graphs for a single benchmark called “html5ever”, which consists of an old version of the project of the same name. Each one shows measurements for a different kind of build: a debug build, a “check” build (which detect errors but don’t generate code), and an optimized build. Within each graph there are the following three data series.

  • Clean: a normal build.
  • Baseline incremental: an incremental build with no prior incremental runs. Such a build is a little slower than a normal build, because it does normal compilation and also gathers information to guide subsequent incremental builds.
  • Clean incremental: an incremental build run immediately after a baseline incremental build. This is the best-case scenario for incremental compilation in which the minimal amount of work is done.

If you visit the site yourself you’ll see that most of the benchmarks have more than three data series, including ones for incremental builds done after small code changes (a more realistic use case), and one for builds with non-lexical lifetimes enabled.

The x-axis shows time and the y-axis shows instruction counts. Other units of measurements are available, including cycles, time, and memory usage. Instruction counts are shown as the default; this isn’t ideal because it’s only a proxy for the measurement that really matters (time)… but it’s a pretty good proxy, and it has a lot lower variation than the time measurements, which is important when detecting changes.

This graphical view is particularly useful for detecting major changes. For example, you can see that in early May there was a major regression for “clean” and “baseline incremental” builds, which Alex Crichton fixed a few days later.

As well as the graphical view, the site also provides a textual “compare” view, which can be reached via the link at the top left of each page. This view compares measurements from two revisions of the compiler; by default it compares the most recently measured revision with one from a month ago. (It can also be used locally, which is very useful to evaluate changes that speed up the compiler.)

The screenshot above is of the “compare” view at the time of writing. Each line corresponds to a single graph from the graphical view. (If you visit the site and click on an individual entry it will expand and show all of the measurements. The resemblance between those measurements and this screenshot will of course diminish over time.) The “avg” column shows the average change across all the data series. The “min” and “max” columns show the minimum and maximum changes for any of the data series. The “serde” and “script-servo” lines are empty because those benchmarks were added to the suite less than a month ago, so no comparison can be made.

The table has many numbers, but the thing to take away is that they are almost all significantly negative, meaning that compile time has reduced. The “avg” numbers range from 6% to 38%; the “min” numbers (i.e. best result) go as high as 75%; the “max” numbers (i.e. worst result) go as high as 36%.

In conclusion: the Rust compiler has gotten significantly faster in the past month. Across a wide range of programs, and a wide range of build configurations, compile times have reduced by between 6% and 75%. To put it another way, the compiler has gotten between 1.06x and 4x faster.

These benefits are available right now to users of the Nightly channel. Users of the Release channel will see them more gradually, spread across one or two versions released over the next few months.

How to speed up the Rust compiler in 2018

18 months ago I wrote about some work I did to speed up the Rust compiler (rustc). I’ve recently taken this work up again. Also, in the meantime rustc’s build system has been replaced and its benchmark suite has been overhauled. So it’s a good time for an update.

Getting the code

The steps for getting the rustc code haven’t changed. First, I fork the main Rust repository on GitHub. Then I make two local clones: a base clone that I won’t modify, which serves as a stable comparison point (rust0), and a second clone where I make my modifications (rust1). I use commands something like this:

user=nnethercote
for r in rust0 rust1 ; do
  git clone https://github.com/$user/rust $r
  cd $r
  git remote add upstream https://github.com/rust-lang/rust
  git remote set-url origin git@github.com:$user/rust
done

Building the Rust compiler

The compiler’s build system is complex with many possible invocations and configurations. I’ll cover the absolute minimum information required to understand how I’ve been using it.

First, you need a config.toml file, which sits at the root of the repository and dictates the compiler’s configuration. I used the provided config.toml.example as a starting point, cut it down a lot, and ended up with the following.

[llvm]
optimize = true
release-debuginfo = true
assertions = false

[rust]
optimize = true
codegen-units = 1
debug-assertions = false
debuginfo = true
debuginfo-lines = true
use-jemalloc = false

It’s possible that some of these lines just restate defaults, but I figure it doesn’t hurt to be explicit. This configuration has the following characteristics.

  • It results in the production of a fully optimized rustc, which is important for profiling. The exception to this is that I disable jemalloc, because DHAT doesn’t work when jemalloc is enabled.
  • It has maximal debug info present, which ensures that profiles are as detailed as possible.

Building rustc can be confusing, particularly because of the multiple compiler stages and the terminology around them. Here is the command I use.

./x.py build --stage 2 src/rustc

This produces a stage 2 compiler that can handle procedural macros and so is suitable for profiling the benchmark suite.

The benchmark suite

rustc’s benchmark suite is called rustc-perf. It consists of two parts:

  • collector: The 24 benchmark programs, 14 of which are “real” code (i.e. crates used in real applications), and 10 of which are toy programs or stress test microbenchmarks. Also, the harness code that runs and measures them.
  • site: Code for displaying measurements as a website.

The test harness is very thorough. For each benchmark, it measures Debug, Opt, and Check (no code generation) invocations. Furthermore, within each of those categories, it does five or more runs, including a normal build and various kinds of incremental builds. A full benchmarking run measures over 400 invocations of rustc. Note however that only the compilation of the final crate is measured in each benchmark; compilation of dependent crates is not included.

rustc-perf was created primarily for the perf.rust-lang.org site, which tracks rustc’s performance over time. I recently modified it so it could be used to compare two builds of rustc on a local machine, which is a fundamental operation when optimizing. This can be done by running the suite and site locally and navigating to the local “compare” page, which looks like this:

Screenshot of rustc-perf's compare page

Note that rustc-perf uses perf-stat for its measurements, so the benchmarking functionality currently only works on Linux.

I also extended rustc-perf so the benchmarks can be run under a profiler. I implemented support for perf-record, Cachegrind, and DHAT, because they are the profilers that I am most familiar with; it isn’t hard to add support for other profilers (including non-Linux ones). The advantage of integrating support for a profiler intto rustc-perf is that it gets the profiler invocations underneath the Cargo invocations, ensuring that the right rustc invocations are measured.

Wins

Here are the improvements I’ve made to rustc over the past few weeks.

#49993: Cachegrind’s output showed that derived methods for the Token type were taking up significant time. This PR changed the ‘#’ counts in raw Lit variants from usize to u16, which reduced the size of Token from 32 to 24 bytes, speeding up some of the runs for coercions and html5ever by 1%.

#50051: Rust’s Option::ok_or function transforms an Option<T> value into a Result<T, E>. It’s a bit of a footgun… as the docs say:

Arguments passed to ok_or are eagerly evaluated; if you are passing the result of a function call, it is recommended to use ok_or_else, which is lazily evaluated.

DHAT’s output showed that one hot allocation site involved the creation of a String while getting the value of the MIRI_BACKTRACE environment variable. This seemed to me like a strange thing to be happening frequently, and it turns out that it was at the end of a chain of calls that were only needed in the (rare) error case of an ok_or call. This PR changed the code to use a trivial closure with ok_or_else, speeding up runs for a lot of benchmarks, the best by 6%.

#50052: DHAT’s output showed that the char_lit function, which parses “\u{…}” literals, was doing a lot of allocations while stripping out ‘_’ chars. This PR avoided that allocation, speeding up various runs — particularly ones for regex, futures, clap, coercions, hyper, and encoding — the best by 6%.

#50106: Cachegrind’s output showed that the nearest_common_ancestor function, which computes the lowest common ancestor of two nodes in a scope tree, was very hot. The algorithm in use constructed the full scope chain for each node, and then worked backward from the end of the two scope chains until a difference was found. This is a reasonable algorithm in many circumstances, but some ad hoc instrumentation (eprintln! statements plus some simple post-processing) showed that the scope chains usually only differ by a handful of elements at the front and then have very long common tails, with dozens or even hundreds of elements. This PR switched to a different algorithm that looks for differences from the front of the scope chain, speeding up runs for many benchmarks, the best by 8%.

#50174: By default, Rust’s HashSet and HashMap use a hash function that is very high quality but also very slow. Therefore, the Rust compiler internally uses different types, FxHashSet and FxHashMap, which are identical to the standard ones except they use a much faster hash function. Unfortunately, it’s easy to forget about them and use the standard hash tables instead. Cachegrind’s output showed that the default hash function code (SipHash) was executed a lot, and that one particularly hot hash table (the symbol interner) was using HashMap. This PR (trivially) changed that table to an FxHashMap, speeding up runs for numerous benchmarks, the best by 5%.

#50240: Some of Rust’s standard containers, such as Vec, HashSet, and HashMap, have the nice property that by default they don’t allocate until an element is inserted. This is good because it’s surprising how often such containers are created but never inserted into. DHAT’s output showed that such behaviour would also help with a couple of the compiler’s uses of BTreeMap. I tried and failed to implement this behaviour directly in BTreeMap; according to Gankro, “BTreeMap is some of the most complex unsafe code in libstd” and “I just scared off a grizzled firefox dev explaining it“! Instead this PR introduced a thin wrapper type (LazyBTreeMap) around BTreeMap and used it in the handful of relevant places within the compiler, speeding up the runs for several benchmarks, the best by 3%. #50266 is open to do the general fix for BTreeMap, whereupon LazyBTreeMap will be able to be removed.

#50246: Cachegrind’s output showed that a function named dump_allocs was hot for some benchmarks. This sounded to me like a logging or debugging function of some kind, and investigation confirmed that it was traversing data structures in order to build up strings that went unused in the standard case where logging is disabled. This PR (trivially) changed this function and a couple of related ones to be no-ops if logging is disabled, speeding up runs for coercions, tuple-stress, html5ever, and encoding, the best by almost 15%! This shows how not doing unclever things is often as important as doing clever things when it comes to optimizing software.

Update: It’s worth noting that I also made three or four optimization attempts that didn’t work out — where I made a change that seemed like it should help, based on profiling data, but the effect was negligible. Success isn’t guaranteed!

Future work

All of the PRs mentioned above (except for the aborted BTreeMap change) involved small, simple changes to the Rust compiler’s code. I’m not a rustc expert, but I do know how to use a couple of profilers well, and I’ve been able to make a difference. I’m sure there are more improvements of this nature to be made, and I encourage other people to try profiling rustc with their favourite profilers to see what they can find. This is valuable because rustc’s speed is something that Rust users often complain about. And it’s fun, if you like that sort of thing 🙂  I’m happy to help people, and the members of the #rustc IRC channel are very friendly and helpful.

Having said that, in a lot of cases, especially for opt builds, the majority of execution time is within LLVM, which rustc uses for code generation. Speeding up LLVM itself may be difficult, but I hope/suspect there is room for improvement in the way that rustc interacts with LLVM. If anyone has ideas on that front I’d love to hear about them.

A New Preferences Parser for Firefox

Firefox’s preferences system uses data files to store information about default preferences within Firefox, and user preferences in a user’s profile (such as prefs.js, which records changes to preference values, and user.js, which allows users to override default preference values).

A new parser

These data files use a custom format, and therefore Firefox has a custom parser for them. I recently rewrote the parser. The new parser has the following benefits over the old parser.

  • It is faster (raw parsing speed is close to 2x faster).
  • It is safer (because it’s written in Rust rather than C++).
  • It is more correct and better tested (the old one got various obscure edge cases wrong).
  • It is more readable, and easier to modify.
  • It issues no warnings, only errors.
  • It is slightly stricter (e.g. doesn’t allow any malformed input, and it catches integer overflow).
  • It has error recovery and better error messages (including correct line numbers).

Modifiability

Modifiability was the prime motivation for the change. I wanted to make some adjustments to the preferences file grammar, but this would have been very difficult in the old parser, because it was written in an awkward style.

It was essentially a single loop containing a giant switch statement on a state variable. This switch was executed for every single char in a file. The states held by the state variable had names like PREF_PARSE_QUOTED_STRING, PREF_PARSE_UNTIL_OPEN_PAREN, PREF_PARSE_COMMENT_BLOCK_MAYBE_END. It also had a second state variable, because in some places a single one wasn’t enough; the parser had to return to the previous state after exiting the current state. Furthermore, lexing and parsing were not separate, so code to handle comments and whitespace was spread around in various places.

The new parser is a recursive descent parser — even though the grammar doesn’t actually have any recursion — in which the structure of the code reflects the structure of the grammar. Lexing is distinct from parsing. As a result, the new parser is much easier to read and modify. In particular, after landing it I added error recovery without too much effort; that would have been almost impossible in the old parser.

Note that the idea of error recovery for preferences parsing was first proposed in bug 107264, filed in 2001! After landing it, I tweeted the following.

Amazingly enough, the original reporter is on Twitter and responded!

Strictness

The new parser is slightly stricter and rejects some malformed input that the old parser accepted.

Junk chars

Disconcertingly, the old parser allowed arbitrary junk between preferences (including at the start and end of the prefs file) so long as that junk didn’t include any of the following chars: ‘/’, ‘#’, ‘u’, ‘s’, ‘p’. This means that lines like these:

!foo@bar&pref("prefname", true);
ticky_pref("prefname", true);    // missing 's' at start
User_pref("prefname", true);     // should be 'u' at start

would all be treated the same as this:

pref("prefname", true);

The new parser disallows such junk because it isn’t necessary and seems like an unintentional botch by the old parser. In practice, this caught a couple of prefs that accidentally had an extra ‘;’ at the end.

SUB char

The old parser allowed the SUB (0x1a) character between tokens and treated it like ‘\n’.

The new parser does not allow this character. SUB was used to indicate end-of-file (not end-of-line) in some old operating systems such as MS-DOS, but this doesn’t seem necessary today.

Invalid escapes

The old parser tolerated (with a warning) invalid escape sequences within  string literals — such as “\q” (not a valid escape) and “\x1” and “\u12″(both of which have insufficient hex digits) — accepting them literally.

The new parser does not tolerate invalid escape sequences because it doesn’t seem necessary and would complicate things.

NUL char

The old parser tolerated the NUL character (0x00) within string literals; this is
dangerous because C++ code that manipulates string values with embedded NULs will almost certainly consider those chars as end-of-string markers.

The new parser treats the NUL character as end-of-file, to avoid this danger. (The escape sequences “\x00” and “\u0000” are also disallowed.)

Integer overflow

The old parser allowed integer literals to overflow, silently wrapping them.

The new parser treats integer overflow as a parse error. This seems better,
and it caught overflows of several existing prefs.

Consequences

Error recovery minimizes the risk of data loss caused by the increased strictness because malformed pref lines in prefs.js will be removed but well-formed pref lines afterwards are preserved.

Nonetheless, please keep an eye out for any other problems that might arise from this change.

Attributes

I mentioned before that I wanted to make some adjustments to the preferences file grammar. Specifically, I changed the grammar used by default preference files (but not user preference files) to support annotating each preference with one or more boolean attributes. The attributes supported so far are ‘sticky’ and ‘locked’. For example:

pref("sticky.pref", true, sticky);
pref("locked.pref", 123, locked);
pref("sticky-and-locked-pref", "blah", sticky, locked);

Note that the addition of the ‘locked’ attribute fixed a 10 year old bug.

When will this ship?

All of these changes are on track to ship in Firefox 60, which is due to release on May 9th.

Nicer commands for content processes

The current Firefox release creates content processes with very long and ugly commands. Here’s an example, as reported by ‘ps’ on my Linux64 machine (with wrapping to make it show up entirely on your screen):

/home/njn/moz/mi8/o64/dist/bin/firefox -contentproc -childID 2 -isForBrowser -in
tPrefs 6:50|7:-1|20:0|35:1000|43:20|44:5|45:10|52:0|58:128|59:10000|64:0|66:400|
67:1|68:0|69:0|70:100|75:0|76:120|77:120|162:2|163:1|167:60|168:30|169:512000|17
8:5000|180:6|194:8192|195:524288|196:5|209:10000|230:24|231:32768|233:0|234:0|24
3:5|247:1048576|249:100|250:5000|252:600|253:4|254:1|262:20|279:4|291:60000|309:
300|310:30| -boolPrefs 1:0|2:0|4:1|5:1|25:1|28:1|29:1|30:1|32:1|33:1|34:1|37:1|3
8:1|39:1|42:1|46:1|47:0|48:0|49:1|50:1|51:1|53:0|56:1|57:1|60:1|61:0|62:0|63:0|6
5:0|71:1|72:1|73:1|74:1|78:1|79:1|80:0|81:0|82:1|83:1|84:1|85:1|86:1|89:0|90:0|9
3:1|94:1|98:1|99:1|100:0|101:1|102:1|103:1|104:0|105:0|107:0|108:0|109:1|110:1|1
11:1|114:1|115:1|116:1|117:1|118:1|119:0|120:0|121:1|122:0|123:0|124:1|125:1|126
:0|127:1|128:1|129:0|131:1|132:0|133:0|134:1|135:1|136:0|137:0|138:0|139:1|140:1
|141:1|142:1|143:0|144:1|145:1|146:1|147:1|148:1|149:1|150:0|151:1|152:1|153:0|1
54:1|155:0|156:0|157:0|158:1|159:1|160:1|161:1|164:1|165:0|172:1|175:0|176:0|177
:1|181:0|183:1|184:0|185:1|187:1|189:0|190:0|193:0|197:1|198:0|199:1|200:1|201:0
|204:1|208:1|210:1|211:0|213:1|216:0|228:0|229:0|232:1|235:0|237:1|238:1|240:1|2
41:0|248:1|251:1|256:0|257:0|258:0|259:1|260:1|261:0|266:1|269:1|270:1|271:1|272
:1|273:1|274:0|275:1|281:1|284:0|285:1|286:1|287:0|288:1|289:1|290:1|292:0|293:0
|295:1|304:1|305:1|306:0|307:0|308:0| -stringPrefs 3:7;default|215:3;1.0|226:332
;  ¼½¾ǃː??։֊׃״؉؊٪۔܁܂܃܄ᅟᅠ᜵           ???‐ ’․‧??????? ‹›⁁⁄⁒ ⅓⅔⅕⅖⅗⅘⅙⅚?⅜⅝⅞⅟∕∶⎮╱⧶⧸⫻⫽>
⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻ 。〔〕〳゠ㅤ㈝㈞㎮㎯㏆㏟꞉︔︕︿﹝﹞?./。ᅠ???�|227:4;
high|280:36;97c47c17-80b5-497c-b5ee-1c5db2d06af6| -schedulerPrefs 0001,2 -appdir
 /home/njn/moz/mi8/o64/dist/bin/browser 4500 true tab

That’s 1733 characters! Most of it is for three flags, called -intPrefs, -boolPrefs, and -stringPrefs, which are used to pass values of prefs that are required very early in the content process’s existence, before the first normal IPC message is received. These pref values are encoded compactly, but there are enough of them that they make the command quite long.

Furthermore, there is one pref (network.IDN.blacklist_chars) that contains a bunch of unusual characters. That explains the gobbledygook on the 4th and 3rd last lines. Unsurprisingly, this gobbledygook was reported as a bug… which has been duplicated five times.

The good news is that I recently fixed this, by changing things so that only pref values that have changed since startup get passed in this way. (Content processes are able to see the startup values, and so don’t need to be told them.) On my machine and profile, this reduces the number of pref values passed via the command from ~222 to ~3. It also reduces the number sent later via IPC from ~3165 to ~180.

The command now looks like this:

/home/njn/local/install/firefox/firefox -contentproc -childID 4 -isForBrowser -b
oolPrefs 37:1|235:1|257:1|294:1| -stringPrefs 280:36;8bdf7a6e-d39a-494c-b55a-829
2c4fc6254| -schedulerPrefs 0001,2 -greomni /home/njn/local/install/firefox/omni.
ja -appomni /home/njn/local/install/firefox/browser/omni.ja -appdir /home/njn/lo
cal/install/firefox/browser 3237 true tab

That’s 361 characters, all of them ASCII. Much better! And I have a plan to reduce things even further.

This change is in Firefox 60, which is on track to be released around May 9th.

DMD is usable again on all Tier 1 platforms

DMD is heap profiler built into Firefox, best known for being the tool used to  diagnose the sources of “heap-unclassified” memory in about:memory.

It’s been unusable on Win32 for a long time due to incredibly slow start-up times. And recently it became very slow on Mac due to a performance regression in libunwind.

Fortunately I have been able to fix this in both cases (Win32, Mac) by using FramePointerStackWalk() instead of MozStackWalk() to do the stack tracing within DMD. (The Gecko Profiler likewise uses FramePointerStackWalk() on those two platforms, and it was my recent work on the profiler that taught me that there was an alternative stack walker available.)

So DMD should be usable and effective on all Tier 1 platforms. I have tested it on  Win32, Win64, Linux64 and Mac. I haven’t tested it in Linux32 or Android. Please let me know if you encounter any problems.

How we made compiler warnings fatal in Firefox

Compiler warnings are mostly good: they identify real problems, and when false positives do occur they are usually easy to work around. However, if they’re not fatal, they tend to be ignored and build up. (See bug 187528 for an idea!)

One way to prevent the build-up is to make them fatal, so they become errors. But won’t that cause problems? Not if you’re careful. Here’s how we did it for Firefox.

  • Choose with some care which warnings end up fatal. Don’t be afraid to modify your choices as time goes on.
  • Introduce a mechanism for enabling fatal warnings on a per-directory basis. Mozilla’s custom build system used to have a directive called FAIL_ON_WARNINGS for this purpose.
  • Set things up so that fatal warnings are off by default, but enabled on continuous integration (CI). This means the primary coverage is via CI. You don’t want fatal warnings on by default because it causes problems for developers who use non-standard compilers (e.g. pre-release versions with new warning classes). Developers using the same compilers as CI can turn it on locally if they want without problem.
  • Allow per-file exceptions for particular kinds of warnings, because there are occasionally warnings you just want to ignore.
  • Fix warnings one directory at a time, and turn on fatal warnings for that directory as soon as it’s warning-free.
  • Invert the sense of the per-directory mechanism once you’ve converted more than half of the directories. For Mozilla code we now have the ALLOW_COMPILER_WARNINGS directive. It’s almost exclusively used for directories containing third-party code which is not under our control.
  • Gradually expand the coverage of which compilers you have fatal warnings for. Mozilla code now does this for GCC, clang, and MSVC.
  • Congratulations! You now have fatal warnings on everywhere that is practical.

With a setup like this, it’s possible for a patch to compile on a developer’s machine but fail to compile on CI. But that’s just one of many ways in which full CI runs may fail when local runs don’t. So it’s not as bad as it seems.

Also, before upgrading the compilers on CI you need to address any new warnings, by fixing them or suppressing them or filing a compiler bug report. But this isn’t a bad thing.

It took a long time for Firefox to reach this stage, but I think it was worth the effort. Thank you to Chris Peterson and all the others who helped with this.

MemShrink status

Mozilla’s MemShrink project started almost six years ago. It successfully reduced Firefox’s memory consumption to the point where Firefox is now widely (and accurately) seen as the least memory-hungry browser.

Most of the big improvements were made in the first two or three years of the MemShrink project, and it doesn’t get much publicity any more, but it is still ticking along quietly: meetings occur, regressions are tracked, measurements are made, and so on. Some time ago I passed leadership of it to the capable hands of Eric Rahm, who has just written a status update that is worth reading.

Eric also wrote recently about the reincarnation of areweslimyet.com. He maintained that site for several years, which was a thankless but important task, but he won’t need to any more because its functionality has been folded into Mozilla’s main performance monitoring tools. Thank you, Eric!

Improving the Gecko Profiler

Over the last three months I landed 167 patches, via 41 Bugzilla bugs, for the Gecko Profiler. These included crash fixes, assertion failure fixes, data race fixes, optimization fixes, and a great many refactorings.

Background

The Gecko Profiler is a profiler built into Firefox. It can be used with the Gecko Profiler Addon to profile Firefox. It also provides the core profiling mechanism that is used by Firefox’s devtools to profile JavaScript content code.

It’s a crucial component.  It was originally written 5 years ago in something of a hurry, because we desperately needed a built-in profiler, one that could give more detailed, custom information than is possible with an external profiler. As I understand it, part of it was imported from V8, so it was a mix of previously existing code and new code. And in the years since it has been extended by multiple people, in a variety of ways that didn’t necessarily mesh well.

As a result, at the start of Q1 it was in pretty bad shape. Crashes and assertion failures were frequent, and the code itself was hard to read and maintain. Markus Stange had recently taken over ownership, and had made some great improvements to the Addon, but the core code needed work as well. So I started digging in to see what I could find and improve. There was a lot!

Fixes

Bug 1331571. The profiler had code for incorporating power consumption estimates from the Intel Power Gadget. Unfortunately, this integration had major flaws: Intel Power Gadget only gives very coarse power consumption estimates; the profiler samples at 1000Hz and is CPU intensive and so is likely to skew the power consumption estimates significantly; and nobody had ever used it meaningfully. So I removed it.

Bug 1317771. The profiler had a “standalone” configuration that allowed it to be used in programs other than Firefox. But it was complex (lots of #ifdef statements) and broken and unlikely to be of use. So I removed it.

Bug 1328369, Bug 1328373: Dead code removal.

Bug 1332577. The public API for the profiler was a mess. It was split across several header files, and most API functions had an “outer” version with a profiler_ prefix that immediately called into an inner version with a mozilla_sampler_ prefix (though there were some inconsistencies). So I combined the various header files into a single file, GeckoProfiler.h, and simplified the API functions into a single level, all consistently named with a profiler_ prefix.

Bug 1333296. Even the name of the profiler was a source of confusion. It was originally known as “SPS”, which I believe is short for “simple profiling system”. At some point that changed to “the Gecko Profiler”, although was also occasionally referred to as “the built-in profiler”! Because of this history, the code was littered with references to SPS. In this bug I updated them all to refer to the Gecko Profiler. (I updated the MDN docs, too. The page name still uses “Built-in Profiler” because I don’t know how to change MDN page names.)

Bug 1329684. I removed some mutex wrapper classes that I think were necessary at one point for the “standalone” configuration.

Bug 1328365. Thread-local storage was being used to store a pointer to an object that was only accessed on the main thread, so I changed it to be a global variable. I also renamed some variables whose names referred to a type that had been renamed a long time ago.

Bug 1333655. The profiler had a cross-platform thread abstraction that was clumsy and over-engineered, so I streamlined it.

Bug 1334466. The profiler had a class called Sampler, which I think was imported from V8, and a subclass called GeckoSampler. Both classes were fairly complex, and we only ever instantiated the subclass. The separation merely obscured things, so I merged the subclass into Sampler. Having all that code in a single class and a single module made it much easier to see exactly what it was doing.

Bug 1335595. Two classes, ThreadInfo and ThreadProfile, were used for per-thread data. They were hopelessly intertwined: each one had a pointer to the other, and multiple fields were present (i.e. duplicated) in both of them. So I just merged them.

Bug 1336326. Three minor clean-ups.

Bug 816598. I implemented a memory reporter for the profiler. This was first requested in 2012, and five duplicate bugs had been filed in the interim!

Bug 1126576. I removed some very grotty manual refcounting from the PseudoStack class, which simplified things. A little too much, in fact… I misunderstood how things worked, causing a crash, which I subsequently fixed in bug 1340161.

Bug 1337189. The aforementioned Sampler class was still over-engineered. It only ever had 0 or 1 instantiations, and basically was an unnecessary level of abstraction. In this bug I basically got rid of it by merging it into another file. Which took 27 patches! (One of these patches introduced a regression, which I later fixed in bug 1340327.) At this point a lot of the core code that had previously been spread across multiple files and classes was now in a single file, tools/profiler/core/platform.cpp, and it was becoming increasingly obvious that there was a lot of global state being accessed from multiple threads with insufficient thread synchronization.

Bug 1338957. The profiler tracks which threads are “sleeping” (i.e. blocked on some operation), to support an optimization where it samples sleeping threads in a cheaper-than-normal fashion. It was using two counters and a boolean to track the sleep state of each thread. These three values were accessed from multiple threads; two of them were atomic, and one wasn’t, so the whole setup was very racy. I managed to condense the three values into a single atomic tri-state value, which made things simpler and thread-safe.

Bug 1339327. I landed eight refactoring patches with no particular common theme, mostly involving renaming things and removing unnecessary stuff.

Bug 1339435. I removed two erroneous assertions that I had added in an earlier patch — two functions that I thought only ran on the main thread turned out to run off the main thread as well.

Bug 1339695. The profiler has a lot of code that is specific to a particular architecture (e.g. x86), OS (e.g. Windows), or platform (e.g. x86/Windows). The #ifdef statements used to select these were massively inconsistent — turns out there are many ways to detect this stuff — so I fixed this up. Among other things, this involved using the nice constants in tools/profiler/core/PlatformMacros.h consistently throughout the profiler’s code. (I fixed a regression — caused by mistyping one of the #ifdef conditions, alas! — from this change in bug 1350211. And another one involving --disable-profiling in bug 1348776.) I also renamed some files that had .cc extensions instead of the usual .cpp because they had (I think) been imported from V8.

Bug 1340928. At this point I had started working on a major change to the handling of the profiler’s core global state. It would inevitably be a big patch, but I wanted it to be as small as possible, so I started aggressively carving off small changes that could be landed separately. This bug featured 16 of them.

Bug 1328378. The profiler has two kinds of samples: periodic, which are taken by a separate thread in response to a timer firing, and synchronous, which a thread takes itself in response to a request via the profiler API. There are a lot of similarities between the two, but also some important differences. This bug took some steps to simplify the messy handling of synchronous samples.

Bug 1344118. I mentioned earlier that the profiler tracks which threads are “sleeping” to support an optimization: when a thread is asleep, we can mostly duplicate its last sample without unwinding its stack. But the optimization was buggy and would become a catastrophic pessimization in certain circumstances, due to what should have been a short O(1)-ish buffer search becoming O(n²)-ish, which would quickly peg one CPU at 100% usage. As far as I can tell, this bug was present in the optimization ever since it was implemented three years ago. (It’s possible it wasn’t noticed because its effect increase as more threads are profiled, but the profiler defaults to only profiling the main thread and the compositor thread.) The fix was straightforward once the diagnosis was made, and Julian Seward did a follow-up that made the optimization even more effective.

Bug 1342306. In this bug I put almost all of the profiler’s global state into a single class and protected accesses to it with a mutex. Unlike the old code, the new code is simple and obviously thread-safe. The final patch in this bug was much bigger than I would have liked, at 142 KiB, even after I carved off as many precursor patches as I could. Unsurprisingly, there were some follow-up fixes required: bug 1346356 (a leak and a deadlock), bug 1347044 (another deadlock), bug 1348374 (yet another deadlock), and bug 1350967 (surprise! another deadlock).

Bug 1345262. I fixed an assertion failure caused by the profiler and the JS engine having differing views about what functions should be called on what threads.

Bug 1347348. Five more assorted clean-ups.

Bug 1349856. I fixed a minor error involving a call to the profiler from Gecko.

Bug 1346132. I removed the profiler’s bespoke logging system, replacing it with the standard Mozilla one. I also made the logging output more concise and informative.

Bug 1350212. I cleaned up a class and its uses a bit.

Bug 1351523. I reordered one function’s arguments to match the order used in two related functions.

Bug 1351528. I removed some unused values from an enum.

Bug 1348024. I simplified some environment variables used by the profiler.

Bug 1351946. I removed some gnarly code for starting the profiler on B2G.

Bug 1351136. The profiler’s testing coverage is not very good, as can be seen from the numerous regressions I introduced and fixed. So I added a gtest that improves coverage. There’s still room for more test coverage improvement.

Bug 1351963. I further clarified the handling of synchronous vs. periodic samples, and simplified the ownership of some per-thread data structures.

Discussion

I learned some interesting things while doing this work.

Learning a component

Three months ago I knew almost nothing about the profiler’s code. Today I’m a module peer.

At the start of January I had been told that the profiler needed work, and I had assigned myself a Q1 deliverable to “land three improvements to the Gecko Profiler”. I started just by looking for easy, obvious code clean-ups, such as dead code removal, fixing inconsistent naming of things, and removing unnecessary indirections. (These are the kinds of clean-ups you can make with only shallow understanding.) The profiler proved to be a target-rich environment for such changes!

After I made a few dozen such changes I started understanding more deeply how the pieces fit together. (Partly because I’d been staring at the code a lot, and partly because my changes were making the code easier to understand. Refactorings add up.) I started interleaving my easy clean-up patches with ones that required more insight into how the profiler worked. I made numerous mistakes along the way, as the various regression fixes above show. But that’s ok.

I also kept a text file in which I had a list of ideas for things to fix. Every time I saw something that looked like it could be improved, I added it to the file, and I repeatedly checked the file when deciding what to work on next. As my understanding of the code improved, multiple times I realized that items I had written down were wrong, or incomplete, or that seemingly separate things were related. (In fact, I’m still using the file, because I still have numerous things I want to improve.)

Multi-threaded programming basics

Although I first learned C and C++ about 20 years ago, and I have worked at Mozilla for more than 8 years, this was the first time I’ve ever done serious multi-threaded programming, i.e. at a level where I needed a reasonably deep understanding of how various threads can interact. I got the following two great tips from Julian Seward, which helped a lot.

  • Write down pseudocode for each thread.
  • Write down potential worst-case thread operation interleavings.

I also found it helpful to add comments (or assertions, where appropriate) to the top of functions that indicate which thread or threads they run on. For example:

void profiler_gathered_OOP_profile()
{
  MOZ_RELEASE_ASSERT(NS_IsMainThread());
  ...
}

and:

void profiler_thread_sleep()
{ 
  // This function runs both on and off the main thread.
  ...
}

A useful idiom: proof-of-lock tokens

I also employed a programming idiom that turned out to be extremely helpful.  Most of the global profiler state is in single class called ProfilerState. There is a single instance of this class, gPS, and a single mutex that protects it, gPSMutex. To guarantee that no code is able to access gPS‘s contents without first locking the mutex, for every field in ProfilerState there is a getter and a setter, both of which require a “proof-of-lock” token, which takes the form of a const PS::AutoLock&, where PS::AutoLock is an RAII type that locks and unlocks a mutex.

For example, consider this function, which checks if the profiler is paused.

bool profiler_is_paused()
{
  PS::AutoLock lock(gPSMutex);
  if (!gPS->IsActive(lock)) {
    return false;
  }
  return gPS->IsPaused(lock);
}

The PS::AutoLock locks the mutex. IsActive() and IsPaused() both access fields within gPS, and so they are passed lock, which serves as the proof-of-lock value. IsPaused() and SetIsPaused() are implemented as follow.

bool IsPaused(const PS::AutoLock&) const { return mIsPaused; }
void SetIsPaused(const PS::AutoLock&, bool aIsPaused) { mIsPaused = aIsPaused; }

Neither function actually uses the proof-of-lock token. Nonetheless, any function that calls a ProfilerState getter or setter must either lock gPSMutex, or have an ancestor that does. This idiom has two very nice benefits.

  • You can’t access gPS‘s contents without having first locked gPSMutex. (Well, it is possible to subvert the protection, but not by accident.)
  • It’s obvious that all functions that have a proof-of-lock argument are called only while gPSMutex is locked.

Functions that are called from multiple places sometimes must be split in two: an outer function in which the mutex is initially unlocked, and an inner function that takes a proof-of-lock token. This isn’t hard, though.

Deadlocks vs. data races

After my big change to the profiler’s global state, I had to fix numerous deadlocks. This wasn’t too hard. Deadlocks (caused by too much thread synchronization) are obvious, easy to diagnose, and these ones weren’t hard to fix. It’s useful to contrast them with data races (caused by too little thread synchronization) which typically have subtle effects and are difficult to diagnose.

Patch discipline

For this work I wrote a lot of small patches. This is my preferred way to work, for two reasons. First, small patches make life easier for reviewers, which in turn results in faster reviews. Second, small patches make regression hunting easy. It’s always nice when you bisect a regression to a small patch.

Future work and Thanks

The profiler still has plenty of room for improvement, and I’m planning to do more work on it in Q2. In the meantime, if you’ve tried the profiler in the past and had problems it might be worth trying again. It’s in much better shape now.

Finally, many thanks to Markus Stange for reviewing the majority of the patches and answering lots of questions, and Julian Seward for reviewing most of the remainder and for numerous helpful discussions about threaded programming.