Performance Programming Work habits

Ad Hoc Profiling

I have used a variety of profiling tools over the years, including several I wrote myself.

But there is one profiling tool I have used more than any other. It is capable of providing invaluable, domain-specific profiling data of a kind not obtainable by any general-purpose profiler.

It’s a simple text processor implemented in a few dozen lines of code. I use it in combination with logging print statements in the programs I am profiling. No joke.


The tool is called counts, and it tallies line frequencies within text files, like an improved version of the Unix command chain sort | uniq -c. For example, given the following input.

a 1
b 2
b 2
c 3
c 3
c 3
d 4
d 4
d 4
d 4

counts produces the following output.

10 counts:
(  1)        4 (40.0%, 40.0%): d 4
(  2)        3 (30.0%, 70.0%): c 3
(  3)        2 (20.0%, 90.0%): b 2
(  4)        1 (10.0%,100.0%): a 1

It gives a total line count, and shows all the unique lines, ordered by frequency, with individual and cumulative percentages.

Alternatively, when invoked with the -w flag, it assigns each line a weight, determined by the last integer that appears on the line (or 1 if there is no such integer).  On the same input, counts -w produces the following output.

30 counts:
(  1)       16 (53.3%, 53.3%): d 4
(  2)        9 (30.0%, 83.3%): c 3
(  3)        4 (13.3%, 96.7%): b 2
(  4)        1 ( 3.3%,100.0%): a 1

The total and per-line counts are now weighted; the output incorporates both frequency and a measure of magnitude.

That’s it. That’s all counts does. I originally implemented it in 48 lines of Perl, then later rewrote it as 48 lines of Python, and then later again rewrote it as 71 lines of Rust.

In terms of benefit-to-effort ratio, it is by far the best code I have ever written.

counts in action

As an example, I added print statements to Firefox’s heap allocator so it prints a line for every allocation that shows its category, requested size, and actual size. A short run of Firefox with this instrumentation produced a 77 MB file containing 5.27 million lines. counts produced the following output for this file.

5270459 counts:
( 1) 576937 (10.9%, 10.9%): small 32 (32)
( 2) 546618 (10.4%, 21.3%): small 24 (32)
( 3) 492358 ( 9.3%, 30.7%): small 64 (64)
( 4) 321517 ( 6.1%, 36.8%): small 16 (16)
( 5) 288327 ( 5.5%, 42.2%): small 128 (128)
( 6) 251023 ( 4.8%, 47.0%): small 512 (512)
( 7) 191818 ( 3.6%, 50.6%): small 48 (48)
( 8) 164846 ( 3.1%, 53.8%): small 256 (256)
( 9) 162634 ( 3.1%, 56.8%): small 8 (8)
( 10) 146220 ( 2.8%, 59.6%): small 40 (48)
( 11) 111528 ( 2.1%, 61.7%): small 72 (80)
( 12) 94332 ( 1.8%, 63.5%): small 4 (8)
( 13) 91727 ( 1.7%, 65.3%): small 56 (64)
( 14) 78092 ( 1.5%, 66.7%): small 168 (176)
( 15) 64829 ( 1.2%, 68.0%): small 96 (96)
( 16) 60394 ( 1.1%, 69.1%): small 88 (96)
( 17) 58414 ( 1.1%, 70.2%): small 80 (80)
( 18) 53193 ( 1.0%, 71.2%): large 4096 (4096)
( 19) 51623 ( 1.0%, 72.2%): small 1024 (1024)
( 20) 45979 ( 0.9%, 73.1%): small 2048 (2048)

Unsurprisingly, small allocations dominate. But what happens if we weight each entry by its size? counts -w produced the following output.

2554515775 counts:
( 1) 501481472 (19.6%, 19.6%): large 32768 (32768)
( 2) 217878528 ( 8.5%, 28.2%): large 4096 (4096)
( 3) 156762112 ( 6.1%, 34.3%): large 65536 (65536)
( 4) 133554176 ( 5.2%, 39.5%): large 8192 (8192)
( 5) 128523776 ( 5.0%, 44.6%): small 512 (512)
( 6) 96550912 ( 3.8%, 48.3%): large 3072 (4096)
( 7) 94164992 ( 3.7%, 52.0%): small 2048 (2048)
( 8) 52861952 ( 2.1%, 54.1%): small 1024 (1024)
( 9) 44564480 ( 1.7%, 55.8%): large 262144 (262144)
( 10) 42200576 ( 1.7%, 57.5%): small 256 (256)
( 11) 41926656 ( 1.6%, 59.1%): large 16384 (16384)
( 12) 39976960 ( 1.6%, 60.7%): large 131072 (131072)
( 13) 38928384 ( 1.5%, 62.2%): huge 4864000 (4866048)
( 14) 37748736 ( 1.5%, 63.7%): huge 2097152 (2097152)
( 15) 36905856 ( 1.4%, 65.1%): small 128 (128)
( 16) 31510912 ( 1.2%, 66.4%): small 64 (64)
( 17) 24805376 ( 1.0%, 67.3%): huge 3097600 (3100672)
( 18) 23068672 ( 0.9%, 68.2%): huge 1048576 (1048576)
( 19) 22020096 ( 0.9%, 69.1%): large 524288 (524288)
( 20) 18980864 ( 0.7%, 69.9%): large 5432 (8192)

This shows that the cumulative count of allocated bytes (2.55GB) is dominated by a mixture of larger allocation sizes.

This example gives just a taste of what counts can do.

(An aside: in both cases it’s good the see there isn’t much slop, i.e. the difference between the requested sizes and actual sizes are mostly 0. That 5432 entry at the bottom of the second table is curious, though.)

Other Uses

This technique is often useful when you already know something — e.g. a general-purpose profiler showed that a particular function is hot — but you want to know more.

  • Exactly how many times are paths X, Y and Z executed? For example, how often do lookups succeed or fail in data structure D? Print an identifying string each time a path is hit.
  • How many times does loop L iterate? What does the loop count distribution look like? Is it executed frequently with a low loop count, or infrequently with a high loop count, or a mix? Print the iteration count before or after the loop.
  • How many elements are typically in hash table H at this code location? Few? Many? A mixture? Print the element count.
  • What are the contents of vector V at this code location? Print the contents.
  • How many bytes of memory are used by data structure D at this code location? Print the byte size.
  • Which call sites of function F are the hot ones? Print an identifying string at the call site.

Then use counts to aggregate the data. Often this domain-specific data is critical to fully optimize hot code.

Worse is better

Print statements are an admittedly crude way to get this kind of information, profligate with I/O and disk space. In many cases you could do it in a way that uses machine resources much more efficiently, e.g. by creating a small table data structure in the code to track frequencies, and then printing that table at program termination.

But that would require:

  • writing the custom table (collection and printing);
  • deciding where to define the table;
  • possibly exposing the table to multiple modules;
  • deciding where to initialize the table; and
  • deciding where to print the contents of the table.

That is a pain, especially in a large program you don’t fully understand.

Alternatively, sometimes you want information that a general-purpose profiler could give you, but running that profiler on your program is a hassle because the program you want to profile is actually layered under something else, and setting things up properly takes effort.

In contrast, inserting print statements is trivial. Any measurement can be set up in no time at all. (Recompiling is often the slowest part of the process.) This encourages experimentation. You can also kill a running program at any point with no loss of profiling data.

Don’t feel guilty about wasting machine resources; this is temporary code. You might sometimes end up with output files that are gigabytes in size. But counts is fast because it’s so simple… and the Rust version is 3–4x faster than the Python version, which is nice. Let the machine do the work for you. (It does help if you have a machine with an SSD.)

Ad Hoc Profiling

For a long time I have, in my own mind, used the term ad hoc profiling to describe this combination of logging print statements and frequency-based post-processing. Wikipedia defines “ad hoc” as follows.

In English, it generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes

The process of writing custom code to collect this kind of profiling data — in the manner I disparaged in the previous section — truly matches this definition of “ad hoc”.

But counts is valuable specifically because it makes this type of custom profiling less ad hoc and more repeatable. I should arguably call it “generalized ad hoc profiling” or “not so ad hoc profiling”… but those names don’t have quite the same ring to them.


Use unbuffered output for the print statements. In C and C++ code, use fprintf(stderr, ...). In Rust code use eprintln!. (Update: Rust 1.32 added the dbg! macro, which also works well.)

Pipe the stderr output to file, e.g. firefox 2> log.

Sometimes programs print other lines of output to stderr that should be ignored by counts. (Especially if they include integer IDs that counts -w would interpret as weights!) Prepend all logging lines with a short identifier, and then use grep $ID log | counts to ignore the other lines. If you use more than one prefix, you can grep for each prefix individually or all together.

Occasionally output lines get munged together when multiple print statements are present. Because there are typically many lines of output, having a few garbage ones almost never matters.

It’s often useful to use both counts and counts -w on the same log file; each one gives different insights into the data.

To find which call sites of a function are hot, you can instrument the call sites directly. But it’s easy to miss one, and the same print statements need to be repeated multiple times. An alternative is to add an extra string or integer argument to the function, pass in a unique value from each call site, and then print that value within the function.

It’s occasionally useful to look at the raw logs as well as the output of counts, because the sequence of output lines can be informative. For example, I recently diagnosed an occurrences of quadratic behaviour in the Rust compiler by seeing that a loop iterated 1, 2, 3, …, 9000+ times.

The Code

counts is available here.


I use counts to do ad hoc profiling all the time. It’s the first tool I reach for any time I have a question about code execution patterns. I have used it extensively for every bout of major performance work I have done in the past few years, as well as in plenty of other circumstances. I even built direct support for it into rustc-perf, the Rust compiler’s benchmark suite, via the profile eprintln subcommand. Give it a try!

Humans pdf.js Work habits

“Thank you” is a wonderful phrase

Last year I contributed a number of patches to pdf.js. Most of my patches were reviewed and merged to the codebase by Yury Delendik. Every single time he merged one of my patches, Yury wrote “Thank you for the patch”. It was never “thanks” or “thank you!” or “thank you :)”. Nor was it “awesome” or “great work” or “+1” or “\o/” or “\m/”. Just “Thank you for the patch”.

Oddly enough, this unadorned use of one of the most simple and important phrases in the English language struck me as quaint, slightly formal, and perhaps a little old-fashioned. Not a bad thing by any means, but… notable. And then, as Yury merged more of my patches, I started getting used to it. Tthen I started enjoying it. Each time he wrote it — I’m pretty sure he wrote it every time — it made me smile. I felt a small warm glow inside. All because of a single, simple, specific phrase.

So I started using it myself. (“Thank you for the patch.”) I tried to use it more often, in situations I previously wouldn’t have. (“Thank you for the fast review”.) I mostly kept to this simple form and eschewed variations. (“Thank you for the additional information.”) I even started using it each time somebody answered one of my questions on IRC. (“glandium: thank you”)

I’m still doing this. I don’t always use this exact form, and I don’t always remember to thank people who have helped me. But I do think it has made my gratitude to those around me more obvious, more consistent, and more sincere. It feels good.

Humans Work habits

Two suggestions for the Portland work week

Mozilla is having a company-wide work week in Portland next week. It’s extremely rare to have this many Mozilla employees in the same place at the same time, and I have two suggestions.

  • Write down a list of people that you want to meet. This will probably contain people you’ve interacted with online but not in person. And send an email to everybody on that list saying “I would like to meet you in person in Portland next week”. I’ve done this at previous work weeks and it has worked well. (And I did it again earlier today.)
  • During the week, don’t do much work you could do at home. This includes most solo coding tasks. If you’re tempted to do such work, stand up and try to find someone to talk to (or listen to) who you couldn’t normally talk to easily. (This is a rule of thumb; if a zero-day security exploit is discovered in your code on Tuesday morning, yes, you should fix it.) Failing that, gee, you might as well do something that you can only do in Portland.

That’s it. Have a great week!

Work habits

Lightweight branches aren’t always appropriate

Recently, I wrote about how I use multiple (10!) clones of the mozilla-inbound repository, with one Mercurial queue per clone, to work on multiple changes to the Mozilla codebase concurrently.

At times, I’ve felt almost guilty about using such a heavyweight branching mechanism, as opposed to a lightweight (i.e. intra-clone) branching mechanism such as git branches, or Mercurial bookmarks, or multiple Mercurial queues in a single clone (managed via hg qqueue). It seemed clumsy, like I was missing out on a compelling feature of modern version control systems.

But I now have come to understand that each approach is appropriate in a particular circumstances. In parcticular, lightweight branches are not appropriate when code modifications incur a non-trivial build-time cost.

Consider a Mozilla developer who works only on code that doesn’t need compilation, such as JavaScript, HTML, or Python. After modifying code, such a developer might incur a zero, or almost-zero build-time cost. For example, they might not have to do anything for their changes to propagate into the built product, or they might merely have to copy the modified file into the build directory.

For this developer, lightweight branches are entirely appropriate, because they can switch between branches with hardly a care in the world.

In contrast, consider a Mozilla developer (such as me!) who works mostly on C++ code within Gecko. After modifying code, this developer incurs a decidedly non-zero build cost — on my machine, just linking libxul takes around ten seconds. So any change to Gecko’s C++ code will require at least this much time, and it’s often substantially more, especially for anyone using a slow machine and/or OS.

For this developer, lightweight branches are not appropriate, because they will have to wait for rebuilding to occur every time they switch. ccache mitigates this problem, but it doesn’t solve it. In particular, the developer may well have switched away from one branch precisely because they are waiting for a long-running build to complete, and lightweight branches certainly don’t allow that.

These two distinct cases may be obvious to some people, but they weren’t to me. If nothing else, as someone who mostly works on C++ Mozilla code, I now can feel content with the heavyweight branching approach I use.

Work habits

How I work on Mozilla code

Most Mozilla developers have their own particular set-ups and work-flows, and over time develop various scripts, shortcuts, and habits to make their lives easier.  But we rarely talk about them.

In this post I will describe various interesting aspects of how I work on Mozilla code, in the hope that (a) it will give other people useful ideas, and (b) other people will in turn give me useful ideas.


I have two machines:  a quite new and very fast Linux desktop machine, on which I do 99% of my coding work, and a two-year old Macbook Pro, on which I do little coding but a lot of browsing and other non-development stuff.  In theory my Linux desktop also has a Windows VM on it, though in practice that hasn’t happened yet!

I use Ubuntu on my Linux machine.  I don’t really enjoy sysadmin-type stuff, so I use the most widely-used, vanilla distribution available.  That way, if something goes wrong there’s a decent chance somebody else will have had the same problem.

Mercurial Repositories

I do most of my work on mozilla-inbound.  I clone that into a “master” repository,  that I leave untouched, called ws0. Then I have nine local clones of ws0, called ws1..ws9. I don’t have trouble filling the nine local clones because I usually have multiple different coding tasks in flight at any one time. Indeed, this is a necessity when dealing with the various latencies involved with development, such as compilation and local test runs (minutes), try server runs (hours), and reviews (days).


I use Mercurial queues and am quite content with them, though I am looking forward to changeset evolution becoming stable.  I tried git once but didn’t like it much;  the CLI is awful, the speed wasn’t noticeably better than Mercurial (and this was before I upgraded from Mercurial 1.8 to 2.7, which is much faster), and it was inconvenient to have to move patches over to a Mercurial repo before landing.

One problem I had with Mercurial queues was that I would often type hg qref patchname when I meant hg qnew patchname. This can lead to surprising and annoying mix-ups with patches, so I wrote a pre-hook for hg qref — if I give it an argument that isn’t -e or -u, I almost certainly meant hg qnew, so it aborts with a reminder. This has saved me some hassle on numerous occasions.

Some Mercurial extensions that I particularly like are color (colour output), progress (progress bars for slow operations), record (commit part of a change) and bzexport (upload patches to Bugzilla from the command line).

Try Server

Even though I do most of my work with mozilla-inbound, pushing to try from mozilla-inbound isn’t a great idea, because every so often you’ll catch some test breakage caused by someone else’s patch and then you have to work out if it’s your fault or was pre-existing.  So recently I took RyanVM’s advice and wrote a script that transplants the patch queue from a mozilla-inbound repo to a mozilla-central repo, and then pushes to try.  This avoids the test bustage problem, but occasionally the patches don’t apply cleanly.  In that case I just push from the mozilla-inbound repo and deal with the risk of test bustage.


I mostly use Clang, though I also sometimes use GCC.  Clang is substantially faster than GCC, and its error messages are much better (though GCC’s have improved recently).  Clang’s generated code is slightly worse (~5–10% slower), but that’s not much of an issue for me while developing.  In fact, faster compilation time is important enough that my most common build configuration has the following line:

ac_add_options --enable-optimize='-O0'  # worse code, but faster builds

Last time I measured (which was before unified builds were introduced) this shaved a couple of minutes off build times, as compared to a vanilla --enable-optimize build.


I have a lot of mozconfig files.  I switch frequently between different kinds of builds, so much so that all my custom short-cut commands (e.g. for building and running the browser) have a mandatory argument that specifies the relevant mozconfig.  As a result, the mozconfig names I use are much shorter than the default names.  For example, the following is a selection of some of my more commonly-used mozconfigs for desktop Firefox.

  • d64: debug 64-bit build with clang
  • o64: optimized 64-bit build with clang
  • gd64: debug 64-bit build with GCC
  • go64: optimized 64-bit build with GCC
  • cd64: debug 64-bit build with clang and ccache
  • co64: optimized 64-bit build with clang and ccache
  • o64v: optimized 64-bit build with clang and –enable-valgrind
  • o64dmd: optimized 64-bit build with clang and –enable-dmd

Although I never do 32-bit browser builds, I do sometimes do 32-bit JS shell builds, so the ’64’ isn’t entirely redundant!


I have a script called mmq that I use to build the browser.  I invoke it like this:

mmq o64

The argument is the mozconfig/objdir name.  This script invokes the build and redirects the output to an errors.err file (for use with quickfix, see below).  Once compilation finishes, the script also does some hacky grepping to reprint the first five compilation errors, if they exist, once compilation finishes.  I do this to make it easier to find the errors — sometimes they get swamped by the subsequent output.  (My use of Quickfix has made this feature less important than it once was, though it’s still a good thing to have.)


I have multiple profiles.

  • default: used for my normal browsing.
  • dev: my standard development profile.
  • dev2: a second development profile, mostly used if I already am using the dev profile in another browser invocation.
  • e10s: a profile with Electrolysis enabled.

Starting Firefox

I have a script called ff with which I start Firefox like the following.

ff o64 dev

The first argument is the mozconfig, and the second is the profile.  Much of the time, this invokes Firefox in the usual way, e.g.:

o64/dist/bin/firefox -P dev -no-remote

However, this script knows about my mozconfigs and it automatically does more elaborate invocations when necessary, e.g. for DMD (which requires the setting of some environment variables).  I also wrote it so that I can tack on gdb as a third argument and it’ll run under GDB.

Virtual desktop and window layout

I use a 2×2 virtual desktop layout.

In each of the top-left and bottom-left screens I have three xterms — a full-height one on the left side in which I do editing, and two half-height ones on the right side in which I invoke builds, hg commands, and sundry other stuff.

In the top-right screen I have Firefox open for general use.

In the bottom-right screen I have a Chatzilla window open.

Text Editing

I use Vim.  This is largely due to path dependence;  it’s what they taught in my introductory programming classes at university, and I’ve happily used it ever since.  My setup isn’t particularly advanced, and most of it isn’t worth writing about.  However, I have done a few things that I think are worth mentioning.


Tags are fantastic — they let you jump immediately to the definition of a function/type/macro.  I use ctags and I have an alias for the following command.

ctags -R --langmap=C++:.c.h.cpp.idl --languages=C++ --exclude='*dist\/include*' --exclude='*[od]32/*' --exclude='*[od]64/*'

I have to re-run this command periodically to keep up with changes to the codebase.  Fortunately, it only takes about 5 seconds on my fast SSD.  (My old machine with a mechanical HD took much longer).  The coverage isn’t perfect but it’s good enough, and the specification of .idl files in the –langmap option was a recent tweak I made that improved coverage quite a bit.


I now use quickfix, which is a special mode to speed up the edit-compile-edit cycle.  The commands I use to build Firefox redirect the output to a special file, and if there are any compile errors, I use Vim’s quickfix command to quickly jump to their locations.  This saves enormous amounts of manual file and line traversal — I can’t recommend it highly enough.

In order to use quickfix you need to tell Vim what the syntax of the compile errors is.  I have the following command in my .vimrc for this.

" Multiple entries (separated by commas):
" - compile (%f:%l:%c) errors for different levels of file nesting
" - linker (%f:%l) errors for different levels of file nesting
set efm=../../../../../%f:%l:%c:\ error:\ %m,../../../../%f:%l:%c:\ error:\ %m,../../../%f:%l:%c:\ error:\ %m,../../%f:%l:%c:\ error:\ %m,../%f:%l:%c:\ error:\ %m,%f:%l:%c:\ error:\ %m,../../../../../%f:%l:\ error:\ %m,../../../../%f:%l:\ error:\ %m,../../../%f:%l:\ error:\ %m,../../%f:%l:\ error:\ %m,../%f:%l:\ error:\ %m,%f:%l:\ error:\ %m

This isn’t pretty, but it works well for Mozilla code.  Then it’s just a matter of doing :cf to load the new errors file (which also takes me to the first error) and then :cn/:cp to move forward and backward through the list.  Occasionally I get an error in a header file that is present in the objdir and the matching fails, and so I have to navigate to that file manually, but this is rare enough that I haven’t bothered trying to fix it properly.

One nice thing about quickfix is that it lets me start fixing errors before compilation has finished!  As soon as I see the first error message I can do :cf.  This means I have to re-do :cf and skip over the already-fixed errors if more errors occur later, but this is still often a win.

If you use Vim, work on Mozilla C++ code, and don’t use it, you should set it up right now.  There are many additional commands and options, but what I’ve written above is enough to get you started, and covers 95% of my usage.  (In case you’re curious, the :cnf, :cpf, :copen and :cclose commands cover the other 5%.)


I also set up Vim’s :grep command for use with Firefox.  I put the following script in ~/bin/.

#! /bin/sh
if [ -z "$pattern" ]; then
    echo "usage: $FUNCNAME <pattern>";
    return 1;
grep -n -r -s \
    --exclude-dir="*[od]32*" \
    --exclude-dir="*[od]64*" \
    --exclude-dir=".hg*" \
    --exclude-dir=".svn*" \
    --include="*.cpp" \
    --include="*.h" \
    --include="*.c" \
    --include="*.idl" \
    --include="*.html" \
    --include="*.xul" \
    --include="*.js" \
    --include="*.jsm" \

It does a recursive grep for a pattern through various kinds of source files, ignoring my objdirs and repository directories.  To use it, I just type “:grep <pattern>” and Vim runs the script and sends the results to quickfix, so I can again use :cn and :cp to navigate through the matches.  (Vim already knows about grep’s output format, so you don’t need to configure anything for that.)

I can also use that script from the command line, which is useful when I want to see all the matches at once, rather than stepping through them one at a time in Vim.

Trailing whitespace detection

This line in my .vimrc tells Vim to highlight any trailing whitespace in red.

highlight ExtraWhitespace ctermbg=red guibg=red
autocmd BufWinEnter *.{c,cc,cpp,h,py,js,idl} match ExtraWhitespace /\s\+$/
autocmd InsertEnter *.{c,cc,cpp,h,py,js,idl} match ExtraWhitespace /\s\+\%#\@<!$/
autocmd InsertLeave *.{c,cc,cpp,h,py,js,idl} match ExtraWhitespace /\s\+$/
autocmd BufWinLeave *.{c,cc,cpp,h,py,js,idl} call clearmatches()

[Update: I accidentally omitted the final four lines of this when I first published this post.]

The good thing about this is that I now never submit patches with trailing whitespace.  The bad thing is that I can see where other people have left behind trailing whitespace 🙂


Finally, I installed the Ctrl-P plugin, which aims to speed up the opening of files by letting you type in portions of the file’s path.  This is potentially quite useful for Mozilla code, where the directory nesting can get quite deep.  However, Ctrl-P has been less of a win than I hoped, for two reasons.

First, it’s quite slow, even on my very fast desktop with its very fast SSD.  While typing, there are often pauses of hundreds of milliseconds after each keystroke, which is annoying.

Second, I find it too eager to find matches. If you type a sequence of characters, it will match against any file that has those characters present in that order, no matter how many other characters separate them, and it will do so case-insensitively.  This might work well for some people, but if I’m opening a file such as content/base/src/nsContentUtils.cpp, I always end up typing just the filename in full.  By the time I’ve typed “nsContentU” ideally it would be down to the two files in the repository that match that exact string.  But instead I get the following.

> widget/tests/window_composition_text_querycontent.xul
> dom/interfaces/base/nsIQueryContentEventResult.idl
> dom/base/nsQueryContentEventResult.cpp
> dom/base/nsQueryContentEventResult.h
> content/base/public/nsIContentSecurityPolicy.idl
> content/base/public/nsContentCreatorFunctions.h
> dom/interfaces/base/nsIContentURIGrouper.idl
> content/base/public/nsContentPolicyUtils.h
> content/base/src/nsContentUtils.cpp
> content/base/public/nsContentUtils.h

I wish I could say “case-insensitive exact matches, please”, but I don’t think that’s possible.  As a result, I don’t use Ctrl-P that much, though it’s still occasionally useful if I want to open a file for which I know the name but not the path — it’s faster for that than dropping back to a shell and using find.


That’s everything of note that I can think of right now.  Please feel free to steal as many ideas as you wish from this post.

I haven’t given full details for a lot of the things I mentioned above. I’m happy to give more details (e.g. what various scripts look like) if people are interested.

Finally, I encourage other developers to write posts like this one explaining how they work on Mozilla code.  Thanks!

Firefox Work habits

Working on the browser sucks

I’ve joked before that I hoped to never work on anything in Firefox that requires me to build the browser regularly.  That probably sounds weird unless you are a member of the JavaScript team, because it’s possible to do close to 100% of SpiderMonkey development just building the JS shell.

Working on the JS shell is great.  On my 2.5 year old Linux box it takes maybe 1 minute to build from scratch, rebuilds are almost instantaneous, the regression tests take a few minutes, shell start-up is instantaneous, you rarely have to do try server runs, tools like GDB and Valgrind are easy to run, and landing patches on the TraceMonkey repo is low-stress because breakage doesn’t affect too many people.

In comparison, working on the browser sucks.  Builds from scratch take 25 minutes, zero-change rebuilds take 1.5 minutes, single-change rebuilds take 3 or 4 minutes, the linking stage grinds my machine to a halt, cold starts take up to 20 seconds or more (warm starts are much better), the test suites are gargantuan, every change requires a try server run, tools like GDB and Valgrind require jumping though hoops (--disable-jemalloc, anyone?), and landing patches on mozilla-central is stressful.

Thanks to my recent about:memory work, I’ve had to experience this pain first-hand.  It’s awful.  Debugging experiments that would take 20 seconds in the shell take 5 minutes in the browser. I avoid ‘hg up’ as much as possible due to the slow rebuilds it usually entails.  How do all you non-JS people deal with it?  Maybe you just get used to it… but I figure there have to be some tips and tricks I’m not aware of.

(Nb: Why are rebuilds so slow?  Because configure is invoked every time?  Imprecise dependencies in makefiles?  Too much use of recursive make?  Bug 629668? Why do Fennec rebuilds seem to be slower than Firefox rebuilds?)

Kyle Huey told me how you can get away with rebuilding only parts of the browser.  Eg. if I only modify code under xpcom/, I can just do make -C <build>/xpcom && make -C <build>/toolkit/library and this reduces the rebuild time a bit.  The down-side is that when I screw it up, eg. by forgetting to rebuild a directoy that I changed, it creates Frankenbuilds, and realizing what I’ve done can end up taking a lot more time than I saved.

Another trick I worked out:  implement things in JavaScript.  Seriously!  While doing my about:memory revamp I originally had a big chunk of the work done on the C++ side.  I quickly realized that if I did most of it on the JavaScript side I could see the effect of most changes by simply copying aboutMemory.js from my source dir to my build dir and then reloading the page.  Much better than re-building and re-starting.

What else can I do?  Get a faster machine is the obvious option, I guess.  More cores would help, though linking would still be a bottleneck.  Do SSDs make a big difference?

Also, there’s talk of using more project branches and introducing mozilla-staging.  That would avoid the stress of landing on mozilla-central, but that’s really the smallest part of what I’m complaining about.

Any and all suggestions are welcome!  Please, I’m begging you.

Correctness Programming Tracemonkey Work habits

Safer refactoring in Nanojit

Recently I’ve been making a lot of refactoring changes to Nanojit that I think of as “code hygiene” — changes that clean up the code, but don’t change its behaviour at all.  (Example 1, example 2.)  When making changes like these I like to do the changes carefully and gradually to make sure that I don’t affect behaviour unintentionally.  In Nanojit’s case, this means that the generated code shouldn’t change.  But it’s always possible to make mistakes, and sometimes a particular change is disruptive enough that you can’t evolve the code from one state to another without breaking it temporarily.

In such cases, if the regression tests pass it tells me that the new code works, but it doesn’t tell me if I’ve managed to avoid changing the behaviour as I intended.  So I’ve started doing comparisons of the native code generated by Nanojit for parts of SunSpider using TraceMonkey’s verbose output, which is controlled with the TMFLAGS environment variable.  For such changes, the generated code should be identical to an unchanged version.  The way I organise my workspaces is such that I always have an unchanged repository available, which makes such comparisons easy.

Except, the generated code is never quite identical because it contains addresses and the slightest change in the executable can affect these.  So I run the verbose output through this script:

perl -p -e 's/[0-9a-fA-F]{4,}/....../g'

It just replaces numbers with four or more digits with “……”, which is enough to filter out these address differences.

I tried this technique on all of SunSpider, but it turns out it doesn’t work reliably, because crypto-aes.js is non-deterministic — it contains a call to Date.getTime() which introduces enough non-determinism that TraceMonkey sometimes traces different code.  A couple of the V8 benchmarks have similar non-deterministic behaviours.  Non-determinism is a rather undesirable feature of a benchmark suite so I filed a SunSpider bug which is yet to be acted on.

Fortunately, experience has shown me that I don’t need to do code diffs on all of SunSpider to be confident that I haven’t changed Nanojit’s behaviour — doing code diffs on two or three of the bigger benchmarks suffices, as enough code is generated that even small behavioural differences are quickly found.

In fact, this code diff technique is useful enough that I’ve started using it sometimes even when I do want to change behaviour — I split such changes into two parts, one which doesn’t change the behaviour and one which does.  For example, when I added an opcode to distinguish between 64-bit integer loads and 64-bit integer floats I landed a non-behaviour-changing patch first, then did a follow-up change that improved x86-64 code generation.  This turned out to be a good idea because I introduced a regression with the follow-up change, and the fact that it was separated from the non-behavioural changes made it easy to determine what had gone wrong, because the regressing patch was much smaller than a combined patch would have been.

I also recently wrote a script to automate the diff process, which makes it that much easier to perform, which in turn makes it that much more likely that I’ll actually do it.

In summary, it’s always worth thinking about new ways to test your code, and when you find a good one, it’s worth making it as easy as possible to run those tests.

Mercurial Tracemonkey Work habits

How I Work on Tracemonkey

After six months of working on Tracemonkey, I’ve built up a particular workflow — how I use Mercurial, arrange my workspaces, run tests, and commit code.  I thought it would be worth describing this in case it helps other developers improve their workflow, or perhaps so they can give me ideas on how to improve my own workflow.

Workspace Structure

I have two machines, an Ubuntu Linux desktop and a Mac laptop.  For both machines I use the same workspace structure.  All my Mozilla work is in a directory ~/moz/.  At any one time I have up to 10 workspaces.  ~/moz/ws0/ always contains an unmodified clone of the tracemonkey repository, created like so:

hg clone ~/moz/ws0

Workspaces ~/moz/ws1 through to ~/moz/ws9 are local clones of ~/moz/ws0/ in which I make modifications.  I create these workspaces like this:

hg clone ~/moz/ws0 ~/moz/wsN

Local hg clones are much cheaper than ones done over the network.  On my Linux box it takes about 45 seconds, on my Mac somewhere over 2 minutes;  it seems that laptops have slower hard disks than desktops.  In comparison, cloning can take anywhere from 5 to 30 minutes or more (I don’t know why there’s so much variation there).

I mostly work with the Javascript shell, called ‘js’, so I do most of my work in ~/moz/wsN/js/src/.  There are three ways I commonly build ‘js’.

  • Debug builds go in ~/moz/wsN/js/src/debug/.  I use these for most of my development and testing.
  • Optimised builds go in ~/moz/wsN/js/src/opt/.  I use these for measuring performance.
  • Optimised builds with symbols go in ~/moz/wsN/js/src/optg/.  I use these with Cachegrind, which needs optimised code with symbols to be useful.

I have a number of bash aliases I use to move around these directories:

alias m="cd ~/moz/"
alias m0="cd ~/moz/ws0/"
alias j0="cd ~/moz/ws0/js/src/"
alias j0d="cd ~/moz/ws0/js/src/debug/"
alias j0o="cd ~/moz/ws0/js/src/opt/"

and so on for the remaining workspaces ws1 through ws9.  I have a common bash config file that I use on both my machines;  whenever I change it I copy it to the other machine.  This is a manual process, which is not ideal, but in practice it works well enough.

I find nine workspaces for making changes is enough to cover everything I’m doing;  if I find myself needing more it’s because some of the existing ones have stagnated and I need to do some cleaning up.

Building ‘js’

I have three scripts, js_conf_debug, js_conf_opt, js_conf_optg, which configure and build from scratch.  Here is js_conf_debug, the others are similar:

#! /bin/sh

if [ -z $1 ] ; then
    echo "usage: $0 <dirname>"
elif [ -d $1 ] ; then
    echo "directory $1 already exists"
    mkdir $1
    cd $1
    CC='gcc -m32' CXX='g++ -m32' AR=ar ../configure \
        --enable-debug --disable-optimize --target=i686-pc-linux-gnu
    make --quiet -j 2

These are scripts rather than bash aliases or functions because they are quite different on the Linux machine and the Mac.

I also have this alias for incremental builds:

alias mq="make --quiet -j 2"

Testing ‘js’

The program I run most is trace-test.js.  So much so that I have more aliases for it:

alias jsott="time opt/js -j trace-test.js"
alias jsdtt="time debug/js -j trace-test.js"

I don’t need an alias for the optg build because that’s only used with Cachegrind, which I run in a different way (see below).

I run the JS unit test with the following script:

function js_regtest
    if [ -z $x ] || [ -z $y ] ; then
        echo "usage: js_regtest <ws-number-1> <ws-number-2>"
        echo "############################"
        echo "## COMPILING $xdir"
        echo "############################"
        cd $xdir && mq
        echo "############################"
        echo "## COMPILING $ydir"
        echo "############################"
        cd $ydir && mq
        cd $ydir/../../tests
        echo "############################"
        echo "## TESTING $xdir"
        echo "############################"
        time \
            -k \
            -e smdebug \
            --opt '-j' \
            -L spidermonkey-n.tests slow-n.tests \
            -f base.html \
            -s $xdir/js && \
        echo "############################"
        echo "## TESTING $ydir"
        echo "############################"
        time \
             -k \
             -e smdebug \
             --opt '-j' \
             -L spidermonkey-n.tests slow-n.tests \
             -L base-failures.txt \
             -s $ydir/js

An example invocation would be:

js_regtest 0 3

The above invocation first ensures a debug ‘js’ is built in workspaces 0 and 3.  Then it runs ~/moz/ws0/js/src/debug/js in order to get the baseline failures, which are put in base-failures.txt.  Then it runs ~/moz/ws3/js/src/debug/js and compares the results against the baseline.  The -L lines skip the tests that are really slow;  without them it takes hours to run.  I time each invocation just so I always know roughly how long it takes;  it’s a bit over 10 minutes to do both runs.  It assumes that workspace 0 and 3 correspond to the same hg revision;  perhaps I could automate that to guarantee it but I haven’t (knowingly) got that wrong yet so haven’t bothered to do so.

Timing ‘js’

I time ‘js’ by running SunSpider.  I obtained it like so:

svn ~/moz/SunSpider

I haven’t updated it in a while, I hope it hasn’t changed recently!

I run it with this bash function:

function my_sunspider
    if [ -z $x ] || [ -z $y ] || [ -z $n ] ; then
        echo "usage: my_sunspider <ws-number-1> <ws-number-2> <number-of-runs>"
        for i in $x $y ; do
            dir = $HOME/moz/ws$i/js/src/opt
            cd $dir || exit 1
            make --quiet || exit 1
            cd ~/moz/SunSpider
            echo "############################"
            echo "####### TESTING ws$i #######"
            echo "############################"
            time sunspider --runs=$n --args='-j' --shell $dir/js > opt$i

         my_sunspider_compare_results $x $y

function my_sunspider_compare_results
    if [ -z $x ] || [ -z $y ] ; then
        echo "usage: my_sunspider_compare_results <ws-number-1> <ws-number-2>"
        sunspider-compare-results \
            --shell $HOME/moz/ws$x/js/src/opt/js opt$x opt$y

An invocation like this:

my_sunspider 0 3 100

will ensure that optimised builds in both workspaces are present, and then compare them by doing SunSpider 100 runs.  That usually gives what SunSpider claims as +/-0.1% variation (I don’t believe it, though).  On my Mac this takes about 3.5 minutes, and 100 runs is enough that the results are fairly reliable, certainly more so than the default of 10 runs.  But when testing a performance-affecting change I like to do some timings, wait until a few more patches have landed in the tree, then update and rerun the timings — on my Mac I see variations of 5-10ms regularly due to minor code differences.  Timing multiple versions like this gives me a better idea of whether a timing difference is real or not.  Even then, it’s still not easy to know for sure, and this can be frustrating when trying to work out if an optimisation I applied is really giving a 5ms speed-up or not.

On my Linux box, I have to use 1000 runs to get +/-0.1% variation.  This takes about 25 minutes, so I rarely do performance-related work on this machine.  I don’t know why Linux causes greater timing variation.

Profiling ‘js’ with Cachegrind

I run Cachegrind on ‘js’ running SunSpider with this bash function:

function cg_sunspider
    if [ -z $x ] || [ -z $y ] ; then
        echo "usage: cg_sunspider <ws-number-1> <ws-number-2>"
        for i in $x $y ; do
            dir = $HOME/moz/ws$i/js/src/optg
            cd $dir || exit 1
            make --quiet || exit 1
            cd ~/moz/SunSpider
            time valgrind --tool=cachegrind --branch-sim=yes --smc-check=all \
                --cachegrind-out-file=cachegrind.out.optg$i \
                --auto-run-dsymutil=yes \
                $dir/js `cat ss0-args.txt`
            cg_annotate --auto=yes cachegrind.out.optg$i > ann-optg$i

ss0-args.txt contains this text:

-j -f tmp/sunspider-test-prefix.js -f resources/sunspider-standalone-driver.js

What this does is run just the main SunSpider program, once, avoiding all the start-up processes and all that.  This is important for Cachegrind — it means that I can safely use –cachegrind-out-file to name a specific file, which is not safe if running Cachegrind on a program involving multiple processes.   (I think this is slightly dangerous… if you run ‘sunspider –ubench’ it seems to change one of the above .js files and you have to rerun SunSpider normally to get them back to normal.)  I use –branch-sim=yes because I often find it to be useful; at least twice recently it has helped me identify performance problems.

If I want to focus on a particular Cachegrind statistic, e.g. D2mr (level 2 data read misses) or Bim (indirect branch mispredictions) then I rerun cg_annotate like this:

cg_annotate --auto=yes --show=I2mr --sort=I2mr cachegrind.out.optgN > ann-optgN-I2mr

Profiling ‘js’ with Shark

To profile ‘js’ with Shark, I use SunSpider’s –shark20 and –test options.  I don’t have this automated yet, I probably should.

Managing Changes with Mercurial

Most of my changes are not that large, so I leave them uncommitted in a workspace.  This is primitive, but has one really nice feature:  when pulling and updating, hg merges the changes and marks conflicts in the nice “<<<” “>>>” way.

In comparison, with Mercurial queues (which I tried for a while) you have to pop your patches, update, then push them, and it uses ‘patch’ to do the merging.  And I hate ‘patch’ because conflicted areas tend to be larger, and because they go in a separate reject file rather than being inserted inline.

I also avoid doing local commits unless I’m working on something really large just because the subsequent merging is difficult (at least, I think it’s difficult;  my Mercurial knowledge still isn’t great).  In that case I do local commits until the change is finished, then apply the patch (using ‘hg diff’ and ‘patch’) in a single hit to a newly cloned tree — given Mozilla’s use of Bugzilla, the change will have to be a single patch anyway so this aggregation step has to happen at some point.

Pre-push Checklist

Before landing any patch, I do my best to work through the following check-list.  I created this list recently after having to back out several commits due to missing one of the above steps;  I give examples of breakage I’ve caused in square brackets.

  • Ensure there are no new compiler warnings for ‘js’ for optimised and debug builds.  [I managed to introduce some warnings on an optimised build recently for what was supposedly a whitespace-only change!]
  • Ensure ‘js’ runs trace-test.js without failures, for optimised builds, debug builds, debug builds with TMFLAGS=full (to test the verbose output) under Valgrind (to test for memory errors).  [I’ve had to back out several patches due to breaking TMFLAGS=full]
  • Ensure lirasm builds and passes its tests for both optimised and debug builds.  [I’ve forgotten this numerous times, leaving lirasm in a broken state, which is why I created bug 503449].
  • Ensure unit tests pass with a debug build.  [Amusingly enough, I don’t think I’ve ever caused breakage by forgetting this step!]
  • (For any commit that might affect performance) Check SunSpider timings with an optimised build.
  • (For complex changes) Check the patch on the try servers. (Nb: they run optimised builds, so will miss assertion failures among other things)
  • (For changes affecting the ARM backend) Check the patch builds and runs trace-test.js (using a debug build) on my virtual qemu+ARM/Linux machine.
  • Check tinderbox to make sure the tree is open for commits.  [When the tree is closed, there’s no mechanism that actually prevents you from committing.  I had to back-out a patch during a tinderbox upgrade because of this.]

It’s quite a list, and I don’t usually do anything with a browser build, when I probably should, so that would make it even longer.  And there are other things to get wrong… for example, I never test the –disable-jit configuration and I broke it once.


When I’m ready to push a change, I make sure my workspaces are up-to-date with respect to the Mozilla repo.  I then commit the change to my modified repo, then push it from there into ~/moz/ws0/, then check ‘hg outgoing -p’ on that repo to make sure it looks ok, and then push to the Mozilla repo from there.  I try to do this quickly so that no-one else lands something in the meantime;  this has only happened to me once and I tried to use ‘hg rollback’ to undo my local changes which I think should have worked but seemingly didn’t.

Post-push Checklist

After committing, I do these steps:

  • Mark the bug’s “whiteboard” field as “fixed-in-tracemonkey”.
  • Put a link to the commit in a comment for the bug, of the form<revhash>/.  I always test the link before submitting the comment.


That’s a lot of stuff.  Two of my more notable conclusions are:

  • Automation is a wonderful thing.  In particular, having scripts for the complicated tasks (e.g. running the unit tests, running sunspider, running sunspider under Cachegrind) has saved me lots of time and typing (and lots of head-scratching and re-running when I realised I forgot some command line option somewhere).  And this automation was made much easier once I settled on a standard workspace+build layout.
  • The pre-push checklist is both disconcertingly long and disconcertingly incomplete.  And I had to work it out almost entirely by myself — I’m not aware of any such check-list documented anywhere else.  Having lots of possible configurations really hurts testability.  I’m not sure how to improve this.

If you made it this far, congratulations!  That was pretty dry, especially if you’re not a Tracemonkey developer.  I’d love to hear suggestions for improving what I’m doing.