We build a lot of code at Mozilla. Every time someone pushes changes to the code that makes up Firefox we build the application on multiple platforms in a variety of build configurations. This means that we’re constantly looking for ways to make the build faster–to get faster results from our builds and tests and to use less machine time so that we can use fewer machines for builds and save money.

A few years ago my colleague Mike Hommey did some work to see if we could deploy a shared compiler cache. We had been using ccache for many of our builds, but since we use ephemeral build machines in AWS and we also have a large pool of build machines, it doesn’t help as much as it does on a developer’s local machine. If you’re interested in the details, I’d recommend you go read his series of blog posts: Shared compilation cache experiment, Shared compilation cache experiment, part 2, Testing shared cache on try and Analyzing shared cache on try. The short version is that the project (which he named sccache) was extremely successful and improved our build times in automation quite a bit. Another nice win was that he added support for Microsoft Visual C++ in sccache, which is not supported by ccache, so we were finally able to use a compiler cache on our Windows builds.

This year we started a concerted effort to drive build times down even more, and we’ve made some great headway. Some of the ideas for improvement we came up with would involve changes to sccache. I started looking at making changes to the existing Python sccache codebase and got a bit frustrated. This is not to say that Mike wrote bad code, he does fantastic work! By nature of the sccache design it is doing a lot of concurrent work and Python just does not excel at that workload. After talking with Mike he mentioned that he had originally planned to write sccache in Rust, but at the time Rust had not had its 1.0 release and the ecosystem just wasn’t ready for the work he needed to do. I had spent several months learning Rust after attending an “introduction to Rust” training session and I thought it’d be a good time to revisit that choice. (I went back and looked at some meeting notes and in late April I wrote a bullet point “Got distracted and started rewriting sccache in Rust”.)

As with all good software rewrites, the reality of things made it into a much longer project than anticipated. (In fairness to myself, I did set it aside for a few months to spend time on another project.) After seven months of part-time work on the project it’s finally gotten to the point where I’m ready to put it into production usage, replacing the existing Python tool. I did a series of builds on our Try server to compare performance of the existing sccache and the new version, mostly to make sure that I wasn’t going to cause regressions in build time. I was pleasantly surprised to find that the Rust version gave us a noticeable improvement in build times! I hadn’t done any explicit work on optimizing it, but some of the improvements are likely due to the process startup overhead being much lower for a Rust binary than a Python script. It actually lowered the time we spent running our configure script by about 40% on our  Linux builds and 20% on our OS X builds, which makes some sense because configure invokes the compiler quite a few times, and when using ccache or sccache it will invoke the compiler using that tool.

My next steps are to tackle the improvements that were initially discussed. Making sccache usable for local developers is one thing, since Windows developers can’t use ccache currently this should help quite a bit there. We also want to make it possible for developers to use sccache and get cache hits from the builds that our automation has already done. I’d also like to spend some time polishing the tool a bit so that it’s usable to a wider audience outside of Mozilla. It solves real problems that I’m sure other organizations face as well and it’d be great for others to benefit from our work. Plus, it’s pretty nice to have an excuse to work in Rust. 🙂 You can find the code for the rewritten sccache on GitHub.

Overall I’ve really enjoyed the experience of working in Rust on this project. Compared to working on the Python version of the tool it was nice to have static typing to catch mistakes I made in the compile phase. I’ve really grown to love Rust as a language, I miss things like the match expression when I’m working in other languages now! There were certainly some growing pains–I hit a few cases where the crates.io ecosystem just didn’t have something I expected, or the Rust standard library was missing a feature I needed, but those were not very common occurrences for me. I would definitely reach for Rust again for a project like this!

Most smartphones use ARM processors. Much like how most PCs use x86 processors, for various reasons ARM has become the CPU of choice for mobile devices. Similar to x86, there are different versions of ARM processors that support different features. One of the biggest differences is which instruction set is supported. Instructions are the smallest units of what a processor can do, and an instruction set are the particular units that a processor knows how to run. For Intel, instruction sets were changed when they went from the 386 to the 486 to the Pentium and so on. For ARM, the instruction sets are numbered, with the most current one in use being ARMv7 (with ARMv8 in development). Confusingly, ARM’s processors themselves have similar naming, with the ARM11 being the generation that supports the ARMv6 instruction set, and ARM Cortex being the generation that supports the ARMv7 instruction set. All high-end smartphones that are currently shipping use processors that support the ARMv7 instruction set. The Apple iPhone 4S, Samsung Galaxy S2 and Galaxy Nexus, as well as others all come with similar processors. They’re all similarly fast as processors in smartphones go, and ARMv7 contains lots of features that allow programs to run very quickly.

How is this relevant to Firefox Mobile? Currently the builds we’re producing only run on processors that support ARMv7. This is partially because we’ve been working on performance for quite a while, and it’s much harder to get acceptable performance on a slower processor, so targeting only faster processors makes sense. (This is the same reason that Chrome for Android only runs on Android phones running the latest version of Android.) It’s also partially because all modern JavaScript engines ship with a JIT, which is a highly specialized piece of code that needs to know intimate details about the type of processor it’s running on. We used to produce additional builds that supported ARMv6 alongside our ARMv7 builds, but we saw lots of ARMv6-specific crashes in our crash reporting system, and we didn’t have the resources to tackle them all. Additionally, we were focused on making Firefox Mobile run well on ARMv7 processors; so making it run well on ARMv6 seemed like a stretch at the time.

Coming back to the present, we’ve got a revitalized mobile team working on a revamped Firefox Mobile that’s much faster than previous versions, so the performance target seems much more within reach. We also had people attending MozCamps and other Mozilla events across the globe last year. Dietrich visited Nairobi for some Mozilla Kenya events and found that the most widely used Android phones in Kenya are all ARMv6 devices. In addition, there are lots of Android phones being sold in China that are ARMv6. Even in the USA there are some low-end Android devices being released that are still ARMv6, like the LG Optimus Hub, which shipped in October of 2011. As of that date roughly 58% of the Android install base was comprised of ARMv6 phones. That’s a huge segment of the market that we’re not supporting.

Because of this, during the Firefox Mobile revamp Doug roped me in and asked if I would look into getting our ARMv6 builds back up and running. I started working on it figuring it wouldn’t be too bad since we used to produce builds. As it turns out, I was wrong. We managed to break things in quite a few ways since we disabled those builds. A few of them were simple fixes in our build configuration (although one of those took Mike Hommey and I a solid week of debugging to track down), but I also ran into a few problems with our custom linker. Firefox Mobile ships with a replacement for the system dynamic linker on Android. It’s pretty complicated, but this is the reason that Firefox only takes up about 15MB, whereas Chrome for Android takes up nearly 50MB after installation. Being a complicated piece of code there were some hard-to-diagnose bugs in it. Thankfully, with some input from Jacob Bramley from ARM we were able to track down the remaining problem and get builds working again.

With all the setbacks and other issues it’s not unreasonable to ask why we’re doing this. Clearly this isn’t the end of the process by any means. We still have to get automated builds back up and running on our build farm. We will undoubtedly have to shake out more ARMv6-specific bugs in our JavaScript engine and elsewhere. We’ll almost assuredly have to do some work to make performance acceptable. It’s a lot of work and it will take time, but this seems like the right thing to do given the number of users we can reach. You can follow along in Bugzilla if you’re interested in this work.

Measuring UI Responsiveness

June 27th, 2011

One of the goals for the Firefox team is to ensure that the user interface remains responsive to input at all times. Clearly a responsive interface is incredibly important to making the browser a useful application, but how do we measure “responsiveness”?

Dietrich has done some work on this, writing an add-on that measures the time that various UI actions take. This covers the direct case, where a user initiates an action and expects a response in a reasonable amount of time. Clearly we want to make sure that individual actions don’t take an extraordinary amount of time.

I took the opposite tack, with an eye on being able to detect when the application was not responsive to user input regardless of what actions the user was taking. Building on some work by Chris Jones and Alon Zakai, I wrote some code that instruments the main thread event loop to find out how long it takes to respond to events, which ought to be a reasonable proxy for measuring responsiveness. When the instrumentation detects that the event loop takes too long to respond (more than 50 milliseconds, currently) it writes a data point to a log giving the current timestamp and the amount of time the event loop was not responsive.

When I implemented this I had my eye on Talos integration, where we could run the browser through some automated UI tests with this instrumentation enabled, and then correlate “UI actions” with “unresponsive periods” and ensure that the browser did not become unresponsive during those actions. Talos integration has been shifted off as a longer-term goal, with the more immediate goal being “find UI actions that are the worst offenders of unresponsiveness”. To that end we’ve filed some other bugs about correlating this unresponsiveness data with JavaScript execution, and correlating the data with C++ execution. If you’ve got any ideas please feel free to contribute to those bugs!

If you’d like to try out the responsiveness instrumentation I implemented, it landed on mozilla-central a while ago, and there’s some reasonably complete documentation in the source code. There are implementations for Windows, Linux/GTK2 and OS X currently. (And a patch for an Android implementation in a bug.)

I was just looking at some data produced from our crash reporting system, and I continue to be amazed at the amount of third-party code that gets loaded into Firefox on Windows. That data file contains a list of all unique binary files (EXE or DLL) that were listed in Windows crash reports in a single day. A quick look at it shows:

$ cut -f1 -d, 20110613-modulelist.txt  | sort -u | wc -l
10385

There are over 10,000 unique filenames in a single day’s worth of crash reports. That sure seems like a lot! Now, certainly, a lot of these modules look like they’ve been randomly named, which probably indicates that they’re some kind of virus (like 0eYZf0QFDSGEAbTRWD3F.dll, for example), so those are likely to inflate the number. There’s a bug on file asking that we collect MD5 hashes of every DLL in our crash reports so we could more easily detect malware/virus DLLs that use these tactics, as well as integrate with lists of known malware and viruses from antivirus vendors.

In the past, we have had problems with plugins and extensions causing crashes for many Firefox users. We have ways of mitigating those through blacklisting. We can also blacklist specific DLLs from loading in the Firefox process, which is not used as often because it’s harder to get right and provides little feedback to users about what’s been disabled. However, given the sheer number of possible things that can be loaded in our process, it’s unlikely that we’ll ever be able to block all software that causes crashes for users. This is unfortunate, because any one of these pieces of software can cause a crash in Firefox, and all the user sees is “Firefox crashed“. I suppose we now know how Microsoft feels when users blame Windows for crashes caused by faulty drivers.

moz-headless-screenshot

July 29th, 2010

On a personal project of mine, I have a need to generate thumbnails of web pages. Until recently my solution was to use Firefox running in Xvfb, a virtual X server. This is not an ideal solution, as it requires you to have lots of X client libraries installed on your server. Additionally, Firefox is not intended for this purpose, so there are lots of ways for things to go wrong.

Because of this, I’d been following Chris Lord‘s work on the offscreen branch of Mozilla for some time, but never tried it out until recently. The offscreen branch provides a widget backend for Mozilla that can render web content to an offscreen buffer. Chris wrote it in support of Clutter, which is a pretty neat use case. Conveniently, he also provided a sample embedding client application called moz-headless-screenshot. This is a simple command line tool that takes a URL, image size, and output filename and generates a PNG screenshot of the webpage. This being exactly what I wanted, and having my poorly-written Firefox+Xvfb solution fall apart due to a server migration, I decided to give his solution a shot.

I hit a few speed bumps on the way, since there wasn’t much documentation to be found on actually building and using moz-headless-screenshot. I’ve attempted to fix this my providing detailed steps and a Makefile in my own moz-headless-screenshot repository. I’ve also modified the code slightly such that it’s easier to run (at least in my use case). I have heard from others over the years that have this same need, so hopefully someone else finds it useful!

Easy branch-landing of patches

December 2nd, 2009

I often find myself landing patches on our 1.9.2 and 1.9.1 branches, both of which are in Mercurial repositories. This generally involves getting a changeset from the mozilla-central repository into one of these repositories, and also amending the changeset message to include the name of the person who gave me approval to land. There was some discussion recently on how it’s kind of a pain to do that. I’ve cooked up an easy solution for my own needs, perhaps it will serve yours as well.

I use “hg transplant” to get changesets from one repository to another. This assumes that you have local clones of both the source repository (mozilla-central in this case) and the destination repository (mozilla-1.9.2 or 1.9.1, usually). Assuming you had both clones side-by-side in a directory, you could run the transplant command in the destination repository’s working directory like so (where xxx is the changeset identifier of the changeset you want transplanted, you can also specify more than one changeset):

hg transplant -s ../mozilla-central xxx

The transplant command conveniently includes a “–filter” option that will let you alter the commit message or patch while transplanting. This requires you to have some sort of script for transplant to run. Here’s what I’m using (on Linux):

#!/bin/sh

if test -n "$APPEND"; then
 echo " $APPEND" >> "$1";
else
 if test -n "$EDITOR"; then
 $EDITOR "$1";
 else
 editor "$1";
 fi
fi

Save this as “transplant.sh” somewhere (and ensure that it’s executable), then in your ~/.hgrc, add a section:

[transplant]
filter = /path/to/transplant.sh

Now, when you run “hg transplant”, by default it will open an editor to edit the commit message for each changeset, allowing you to add approval information. But, even better, transplant.sh will append the contents of the “APPEND” variable if set, so you can run transplant like so to quickly append approval information:

APPEND="a=someone" hg transplant -s ../mozilla-central xxx

I find that this saves me a bunch of time, so hopefully it’s useful to someone else!

Source Server, back on trunk

October 5th, 2009

Some time ago, Lukas Blakk implemented support for a source server on our Windows builds as a class project in Dave Humphrey‘s class at Seneca College. Of course, soon after that we switched our main VCS from CVS to Mercurial, which broke all of her hard work. Thankfully, we got another one of Dave’s students, Jesse Valianes, to fix things to make it work with Mercurial. We landed his patch, but as it turns out we never enabled a setting on our build machines to make it actually work. However, when we finally tried to do so, I found out that another patch we had landed in the interim had broken things. I finally landed a fix for that, and we flipped it back on, and so today’s trunk build is source-enabled again.

If you have no idea what any of this means, it means you can download a Windows nightly build, attach a debugger, have it download the debug symbols automatically from our symbol server, and the debugger will download the matching source for you automatically.

I hope to get this backported to our 1.9.2 and 1.9.1 branches ASAP, so that our 3.5.x and 3.6 release builds will be similarly debuggable.

Firefox Packaging

September 17th, 2009

I recently landed some changes (on trunk and 1.9.2) to the way Firefox packaging works. There are two immediate consequences of this you should be aware of:

  1. Mac builds now use a packaging manifest just like Windows and Linux. If you add a file that you intend to ship on Mac, it needs to wind up in a packaging manifest. (bug 463605)
  2. All the  packaging manifest files have been combined into one single file: browser/installer/package-manifest.in. This should save everyone some time and annoyance. (bug 511642)

These changes had no effect on applications other than Firefox.

SSL in Mochitest

September 22nd, 2008

Without a lot of fanfare, a patch landed recently that enables the use of SSL with the test HTTP server we use in our Mochitest test harness.

About five months ago, I read an article about how Fedora wanted to standardize on NSS as the cryptography solution for their distro in order to be able to leverage a common certificate database, among other things. The article went into detail on how they wrote an OpenSSL wrapper around NSS so they could easily port applications that only supported OpenSSL to use NSS instead. As a concrete example, they showed a ported version of stunnel using NSS. This gave me pause, as one of the things we were lacking in our Mochitest harness was SSL support and stunnel would do exactly what we needed in this case. Considering we already build and ship NSS with every copy of Firefox, and it was clearly possible to implement the functionality we needed using NSS, I set out to figure out how to implement a bare-bones version of stunnel from scratch. After a bit of poking through the online NSPR and NSS documentation, I had a proof of concept application which I called “ssltunnel.” After some insightful review comments from NSS developers I committed it to CVS.

Unfortunately, that wasn’t the end. We still needed to hook this program up to the test harness, and I just didn’t have the motivation to do so. I filed the bug, and hoped someone else would do the work. (as I often do!) Thankfully, that someone appeared in the person of Honza Bambas, whom I can only describe as a “programming rockstar.” He not only integrated ssltunnel into Mochitest, but he rewrote large sections of it to make it work robustly and made it work as an HTTP proxy while he was at it. After some reviews, and a couple of landings and backouts due to unrelated test failures, and some time spent languishing in bugzilla, we finally made his patch stick.

Of course, now that we have this capability, we need tests to use it! Honza has written some great documentation on what is currently available via Mochitest, and how to add custom servers and certificates other things you might want. If you get motivated to write some tests and hit a rough spot, feel free as always to track me down on IRC and ask me about it.

MochiTest Maker

April 18th, 2008

Just something I threw together this morning: MochiTest Maker. It’s a pure HTML+JavaScript environment for writing MochiTests. It’s not as full-featured as the real MochiTest, as you can’t set HTTP headers or include external files, but it should serve for a lot of simple web content tests.

Ideally at some point I’d like to add a CGI backend to this so you could specify a directory, and have it generate a patch against current CVS to include your test in that directory. That would lower the bar even further for getting new tests into the tree. Another cool addition would be to integrate this with my regression search buildbot (currently offline), so that you could write a mochitest and then with one click submit it to find out when something regressed. That shouldn’t be hard to do, but my buildbot needs to find a more permanent home first.

I think there’s still a lot more we can (and must) do to lower the bar for writing tests. We need all the tests we can get!