01
Nov 19

evaluating bazel for building firefox, part 2

In our last post, we highlighted some of the advantages that Bazel would bring.  The remote execution and caching benefits Bazel bring look really attractive, but it’s difficult to tell exactly how much they would benefit Firefox.  I looked for projects that had switched to Bazel, and a brief summary of each project’s experience is written below.

The Bazel rules for nodejs highlight Dataform’s switch to Bazel, which took about 2 months.  Their build involves some combination of “NPM packages, Webpack builds, Node services, and Java pipelines”. Switching plus enabling remote caching reduced the average time for a build in CI from 30 minutes to 5 minutes; incremental builds for local development have been “reduced to seconds from minutes”.  It’s not clear whether the local development experience is also hooked up to the caching infrastructure as well.

Pinterest recently wrote about their switch to Bazel for iOS.  While they call out remote caching leading to “build times [dropping] under a minute and as low as 30 seconds”, they state their “time to land code” only decreased by 27%.  I wasn’t sure how to reconcile such fast builds with (relatively) modest decreases in CI time.  Tests have gotten a lot faster, given that test results can be cached and reused if the tests in question have their transitive dependencies unchanged.

One of the most complete (relatively speaking) descriptions I found was Redfin’s switch from Maven to Bazel, for building a large amount of JavaScript modules and Java code, nearly 30,000 files in all.  Their CI builds went from 40-90 minutes to 5-6 minutes; in fairness, it must be mentioned that their Maven builds were not parallelized (for correctness reasons) whereas their Bazel builds were.  But it’s worth highlighting that they managed to do this incrementally, by generating Bazel build definitions from their Maven ones, and that the quoted build times did not enable caching.  The associated tech talk slides/video indicates builds would be roughly in the 1-2 minute range with caching, although they hadn’t deployed that yet.

None of the above accounts talked about how long the conversion took, which I found peculiar.  Both Pinterest and Redfin called out how much more reliable their builds were once they switched to Bazel; Pinterest said, “we haven’t performed a single clean build on CI in over a year.”

In some negative results, which are helpful as well, Dropbox wrote about evaluating Bazel for their Android builds.  What’s interesting here is that other parts of Dropbox are heavily invested in Bazel, so there’s a lot of in-house experience, and that Bazel was significantly faster than their current build system (assuming caching was turned on; Bazel was significantly slower for clean builds without caching).  Yet Dropbox decided to not switch to Bazel due to tooling and development experience concerns.  They did leave open the possibility of switching in the future once the ecosystem matures.

The oddly-named Bazel Fawlty describes a conversion to Bazel from Go’s native tooling, and then a switch back after a litany of problems, including slower builds (but faster tests), a poor development experience (especially on OS X), and various things not being supported in Bazel leading to the native Go tooling still being required in some cases.  This post was also noteworthy for noting the amount of porting effort required to switch: eight months plus “many PR’s accepted into the bazel go rules git repo”.  I haven’t used Go, but I’m willing to discount some of the negative experience here due to the native Go tools being so good.

Neither one of these negative experiences translate exactly to Firefox: different languages/ecosystems, different concerns, different scales.  But both of them cite the developer experience specifically, suggesting that not only is there a large investment required to actually do the switchover, but you also need to write tooling around Bazel to make it more convenient to use.

Finally, a 2018 BazelCon talk discusses two Google projects that made the switch to Bazel and specifically to use remote caching and remote execution on Google’s public-facing cloud infrastructure: Android Studio and TensorFlow.  (You may note that this is the first instance where somebody has called out supporting remote execution as part of the switch; I think that implies getting a build to the point of supporting remote execution is more complicated than just supporting remote caching, which makes a certain amount of sense.)  Android Studio increased their test presubmit coverage by 4x, presumably by being able to run more than 4x test jobs than previously due to remote execution.  In the same vein, TensorFlow decreased their build and test times by 80%, and they could use significantly less powerful machines to actually run the builds, given that large machines in the cloud were doing the actual heavy lifting.

Unfortunately, I don’t think expecting those same reductions in test time, were Firefox to switch to Bazel, is warranted.  I can’t speak to Android Studio, but TensorFlow has a number of unit tests whose test results can be cached.  In the Firefox context, these would correspond to cppunittests, which a) we don’t have that many of and b) don’t take that long to run.  The bulk of our tests depend in one way or another on kitchen-sink-style artifacts (e.g. libxul, the JS shell, omni.ja) which essentially depend on everything else.  We could get some reductions for OS-specific modifications; Windows-specific changes wouldn’t require re-running OS X tests, for instance, but my sense is that these sorts of changes are not common enough to lead to an 80% reduction in build + test time.  I suppose it’s also possible that we could teach Bazel that e.g. devtools changes don’t affect, say, non-devtools mochitests/reftests/etc. (presumably?), which would make more test results cacheable.

I want to believe that Bazel + remote caching (+ remote execution if we could get there) will bring Firefox build (and maybe even test) times down significantly, but the above accounts don’t exactly move the needle from belief to certainty.


28
Oct 19

evaluating bazel for building firefox, part 1

After the Whistler All-Hands this past summer, I started seriously looking at whether Firefox should switch to using Bazel for its build system.

The motivation behind switching build systems was twofold.  The first motivation was that build times are one of the most visible developer-facing aspects of the build system and everybody appreciates faster builds.  What’s less obvious, but equally important, is that making builds faster improves automation: less time waiting for try builds, more flexibility to adjust infrastructure spending, and less turnaround time with automated reviews on patches submitted for review.  The second motivation was that our build system is used by exactly one project (ok, two projects), so there’s a lot of onboarding cost both in terms of developers who use the build system and in terms of developers who need to develop the build system.  If we could switch to something more off-the-shelf, we could improve the onboarding experience and benefit from work that other parties do with our chosen build system.

You may have several candidates that we should have evaluated instead.  We did look at other candidates (although perhaps none so deeply as Bazel), and all of them have various issues that make them unsuitable for a switch.  The reasons for rejecting other possibilities fall into two broad categories: not enough platform support (read: Windows support) and unlikely to deliver on making builds faster and/or improving the onboarding/development experience.  I’ll cover the projects we looked at in a separate post.

With that in mind, why Bazel?

Bazel advertises itself with the tagline “{Fast, Correct} – Choose two”.  What’s sitting behind that tagline is that when building software via, say, Make, it’s very easy to write Makefiles in such a way that builds are fast, but occasionally (or not-so-occasionally) fail because somebody forgot to specify “to build thing X, you need to have built thing Y”.  The build doesn’t usually fail because thing Y is built before thing X: maybe the scheduling algorithm for parallel execution in make chooses to build Y first 99.9% of the time, and 99% of those times, building Y finishes prior to even starting to build X.

The typical solution is to become more conservative in how you build things such that you can be sure that Y is always built before X…but typically by making the dependency implicit by, say, ordering the build commands Just So, and not by actually making the dependency explicit to make itself.  Maybe specifying the explicit dependency is rather difficult, or maybe somebody just wants to make things work.  After several rounds of these kind of fixes, you wind up with Makefiles that are (probably) correct, but probably not as fast as it could be, because you’ve likely serialized build steps that could have been executed in parallel.  And untangling such systems to the point that you can properly parallelize things and that you don’t regress correctness can be…challenging.

(I’ve used make in the above example because it’s a lowest-common denominator piece of software and because having a concrete example makes differentiating between “the software that runs the build” and “the specification of the build” easier.  Saying “the build system” can refer to either one and sometimes it’s not clear from context which is in view.  But you should not assume that the problems described above are necessarily specific to make; the problems can happen no matter what software you rely on.)

Bazel advertises a way out of the quagmire of probably correct specifications for building your software.  It does this—at least so far as I understand things, and I’m sure the Internet will come to correct me if I’m wrong—by asking you to explicitly specify dependencies up front.  Build commands can then be checked for correctness by executing the commands in a “sandbox” containing only those files specified as dependencies: if you forgot to specify something that was actually needed, the build will fail because the file(s) in question aren’t present.

Having a complete picture of the dependency graph enables faster builds in three different ways.  The first is that you can maximally parallelize work across the build.  The second is that Bazel comes with built-in facilities for farming out build tasks to remote machines.  Note that all build tasks can be distributed, not just C/C++/Rust compilation as via sccache.  So even if you don’t have a particularly powerful development machine, you can still pretend that you have a large multi-core system at your disposal.  The third is that Bazel also comes with built-in facilities for aggressive caching of build artifacts.  Again, like remote execution, this caching applies across all build tasks, not just C/C++/Rust compilation.  In Firefox development terms, this is Firefox artifact builds done “correctly”: given appropriate setup, your local build would simply download whatever was appropriate for the changes in your current local tree and rebuild the rest.

Having a complete picture of the dependency graph enables a number of other nifty features.  Bazel comes with a query language for the dependency graph, enabling you to ask questions like “what jobs need to run given that these files changed?”  This sort of query would be valuable for determining what jobs to run in automation; we have a half-hearted (and hand-updated) version of this in things like files-changed in Taskcluster job specifications.  But things like “run $OS tests for $OS-only changes” or “run just the mochitest chunk that contains the changed mochitest” become easy.

It’s worth noting here that we could indeed work towards having the entire build graph available all at once in the current Firefox build system.  And we have remote execution and caching abilities via sccache, even moreso now that sccache-dist is being deployed in Mozilla offices.  We think we have a reasonable idea of what it would take to work towards Bazel-esque capabilities with our current system; the question at hand is how a switch to Bazel compares to that and whether a switch would be more worthwhile for the health of the Firefox build system over the long term.  Future posts are going to explore that question in more detail.