In our last post, we highlighted some of the advantages that Bazel would bring. The remote execution and caching benefits Bazel bring look really attractive, but it’s difficult to tell exactly how much they would benefit Firefox. I looked for projects that had switched to Bazel, and a brief summary of each project’s experience is written below.
The Bazel rules for nodejs highlight Dataform’s switch to Bazel, which took about 2 months. Their build involves some combination of “NPM packages, Webpack builds, Node services, and Java pipelines”. Switching plus enabling remote caching reduced the average time for a build in CI from 30 minutes to 5 minutes; incremental builds for local development have been “reduced to seconds from minutes”. It’s not clear whether the local development experience is also hooked up to the caching infrastructure as well.
Pinterest recently wrote about their switch to Bazel for iOS. While they call out remote caching leading to “build times [dropping] under a minute and as low as 30 seconds”, they state their “time to land code” only decreased by 27%. I wasn’t sure how to reconcile such fast builds with (relatively) modest decreases in CI time. Tests have gotten a lot faster, given that test results can be cached and reused if the tests in question have their transitive dependencies unchanged.
One of the most complete (relatively speaking) descriptions I found was Redfin’s switch from Maven to Bazel, for building a large amount of JavaScript modules and Java code, nearly 30,000 files in all. Their CI builds went from 40-90 minutes to 5-6 minutes; in fairness, it must be mentioned that their Maven builds were not parallelized (for correctness reasons) whereas their Bazel builds were. But it’s worth highlighting that they managed to do this incrementally, by generating Bazel build definitions from their Maven ones, and that the quoted build times did not enable caching. The associated tech talk slides/video indicates builds would be roughly in the 1-2 minute range with caching, although they hadn’t deployed that yet.
None of the above accounts talked about how long the conversion took, which I found peculiar. Both Pinterest and Redfin called out how much more reliable their builds were once they switched to Bazel; Pinterest said, “we haven’t performed a single clean build on CI in over a year.”
In some negative results, which are helpful as well, Dropbox wrote about evaluating Bazel for their Android builds. What’s interesting here is that other parts of Dropbox are heavily invested in Bazel, so there’s a lot of in-house experience, and that Bazel was significantly faster than their current build system (assuming caching was turned on; Bazel was significantly slower for clean builds without caching). Yet Dropbox decided to not switch to Bazel due to tooling and development experience concerns. They did leave open the possibility of switching in the future once the ecosystem matures.
The oddly-named Bazel Fawlty describes a conversion to Bazel from Go’s native tooling, and then a switch back after a litany of problems, including slower builds (but faster tests), a poor development experience (especially on OS X), and various things not being supported in Bazel leading to the native Go tooling still being required in some cases. This post was also noteworthy for noting the amount of porting effort required to switch: eight months plus “many PR’s accepted into the bazel go rules git repo”. I haven’t used Go, but I’m willing to discount some of the negative experience here due to the native Go tools being so good.
Neither one of these negative experiences translate exactly to Firefox: different languages/ecosystems, different concerns, different scales. But both of them cite the developer experience specifically, suggesting that not only is there a large investment required to actually do the switchover, but you also need to write tooling around Bazel to make it more convenient to use.
Finally, a 2018 BazelCon talk discusses two Google projects that made the switch to Bazel and specifically to use remote caching and remote execution on Google’s public-facing cloud infrastructure: Android Studio and TensorFlow. (You may note that this is the first instance where somebody has called out supporting remote execution as part of the switch; I think that implies getting a build to the point of supporting remote execution is more complicated than just supporting remote caching, which makes a certain amount of sense.) Android Studio increased their test presubmit coverage by 4x, presumably by being able to run more than 4x test jobs than previously due to remote execution. In the same vein, TensorFlow decreased their build and test times by 80%, and they could use significantly less powerful machines to actually run the builds, given that large machines in the cloud were doing the actual heavy lifting.
Unfortunately, I don’t think expecting those same reductions in test time, were Firefox to switch to Bazel, is warranted. I can’t speak to Android Studio, but TensorFlow has a number of unit tests whose test results can be cached. In the Firefox context, these would correspond to cppunittests, which a) we don’t have that many of and b) don’t take that long to run. The bulk of our tests depend in one way or another on kitchen-sink-style artifacts (e.g. libxul, the JS shell, omni.ja) which essentially depend on everything else. We could get some reductions for OS-specific modifications; Windows-specific changes wouldn’t require re-running OS X tests, for instance, but my sense is that these sorts of changes are not common enough to lead to an 80% reduction in build + test time. I suppose it’s also possible that we could teach Bazel that e.g. devtools changes don’t affect, say, non-devtools mochitests/reftests/etc. (presumably?), which would make more test results cacheable.
I want to believe that Bazel + remote caching (+ remote execution if we could get there) will bring Firefox build (and maybe even test) times down significantly, but the above accounts don’t exactly move the needle from belief to certainty.
Bazel Fawlty is named after the main character from Fawlty Towers, an English sitcom from the 1970s: https://en.wikipedia.org/wiki/Fawlty_Towers
Recent discussion on LWN is interesting, and mostly negative
https://lwn.net/Articles/802541/
I don’t find it surprising that uncached/non-remote-executed builds are significantly slower with Bazel : all that isolation and complete dependency specification enforcement must have quite an overhead. I don’t know how exactly they do it (chroot, symlinks, copying?) but it must be significant, especially on something like Windows.