After the Whistler All-Hands this past summer, I started seriously looking at whether Firefox should switch to using Bazel for its build system.
The motivation behind switching build systems was twofold. The first motivation was that build times are one of the most visible developer-facing aspects of the build system and everybody appreciates faster builds. What’s less obvious, but equally important, is that making builds faster improves automation: less time waiting for try builds, more flexibility to adjust infrastructure spending, and less turnaround time with automated reviews on patches submitted for review. The second motivation was that our build system is used by exactly one project (ok, two projects), so there’s a lot of onboarding cost both in terms of developers who use the build system and in terms of developers who need to develop the build system. If we could switch to something more off-the-shelf, we could improve the onboarding experience and benefit from work that other parties do with our chosen build system.
You may have several candidates that we should have evaluated instead. We did look at other candidates (although perhaps none so deeply as Bazel), and all of them have various issues that make them unsuitable for a switch. The reasons for rejecting other possibilities fall into two broad categories: not enough platform support (read: Windows support) and unlikely to deliver on making builds faster and/or improving the onboarding/development experience. I’ll cover the projects we looked at in a separate post.
With that in mind, why Bazel?
Bazel advertises itself with the tagline “{Fast, Correct} – Choose two”. What’s sitting behind that tagline is that when building software via, say, Make, it’s very easy to write Makefile
s in such a way that builds are fast, but occasionally (or not-so-occasionally) fail because somebody forgot to specify “to build thing X, you need to have built thing Y”. The build doesn’t usually fail because thing Y is built before thing X: maybe the scheduling algorithm for parallel execution in make
chooses to build Y first 99.9% of the time, and 99% of those times, building Y finishes prior to even starting to build X.
The typical solution is to become more conservative in how you build things such that you can be sure that Y is always built before X…but typically by making the dependency implicit by, say, ordering the build commands Just So, and not by actually making the dependency explicit to make
itself. Maybe specifying the explicit dependency is rather difficult, or maybe somebody just wants to make things work. After several rounds of these kind of fixes, you wind up with Makefile
s that are (probably) correct, but probably not as fast as it could be, because you’ve likely serialized build steps that could have been executed in parallel. And untangling such systems to the point that you can properly parallelize things and that you don’t regress correctness can be…challenging.
(I’ve used make
in the above example because it’s a lowest-common denominator piece of software and because having a concrete example makes differentiating between “the software that runs the build” and “the specification of the build” easier. Saying “the build system” can refer to either one and sometimes it’s not clear from context which is in view. But you should not assume that the problems described above are necessarily specific to make
; the problems can happen no matter what software you rely on.)
Bazel advertises a way out of the quagmire of probably correct specifications for building your software. It does this—at least so far as I understand things, and I’m sure the Internet will come to correct me if I’m wrong—by asking you to explicitly specify dependencies up front. Build commands can then be checked for correctness by executing the commands in a “sandbox” containing only those files specified as dependencies: if you forgot to specify something that was actually needed, the build will fail because the file(s) in question aren’t present.
Having a complete picture of the dependency graph enables faster builds in three different ways. The first is that you can maximally parallelize work across the build. The second is that Bazel comes with built-in facilities for farming out build tasks to remote machines. Note that all build tasks can be distributed, not just C/C++/Rust compilation as via sccache. So even if you don’t have a particularly powerful development machine, you can still pretend that you have a large multi-core system at your disposal. The third is that Bazel also comes with built-in facilities for aggressive caching of build artifacts. Again, like remote execution, this caching applies across all build tasks, not just C/C++/Rust compilation. In Firefox development terms, this is Firefox artifact builds done “correctly”: given appropriate setup, your local build would simply download whatever was appropriate for the changes in your current local tree and rebuild the rest.
Having a complete picture of the dependency graph enables a number of other nifty features. Bazel comes with a query language for the dependency graph, enabling you to ask questions like “what jobs need to run given that these files changed?” This sort of query would be valuable for determining what jobs to run in automation; we have a half-hearted (and hand-updated) version of this in things like files-changed
in Taskcluster job specifications. But things like “run $OS
tests for $OS
-only changes” or “run just the mochitest chunk that contains the changed mochitest” become easy.
It’s worth noting here that we could indeed work towards having the entire build graph available all at once in the current Firefox build system. And we have remote execution and caching abilities via sccache, even moreso now that sccache-dist is being deployed in Mozilla offices. We think we have a reasonable idea of what it would take to work towards Bazel-esque capabilities with our current system; the question at hand is how a switch to Bazel compares to that and whether a switch would be more worthwhile for the health of the Firefox build system over the long term. Future posts are going to explore that question in more detail.
One additional point: Bazel would also subsume the “./mach bootstrap” portion of the build. Its dependency graph extends down to toolchains and compilers. The only exceptions will be iOS and macOS builds, since those require local system libraries. You can bring your own Clang + linker etc. This is what Chrome does with GN (and it’s fairly easy to port GN to Bazel). I think Chrome itself is not built with Bazel/Blaze because it didn’t used to support Windows.
On the output side, Bazel’s dependency graph also extends very far: to test results as you imply, but also to things like Docker containers and installers.
Take a look at this to see it combining languages and test results:
https://github.com/sayrer/bazel-lesson-2
For clarification, does “toolchains and compilers” include things like libraries (e.g. PulseAudio or GTK+ under Linux)? I know for something like Chrome, this doesn’t matter so much, but Firefox doesn’t bundle all its dependencies into a monorepo.
Having tests in the dependency graph looked very interesting, but AFAICT it wasn’t that useful for Firefox because many of the tests that we care about aren’t unit tests, but integration tests. And our integration tests wouldn’t benefit so much. 🙁
The “monorepo” question is sort of tangential. Bazel is pretty agnostic about how you split up your repos. It can refer to rules in remote repositories and download things. For example, my repo downloads GTest: https://github.com/sayrer/bazel-lesson-2/blob/master/WORKSPACE#L98
I think what you’re intending to get at is whether Bazel makes sense if Firefox is built with an unspecified, Debian-supplied libpng, or something like that. I don’t think it does, but I also think Firefox should not be built this way during development.
The question of whether integration tests can benefit depends on how precisely their dependencies can be calculated. At the limit, it seems unlikely that a check in to macOS keychain code would cause a JS integration test on linux to fail. If there are a lot of integration tests with woolly dependencies, I think you might want to go with running tests that are likely to fail or have recently failed. Then, do full (cached…) builds and test runs in the background.
Bazel has to make sense when we’re building with system libraries because some people (i.e. Linux distributors) actually build things that way. So we need to figure out how to make things work for them as well.
As far as integration tests go, the problem was that most everything ultimately winds up depending on libxul, which depends on all the compiled code in the repo. I think a lot of JS testing has the same sort of issue with depending on the JS shell. You can argue that we should be building differently, or our tests should be structured differently, but that is a separate issue.
I agree that it should be possible to build with system libraries, but I do not agree it should be in the per-commit developer workflow. Exposing developers to that by default means poor caching and “works on my machine” problems. Let the average workflow run on predictable libraries, compilers, linkers, and sysroots.
As for the tests, I recall that Firefox had a way to build in “very shared” mode, where libxul was not a monolith. I’d suggest running integration tests against those, and also running them against static builds in the background.
Oh, I see what you mean as far as bootstrapped libraries go. Sure, we could do that.
Non-monolithic libxul has not been a thing for a long time, for better or for worse.
Weren’t we going to switch to tup at some point? What happened to that?
I’m not familiar with Bazel much but from your description it sounds like you can run into the problem of specifying dependencies that are not actually needed, hurting parallelism. I would really like a build system that uses strace or equivalent to monitor what the compiler is doing and use that to dynamically prune the dependencies based on the specific build environment/flags in your build. The first build with a new environment/flags might be slow but incremental builds should be maximally parallelized.
“it sounds like you can run into the problem of specifying dependencies that are not actually needed”
I don’t think this is correct. For example, if you have a repo with Go and Rust targets, it won’t even download any Rust stuff if you’re just building Go stuff, let alone build it.
I do not see why Mozilla would choose anything other than Bazel right now. Better options may emerge in the future, and Tup has some nice properties, but it’s not better.
Here’s the best paper I know of:
https://www.microsoft.com/en-us/research/uploads/prod/2018/03/build-systems-final.pdf
Can’t you run into the problem where C/C++ dependencies are overspecified? ISTR seeing code in Bazel specifically to prune overzealous dependency declaration for C/C++.
Do you mean something like unnecessary includes or linked libraries? I am not sure how Bazel deals with that. I think it doesn’t penalize you for depending on libraries you don’t actually use, and that does sort of bug me. But I think I have seen language-specific tools that check for this.
What it does take care of is undeclared dependencies that are linked or included. This is the more powerful property, as it avoids “clean” build steps.
No, I mean that you can say in Bazel x.cpp depends on x.h, y.h, and z.h, when in reality, x.cpp compiles just fine without the presence of z.h. And that would lead to spurious rebuilds, at least (although maybe Bazel would detect the preprocessed source hasn’t changed?).
You should read the Microsoft paper. You’re discussing what’s termed “cutoff” in it.
Yeah that’s what I meant. strace’ing the compilation would reveal that the compiler never touches z.h and so the next time around the build system could prune that dependency.
(that reply should be nested, not sure why it isn’t…)
Bazel simply runs $(CC) -M to ask the compiler to output the list of files that the compilation would include through the preprocessor. https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html
Note that tricks like scanning the file for #include yourself don’t work in general because you may use arbitrary Turing-complete (yes, really) preprocessor language to compute the string that is then included. Bazel only does this to prune out dependencies that are declared but unnecessary (library A depends on library B which exports a great many headers, but A does only uses a few of them), it never discovers new dependencies that were not declared.
note that bazel also has a couple of issues:
1. java-based means a substantial increase to the amount of software that people will have to install before they can build anything.
2. the bazel ‘BUILD’ language itself has odd limitations in terms of what you can or can’t express, such as what boolean operators are supported in select() rules, which matters since select() is the only thing that can consume a config_setting(). Before considering Bazel for mozilla-central you might want to take a part of the current buildsystem that does things nontrivially conditionall on configuration settings and see how that could be expressed in Bazel BUILD. See https://docs.bazel.build/versions/master/configurable-attributes.html and specifically the ‘OR chaining’ and ‘AND chaining’ paragraphs.
It comes with a Java runtime bundled so you don’t have to install anything separately before using it.
Please don’t, at least not yet.
Bazel is a nightmare to package, which means that getting newer Firefox into distributions will be complicated.
Have you considered alternatives, such as Meson+Ninja?
Meson and Ninja were considered, yes, but rejected because neither one improves sufficiently on the status quo.
We are aware of the contortions that e.g. TensorFlow has had to get packaged in Debian; we would rather not go through the same sort of gymnastics for Firefox. 🙂
I wonder if you’ve looked at build2[1]? It seems to tick most of your checkboxes. In particular, it has first-class Windows support and, in the upcoming release, even native Clang-targeting-MSVC support (not the clang-cl wrapper)[2] which I believe is important to you. While some of the features you are looking for (like faster build times which I assume means distributed compilation and caching) are not there yet, we are working on them (and would welcome help)[3]. If you have or would like to check build2 out and have any questions, I would be happy to help (boris@build2.org).
[1] https://build2.org
[2] https://stage.build2.org/?builds=&cf=windows_10-clang*
[3] https://github.com/build2/README/blob/master/roadmap.md
I looked briefly at build2. It didn’t seem to have the feature set that we wanted and the insistence on canonical project structures is a bit limiting when our current project structure and the structure of imported third-party libraries don’t line up with build2’s ideas of what projects should look like. It also wasn’t clear to me whether build2 really supported compiling for multiple architectures simultaneously.
Thanks for the reply. build2 doesn’t really insist on a canonical project structure, rather merely suggests it for new projects. It is able to handle quite a wide variety of structures, in fact[1]. And it does support compiling multiple configurations (which can use different compilers/targets/etc) simultaneously — that was one of the goals from the outset. I would also like to hear what other features you found missing, if you have that information handy (we do need a proper “features” page, I must admit).
[1] https://github.com/build2-packaging/
We’ve been using Bazel in the DeepSpeech project for a couple of years now. (Not because we chose to, but because TensorFlow uses it and Bazel has a world-consuming appetite by design.)
I predict the result of this evaluation will be that Bazel is not suitable for Firefox. It works great if, and only if, your entire world is captured in its graph. If you need a special case for e.g. distro builds, you’ll have to contort the system to extremes or fork it. The entire system is built on the assumption of having all your dependencies handled by Bazel itself.
It also tends to be lacking in some pretty basic features when you’re developing desktop applications, like… debugging a cc_binary on macOS: https://github.com/bazelbuild/bazel/issues/2537
Feel free to make issues early and often in https://github.com/bazelbuild/rules_rust if you have questions or problems w/ Rust + Bazel.
Thanks! When looking at things, the problem was not necessarily the Rust rules (although there are likely changes that need to be made there), but cargo-raze having lots and lots of issues with non-trivial crates.