01
Nov 19

evaluating bazel for building firefox, part 2

In our last post, we highlighted some of the advantages that Bazel would bring.  The remote execution and caching benefits Bazel bring look really attractive, but it’s difficult to tell exactly how much they would benefit Firefox.  I looked for projects that had switched to Bazel, and a brief summary of each project’s experience is written below.

The Bazel rules for nodejs highlight Dataform’s switch to Bazel, which took about 2 months.  Their build involves some combination of “NPM packages, Webpack builds, Node services, and Java pipelines”. Switching plus enabling remote caching reduced the average time for a build in CI from 30 minutes to 5 minutes; incremental builds for local development have been “reduced to seconds from minutes”.  It’s not clear whether the local development experience is also hooked up to the caching infrastructure as well.

Pinterest recently wrote about their switch to Bazel for iOS.  While they call out remote caching leading to “build times [dropping] under a minute and as low as 30 seconds”, they state their “time to land code” only decreased by 27%.  I wasn’t sure how to reconcile such fast builds with (relatively) modest decreases in CI time.  Tests have gotten a lot faster, given that test results can be cached and reused if the tests in question have their transitive dependencies unchanged.

One of the most complete (relatively speaking) descriptions I found was Redfin’s switch from Maven to Bazel, for building a large amount of JavaScript modules and Java code, nearly 30,000 files in all.  Their CI builds went from 40-90 minutes to 5-6 minutes; in fairness, it must be mentioned that their Maven builds were not parallelized (for correctness reasons) whereas their Bazel builds were.  But it’s worth highlighting that they managed to do this incrementally, by generating Bazel build definitions from their Maven ones, and that the quoted build times did not enable caching.  The associated tech talk slides/video indicates builds would be roughly in the 1-2 minute range with caching, although they hadn’t deployed that yet.

None of the above accounts talked about how long the conversion took, which I found peculiar.  Both Pinterest and Redfin called out how much more reliable their builds were once they switched to Bazel; Pinterest said, “we haven’t performed a single clean build on CI in over a year.”

In some negative results, which are helpful as well, Dropbox wrote about evaluating Bazel for their Android builds.  What’s interesting here is that other parts of Dropbox are heavily invested in Bazel, so there’s a lot of in-house experience, and that Bazel was significantly faster than their current build system (assuming caching was turned on; Bazel was significantly slower for clean builds without caching).  Yet Dropbox decided to not switch to Bazel due to tooling and development experience concerns.  They did leave open the possibility of switching in the future once the ecosystem matures.

The oddly-named Bazel Fawlty describes a conversion to Bazel from Go’s native tooling, and then a switch back after a litany of problems, including slower builds (but faster tests), a poor development experience (especially on OS X), and various things not being supported in Bazel leading to the native Go tooling still being required in some cases.  This post was also noteworthy for noting the amount of porting effort required to switch: eight months plus “many PR’s accepted into the bazel go rules git repo”.  I haven’t used Go, but I’m willing to discount some of the negative experience here due to the native Go tools being so good.

Neither one of these negative experiences translate exactly to Firefox: different languages/ecosystems, different concerns, different scales.  But both of them cite the developer experience specifically, suggesting that not only is there a large investment required to actually do the switchover, but you also need to write tooling around Bazel to make it more convenient to use.

Finally, a 2018 BazelCon talk discusses two Google projects that made the switch to Bazel and specifically to use remote caching and remote execution on Google’s public-facing cloud infrastructure: Android Studio and TensorFlow.  (You may note that this is the first instance where somebody has called out supporting remote execution as part of the switch; I think that implies getting a build to the point of supporting remote execution is more complicated than just supporting remote caching, which makes a certain amount of sense.)  Android Studio increased their test presubmit coverage by 4x, presumably by being able to run more than 4x test jobs than previously due to remote execution.  In the same vein, TensorFlow decreased their build and test times by 80%, and they could use significantly less powerful machines to actually run the builds, given that large machines in the cloud were doing the actual heavy lifting.

Unfortunately, I don’t think expecting those same reductions in test time, were Firefox to switch to Bazel, is warranted.  I can’t speak to Android Studio, but TensorFlow has a number of unit tests whose test results can be cached.  In the Firefox context, these would correspond to cppunittests, which a) we don’t have that many of and b) don’t take that long to run.  The bulk of our tests depend in one way or another on kitchen-sink-style artifacts (e.g. libxul, the JS shell, omni.ja) which essentially depend on everything else.  We could get some reductions for OS-specific modifications; Windows-specific changes wouldn’t require re-running OS X tests, for instance, but my sense is that these sorts of changes are not common enough to lead to an 80% reduction in build + test time.  I suppose it’s also possible that we could teach Bazel that e.g. devtools changes don’t affect, say, non-devtools mochitests/reftests/etc. (presumably?), which would make more test results cacheable.

I want to believe that Bazel + remote caching (+ remote execution if we could get there) will bring Firefox build (and maybe even test) times down significantly, but the above accounts don’t exactly move the needle from belief to certainty.


28
Oct 19

evaluating bazel for building firefox, part 1

After the Whistler All-Hands this past summer, I started seriously looking at whether Firefox should switch to using Bazel for its build system.

The motivation behind switching build systems was twofold.  The first motivation was that build times are one of the most visible developer-facing aspects of the build system and everybody appreciates faster builds.  What’s less obvious, but equally important, is that making builds faster improves automation: less time waiting for try builds, more flexibility to adjust infrastructure spending, and less turnaround time with automated reviews on patches submitted for review.  The second motivation was that our build system is used by exactly one project (ok, two projects), so there’s a lot of onboarding cost both in terms of developers who use the build system and in terms of developers who need to develop the build system.  If we could switch to something more off-the-shelf, we could improve the onboarding experience and benefit from work that other parties do with our chosen build system.

You may have several candidates that we should have evaluated instead.  We did look at other candidates (although perhaps none so deeply as Bazel), and all of them have various issues that make them unsuitable for a switch.  The reasons for rejecting other possibilities fall into two broad categories: not enough platform support (read: Windows support) and unlikely to deliver on making builds faster and/or improving the onboarding/development experience.  I’ll cover the projects we looked at in a separate post.

With that in mind, why Bazel?

Bazel advertises itself with the tagline “{Fast, Correct} – Choose two”.  What’s sitting behind that tagline is that when building software via, say, Make, it’s very easy to write Makefiles in such a way that builds are fast, but occasionally (or not-so-occasionally) fail because somebody forgot to specify “to build thing X, you need to have built thing Y”.  The build doesn’t usually fail because thing Y is built before thing X: maybe the scheduling algorithm for parallel execution in make chooses to build Y first 99.9% of the time, and 99% of those times, building Y finishes prior to even starting to build X.

The typical solution is to become more conservative in how you build things such that you can be sure that Y is always built before X…but typically by making the dependency implicit by, say, ordering the build commands Just So, and not by actually making the dependency explicit to make itself.  Maybe specifying the explicit dependency is rather difficult, or maybe somebody just wants to make things work.  After several rounds of these kind of fixes, you wind up with Makefiles that are (probably) correct, but probably not as fast as it could be, because you’ve likely serialized build steps that could have been executed in parallel.  And untangling such systems to the point that you can properly parallelize things and that you don’t regress correctness can be…challenging.

(I’ve used make in the above example because it’s a lowest-common denominator piece of software and because having a concrete example makes differentiating between “the software that runs the build” and “the specification of the build” easier.  Saying “the build system” can refer to either one and sometimes it’s not clear from context which is in view.  But you should not assume that the problems described above are necessarily specific to make; the problems can happen no matter what software you rely on.)

Bazel advertises a way out of the quagmire of probably correct specifications for building your software.  It does this—at least so far as I understand things, and I’m sure the Internet will come to correct me if I’m wrong—by asking you to explicitly specify dependencies up front.  Build commands can then be checked for correctness by executing the commands in a “sandbox” containing only those files specified as dependencies: if you forgot to specify something that was actually needed, the build will fail because the file(s) in question aren’t present.

Having a complete picture of the dependency graph enables faster builds in three different ways.  The first is that you can maximally parallelize work across the build.  The second is that Bazel comes with built-in facilities for farming out build tasks to remote machines.  Note that all build tasks can be distributed, not just C/C++/Rust compilation as via sccache.  So even if you don’t have a particularly powerful development machine, you can still pretend that you have a large multi-core system at your disposal.  The third is that Bazel also comes with built-in facilities for aggressive caching of build artifacts.  Again, like remote execution, this caching applies across all build tasks, not just C/C++/Rust compilation.  In Firefox development terms, this is Firefox artifact builds done “correctly”: given appropriate setup, your local build would simply download whatever was appropriate for the changes in your current local tree and rebuild the rest.

Having a complete picture of the dependency graph enables a number of other nifty features.  Bazel comes with a query language for the dependency graph, enabling you to ask questions like “what jobs need to run given that these files changed?”  This sort of query would be valuable for determining what jobs to run in automation; we have a half-hearted (and hand-updated) version of this in things like files-changed in Taskcluster job specifications.  But things like “run $OS tests for $OS-only changes” or “run just the mochitest chunk that contains the changed mochitest” become easy.

It’s worth noting here that we could indeed work towards having the entire build graph available all at once in the current Firefox build system.  And we have remote execution and caching abilities via sccache, even moreso now that sccache-dist is being deployed in Mozilla offices.  We think we have a reasonable idea of what it would take to work towards Bazel-esque capabilities with our current system; the question at hand is how a switch to Bazel compares to that and whether a switch would be more worthwhile for the health of the Firefox build system over the long term.  Future posts are going to explore that question in more detail.


25
Apr 19

an unexpected benefit of standardizing on clang-cl

I wrote several months ago about our impending decision to switch to clang-cl on Windows.  In the intervening months, we did that, and we also dropped MSVC as a supported compiler.  (We still build on Linux with GCC, and will probably continue to do that for some time.)  One (extremely welcome) consequence of the switch to clang-cl has only become clear to me in the past couple of weeks: using assembly language across platforms is no longer painful.

First, a little bit of background: GCC (and Clang) support a feature called inline assembly, which enables you to write little snippets of assembly code directly in your C/C++ program.  The syntax is baroque, it’s incredibly easy to shoot yourself in the foot with it, and it’s incredibly useful for a variety of low-level things.  MSVC supports inline assembly as well, but only on x86, and with a completely different syntax than GCC.

OK, so maybe you want to put your code in a separate assembly file instead.  The complementary assembler for GCC (courtesy of binutils) is called gas, with its own specific syntax for various low-level details.  If you give gcc an assembly file, it knows to pass it directly to gas, and will even run the C preprocessor on the assembly before invoking gas if you request that.  So you only ever need to invoke gcc to compile everything, and the right thing will just happen. MSVC, by contrast, requires you to invoke a separate, differently-named assembler for each architecture, with different assembly language syntaxes (e.g. directives for the x86-64 assembler are quite different than the arm64 assembler), and preprocessing files beforehand requires you to jump through hoops.  (To be fair, a number of these details are handled for you if you’re building from inside Visual Studio; the differences are only annoying to handle in cross-platform build systems.)

In short, dealing with assembler in a world where you have to support MSVC is somewhat painful.  You have to copy-and-paste code, or maybe you write Perl scripts to translate from the gas syntax to whatever flavor of syntax the Microsoft assembler you’re using is.  Your build system needs to handle Windows and non-Windows differently for assembly files, and may even need to handle different architectures for Windows differently.  Things like our ICU data generation have been made somewhat more complex than necessary to support Windows platforms.

Enter clang-cl.  Since clang-cl is just clang under the hood, it handles being passed assembly files on the command line in the same way and will even preprocess them for you.  Additionally, clang-cl contains a gas-compatible assembly syntax parser, so assembly files that you pass on the command line are parsed by clang-cl and therefore you can now write a single assembly syntax that works on Unix-y and Windows platforms.  (You do, of course, have to handle differing platform calling conventions and the like, but that’s simplified by having a preprocessor available all the time.)  Finally, clang-cl supports GCC-style inline assembly, so you don’t even have to drop into separate assembly files if you don’t want to.

In short, clang-cl solves every problem that made assembly usage painful on Windows. Might we have a future world where open source projects that have to deal with any amount of assembly standardize on clang-cl for their Windows support, and declare MSVC unsupported?


28
Mar 19

a thousand and one quite modest ones

From The Reckoning, by David Halberstam:

Shaiken’s studies showed that the Japanese had made their great surge in the sixties and seventies, by which time the financial men had climbed to eminence within America’s industrial companies and had successfully subordinated the power of the manufacturing men. When the Japanese advantage in quality became obvious in the early eighties, it was fashionable among American managers to attribute it to the Japanese lead in robots, and it was true that Japanese were somewhat more robotized than the Americans. But in Shaiken’s opinion the Japanese success had come not from technology but from manufacturing skills. The Japanese had moved ahead of America when they were at a distinct disadvantage in technology. They had done it by slowly and systematically improving the process of the manufacturing in a thousand tiny increments. They had done it by being there, on the factory floor, as the Americans were not.

In that opinion Shaiken was joined by Don Lennox, the former Ford manufacturing man who had ended up at Harvester. Lennox had gone to Japan in the mid-seventies and been dazzled by what the Japanese had achieved in modernizing their factories. He was amazed not by the brilliance and originality of what they had done but by the practicality of it. Lennox’s visit had been an epiphany: He had suddenly envisioned the past twenty years in Japan, two decades of Japanese manufacturing engineers coming to work every day, busy, serious, being taken seriously by their superiors, being filled with the importance of the mission, improving the manufacturing in countless small ways. It was not that they had made one giant breakthrough, Lennox realized; they had made a thousand and one quite modest ones.


29
May 18

when an implementation monoculture might be the right thing

It’s looking increasingly likely that Firefox will, in the not-too-distant future, build with a single C++ compiler across the four major platforms we support.  I’m uneasy with this, but I think I’ve made my peace with it, partly as a result of writing the piece below.

Firefox currently builds with three major C++ compilers across four platforms: Microsoft’s Visual C++ compiler (MSVC), GCC, and Clang.  A fair amount of work has been done to deal with peculiar bugs in all three compilers: you can go search the source code and/or Bugzilla to find hacks that were needed for one reason or another.  A fair amount of work has also been stalled or shelved because one or two compilers don’t quite measure up in some required area (e.g. standards support).  As you might imagine, many a Firefox engineer has bemoaned the need for cross-compiler compatibility.

Cross-implementation compatibility is something that Mozilla expends a lot of effort on in a different context.  We have a Tech Evangelism bugzilla component for outreach to sites who use techniques that don’t translate across browsers.  When new sites appear that deliberately block Firefox (whether because the launch team took the time to test with Firefox and determine the user experience wouldn’t be acceptable, or because cross-browser compatibility was an explicit non-goal), Firefox engineers go find the performance cliffs and fix them.  Mozilla has a long-history of promoting the benefits of multiple implementations of the web platform; some of the old guard might remember “Works best in all browsers” campaigns and the like.  If you squint properly, you can even see this promotion in the manifesto (principles 2, 5, 6, 7, and 9, by my reckoning).

So as nice as a single implementation might be, dealing with multiple implementations was a fact of life in building an high quality open-source browser.  We dealt with it, because it seemed like we would always need to support MSVC; who would invest the time to create an open source, MSVC-compatible compiler?

Well, Google, mostly, and a host of other people, because the past several releases of Clang have included an MSVC-compatible frontend, clang-cl.  (Indeed, Firefox has been using clang-cl for Windows static analysis builds for some time.)  And now that we have a usable non-MSVC compiler on Windows, we can contemplate using an open-source compiler to create our release Windows builds.  And once we have that, we can consider using (and potentially only supporting) a single compiler (Clang) for all of the major platforms we support; Linux would be the remaining holdout.  (Chrome already ships on Windows with clang and requires clang everywhere, FWIW.)

We might continue to require that things build with MSVC and GCC on relevant platforms, even if we’re not shipping these builds; even if this happened, such builds seem unlikely to last for very long, for all the reasons that we wanted them dropped in the first place.  I imagine we’d probably continue to accept patches to make things build with non-Clang compilers, as long as the patches were not intrusive, just like we accept patches for non-tier 1 platforms.

Supporting a single compiler has a number of advantages:

  • Cross-language LTO (i.e. inlining) between Rust and C++ (we could, of course, do this today, but we wouldn’t get the win on all platforms);
  • Mozilla engineers can fix bugs in Clang/LLVM if need be;
  • Fixes can be more easily backported from the Clang/LLVM development tree;
  • Contributors have fewer compiler quirks to hold up their patches;
  • Integrating and/or upgrading local copies of upstream projects becomes easier;
  • Performance tuning becomes somewhat more straightforward when you have a single compiler to worry about.

I am probably forgetting some along the way.  (I don’t think it’s true that we’ll be able to entirely eliminate hacks to pacify the compiler; you push on C++ hard enough and long enough, and you find yourself doing all manner of unusual things.  We might even find ourselves doing more hacks, since we can justify it via, “Since we can/can’t rely on the compiler to do X…”)

I can see all the advantages.  I can even admire the sheer coolness of some of them; cross-language inlining sounds fantastic!  But the analogy between the Web situation and the C++ compiler situation makes me uneasy: we ask web developers to write cross-browser compatible websites, with all the time and energy that requires.  We tout the goodness of supporting multiple implementations of the web platform.  However, in the implementation of that web platform, we are in the process of deciding that the benefits of supporting a single C++ implementation are greater than whatever benefits (engineering, philosophical, etc.) might accrue from supporting multiple implementations.

To be explicit: we are making the exact style of decision that we ask web development teams not to make.

After having proposed this and thought about it for a while, I think the analogy is a bit strained.  We make the argument that websites should be cross-browser compatible because we support the freedom of users to access those sites with whatever browser they like.  Whereas Firefox engineering is the only “consumer” of the compiler(s), and so we should optimize for that single consumer.  Indeed, we don’t really concern ourselves with cross-engine compatibility for the JavaScript that lies behind our UI.  Firefox users (generally) don’t care too much what compiler gets used to build Firefox, and they’d probably support a switch to a compiler monoculture if that meant the browser got faster!

(I’m not completely at ease with calling the two situations dissimilar; it’d be all too easy for a website to say they only care about a single “user”, viz. users of $BROWSER, and dispense with cross-browser support.  I want to have a stronger argument for this case, but I don’t at the moment…)

At the end of the day, I think I’m mostly in support (0.6 on the Apache voting scale?).  I think it will be cool when it’s done, and I will probably wind up doing some work in support of the project.  But I can’t completely shake my uneasiness.  What do you think?


15
Nov 16

efficiently passing the buck with needinfo requests

A while back, Bugzilla added this great tool called needinfo requests: you set a flag on the bug indicating that a particular person’s input is desired. X will then get something dropped into their requests page and a separate email notifying them of the needinfo request. Then, when X responds, clearing the needinfo request, you get an email notifying you that the request has been dealt with. This mechanism works much better than merely saying “X, what do you think?” in a bug comment and expecting that X will see the comment in their bugmail and respond.

My needinfo-related mail, along with all review-related mail, gets filtered into a separate folder in my email client.  It is then very obvious when I get needinfo requests, or needinfo requests that I have made have been answered.

Occasionally, however, when you get a needinfo, you will not be the correct person to answer the question, and you will need to needinfo someone else who has the appropriate knowledge…or is at least one step closer to providing the appropriate knowledge.

There is a right way and a wrong way to accomplish this. The wrong way is to clear your own needinfo request and request needinfo from someone else:

wrong-way

Why is this bad? Because the original requester will receive a notification that request has been dealt with appropriately, when it has not! So now they have to remember to watch the bug, or poll their bugmail, or similar to figure out when their request has been dealt with.  Additionally, you’ll get an email notification when your needinfo request has been answered, which you don’t necessarily want.

The right way (which I just discovered this week) is to uncheck the “Clear the needinfo request” box, which turns the second checkbox into a “Redirect my needinfo request”:

right-way

This method appropriately redirects the needinfo without notifying the original requester, and the original requester will (ideally) now receive a notification only when the request has been dealt with.


29
Jul 16

a git pre-commit hook for tooltool manifest checking

I’ve recently been uploading packages to tooltool for my work on Rust-in-Gecko and Android toolchains. The steps I usually follow are:

  1. Put together tarball of files.
  2. Call tooltool.py from build-tooltool to create a tooltool manifest.
  3. Upload files to tooltool with said manifest.
  4. Copy bits from said manifest into one of the manifest files automation uses.
  5. Do try push with new manifest.
  6. Admire my completely green try push.

That would be the ideal, anyway.  What usually happens at step 4 is that I forget a comma, or I forget a field in the manifest, and so step 5 winds up going awry, and I end up taking several times as long as I would have liked.

After running into this again today, I decided to implement some minimal validation for automation manifests.  I use a fork of gecko-dev for development, as I prefer Git to Mercurial. Git supports running programs when certain things occur; these programs are known as hooks and are usually implemented as shell scripts. The hook I’m interested in is the pre-commit hook, which is looked for at .git/hooks/pre-commit in any git repository. Repositories come with a sample hook for every hook supported by Git, so I started with:

cp .git/hooks/pre-commit.sample .git/hooks/pre-commit

The sample pre-commit hook checks trailing whitespace in files, which I sometimes leave around, especially when I’m editing Python, and can check for non-ASCII filenames being added.  I then added the following lines to that file:

if git diff --cached --name-only | grep -q releng.manifest; then
    for f in $(git diff --cached --name-only | grep releng.manifest); do
	if ! python -<<EOF
import json
import sys
try:
    with open("$f", 'r') as f:
        json.loads(f.read())
    sys.exit(0)
except:
    sys.exit(1)
EOF
	    then
	    echo $f is not valid JSON
	    exit 1
	fi
     done
fi

In prose, we’re checking to see if the current commit has any releng.manifest files being changed in any way. If so, then we’ll try parsing each of those files as JSON, and throwing an error if one doesn’t parse.

There are several ways this check could be more robust:

  • The check will error if a commit is removing a releng.manifest, because that file won’t exist for the script to check;
  • The check could ensure that the unpack field is set for all files, as the manifest file used for the upload in step 3, above, doesn’t include that field: it needs to be added manually.
  • The check could ensure that all of the digest fields are the correct length for the specified digest in use.
  • …and so on.

So far, though, simple syntax errors are the greatest source of pain for me, so that’s what’s getting checked for.  (Mismatched sizes have also been an issue, but I’m unsure of how to check that…)

What pre-commit hooks have you found useful in your own projects?


06
Jul 16

on the usefulness of computer books

I have a book, purchased during my undergraduate days, entitled Introduction to Algorithms. Said book contains a wealth of information about algorithms and data structures, has its own Wikipedia page, and even a snappy acronym people use (“CLRS”, for the first letters of its authors’ last names).

When I bought it, I expected it to be both an excellent textbook and a book I would refer to many times throughout my professional career.  I cannot remember whether it was a good textbook in the context of my classes, and I cannot remember the last time I opened it to find some algorithm or verify some subtle point.  Mostly, it has served two purposes: an excellent support for my monitor to position the monitor more closely to eye level, and as extra weight to move around when I have had to transfer my worldly possessions from place to place.

Whether this reflects on the sort of code I have worked on, or the rise of the Internet for answering questions, I am unsure.

I have another book, also purchased during my undergraduate days, entitled Programming with POSIX Threads.  Said book contains a wealth of information about POSIX threads (“pthreads”), is only mentioned in “Further Reading” on the Wikipedia page for POSIX threads, and has no snappy acronym associated with it.

I purchased this book because I thought I might assemble a library of programming knowledge, and of course threads would be a part of that.  Mostly, it would sit on the shelves to show people I was a Real Programmer(tm).

Instead, I have found it to be one of those books to always have close at hand, particularly working on Gecko.  Its explanations of the basic concepts of synchronization are clear and extensive, its examples of how to structure multithreaded algorithms are excellent, and its secondary coverage of “real-world” things such as memory ordering and signals + threads (short version: “don’t”) have been helpful when people have asked me for opinions or to review multi-threaded code.  When I have not followed the advice of this book, I have found myself in trouble later on.

My sense when searching for some of the same topics the book covers is that finding the same quality of coverage for those topics online is rather difficult, even taking into account that topics might be covered by disparate people.

If I had to trim my computer book library down significantly, I’m pretty sure I know what book I would choose.

What book have you found unexpectedly (un)helpful in your programming life?


31
May 16

why gecko data structures should be preferred to std:: ones

In light of the recent announcement that all of our Tier-1 platforms now have a C++11-supporting standard library, I received some questions about whether we should continue encouraging the use of Gecko-specific data structures. My answer was “yes”, and as I was writing the justification for said answer, I felt that the justification was worth broadcasting to a wider audience. Here are the reasons I came up with; feel free to agree or disagree in the comments.

  • Gecko’s data structures can be customized extensively for our purposes, whereas we don’t have the same control over the standard library.  Our string classes, for instance, permit sharing structure between strings (whether via something like nsDependentString or reference-counted string buffers); that functionality isn’t currently supported in the standard library.  While the default behavior on allocation failure in Gecko is to crash, our data structures provide interfaces for failing gracefully when allocations fail.  Allocation failures in standard library data structures are reported via exceptions, which we don’t use.  If you’re not using exceptions, allocation failures in those data structures simply crash, which isn’t acceptable in a number of places throughout Gecko.
  • Gecko data structures can assume things about the environment that the standard library can’t.  We ship the same memory allocator on all our platforms, so our hashtables and our arrays can attempt to make their allocation behavior line up with what the memory allocator efficiently supports.  It’s possible that the standard library implementations we’re using do things like this, but it’s not guaranteed by the standard.
  • Along similar lines as the first two, Gecko data structures provide better visibility for things like debug checks and memory reporting.  Some standard libraries we support come with built-in debug modes, but not all of them, and not all debug modes are equally complete. Where possible, we should have consistent support for these sorts of things across all our platforms.
  • Custom data structures may provide better behavior than standard data structures by relaxing the specifications provided by the standard.  The WebKit team had a great blog post on their new mutex implementation, which optimizes for cases that OS-provided mutexes aren’t optimized for, either because of compatibility constraints or because of outside specifications.  Chandler Carruth has a CppCon talk where he mentions the non-ideal interfaces in many of the standard library data structures.  We can do better with custom data structures.
  • Data structures in the standard library may provide inconsistent performance across platforms, or disagree on the finer points of the standard.  Love them or hate them, Gecko’s data structures at least provide consistent behavior everywhere.

Most of these arguments are not new; if you look at the documentation for Facebook’s open-source Folly library, for instance, you’ll find a number of these arguments, if not expressed in quite the same way.  Browsing through WebKit’s WTF library shows they have a number of the same things that we do in xpcom/ or mfbt/ as well, presumably for some of the same reasons.

All of this is not to say that our data structures are perfect: the APIs for our hashtables could use some improvements, our strings and nsTArray do a poor job of separating “data structure” from “algorithm”, nsDeque serves as an excellent excuse to go use the standard library instead, and XPCOM’s synchronization primitives should stop going through NSPR and use the underlying OS’s primitives directly (or simply be rewritten to use something like WebKit’s locking primitives, above).  This is a non-exhaustive list; I have more ideas if people are interested.

Having a C++11 standard library on all platforms brings opportunities to remove dead polyfills; MFBT contains a number of these (Atomics.h, Tuple.h, TypeTraits.h, UniquePtr.h, etc.)  But we shouldn’t flock to the standard library’s functionality just because it’s the standard.  If the standard library’s functionality doesn’t fit our use cases, we should definitely write our own replacement(s) and use them widely.


18
Apr 16

rr talk post-mortem

On Wednesday last week, I gave an invited talk on rr to a group of interested students and faculty at Rose-Hulman. The slides I used are available, though I doubt they make a lot of sense without the talk itself to go with them. Things I was pleased with:

  • I didn’t overrun my time limit, which was pretty satisfying.  I would have liked to have an hour (40 minutes talk/20 minutes for questions or overrun), but the slot was for a standard class period of 50 minutes.  I also wanted to leave some time for questions at the end, of which there were a few. Despite the talk being scheduled for the last class period of the day, it was well-attended.
  • The slides worked well  My slides are inspired by Lawrence Lessig’s style of presenting, which I also used for my lightning talk in Orlando.  It forces you to think about what you’re putting on each slide and make each slide count.  (I realize I didn’t use this for my Gecko onboarding presentation; I’m not sure if the Lessig method would work for things like that.  Maybe at the next onboarding…)
  • The level of sophistication was just about right, and I think the story approach to creating rr helped guide people through the presentation.  At least, it didn’t look as though many people were nodding off or completely confused, despite rr being a complex systems-heavy program.

Most of the above I credit to practicing the talk repeatedly.  I forget where I heard it, but a rule of thumb I use for presentations is 10 hours of prep time minimum (!) for every 1 hour of talk time.  The prep time always winds up helping: improving the material, refining the presentation, and boosting my confidence giving the presentation.  Despite all that practice, opportunities for improvement remain:

  • The talk could have used any amount of introduction on “here’s how debuggers work”.  This is kind of old hat to me, but I realized after the fact that to many students (perhaps even some faculty), blithely asserting that rr can start and stop threads at will, for instance, might seem mysterious.  A slide or two on the differences between how rr record works vs. how rr replay works and interacts with GDB would have been clarifying as well.
  • The above is an instance where a diagram or two might have been helpful.  I dislike putting diagrams in my talks because I dislike the thought of spending all that time to find a decent, simple app for drawing things, actually drawing them, and then exporting a non-awful version into a presentation.  It’s just a hurdle that I have to clear once, though, so I should just get over it.
  • Checkpointing and the actual mechanisms by which rr can run forwards or backwards in your program got short shrift and should have been explained in a little more detail.  (Diagrams again…)  Perhaps not surprisingly, the checkpointing material got added later during the talk prep and therefore didn’t get practiced as much.
  • The demo received very little practice (I’m sensing a theme here) and while it was able to show off a few of rr‘s capabilities, it wasn’t very polished or impressive.  Part of that is due to rr mysteriously deciding to cease working on my virtual machine, but part of that was just my own laziness and assuming things would work out just fine at the actual talk.  Always practice!