Benchmarks – Nicholas Nethercote

This post represents my own opinion, not the official position of Mozilla.

I’ve been working on Firefox for over five years and I’ve been unhappy with the state of browser benchmarking the entire time.

The current problems

The existing benchmark suites are bad

Most of the benchmark suites are terrible. Consider at Hennessy and Patterson’s categorization of benchmarks, from best to worst.

Real applications.
Modified applications (e.g. with I/O removed to make it CPU-bound).
Kernels (key fragments of real applications).
Toy benchmarks (e.g. sieve of Erastosthenes).
Synthetic benchmarks (code created artificially to fit a profile of particular operations, e.g. Dhrystone).

In my opinion, benchmark suites should only contain benchmarks in categories 1 and 2. There are certainly shades of grey in these categories, but I personally have a high bar, and I’m likely to consider most programs whose line count is in the hundreds as closer to a “toy benchmark” than a “real application”.

Very few browser benchmark suites contain benchmarks from category 1 or 2. I think six or seven of the benchmarks in Octane meet this standard. I’m having trouble thinking of any other existing benchmark suite that contains any that reach this standard. (This includes Mozilla’s own Kraken suite.)

Bad benchmark suites hurt the web, because every hour that browser vendors spend optimizing bad benchmarks is an hour they don’t spend improving browsers in ways that actually help users. I’ve seen first-hand how much time this wastes.

Conflicts of interest

Some of the benchmark suites — including the best-known ones — come from browser vendors. The conflict of interest is so obvious I won’t bother discussing it further.

Opaque scoring systems

Some of them have opaque scoring systems. Octane (derived from V8bench) is my least favourite here. It also suffers from much higher levels of variance between runs than the benchmarks with time-based scoring systems.

A better way

Here’s how I think a new browser benchmark suite should be created. Some of these ideas are taken from the SPEC benchmark suiteswhich have been used for over 20 years to benchmark the performance of C, C++, Fortran and Java programs, among other things.

A collaborative and ongoing process

The suite shouldn’t be owned by a single browser vendor. Ideally, all the major browser vendors (Apple, Google, Microsoft, and Mozilla) would be involved. Involvement of experts from outside those companies would also be desirable.

This is a major political and organizational challenge, but it would avoid bias (both in appearance and reality), and would likely lead to a higher-quality suite, because low-quality benchmarks are likely to be shot down.

There should be a public submissions process, where people submit web
apps/sites for consideration. And obviously there needs to be a selection
process to go with that.

The benchmark suite should be open source and access should be free of
charge. It should be hosted on a public site not owned by a browser vendor.

There should also be an official process for modifying the suite: modifying benchmarks, removing benchmark, and adding new benchmarks. This process
shouldn’t be too fluid. Perhaps allowing changes once per year would be reasonable. The current trend of continually augmenting old suites should be resisted.

High-quality

All benchmarks should be challenging, and based on real website/apps; they should all belong to category 1 or 2 on the above scale.

Each benchmark should have two or three inputs of different sizes. Only the
largest input would be used in official runs, but the smaller inputs
would be useful for browser developers for testing and profiling purposes.

All benchmarks should have side-effects of some kind that depend on most or
all of the computation. There should be no large chunks of computation whose
results are unused, because such benchmarks are subject to silly results due
to certain optimizations such as dead code elimination.

Categorization

Any categorization of benchmarks should not be done by browser subsystems. E.g. there shouldn’t be groups like “DOM”, “CSS”, “hardware acceleration”, “code loading” because that encourages artificial benchmarks that stress a single
subsystem. Instead, any categories should be based on application domain:
“games”, “productivity”, “multimedia”, “social”, etc. If a grouping like
that ends up stressing some browser subsystems more than others, well that’s
fine, because it means that’s what real websites/apps are doing (assuming the
benchmark selection is balanced).

There is one possible exception to this rule, which is the JavaScript engine. At least some of the benchmarks will be dominated by JS execution time, and because JS engines can be built standalone, the relevant JS engine
teams will inevitably want to separate JS code from everything else so that
the code can be run in JS shells. It may be worthwhile to pre-emptively do that, where appropriate. If that were done, you wouldn’t run these JS-only sub-applications as part of an official suite run, but they would be present in the repository.

Scoring

Scoring should be as straightforward as possible. Times are best because their meaning is more obvious than anything else, *especially* when multiple
times are combined. Opaque points systems are awful.

Using times for everything makes most sense when benchmarks are all
throughput-oriented. If some benchmarks are latency-oriented (or FPS, or
something else) this becomes trickier. Even still, a simple system that humans can intuitively understand is best.

Conclusion

It’s time the browser world paid attention to the performance and benchmarking lessons that the CPU and C/C++/Fortran worlds learned years ago. It’s time to grow up.

If you want to learn more, I highly recommend getting your hands on a copy of Hennessy & Patterson’s Computer Architecture and reading the section on measuring performance. This is section 1.5 (“Measuring and Reporting Performance”) in the 3rd edition, but I expect later editions have similar sections.

The Google Chrome team recently extended their V8 benchmark suite with five new benchmarks, and renamed it “Octane“. The descriptions page says the following about the new benchmarks.

pdf.js. Mozilla’s PDF Reader implemented in JavaScript. It measures decoding and interpretation time (33,056 lines).
Mandreel. Runs the 3D Bullet Physics Engine ported from C++ to JavaScript via Mandreel (277,377 lines).
GB Emulator. Emulate the portable console’s architecture and runs a demanding 3D simulation, all in JavaScript (11,097 lines).
Code loading. Measures how quickly a JavaScript engine can start executing code after loading a large JavaScript program, social widget being a common example. The source for test is derived from open source libraries (Closure, jQuery) (1,530 lines).
Box2DWeb. Based on Box2DWeb, the popular 2D physics engine originally written by Erin Catto, ported to JavaScript. (560 lines, 9000+ de-minified).

I haven’t looked closely at these benchmarks, but the descriptions are very promising. Hennessy and Patterson’s classic Computer Architecture lists the following five categories of benchmarks, from best to worst.

Real applications.
Modified applications (e.g. with I/O removed to make it CPU-bound).
Kernels (key fragments of real applications).
Toy benchmarks (e.g. sieve of Erastosthenes).
Synthetic benchmarks (code created artificially to fit a profile of particular operations, e.g. Dhrystone).

Four of the five new Octane benchmarks are category 1 or perhaps 2 (some have minor modifications to make them benchmarkable). “Code loading” is the only exception; it sounds like a kernel. Furthermore, most of these benchmarks are large (look at those line counts!) and represent cutting-edge JavaScript code that real websites and browsers are using today (pdf.js! jQuery! Game engines!)

I’m not saying these benchmarks are perfect — for example, there’s arguably too much focus on games, and the use of the proprietary Mandreel instead of the open source Emscripten is unfortunate — but they certainly pass the initial “sniff test”. [Update: Alon Zakai has written a more detailed critique of the new benchmarks, particularly Mandreel and Box2DWeb.] Much more so, in fact, than the existing eight benchmarks that have been carried over from V8. I’ve listed their descriptions below; my annotations are in square brackets.

Richards. OS kernel simulation benchmark, originally written in BCPL by Martin Richards (539 lines). [BCPL is a programming language that predates C. Martin Richards is a former colleague of mine, and I remember him saying in 2004 that its main use these days is running the control systems for some ancient car factories in South America!]
Deltablue. One-way constraint solver, originally written in Smalltalk by John Maloney and Mario Wolczko (880 lines). [Ported SmallTalk code?!]
Raytrace. Ray tracer benchmark based on code by Adam Burmister (904 lines).
Regexp. Regular expression benchmark generated by extracting regular expression operations from 50 of the most popular web pages (1761 lines). [A kernel, but at least it comes from real websites. However, the results of the regexp invocations are not used which makes it easy to game.]
NavierStokes. 2D NavierStokes equations solver, heavily manipulates double precision arrays. Based on Oliver Hunt’s code (387 lines). [This was added to V8 only a few months ago.]
Crypto. Encryption and decryption benchmark based on code by Tom Wu (1698 lines).
Splay. Data manipulation benchmark that deals with splay trees and exercises the automatic memory management subsystem (394 lines). [The data inserted into this splay tree is completely synthetic, which greatly limits it usefulness as a benchmark.]
EarleyBoyer. Classic Scheme benchmarks, translated to JavaScript by Florian Loitsch’s Scheme2Js compiler (4684 lines). [“Classic” here means “old”. Also, auto-compiled Scheme code?!]

These are all much smaller. Also, Regexp is the only one that is clearly based on code commonly run in web browsers.

In fact, these new benchmarks are so much better than the old benchmarks that I wish the Google Chrome team had instead released them as a separate benchmark suite. That would have allowed the old benchmarks to gradually move into a well-earned retirement (along with the equally venerable and flawed SunSpider). I guess there’s nothing stopping people from effectively doing this by running just the new benchmarks. Perhaps this could be called the “Octane minus V8” benchmark suite…