Categories
Benchmarks

A browser benchmarking manifesto

This post represents my own opinion, not the official position of Mozilla.

I’ve been working on Firefox for over five years and I’ve been unhappy with the state of browser benchmarking the entire time.

The current problems

The existing benchmark suites are bad

Most of the benchmark suites are terrible. Consider at Hennessy and Patterson’s categorization of benchmarks, from best to worst.

  1. Real applications.
  2. Modified applications (e.g. with I/O removed to make it CPU-bound).
  3. Kernels (key fragments of real applications).
  4. Toy benchmarks (e.g. sieve of Erastosthenes).
  5. Synthetic benchmarks (code created artificially to fit a profile of particular operations, e.g. Dhrystone).

In my opinion, benchmark suites should only contain benchmarks in categories 1 and 2. There are certainly shades of grey in these categories, but I personally have a high bar, and I’m likely to consider most programs whose line count is in the hundreds as closer to a “toy benchmark” than a “real application”.

Very few browser benchmark suites contain benchmarks from category 1 or 2. I think six or seven of the benchmarks in Octane meet this standard. I’m having trouble thinking of any other existing benchmark suite that contains any that reach this standard. (This includes Mozilla’s own Kraken suite.)

Bad benchmark suites hurt the web, because every hour that browser vendors spend optimizing bad benchmarks is an hour they don’t spend improving browsers in ways that actually help users. I’ve seen first-hand how much time this wastes.

Conflicts of interest

Some of the benchmark suites — including the best-known ones — come from browser vendors. The conflict of interest is so obvious I won’t bother discussing it further.

Opaque scoring systems

Some of them have opaque scoring systems. Octane (derived from V8bench) is my least favourite here. It also suffers from much higher levels of variance between runs than the benchmarks with time-based scoring systems.

A better way

Here’s how I think a new browser benchmark suite should be created.  Some of these ideas are taken from the SPEC benchmark suiteswhich have been used for over 20 years to benchmark the performance of C, C++, Fortran and Java programs, among other things.

A collaborative and ongoing process

The suite shouldn’t be owned by a single browser vendor. Ideally, all the major browser vendors (Apple, Google, Microsoft, and Mozilla) would be involved. Involvement of experts from outside those companies would also be desirable.

This is a major political and organizational challenge, but it would avoid bias (both in appearance and reality), and would likely lead to a higher-quality suite, because low-quality benchmarks are likely to be shot down.

There should be a public submissions process, where people submit web
apps/sites for consideration.  And obviously there needs to be a selection
process to go with that.

The benchmark suite should be open source and access should be free of
charge.  It should be hosted on a public site not owned by a browser vendor.

There should also be an official process for modifying the suite: modifying benchmarks, removing benchmark, and adding new benchmarks.  This process
shouldn’t be too fluid.  Perhaps allowing changes once per year would be reasonable. The current trend of continually augmenting old suites should be resisted.

High-quality

All benchmarks should be challenging, and based on real website/apps;  they should all belong to category 1 or 2 on the above scale.

Each benchmark should have two or three inputs of different sizes.  Only the
largest input would be used in official runs, but the smaller inputs
would be useful for browser developers for testing and profiling purposes.

All benchmarks should have side-effects of some kind that depend on most or
all of the computation.  There should be no large chunks of computation whose
results are unused, because such benchmarks are subject to silly results due
to certain optimizations such as dead code elimination.

Categorization

Any categorization of benchmarks should not be done by browser subsystems. E.g. there shouldn’t be groups like “DOM”, “CSS”, “hardware acceleration”, “code loading” because that encourages artificial benchmarks that stress a single
subsystem. Instead, any categories should be based on application domain:
“games”, “productivity”, “multimedia”, “social”, etc. If a grouping like
that ends up stressing some browser subsystems more than others, well that’s
fine, because it means that’s what real websites/apps are doing (assuming the
benchmark selection is balanced).

There is one possible exception to this rule, which is the JavaScript engine. At least some of the benchmarks will be dominated by JS execution time, and because JS engines can be built standalone, the relevant JS engine
teams will inevitably want to separate JS code from everything else so that
the code can be run in JS shells. It may be worthwhile to pre-emptively do that, where appropriate. If that were done, you wouldn’t run these JS-only sub-applications as part of an official suite run, but they would be present in the repository.

Scoring

Scoring should be as straightforward as possible. Times are best because their meaning is more obvious than anything else, *especially* when multiple
times are combined. Opaque points systems are awful.

Using times for everything makes most sense when benchmarks are all
throughput-oriented.  If some benchmarks are latency-oriented (or FPS, or
something else) this becomes trickier.  Even still, a simple system that humans can intuitively understand is best.

Conclusion

It’s time the browser world paid attention to the performance and benchmarking lessons that the CPU and C/C++/Fortran worlds learned years ago. It’s time to grow up.

If you want to learn more, I highly recommend getting your hands on a copy of Hennessy & Patterson’s Computer Architecture and reading the section on measuring performance. This is section 1.5 (“Measuring and Reporting Performance”) in the 3rd edition, but I expect later editions have similar sections.

12 replies on “A browser benchmarking manifesto”

I haven’t looked closely at Speedometer, but the description says it “simulate[s] user actions for adding, completing, and removing to-do items” which sounds rather microbenchmark-ish.

I did look more closely at JetStream, also released recently by Apple, and also available at the generic yet authoritatively-named browserbench.org site. It is, alas, a rehash of half of SunSpider, all of Octane, plus some terrible benchmarks (including Dhrystone, Fibonacci, and Towers of Hanoi!) from the LLVM suite translated via Emscripten, and a translation of an Apache hash table benchmark originally implemented in Java.

A suite that looks much more promising is Massive (http://kripken.github.io/Massive/), an asm.js suite that is currently under development. It features four real-world benchmarks: a database engine, a Lua VM, a PDF viewer, and a physics engine. (See the FAQ for details.) It’s also publicly visible on GitHub.

AIUI Speedometer actually implements a small application in using several popular frameworks, and tries to automate the resulting UI. Given that, I don’t see it’s fair to call it a “microbenchmark”. However, you don’t define exactly what you think that term means so it’s very hard to tell what criteria of usefulness you think Speedometer fails.

Massive does indeed look useful, but as noted on the page itself, sounds like it is heavily weighted to a particular style of code. Which is actually an interesting point because a few years ago, when everyone was writing fast JS engines against the V8 and Sunspider benchmarks, Microsoft (Research, probably) released a paper noting that these benchmarks didn’t fit the workloads of typical webapps. Which is of course true of the kind of apps that were around at the time, and indeed of many apps that are around now. But it rather missed the point, simply because it wasn’t possible to make the kinds of apps that needed an order of magnitude better JS performance. So even though those benchmarks are deeply flawed, competition on them drove performance improvements that opened up the possibility of whole new classes of apps. Therefore one should be careful about systematic biases in only using type 1 and type 2 benchmarks because they can’t drive you toward the performance profile you need for classes of apps that don’t exist on your target medium yet. On the web today that might be advanced media editing tools, for example.

I’m a bit bemused about MS denouncing Sunspider as a poor benchmark considering that it appears to be the IE team’s favorite, probably because unlike Kraken/Octane, they’ve optimized it to run faster on their JS engine than any of their rivals have.

Let’s not get bogged down in the exact meanings, because I bet everyone interprets those terms slightly differently. My basic complaint is that the Speedometer app is synthetic — it was built from scratch specifically for benchmarking purposes. I contend that any such app is likely to have quite different characteristics to real-world code.

Something I’ve said before elsewhere: if you want to know how a browser performs on workload X, measure X — don’t measure Y, which you think is a good proxy for X. In this case, I want to know how a browser performs on challenging, real-world code, so that’s what we should be measuring.

While a “proper” full benchmark (which tests several aspects) is useful for anecdotal comparison of different browsers, it’s not very useful as a development aid IMO.

Also, there’s some distance between proper multi-aspect benchmark and a micro-benchmark. This is a gap into which “subject-benchmarks” can fit in, such as rendering-speed benchmark, WebGL benchmark, Web-Apps benchmarks, memory-usage benchmarks, etc.

Since many times the results of these subject-benchmarks are orthogonal to each other (even within the same browser), it makes sense to not combine them into a “one number to rule them all”.

Performance numbers and benchmarks are not black and white, and there are good uses to several different “levels” of them, including very high level ones, as well as highly specific ones. It all depends on your needs.

My only comment, as an ex-compiler engineer, is that your proscription against benchmarks with dead code sounds like a good idea for preventing meaningless “wins”, but in actuality such cases come up in real code all the time.

One day I was working on a benchmark submitted by one of our company’s customers, and realized that much of its code was simply dead due to a constant controlling expression (basically “if (foo) { …” when |foo| was known at compile time to be zero). The compiler was unfortunately not noticing this early enough to prune the entire body out of the intermediate representation, so even though we later removed some of the basic blocks during the optimizer phase, we still had a lot of dead code.

I implemented an early pass to recognize and remove these sorts of cases before they even made it to the intermediate representation, and the benchmark’s code size dropped hugely, as I expected. What I didn’t expect was that, when run on the remainder of our (enormous) library of benchmarks, tests, and other pieces of code, the change was also widely useful — in fact, it was the biggest single win I ever had.

It turns out that real-world code, at least in the cases we were looking at, contains a lot of cases where huge pieces of functionality are either not run at all, or run but have their results ignored, for a variety of reasons (not just stupid coders, though that certainly was one).

I’m not familiar enough with the Javascript world to know how it compares, qualitatively, with the sort of C code I was trying to optimize. But I wouldn’t be so quick to assume that dead code in benchmarks should be avoided as nonrepresentative.

Dead code happens in practice, sure.

The problem is when every benchmark has zero side-effects, and so replacing the entire benchmark with an empty function would be a valid transformation. This is the kind of thing that synthetic benchmarks and microbenchmarks suffer from.

A “simple” benchmark that’d be way better than the current ones from end user perspective would be to measure the startup time of opening facebook, youtube and gmail inbox and in addition measuring the idle cpu usage (including occasional spikes) of facebook (Firefox is particularly atrocious here). Of course, getting stable snapshots of those might not be exactly trivial, so this is mostly wishful thinking.

The beauty of the web is that websites are public. Would it be possible to create a “Web 30” basket of the most popular websites, and then once a year cache copies of them to be used for benchmarking? Sort of like a Dow 30 for websites, with ratings for popular browsers.

An interesting data point: a paper recently presented at PLDI 2014 on JavaScript types/shapes in V8 showed that V8 had over-optimized for the synthetic benchmark suites at the expense of performance on real sites.

Comments are closed.