TL;DR: Cross-browser comparisons of memory consumption should be avoided. If you want to evaluate how efficiently browsers use memory, you should do cross-browser comparisons of performance across several machines featuring a range of memory configurations.
Cross-browser Memory Comparisons are Bad
Various tech sites periodically compare the performance of browsers. These often involve some cross-browser comparisons of memory efficiency. A typical one would be this: open a bunch of pages in tabs, measure memory consumption, then close all of them except one and wait two minutes, and then measure memory consumption again. Users sometimes do similar testing.
I think comparisons of memory consumption like these are (a) very difficult to make correctly, and (b) very difficult to interpret meaningfully. I have suggestions below for alternative ways to measure memory efficiency of browsers, but first I’ll explain why I think these comparisons are a bad idea.
Cross-browser Memory Comparisons are Difficult to Make
Getting apples-to-apples comparisons are really difficult.
- Browser memory measurements aren’t easy. In particular, all browsers use multiple processes, and accounting for shared memory is difficult.
- Browsers can exhibit adaptive memory behaviour. If running on a machine with lots of free RAM, a browser may choose to take advantage of it; if running on a machine with little free RAM, a browser may choose to discard regenerable data more aggressively.
If you are comparing two versions of the same browser, problems (1) and (3) are avoided, and so if you are careful with problem (2) you can get reasonable results. But comparing different browsers hits all three problems.
Indeed, Tom’s Hardware de-emphasized memory consumption measurements in their latest Web Browser Grand Prix due to problem (3). Kudos to them!
Cross-browser Memory Comparisons are Difficult to Interpret
Even if you could get the measurements right, memory consumption is still not a good thing to compare. Before I can explain why, I’ll introduce a couple of terms.
- A primary metric is one a user can directly perceive. Metrics that measure performance and crash rate are good examples.
- A secondary metric is one that a user can only indirectly perceive via some kind of tool. Memory consumption is one example. The L2 cache miss rate is another example.
(I made up these terms, I don’t know if there are existing terms for these concepts.)
Primary metrics are obviously important, precisely because user can detect them. They measure things that users notice: “this browser is fast/slow”, “this browser crashes all the time”, etc.
Secondary metrics are important because they can affect primary metrics: memory consumption can affect performance and crash rate; the L2 cache miss rate can affect performance.
Secondary metrics are also difficult to interpret. They can certainly be suggestive, but there are lots of secondary metrics that affect each primary metric of interest, so focusing too strongly on any single secondary metric is not a good idea. For example, if browser A has a higher L2 cache miss rate than browser B, that’s suggestive, but you’d be unwise to draw any strong conclusions from it.
And I haven’t even discussed which memory consumption metric you should use. Physical memory consumption is an obvious choice, but I’ll discuss this more below.
A Better Methodology
So, I’ve explained why I think you shouldn’t do cross-browser memory comparisons. That doesn’t mean that efficient usage of memory isn’t important! However, instead of directly measuring memory consumption — a secondary metric — it’s far better to measure the effect of memory consumption on primary metrics such as performance.
In particular, I think people often use memory consumption measurements as a proxy for performance on machines that don’t have much RAM. If you care about performance on machines that don’t have much RAM, you should measure performance on a machine that doesn’t have much RAM instead of trying to infer it from another measurement.
I did exactly this by doing something I call memory sensitivity testing, which involves measuring browser performance across a range of memory configurations. My test machine had the following characteristics.
- CPU: Intel i7-2600 3.4GHz (quad core with hyperthreading)
- RAM: 16GB DDR3
- OS: Ubuntu 11.10, Linux kernel version 3.0.0.
I used a Linux machine because Linux has a feature called cgroups that allows you to restrict the machine resources available to one or more processes. I followed Justin Lebar’s instructions to create the following configurations that limited the amount of physical memory available: 1024MiB, 768MiB, 512MiB, 448MiB, 384MiB, 320MiB, 256MiB, 192MiB, 160MiB, 128MiB, 96MiB, 64MiB, 48MiB, 32MiB.
(The more obvious way to do this is to use
ulimit, but as far as I can tell it doesn’t work on recent versions of Linux or on Mac. And I don’t know of any way to do this on Windows. So my experiments had to be on Linux.)
I used the following browsers.
- Firefox 12 Nightly, from 2012-01-10 (64-bit)
- Firefox 9.0.1 (64-bit)
- Chrome 16.0.912.75 (64-bit)
- Opera 11.60 (64-bit)
IE and Safari aren’t represented because they don’t run on Linux. Firefox is over-represented because that’s the browser I work on and care about the most The versions are a bit old because I did this testing about six months ago.
The following graph shows the Sunspider results. (Click on it to get a larger version.)
As the lines move from right to left, the amount of physical memory available drops. Firefox was clearly the fastest in most configurations, with only minor differences between Firefox 9 and Firefox 12pre, but it slowed down drastically below 160MiB; this is exactly the kind of curve I was expecting. Opera was next fastest in most configurations, and then Chrome, and both of them didn’t show any noticeable degradation at any memory size, which was surprising and impressive.
All the browsers crashed/aborted if memory was reduced enough. The point at which the graphs stop on the left-hand side indicate the lowest size that each browser successfully handled. None of the browsers ran Sunspider with 48MiB available, and FF12pre failed to run it with 64MiB available.
The next graph shows the V8 results.
The curves go the opposite way because V8 produces a score rather than a time, and bigger is better. Chrome easily got the best scores. Both Firefox versions degraded significantly. Chrome and Opera degraded somewhat, and only at lower sizes. Oddly enough, FF9 was the only browser that managed to run V8 with 128MiB available; the other three only ran it with 160MiB or more available.
I don’t particularly like V8 as a benchmark. I’ve always found that it doesn’t give consistent results if you run it multiple times, and these results concur with that observation. Furthermore, I don’t like that it gives a score rather than a time or inverse-time (such as runs per second), because it’s unclear how different scores relate.
The final graph shows the Kraken results.
As with Sunspider, Chrome barely degraded and both Firefoxes degraded significantly. Opera was easily the slowest to begin with and degraded massively; nonetheless, it managed to run with 128MiB available (as did Chrome), which neither Firefox managed.
Overall, Chrome did well, and Opera and the two Firefoxes had mixed results. But I did this experiment to test a methodology, not to crown a winner. (And don’t forget that these experiments were done with browser versions that are now over six months old.) My main conclusion is that Sunspider, V8 and Kraken are not good benchmarks when it comes to gauging how efficiently browsers use memory. For example, none of the browsers slowed down on Sunspider until memory was restricted to 128MiB, which is a ridiculously small amount of memory for a desktop or laptop machine; it’s small even for a smartphone. V8 is clearly stresses memory consumption more, but it’s still not great.
What would a better benchmark look like? I’m not completely sure, but it would certainly involve opening multiple tabs and simulate real-world browsing. Something like Membench (see here and here) might be a reasonable starting point. To test the impact of memory consumption on performance, a clear performance measure would be required, because Membench lacks one currently. To test the impact of memory consumption on crash rate, Membench could be modified to just keep opening pages until the browser crashes. (The trouble with that is that you’d lose your count when the browser crashed! You’d need to log the current count to a file or something like that.)
BTW, if you are thinking “you’ve just measured the working set size“, you’re exactly right! I think working set size is probably the best metric to use when evaluating memory consumption of a browser. Unfortunately it’s hard to measure (as we’ve seen) and it is best measured via a curve rather than a single number.
A Simpler Methodology
I think memory sensitivity testing is an excellent way to gauge the memory efficiency of different browsers. (In fact, the same methodology can be used for any kind of program, not just browsers.)
But the above experiment wasn’t easy: it required a Linux machine, some non-trivial configuration of that machine that took me a while to get working, and at least 13 runs of each benchmark suite for each browser. I understand that tech sites would be reluctant to do this kind of testing, especially when longer-running benchmark suites such as Dromaeo and Peacekeeper are involved.
A simpler alternative that would still be quite good would be to perform all the performance tests on several machines with different memory configurations. For example, a good experimental setup might involve the following machines.
- A fast desktop with 8GB or 16GB of RAM.
- A mid-range laptop with 4GB of RAM.
- A low-end netbook with 1GB or even 512MB of RAM.
This wouldn’t require nearly as many runs as full-scale memory sensitivity testing would. It would avoid all the problems of cross-browser memory consumption comparisons: difficult measurements, non-determinism, and adaptive behaviour. It would avoid secondary metrics in favour of primary metrics. And it would give results that are easy for anyone to understand.
(In some ways it’s even better than memory sensitivity testing because it involves real machines — a machine with a 3.4GHz i7-2600 CPU and only 128MiB of RAM isn’t a realistic configuration!)
I’d love it if tech sites started doing this.