Profiling the browser’s virtual memory behaviour

We’ve been chipping away at memory use of Firefox 4 for a couple of
months now, with good results. Recently, though, I’ve been wondering
if we’re measuring the right things. It seems to me there’s two
important things to measure:

Maximum virtual address space use for the process. Why is this
important? Because if the process runs out of address space, it’s
in serious trouble. Ditto, perhaps worse, if the process uses up
all the machine’s swap.

But the normal case is different: we don’t run out of address space
or swap. In this case I don’t care how much memory the browser
uses. Really. When we talk about memory use in the non-OOM
situation, we’re using that measure as a proxy for responsiveness.
Excessive memory use isn’t intrinsically bad. Rather, it’s the side
effect that’s the problem: it causes paging, both for the browser
and for everything else running on the machine, slowing
everything down.

Trying to gauge responsiveness by looking at peak RSS figures strikes
me as a losing prospect. The RSS values are set by some more-or-less
opaque kernel page discard algorithm, and depend on the behaviour of
all processes in the system, not just Firefox. Worse, it’s uninformative:
we get no information about which parts of our code base are causing
paging.

So I hacked up a VM profiler. This tells me the page fault behaviour
when running Firefox using a given amount of real memory. It isn’t as
big a task as it sounds, since we already have 99.9% of the required
code in pace: Valgrind’s Cachegrind tool. It just required replacing
the cache simulator with a virtual-to-physical address map simulator.

The profiler does a pretty much textbook pseudo-LRU clock algorithm
simulation. It differentiates between page faults caused by data and
instruction accesses, since these require different fixes — make the
data smaller vs make the code smaller. It also differentiates between
clean (page unmodified) and dirty (page modified, requires writeback)
faults.

Here are some preliminary results. Bear in mind the profiler has only
just started to work, so the potential for bogosity is still large.

First question is: we know that 4.0 uses more memory than 3.6.x. But
does that result in more paging? I profiled both, loading 5 cad-comic
tabs (http://www.cad-comic.com/cad/random) and idling for a while, for
about 8 billion instructions. Results, simulating 100MB of real memory:

3.6.x, release build, using jemalloc:

VM I accesses: 8,250,840,547 (3,186 clean faults + 350 dirty faults)
VM D accesses: 3,089,412,941 (5,239 clean faults + 552 dirty faults)

M-C, release build, using jemalloc:

VM I accesses: 8,473,182,041 ( 8,140 clean faults + 4,979 dirty faults)
VM D accesses: 3,372,806,043 (22,720 clean faults + 14,335 dirty faults)

Apparently it does page more. Most of the paging is due to data
rather than instruction accesses. Requires further investigation.

Second question is: where does that paging come from? Are we missing
any easy wins? From a somewhat longer run with bigger workload, I got
this (w/ apologies for terrible formatting):
Da (# data accesses) . Dfc (# clean data faults) . function ------------------------------------------ 18,921,574,436 382,023 PROGRAM TOTALS

. 19,339,625 60,583 js::Shape::trace . 2,228,649 51,635 JSCompartment::purge . 32,583,809 22,223 js_TraceScript . 16,306,348 18,404 js::mjit::JITScript::purgePICs . 18,160,249 12,847 js::mjit::JITScript::purgePICs . 52,155,631 11,727 memset . 27,229,391 10,813 js::PropertyTree::sweepShapes . 120,482,308 10,256 js::gc::MarkChildren . 138,049,859 9,134 memcpy . 2,228,649 8,779 JSCompartment::sweep . 179,083,731 8,057 js_TraceObject . 6,269,454 5,949 js::mjit::JITScript::sweepCallICs

18% ish of the faults come from js::Shape::trace.

And quite a few come from js::mjit::JITScript::purgePICs (two
versions) and js::mjit::JITScript::sweepCallICs. According to Dave
Anderson and Chris Leary, there might be some opportunity to poke
the code pages in a less jumping-around-y fashion.

6 responses

mmc wrote on January 27, 2011 at 4:01 pm:

It’s just fantastic to have people like you working on Firefox! And the benefit of having such tools reflects on all open source projects!
Alessandro Pignotti wrote on January 27, 2011 at 8:02 pm :

Is the code for this cachegrind mod available somewhere?
Luke Wagner wrote on January 29, 2011 at 5:57 am:

Wow, if I read that list correctly, all the top offenders seem to be GC related. Is this for the whole browser? Surprising, but it makes sense; when we GC, we touch oodles of memory that we haven’t touched in a while. On the bright side, generational GC should help this a bunch — touch less memory and more recently after it has last been touched.
jseward wrote on January 29, 2011 at 12:01 pm:

Luke: yes, this is for the whole browser. Paging from non-JS-engine
parts was also listed but is further down the list and pretty small.
Note also, there’s entries here for memset and memcpy and it’s not
clear where that traffic comes from; it could be any part of the
browser. Really what we need to do is put this functionality into
Callgrind (which can attribute costs along call chains, a la gprof)
instead of Cachegrind (which can’t).
Anonymous wrote on January 29, 2011 at 3:00 pm:

Ideally, rather than a simulator, you could hook this up to the x86 performance counters that count page faults, cache misses, TLB misses, and so on.

Have you tried the Linux “perf” tool? It can record those various counters and tie that information to program line numbers via debug information, resulting in an annotated profile of your code’s page fault performance.
Anonymous wrote on February 10, 2011 at 12:25 am:

If this is GC related then coupling the whole thing with compartments might be a great win. Repeatedly GCing compartments that haven’t seen *any* script activity at all would seem like a waste because it won’t find anything new to discard and thus touch pages unnecessrily.