Profiling the browser’s virtual memory behaviour

We’ve been chipping away at memory use of Firefox 4 for a couple of
months now, with good results.  Recently, though, I’ve been wondering
if we’re measuring the right things.  It seems to me there’s two
important things to measure:

  • Maximum virtual address space use for the process.  Why is this
    important?  Because if the process runs out of address space, it’s
    in serious trouble.  Ditto, perhaps worse, if the process uses up
    all the machine’s swap.
  • But the normal case is different: we don’t run out of address space
    or swap.  In this case I don’t care how much memory the browser
    uses.  Really.  When we talk about memory use in the non-OOM
    situation, we’re using that measure as a proxy for responsiveness.
    Excessive memory use isn’t intrinsically bad.  Rather, it’s the side
    effect that’s the problem: it causes paging, both for the browser
    and for everything else running on the machine, slowing
    everything down.

Trying to gauge responsiveness by looking at peak RSS figures strikes
me as a losing prospect.  The RSS values are set by some more-or-less
opaque kernel page discard algorithm, and depend on the behaviour of
all processes in the system, not just Firefox.  Worse, it’s uninformative:
we get no information about which parts of our code base are causing

So I hacked up a VM profiler.  This tells me the page fault behaviour
when running Firefox using a given amount of real memory.  It isn’t as
big a task as it sounds, since we already have 99.9% of the required
code in pace: Valgrind’s Cachegrind tool.  It just required replacing
the cache simulator with a virtual-to-physical address map simulator.

The profiler does a pretty much textbook pseudo-LRU clock algorithm
simulation.  It differentiates between page faults caused by data and
instruction accesses, since these require different fixes — make the
data smaller vs make the code smaller.  It also differentiates between
clean (page unmodified) and dirty (page modified, requires writeback)

Here are some preliminary results.  Bear in mind the profiler has only
just started to work, so the potential for bogosity is still large.

First question is: we know that 4.0 uses more memory than 3.6.x.  But
does that result in more paging?  I profiled both, loading 5 cad-comic
tabs ( and idling for a while, for
about 8 billion instructions.  Results, simulating 100MB of real memory:

3.6.x, release build, using jemalloc:

VM I accesses: 8,250,840,547  (3,186 clean faults + 350 dirty faults)
VM D accesses: 3,089,412,941  (5,239 clean faults + 552 dirty faults)

M-C, release build, using jemalloc:

VM I accesses: 8,473,182,041  ( 8,140 clean faults +  4,979 dirty faults)
VM D accesses: 3,372,806,043  (22,720 clean faults + 14,335 dirty faults)

Apparently it does page more.  Most of the paging is due to data
rather than instruction accesses.  Requires further investigation.

Second question is: where does that paging come from?  Are we missing
any easy wins?  From a somewhat longer run with bigger workload, I got
this (w/ apologies for terrible formatting):

Da (# data accesses)
.                Dfc (# clean data faults)
.                          function
18,921,574,436   382,023   PROGRAM TOTALS

.   19,339,625    60,583   js::Shape::trace
.    2,228,649    51,635   JSCompartment::purge
.   32,583,809    22,223   js_TraceScript
.   16,306,348    18,404   js::mjit::JITScript::purgePICs
.   18,160,249    12,847   js::mjit::JITScript::purgePICs
.   52,155,631    11,727   memset
.   27,229,391    10,813   js::PropertyTree::sweepShapes
.  120,482,308    10,256   js::gc::MarkChildren
.  138,049,859     9,134   memcpy
.    2,228,649     8,779   JSCompartment::sweep
.   179,083,731    8,057   js_TraceObject
.    6,269,454     5,949   js::mjit::JITScript::sweepCallICs

18% ish of the faults come from js::Shape::trace.

And quite a few come from js::mjit::JITScript::purgePICs (two
versions) and js::mjit::JITScript::sweepCallICs.  According to Dave
Anderson and Chris Leary, there might be some opportunity to poke
the code pages in a less jumping-around-y fashion.

6 responses

  1. mmc wrote on :

    It’s just fantastic to have people like you working on Firefox! And the benefit of having such tools reflects on all open source projects!

  2. Alessandro Pignotti wrote on :

    Is the code for this cachegrind mod available somewhere?

  3. Luke Wagner wrote on :

    Wow, if I read that list correctly, all the top offenders seem to be GC related. Is this for the whole browser? Surprising, but it makes sense; when we GC, we touch oodles of memory that we haven’t touched in a while. On the bright side, generational GC should help this a bunch — touch less memory and more recently after it has last been touched.

  4. jseward wrote on :

    Luke: yes, this is for the whole browser. Paging from non-JS-engine
    parts was also listed but is further down the list and pretty small.
    Note also, there’s entries here for memset and memcpy and it’s not
    clear where that traffic comes from; it could be any part of the
    browser. Really what we need to do is put this functionality into
    Callgrind (which can attribute costs along call chains, a la gprof)
    instead of Cachegrind (which can’t).

  5. Anonymous wrote on :

    Ideally, rather than a simulator, you could hook this up to the x86 performance counters that count page faults, cache misses, TLB misses, and so on.

    Have you tried the Linux “perf” tool? It can record those various counters and tie that information to program line numbers via debug information, resulting in an annotated profile of your code’s page fault performance.

  6. Anonymous wrote on :

    If this is GC related then coupling the whole thing with compartments might be a great win. Repeatedly GCing compartments that haven’t seen *any* script activity at all would seem like a waste because it won’t find anything new to discard and thus touch pages unnecessrily.