Building a page fault benchmark

I wrote a while ago about the importance of avoiding page faults for browser performance.  Despite this, I’ve been focusing specifically on reducing Firefox’s memory usage.  This is not a terrible thing;  page fault rates and memory usage are obviously strongly linked.  But they don’t have perfect correlation.  Not all memory reductions will have equal effect on page faults, and you can easily imagine changes that reduce page fault rates — by changing memory layout and access patterns — without reducing memory consumption.

A couple of days ago, Luke Wagner initiated an interesting email conversation with me about his desire for a page fault benchmark, and I want to write about some of the things we discussed.

It’s not obvious how to design a page fault benchmark, and to understand why I need to first talk about more typical time-based benchmarks like SunSpider.  SunSpider does the same amount of work every time it runs, and you want it to run as fast as possible.  It might take 200ms to run on your beefy desktop machine, 900ms on your netbook, and 2000ms to run on your smartphone.  In all cases, you have a useful baseline against which you can measure optimizations.  Also, any optimization that reduces the time on one device has a good chance of reducing time on the other devices.  The performance curve across devices is fairly flat.

In contrast, if you’re measuring page faults, these things probably won’t be true on a benchmark that does a constant amount of work.  If my desktop machine has 16GB of RAM, I’ll probably get close to zero page faults no matter what happens.  But on a smartphone with 512MB of RAM, the same benchmark may lead to a page fault death spiral;  the number will be enormous, assuming you even bother waiting for it to finish (or the OS doesn’t kill it).  And the netbook will probably lie unhelpfully on one side or the other of the cliff in the performance curve.  Such a benchmark will be of limited use.

However, maybe we can instead concoct a benchmark that repeats a sequence of interesting operations until a certain number of page faults have occurred.  The desktop machine might get 1000 operations, the netbook 400, the smartphone 100.  The performance curve is fairly flat again.

The operations should be representative of realistic browsing behaviour.  Obviously, the memory consumption has to increase each time you finish a sequence, but you don’t want to just open new pages.  A better sequence might look like “open foo.com in a new tab, follow some links, do some interaction, open three child pages, close two of them”.

And it would be interesting to run this test on a range of physical memory sizes, to emulate different machines such as smartphones, netbooks, desktops.  Fortunately, you can do this on Linux;  I’m not sure about other OSes.

I think a benchmark (or several benchmarks) like this would be challenging but not impossible to create.  It would be very valuable, because it measures a metric that directly affects users (page faults) rather than one that indirectly affects them (memory consumption).  It would be great to use as the workload under Julian Seward’s VM simulator, in order to find out which parts of the browser are causing page faults.  It might make areweslimyet.com catch managers’ eyes as much as arewefastyet.com does.  And finally, it would provide an interesting way to compare the memory “usage” (i.e. the stress put on the memory system) of different browsers, in contrast to comparisons of memory consumption which are difficult to interpret meaningfully.

 

12 Responses to Building a page fault benchmark

  1. Please avoid saying page faults when you really mean hard page faults.

  2. So, thinking a bit more about it, I’m not sure this is the right metric for a benchmark. Basically, when you swapped, you already lost. But then, swapping may not be entirely your fault. Some other application may be sucking memory, etc. So what we really want is to avoid using memory (thanks captain obvious). But I don’t think we need to find a metric to know how long/how many operations it takes before swapping. I think we need to find a metric that allows to know how much memory some typical script uses. Then reducing internal copies, struct/object sizes, etc. would make the number go down.

    Now, once you hit the disk as a memory storage, you may not have entirely lost: it could well be stuff you don’t use or didn’t use recently, and you may simply want to avoid touching that, or avoid to touch a lot of that. There, memory locality would help limit the worst effects of swapping, and we probably need to find some way to measure it in typical situations. For example, if you have a lot of tabs open on a system with a limited amount of memory, you will undoubtedly have a lot of objects from a lot of tabs swapped out. When you switch to one of the tabs you haven’t looked for a while, you’ll need to grab all the corresponding memory pages from disk, and the more the corresponding data is widespread across different pages, mixed with other content, the more pages you’ll have to swap in, thus the longer it will take.

  3. Hello,
    there is something strange happening in Firefox 7.
    I’ve opened Firefox, with all the plugins disabled, and didn’t used it for an hour.
    Initially (not just after startup, but after everything was loaded and working):
    Memory: 362880 KB
    Page faults: 678471
    After an hour (it wasn’t used during this hour):
    Memory: 432164 KB
    Page faults: 1302443

    Also if I don’t use it, it costantly allocates and frees memory (for example now it’s using 380 MB), doing a lot of page faults. Maybe if we find why this happens, we can reduce page faults.

  4. I know you can run a VM with different amounts of memory on windows as well…. but I really want to ask what do you mean by page faults or hard page faults?

  5. An hard page fault is when the page isn’t loaded in physical memory (for example if it’s swapped on the disk).

  6. Luke Wagner

    Thanks for writing this Nick; I hope it stirs someone up :)

    > But then, swapping may not be entirely your fault.
    > Some other application may be sucking memory, etc.

    That is the point of having a test harness that limited available physical memory so that one could expect (when run from a machine with sufficient physical memory) to always be faulting because of this set limit, not because of some unrelated process.

  7. I guess I’m puzzled as to how this differs from something that simply measures the rate of memory growth per operation (or per set of operations)? Do you have any measures like that currently? i.e.:

    -> Start Firefox -> (measure)
    -> open foo.com in a new tab -> (measure)
    -> follow some links -> (measure)
    -> do some interaction -> (measure)
    -> open three child pages -> (measure)
    -> close two of them -> (measure)

    Repeat daily, and plot either as a surface/intensity map or plot lines for individual measurement stages.

  8. Nicholas Nethercote

    voracity: we have benchmarks on measuring memory consumption, and we’re working on more.

    Recall how I said that memory consumption and page fault rates aren’t perfectly correlated — memory consumption could go up or down without the page fault rate moving the same way. The point of the page fault benchmark is to account for that.

  9. Renaud Durlin

    If hard page faults increase, you should see regression in perf benchmarks or memory benchmarks, no ? I’m not sure why you need another kind of benchmarks.

    I think it’s much more easier to measure time and memory than page faults.

    • Nicholas Nethercote

      Renaud: You haven’t understood the point about the uneven performance curve for page faults.

      Consider a change that increases page faults, but it only manifests on smaller devices like netbooks. If you’re only regression testing on desktop machines, you won’t realize what has happened.

      Plus, as I said to voracity, the correlation between page faults and memory consumption is strong but not perfect.

  10. Renaud Durlin

    Nicholas: yes but the problem here is that you need to run benchmarks on desktop and netbooks not that you need another kind of benchmarks.

    The problem with measuring page faults is that it may be hard to find regressions.

    If one test is suddenly two times slower or uses 100MB more memory, you know that it’s bad. But if one test produces 1 hundred more page faults, what’s the conclusion ? Is it a real regression or only noises ?