Category Archives: Valgrind

A better DHAT

DHAT is a heap profiler that comes with Valgrind. (The name is short for “Dynamic Heap Analysis Tool”.) It tells your where all your heap allocations come from, and can help you find the following: places that cause excessive numbers of allocations; leaks; unused and under-used allocations; short-lived allocations; and allocations with inefficient data layouts. This old blog post goes into some detail.

In the new Valgrind 3.15 release I have given DHAT a thorough overhaul.

The old DHAT was very useful and I have used it a lot while profiling the Rust compiler. But it had some rather annoying limitations, which the new DHAT overcomes.

First, the old DHAT dumped its data as text at program termination. The new DHAT collects its data in a file which is read by a graphical viewer that runs in a web browser. This gives several advantages.

  • The separation of data collection and data presentation means you can run a program once under DHAT and then sort and filter the data in various ways, instead of having to choose a particular sort order in advance. Also, full data is in the output file, and the graphical viewer chooses what to omit.
  • The data can be sorted in more ways than previously. Some of these sorts involve useful filters such as “short-lived” and “zero reads or zero writes”.
  • The graphical viewer allows parts of the data to be hidden and unhidden as necessary.

Second, the old DHAT divided its output into records, where each record consisted of all the heap allocations that have the same allocation stack trace. The choice of stack trace depth could greatly affect the output.

In contrast, the new DHAT is based around trees of stack traces that avoid the need to choose stack trace depth. This avoids both the problem of not enough depth (when records that should be distinct are combined, and may not contain enough information to be actionable) and the problem of too much depth (when records that should be combined are separated, making them seem less important than they really are).

Third, the new DHAT also collects and/or shows data that the old DHAT did not.

  • Byte and block measurements are shown with a percentage relative to the global measurements, which helps gauge relative significance of different parts of the profile.
  • Byte and block measurements are also shown with an allocation rate (bytes and blocks per million instructions), which enables comparisons across multiple profiles, even if those profiles represent different workloads.
  • Both global and per-node measurements are taken at the global heap peak, which gives Massif-like insight into the point of peak memory use.
  • The final/lifetimes stats are a bit more useful than the old deaths stats. (E.g. the old deaths stats didn’t take into account lifetimes of unfreed blocks.)

Finally, the new DHAT has a better handling of realloc. The sequence p = malloc(100); realloc(p, 200); now increases the total block count by 2 and the total byte count by 300. In the old DHAT it increased them by 1 and 200. The new handling is a more operational view that better reflects the effect of allocations on performance. It makes a significant difference in the results, giving paths involving reallocation (e.g. repeated pushing to a growing vector) more prominence.

Overall these changes make DHAT more powerful and easier to use.

The following screenshot gives an idea of what the new graphical viewer looks like.

Sample output from DHAT's viewer

The new DHAT can be run using the --tool=dhat flag, in contrast with the old DHAT, which was an “experimental” tool and so used the --tool=exp-dhat flag. For more details see the documentation.

Notes on Reducing Firefox’s Memory Consumption

I gave a talk yesterday at the Browser MiniConf, held in Ballarat, Australia.  Its title was “Notes On Reducing Firefox’s Memory Consumption”.

Below are the slides and notes in a SlideShare embedding. If you find that embedding problematic (some people do) you may prefer to download the PDF version directly.

Reducing about:memory’s “heap-unclassified” measurement with DMD

about:memory is a really useful tool, but everyone always complains about the “heap-unclassified” number being too high.  It’s too high because we don’t have enough memory reporters implemented.

DMD is a tool that identifies where new memory reporters should be added to reduce “heap-unclassified”.  I’ve been using it for a couple of months now, and I’ve just written instructions on how to use it.  It’s not easy, but hopefully it’s doable.

It’s probably worth pointing out that we have a pretty good handle on what needs to be done to reduce “heap-unclassified” significantly — check out the list of memory reporters to be implemented.  Almost all of those bugs were filed based on data from DMD.  The single most important bug in that list is the one to add missing style reporters;  I see multiple megabytes of CSS stuff all the time in DMD’s output.

MemShrink progress, week 9

Firefox 8 graduated to the Aurora channel this week, and the development period for what will become Firefox 9 began.  Lots of MemShrink activity happened this week, and I think all the changes listed below will make it into Firefox 8.

Avoiding Wasted Memory

I have blogged previously about memory wasted by “clownshoes” bugs.   Ed Morley found a webpage that resulted in 700MB of memory being wasted by the PLArena clownshoes bug.  Basically, on platforms where jemalloc is used (Windows, Linux), half the memory allocated by nsPresArena (which is built on top of PLArena) was wasted.  (On Mac the waste was 11%, because the Mac allocator rounds up less aggressively than jemalloc).

Fixing this problem properly for all PLArenas takes time because it requires changes to NSPR, so I made a spot-fix for the nsPresArena case.  This is a particularly big win on very large pages, but it saves around 3MB even on Gmail. This spot-fix has been granted beta approval and so will, barring disaster, make it into Firefox 7.

A Firefox Nightly user did some measurements with different browsers on the problematic page:

  • Firefox 8.0a1 before patch: 2.0 GB
  • Firefox 8.0a1 after patch: 1.3 GB
  • Latest Chrome canary build and dev (15.0.849.0): 1.1GB
  • Webkit2Process of Safari 5.1: 1.05 GB
  • Internet Explorer 9.0.2: 838 MB
  • Latest Opera Next 12.00: 727 MB

So this fix gets Firefox within spitting distance of other browsers, which is good!

In other developments related to avoiding wasted memory:

  • Luke Wagner discovered that, on typical websites, most JSScripts are byte-compiled but never run.  A JSScript roughly corresponds to a JavaScript function.  In hindsight, it’s not such a surprising result — Firefox byte-compiles all loaded JavaScript code, and you can imagine lots of websites use libraries like jQuery but only use a small fraction of the functions in the library.  Making byte-compilation lazy could potentially save MBs of memory per compartment.  But that will require some non-trivial reworking of the JS engine, and so is unlikely to happen in the short-term.
  • Kyle Huey avoided a small amount (~100KB per browser process) of waste due to rounding up in XPT arenas.

Improving about:memory

I made some progress on a Valgrind tool to help identify the memory that is currently reported only by the “heap-unclassified” bucket in about:memory.  It’s called “DMD”, short for “Dark Matter Detector”.  It’s in early stages and I still need to teach it about most of Firefox’s memory reporters, but it’s already spitting out useful data, which led to me and Ehsan Akhgari landing memory reporters for the JS atom table and the Hunspell spell checker.  We also now have some insight (here and here) about memory usage for very large pages.

Mounir Lamouri turned on the memory reporter for the DOM that he’s been working on for some time.  This shows up under “dom” in about:memory.  There are still some cases that require handling;  you can follow the progress of these here.

Andrew McCreight replaced about:memory’s buttons so you can force a cycle collection without also forcing a garbage collection, which may be useful in hunting down certain problems.

Finally, Sander van Veen added the existing “js-compartments-user” and “js-compartments-system” to the statistics collected by telemetry (his first landed patch!), and I did likewise for the “storage/sqlite” reporter.  I also added a new “tjit-data/trace-monitor” memory reporter that accounts for some of the memory used by TraceMonkey.


Igor Bukanov tweaked the handling of empty chunks by the JavaScript garbage collector.  That sounds boring until you see the results on Gregor Wagner’s 150-tab stress test: resident memory usage dropped 9.5% with all 150 tabs open, and dropped by 27% after all those tabs were closed.

Brian Hackett fixed a memory leak in type inference, which gets it one step closer to being ready to land.

Christian Höltje fixed a leak in his “It’s All Text” add-on that was causing zombie compartments.  This fix will be in version 1.6.0, which is currently awaiting to receive AMO approval, but can be obtained here in the meantime.  This fix and last week’s fix of a memory leak in LastPass are very encouraging — per-compartment reporters in about:memory have, for the first time, given add-on developers a reasonable tool for identifying memory leaks.  I hope we can continue to improve the situation here.  Several people have asked me for documentation on how to avoid memory leaks in add-ons.  I’m not the person to write that guide (I’m not a Gecko expert and I know almost nothing about add-ons) but hopefully someone else can step up to the plate.

Bug counts

Here’s the change in MemShrink bug counts.

  • P1: 30 (-0, +1)
  • P2: 64 (-4, +6)
  • P3: 36 (-5, +0)
  • Unprioritized: 1 (-2, +1)

Good progress on P3 bugs, but they’re the least important ones.  Other than that, new bugs are still being reported faster than they’re being fixed.  If you want to help but don’t know where to start, feel free to email me or ping me on IRC and I’ll do my best to help get you involved.


Using Valgrind to get stack traces

Sometimes I want to do some printf-style debugging where I print not only some values, but also the stack trace each time a particular code point is hit. GNU provides a backtrace() function that supposedly does this, but I tried it and got hopeless results, little more than code addresses.

Fortunately, you can do this pretty easily with Valgrind.  First, add this line somewhere in your source code:

  #include <valgrind/valgrind.h>

Then, at the point where you want to print the stack trace, add this:


You can of course print something other than “foo”.  In fact, VALGRIND_PRINTF_BACKTRACE is a variadic printf-style function, so you can do stuff like this:

  VALGRIND_PRINTF_BACKTRACE("%s: %d\n", str, i);

You then have to run the program under Valgrind as usual, except you probably should use --tool=none because that’ll run the quickest.

This is a trick I find occasionally invaluable.

Memory profiling Firefox with Massif, part 2

To follow up from this post: we’ve made some good progress on reducing JaegerMonkey’s memory consumption in Firefox 4, though there’s still a way to go.  Julian Seward will blog about this shortly.  In the meantime, I thought I’d share a particularly useful Massif invocation that Rob Sayre inspired me to concoct:

  valgrind \
  --smc-check=all --trace-children=yes \
  --tool=massif \
  --pages-as-heap=yes --detailed-freq=1000000 \
  --threshold=0.5 \
  --alloc-fn=mmap \
  --alloc-fn=syscall \
  --alloc-fn=pages_map \
  --alloc-fn=chunk_alloc \
  --alloc-fn=arena_run_alloc \
  --alloc-fn=arena_bin_malloc_hard \
  --alloc-fn=malloc \
  --alloc-fn=realloc \
  --alloc-fn='operator new(unsigned long)' \
  --alloc-fn=huge_malloc \
  --alloc-fn=posix_memalign \
  --alloc-fn=moz_xmalloc \
  --alloc-fn=JS_ArenaAllocate \
  --alloc-fn=PL_ArenaAllocate \
  --alloc-fn=NS_Alloc_P \
  --alloc-fn=NS_Realloc_P \
  --alloc-fn='XPConnectGCChunkAllocator::doAlloc()' \
  --alloc-fn='PickChunk(JSRuntime*)' \
  --alloc-fn='RefillFinalizableFreeList(JSContext*, unsigned int)' \
  --alloc-fn=sqlite3MemMalloc \
  --alloc-fn=mallocWithAlarm \
  --alloc-fn=sqlite3Malloc \

Good grief!  What a mess.  Don’t blame Massif for this, though;  it’s because Firefox has so many custom memory allocators.

With that invocation, the output of ms_print becomes something that is comprehensible to people other than Massif’s author 🙂  Here’s an extraction of the output which gives a high-level view of Firefox’s memory consumption on 64-bit Linux after loading 20 tabs, each with a random comic from, which is a JavaScript-heavy site:

31.04% (366,878,720B) _dl_map_object_from_fd (dl-load.c:1195)
15.73% (185,998,724B) in 3693 places, all below massif's threshold (00.00%)
15.62% (184,639,488B) pthread_create@@GLIBC_2.2.5 (allocatestack.c:483)
05.68% (67,112,960B) pa_shm_create_rw (in /usr/lib/
04.35% (51,372,032B) JSC::ExecutablePool::systemAlloc(unsigned long) (ExecutableAllocatorPosix.cpp:43)
03.30% (38,993,920B) js::InitJIT(js::TraceMonitor*) (jstracer.cpp:7644)
03.11% (36,741,120B) js::InitJIT(js::TraceMonitor*) (jstracer.cpp:7643)
02.87% (33,935,360B) js::PropertyTree::newShape(JSContext*, bool) (jspropertytree.cpp:97)
02.84% (33,554,432B) js_NewFunction(JSContext*, JSObject*, int (*)(JSContext*, unsigned int, js::Value*), unsigned int, unsigned int, JSObject*, JSAtom*) (jsgcinlines.h:127)
02.79% (32,923,648B) js::InitJIT(js::TraceMonitor*) (jstracer.cpp:7642)
01.99% (23,555,684B) js::mjit::Compiler::finishThisUp(js::mjit::JITScript**) (jsutil.h:213)
01.69% (19,934,784B) JSScript::NewScript(JSContext*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned short, unsigned short) (jsutil.h:209)
01.53% (18,067,456B) pcache1Alloc (sqlite3.c:33368)
01.48% (17,457,388B) nsStringBuffer::Alloc(unsigned long) (nsSubstring.cpp:206)
01.31% (15,478,784B) g_mapped_file_new (in /lib/
00.89% (10,486,784B) JS_NewObject (jsgcinlines.h:127)
00.71% (8,388,608B) js::StackSpace::init() (jscntxt.cpp:164)
00.68% (8,093,696B) GCGraphBuilder::NoteScriptChild(unsigned int, void*) (mozalloc.h:229)
00.68% (8,024,064B) NewOrRecycledNode(JSTreeContext*) (jsparse.cpp:495)
00.67% (7,974,936B) js::Vector<unsigned short, 32ul, js::ContextAllocPolicy>::growStorageBy(unsigned long) (jsutil.h:217)
00.53% (6,291,456B) js_CloneRegExpObject(JSContext*, JSObject*, JSObject*) (jsgcinlines.h:127)
00.52% (6,190,836B) nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:68)

The total is 1,182,094,880 bytes.

  • 31.04% is from _dl_map_object_from_fd.  This corresponds to code and data segments, mostly from libraries.
  • 15.73% is from allocation points small enough that they fell below the threshold (0.5%) that I used for this run.
  • 15.62% is from pthread_create, i.e. thread stacks.  Hopefully most of this space also won’t be mapped in.
  • 5.68% is from pa_shm_create_rwBug 617852 is open about this.  It won’t be fixed until after Firefox 4.0, but that’s not so bad because /proc/pid/smaps tells me that hardly any of it is mapped into physical memory.
  • That leaves 31.93% of big, heap-ish allocations.  It’s pretty obvious that for this workload, the JS engine is being greedy, accounting for 26.42% of that 31.83%.  One piece of good news is that the three js::InitJIT() entries, which together account for 9.2%, will be greatly improved by bug 623428;  I’m hoping to reduce them by a factor of 10 or more.

If anyone wants Massif’s full output, I’ll be happy to give it to them.  The full output contains full stack traces, which can be useful.

Some conclusions.

  • I’m still worred about our memory consumption, and I intend to keep pushing on it, both before Firefox 4.0 is released and afterwards.
  • Massif takes a bit of getting used to, particularly when you are profiling a huge, messy program like Firefox.  But it’s the only space profiler I know of that gives information that is detailed enough to be really useful in reducing memory consumption.  Without it, we wouldn’t have made much progress on reducing Firefox 4.0’s space consumption.  I’d love for other people to run it, it works on Linux and Mac (not Windows, unfortunately).  I’m happy to help anyone who wants to try it via IRC or email.  For all the improvements done lately, I’ve only looked at a single workload on a single machine!  There’s much more analysis to be done.
  • If anyone knows of other decent memory profilers that can handle programs as complex as Firefox, I’d love to hear about it.  In particular, note that if you only measure the heap (malloc et al) you’re only getting part of the story;  this is again because we have multiple allocators which bypass malloc and use mmap/VirtualAlloc directly.
  • I wonder if we need better memory benchmarks.  I’d like to have some that are as easy to run as, say, SunSpider.  Better telemetry would also be great.

cg_diff: a differential profiling tool

I frequently use the SunSpider and V8 benchmark suites to compare the speed of different versions of TraceMonkey.  The best metric for speed comparisons is always execution time.  However, measuring execution time on modern machines is unreliable — you get different lots of variation between runs.  This is a particular problem in this cae because the run-times of these benchmarks is very small — SunSpider takes less than 700 ms on my laptop, and V8 takes about 6.5 seconds.  Run-to-run variations can be larger than the difference I’m trying to measure.  This is annoying:  the best speed metric cannot be measured exactly.

So I frequently use Cachegrind to measure the number of executed instructions.  This is a worse metric than execution time — the number of instructions doesn’t directly relate to the execution time, although it’s usually a good indicator — but it has the advantage that it can be measured exactly.  Most of the SunSpider and V8 tests are deterministic, and if I measure them twice in a row I’ll get the same result.  Cachegrind also gives instruction counts on a per-function and per-line basis, which is very useful.

So I often run Cachegrind on two different versions of TraceMonkey:  an unchanged copy of the current repository tip, and a copy of the current repository tip with a patch applied.  I can then compare the results and get a very precise idea of how the patch affects performance.

However, comparing the output of two Cachegrind runs manually is a pain.  For example, here is part of Cachegrind’s output (lightly edited for clarity) for crypto-md5.js with an unchanged repository tip (as of a day or two ago):

48,923,280  PROGRAM TOTALS
 Ir  file:function
5,638,362  ???:???
4,746,990  /build/buildd/eglibc-2.10.1/string/../sysdeps/i386/i686/strcmp.S:strcmp
2,032,069  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
1,899,298  jstracer.cpp:bool js::VisitFrameSlots<js::CountSlotsVisitor>(...)
1,759,932  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
1,232,425  jsops.cpp:js_Interpret
 885,168  jstracer.cpp:bool js::VisitFrameSlots<js::DetermineTypesVisitor>(...)
 871,197  jstracer.cpp:js::TraceRecorder::set(int*, nanojit::LIns*, bool)
 812,419  /build/buildd/eglibc-2.10.1/iconv/gconv_conf.c:insert_module
 758,034  jstracer.cpp:js::TraceRecorder::monitorRecording(JSOp)

At the top we have the total instruction count, and then we have the instruction counts for the top 10 functions.  The ???:??? entry represents code generated by TraceMonkey’s JIT compiler, for which there is no debug information.  “Ir” is short for “I-cache reads”, which is equivalent to “instructions executed”.

Cachegrind tracks a lot more than just instruction counts, but I’m only showing them here to keep things simple. It also gives per-line counts, but I’ve omitted them as well.

And here is the corresponding output when a patch from bug 575529 is applied:

42,332,998  PROGRAM TOTALS
       Ir  file:function
4,746,990  /build/buildd/eglibc-2.10.1/string/../sysdeps/i386/i686/strcmp.S:strcmp
4,100,366  ???:???
1,687,434  jstracer.cpp:bool js::VisitFrameSlots(js::CountSlotsVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
1,343,085  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
1,229,853  jsops.cpp:js_Interpret
1,137,981  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
  868,855  jstracer.cpp:js::TraceRecorder::set(int*, nanojit::LIns*, bool)
  812,419  /build/buildd/eglibc-2.10.1/iconv/gconv_conf.c:insert_module
  755,753  jstracer.cpp:js::TraceRecorder::monitorRecording(JSOp)
  575,200  jsscan.cpp:js::TokenStream::getTokenInternal()

It’s easy to see that the total instruction count has dropped from 48.9M to 42.3M, but seeing the changes at a per-function level is more difficult. For a long time I would make this comparison manually by opening the two files side-by-side and reading carefully.  Sometimes I’d also do some cutting-and-pasting to reorder entries. The whole process was tedious, but the information revealed is so useful that I did it anyway.

Then three months ago David Baron asked on Mozilla’s dev-platform mailing list if anybody knew of any good differential profiling tools. This prompted me to realise that I wanted exactly such a tool for Cachegrind. Furthermore, as Cachegrind’s author, I was in a good place to understand exactly what was necessary 🙂

The end result is a new script, cg_diff, that can be used to compute the difference between two Cachegrind output files. Here’s part of the difference between the above two versions:

-6,590,282  PROGRAM TOTALS
        Ir  file:function
-1,537,996  ???:???
  -894,088  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
  -416,847  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
  -405,271  jstracer.cpp:bool js::VisitFrameSlots(js::DetermineTypesVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
  -246,047  nanojit/Containers.h:nanojit::StackFilter::read()
  -238,121  nanojit/Assembler.cpp:nanojit::Assembler::registerAlloc(nanojit::LIns*, int, int)
   230,419  nanojit/LIR.cpp:nanojit::interval::of(nanojit::LIns*)
  -226,070  nanojit/Assembler.cpp:nanojit::Assembler::asm_leave_trace(nanojit::LIns*)
  -211,864  jstracer.cpp:bool js::VisitFrameSlots(js::CountSlotsVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
  -200,742  nanojit/Assembler.cpp:nanojit::Assembler::findRegFor(nanojit::LIns*, int)

This makes it really easy to see what’s changed. Negative values mean that the instruction count dropped, positive numbers mean that the instruction count increased.

I’ve been using this script for a while now, and it’s really helped me analyse the performance effects of my patches. Indeed, I have some scripts set up so that, with a single command, I can run all of SunSpider through two different versions of TraceMonkey and produce both normal profiles and difference profiles. I can also get high-level instruction comparisons such as the one in this Bugzilla comment.

And now everybody else can use cg_diff too, because I just landed it on the Valgrind trunk. If you want to try it, follow these instructions to setup a copy of the trunk.  And note that if you want to compare two versions of a program that sit in different directories (as opposed to profiling a program, modifying it, then reprofiling it) you’ll need to use cg_diff’s –mod-filename option to get useful results.  Feel free to ask me questions (via email, IRC or in the comments below) if you have troubles.

Happy differencing!

Valgrind and Mac OS 10.6

A new entry in the annals of unfortunate software release dates:

  • On August 19, Valgrind 3.5.0 was released. It added support for Mac OS 10.5.
  • On August 28, Mac OS 10.6 was released.
  • Valgrind 3.5.0 does not support Mac OS 10.6.

If you try to install Valgrind on a machine running Mac OS 10.6, it will fail at configure-time.  If you hack the configure file appropriately so that the install completes, Valgrind will run but crash quickly on any program.  Bug 205241 has the details.  Greg Parker says he has a series of patches to make Valgrind work and he’s just waiting for the open source release of xnu (the core of Mac OS X) before making them public.  With some luck, these fixes will make it into Valgrind 3.5.1 relatively soon.

However, once that’s fixed, there’s another problem.  Mac OS 10.6 uses 64-bit executables by default.  In comparison, 10.5 uses 32-bit executables by default, even though it’s capable of creating and running 64-bit executables.  Unfortunately Valgrind’s support for 64-bit executables on Mac OS X isn’t very good.  The main problem is that start-up is sloooooow, which means that even Hello World takes over four seconds to run on my MacBook Pro.  Fixing this one will be harder, as it will require reworking the Mac OS X start-up sequence.  Bug 205938 is tracking this problem.

Related to this: does anyone know if there is an easy way to have both 10.5 and 10.6 installed on a single machine?  That would be a big help when it comes to developing and testing Valgrind’s Mac OS X support.

Valgrind 3.5.0 has been released

Valgrind 3.5.0 has been released!  It’s the first release that supports Mac OS X.  It also adds a number of other new features and a whole lot of bug-fixes.  See the release notes for details.  Many thanks to everyone who contributed to this release.

Valgrind + Mac OS X update (July 17, 2009)

We’re now in the preparation phase for the 3.5.0 release of Valgrind, which will be the first release with Mac OS X support.  We’ve absorbed some Mozilla culture in the Valgrind development process — we’re now using Bugzilla much more effectively.  We have 17 open blockers (and 18 closed blockers), and 41 open “wanted” bugs (and 7 closed ones).  Any contributions towards fixing these bugs is most welcome!  We’re hoping to release in early August.