Feed on
Posts
Comments

In last week’s episode of The Wonderful World of CFI Unwinding we looked at how fast it could be done.  This week we’re going to look at how compactly the unwind information can be stored.

But I know you want a summary up front, so here it is: I reckon it’s possible to store enough CFI to unwind all of libxul.so on x86_64-linux in just 5.4MB.  At least, if my numbers are not wrong.

Background

Representing CFI and EXIDX unwind information compactly in memory is critical for memory constrained devices running FirefoxOS and Fennec.

The title of this posting is misleading, though.  The question is not “how compactly can it be stored?”, but rather “how compactly can it be stored and still give us the really fast access we need?”.

I did some digging.  Valgrind will, if asked nicely, print out the CFI unwinding rules as it stores them.  The previous posting described how it only stores unwind information for a limited register set — on x86_64-linux, which I’m experimenting with here — %rip, %rsp and %rbp.  It manages to summarise the unwind rule for each address range into just one line of text:

  [0x53eea71 .. 0x53eea84]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)

This says: for %rip values in the range 0x53eea71 to 0x53eea84, first compute

  cfa = %rbp + 16

then

  previous %rip = *(cfa-8)
  previous %rsp = cfa
  previous %rbp = *(cfa-16)

For libxul.so for a recent m-c, built with “-g -O”, there are 890,065 of these records — that is, the compiler gives unwind descriptions for 890,065 different code address ranges in libxul.so.

That sounds like a lot of data.  But those descriptions are immensely repetitive.  Just how repetitive we can see by generating the descriptions, cutting off the address range part and throwing the rest through sort -u:

  valgrind --smc-check=all-non-file --tool=none --trace-cfi=yes " \
     --trace-symtab-patt=*libxul.so*" \
     ./ff-opt-linux/dist/bin/firefox-bin &> logfile

  cat logfile | grep "let cfa=" | cut -c 27- | sort -u | wc

The are just 75 different ones in nearly 900k address ranges!

In hindsight, that shouldn’t be a big surprise.  A description of how to recover the return address, stack pointer and frame pointer must be pretty boring.  Either they’re parked in memory at a small handful of 8-aligned offsets from the stack pointer, or they are the value of one of the previous registers plus or minus a similarly constrained offset.

In other words, the set of descriptions is small because GCC generates only a small set of stack frame layouts, if we restrict ourselves to considering just the parts of the frame needed for unwinding.

I was surprised by these figures, so I also tested the CFI on the main Valgrind executable.  That has 35,430 address ranges containing 160 unique descriptions.  Not quite as striking as the libxul.so case, but not far off.

Storing the address ranges compactly

The obvious storage optimisation is to park the 75 descriptions in a dictionary, the size of which is insignificant, and represent the 890,065 address ranges and dictionary-entry number as compactly as possible.  Then, sort these entries by address range and put them in a flat array, for fast binary search.

How compactly can we represent an (address, length, dictionary-entry-number) triple?  Naively, the address is a 64 bit word, and length and entry number could be 32 bits, giving 16 bytes in total.

That’s way overkill.  Since we’ll have one table per mapped object, the base address can be replaced by the offset from the base of the object.  libxul.so has about 36MB of text, so unfortunately that’s 4 bytes.  The address range lengths are tiny, though, mostly less than 256, so we could store than in a byte, and duplicate the descriptor for the occasional longer run.  And the dictionary entry number in the two cases I tested would fit in 8 bits.

So that’s 6 bytes per description.  For 890,065 descriptions, 5,340,390 bytes in total.

It would be interesting to see if the same level of duplication occurs for CFI and EXIDX on ARM.  But given that it merely reflects the non-diversity of frame layouts, I find it hard to believe we’d see anything much different.

EXIDX contains less information than CFI, yet — as experiments on Fennec/ARM have shown — it gives good unwinding results.  So this analysis surely applies equally to our EXIDX-based targets, Fennec/ARM and FirefoxOS/ARM.

This is a long post.  Here’s a summary.

Valgrind unwinds CFI typically 30 times faster than Breakpad.  Bad though that sounds, it’s really encouraging.  If we can get even half that speedup — that is, 15 times faster than at present — CFI/EXIDX unwinding in SPS will be a lot more useful than it currently is.

There are a number of reasons for the discrepancy, which the post discusses in detail.

Background

For some time now, we’ve been producing nightlies that have the ability to time-profile themselves using the built-in SPS profiler.  SPS is a statistical sample-based profiler that samples specified (C++) threads by unwinding their stacks using a variety of mechanisms.

In the past couple of months, Fennec and Linux nightlies have acquired the ability to take samples by unwinding the stacks using the same “native” unwind information that would be used for diagnosing a crash with (for example) GDB.  On Linux that information is Dwarf2 CFI, and on ARM/Fennec we use the ARM-specific EXIDX format.  The unwinding is done using the Google Breakpad library that lives in our tree.

Breakpad works well in the sense that it unwinds reliably through both libxul, system libraries and, apparently, JIT-produced code.  But it’s slow.  After considerable efforts at speeding it up, including some as-yet uncommitted changes (bug 893542 plus other hacks), Breakpad can unwind at a cost of about 6,600 instructions per frame for x86_64/Linux, and probably a bit worse for ARM/Android.

That’s expensive — for a typical 18 frame unwind that comes to around 120,000 instructions.  On a desktop processor that can sustain perhaps 2000 million instructions/second, that means we’d saturate one core at about 16,500 unwinds/second.  Supposing we’d like 90% of the CPU to be used for real work, not unwinding.  Then we have budget for only 1650 unwinds/second, which is pretty hopeless considering we want to sample multiple threads at a 1 KHz rate.  On a low end phone, the problem is an order of magnitude worse.

I spent quite some time profiling Breakpad’s core CFI unwind loop on x86_64/Linux.  One thing that becomes very clear from this is that Breakpad is designed for generality, flexibility and correctness, but it is not designed for fast in-process unwinding.

Valgrind also does CFI based unwinding, and has been highly tuned over the years, particularly to support Helgrind, which is very unwind-intensive.  I wondered how it compared, so I profiled it — a bit of a tricky exercise, running a big test app on an inner Valgrind (which does a lot of unwinding) on an outer Valgrind, running Callgrind, to get profile data.

The numbers really surprised me.  Valgrind’s unwinder achieves about 220 instructions/frame.  That’s 30 times faster than my best Breakpad results so far.  Why the huge difference?  Surely some mistake?  I started digging.  First, though, a look at the algorithm.

The core CFI unwinding algorithm

The core algorithm is simple to understand.  We start off with registers taken from the innermost frame — as a minimum, three: the program counter, the stack pointer and the frame pointer.  The CFI data provides, for each possible instruction, a set of rules which say how recover the register values in the calling frame.  So we apply those rules to our three registers, and repeat.  The stack trace we’re after is then the sequence of PC values resulting.

One thing to note is that, although we only want the program counter values, we need to compute values for multiple registers, including at the very least the stack pointer.  That’s because return addresses — hence, previous program counter values — are typically in memory at some SP offset, and so we need to know the SP values.

Diehard CFI-heads will recognise that the above description omits a lot of details.  Nonetheless it encapsulates what the unwinder needs to be fast at.  Viz:

   (PC, SP, FP) = <values taken from CPU registers at start of unwind>

   repeat
      output_array[index++] = PC
      current_ruleset = Lookup_in_huge_table(PC)
      new_PC = evaluate_rule(current_ruleset.rule_for_PC,  PC, SP, FP)
      new_SP = evaluate_rule(current_ruleset.rule_for_SP,  PC, SP, FP)
      new_FP = evaluate_rule(current_ruleset.rule_for_FP,  PC, SP, FP)
      PC = new_PC ; SP = new_SP; FP = new_FP;
   until
      end of stack reached, or we have enough frames

The rule for each register is usually simple, having one of the following two forms:

   new_REG =                 old_PC (or old_SP or old_FP) + constant
   new_REG = read-memory-at( old_PC (or old_SP or old_FP) + constant )

For example, it might well be the case that, at some point

   new_PC = read-memory-at( old_SP + 64 )

if we know that the return address is 64 bytes back up the stack.

Implementation differences

With that in mind, the differences between the Breakpad and Valgrind implementations are as follows:

  • Lookup_in_huge_table(PC)

Valgrind maintains a 509-entry direct mapped cache containing previously used rule-sets.  If we are doing a lot of unwinds, most of them will involve the same few hundred instructions, because the outer parts of the traces are identical or very similar.  It achieves a near 100% hit rate in practice.  Hence finding the entry comes down to computing “PC % 509″, a couple of loads and a conditional branch to check we’ve got the right entry.

Breakpad has no such cache.  When bug 893542 lands it will have at a std::map-based cache.  That’s a big improvement on no cache, but it’s still not anywhere near as good as a direct-mapped cache, since any lookup in the cache involves a std::map.find(), which means a red-black tree lookup,  costing several hundred instructions.

Breakpad has another disadvantage: its architecture forces the cache test to be placed later than is optimal.  Valgrind loads the unwind rules for all shared objects at process startup.  Breakpad only reads debuginfo for an object at the first visit of the object.  Hence Breakpad’s unwind loop must first, inefficiently, check each PC value to find out whether it needs to read debuginfo for the containing shared object.  This alone adds a couple of hundred instructions per frame.

  • Representation of the rules

Breakpad is nothing if not flexible.  The set of registers it is prepared to unwind is not fixed.  Instead it depends on whatever registers the compiler-generated CFI provides unwind rules for. Hence Breakpad produces, at each unwind step, new values for all the integer registers that have a CFI unwind rule available at that location.

Valgrind, on the other hand, just unwinds a fixed set of registers which are known to be important in practice.  On x86_64-linux, the set is %rip, %rsp and %rbp.  On arm-linux: r15, r14, r13, r12, r11 and r7.  If unwinding needs the value of some other register it’s just too bad — unwinding stops and it’s Game Over.  But this never appears to be a problem in practice.

This disadvantages Breakpad in two ways.

Firstly, Breakpad unwinds more registers than Valgrind does, because the in-file CFI typically gives unwind rules for more registers.  On x86_64-linux, for example, the CFI rules often give unwind info for %r15, %r14, %r13, %r12 and %r11, which Valgrind simply ignores.

Secondly, Valgrind’s use of a fixed register set makes its data structures much more efficient.  Breakpad’s unwind loop has to maintain a std::map of register names to register values for all the currently tracked registers, and a similar map for register names to recovery rules.  Hence it spends much time iterating over, and looking up in, these mappings.  Despite efforts to make these lookups efficient, they can’t come close to Valgrind’s compiled-in (C-style) structs for register and rule sets, that require no lookups or iteration.

  • Dynamic memory allocation

Valgrind unwinds with no dynamic memory allocation and dumps the resulting PC values in a known-to-be-large-enough caller-supplied array.

Breakpad constructs a std::vector of StackFrame objects.  Each frame therefore requires a heap allocation — and, eventually, deallocation, which just by themselves come to more than Valgrind’s total per-frame cost.  The 6,600 insn/frame cost is after I did some experimental restructuring to avoid these allocations.

  • Valgrind cheats

Valgrind shortcuts in two ways, which makes the comparison slightly unfair.

As described above, Breakpad unwinds all integer registers for which CFI rules are available.  Valgrind unwinds a fixed and generally smaller subset.  This doesn’t appear to make any difference in practice.

Breakpad tracks the validity state of each register.  This helps make unwinding more reliable and is needed to support transitioning between different frame recovery methods (CFI unwind, frame-pointer following, stack scanning) on a per-frame basis.  Really, Valgrind ought to do that too.  It will give some extra overhead, but not a lot — perhaps increasing the per-frame cost by 50%.

In short: more adaptation to real-use cases, and more unwind speedups.

  • Benoit redid time-unit measurements so as to support profiling with sampling intervals of less than one millisecond.  This will help profiling graphics and animation code (820048).
  • Gijs Kruitbosch made it possible to control the size of the profiler’s circular buffer using the MOZ_PROFILER_ENTRIES environment variable (901481).
  • :roc and :avih taught the compositor to notify the profiler of missed (graphics) frames, so as to help monitor and debug smoothness issues (900785).
  • glibc backtrace() unwinding on Linux was removed, as it risked deadlock, and the breakpad unwinder made it redundant (880158).

There was work to improve the performance of CFI and EXIDX native unwinding, by improving performance of the data structures accessed by the CFI/EXIDX unwind algorithm:

  • speedups for GetModuleForAddress — replaced a linear search with a binary search (892774).
  • don’t generate useless frames via stack scanning, which turned out to be happening even though we thought stack scanning was disabled (894264).
  • the unwinder spends considerable effort repeatedly re-generating unwind rules for the same relatively small set of code addresses.  Caching them makes a big difference (893542).

With the above fixes in place, native unwinding performance for x86_64-linux costs about 8100 instructions per frame, or around 145k instructions/unwind for an average 18-frame unwind.

There are still opportunities to remove inefficiency, but they are becoming scarcer.  It might be possible to get to about 5000 insns/frame without too much extra effort, but below that  could be difficult.

 

There’s been quite some progress since the last report, focussed on Firefox for Android.

  • Anton Kovalyov landed changes to the GUI that show the event timelines for multiple threads in a synchronised way.  The multithreaded aspects of the profiler are stabilising nicely.
  • Multi-thread profiling support is ready.  Select ‘Multi-Thread’ when profiling to watch any registered thread.  For now a local build is recommended to tweak which threads you want to profile.
  • Android Java profiling support is ready.  Set ‘profiler.java’ to true on the host desktop machine before connecting to your phone. A new timeline will appear that will represent main java thread execution.
  • Native unwind for Android nightlies is now available.  Select ‘breakpad’ in the GUI on the host desktop machine when connecting to your phone.  With this, full symbolicated profiles are available.
  • Native unwind on Android uses the ARM-specific EXIDX unwind format.  Addition of it to the nightly .apk increases size by about 2.5MB (10% ish), which is not a bad result for near-complete unwindability of libxul and the other libraries.
  • There was some additional work to improve the EXIDX unwind performance and stability to usable levels (883126, 882903).  Sampling beyond 200Hz is viable on a dual-core Cortex A9.  Investigations are in progress to improve unwind speed and reduce memory use, to levels that would make it useful for Firefox OS.

Since the last report, work has focused on making multithreaded profiling more usable, using ARM EXIDX unwind data for Android nightlies, and more analysis of unwind failures.  The latter two lines of investigation happened concurrently because there was some doubt as to how good EXIDX is, so I wanted to find out to what extent unwind failures were caused by EXIDX limitations as opposed to other problems.

Multithreaded profiling:

  • Benoit worked on facilities to select threads to be profiled by name (873914) and improved the names of web worker threads (865885).
  • Benoit also improved controllability of the profiler using environment variables (873915), which might be important for profiling B2G.

Unwind quality:

  • I fixed a problem (872496) caused by the way threads register their stacks for profiling.  They need to do it earlier.  This problem affects not only EXIDX unwinding, but also CFI and frame-pointer too.  I also noticed that native unwinds for Linux nightlies had regressed recently, for the same reason.
  • The simple unwind-stats-gathering patch that landed a while back  (863705) proved very useful in figuring this out.

EXIDX:

  • There was investigation into whether EXIDX is a suitable replacement for the Dwarf CFI format on Android.  EXIDX is an ARM-specific unwind data format, and has the advantage of being compact.  Including complete EXIDX unwind information in Android nightlies  increases the .APK size from 28.8MB to only 30.7MB, which is an excellent result.

The concern with EXIDX is that it is designed for unwinding only at  points where C++ exceptions could possibly be thrown.  It is not designed for the “anywhere, anytime” unwinding that Dwarf CFI provides, and so is not a drop-in replacement for CFI.  In particular, EXIDX does not provide correct unwind data in function prologues and epilogues, so if the thread being sampled happens to be in either of those places, we’re out of luck.

Despite that, EXIDX unwinding does appear to work pretty well.

It might be possible to get around the prologue/epilogue problem by fiddling with the topmost frame’s PC so as to move it away from such areas.  I haven’t investigated this yet.

  • Breakpad doesn’t actually handle EXIDX right now.  I picked up an EDIDX-reading patch originally from Linaro, with an initial port into Breakpad (863475) by TedM.  The patch required some rework but is now close to being landable.
  • Turns out NSPR was not being built with EXIDX generation, so all stack traces stopped at the first NSPR frame.  872649 fixes this.

There has been a lot of activity since the previous report, mostly along three lines of development: multithreaded profiling, getting profiling working for Android nightlies, and improving the quality of the stack traces.

We now have native unwinds working out-of-the-box for 32- and 64-bit Linux nightlies, and very nearly working for Android nightlies.

Multithreaded profiling:

  • Benoit landed a big change (734691) to the profiler backend, that makes it able to profile multiple threads.  Work is in progress (861863) to update the Cleopatra GUI to match, but it has not yet landed.  Early adopters are welcome to try out this new functionality.
  • Benoit also landed 788022, which adds support for profiling Java code running on the Dalvik VM on Android.

Getting profiling working for Android nightlies:

  • On Android, we landed plumbing code that allows Breakpad to read debug info direct out of APK files, by interfacing with faulty.lib (802240, 861141).
  • On Android, we landed a 1-liner that enabled profiling on nightlies (863264, 863375).  Then the fun started.  A few reftests started failing, so the patch was backed out.  After quite some digging, it turns out that this change had the effect of changing the compile flags from “-O2″ to “-O2 -fno-omit-frame-pointer”, and gcc-4.6.2 wound up miscompiling gfxFont::RunMetrics::RunMetrics() in such a way as to copy 16 bytes of uninitialised stack-allocated garbage to the start of the object it is initialising.  Nathan Froyd suggested and landed a workaround.

Improving the quality of native stack traces:

  • I made a simple change that periodically shows counts of how many frames were recovered by CFI data, how many by following frame pointers, and how many by stack scanning (863705).  Using this, it is finally possible to get some idea of how well Breakpad is doing and why it sometimes produces results that are poorer than we expect.
  • Using the 863705 patch, I investigated unwinding behaviour on both 32- and 64-bit Linux, and concluded that we are doing well there.  In particular, Breakpad is able to unwind using frame pointers on 32-bit Linux, which had up to that point been somewhat in doubt. Any incomplete stack traces on that platform are due to system libraries which have been compiled without frame pointers and for which there is no CFI available.  There’s nothing we can do about them, except for resorting to stack scanning.  As discussed in the previous posting, stack scanning gives poor results and is disabled by default.
  • Also on the theme of diagnosing unwind problenms, I modified SPS so as to produce CFI coverage statistics (859775).  This prints extra information at debug info load time, indicating how much CFI was read and how much address range it covers.  I had hoped to also be able to get the size of the relevant .text segment, so as to enable printing messages of the form “available CFI covers 85.7% of the text segment”.  This unfortunately didn’t work out due to the difficulties of getting the segment size.
  • The idea of enhancing Breakpad to use EXIDX unwind info on ARM came back to life and was generally well received.  TedM refreshed his patch (863475) and it is now waiting for further resync with our local Breakpad changes.

Misc other fixes:

  • A problem causing Firefox to livelock when starting any external program while the profiler is running (837390) was fixed.
  • As it stands, SPS/breakpad will unwind up to 1024 stack frames before stopping.  That can happen if the stack is corrupted.  In the worst case this can potentially waste a lot of unwinder time, so we installed a 256 frame limit (859745).

 

Here is some news about work on Firefox’s built-in time profiler (SPS), and in particular the work being done to use native stack unwinding by using the in-tree copy of the Breakpad unwinding library.

There has been a lot of activity in the two and a half weeks since we got back from Paris.  Some of that has been fixing fallout from the initial patch (779291), which proved difficult to land.

The immediate goals are to have the new profiling backend available on nightly builds for desktop Linux, Android and B2G.  Currently 64- and 32-bit Linux work.  If you want to give it a try, grab one and follow the instructions shown here.

  • Benoit committed a patch (851748) to remove some of the worst code duplication resulting from the 779291 landing.  He also cleaned up some of the profiler headers (851611).
  • Currently the profiler backend is controlled by environment variables it reads at startup.  Benoit made a step towards making these configurable within the GUI (856331).
  • One current problem is that it’s difficult to establish why unwinding gives poor results.  Breakpad sometimes fails to find or read CFI information for a shared object.  But that’s difficult to diagnose because the profiler’s logging facilities are poor.  I improved Breakpad’s logging facilities somewhat (853851, 857242) so we can send diagnostic information to the Android log or Linux console.  An as-yet unactioned item is to extract CFI-coverage statistics from Breakpad (859775) so as to make it obvious where objects lack adequate CFI.
  • Another problem that quickly appeared is that, as a last-ditch measure, Breakpad will try to unwind using stack scanning.  This is inherently imprecise and often adds non-existent frames to the trace, which seriously confuses the profile results.  I added a patch (855977) to disable stack scanning by default, but allows it to be selectively re-enabled if required.
  • There was some evaluation of unwinding ability on desktop Linux (855466).  64-bit works well; 32-bit works, but not so well, for reasons that are not yet entirely clear.
  • There is ongoing work to allow the profiler to work well on Android nightlies.  The problem is that nightlies use Mike Hommey’s alternative runtime linker (faulty.lib) and that requires some plumbing work to allow Breakpad to read unwind information direct out of the APK file.  Currently we have a proof of concept patch (802240) that appears to work, but requires some revision before landing.
  • There was investigation of a problem causing Firefox to livelock when starting any external program (837390).  This seems to be a bad interaction between the fork() syscall and the profiler signals.  An initial patch was posted.

Ongoing activities include: finishing up support for faulty.lib on Android, fixing the hang-at-fork problem, and trying to get a handle on why Breakpad gives us poorer unwind results than we expect, especially when doing frame-pointer based unwinding.

Following some sterling hackery by Alexander Potapenko and Philippe Waroquiers, Valgrind/Memcheck now has direct support for JEMalloc, tcmalloc, and any other allocation library offering a malloc-style interface.

This means you no longer need to build Firefox using –disable-jemalloc on Linux.  You do still need –enable-valgrind, to stop V freaking out when JS GC scans the stack, but that’s no big deal.

To use this, you need to give Valgrind the flag –soname-synonyms=somalloc=NONE.  This tells it that the bundle of functions it needs to intercept (malloc, etc) live in the main executable instead of in libc.

You need a Valgrind trunk build of 11 May or later (svn r12559).

On MacOSX 10.7, all of this is irrelevant, since V appears to run a normal JEMalloc’d build  just fine.

On Android, the same trickery gives support for JEMalloc, but requires some extra hacks  which are not in the repo, so you can’t use it as-is.  I also have a bunch of code which checks for cross-DSO mismatches (eg, alloc by JEMalloc, free by libc), but that is also not committed yet.

Other recent developments, that you might want to upgrade to a trunk svn build for:

  • Translation chaining — improves performance noticably in some circumstances, particularly on lower end hardware and for the lighter weight tools.
  •  AVX support — trunk now contains partial AVX instruction set support, enough to run a Firefox built with “gcc-4.7.0 -mavx -O2″. Work to give more complete support is ongoing.
  • GDB server — this has been around for a while, but doesn’t  seem widely known.  Using it, you can control Valgrind with GDB, examine data when V reports errors, single step, all the usual stuff.  To use it, start Valgrind with –vgdb-error=0 and follow the directions it prints.
  • On MacOSX, reduced noise level when running Clang/LLVM generated code.

On Android, we have a couple of ways to periodically extract stack traces from a running Firefox: the built in profiler, and Valgrind.  In principle oprofile could also produce stacks at intervals, although it doesn’t.

Looking at zillions of stacks, or text-only output derived from them, isn’t much fun.  What I’d really like to do is take a large number of stacks sampled at equal cost intervals, and use Josef Weidendorfer’s excellent KCachegrind GUI to visualise the data.  KCachegrind, if you haven’t tried it, makes it easy to examine the dynamic call graph and to see how costs flow between callers and callees.

To do this, the call stacks need to be merged into an approximate dynamic call graph, cycles found and collapsed, and costs propagated back up from leaf nodes.  This produces the approximate inclusive and exclusive costs for each node (function) encountered.

So I wrote a program to do this.  It takes Valgrind-style call stacks and produces KCachegrind output files.  It could easily be generalised to other call stack generators — the call stack elements only need to be comparable for equality.

I ran Firefox on Android, used the STR in bug 728846 as a test workload, and obtained about 32000 16-entry stacks, sampled at approximately million-instruction intervals.  Merging them and viewing the results in KCachegrind produces output which at least roughly tallies with the ad-hoc observations listed in the bug report.

There’s clearly room for improvement, particularly in the handling of cycles, and the fact that it’s limited by how often the stack unwinder produces an incomplete trace.  Nevertheless it’s an interesting bit of code to have around.  Here’s a snapshot of a bit of the graph leading up to fast_composite_over_8888_0565, which features prominently in the report.

 

And here’s a picture of KCachegrind showing the caller and callee relationships for moz_pixman_image_composite32.

Various sets of fixes have been committed for Valgrind on Lion.  It now works well enough to run 64 bit Firefox builds and get sane results.  32 bit builds run too, but appear to hang, for threading related reasons I cannot figure out, despite quite some investigative effort.

There may be some false positives from Memcheck as a result of kludged-up syscall wrappers for some new syscalls that are 10.7-specific.  Let me know if you see errors which you think are obviously bogus.

Feedback is welcomed.  If you’re developing on Mac and have migrated to 10.7, I’d be interested to hear if it works for you. If you’re still on 10.6, I’d be interested to hear if I broke anything :-)  Btw, 10.5 support is pretty much dropped now — is anybody still using 10.5 for development?

The tracking bug report is valgrind bug 275168.

Quick reminder of how to check out and build:

  svn co svn://svn.valgrind.org/valgrind/trunk
  cd trunk
  ./autogen.sh
  ./configure --prefix=`pwd`/Inst
  make -j2
  make -j2 install

Next »