How fast can CFI/EXIDX-based stack unwinding be?

This is a long post.  Here’s a summary.

Valgrind unwinds CFI typically 30 times faster than Breakpad.  Bad though that sounds, it’s really encouraging.  If we can get even half that speedup — that is, 15 times faster than at present — CFI/EXIDX unwinding in SPS will be a lot more useful than it currently is.

There are a number of reasons for the discrepancy, which the post discusses in detail.

Background

For some time now, we’ve been producing nightlies that have the ability to time-profile themselves using the built-in SPS profiler.  SPS is a statistical sample-based profiler that samples specified (C++) threads by unwinding their stacks using a variety of mechanisms.

In the past couple of months, Fennec and Linux nightlies have acquired the ability to take samples by unwinding the stacks using the same “native” unwind information that would be used for diagnosing a crash with (for example) GDB.  On Linux that information is Dwarf2 CFI, and on ARM/Fennec we use the ARM-specific EXIDX format.  The unwinding is done using the Google Breakpad library that lives in our tree.

Breakpad works well in the sense that it unwinds reliably through both libxul, system libraries and, apparently, JIT-produced code.  But it’s slow.  After considerable efforts at speeding it up, including some as-yet uncommitted changes (bug 893542 plus other hacks), Breakpad can unwind at a cost of about 6,600 instructions per frame for x86_64/Linux, and probably a bit worse for ARM/Android.

That’s expensive — for a typical 18 frame unwind that comes to around 120,000 instructions.  On a desktop processor that can sustain perhaps 2000 million instructions/second, that means we’d saturate one core at about 16,500 unwinds/second.  Supposing we’d like 90% of the CPU to be used for real work, not unwinding.  Then we have budget for only 1650 unwinds/second, which is pretty hopeless considering we want to sample multiple threads at a 1 KHz rate.  On a low end phone, the problem is an order of magnitude worse.

I spent quite some time profiling Breakpad’s core CFI unwind loop on x86_64/Linux.  One thing that becomes very clear from this is that Breakpad is designed for generality, flexibility and correctness, but it is not designed for fast in-process unwinding.

Valgrind also does CFI based unwinding, and has been highly tuned over the years, particularly to support Helgrind, which is very unwind-intensive.  I wondered how it compared, so I profiled it — a bit of a tricky exercise, running a big test app on an inner Valgrind (which does a lot of unwinding) on an outer Valgrind, running Callgrind, to get profile data.

The numbers really surprised me.  Valgrind’s unwinder achieves about 220 instructions/frame.  That’s 30 times faster than my best Breakpad results so far.  Why the huge difference?  Surely some mistake?  I started digging.  First, though, a look at the algorithm.

The core CFI unwinding algorithm

The core algorithm is simple to understand.  We start off with registers taken from the innermost frame — as a minimum, three: the program counter, the stack pointer and the frame pointer.  The CFI data provides, for each possible instruction, a set of rules which say how recover the register values in the calling frame.  So we apply those rules to our three registers, and repeat.  The stack trace we’re after is then the sequence of PC values resulting.

One thing to note is that, although we only want the program counter values, we need to compute values for multiple registers, including at the very least the stack pointer.  That’s because return addresses — hence, previous program counter values — are typically in memory at some SP offset, and so we need to know the SP values.

Diehard CFI-heads will recognise that the above description omits a lot of details.  Nonetheless it encapsulates what the unwinder needs to be fast at.  Viz:

   (PC, SP, FP) = <values taken from CPU registers at start of unwind>

   repeat
      output_array[index++] = PC
      current_ruleset = Lookup_in_huge_table(PC)
      new_PC = evaluate_rule(current_ruleset.rule_for_PC,  PC, SP, FP)
      new_SP = evaluate_rule(current_ruleset.rule_for_SP,  PC, SP, FP)
      new_FP = evaluate_rule(current_ruleset.rule_for_FP,  PC, SP, FP)
      PC = new_PC ; SP = new_SP; FP = new_FP;
   until
      end of stack reached, or we have enough frames

The rule for each register is usually simple, having one of the following two forms:

   new_REG =                 old_PC (or old_SP or old_FP) + constant
   new_REG = read-memory-at( old_PC (or old_SP or old_FP) + constant )

For example, it might well be the case that, at some point

   new_PC = read-memory-at( old_SP + 64 )

if we know that the return address is 64 bytes back up the stack.

Implementation differences

With that in mind, the differences between the Breakpad and Valgrind implementations are as follows:

  • Lookup_in_huge_table(PC)

Valgrind maintains a 509-entry direct mapped cache containing previously used rule-sets.  If we are doing a lot of unwinds, most of them will involve the same few hundred instructions, because the outer parts of the traces are identical or very similar.  It achieves a near 100% hit rate in practice.  Hence finding the entry comes down to computing “PC % 509”, a couple of loads and a conditional branch to check we’ve got the right entry.

Breakpad has no such cache.  When bug 893542 lands it will have at a std::map-based cache.  That’s a big improvement on no cache, but it’s still not anywhere near as good as a direct-mapped cache, since any lookup in the cache involves a std::map.find(), which means a red-black tree lookup,  costing several hundred instructions.

Breakpad has another disadvantage: its architecture forces the cache test to be placed later than is optimal.  Valgrind loads the unwind rules for all shared objects at process startup.  Breakpad only reads debuginfo for an object at the first visit of the object.  Hence Breakpad’s unwind loop must first, inefficiently, check each PC value to find out whether it needs to read debuginfo for the containing shared object.  This alone adds a couple of hundred instructions per frame.

  • Representation of the rules

Breakpad is nothing if not flexible.  The set of registers it is prepared to unwind is not fixed.  Instead it depends on whatever registers the compiler-generated CFI provides unwind rules for. Hence Breakpad produces, at each unwind step, new values for all the integer registers that have a CFI unwind rule available at that location.

Valgrind, on the other hand, just unwinds a fixed set of registers which are known to be important in practice.  On x86_64-linux, the set is %rip, %rsp and %rbp.  On arm-linux: r15, r14, r13, r12, r11 and r7.  If unwinding needs the value of some other register it’s just too bad — unwinding stops and it’s Game Over.  But this never appears to be a problem in practice.

This disadvantages Breakpad in two ways.

Firstly, Breakpad unwinds more registers than Valgrind does, because the in-file CFI typically gives unwind rules for more registers.  On x86_64-linux, for example, the CFI rules often give unwind info for %r15, %r14, %r13, %r12 and %r11, which Valgrind simply ignores.

Secondly, Valgrind’s use of a fixed register set makes its data structures much more efficient.  Breakpad’s unwind loop has to maintain a std::map of register names to register values for all the currently tracked registers, and a similar map for register names to recovery rules.  Hence it spends much time iterating over, and looking up in, these mappings.  Despite efforts to make these lookups efficient, they can’t come close to Valgrind’s compiled-in (C-style) structs for register and rule sets, that require no lookups or iteration.

  • Dynamic memory allocation

Valgrind unwinds with no dynamic memory allocation and dumps the resulting PC values in a known-to-be-large-enough caller-supplied array.

Breakpad constructs a std::vector of StackFrame objects.  Each frame therefore requires a heap allocation — and, eventually, deallocation, which just by themselves come to more than Valgrind’s total per-frame cost.  The 6,600 insn/frame cost is after I did some experimental restructuring to avoid these allocations.

  • Valgrind cheats

Valgrind shortcuts in two ways, which makes the comparison slightly unfair.

As described above, Breakpad unwinds all integer registers for which CFI rules are available.  Valgrind unwinds a fixed and generally smaller subset.  This doesn’t appear to make any difference in practice.

Breakpad tracks the validity state of each register.  This helps make unwinding more reliable and is needed to support transitioning between different frame recovery methods (CFI unwind, frame-pointer following, stack scanning) on a per-frame basis.  Really, Valgrind ought to do that too.  It will give some extra overhead, but not a lot — perhaps increasing the per-frame cost by 50%.

6 responses

  1. Steve Fink wrote on :

    When sampling at 1kHz, how many of the frames change from one sample to the next, on average? As in, if you could magically discover during one unwind that the current frame and everything below it is the same as it was on the previous sample, how much would that save? (Ok, “the same” wouldn’t be quite enough. It would have to be the same *and* not popped in between samples, which matters if you end up pushing the same frames over again.)

    1. jseward wrote on :

      I suspect the changes would be small, but I haven’t measured this.

  2. Robert O’Callahan wrote on :

    This all sounds very fixable, yay!

    If we add a direct-mapped cache to Breakpad, we can check the cache before doing the “needs to read debuginfo” check, thus mitigating the impact of that check, right?

    Can’t you optimistically unwind a small set of registers and if that fails, fall back to a slower full-register-set mode?

  3. Nicholas Nethercote wrote on :

    Nice description. How much control do we have over Breakpad?

    1. jseward wrote on :

      I don’t really know. The changes ideally need to go back upstream, though, so mashing it around it a big way is likely not an easy sell.

  4. avih wrote on :

    This has to be one of the best posts I’ve read in a long while. Extremely clear, informative, interesting and important. Thank you for writing it up (and of course for doing this investigations and work).

    I was wondering, however, what were the initial instructions/frame numbers before they got down to 6600?