{"id":243,"date":"2013-08-29T11:32:23","date_gmt":"2013-08-29T10:32:23","guid":{"rendered":"http:\/\/blog.mozilla.org\/jseward\/?p=243"},"modified":"2013-08-29T11:32:23","modified_gmt":"2013-08-29T10:32:23","slug":"how-fast-can-cfiexidx-based-stack-unwinding-be","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/jseward\/2013\/08\/29\/how-fast-can-cfiexidx-based-stack-unwinding-be\/","title":{"rendered":"How fast can CFI\/EXIDX-based stack unwinding be?"},"content":{"rendered":"<p>This is a long post.\u00a0 Here&#8217;s a summary.<\/p>\n<p>Valgrind unwinds CFI typically 30 times faster than Breakpad.\u00a0 Bad though that sounds, it&#8217;s really encouraging.\u00a0 If we can get even half that speedup &#8212; that is, 15 times faster than at present &#8212; CFI\/EXIDX unwinding in SPS will be a lot more useful than it currently is.<\/p>\n<p>There are a number of reasons for the discrepancy, which the post discusses in detail.<\/p>\n<h2>Background<\/h2>\n<p>For some time now, we&#8217;ve been producing nightlies that have the ability to time-profile themselves using the built-in SPS profiler.\u00a0 SPS is a statistical sample-based profiler that samples specified (C++) threads by unwinding their stacks using a variety of mechanisms.<\/p>\n<p>In the past couple of months, Fennec and Linux nightlies have acquired the ability to take samples by unwinding the stacks using the same &#8220;native&#8221; unwind information that would be used for diagnosing a crash with (for example) GDB.\u00a0 On Linux that information is Dwarf2 CFI, and on ARM\/Fennec we use the ARM-specific EXIDX format.\u00a0 The unwinding is done using the Google Breakpad library that lives in our tree.<\/p>\n<p>Breakpad works well in the sense that it unwinds reliably through both libxul, system libraries and, apparently, JIT-produced code.\u00a0 But it&#8217;s slow.\u00a0 After considerable efforts at speeding it up, including some as-yet uncommitted changes (<a href=\"https:\/\/bugzilla.mozilla.org\/show_bug.cgi?id=893542\">bug 893542<\/a> plus other hacks), Breakpad can unwind at a cost of about 6,600 instructions per frame for x86_64\/Linux, and probably a bit worse for ARM\/Android.<\/p>\n<p>That&#8217;s expensive &#8212; for a typical 18 frame unwind that comes to around 120,000 instructions.\u00a0 On a desktop processor that can sustain perhaps 2000 million instructions\/second, that means we&#8217;d saturate one core at about 16,500 unwinds\/second.\u00a0 Supposing we&#8217;d like 90% of the CPU to be used for real work, not unwinding.\u00a0 Then we have budget for only 1650 unwinds\/second, which is pretty hopeless considering we want to sample multiple threads at a 1 KHz rate.\u00a0 On a low end phone, the problem is an order of magnitude worse.<\/p>\n<p>I spent quite some time profiling Breakpad&#8217;s core CFI unwind loop on x86_64\/Linux.\u00a0 One thing that becomes very clear from this is that Breakpad is designed for generality, flexibility and correctness, but it is not designed for fast in-process unwinding.<\/p>\n<p>Valgrind also does CFI based unwinding, and has been highly tuned over the years, particularly to support Helgrind, which is very unwind-intensive.\u00a0 I wondered how it compared, so I profiled it &#8212; a bit of a tricky exercise, running a big test app on an inner Valgrind (which does a lot of unwinding) on an outer Valgrind, running Callgrind, to get profile data.<\/p>\n<p>The numbers really surprised me.\u00a0 Valgrind&#8217;s unwinder achieves about 220 instructions\/frame.\u00a0 That&#8217;s 30 times faster than my best Breakpad results so far.\u00a0 Why the huge difference?\u00a0 Surely some mistake?\u00a0 I started digging.\u00a0 First, though, a look at the algorithm.<\/p>\n<h2>The core CFI unwinding algorithm<\/h2>\n<p>The core algorithm is simple to understand.\u00a0 We start off with registers taken from the innermost frame &#8212; as a minimum, three: the program counter, the stack pointer and the frame pointer.\u00a0 The CFI data provides, for each possible instruction, a set of rules which say how recover the register values in the calling frame.\u00a0 So we apply those rules to our three registers, and repeat.\u00a0 The stack trace we&#8217;re after is then the sequence of PC values resulting.<\/p>\n<p>One thing to note is that, although we only want the program counter values, we need to compute values for multiple registers, including at the very least the stack pointer.\u00a0 That&#8217;s because return addresses &#8212; hence, previous program counter values &#8212; are typically in memory at some SP offset, and so we need to know the SP values.<\/p>\n<p>Diehard CFI-heads will recognise that the above description omits a lot of details.\u00a0 Nonetheless it encapsulates what the unwinder needs to be fast at.\u00a0 Viz:<\/p>\n<pre>   (PC, SP, FP) = &lt;values taken from CPU registers at start of unwind&gt;\r\n\r\n   repeat\r\n      output_array[index++] = PC\r\n      current_ruleset = Lookup_in_huge_table(PC)\r\n      new_PC = evaluate_rule(current_ruleset.rule_for_PC,\u00a0 PC, SP, FP)\r\n      new_SP = evaluate_rule(current_ruleset.rule_for_SP,\u00a0 PC, SP, FP)\r\n      new_FP = evaluate_rule(current_ruleset.rule_for_FP,\u00a0 PC, SP, FP)\r\n      PC = new_PC ; SP = new_SP; FP = new_FP;\r\n   until\r\n      end of stack reached, or we have enough frames<\/pre>\n<p>The rule for each register is usually simple, having one of the following two forms:<\/p>\n<pre>   new_REG =                 old_PC (or old_SP or old_FP) + constant\r\n   new_REG = read-memory-at( old_PC (or old_SP or old_FP) + constant )<\/pre>\n<p>For example, it might well be the case that, at some point<\/p>\n<pre>   new_PC = read-memory-at( old_SP + 64 )<\/pre>\n<p>if we know that the return address is 64 bytes back up the stack.<\/p>\n<h2>Implementation differences<\/h2>\n<p>With that in mind, the differences between the Breakpad and Valgrind implementations are as follows:<\/p>\n<ul>\n<li>\n<pre>Lookup_in_huge_table(PC)<\/pre>\n<\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\">Valgrind maintains a 509-entry direct mapped cache containing previously used rule-sets.\u00a0 If we are doing a lot of unwinds, most of them will involve the same few hundred instructions, because the outer parts of the traces are identical or very similar.\u00a0 It achieves a near 100% hit rate in practice.\u00a0 Hence finding the entry comes down to computing &#8220;PC % 509&#8221;, a couple of loads and a conditional branch to check we&#8217;ve got the right entry.<\/p>\n<p style=\"padding-left: 30px;\">Breakpad has no such cache.\u00a0 When bug 893542 lands it will have at a std::map-based cache.\u00a0 That&#8217;s a big improvement on no cache, but it&#8217;s still not anywhere near as good as a direct-mapped cache, since any lookup in the cache involves a std::map.find(), which means a red-black tree lookup,\u00a0 costing several hundred instructions.<\/p>\n<p style=\"padding-left: 30px;\">Breakpad has another disadvantage: its architecture forces the cache test to be placed later than is optimal.\u00a0 Valgrind loads the unwind rules for all shared objects at process startup.\u00a0 Breakpad only reads debuginfo for an object at the first visit of the object.\u00a0 Hence Breakpad&#8217;s unwind loop must first, inefficiently, check each PC value to find out whether it needs to read debuginfo for the containing shared object.\u00a0 This alone adds a couple of hundred instructions per frame.<\/p>\n<ul>\n<li>Representation of the rules<\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\">Breakpad is nothing if not flexible.\u00a0 The set of registers it is prepared to unwind is not fixed.\u00a0 Instead it depends on whatever registers the compiler-generated CFI provides unwind rules for. Hence Breakpad produces, at each unwind step, new values for all the integer registers that have a CFI unwind rule available at that location.<\/p>\n<p style=\"padding-left: 30px;\">Valgrind, on the other hand, just unwinds a fixed set of registers which are known to be important in practice.\u00a0 On x86_64-linux, the set is %rip, %rsp and %rbp.\u00a0 On arm-linux: r15, r14, r13, r12, r11 and r7.\u00a0 If unwinding needs the value of some other register it&#8217;s just too bad &#8212; unwinding stops and it&#8217;s Game Over.\u00a0 But this never appears to be a problem in practice.<\/p>\n<p style=\"padding-left: 30px;\">This disadvantages Breakpad in two ways.<\/p>\n<p style=\"padding-left: 30px;\">Firstly, Breakpad unwinds more registers than Valgrind does, because the in-file CFI typically gives unwind rules for more registers.\u00a0 On x86_64-linux, for example, the CFI rules often give unwind info for %r15, %r14, %r13, %r12 and %r11, which Valgrind simply ignores.<\/p>\n<p style=\"padding-left: 30px;\">Secondly, Valgrind&#8217;s use of a fixed register set makes its data structures much more efficient.\u00a0 Breakpad&#8217;s unwind loop has to maintain a std::map of register names to register values for all the currently tracked registers, and a similar map for register names to recovery rules.\u00a0 Hence it spends much time iterating over, and looking up in, these mappings.\u00a0 Despite efforts to make these lookups efficient, they can&#8217;t come close to Valgrind&#8217;s compiled-in (C-style) structs for register and rule sets, that require no lookups or iteration.<\/p>\n<ul>\n<li>Dynamic memory allocation<\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\">Valgrind unwinds with no dynamic memory allocation and dumps the resulting PC values in a known-to-be-large-enough caller-supplied array.<\/p>\n<p style=\"padding-left: 30px;\">Breakpad constructs a std::vector of StackFrame objects.\u00a0 Each frame therefore requires a heap allocation &#8212; and, eventually, deallocation, which just by themselves come to more than Valgrind&#8217;s total per-frame cost.\u00a0 The 6,600 insn\/frame cost is <em>after<\/em> I did some experimental restructuring to avoid these allocations.<\/p>\n<ul>\n<li>Valgrind cheats<\/li>\n<\/ul>\n<p style=\"padding-left: 30px;\">Valgrind shortcuts in two ways, which makes the comparison slightly unfair.<\/p>\n<p style=\"padding-left: 30px;\">As described above, Breakpad unwinds all integer registers for which CFI rules are available.\u00a0 Valgrind unwinds a fixed and generally smaller subset.\u00a0 This doesn&#8217;t appear to make any difference in practice.<\/p>\n<p style=\"padding-left: 30px;\">Breakpad tracks the validity state of each register.\u00a0 This helps make unwinding more reliable and is needed to support transitioning between different frame recovery methods (CFI unwind, frame-pointer following, stack scanning) on a per-frame basis.\u00a0 Really, Valgrind ought to do that too.\u00a0 It will give some extra overhead, but not a lot &#8212; perhaps increasing the per-frame cost by 50%.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a long post.\u00a0 Here&#8217;s a summary. Valgrind unwinds CFI typically 30 times faster than Breakpad.\u00a0 Bad though that sounds, it&#8217;s really encouraging.\u00a0 If we can get even half that speedup &#8212; that is, 15 times faster than at &hellip; <a class=\"go\" href=\"https:\/\/blog.mozilla.org\/jseward\/2013\/08\/29\/how-fast-can-cfiexidx-based-stack-unwinding-be\/\">Continue reading<\/a><\/p>\n","protected":false},"author":240,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/posts\/243"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/users\/240"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/comments?post=243"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/posts\/243\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/media?parent=243"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/categories?post=243"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/jseward\/wp-json\/wp\/v2\/tags?post=243"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}