How compactly can CFI/EXIDX stack unwinding info be represented?

In last week’s episode of The Wonderful World of CFI Unwinding we looked at how fast it could be done.  This week we’re going to look at how compactly the unwind information can be stored.

But I know you want a summary up front, so here it is: I reckon it’s possible to store enough CFI to unwind all of libxul.so on x86_64-linux in just 5.4MB.  At least, if my numbers are not wrong.

Background

Representing CFI and EXIDX unwind information compactly in memory is critical for memory constrained devices running FirefoxOS and Fennec.

The title of this posting is misleading, though.  The question is not “how compactly can it be stored?”, but rather “how compactly can it be stored and still give us the really fast access we need?”.

I did some digging.  Valgrind will, if asked nicely, print out the CFI unwinding rules as it stores them.  The previous posting described how it only stores unwind information for a limited register set — on x86_64-linux, which I’m experimenting with here — %rip, %rsp and %rbp.  It manages to summarise the unwind rule for each address range into just one line of text:

  [0x53eea71 .. 0x53eea84]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)

This says: for %rip values in the range 0x53eea71 to 0x53eea84, first compute

  cfa = %rbp + 16

then

  previous %rip = *(cfa-8)
  previous %rsp = cfa
  previous %rbp = *(cfa-16)

For libxul.so for a recent m-c, built with “-g -O”, there are 890,065 of these records — that is, the compiler gives unwind descriptions for 890,065 different code address ranges in libxul.so.

That sounds like a lot of data.  But those descriptions are immensely repetitive.  Just how repetitive we can see by generating the descriptions, cutting off the address range part and throwing the rest through sort -u:

  valgrind --smc-check=all-non-file --tool=none --trace-cfi=yes " \
     --trace-symtab-patt=*libxul.so*" \
     ./ff-opt-linux/dist/bin/firefox-bin &> logfile

  cat logfile | grep "let cfa=" | cut -c 27- | sort -u | wc

The are just 75 different ones in nearly 900k address ranges!

In hindsight, that shouldn’t be a big surprise.  A description of how to recover the return address, stack pointer and frame pointer must be pretty boring.  Either they’re parked in memory at a small handful of 8-aligned offsets from the stack pointer, or they are the value of one of the previous registers plus or minus a similarly constrained offset.

In other words, the set of descriptions is small because GCC generates only a small set of stack frame layouts, if we restrict ourselves to considering just the parts of the frame needed for unwinding.

I was surprised by these figures, so I also tested the CFI on the main Valgrind executable.  That has 35,430 address ranges containing 160 unique descriptions.  Not quite as striking as the libxul.so case, but not far off.

Storing the address ranges compactly

The obvious storage optimisation is to park the 75 descriptions in a dictionary, the size of which is insignificant, and represent the 890,065 address ranges and dictionary-entry number as compactly as possible.  Then, sort these entries by address range and put them in a flat array, for fast binary search.

How compactly can we represent an (address, length, dictionary-entry-number) triple?  Naively, the address is a 64 bit word, and length and entry number could be 32 bits, giving 16 bytes in total.

That’s way overkill.  Since we’ll have one table per mapped object, the base address can be replaced by the offset from the base of the object.  libxul.so has about 36MB of text, so unfortunately that’s 4 bytes.  The address range lengths are tiny, though, mostly less than 256, so we could store than in a byte, and duplicate the descriptor for the occasional longer run.  And the dictionary entry number in the two cases I tested would fit in 8 bits.

So that’s 6 bytes per description.  For 890,065 descriptions, 5,340,390 bytes in total.

It would be interesting to see if the same level of duplication occurs for CFI and EXIDX on ARM.  But given that it merely reflects the non-diversity of frame layouts, I find it hard to believe we’d see anything much different.

EXIDX contains less information than CFI, yet — as experiments on Fennec/ARM have shown — it gives good unwinding results.  So this analysis surely applies equally to our EXIDX-based targets, Fennec/ARM and FirefoxOS/ARM.

6 responses

  1. Noel Grandin wrote on :

    Nice work!

    Suggestion: use a 2-deep tree, where

    – the first level is a sorted list of (4-byte-address-offset, address of second level block) tuples
    – the second level is like your proposed encoding, but with a 2-byte-address-offset

    should get you some more gains without cranking up the complexity too much.

  2. Nathan wrote on :

    Very nice. On ARM, does Breakpad do some sort of heuristic for unwinding from prologue epilogues? GCC emits unwind information for that in CFI (at least it does in recentish versions); EXIDX doesn’t have that. I assume Valgrind doesn’t much care about such cases, but surely a profiler unwind library would have to care.

    1. jseward wrote on :

      No, Breakpad has no heuristics for that case. It can’t since
      in fact we’re just translating EXIDX into the same data structures
      as CFI is, so the unwinder proper doesn’t know if it is using
      CFI or EXIDX. I had thought of adding some kludgery to move
      the program counter forwards a bit in the context frame, if it
      appears to point at a prologue, so as to hopefully get it into
      the post-push sequence that is properly described by EXIDX. But
      so far I haven’t done this.

  3. avih wrote on :

    Excellent!

  4. Steve Fink wrote on :

    Interesting. I ran it on my 64-bit linux box, and got only 7 unique CFI expressions in 1424173 address ranges. Then again, I think that’s a debug build.

    I also tried

    readelf –debug-dump=frames-interp ~/src/MI-GC/obj/dist/bin/libxul.so | perl -lne ‘print $1 if /pc=(\w+\.\.\w+)/’ > addrs.txt

    and got 355795 address ranges for libxul.so. Not only that, but eyeballing it, many adjacent address ranges have identical CFI info and so could be merged. I’m not sure why this is so different from the valgrind numbers. The address values are different between the two; one must incorporate a base address and the other doesn’t?

    Oh, wait. The –debug-dump one says it’s giving the contents of .eh_frame. I was expecting .debug_frame; maybe this is just the subset that is needed for exception handling?

    I also notice that the parts of the rules have even more redundancy. For my binary, there are exactly 5 different ways to compute the CFA:

    oldBP+16
    oldSP+16
    oldSP+24
    oldSP+8
    {((xSP)+(0x8))+(((((xIP)+(0x0))&(0xf))>=(0xb))<<(0x3))}

    and I'm guessing that last one isn't used much. Yep, it and oldSP+24 are each used exactly once. So we're mostly looking at 3 possibilities that show up with any frequency.

    The expressions using the cfa are then even more boring: there are exactly 2 used:

    RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
    RA=*(cfa+-8) SP=cfa+0 BP=Same

    I think that just means that in the prologue, you keep the old BP until you change it, and from then on everything is stored in the same place (relative to the cfa).

    This is where I start to get skeptical about using a debug (as in, unoptimized) build. Surely the optimizer does some more interesting stuff.

    So my currently nightly has 833722 rules, 248 of them unique (compared to 7 for the unoptimized build.) Ah, now there are 113 unique ways to compute the cfa, and 6 ways to use the cfa to get the addrs. They are:

    1537 RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-40)
    1614 RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-32)
    2292 RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-24)
    5160 RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-48)
    178213 RA=*(cfa+-8) SP=cfa+0 BP=Same
    644906 RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)

    (1st number is the number of occurrences.)

    Does this get anywhere? Well, if you take the top 15 of those 113 cfa computations, you cover 99.905% of the cases, so you can encode that in 4 bits + an escape value pointing to an auxiliary table. So 4 bits for computing the cfa + 3 bits for using it gets you down to a single byte for the rules. Which is useless; we only started with 248 possibilities to begin with. And the address ranges still dominate.

    Too late for me to think about that, but here's a visualization: this is one pixel per byte in libxul.so, where same color means same CFI rules: http://people.mozilla.org/~sfink/data/libxul-cfi.png (warning: 5090×5090 image).

    Wait… come to think of it, you don't need to look up the CFI info for prologues/epilogues very often. If you're running a sampling profiler, that'll probably only happen for the leaf frame, and even that only for a smallish percentage of the time. So you can afford for that to be a slower lookup. Then all the hot pink in that image, the dominant CFI rule, can be encoded in a 3.3MB bit vector (and I bet that vector would compress down pretty well for transfer.) In fact, you could store everything else in a linear array of one byte rules per address, where the index into the array is found via a binary search of an index vector giving the lookup table offsets of every 1k bits in the bit vector, and then counting bits to find the exact entry. (Sorry, this is a compressed description. Need to get to sleep.) That's like a 13KB offset table if you use a full 32 bits for the offsets.

    Note that the image I linked above shows that this scheme would work very poorly on the last part of libxul, which has a crazy mishmash of CFI rules. But if I ignore that nasty little detail, and I did the math right, that's about 3.5MB for everything, and a single lookup (no binary search) in the common case.

    Doh! No, I missed something. I left out the rules array, which is 3.1MB by itself. (There are 22803428 hot pink rules, as in the "normal" CFI rule, and 3104675 other rules.) Darn, I lose — that's like 6.6 MB total. Ugh. That rules array looks compressible, though — from the image, it seems like most prologues have the same structure, and only differ slightly in length. But this is getting too handwavy even for me.

  5. Jed Davis wrote on :

    I wrote a small Python script to inspect exidx/extab and count distinct unwind rules: https://gist.github.com/jld/6471447

    The results on a current b2g libxul are:
    92051 exidx entries
    775 unwinders stored in exidx + 3827 extab refs
    1890 total unwinders

    So it’s not 75, but it’s not too bad.

    With a little more work I could disassemble them and estimate how many regions a full unwind map that handles prologue/epilogue samples would need (and how many of them are covered by the 256 or 128 most common unwinders, even).