In last week’s episode of The Wonderful World of CFI Unwinding we looked at how fast it could be done. This week we’re going to look at how compactly the unwind information can be stored.
But I know you want a summary up front, so here it is: I reckon it’s possible to store enough CFI to unwind all of libxul.so on x86_64-linux in just 5.4MB. At least, if my numbers are not wrong.
Representing CFI and EXIDX unwind information compactly in memory is critical for memory constrained devices running FirefoxOS and Fennec.
The title of this posting is misleading, though. The question is not “how compactly can it be stored?”, but rather “how compactly can it be stored and still give us the really fast access we need?”.
I did some digging. Valgrind will, if asked nicely, print out the CFI unwinding rules as it stores them. The previous posting described how it only stores unwind information for a limited register set — on x86_64-linux, which I’m experimenting with here — %rip, %rsp and %rbp. It manages to summarise the unwind rule for each address range into just one line of text:
[0x53eea71 .. 0x53eea84]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)
This says: for %rip values in the range 0x53eea71 to 0x53eea84, first compute
cfa = %rbp + 16
previous %rip = *(cfa-8) previous %rsp = cfa previous %rbp = *(cfa-16)
For libxul.so for a recent m-c, built with “-g -O”, there are 890,065 of these records — that is, the compiler gives unwind descriptions for 890,065 different code address ranges in libxul.so.
That sounds like a lot of data. But those descriptions are immensely repetitive. Just how repetitive we can see by generating the descriptions, cutting off the address range part and throwing the rest through sort -u:
valgrind --smc-check=all-non-file --tool=none --trace-cfi=yes " \ --trace-symtab-patt=*libxul.so*" \ ./ff-opt-linux/dist/bin/firefox-bin &> logfile cat logfile | grep "let cfa=" | cut -c 27- | sort -u | wc
The are just 75 different ones in nearly 900k address ranges!
In hindsight, that shouldn’t be a big surprise. A description of how to recover the return address, stack pointer and frame pointer must be pretty boring. Either they’re parked in memory at a small handful of 8-aligned offsets from the stack pointer, or they are the value of one of the previous registers plus or minus a similarly constrained offset.
In other words, the set of descriptions is small because GCC generates only a small set of stack frame layouts, if we restrict ourselves to considering just the parts of the frame needed for unwinding.
I was surprised by these figures, so I also tested the CFI on the main Valgrind executable. That has 35,430 address ranges containing 160 unique descriptions. Not quite as striking as the libxul.so case, but not far off.
Storing the address ranges compactly
The obvious storage optimisation is to park the 75 descriptions in a dictionary, the size of which is insignificant, and represent the 890,065 address ranges and dictionary-entry number as compactly as possible. Then, sort these entries by address range and put them in a flat array, for fast binary search.
How compactly can we represent an (address, length, dictionary-entry-number) triple? Naively, the address is a 64 bit word, and length and entry number could be 32 bits, giving 16 bytes in total.
That’s way overkill. Since we’ll have one table per mapped object, the base address can be replaced by the offset from the base of the object. libxul.so has about 36MB of text, so unfortunately that’s 4 bytes. The address range lengths are tiny, though, mostly less than 256, so we could store than in a byte, and duplicate the descriptor for the occasional longer run. And the dictionary entry number in the two cases I tested would fit in 8 bits.
So that’s 6 bytes per description. For 890,065 descriptions, 5,340,390 bytes in total.
It would be interesting to see if the same level of duplication occurs for CFI and EXIDX on ARM. But given that it merely reflects the non-diversity of frame layouts, I find it hard to believe we’d see anything much different.
EXIDX contains less information than CFI, yet — as experiments on Fennec/ARM have shown — it gives good unwinding results. So this analysis surely applies equally to our EXIDX-based targets, Fennec/ARM and FirefoxOS/ARM.