I’ve been wondering how much of the data we keep in memory is actually useful, as opposed to being bits which are mostly or entirely constant. Recently I hacked up a Valgrind tool to get some initial measurements.
Imagine a machine instruction that does a 32-bit memory read. If the read value is observed only ever to be one of a small set, let’s say, 1, 2 or 3, then we might suspect that most of the bits in the word are unused. Should we be concerned? It depends whether the instruction accesses a wide range of addresses: if it does, that might be a clue that we’ve got some inefficient representation going on over a large area of memory.
MVP puts this into practice. It watches all 1, 2, 4 and 8 byte non-stack memory read instructions, observing both the spread of read values and the spread of accessed addresses. For each monitored instruction, it computes an aggregate wastefulness score. This is the the number of always-constant bits in the values multiplied by (some approximation of) the number of different accessed addresses.
You can level various kinds of “Bogus!” accusations at it, but it does give some indication of structure and array fields that are not working very hard. For example, given this:
#define MILLION 1000000
for (i = 0; i < MILLION; i++) a[i] = (i & 1) ? 1 : -1;
// ... later ...
for (i = 0; i < MILLION; i++) sum += a[i];
it will tell you that the “sum += a[i]” reads are mostly pulling in constant bits:
wasted 968,781 (31/32 cbs, 31,251 sects) 0x40057E: main (mvp1.c:25)
31 out of the 32 bits read are constant (it has some sophistication about ignoring constant sign bits), and the accesses are spread over 31,251 different “sectors” (128-byte chunks of address space). Hence this read instruction gets a relatively high wastefulness metric of 968,781.
MVP runs Firefox without difficulty. Results are interesting. A couple of high-roller tidbits:
wasted 1,143,304 (8/32 cbs, 142,913 sects) 0x6431A7B: fast_composite_src_x888_0565 (pixman-fast-path.c:904)
fast_composite_src_x888_0565 converts image data from 24 bit 8:8:8 format to 16-bit 5:6:5 format, and the 24 bit values are parked in 32 bit words. The data is spread widely (142,913 sectors) and the top 8 bits of the read data are observed to be constant, as we’d expect.
wasted 989,696 (16/16 cbs, 61,856 sects) 0x652B054: js::NameNode::create(JSAtom*, JSTreeContext*) (jsparse.cpp:798)
This has to do with
pn_dflags = (!tc->topStmt || tc->topStmt->type == STMT_BLOCK) ? PND_BLOCKCHILD : 0;
16 constant bits out of 16 implies that tc->topStmt->type (a uint16) has only ever been observed to be STMT_BLOCK here.
The most striking observation from the initial numbers, though, is somewhat obvious: that on a 64 bit platform, pointer-intensive structures are space-inefficient. There are large numbers of 64-bit load instructions that pull in values with 36 or so constant bits. That would correspond to fetching addresses of objects scattered across a few hundred megabytes of address space.
Perhaps we should work to minimise the number of heap objects. Maybe what we need is a way to detect object pairs with coupled lifetimes, so they can be merged into a single object, and the connecting pointers removed.
Anyway, enough speculation. The tool is very much WIP. If anybody is interested in trying it out, or has comments/suggestions, let me know.