A History of Insanity in the Age of x86
It’s been a long time since i’ve blogged–I’ve been pretty deep in coding mode. But bug 471822 has been fixed, and it’s time to celebrate with a post.
Bug 471822 is a TraceMonkey performance regression on SunSpider of about 70 ms or so that Andreas Gal noticed recently. And it was worse than a simple slowdown, it was an intermittent slowdown, making benchmark scores hard to use to drive optimization efforts.
The first really interesting thing I noticed about the bug was that the performance regression depended on the length of the command line used by SunSpider to run the JS engine. If I renamed one of the files to a longer name and ran with the longer command line, the benchmark could be either slower or faster. That’s really weird. I spent some time reading the source code to bash, thinking that maybe there was a bug in bash that altered the way it started the program. But there was nothing to be found there, and eventually I figured out how to replicate the problem using my own little C program with fork and execve.
Somehow, I figured out that the real effect of command-line length is to alter the value of the stack pointer register (esp) on program startup, because command-line arguments and environment variables are passed into main as arguments. That’s just as weird as the idea of command-line length influencing perf. Fast-forward past many hours of performance profiling, hardware performance counter profiling, manual clock timer instrumentation, blah, blah, blah, and eventually I had apparently localized the perf hit to one part of one particular trace compiled for the access-nbody benchmark that did a bunch of floating-point math, but also reached the point where the other data I had collected on the problem basically made no sense.
The patron saint of microarchitectural perfomance analysis.
So I sat down on a bean bag and started swapping crazy ideas with Andreas. We suspected that there was a problem with floating-point math, and we knew that none of the sensible theories about the bug were consistent with the data. Andreas was talking about data alignment, and in particular how he had proved that on-stack floating-point numbers were always stored in addresses offset by nice multiples of 8 from the frame pointer register, ebp, when I remembered that I had discovered earlier that day that on compiled traces, ebp always had an address ending in c (e.g. 0x0176f00c).
double or IEEE-754 double-precision floating-point numbers, which are 8 bytes long. For various mysterious reasons, if you are working with 2^k-byte values, it is usually better to keep them in addresses that are multiples of 2^k. Thus, doubles should be stored in addresses that are multiples of 8, or in other words, addresses that end with 0 or 8 in hex. Compilers like gcc do this automatically, but if you are writing your own compiler, it’s all up to you.)
We verified the misalignment. I also remembered that in functions compiled by gcc, ebp always ends in 8. The easiest way to test the theory was to modify nanojit so that ebp is 8-byte-aligned, so Andreas did that, I tested it, and it worked.
With the 8-byte alignment in place, the inconsistent performance went away. On my laptop in the JS shell, that corresponded to an 7x speedup in access-nbody. I also got a 5-15% improvement in the other math-heavy benchmarks. Andreas got only a 1% total speedup on his newer Penryn laptop, but he was happy to have consistent benchmarks again. We think Penryn must be smarter about unaligned accesses, or benefits from its 24-way set-associative L2 cache (vs. 16 on mine), or probably just something 10x weirder.