Main menu:

Site search

Categories

Archive

A History of Insanity in the Age of x86

It’s been a long time since i’ve blogged–I’ve been pretty deep in coding mode. But bug 471822 has been fixed, and it’s time to celebrate with a post.

Bug 471822 is a TraceMonkey performance regression on SunSpider of about 70 ms or so that Andreas Gal noticed recently. And it was worse than a simple slowdown, it was an intermittent slowdown, making benchmark scores hard to use to drive optimization efforts.

The first really interesting thing I noticed about the bug was that the performance regression depended on the length of the command line used by SunSpider to run the JS engine. If I renamed one of the files to a longer name and ran with the longer command line, the benchmark could be either slower or faster. That’s really weird. I spent some time reading the source code to bash, thinking that maybe there was a bug in bash that altered the way it started the program. But there was nothing to be found there, and eventually I figured out how to replicate the problem using my own little C program with fork and execve.

Somehow, I figured out that the real effect of command-line length is to alter the value of the stack pointer register (esp) on program startup, because command-line arguments and environment variables are passed into main as arguments. That’s just as weird as the idea of command-line length influencing perf. Fast-forward past many hours of performance profiling, hardware performance counter profiling, manual clock timer instrumentation, blah, blah, blah, and eventually I had apparently localized the perf hit to one part of one particular trace compiled for the access-nbody benchmark that did a bunch of floating-point math, but also reached the point where the other data I had collected on the problem basically made no sense.

Azathoth The patron saint of microarchitectural perfomance analysis.

So I sat down on a bean bag and started swapping crazy ideas with Andreas. We suspected that there was a problem with floating-point math, and we knew that none of the sensible theories about the bug were consistent with the data. Andreas was talking about data alignment, and in particular how he had proved that on-stack floating-point numbers were always stored in addresses offset by nice multiples of 8 from the frame pointer register, ebp, when I remembered that I had discovered earlier that day that on compiled traces, ebp always had an address ending in c (e.g. 0x0176f00c).

(Background: JavaScript numbers correspond to the C type double or IEEE-754 double-precision floating-point numbers, which are 8 bytes long. For various mysterious reasons, if you are working with 2^k-byte values, it is usually better to keep them in addresses that are multiples of 2^k. Thus, doubles should be stored in addresses that are multiples of 8, or in other words, addresses that end with 0 or 8 in hex. Compilers like gcc do this automatically, but if you are writing your own compiler, it’s all up to you.)

We verified the misalignment. I also remembered that in functions compiled by gcc, ebp always ends in 8. The easiest way to test the theory was to modify nanojit so that ebp is 8-byte-aligned, so Andreas did that, I tested it, and it worked.

With the 8-byte alignment in place, the inconsistent performance went away. On my laptop in the JS shell, that corresponded to an 7x speedup in access-nbody. I also got a 5-15% improvement in the other math-heavy benchmarks. Andreas got only a 1% total speedup on his newer Penryn laptop, but he was happy to have consistent benchmarks again. We think Penryn must be smarter about unaligned accesses, or benefits from its 24-way set-associative L2 cache (vs. 16 on mine), or probably just something 10x weirder.

Comments

Comment from Blaisorblade
Time: January 9, 2009, 9:13 pm

Uh, it’s just weird the first time you see that. And since reads from memory are 8-byte wide, that’s not weird at all – misaligned read are slower and non-atomic on x86, and forbidden on most other processors, since they require two bus operations if they happen from memory.
And well, a lot of stuff needs to be aligned, including branch targets (the Intel Optimization Guide recommends a 16-byte alignment for all of them, to help the branch predictor); IIRC the Intel Compiler uses 16-byte alignment also for the stack, but that’s useful just to support SSE 128-byte data.

I saw something similar once, in my Virtual Machine student project – adding a single addition before returning made a noticeable difference, like 30% (in an interpreter!), in a small benchmark just because my allocator used 2-byte alignment instead of 4-byte alignment (and the space for that addition misaligned most of the heap, since the bytecode was heap allocated).

Comment from Blaisorblade
Time: January 9, 2009, 10:02 pm

Oh, on the bug tracker something really weird is mentioned. I.e. that adding 32, or 64, to ESP makes a difference:

https://bugzilla.mozilla.org/show_bug.cgi?id=472791#c4
https://bugzilla.mozilla.org/show_bug.cgi?id=471822#c10

I rechecked, and Appendix D of 248966 of the Intel Optimization Reference Guide only mentions 16-byte stack alignment for XMM register spills, so that’s indeed weird. The cacheline size is 64 bytes though, so that might matter.

Actually, I’d check if any code alignment issue is involved. Instead of adding
asm(“addl $64, %esp”)
compare performance with adding:
asm(“addl $0, %esp”)

Comment from Justin Dolske
Time: January 9, 2009, 10:48 pm

I’m curious what lead you to notice that the command line length mattered? Because that sounds like a terribly non-obvious thing. :-) Random “what if”, or did something give you a clue?

Comment from AJ
Time: January 10, 2009, 10:00 am

Awesome post.

You’re slowing becoming the Mark Russinovich of browser development =).

Comment from AJ
Time: January 10, 2009, 10:01 am

Er… slowly, I mean.

Comment from dmandelin
Time: January 12, 2009, 2:34 pm

On the subject of alignments, we’re still getting help from Intel. I did compare adding 64 and adding 0 to esp as part of what led me to figure this out so far. I just posted results to the bug of a microbenchmark that uses various alignments. Most of the time, using a 4-byte alignment instead of 8 seems not to matter–it’s the crossing 64- and 4096-byte boundaries that hurts. I wasn’t able to observe a 64-byte periodic effect in my tests but I think that’s because there are many 8-byte values in the program, so a given shift will put about as many onto a 64-byte boundary as off of it.

On how I realized command-line length mattered: Just to make the testing easier and isolate components I tried to run the SunSpider command line outside of SunSpider, but the effect went away. Digging into that, I figured out that Perl starts backtick commands (used in SunSpider) with “bash -c”, and that the effect happened then, but not if I just ran from the command line. Fiddling with bash for a while, I found out that “-c” without “-i” means the .bashrc is not run. Fiddling around with .bashrcs, I noticed that a bashrc that used ~ expansion meant no slowdown, but an empty .bashrc made things slow down even without “-c”. This made me think maybe there was a memory corruption bug in bash dependent on filename lengths. So I started playing with different filename lengths in the command line and found I was able to replicate the problem that way.

Comment from leo
Time: February 11, 2009, 4:30 pm

Go Dave!

Posts like this make me worried about any sort of tuned multicore future, esp. how the problem did not appear on the Penryn.

Comment from dustin
Time: March 1, 2010, 1:10 am

@leo, so what you fear is that a slowdown that happens due to architectural requirements goes away because that requirement is relaxed/removed? Seems like modern architectures are making life easier on software guys, not harder!

Comment from çizgi filmler
Time: April 26, 2010, 6:34 am

access-nbody benchmark that did a bunch of floating-point math

Comment from shingles symptoms
Time: May 5, 2010, 8:19 pm

Often I see weird traffic trend in my blog. This shocks me a lot. Later on I put the tracker in the blog & find the often the site is down. May be this bug can help me more. Thanks for that.

Comment from pneumonia symptoms
Time: May 6, 2010, 12:50 am

This is height of insanity. I am not able to tolerate it any more.

Comment from ava fx
Time: June 29, 2010, 3:46 pm

Thank you for this blog. Thats all I can say. You most definitely have made this blog into something thats eye opening and important. You clearly know so much about the subject, youve covered so many bases. Great stuff from this part of the internet. Again, thank you for this blog.Thanks

Comment from Dog Aggression
Time: July 6, 2010, 3:30 am

such a beautifully composed, informative article.I think your designing work to this is really great .I really appreciate your work to this site.So thanks for it.

Comment from cam
Time: July 17, 2010, 2:14 pm

The 8-byte alignment strategy is real interesting, keep us updated!

Comment from sonnerie portable gratuit
Time: July 21, 2010, 11:18 pm

thanks for this information

Comment from Business Administration Management
Time: July 28, 2010, 12:21 pm

Weird. I’m glad someone can figure this stuff out.

Comment from Diamond Rings
Time: August 15, 2010, 11:45 pm

Let’s see if Today’s my day or not.

Comment from Cheap Radio City Christmas Spectacular Tickets
Time: August 19, 2010, 11:21 pm

I sat down on a bean bag and started swapping crazy ideas with Andreas. We suspected that there was a problem with floating-point math, and we knew that none of the sensible theories about the bug were consistent with the data.

Comment from Cheap Radio City Christmas Spectacular Tickets
Time: August 19, 2010, 11:21 pm

I sat down on a bean bag and started swapping crazy ideas with Andreas. We suspected that there was a problem with floating-point math, and we knew that none of the sensible theories about the bug were consistent with the data. thanks for sharing.

Comment from cizgifilmi
Time: September 13, 2010, 6:22 am

I think it was a beautiful work.

Comment from logotipo
Time: October 1, 2010, 6:00 am

is a good job!
As I creating logo design and branding. We must be very careful about what we do.

_________________
Criação de logotipo, logomarca, marcas e naming para sua empresa.

Comment from adesivos de parede
Time: October 1, 2010, 6:01 am

Great to see articles like these, especially with this quality of writing!

_________________________________
Adesivos de parede e decorativos para design de interiores.

Comment from Gatas e gifs engraçados
Time: October 1, 2010, 6:04 am

Amazing how these things happen. I leave my satisfaction in reading this post! Thanks!

Mulheres lindas, bonitas, sexys e sensuais. São as Gatas.
Gifs engraçados, funny e de humor.

Comment from custom sticker printing
Time: October 19, 2010, 10:04 pm

The easiest way to test the theory was to modify nanojit so that ebp is 8-byte-aligned, so Andreas did that, I tested it, and it worked.

Comment from roger waters zurich
Time: October 21, 2010, 10:24 pm

A holiday in Thailand is no doubt but amazing. It is however critical and very important that you do some research before you make a booking online from a travel website.

Comment from Pollen
Time: November 19, 2010, 4:56 am

this is very informative!

Comment from Bmxplus
Time: November 19, 2010, 5:01 am

Thailand is full of beautiful ladies.