## SunSpider Statistics, Part II

In my last post, I laid out some questions I wanted to answer. I decided to tackle these questions empirically, by running SunSpider 100 times with –runs=10, for a total of 1000 test suite runs, and then see if SunSpider’s confidence intervals and tests come out as predicted. Empirical methods are nice because you can do interesting statistics without doing hard math.

The first one:

Question 1: Are SunSpider times normally (gaussian) distributed?

There are a bunch of normality tests out there but I didn’t want to do anything complicated. Instead, I just made a histogram of times and compared it to the best fit normal distribution. For my 1000 runs, the mean was 1232.2 and the standard deviation was 11.48. Here is a plot of the observed times in black and the normal distribution with mean 1232.2 and standard deviation 11.48 in red (ignore the blue line for now):

The shape is vaguely bell-curve like, but the scaling really isn’t normal. In particular, the peak of the empirical data is about twice as high as the normal distribution model, and the data is much more concentrated a center band. Also, the tail of the real data goes much further to the right than the normal distribution. In particular, for the longest time I got, 1329ms, the normal distribution has a probability of 1.3×10^-14. Finally, the left tail of the data falls off a bit faster than the normal distribution.

It’s safe to say these times are not normally distributed. If they are, then my sample is a once-in-the-lifetime-of-the-earth rarity. Another reason not to expect a normal distribution in the first place is that a normal distribution can generate any value, positive or negative, but SunSpider times can only be positive. So I tried fitting a related positive-only distribution, the log-normal, shown in blue. To me, it looks neither a better fit nor a worse one.

I think these times might come from a model with two different kinds of noise:

T = m + En + Ep

where m (“mean”) is the baseline time as before. En is noise centered at zero with a higher peak than a gaussian distribution. I’m not sure why the peak is so pointy; it might be because of the limited (1ms) timer resolution. Ep is another source of noise that is always positive (making the program run longer), perhaps with an exponential or log-normal distribution. I think Ep comes from rare events that cause relatively long delays, such as a background process waking up to check for updates on some program I never use.

Although the times aren’t normally distributed , and all of SunSpider’s statistics are based on assuming a normal distribution, the statistics might work out all right in practice. We have to check them separately.

Question 2: Are SunSpider’s confidence intervals valid?

SunSpider gave me 100 95% confidence intervals for my 100 runs. If they are valid confidence intervals, then about 95 of them should contain the true mean.

Of course, I have no way of getting the true mean from any finite number of samples, but I’m going to arbitrarily assume that taking the mean or median of my 1000 samples is about right. The mean of 1000 samples is 1232.2. The median of the 100 averages is 1232.4, so it doesn’t really matter which one I use. I used the median because I figured it would reduce the effect of the outliers at the top.

In my trial, 86 of 100 confidence intervals contained the median. I think that’s pretty good, given the complex distribution of the underlying process.

For comparison, I did the same confidence interval test 20 times on sets of 100 fake SunSpider runs where the results are generated from a true normal process with the same mean and standard deviation. In those tests, 87 to 97 of the results contained the median, usually around 93 or so, about as predicted. (Because the median isn’t the true mean, the confidence intervals probably miss the median more often than they miss the true mean.)

So, for the total time, the SunSpider confidence interval is probably pretty good if regarded more as a 90% confidence interval instead of 95%.

I did the same experiment for each individual test as well. On those, the score ranged from 64 for 3bit-bits-in-byte to 100 for cordic, with most of them around 85-95. Broadly, I’d say the results are about as reliable as those for the total score. It does seem that the longer tests tended to do worse, probably because the rare-delay noise gets more of a chance to kick on longer tests.

On medium length tests (runtime 8-80ms or so), the confidence intervals seem very reliable. I figured that the normality assumption might hold better for these tests, but it turns out they have pointy peaks and long right tails too. The real reason the median is always in the confidence interval on cordic is that (a) the results are very tightly clustered (99% are either 22ms or 23ms) so most runs are close to the mean, and (b) there are some outliers, bumping up the standard deviations a bit.

Question 3: Are SunSpider’s significance tests valid?

I tested the significance indicators using a similar procedure as for confidence intervals. Using my 100 runs, I did 50 pairwise comparisons. Because they are all for the same JavaScript engine, all differences are random noise, so there should be a “significant” indicator on about 2 or 3 pairs (5%).

SunSpider reported “significant” in 4 cases (8%). That’s really pretty good, and I’d be willing to believe that the extra 1.5 “significant”s are just a feature of my sample, except for the fact that all my other tests so far show SunSpider’s statistics to be a little less precise than with a true normal distribution.

On the individual tests, “significant” indicators ranged from 0% to 24% of trials. I couldn’t find any real pattern in the numbers. But I would say that the “significant” indicators aren’t really reliable at the 5% level for individual tests. Also, from experience with the data and actually debugging performance effects, I’ve found the significance indicator is fairly useful on the total, but not on the individual tests.

Whether this is good enough or not depends on how you intend to use the tests. For a one-off run, I’d say the significance indicator on the total is solid, and the others can be used with a grain of salt and in the light of other data.

But if you run SunSpider all the time, like we do, you would get a lot of false positive differences even on the total. I’d say a 99% significance level would be more useful in that case, which corresponds to maybe doubling the width of SunSpider’s confidence intervals. For a –runs=10 run, that means a difference of 15ms or more is significant.

But given the right-tail outliers, if you do see a significant time increase, it’s probably always a good idea to rerun to make sure you weren’t looking at an outlier. Seeing two outliers in a row is pretty unlikely.

In part III, I’ll get to the questions of how to process the results in practice. Also, over the weekend I’m going to try to run a series of tests over our revision history, so I should also have some results to report from that next week.

Comment from Bill McCloskey
Time: February 8, 2009, 11:31 pm

Hey Dave, this was really interesting. I love it when people really look into stuff like this. It’d be really interesting to isolate the sources of noise in these tests. Have you tried running the SunSpider tests at realtime priority? I know this is possible in Linux, and I imagine MacOS has a similar feature (in Linux you use sched_setscheduler). The upside is that you won’t get interference from the scheduler or any other processes, so you just have to worry about the occasional interrupt bottom-half. The downside is that you’d better not have any infinite loops, or you’ll have to reboot :-). Normally using RT prio isn’t worth the aggravation, but since your tests are so short-running it might make sense.

Comment from leo
Time: February 11, 2009, 4:42 pm

+1 for Bill’s comment, and, on a more mundane level, just basic initialization costs.

E.g., I’m using Cilk++ for some algorithm prototypes, and just starting the thread package takes 200ms, which blew out my initial benchmarks, and I got another 20ms of variance per job just for launching a task from bash that requests a pthread.

Comment from PMP Course
Time: October 29, 2010, 8:39 am

Very interesting topic. However I am nosebleed.. Not really familiar with the topic but it was really interesting.