SunSpider Statistics, Part I: Questions
SunSpider’s confidence intervals. As I’m sure you know, a basic SunSpider result looks like this:
============================================ RESULTS (means and 95% confidence intervals) -------------------------------------------- Total: 1226.2ms +/- 0.3% -------------------------------------------- 3d: 176.6ms +/- 0.5% cube: 43.8ms +/- 1.0% morph: 34.5ms +/- 1.1% raytrace: 98.3ms +/- 0.7% access: 155.2ms +/- 0.5% binary-trees: 45.3ms +/- 0.8% fannkuch: 67.0ms +/- 0.7% nbody: 28.7ms +/- 1.2% nsieve: 14.2ms +/- 4.0% ... and 19 more individual tests in 7 groups
Doing the multiplication, the 95% confidence interval works out to 1222.5-1229.9ms. What this confidence interval really means is rather technical.
First, SunSpider’s confidence intervals are based on a gaussian model of run time. The model assumes that the time taken on any given run is a random variable
T = m + E
where m (“mean”) is a constant “baseline” run time and E (“error”) is a random variable representing noise. The model also assumes that E is gaussian with mean 0. If the model is perfectly true in reality, then there is a strong mathematical interpretation of the confidence interval. But to the extent that the model deviates from reality, the confidence interval is less meaningful.
This brings up Question 1: Do SunSpider times vary with a gaussian distribution?
If the model is true, then over a large set of SunSpider runs, the baseline time will be in the confidence interval in 95% of runs. With the run I showed above, this seems to imply that, the baseline time is in the interval 1222.5-1229.9ms with probability 0.95, but according to the standard statistical interpretation, that interpretation is wrong, because 1222.5, 1229.9, and the baseline time are not random variables, so probability does not apply. But I don’t think the intuitive interpretation is that bad.
Question 2: In a large set of SunSpider runs, is the baseline time in the calculated confidence interval 95% of the time?
SunSpider’s significance tests. SunSpider can compare two sets of results, giving output like this:
TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: *1.009x as slow* 1226.2ms +/- 0.3% 1237.3ms +/- 0.8% significant ============================================================================= 3d: *1.031x as slow* 176.6ms +/- 0.5% 182.1ms +/- 2.0% significant cube: *1.032x as slow* 43.8ms +/- 1.0% 45.2ms +/- 2.7% significant morph: ?? 34.5ms +/- 1.1% 34.8ms +/- 2.3% not conclusive: might be *1.009x as slow* raytrace: *1.039x as slow* 98.3ms +/- 0.7% 102.1ms +/- 3.4% significant access: ?? 155.2ms +/- 0.5% 156.3ms +/- 1.2% not conclusive: might be *1.007x as slow* binary-trees: *1.015x as slow* 45.3ms +/- 0.8% 46.0ms +/- 1.5% significant fannkuch: ?? 67.0ms +/- 0.7% 67.3ms +/- 1.8% not conclusive: might be *1.004x as slow* nbody: ?? 28.7ms +/- 1.2% 29.0ms +/- 1.2% not conclusive: might be *1.010x as slow* nsieve: - 14.2ms +/- 4.0% 14.0ms +/- 2.4%
Differences are reported as “significant”, “not conclusive”, or blank, presumably meaning “not significant”. SunSpider’s “significant” means that the difference is significant at the 0.05 level according to the t test, which brings up more technical bits.
In general, significance tests are meant to separate observed differences that are real (the underlying baseline means are different) from random differences (the means are the same, and only the noise was different). In a basic significance test, we implicitly assume the means are the same, and then look to see if our observations tend to disprove that. If our observations would be very unlikely under that assumption, then we say the difference is significant. (c.f. if I think a certain roulette wheel is fair, and it comes up “13” 10 times in a row, I have good reason now to think it’s not fair.) So, to say that two runs are different at the 0.05 level is to say: if the means are really the same, then there was only a 5% chance of getting runs that different. In other words, if I go around doing experiments looking for differences when there aren’t any, I’m going to get p=0.05-significant results 5% of the time.
Since SunSpider has 26 tests, this means if you do 2 runs of the same system, if the significance tests are applicable, you’d probably see 1 or 2 false “significant” differences in the results.
SunSpider in practice. Our everyday use of SunSpider is through a little script called bench.sh that runs one trial and prints out the total time. The idea is that a developer can run it a few times, make a change to TraceMonkey, and then run it a few more times and see if it seems to be better or worse. It’s supposed to be a quick test, taking 5 or 10 seconds to run and interpret, that can be run after any little change. So I want to know the answer to Question 4: What can we do with 3-5 SunSpider runs? How small of performance changes can it reasonably detect? What’s the best way to look at the numbers: average them, take the minimum, or something else?
Another use of SunSpider is to run over time, over our Mercurial history, to look for performance regressions that sneaked in with other changes. For that purpose, the runs will be automatic, so it’s OK if they take a while: we can do a lot of trials. So here, I want to answer Question 5: how many runs are needed to detect the smallest performance change that counts? To make that question real, we also need to know what kind of performance change counts. Theoretically, a 1ms slowdown is bad, because after 100 checkins like that, we’ve slowed down by 100ms, which definitely matters. But we seem to get changes of plus or minus 1-3ms “randomly” with many checkins just because gcc’s optimizer does something a little different. So I figure 5ms or so is a real difference that we’d like to detect.
Summary. I’ve laid out two broad areas I’d like to learn more about. First, I want to know if SunSpider’s statistical claims hold up in reality, and if not, where the deviation is. Second, I want to figure out the right number of runs to use, how to combine them, and how small of differences they can detect for a few important practical applications.
In Part II, I’ll start trying to answer these questions.