At a meeting a couple of weeks ago, I volunteered to look at the Talos data for the last release cycle to see how stable the tests were. I wasn’t going to do extensive statistical analysis on the numbers, since the Talos servers already do that when deciding whether a change is significant or not. What I initially came up with was:
- Detecting important changes: do multiple platforms change consistently? If the numbers for platform X go up by 5% while the numbers for platform Y go down by 3%, maybe the test isn’t very stable.
- Do the numbers wobble around a lot? If the numbers for platform X were up one day and down the next, and this happened several times, then the test isn’t particularly stable on that platform. However, if the test numbers keep going one direction, then our confidence in that test increases somewhat.
(I realize that the specifics of the changesets involved may invalidate the generalizations made above. But as a good first cut, e.g. for somebody who’s just trying to see if there are significant Talos regressions, this seems like a good start.)
I started out trying to catalog these numbers manually and quickly decided that was for the birds. I found myself mixing up numbers for different tests, different trees, and different platforms (to say nothing of PGO versus non-PGO builds on inbound!), due mostly to the large number of emails to look at and the similarity between the subjects of the emails. A better approach was called for.
So I wrote talos-summarize for parsing dev.tree-management archives for some date range and generating useful visualizations from them. Wanting some insight into the questions above, I chose a table whose rows represent ranges of changesets and whose columns contain data for how a particular platform’s numbers change. This makes it fairly easy to address point 1 above; point 2 is somewhat addressed by eyeballing the table and looking at the distribution of +/-s, but it should also be addressed by computing cumulative changes from the beginning of the chosen range. That last point hasn’t been implemented yet.
Showing pictures would be more useful; I generated visualizations for all the Talos tests that ran on inbound during the start of the Firefox 15 release cycle. The correspondence between the filenames and the tests themselves should be pretty obvious.
I initially started looking at the Ts, MED Dirty Profile table. This table made me happy, because you can see the consistent jump across all platforms as a result of bug 769960 landing and the corresponding decrease when the startup regression from that bug landed, bug 778855. That test also looks like it has fairly stable numbers; the numbers don’t jump around too much. The table also suggests that we significantly regressed startup time on multiple platforms with this pushlog; this regression didn’t get addressed during the initial cycle of 15, even though it was about as severe as the regression from bug 769960 (and touched our more critical Windows platforms, too).
For the most part–and this is just eyeballing–the numbers from all tests don’t jump around or provide contradictory information. The one exception would be Tp5 No Network Row Major MozAfterPaint (what a mouthful!), where the row in the middle looks quite odd. Unfortunately, stable numbers across all the tests also look like constantly increasing numbers across all the tests, which is not great.
I’m still looking at the numbers and tweaking the script; it doesn’t get everything right (I know it does strange things when summarizing the Number of Constructors test, for instance). What do you think could be improved about the visualization or what extra information should the summary try to present?