Sep 12

looking at talos differently

At a meeting a couple of weeks ago, I volunteered to look at the Talos data for the last release cycle to see how stable the tests were.  I wasn’t going to do extensive statistical analysis on the numbers, since the Talos servers already do that when deciding whether a change is significant or not.  What I initially came up with was:

  1. Detecting important changes: do multiple platforms change consistently?  If the numbers for platform X go up by 5% while the numbers for platform Y go down by 3%, maybe the test isn’t very stable.
  2. Do the numbers wobble around a lot?  If the numbers for platform X were up one day and down the next, and this happened several times, then the test isn’t particularly stable on that platform.  However, if the test numbers keep going one direction, then our confidence in that test increases somewhat.

(I realize that the specifics of the changesets involved may invalidate the generalizations made above.  But as a good first cut, e.g. for somebody who’s just trying to see if there are significant Talos regressions, this seems like a good start.)

I started out trying to catalog these numbers manually and quickly decided that was for the birds.  I found myself mixing up numbers for different tests, different trees, and different platforms (to say nothing of PGO versus non-PGO builds on inbound!), due mostly to the large number of emails to look at and the similarity between the subjects of the emails.  A better approach was called for.

So I wrote talos-summarize for parsing dev.tree-management archives for some date range and generating useful visualizations from them.  Wanting some insight into the questions above, I chose a table whose rows represent ranges of changesets and whose columns contain data for how a particular platform’s numbers change.  This makes it fairly easy to address point 1 above; point 2 is somewhat addressed by eyeballing the table and looking at the distribution of +/-s, but it should also be addressed by computing cumulative changes from the beginning of the chosen range.  That last point hasn’t been implemented yet.

Showing pictures would be more useful; I generated visualizations for all the Talos tests that ran on inbound during the start of the Firefox 15 release cycle.  The correspondence between the filenames and the tests themselves should be pretty obvious.

I initially started looking at the Ts, MED Dirty Profile table.  This table made me happy, because you can see the consistent jump across all platforms as a result of bug 769960 landing and the corresponding decrease when the startup regression from that bug landed, bug 778855.  That test also looks like it has fairly stable numbers; the numbers don’t jump around too much.  The table also suggests that we significantly regressed startup time on multiple platforms with this pushlog; this regression didn’t get addressed during the initial cycle of 15, even though it was about as severe as the regression from bug 769960 (and touched our more critical Windows platforms, too).

For the most part–and this is just eyeballing–the numbers from all tests don’t jump around or provide contradictory information.  The one exception would be Tp5 No Network Row Major MozAfterPaint (what a mouthful!), where the row in the middle looks quite odd.  Unfortunately, stable numbers across all the tests also look like constantly increasing numbers across all the tests, which is not great.

I’m still looking at the numbers and tweaking the script; it doesn’t get everything right (I know it does strange things when summarizing the Number of Constructors test, for instance).  What do you think could be improved about the visualization or what extra information should the summary try to present?


Sep 12

running talos locally

We had a meeting about not regressing Talos last week.  One of the concerns brought up at the meeting was that reproducing Talos regressions was hard: dissimilar machines, not knowing where Talos is, inability to run Talos locally, etc. etc.

Joel Maher noted that Talos has its very own wiki page, complete with checkout instructions and information about running the tests.  In the interest of looking like a functioning member of the performance team (“benchmarks?  what are those?”), I decided to verify those instructions.  The monospace commands below were run on a Linux/x86-64 box.

  1. Checkout the repository: hg clone http://hg.mozilla.com/build/talos && cd talos
  2. Run the INSTALL.py script: ./INSTALL.py
  3. Run activation script for virtualenv: . bin/activate  This step will modify your shell prompt to include a (talos) string at the beginning; it also modifies PATH to include the location of talos scripts.
  4. Create a test configuration: PerfConfigurator --develop --executablePath /path/to/firefox --activeTests ts --results_url file:///${HOME}/ts.txt --output ts_desktop.yaml Note that providing a path to a firefox you just compiled (~/src/build-mc/dist/bin/firefox in my case) works just fine.
  5. Run tests with your configuration: talos -n ts_desktop.yaml Unfortunately, you have to shutdown any Firefox instances you have running or the test harness will complain. I filed bug 787980 for fixing this.
  6. Sit back and relax while you watch the test harness print progress messages on your terminal.  You may see messages about logs being posted to http://datazilla.mozilla.org/talos; the --develop flag ought to prevent that from happening,   It’s not clear to me whether the bug is in the wrong messages or that --develop doesn’t actually inhibit uploading.

And that’s it!  You’ve now run the Talos startup tests.  I haven’t tried running the pageload tests (yet!) because doing them properly requires downloading quite a number of webpages and arranging things just so.