05
Oct 12

looking at talos differently, part 2

Over the past couple of weeks, I’ve been working on refining the Talos summary script I blogged about earlier.  The fruits of the labor can be seen in summary pages for the last three release cycles:

(It’s worth pointing out that the coloring for the JS-related tests is wrong; I think those tests are “bigger is better” tests.)

One meta-point before diving into suggestions for various tests: emails don’t always get triggered for the correct changesets.  To see what I mean, take a look at the following examples:

All this suggests that there’s a bug in how we’re benchmarking our trees and generating our results; I haven’t investigated any further.


19
Sep 12

looking at talos differently

At a meeting a couple of weeks ago, I volunteered to look at the Talos data for the last release cycle to see how stable the tests were.  I wasn’t going to do extensive statistical analysis on the numbers, since the Talos servers already do that when deciding whether a change is significant or not.  What I initially came up with was:

  1. Detecting important changes: do multiple platforms change consistently?  If the numbers for platform X go up by 5% while the numbers for platform Y go down by 3%, maybe the test isn’t very stable.
  2. Do the numbers wobble around a lot?  If the numbers for platform X were up one day and down the next, and this happened several times, then the test isn’t particularly stable on that platform.  However, if the test numbers keep going one direction, then our confidence in that test increases somewhat.

(I realize that the specifics of the changesets involved may invalidate the generalizations made above.  But as a good first cut, e.g. for somebody who’s just trying to see if there are significant Talos regressions, this seems like a good start.)

I started out trying to catalog these numbers manually and quickly decided that was for the birds.  I found myself mixing up numbers for different tests, different trees, and different platforms (to say nothing of PGO versus non-PGO builds on inbound!), due mostly to the large number of emails to look at and the similarity between the subjects of the emails.  A better approach was called for.

So I wrote talos-summarize for parsing dev.tree-management archives for some date range and generating useful visualizations from them.  Wanting some insight into the questions above, I chose a table whose rows represent ranges of changesets and whose columns contain data for how a particular platform’s numbers change.  This makes it fairly easy to address point 1 above; point 2 is somewhat addressed by eyeballing the table and looking at the distribution of +/-s, but it should also be addressed by computing cumulative changes from the beginning of the chosen range.  That last point hasn’t been implemented yet.

Showing pictures would be more useful; I generated visualizations for all the Talos tests that ran on inbound during the start of the Firefox 15 release cycle.  The correspondence between the filenames and the tests themselves should be pretty obvious.

I initially started looking at the Ts, MED Dirty Profile table.  This table made me happy, because you can see the consistent jump across all platforms as a result of bug 769960 landing and the corresponding decrease when the startup regression from that bug landed, bug 778855.  That test also looks like it has fairly stable numbers; the numbers don’t jump around too much.  The table also suggests that we significantly regressed startup time on multiple platforms with this pushlog; this regression didn’t get addressed during the initial cycle of 15, even though it was about as severe as the regression from bug 769960 (and touched our more critical Windows platforms, too).

For the most part–and this is just eyeballing–the numbers from all tests don’t jump around or provide contradictory information.  The one exception would be Tp5 No Network Row Major MozAfterPaint (what a mouthful!), where the row in the middle looks quite odd.  Unfortunately, stable numbers across all the tests also look like constantly increasing numbers across all the tests, which is not great.

I’m still looking at the numbers and tweaking the script; it doesn’t get everything right (I know it does strange things when summarizing the Number of Constructors test, for instance).  What do you think could be improved about the visualization or what extra information should the summary try to present?

 


03
Sep 12

running talos locally

We had a meeting about not regressing Talos last week.  One of the concerns brought up at the meeting was that reproducing Talos regressions was hard: dissimilar machines, not knowing where Talos is, inability to run Talos locally, etc. etc.

Joel Maher noted that Talos has its very own wiki page, complete with checkout instructions and information about running the tests.  In the interest of looking like a functioning member of the performance team (“benchmarks?  what are those?”), I decided to verify those instructions.  The monospace commands below were run on a Linux/x86-64 box.

  1. Checkout the repository: hg clone http://hg.mozilla.com/build/talos && cd talos
  2. Run the INSTALL.py script: ./INSTALL.py
  3. Run activation script for virtualenv: . bin/activate  This step will modify your shell prompt to include a (talos) string at the beginning; it also modifies PATH to include the location of talos scripts.
  4. Create a test configuration: PerfConfigurator --develop --executablePath /path/to/firefox --activeTests ts --results_url file:///${HOME}/ts.txt --output ts_desktop.yaml Note that providing a path to a firefox you just compiled (~/src/build-mc/dist/bin/firefox in my case) works just fine.
  5. Run tests with your configuration: talos -n ts_desktop.yaml Unfortunately, you have to shutdown any Firefox instances you have running or the test harness will complain. I filed bug 787980 for fixing this.
  6. Sit back and relax while you watch the test harness print progress messages on your terminal.  You may see messages about logs being posted to http://datazilla.mozilla.org/talos; the --develop flag ought to prevent that from happening,   It’s not clear to me whether the bug is in the wrong messages or that --develop doesn’t actually inhibit uploading.

And that’s it!  You’ve now run the Talos startup tests.  I haven’t tried running the pageload tests (yet!) because doing them properly requires downloading quite a number of webpages and arranging things just so.


22
Aug 12

humane practices for review requests

I dislike seeing “review not granted” or “review cancelled” in my bugmail.

Even if the reviewer provides helpful comments and points out things that you missed, those few words are the first thing you see about your patch.  Part of me understands that these headlines are informative messages, that they are not judgements on me as a person, merely some sort of judgement on the patch that I have written.  But the language doesn’t encourage that perspective.  Permissions are not granted, entrance is not granted, your favorite show on television is cancelled, and so forth.  These are not heartening words: they are words that shut you out, that indicate your contribution was not desirable.

I took a mathematics class in college where individual problems on homework sets were scored out of a maximum of three points.  Half points were virtually unheard of.  In the usual grade/percentage terms, the scores for a problem looked roughly like this: A, D, F, F.  Not encouraging.  One day, the professor announced that there had been some grumbling from students about this scale.  He then described a colleague’s grading scale for problems in a class of similar difficulty: check-plus, check, check-minus, and zero.  “Now it is trivial to show that these two scales are isomorphic to one another,” he said, smiling.  “But I don’t hear any [grumbling] from Dr. L’s students about his grading scale!”  The message was clear: the scale isn’t that bad, just get over it.[1][2]

However, it matters how you say things, not just what you say.  Some organizations use a plus/delta feedback framework instead of a plus/minus framework.  That is, instead of dividing things into good and bad, you divide things into good and “changes needed”.  Phrasing things as “deltas” nudges people to provide actionable items; it’s the difference between saying “it sucked that the meeting ran half an hour over” versus “the meeting should stay within the time allotted”.

With all this in mind, if I get r?‘d on a patch, I will provide one of three responses (within 24 hours, natch):

  • r+ the patch.  Minor changes may be requested; whether I ask to see the patch again depends on how confident I am in the author and/or the triviality of the changes.  For instance, syntax changes and so forth don’t usually require a re-review.  Documentation changes often can.
  • f+ the patch.  As I understand it, this shows up in bugmail as “feedback granted”, not “review cancelled” (props to the Bugzilla team for doing this!).  Doing this lets the patch author know that they are making progress and that their contribution is appreciated (you took the time to provide feedback), but there are still some changes to be made.
    If at all possible, once the author makes the changes you have requested, r+ the patch.  Your feedback can be thought of as a contract with the patch author: you do these things and the patch is good enough.  Don’t change the terms of the contract.  File a followup bug, if need be.
  • Cancel the review.  I don’t like doing this due to reasons given above, but sometimes you have to, e.g. the wrong patch is provided.  The following scenario is also, I think, reasonable grounds for cancellation, provided you make it clear:
    • A patch depends on prior patches in the same bug;
    • You have reviewed those prior patches and requested significant changes;
    • Those changes would require equally significant changes to the patch in question, thereby rendering any review that you would do moot.

[1] This class was Real Analysis.  Real Analysis is often graded on a very…generous curve, owing to the material and that the course is typically students’ first introduction to rigorous mathematical proof.  So your mental model of what a “good” grade is requires some adjustment anyway.  But still.

[2] I was taking Dr. L’s course simultaneously with Reals.  I can attest that his grading scale did not bother me; indeed, I don’t think I recognized the isomorphism until it was pointed out.


17
Aug 12

moving away from x-macros

In any large (and even many small) codebases, there comes a point where some knowledge about an entity X is needed at various points throughout the codebase, but the knowledge required at each point is ever-so-slightly different.  You could provide comments at each point, stating the list of other locations that need to be updated appropriately when changes occur at that point, but that gets tedious quickly; it’s also error-prone, since it’s easy to forget the updating.  One technique for handling this in Mozilla’s C++ codebase is known as X-macros: a header file filled with calls to some macro X:

X(name1, data1, ...)
X(name2, data2, ...)
X(name3, data3, ...)
...

This header is then included at various points.  At each point, X is defined to extract whatever data is necessary:

enum ID {
#define X(name, data, size) name
#include "Xmacro_header.h"
#undef X
};
...
static const uint32_t sizes[] = {
#define X(name, data, size) size,
#include "Xmacro_header.h"
#undef X
};
...
method(enum ID, ...)
{
  uint32_t size = sizes[ID];
  ...
}

A prime example of this is the header content/events/public/nsEventNameList.h, which gets included at various points for generating method declarations, method definitions, and name tables; we use a similar header toolkit/components/telemetry/TelemetryHistograms.h for defining Telemetry histograms and the associated IDs those histograms are identified with.

But we are moving away from this technique in Telemetry: we’re soon going to define our histograms using JSON and generate the necessary information from that JSON with Python scripts, rather than relying on the C preprocessor.

The main motivation for doing this is validation: we don’t have anything on the server-side of Telemetry which defines schemas for Telemetry data.  The client knows what are the bucket ranges for individual histograms, for instance, but the server doesn’t.  So the server has to accept all kinds of bogus data, whereas it could reject that data if some sort of schema for the data was provided.  And generating that information is (relatively) more easily done from a JSON definition than from a C preprocessor definition.

Some forms of client-side validation are eased by this process as well.  For example, there’s a quirk in how “linear” histograms are defined that makes it easy to lose data when trying to capture data than runs from 1 to N.  There are ways around this, but they don’t always get used properly.  With the histograms defined in JSON, we can make this sort of “enumerated” histogram a first class citizen and error on misdefinitions of “linear” histograms for capturing enumerated data.

Other future enhancements are also easier to do when you’re not limited by the C preprocessor, like eliminating relocations (and generally making the histogram information take up less space in the binary), automating field trials, providing labels for individual buckets, and so forth.  There are some downsides, notably that defining related histograms can’t be done with a little #define trickery, but the benefits more than make up for it.


19
Jun 12

performant concurrency

The most important conclusion from this foray into history is that concurrency has always been employed for one purpose: to improve the performance of the system. This seems almost too obvious to make explicit—why else would we want concurrency if not to improve performance?—yet for all its obviousness, concurrency’s raison d’être is increasingly forgotten, as if the proliferation of concurrent hardware has awakened an anxiety that all software must use all available physical resources. Just as no programmer felt a moral obligation to eliminate pipeline stalls on a superscalar microprocessor, no software engineer should feel responsible for using concurrency simply because the hardware supports it. Rather, concurrency should be thought about and used for one reason and one reason only: because it is needs [sic] to yield an acceptably performing system.

–from Real-World Concurrency by Bryan Cantrill and Jeff Bonwick


14
Jun 12

lessons learned while filing bugs

When I tell people that I and my entire team at Mozilla work remote, they often ask, “How do you keep track of everything people are doing?”  I tell them about Bugzilla and anything that needs fixed goes into it as a bug; bugs aren’t just for software problems, but hardware requests, account requests, office maintenance, the list goes on.  A bug is anything that prevents you from getting your work done.

Despite extolling the virtues of filing things, large and small, in Bugzilla, I (re?)learned two lessons about filing bugs this week:

  • Even if you think something is totally obvious, file a bug.  I needed to modify the Android mozconfigs this week because they weren’t turning on telemetry support, like our desktop mozconfigs.  In the process, I became disgusted with how much copy-and-paste there is between our mozconfigs and thought, “Surely, there is a bug out there already on this goop.”  I even asked one of my colleagues, Mike Hommey, whether there might be a bug on this already.  It turns out there wasn’t one, but there is now.  I shouldn’t have even asked Mike; I should have searched Bugzilla a bit and filed something if I couldn’t find a plausible match  Worst-case, somebody would mark my bug as a duplicate and I would be connected with people and/or possible attempts at fixing the bug already.
  • If something looks like a bug, file a bug.  Back in March, I made it harder to get certain kinds of Telemetry histograms wrong.  In the bug, I noted several histograms that were wrong, but didn’t bother to do anything about them.  “Too much work for too little gain,” I said.  Turns out that the networking team has been working with bad data on cache Telemetry because of one of the aforementioned histograms being wrong.  That problem could’ve been fixed several months ago!  My mistake; I should have filed each and every one of those histograms as a bug and let the relevant teams decide what to do about them.  Even if you think it’s harmless, file a bug.  Worst case is somebody tells you that it is harmless and you come away enlightened.  (And just in case you’re wondering, yes, there is a bug to make those histograms even harder to get wrong.)

And as always, please remember to follow Bugzilla’s rules of etiquette when filing bugs.


12
Jun 12

about:memory statistics improvements

The three major memory costs to rendering webpages in Gecko are layout, the DOM, and JavaScript.  Only the last of these has a detailed about:memory breakdown.  Layout has a few subcategories, but likely as not, layout/arenas is one big opaque number, and DOM just sits as a big blob-o-stuff.

Over the past two weeks, I’ve been working to improve this state of affairs.  DOM’s numbers have been made more transparent by splitting out the numbers by the type of DOM node.  Admittedly, this isn’t perfect, as now you have several semi-opaque blobs, but it gives you a slightly better idea where the memory is going.  Additionally, a few more things are getting counted in those DOM numbers, like some on-the-side datastructures Gecko’s DOM engine keeps around.

Layout has received the most work, however.  We now display stats for individual frame types and common objects allocated in layout’s arenas, which can be helpful in diagnosing page performance problems (bug 686795 comment 10 and comment 23).  Please note that in the stats for these frames and objects, we’re not measuring any substructures contained in them, some of which could be quite significant.  That’s a (potentially tricky) job for later.


30
May 12

slow sql in xpcshell testing

One of the best things about performance monitoring is that it sometimes shows up problems in very unexpected areas.  Yesterday while working on some improvements to telemetry, I noticed that the telemetry packets being sent had information for slowSQLStartup.  That is, there were some SQL queries (4 of them, to be exact; 3 unique queries) that were taking more than 150 ms to run just during xpcshell testing.

These queries were run by the addons manager and already have a bug on file.

I’m guessing that this slow SQL does not occur during all xpcshell tests, but only those that create a profile (about 5% of tests). which would translate into ~1% of xpcshell testing time.  Still, speedups are speedups, and fixing these queries would be a good speedup for Firefox users generally.


14
May 12

statement wrappers have been deprecated

It’s a lot of fun removing old code; it’s less fun when removing that old code breaks lots of things.  That’s what happened when bug 589032 was checked in to mozilla-central last week. We had deprecated mozIStorageStatementWrapper over a year ago; bug 589032 removed mozIStorageStatementWrapper in favor of using mozIStorageStatement directly.

As there have been a couple reports of breakage (which likely means there’s a lot more breakage pending), and I have claimed in the bug that fixing said breakage is a straightforward process, here’s the quick-n-dirty guide to doing so. (Warning: may contain bugs or be incomplete. Please note any oversights in the comments.)

Code that created wrappers:

var wrapper = new Components.Constructor("@mozilla.org/storage/statement-wrapper;1",
                                         Ci.mozIStorageStatementWrapper,
                                         "initialize");
function createStatement(db, aSQL) {
  return new wrapper(db.createStatement(aSQL));
}

or perhaps:

var wrapper = Cc["@mozilla.org/storage/statement-wrapper;1"]
                .createInstance(Ci.mozIStorageStatementWrapper);
wrapper.initialize(statement);

should be replaced with:

function createStatement(db, aSQL) {
  return db.createStatement(aSQL);
}

Of course, that’s almost not worth having a separate function for. :)

Nearly all of the methods that you were calling on the wrapper remain unchanged, with the exception of:

foo = wrapped_stmt.execute();

That should be replaced with:

foo = stmt.executeStep();

Statements have an execute() method, but it does something slightly different than the wrapper’s execute() method.

Finally, where you finalize your statements (you are finalizing your statements, right?):

wrapped_stmt.statement.finalize();

is now simply:

statement.finalize();

As I said earlier, please note any omissions or errors in the comments!