29
Sep 11

pgo startup times with syzygy

In an effort to confirm that we do want all this syzygy goodness in our release builds, I’ve been testing out syzygy on PGO builds (since we do PGO builds on Windows for releases). After removing the PEBKAC and getting a proper PGO build–which took depressingly long–I have mixed results.

First, the good news. On my laptop (Win7, Core i7, SSD), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 265 3152 3012
Optimized PGO build 234 2778 2653

These numbers are really encouraging; they’re actually even better than the initial numbers I posted earlier.  (Though I note that they are universally slower than the earlier numbers…hm….)

There is one curious thing about these builds, though. When you look at the page fault numbers, they suggest a much different story. The (cold) numbers are what you get from starting Firefox just after a reboot; the (warm) numbers are from a second startup after the first.

Version Hard faults (cold) Soft faults (cold) Hard faults (warm) Soft faults (warm)
Base PGO build 2507 41219 8 26100
Optimized PGO build 2264 41488 14 23017

These numbers are totally contrary to what I saw with non-PGO builds. We’re not consistently lower in the optimized build on either measure. I honestly haven’t thought very hard about what this means yet.

Anyway, that’s the good news. The bad news is that on my desktop (Win XP, Core 2 Quad, mechanical drive), the about:startup numbers look like this:

Version main sessionRestored firstPaint
Base PGO build 1516 8984 8813
Optimized PGO build 1437 9187 8828

(I don’t have the necessary profiling tools on my XP box for doing page fault analysis. I shouldn’t think they’d differ dramatically between the two systems, though.)

This is a little discouraging. The syzygy-optimized build is a little faster off the line, but gets edged out by the base build in getting to the points that actually matter. I haven’t thought terribly hard about these numbers, either. One possibility is that I did turn off the XPCOM glue preloading bits, which IIUC correctly are helpful for encouraging XP to keep its hands off your binary’s startup time. Doing that was necessary for getting postlinking to work properly. If I made that runtime-configurable, then I could run tests with the preloading enabled and see if we win there.

Bottom line: We would win on leading-edge machines, but we wouldn’t see a lot of benefit on older machines.

Also, if Microsoft would add a drop_caches lookalike, that would be fantastic.


20
Sep 11

startup reduction times with syzygy, part 2

In my previous post, I presented some startup timings with syzygy-optimized Firefox binaries.  People asked whether those timings were on my work laptop (quad-core i7, Windows 7, SSD) and I confirmed they were.  Since those timings were encouraging, but not overwhelming, folks suggested that I might try timing things on a more conventional system.

Below are results from testing on my desktop (quad-core Core 2 Duo @ 2.6GHz, Windows XP, 7200ish-rpm hard drive):

Version main sessionRestored firstPaint
Trunk build 1484 11515 11125
Optimized build 2562 8812 8703

That’s quite a difference: about a 25% win just from reordering functions.  Much more exciting!


19
Sep 11

startup reduction times with syzygy

People that I’ve told about the work with syzygy that I’ve been doing have, almost universally, two reactions:

  1. That’s cool!  (Thanks; I did very little work for it!)
  2. How does that translate into startup time?

Assuming that a 40% reduction in page faults leads to a 40% reduction in startup time is not reasonable, but surely there should be some reduction in startup time, right?  I finally benchmarked this today using the about:startup extension; these numbers are from cold start, freshly rebooted both times:

Version main sessionRestored firstPaint
Trunk build 125 1733 1671
Optimized build 125 1639 1577

So a 40% reduction in page faults translates into ~6% reduction in startup time; not stellar, but not too bad either.

The next step is making sure this all works with PGO builds on Windows. Then we get to have a discussion about whether to incorporate this into the regular builds and getting all the infrastructure on build machines.


17
Sep 11

notes from the all-hands profiling bof

After talking to a couple people at all-hands, it became clear that writing your own profiler was a popular activity.  (Jeff Muizelaar informed me that last year, the pet project was heap analyzers.  What’s next for 2012?)  A short, non-exhaustive list:

…and so forth.  So it seemed like getting interested parties together to talk about One Profiler to Rule Them All would be good.  And it worked; we had probably 20 people come to the BoF.  Below is my summary/recollection of what we discussed..  All the omissions and/or misrepresentation in the notes are my own; please leave a comment if you felt I left something out or need to describe something better.

What does the ideal profiler look like?

  • Cross-platform
  • Low overhead, both when profiling and not
  • Collects more-or-less complete call stacks (this is fairly easy everywhere except x86/Linux)
  • Understands compiled/interpreted JavaScript and C/C++ code
  • Built with the browser itself, not an external process or loaded via LD_PRELOAD or similar; this means we can ship it with the browser and diagnose problems in the field
  • Pretty pictures for viewing collected data.  Who doesn’t love pretty pictures, right?

Bas Schouten also pointed out that it might be much more efficient to just buy suitable profiling technology from a vendor; there was some skepticism that a profiler fulfilling the desiderata existed, though.

Since Benoit gave a demo of his profiler and his profiler seems likely to get into the tree, most of the discussion centered around that or variantions thereof.  Benoit’s profiler works in the usual way via periodic signaling; however, instead of unwinding the call stack, Benoit chose to require annotations (via RAII objects) placed in various parts of the code to indicate what your “callstack” was when a sample was taken.  I believe Steve said his profiler does something similar for JavaScript.  I put “callstack” in quotes because you only get to know whether functions were on the call stack if annotations had been placed in them.

Sprinkling annotations all over the tree sounded like a tedious process.  Somebody pointed out that Chrome does this, though they only place annotations at “entry points” for modules, so you might have one entry point for layout, one for graphics, etc. etc.  That way, given a profile on some random performance bug, you can at least tell who should be exploring the bug further with minimal overhead, since you’re not unwinding and a handful of RAII objects isn’t going to cost much.  Granted, this doesn’t do much for the folks who need to dig deeper, but perhaps we can have other tools for that.

There was some discussion of unwinding instead of annotations.  Unwinding is reasonably cheap when using libunwind and caching decoding of the unwind information; it’s even cheaper when you can just walk frame pointers.  The only wrinkle is you aren’t guaranteed to have frame pointers or unwind information on x86/Linux, so unwinding is not generally doable there.  Sometimes assembly coders also forget they need to insert unwind annotations, though most if not all of the problematic code in glibc, at least, has been so annotated.  Taras Glek suggested that we could insert RAII objects before calling out to code that we don’t control and to make those objects record something about frame/stack pointers so that we could unwind around third-party code if necessary.  I don’t believe we came to a consensus on using unwinding instead of or in addition to annotations.

I can’t recall if we talked extensively about unwinding through JavaScript code.

We didn’t talk about displaying the collected data.  Drawing pretty, understandable pictures is hard.

Benoit and Steve agreed to talk further offline about modifying Benoit’s profiler to understand JavaScript; if you’d like to help with profilers, talk to either of them.  It’d be great to have something in the tree that we can all work with and improve.


08
Sep 11

fewer page faults with syzygy

In my last post, I explained what Syzygy was and discussed some preliminary results from using it with Firefox.  I finally have results of running the whole Syzygy toolchain (instrumentation/profiling/reordering) and some numbers to share.

First off, I mentioned that I could get the call tracing to work right.  That was because I hadn’t installed Syzygy’s counterpart, the Sawbuck log viewer.  That’s the bit that contains the trace provider for the call tracer; you can download Sawbuck or you can install the src/sawbuck/Debug/sawbuck.msi module from your own build.

Secondly, these numbers need to be collected with a change to browser/app/nsBrowserApp.cpp to turn preloading off; otherwise the preloading of xul.dll will defeat the tracing mechanism.  With an eye to eventually getting this into the Mozilla build process, I’m not sure how well that bit will work out; perhaps preloading could be made dependent on an environment variable.  We can then turn off preloading during the build + postlink and get the benefits of a properly laid-out xul.dll without hacks in the source.

With my laptop (Windows 7, Intel SSD), a cold startup of a trunk build of Firefox results in about 2300 hard page faults and 43000 soft page faults.  These terms are what Windows uses; other operating systems call them major and minor page faults, respectively.  In any event, hard page faults are what we care about, because those result in talking to the hard drive.

After running the optimize program that comes with Syzygy–which automates the running of the instrumenter, the collection of profile data from the application, and reordering the library(s) in question–cold startup results in about 1400 hard page faults and 27000 soft page faults.  Like the Chrome folks saw, this is about a 40% reduction in hard page faults; soft page faults in this scenario are reduced to about what you’d see with warm startup.