17
Sep 11

notes from the all-hands profiling bof

After talking to a couple people at all-hands, it became clear that writing your own profiler was a popular activity.  (Jeff Muizelaar informed me that last year, the pet project was heap analyzers.  What’s next for 2012?)  A short, non-exhaustive list:

…and so forth.  So it seemed like getting interested parties together to talk about One Profiler to Rule Them All would be good.  And it worked; we had probably 20 people come to the BoF.  Below is my summary/recollection of what we discussed..  All the omissions and/or misrepresentation in the notes are my own; please leave a comment if you felt I left something out or need to describe something better.

What does the ideal profiler look like?

  • Cross-platform
  • Low overhead, both when profiling and not
  • Collects more-or-less complete call stacks (this is fairly easy everywhere except x86/Linux)
  • Understands compiled/interpreted JavaScript and C/C++ code
  • Built with the browser itself, not an external process or loaded via LD_PRELOAD or similar; this means we can ship it with the browser and diagnose problems in the field
  • Pretty pictures for viewing collected data.  Who doesn’t love pretty pictures, right?

Bas Schouten also pointed out that it might be much more efficient to just buy suitable profiling technology from a vendor; there was some skepticism that a profiler fulfilling the desiderata existed, though.

Since Benoit gave a demo of his profiler and his profiler seems likely to get into the tree, most of the discussion centered around that or variantions thereof.  Benoit’s profiler works in the usual way via periodic signaling; however, instead of unwinding the call stack, Benoit chose to require annotations (via RAII objects) placed in various parts of the code to indicate what your “callstack” was when a sample was taken.  I believe Steve said his profiler does something similar for JavaScript.  I put “callstack” in quotes because you only get to know whether functions were on the call stack if annotations had been placed in them.

Sprinkling annotations all over the tree sounded like a tedious process.  Somebody pointed out that Chrome does this, though they only place annotations at “entry points” for modules, so you might have one entry point for layout, one for graphics, etc. etc.  That way, given a profile on some random performance bug, you can at least tell who should be exploring the bug further with minimal overhead, since you’re not unwinding and a handful of RAII objects isn’t going to cost much.  Granted, this doesn’t do much for the folks who need to dig deeper, but perhaps we can have other tools for that.

There was some discussion of unwinding instead of annotations.  Unwinding is reasonably cheap when using libunwind and caching decoding of the unwind information; it’s even cheaper when you can just walk frame pointers.  The only wrinkle is you aren’t guaranteed to have frame pointers or unwind information on x86/Linux, so unwinding is not generally doable there.  Sometimes assembly coders also forget they need to insert unwind annotations, though most if not all of the problematic code in glibc, at least, has been so annotated.  Taras Glek suggested that we could insert RAII objects before calling out to code that we don’t control and to make those objects record something about frame/stack pointers so that we could unwind around third-party code if necessary.  I don’t believe we came to a consensus on using unwinding instead of or in addition to annotations.

I can’t recall if we talked extensively about unwinding through JavaScript code.

We didn’t talk about displaying the collected data.  Drawing pretty, understandable pictures is hard.

Benoit and Steve agreed to talk further offline about modifying Benoit’s profiler to understand JavaScript; if you’d like to help with profilers, talk to either of them.  It’d be great to have something in the tree that we can all work with and improve.


08
Sep 11

fewer page faults with syzygy

In my last post, I explained what Syzygy was and discussed some preliminary results from using it with Firefox.  I finally have results of running the whole Syzygy toolchain (instrumentation/profiling/reordering) and some numbers to share.

First off, I mentioned that I could get the call tracing to work right.  That was because I hadn’t installed Syzygy’s counterpart, the Sawbuck log viewer.  That’s the bit that contains the trace provider for the call tracer; you can download Sawbuck or you can install the src/sawbuck/Debug/sawbuck.msi module from your own build.

Secondly, these numbers need to be collected with a change to browser/app/nsBrowserApp.cpp to turn preloading off; otherwise the preloading of xul.dll will defeat the tracing mechanism.  With an eye to eventually getting this into the Mozilla build process, I’m not sure how well that bit will work out; perhaps preloading could be made dependent on an environment variable.  We can then turn off preloading during the build + postlink and get the benefits of a properly laid-out xul.dll without hacks in the source.

With my laptop (Windows 7, Intel SSD), a cold startup of a trunk build of Firefox results in about 2300 hard page faults and 43000 soft page faults.  These terms are what Windows uses; other operating systems call them major and minor page faults, respectively.  In any event, hard page faults are what we care about, because those result in talking to the hard drive.

After running the optimize program that comes with Syzygy–which automates the running of the instrumenter, the collection of profile data from the application, and reordering the library(s) in question–cold startup results in about 1400 hard page faults and 27000 soft page faults.  Like the Chrome folks saw, this is about a 40% reduction in hard page faults; soft page faults in this scenario are reduced to about what you’d see with warm startup.


26
Aug 11

making firefox work with syzygy

The good folks at Google have written a clever tool called Syzygy, which is part of the Sawbuck project.  The best summary of Syzygy comes from its design document:

Syzygy is a suite of tools to perform profile-guided, function-level reordering of 32-bit Windows PE executables, to optimize their layout improved performance, notably for improved paging patterns.

Google wrote Syzygy for use with Chrome, but the tool is equally applicable to any large application where you want to improve performance…like Firefox.  In this case, we’re concerned with improving the layout of libxul, as that’s where the bulk of the Firefox code lives.  Working with Syzygy involves four major steps:

  1. Instrumenting the application binary in question;
  2. Running the application to collect profile data (function addresses along with invocation time);
  3. Passing the profile data through an ordering generator, which comes up with the order in which functions should be laid out in the optimized binary; and finally
  4. Relinking the application binary using the ordering from step 3.

Step 1 is pretty easy; Firefox just needs to be compiled with Visual Studio’s /PROFILE switch to ensure that the instrumenter has all the information it needs.  Steps 3 and 4 are likewise straightforward.

Step 2 appears to be the tricky part.  Being good–lazy–computer programmers, the Google folks wrote a number of scripts and programs to automate this process, as well as some benchmarking infrastructure.  However, the scripts are written with Chrome in mind; many places have Chrome-specific bits hardcoded.  This is, of course, totally understandable, but it makes using those scripts with other programs difficult.

Over the past couple of weeks, I’ve been working at making Syzygy cooperate with Firefox.  If you’re interested, you can see the modifications I’ve made in my sawbuck github project.  Things are working well enough that I can now run:

Debug/py/Scripts/benchmark --user-data-dir flobbity --no-preload --no-prefetch ~/ff-build/dist/bin/firefox.exe

(The --no-{preload,prefetch} options are required to work around Chrome-specific code that didn’t seem worth ripping out; the --user-data-dir specifies what profile to use when launching Firefox.)  After waiting for a minute or two, the benchmark script reports:

RESULT firefox.exe: SoftPageFaults= [23495, 32139, 23356, 23343, 23299, 23167, 23063, 23141, 23113, 23267]
RESULT firefox.exe: HardPageFaults= [1158, 10, 3, 3, 4, 2, 2, 2, 3, 2]

This is for an unoptimized binary, of course.  You can clearly see the OS’s page cache at work in runs after the first.

The scripts are not quite perfect yet.  In particular, the call traces necessary to perform reordering don’t seem to be generated, for some peculiar reason that I haven’t ferreted out.  Also, the script will indiscriminately kill any Mozilla-related apps running along with the Firefox instances being benchmarked; I couldn’t find any good way to limit the killing to windows associated with a particular profile.  (IIUC the Chrome code correctly, it sets the window text of a hidden window to the full path to the profile directory in use.)  But a good bit seems to work; hopefully progress will come faster now that the groundwork has been laid.


24
Aug 11

Hello world!

My name is Nathan Froyd and I work in Taras Glek’s group on performance-related things.  I’m working remote; I live in the wonderful city of Indianapolis, Indiana.

I started working at Mozilla back in mid-June, but took quite some time off due to cardiac arrest from streptococcal myocarditis.  Fortunately I came through that OK, and folks at Mozilla have been very helpful and understanding during my hospitalization and recovery.  I’ve been back to work for a couple of weeks now and finally feel like a Mozilla employee. :)

In my previous work, I worked on the GNU toolchain at CodeSourcery: GCC optimizations, both general and PowerPC-related; maintaining the PowerPC ports internally; C++ frontend diagnostic improvements, like function overload resolution failure explanation and better missing semicolon diagnostics; GDB porting; and supporting the toolchain on popular embedded platforms (across the x86, ARM, PowerPC, MIPS, SPARC, and SH architectures).

I also worked on some of the software architecture bits of GCC.  One of the patches I’m happiest with was a patch series to slim down how GCC represents expressions and constants internally.  I’m looking forward to working on similar software architecture patches at Mozilla.

Outside of work, I enjoy spending time with my family; I have three daughters and they keep my wife and I busy!  (I like to say that I am thoroughly outnumbered in my house–even the cat is a girl.)  I love to read: sci-fi, fantasy, history, philosophy, and theology books all grace my bookshelves.  When I do feel like programming outside of work, I enjoy working with Common Lisp and SBCL.  And finally, I recently started playing World of Warcraft once again.