notes from the all-hands profiling bof

After talking to a couple people at all-hands, it became clear that writing your own profiler was a popular activity.  (Jeff Muizelaar informed me that last year, the pet project was heap analyzers.  What’s next for 2012?)  A short, non-exhaustive list:

…and so forth.  So it seemed like getting interested parties together to talk about One Profiler to Rule Them All would be good.  And it worked; we had probably 20 people come to the BoF.  Below is my summary/recollection of what we discussed..  All the omissions and/or misrepresentation in the notes are my own; please leave a comment if you felt I left something out or need to describe something better.

What does the ideal profiler look like?

  • Cross-platform
  • Low overhead, both when profiling and not
  • Collects more-or-less complete call stacks (this is fairly easy everywhere except x86/Linux)
  • Understands compiled/interpreted JavaScript and C/C++ code
  • Built with the browser itself, not an external process or loaded via LD_PRELOAD or similar; this means we can ship it with the browser and diagnose problems in the field
  • Pretty pictures for viewing collected data.  Who doesn’t love pretty pictures, right?

Bas Schouten also pointed out that it might be much more efficient to just buy suitable profiling technology from a vendor; there was some skepticism that a profiler fulfilling the desiderata existed, though.

Since Benoit gave a demo of his profiler and his profiler seems likely to get into the tree, most of the discussion centered around that or variantions thereof.  Benoit’s profiler works in the usual way via periodic signaling; however, instead of unwinding the call stack, Benoit chose to require annotations (via RAII objects) placed in various parts of the code to indicate what your “callstack” was when a sample was taken.  I believe Steve said his profiler does something similar for JavaScript.  I put “callstack” in quotes because you only get to know whether functions were on the call stack if annotations had been placed in them.

Sprinkling annotations all over the tree sounded like a tedious process.  Somebody pointed out that Chrome does this, though they only place annotations at “entry points” for modules, so you might have one entry point for layout, one for graphics, etc. etc.  That way, given a profile on some random performance bug, you can at least tell who should be exploring the bug further with minimal overhead, since you’re not unwinding and a handful of RAII objects isn’t going to cost much.  Granted, this doesn’t do much for the folks who need to dig deeper, but perhaps we can have other tools for that.

There was some discussion of unwinding instead of annotations.  Unwinding is reasonably cheap when using libunwind and caching decoding of the unwind information; it’s even cheaper when you can just walk frame pointers.  The only wrinkle is you aren’t guaranteed to have frame pointers or unwind information on x86/Linux, so unwinding is not generally doable there.  Sometimes assembly coders also forget they need to insert unwind annotations, though most if not all of the problematic code in glibc, at least, has been so annotated.  Taras Glek suggested that we could insert RAII objects before calling out to code that we don’t control and to make those objects record something about frame/stack pointers so that we could unwind around third-party code if necessary.  I don’t believe we came to a consensus on using unwinding instead of or in addition to annotations.

I can’t recall if we talked extensively about unwinding through JavaScript code.

We didn’t talk about displaying the collected data.  Drawing pretty, understandable pictures is hard.

Benoit and Steve agreed to talk further offline about modifying Benoit’s profiler to understand JavaScript; if you’d like to help with profilers, talk to either of them.  It’d be great to have something in the tree that we can all work with and improve.

5 comments

  1. > Collects more-or-less complete call stacks (this is fairly easy everywhere except x86/Linux)

    Shouldn’t that be reversed? Getting callstacks is easy on x86, but platforms like ARM and x86_64 are problematic. Or have I misunderstood this point?

  2. Two comments:

    1) A profiler in an external process is invaluable for debugging hangs or anything else that slows down the browser UI. In the e10s world, the “external process” could be the chrome process for content profiling. What about chrome profiling?

    2) Pretty pictures are nice, but the ability to have the profiler help analyze the data (the exclude, blame to parent, focus, collapse recursion functionality that Shark has) is something that’s missing from most profilers I’ve dealt with and that makes deep and branchy calltrees much easier to work with. I’d hope whatever we end up doing here would support those….

  3. Nathan Froyd

    Getting callstacks on x86/Linux, in particular, is hard because recent GCC versions default to -fomit-frame-pointer. So you can’t do ebp-based unwinding reliably. You also aren’t guaranteed to have unwinding information available, so that option’s closed to you as well.

    Unwind information is mandatory on x86-64/Linux and ARM/Linux, I believe, so that makes those platforms easy. Macs apparently always have -fno-omit-frame-pointer turned on (so I’ve been told; I don’t develop on Macs), so unwinding there is easy. Bas said Win64 always has frame pointers (and unwind info?) and IIRC, Win32 has unwind info hanging about somewhere as well.

  4. Thanks much for summarizing this!

  5. You’re pretty much screwed on x86 everywhere, since we build without a frame pointer on all x86 platforms (system libraries may be a different story). x86-64 and ARM do have unwind info specified in the ABI, so you can use that even in release builds.