Introducing Chrome Hang Reporting and the Symbolication Server

The Hang Reporter is a new Firefox feature that reports transient main thread hangs to Telemetry. It relies on the existing hang monitor thread to detect when the main thread has not processed any new events in the last X seconds. If it detects a hang, it will suspend the main thread to walk its call stack for PCs and to gather version and address information about the modules loaded in process memory. This data along with the hang’s duration is reported to Telemetry, assuming the user has opted in to submitting performance data. This Telemetry hang information will help us identify and reduce common causes of browser “stalls”. In the interest of saving Telemetry bandwdith, only the information about the modules in memory directly involved in chrome hangs is reported + the memory maps in each hang report are a diff against the memory map in the first hang report of the session.

Currently, the Hang Reporter is only active on Windows Nightly builds from the Profiling branch because it relies on having Firefox compiled with frame pointers to unwind the call stacks. The Hang Reporter functionality is #ifdef’d out on regular release builds. The default minimum interval for capturing a hang is 10 seconds, but the value can be changed via the “hangmonitor.timeout” config preference.

This next bit of functionality is not quite ready for prime time, but you can examine your own browser’s hang reports locally by installing this custom version of the about:telemetry extension that adds a “Main Thread Hangs” section to the about:telemetry page and a button for causing hangs at will for testing.  You have to use either a local debug build of Firefox with symbols or a Nightly Profiling build to see the hang reports.

The hang reports contain raw PCs from the hung stack and a list of the modules involved:

Main Thread Hangs:

Hang report #1 (1 seconds):

Stack: 0x6027b8df 0x6027b897 0x602ce2b7 0x602cf482 0x602d2a7a 0x602d24c1 0x1d6cef50

Memory map:

Hang report #2 (8 seconds):

Stack: 0x75b9f5be 0x10e9051d 0x10e90928 0x112441a6 0x111cb534 0x11018c6f 0x112a4f4e 0x112a4e72 0x112a4d7d 0x10e905b0 0x10e33907 0x10b59f4a 0xf884cac 0x12324ac 0x1231d26 0x1231319 0x1237cf8 0x1237b3f 0x74cf339a 0x775d9ef2 0x775d9ec5

Memory map:

Pending security & privacy review, a symbolication server will eventually be available publicly to symbolicate these hang reports for about:telemetry users. This symbolication server will also be used by the SPS profiler. If you are a developer comfortable with Firefox development and you feel like killing a bit of time :), you can already symbolicate the Firefox portions of these hang reports by running our new Snappy Symbolication Server locally (available on Github). You will need to run the server locally on port 8000 with the config file pointing to your local .sym symbol files. You can generate these .sym files by running “make buildsymbols” from the object directory of your local build. You can then use the “Symbolicate Stacks” functionality in the about:telemetry page.

I don’t expect many any people to go through the trouble of setting all this up :), but the Hang Reporter really does become informative to use after configuring the hang interval to its minimum (1 second) and observing how often and why the browser hangs during regular browsing.

EDIT: Further instructions on setting up the Snappy Symbolication Server can be found here.

5 Responses to Introducing Chrome Hang Reporting and the Symbolication Server

  1. Archaeopteryx

    Will the hang reports also contain information about a JavaScript being busy or something similar? This sounds useful (and in my humble opinion the busy script warnings for chrome files in the application should be turned off).

    By the way, Firefox 12 Beta is a lot snappier than Firefox 10 an its predecessors. I thank all the people who made this happen.

  2. I’m just unsure why we re-invent the wheel there and do stuff again that the crash&hang reporting system already does.

  3. Pingback: Snappy, March 22nd | Lawrence Mandel

  4. I’m confused.
    You said it “really does become informative to use” when setting the interval to 1 second, so why is the default 10 seconds?
    When I first read that it was 10 I was thinking that that’s only going to catch the very worst stuff – not a lot.

  5. Peter Bright

    Can’t you just ship PDBs so that I can walk the stack even on a release build with FPO enabled?