Socorro: Mozilla’s Crash Reporting System

Recently, we’ve been working on planning out the future of Socorro.  If you’re not familiar with it, Socorro is Mozilla’s crash reporting system.

You may have noticed that Firefox has become a lot less crashy recently – we’ve seen a 40% improvement over the last five months.  The data from crash reports enables our engineers to find, diagnose, and fix the most common crashes, so crash reporting is critical to these improvements.

We receive on our peak day each week 2.5 million crash reports, and process 15% of those, for a total of 50 GB.  In total, we receive around 320Gb each day!  Right now we are handicapped by the limitations of our file system storage (NFS) and our database’s ability to handle really large tables.   However, we are in the process of moving to Hadoop, and currently all our crashes are also being written to HBase.  Soon this will become our main data storage, and we’ll be able to do a lot more interesting things with the data.  We’ll also be able to process 100% of crashes.  We want to do this because the long tail of crashes is increasingly interesting, and we may be able to get insights from the data that were not previously possible.

I’ll start by taking a look at how things have worked to date.

History of Crash Reporting

Current Socorro Architecture

The data flows as follows:

  • When Firefox crashes, the crash is submitted to Mozilla by a part of the browser known as Breakpad.  At Mozilla’s end, this is where Socorro comes into play.
  • Crashes are submitted to the collector, which writes them to storage.
  • The monitor watches for crashes arriving, and queues some of them for processing.  Right now, we throttle the system to only process 15% of crashes due to capacity issues.  (We also pick up and transform other crashes on demand as users request them.)
  • Processors pick up crashes and process them.  A processor gets its next job from a queue in our database, invokes minidump_stackwalk (a part of Breakpad) which combines the crash with symbols, where available.  The results are written back into the database.   Some further processing to generate reports (such as top crashes) is done nightly by a set of cron jobs.
  • Finally, the data is available to Firefox and Platform engineers (and anyone else that is interested) via the webui, at http://crash-stats.mozilla.com

Implementation Details

  • The collector, processor, monitor and cron jobs are all written in Python.
  • Crashes are currently stored in NFS, and processed crash information in a PostgreSQL database.
  • The web app is written in PHP (using the Kohana framework) and draws data both from Postgres and from a Pythonic web service.

Roadmap

Future Socorro releases are a joint project between Webdev, Metrics, and IT.  Some of our milestones focus on infrastructure improvements, others on code changes, and still others on UI improvements.  Features generally work their way through to users in this order.

  • 1.6 – 1.6.3 (in production)

    The current production version is 1.6.3, which was released last Wednesday.  We don’t usually do second dot point releases but we did 1.6.1, 1.6.2, and 1.6.3 to get Out Of Process Plugin (OOPP) support out to engineers as it was implemented.

    When an OOPP becomes unresponsive, a pair of twin crashes are generated: one for the plugin process and one for the browser process.  For beta and pre-release products, both of these crashes are available for inspection via Socorro.  Unfortunately, Socorro throttles crash submissions from released products due to capacity constraints.  This means one or the other of the twins may not be available for inspection.  This limitation will vanish with the release of Socorro 1.8.

    You can now see whether a given crash signature is a hang or a crash, and whether it was plugin or browser related.  In the signature tables, if you see a stop sign symbol, that’s a hang.  A window means it is crash report information from the browser, and a small blue brick means it is crash report information from the plugin.

    If you are viewing one half of a hang pair for a pre-release or beta product, you’ll find a link to the other half at the top right of the report.

    You can also limit your searches (using the Advanced Search Filters) to look just at hangs or just at crashes, or to filter by whether a report is browser or plugin related.

  • 1.7 (Q2)

    We are in the process of baking 1.7.  The key feature of this release is that we will no longer be relying on NFS in production. All crash report submissions are already stored in HBase, but with Socorro 1.7, we will retrieve the data from HBase for processing and store the processed result back into HBase.

  • 1.8 (Q2)

    In 1.8, we will migrate the processors and minidump_stackwalk instances to run on our Hadoop nodes, further distributing our architecture.  This will give us the ability to scale up to the amount of data we have as it grows over time. You can see how this will simplify our architecture in the following diagram.

    New Socorro Architecture

    With this release, the 15% throttling of Firefox release channel crashes goes away entirely.

  • 2.0 (Q3 2010)

    You may have noticed 1.9 is missing.  In this release we will be making the power of Hbase available to the end user, so expect some significant UI changes.

    Right now we are in the process of specifying the PRD for 2.0.  This involves interviewing a lot of people on the Firefox, Platform, and QA teams.  If we haven’t scheduled you for an interview and you think we ought to talk to you, please let us know.

Features under consideration

  • Full text search of crashes
  • Faceted search: start by finding crashes that match a particular signature, and then drill down into them by category.
    Which of these crashes involved a particular extension or plugin?  Which ones occured within a short time after startup?
  • The ability to write and run your own Map/Reduce jobs (training will be provided)
  • Detection of “explosive crashes” that appear quickly
  • Viewing crashes by “build time” instead of clock time
  • Classification of crashes by component

This is a big list, obviously. We need your feedback – what should we work on first?

One thing that we’ve learned so far through the interviews is that people are not familiar with the existing features of Socorro, so expect further blog posts with more information on how best to use it!

How to get involved

As always, we welcome feedback and input on our plans.

You can contact the team at socorro-dev@mozilla.com, or me personally at laura@mozilla.com.

In addition, we always welcome contributions.  You can find our code repository at
http://code.google.com/p/socorro/

We hold project meetings on a Wednesday afternoon – details and agendas are here
https://wiki.mozilla.org/Breakpad/Status_Meetings

2 responses

  1. morgamic wrote on :

    Great post, Laura. I’m excited about the changes coming up and how it will improve the stability of Firefox. We’ve come a long way but we still have a lot of room to grow — it’s a good place to be.

  2. David Tenser wrote on :

    Looks great! Some questions:

    * It doesn’t look like it’s possible to compare 3.6.3 with 3.6.4 by showing both graphs at the same time. Any plans for that?
    * Why is 3.6.4 listed as a current release — shouldn’t it be 3.6.3?
    * If I compare the crashes per ADUs on 3.6.3 and 3.6.4, it looks like 3.6.4 crashes 10x more often. 3.6.3 is about 0.25 and 3.6.4 is around 3.0. I assume this is wrong?