The Crashing Edge

Mike Morgan

4

This will be the first of a weekly blog about the crash reporting system — much like the other *edge blogs out there.

In short, we’ve rewritten most of the system to accomodate throughput that is more than 10 times the projected traffic.  It is not because Firefox 3 is crashing more — we are seeing increased traffic because:

  • Client-side throttling has not been effective
  • Overall number of users has increased
  • Mac Intel builds now submit crashes properly

What have we been up to?

First, two bug lists:

Here are some issues that have been addressed in the last 4-6 weeks (not all of which had bugs):

  • server-side throttling now possible and configurable
  • improved separation between monitor and processor jobs
  • updated collector now saves unprocessed crashes grouped by hour
  • mysterious death of main monitor thread resolved
  • collector, monitor, processor now use a common config
  • SVN structure simplified, no longer required to use /dist or /scripts/dist-* scripts to copy things inside SVN
  • Python/Pylons reporter replaced with PHP/Kohana reporter
  • web application now clustered with memcache support
  • removal of SQLAlchemy from all tiers of the system

Known Issues

Known issues are mostly listed in the 0.6 buglist. If you know of a problem that isn’t already filed please file it to help us keep track of things.  Pressing issues with the reporter are:

  • database bottleneck
  • inefficient queries
  • corrupted summary tables for topcrashers

What’s Next?

Our goals as we move into Q4 2008 are:

  • work with PostgreSQL to update and optimize our hardware and software configurations
  • move expensive aggregate queries to cron jobs, summary tables
  • move focus from performance and maintenance to feature development (e.g. new reports, graphs)
  • fully document the system to make sure it is easy to contribute or learn about how this works

In the next week, we will be focusing on documentation and making the reporter usable.  As we resolve our scaling issues, if anybody is interested in revamping some of the UI, you are welcome to jump in.  Find us in #breakpad on IRC.

4 responses

  1. Standard8 wrote on :

    Is the throttling on a per-project basis? If so, will project drivers be kept informed of changes?

  2. morgamic wrote on ::

    @Standard8 – The server-side throttling is currently 10% and is global. However, the code for deferring jobs is configurable (see throttleConditions):
    http://code.google.com/p/socorro/wiki/SocorroCollector

    In all cases, reports that are directly requested will be bumped to the top of their queue.

    Long term we have different options for splitting traffic and customizing throttling and priorities for submitted crashes. One idea we have discussed has been using different domains for collection per app, or changing collector behavior based on channel.

  3. FP wrote on :

    Do you know when the Top Crashers list will be fixed? It seems to have been broken for a long time.

  4. morgamic wrote on ::

    @FP – That’s the current focus, hoping to fix them this week. In the meantime, you can use the query form to generate a similar list.