The Crashing Edge

Sep 17 2008

This will be the first of a weekly blog about the crash reporting system — much like the other *edge blogs out there.

In short, we’ve rewritten most of the system to accomodate throughput that is more than 10 times the projected traffic. It is not because Firefox 3 is crashing more — we are seeing increased traffic because:

Client-side throttling has not been effective
Overall number of users has increased
Mac Intel builds now submit crashes properly

What have we been up to?

First, two bug lists:

Bugs closed between the last blog post and now
Socorro 0.6 — our next milestone, targeted for early October

Here are some issues that have been addressed in the last 4-6 weeks (not all of which had bugs):

server-side throttling now possible and configurable
improved separation between monitor and processor jobs
updated collector now saves unprocessed crashes grouped by hour
mysterious death of main monitor thread resolved
collector, monitor, processor now use a common config
SVN structure simplified, no longer required to use /dist or /scripts/dist-* scripts to copy things inside SVN
Python/Pylons reporter replaced with PHP/Kohana reporter
web application now clustered with memcache support
removal of SQLAlchemy from all tiers of the system

Known Issues

Known issues are mostly listed in the 0.6 buglist. If you know of a problem that isn’t already filed please file it to help us keep track of things. Pressing issues with the reporter are:

database bottleneck
inefficient queries
corrupted summary tables for topcrashers

What’s Next?

Our goals as we move into Q4 2008 are:

work with PostgreSQL to update and optimize our hardware and software configurations
move expensive aggregate queries to cron jobs, summary tables
move focus from performance and maintenance to feature development (e.g. new reports, graphs)
fully document the system to make sure it is easy to contribute or learn about how this works

In the next week, we will be focusing on documentation and making the reporter usable. As we resolve our scaling issues, if anybody is interested in revamping some of the UI, you are welcome to jump in. Find us in #breakpad on IRC.

4 responses

Standard8 wrote on September 18, 2008 at 12:08 am:

Is the throttling on a per-project basis? If so, will project drivers be kept informed of changes?
morgamic wrote on September 18, 2008 at 12:26 am :

@Standard8 – The server-side throttling is currently 10% and is global. However, the code for deferring jobs is configurable (see throttleConditions):
http://code.google.com/p/socorro/wiki/SocorroCollector

In all cases, reports that are directly requested will be bumped to the top of their queue.

Long term we have different options for splitting traffic and customizing throttling and priorities for submitted crashes. One idea we have discussed has been using different domains for collection per app, or changing collector behavior based on channel.
FP wrote on September 18, 2008 at 5:17 am:

Do you know when the Top Crashers list will be fixed? It seems to have been broken for a long time.
morgamic wrote on September 18, 2008 at 8:37 am :

@FP – That’s the current focus, hoping to fix them this week. In the meantime, you can use the query form to generate a similar list.

Mozilla Web Development

For make benefit of glorious tubes

The Crashing Edge

What have we been up to?

Known Issues

What’s Next?

4 responses

Standard8 wrote on September 18, 2008 at 12:08 am:

morgamic wrote on September 18, 2008 at 12:26 am :

FP wrote on September 18, 2008 at 5:17 am:

morgamic wrote on September 18, 2008 at 8:37 am :