Updates from Laura Thomson Toggle Comment Threads | Keyboard Shortcuts

  • Laura Thomson 10:17 am on December 5, 2012 Permalink | Reply  

    Planned Maintenance 5-8pm PT – PostgreSQL 9.2 upgrade 

    On Dec 5, 2012, from 5pm PT to 8pm PT, crash-stats.mozilla.com will be undergoing maintenance and will be temporarily unavailable.

    This evening’s outage will involve routine maintenance which includes upgrading our systems to use PostgreSQL version 9.2. We will also be deploying version 29 of Socorro, which primarily supports the database upgrade.

    We expect all systems to be recovered by 8pm PT.

     
  • Laura Thomson 2:03 pm on September 25, 2012 Permalink | Reply  

    Socorro upgrade: short downtime at 5.30pm PDT today 

    We will be doing a minor PostgreSQL update from version 9.0.9 to 9.0.10 tonight at 5.30pm PT. The Socorro webapp will be unavailable for 1-2 minutes at that time.

    This is being tracked in

     
  • Laura Thomson 12:57 pm on September 12, 2012 Permalink | Reply  

    Socorro maintenance window tonight, Sept 12, @ 5pm PDT 

    Tonight we are pushing an update to Socorro (http://crash-stats.mozilla.com) to support rapid betas. This involves complex database changes to support all the by-build-date views that are in the new release.

    We intend to do this without significant user-facing downtime, by breaking replication, failing over to secondary, pushing the changes to primary, verifying, and failing back. This is the first time we’ve tried a procedure like this for a code release. In the past we’ve done pushes like this one with downtime, although we have used this procedure for database upgrades. This is an attempt at making things disruption free for you, our users.

    If this doesn’t work for any reason, in the worst case the webapp will be offline for up to an hour, starting at around 5pm PDT.

    We will continue collecting crashes throughout, but processing will be paused during the maintenance window.

    You can track progress in this bug:
    https://bugzilla.mozilla.org/show_bug.cgi?id=790707

    The bugs that will be shipping in this release are documented here:
    https://bugzilla.mozilla.org/buglist.cgi?target_milestone=18&product=Socorro

     
  • Laura Thomson 5:39 am on July 11, 2012 Permalink | Reply  

    HBase outage – more leap second fallout 

    We are currently experiencing an HBase outage due to an NTP offset. This was caused by accidentally re-running the leap second correction script (via puppet).

    The NTP problem has been fixed – tracked in this bug (private infra only, sorry) https://bugzilla.mozilla.org/show_bug.cgi?id=772792.

    We are in the process of restoring HBase services and will update when it’s done.

    In the meantime you will be unable to load individual crash reports, and new crashes will not be processed, although they are being safely stored for later processing.

    Update: we’re back. Took two attempts to restore HBase as it was running low on file descriptors. We will solve that problem medium term with a purge according to our data retention policy. New purge process is in testing on staging at present.

     
  • Laura Thomson 5:36 pm on June 30, 2012 Permalink | Reply  

    HBase is down due to Java Leap Second bug 

    A problem with Java’s handling of the Leap Second today caused our HBase clusters to crash. Currently we are accepting but not processing crashes. You will also be unable to view individual crash reports until the problem is resolved.

    Tracking in

    https://bugzilla.mozilla.org/show_bug.cgi?id=769972

    I’ll update when the problem is fixed.

    Update: As of 11.25pm EDT, Socorro is functioning normally.

     
  • Laura Thomson 7:39 pm on April 15, 2012 Permalink | Reply  

    Socorro outage Sunday 4/15 

    The Socorro admin node was down for most of the day. This means that various pieces of batch processing did not happen. While the machine is now back online, we are still cleaning up.

    All data was collected, and processing is up to date. You may see some abnormalities in aggregate data and graphs until we complete backfilling and cleanup.

    Further detail can be found in bug 745539.

     
  • Laura Thomson 5:45 am on August 19, 2011 Permalink | Reply  

    Socorro instability Friday morning 

    HBase is currently offline, with some regions missing. It has been experiencing intermittent difficulties overnight as well. Metrics and Spec Ops engineers are working on the problem at present.

    User facing symptoms are:

    • unable to load individual crashes in the webapp
    • crash processing is delayed until the problem is solved
    • aggregate numbers will be low until we can catch up on processing and regenerate aggregates

    This issue is being tracked in

    https://bugzilla.mozilla.org/show_bug.cgi?id=680348

    More updates as they come to hand.

     
    • Laura Thomson 5:48 am on August 19, 2011 Permalink | Reply

      Back online, but with intermittent problems (10 regions still missing). Expect flakiness until we get those sorted out.

      • Laura Thomson 2:02 pm on August 19, 2011 Permalink | Reply

        Flakiness has continued throughout the day. Crash processing is now progressing again, with intermittent errors.

        I anticipate that some of the aggregate reports we run overnight will have problems, especially if we are still processing a backlog. We can regenerate these after the backlog is processed.

        Cloudera are working on a fix. In summary (via tmary), oomkiller killed HDFS which shut down HBase unexpectedly. Some regions did not come back online after clean restart. Cloudera is working with Spec Ops to restore these regions.

    • lars 10:29 pm on August 19, 2011 Permalink | Reply

      As of approximately 10:30pm PDT, HBase was brought into a stable state and Socorro resumed operations. Regular crash processing will take several hours to catch up from such a long outage. Priority processing should continue at full speed, unaffected by the backlog. Some crash reports from Friday may be unavailable. There is a chance that the aggregate reports for Friday may be inaccurate. We will correct the data when we can get a full accounting of Friday’s crash traffic.

      The outage resulted in the corruption of some of Friday’s incoming crash data. The Metrics team, IT and external experts are working to restore the corrupted data. We’ll post more on that topic on Monday.

      Please report any lingering troubles that you may encounter from the Socorro Web App. Thank you for your patience during this trying day.

  • Laura Thomson 5:27 pm on August 14, 2011 Permalink | Reply  

    Socorro 2.2 now live, with individual beta reporting 

    Tonight we pushed out Socorro 2.2, which includes a change to the way we report data for betas. Each beta is now reported individually – as, for example, “6.0b4″. The final beta that goes on to become the release will be reported as 6.0(beta) on the beta channel, and 6.0 on the release channel.

    Kairo wrote about this change before it was released, in the second part of this blog post:
    Crash-stats Update, Planned Changes, And Crash Rates.

    Edited to add: We’re running some data updates this evening which may lead to slightly reduced performance. Everything should be 100% well before Monday morning PDT.

     
  • Laura Thomson 9:46 am on March 24, 2011 Permalink | Reply  

    INFO: Socorro processing 100% of Firefox 4 crash reports 

    Since Firefox 4 launched, we’ve been processing 100% of crash reports for it, instead of the usual 10% for GA versions. We’ll continue doing so until we start getting a backlog, or have a stability issue. I anticipate we’ll start building a backlog at some point in the next couple of days, as adoption increases.

     
    • Jonathan 8:32 am on April 3, 2011 Permalink | Reply

      I am using VIsta on an Asus 17 inch laptop and have been having FF crashes about a half dozen times a day since it was installed. I never had this problem with the beta versions or the RC but it seems that now it says 4 without the RC, whenever I come back to the computer after a period of non-use, there is a crash report waiting to be submitted. Occasionally it also crashes in use but there is no consistency on this so I cant say when it is happening. I am now wondering whether I should downgrade to 3? Especially if only 10% of these continuous crash reports are now being looked at?

    • Dan 10:31 am on April 5, 2011 Permalink | Reply

      Is it a goal of the Socorro team to one day process 100% of *all* crash reports?

    • Laura Thomson 7:35 am on April 6, 2011 Permalink | Reply

      Jonathan: Worth going over to http://support.mozilla.com as they will help you diagnose the issue. A common cause of crashes on new versions, for example, is interference between your antivirus software and Firefox (we had a huge issue with this with 4; one well known AV vendor quarantined part of the browser, causing lots of crashes).

      Dan: Yes, it’s our goal to process 100%. One of the reasons for running at 100% was to effectively do an in-production perf test and understand where the bottlenecks are.

      Generally: We use 10% as statistical sampling at present. Processing 100% will give us more information about the long tail of crashes.

      Crashes are put into buckets (by stack trace, basically) and this is used to help prioritize bug fixes. The team tries to fix crashes that affect the most users first. Startup crashes are also given high priority as they put users in a bad situation.

    • Jonathan 7:49 am on April 11, 2011 Permalink | Reply

      Do you have ANY idea what has happened to FF recently which would be causing these crashes? I am not using AV on this computer and something serious must have changed since the beta or RCs which is causing these crashes. Are people downgrading to 3 while someone figures out what is causing 4 to crash? As I mentioned, not only is there no consistency to the crash reports, 4 now crashes in non-use! (whether the computer is in stand-by mode or just not being used). This indicates something serious with the program as opposed to an AV or open-tab related or graphics driver problem.

  • Laura Thomson 10:43 pm on March 7, 2011 Permalink | Reply  

    Topcrashers regeneration complete 

    We’re all done, and Socorro now has topcrash data for Firefox for Mobile and for Linux.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel