Updates from lars Toggle Comment Threads | Keyboard Shortcuts

  • lars 6:59 pm on September 1, 2011 Permalink | Reply
    Tags:   

    Socorro HBase connection issues 

    Socorro is currently experiencing an HBase connection issue of unknown origin. Processing is down. Web App functions that require fetching actual crashes is down. Web App functions that use the database appear to be working normally.

    When I learn more, I’ll say more…

    UPDATE: as of 9:30pm PDT, we’re back up and running. All systems should be back to normal. No crashes were lost during this outage.

     
  • lars 6:18 pm on August 31, 2011 Permalink | Reply
    Tags:   

    Socorro Disrupted by Data Center Issue. 

    The Socorro Web App is currently inaccessible due to issues in the Phoenix Data Center. Corey Shields broadcasted this message: “IT is working a network outage in our Phoenix data center right now. This is causing widespread site issues as well as netsplits on IRC. Will keep you posted.”

    The outage seem pretty absolute, as I can access none of the Socorro infrastructure to see how it is faring. I will keep you informed as I know more.

     
    • lars 7:20 pm on August 31, 2011 Permalink | Reply

      It appears that as of about 19:41 PDT, we’re back online. As far as I can tell, we’ve got full functionality.

      The bad news is: according to the collector logs, most crashes submitted from 18:07 PDT to 19:41have been lost. I’ll report more as I investigate more.

  • lars 7:14 am on August 1, 2011 Permalink | Reply  

    Socorro Trouble 

    Socorro is experiencing some, as of yet, undiagnosed networking difficulty. Processing is down, including priority processing. Collection of new crashes is unaffected. I’ll post more once we figure out what is going wrong.

    UPDATE 8:40am PDT: we’ve restored most of the processing capacity of Socorro. We’re still experiencing some trouble from an unknown source. We’re still working on it.

    UPDATE 10:00am PDT: we seem to be back to normal.

     
  • lars 7:52 pm on July 19, 2011 Permalink | Reply  

    Socorro Unplanned Outage 

    Socorro is experiencing an unplanned outage due to network difficulty in the Phoenix data center. The user interface continues to function for aggregate queries. Calling up individual crashes, along with crash processing is stalled. Crashes coming into the collectors are unaffected other than being delayed in processing. No priority jobs are being processed.

    I will report back as soon as I hear more news as to the potential duration of this outage.

    UPDATE: As of 9pm PDT, we appear to have resolved the network issues. Socorro is back to fully functional. Crashes received during the trouble will be sent to processing over the next hour. Please report anything that doesn’t appear to be functioning normally.

     
  • lars 1:48 pm on July 8, 2011 Permalink | Reply  

    Socorro experiencing HBase trouble 

    Socorro is currently experiencing downtime due to technical difficulties with HBase. Standard processing, priority processing and any activities that require pulling individual crash reports are suspended until we’ve resolved the problem. Crash collection is unaffected.

    I’ll post again when I have more news…

     
    • lars 3:08 pm on July 8, 2011 Permalink | Reply

      the HBase connections came back to life at about 3:55PM PDT. The backlog of processing will continue through the evening. If you are encountering any further Socorro trouble, please report them via Bugzilla.

  • lars 2:30 pm on June 30, 2011 Permalink | Reply  

    Brief downtime for upgrade 

    We’ll be having a brief downtime for an upgrade on Thursday, June 30 around 3:30pm PDT. The downtime will be quite short. I’ll give notice when we’re done.

     
    • lars 3:24 pm on June 30, 2011 Permalink | Reply

      while we’re still verifying, everything appears to be running normally. Welcome to the first release in the Socorro 2.0 line.

  • lars 1:24 pm on May 13, 2011 Permalink | Reply  

    Priority Crash Processing Delayed 

    At 7:00 PDT on Friday, May 13, Socorro will be doing some back processing or reprocessing of about 80,000 crashes for Fennec and Firefox. This will result in a slowdown or delay in priority crash processing for ad hoc requests of unprocessed crashes. We expect the slow down to last no longer than a couple hours.

    Other Socorro services, collection, processing and the Web App, will not be affected.

     
    • lars 7:29 pm on May 13, 2011 Permalink | Reply

      This process completed in under an hour. All system are now running normally.

  • lars 5:12 pm on March 10, 2011 Permalink | Reply
    Tags:   

    Socorro Service Restored 

    IT announces that they’re done mucking about with the infrastructure for now. All Socorro services should be back to normal.

    We are notified that this intermittent outage may repeat later this evening.

     
  • lars 4:26 pm on March 10, 2011 Permalink | Reply
    Tags:   

    Socorro Trouble Today 

    Socorro processing has been intermittently adversely affected by infrastructure issues since about 3:30 PST today. First our connection to PostgreSQL failed and, after that came back, our connection to HBase started to get finicky.

    IT is on top of the issue, and I’ll post again when we’ve got more of an idea as to when these intermittent problems will be resolved.

     
c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel