Skip to content

BrowserID outage last week

What happened?

Last week (9 May 2012 23:24 PDT until the following morning at 07:20), BrowserID had an outage affecting 50% of login requests.

Embarrassingly, we didn’t find out until a user filed bug 753728 with us.

Root Cause

To put it bluntly, human error.

Specifically, the accidental draining of all nodes of a load balancer pool that governed verification of logins via BrowserID in our Northern California data center. A test-run of a load balancer management script was executed against a production pool instead of the stage pool.

There are some sites using a local verification system instead of calling back to the systems – they were not affected.

The Fix

The timing of this issue was painfully coincidental, in that it occurred the week before we are planning to roll-out cepmon, which includes robust monitoring of the load balancer pools. It will catch situations like this (among many others) and alert the SRE team and us to the situation immediately.

Additionally, we have a new VPN config for internal use that doesn’t allow cross-talk from stage to prod to reduce the incidence of ‘oopses’ of this nature.

One Comment

  1. Thanks for the transparency. We had reports from some customers about this problem and we pointed out this post.

    Alex @ Unihost Brasil

    Posted on 17-May-12 at 16:11 | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *