Last week (9 May 2012 23:24 PDT until the following morning at 07:20), BrowserID had an outage affecting 50% of login requests.
Embarrassingly, we didn’t find out until a user filed bug 753728 with us.
To put it bluntly, human error.
Specifically, the accidental draining of all nodes of a load balancer pool that governed verification of logins via BrowserID in our Northern California data center. A test-run of a load balancer management script was executed against a production pool instead of the stage pool.
There are some sites using a local verification system instead of calling back to the browserid.org systems – they were not affected.
The timing of this issue was painfully coincidental, in that it occurred the week before we are planning to roll-out cepmon, which includes robust monitoring of the load balancer pools. It will catch situations like this (among many others) and alert the SRE team and us to the situation immediately.
Additionally, we have a new VPN config for internal use that doesn’t allow cross-talk from stage to prod to reduce the incidence of ‘oopses’ of this nature.