Happy Power Outage Day!

Derek Moore

2

Mozilla’s flagship datacenter, SCL3, underwent its first major stress test yesterday. Lucky for us, we had the advantage of knowing it was coming. The facility operators had scheduled significant electrical power maintenance which would result in the sequential shutdown of both our primary and secondary power feeds. We would need to absorb several power “failures” as we transferred back and forth between feeds over a 24-hour period.

fox2mike is a jerk

 

From day one, we designed our infrastructure for maximum redundancy. Every power branch is color-coded, from the PDU to the overhead bus and even right down to the equipment cords. This allows us, at a glance, to determine exactly which power supply on each server will be impacted by an electrical event anywhere in the facility. It also helps avoid operator error by providing visual confirmation when equipment is properly cabled. In the case of this maintenance, these visual cues dramatically simplified our pre-game audit and let us sleep peacefully the night before.

mozilla-power-colors

 

So, how did we do? Well, we learned a few lessons:

Cheap comes at a price

Although the vast majority of our hardware is datacenter-class, there are a few legacy servers which lack the redundant power supplies to ride out an event like this. For these servers, we rely on Automatic Transfer Switches to provide external failover on an as-needed basis. To be fully effective, this technology generally requires phase synchronization between the power branches… a complex capability not included in our design for this facility. As a result, our success rate during the transfer was about 90% (a good score on your CCIE exam, but not down in the trenches). Two devices were powercycled due to slow failover, but at least the ATS enabled an immediate, automated recovery.

Quality control matters

My biggest fear during datacenter maintenance is the human factor. Our racks are dense, and it’s surprisingly easy to knock a power cord loose while you’re shoulder-deep in copper spaghetti. We take advantage of several innovations to physically protect our cabling against drive-by disconnections. For the power cords, we use retention sleeves to ensure a more resilient mechanical connection.

mozilla-retention-sleeves

Our first batch, however, had some loose manufacturing tolerances. After application, the cord may visually appear to be fully inserted without having made an actual electrical connection. Because our equipment has multiple PSUs, the absence of a single power feed could potentially go unnoticed… until that feed is the only one left. As a result of this oversight, we lost two additional devices on the floor, and they took an outage when the redundant power was cut.


Despite these issues, we continued to serve full production load out of this facility for the duration of the maintenance. Our core, backbone, and high-density equipment functioned exactly as designed, and that instills a lot of confidence as we move forward. The incidents we did encounter helped to highlight possible shortcomings, and it only improves our design and response in the future.

 

2 responses

  1. Cristian wrote on :

    Hi Derek,

    nice post, one note about you not being able to see if power works for one off the power supplies, your servers should have a management software from the vendor, and that server can tell you if the power supplies are working as expected, for example , Dell has Dell Open manage, others must have some other forms of impi management software that can tell if you have power or not, also you can always check the stats on the powerbar :)

    Cheers,

    1. Derek Moore wrote on :

      Christian, thanks for bringing this up. It’s a great point that I didn’t cover in the original post.

      We’re primarily an HP shop here, and the internal HP health monitors are crucial for monitoring the redundancy of both our power and network configurations. As you mentioned, many generic IPMI implementations have similar support. Our network vendor of choice, Juniper, offers even more detailed hardware health analysis.

      While the devices mentioned in my post were either too old or too specialized to offer these features, we definitely consider them mandatory for “production” equipment.