Mozilla’s flagship datacenter, SCL3, underwent its first major stress test yesterday. Lucky for us, we had the advantage of knowing it was coming. The facility operators had scheduled significant electrical power maintenance which would result in the sequential shutdown of both our primary and secondary power feeds. We would need to absorb several power “failures” as we transferred back and forth between feeds over a 24-hour period.
From day one, we designed our infrastructure for maximum redundancy. Every power branch is color-coded, from the PDU to the overhead bus and even right down to the equipment cords. This allows us, at a glance, to determine exactly which power supply on each server will be impacted by an electrical event anywhere in the facility. It also helps avoid operator error by providing visual confirmation when equipment is properly cabled. In the case of this maintenance, these visual cues dramatically simplified our pre-game audit and let us sleep peacefully the night before.
So, how did we do? Well, we learned a few lessons:
Cheap comes at a price
Although the vast majority of our hardware is datacenter-class, there are a few legacy servers which lack the redundant power supplies to ride out an event like this. For these servers, we rely on Automatic Transfer Switches to provide external failover on an as-needed basis. To be fully effective, this technology generally requires phase synchronization between the power branches… a complex capability not included in our design for this facility. As a result, our success rate during the transfer was about 90% (a good score on your CCIE exam, but not down in the trenches). Two devices were powercycled due to slow failover, but at least the ATS enabled an immediate, automated recovery.
Quality control matters
My biggest fear during datacenter maintenance is the human factor. Our racks are dense, and it’s surprisingly easy to knock a power cord loose while you’re shoulder-deep in copper spaghetti. We take advantage of several innovations to physically protect our cabling against drive-by disconnections. For the power cords, we use retention sleeves to ensure a more resilient mechanical connection.
Our first batch, however, had some loose manufacturing tolerances. After application, the cord may visually appear to be fully inserted without having made an actual electrical connection. Because our equipment has multiple PSUs, the absence of a single power feed could potentially go unnoticed… until that feed is the only one left. As a result of this oversight, we lost two additional devices on the floor, and they took an outage when the redundant power was cut.
Despite these issues, we continued to serve full production load out of this facility for the duration of the maintenance. Our core, backbone, and high-density equipment functioned exactly as designed, and that instills a lot of confidence as we move forward. The incidents we did encounter helped to highlight possible shortcomings, and it only improves our design and response in the future.