RFO: SCL1 outage Oct 16, 2011

ravi

On October 13th at 1324 PST Nagios alerted the start of a network event affecting reachability to the SCL1 data center. SCL1 is configured with redundant internet links where a VPN traverses a redundant firewall at both ends. There is also a point-to-point (p2p) that connects directly to SJC1.

The running configuration had the VPN as the active path and the p2p disabled because of an ongoing issue (bug 680463).

Because this path was disabled a complete outage was experienced to SCL1 and all its services which primarily includes the release engineering and build infrastructure.

Upon initial investigation the VPNs, fw1.scl1 and vpn1.sjc1, showed the other was sending a incorrect response while renegotiating the tunnel. Standard non-destructive troubleshooting was attempted to reestablish the tunnel with no success.

In the normal course of troubleshooting fw1.scl1 became unresponsive where on-site presence was required. Once on site fw1.scl1 was restored traffic was shifted from the VPN to the p2p despite it not being confirmed fixed. Basic steps were made to reseat optics and clean fiber patches before traffic was moved.

The review of the logs available did not point to any specific issue why the VPN failed nor why the methods used to recover it failed.

While traffic was being shifted to the p2p the VPN recovered on its own, but the decision was made to stay on the p2p while closely monitoring it being mindful of bug 680463 which has since been resolved.

Netops is investigating configurations to augment link fault and the automatic failover to the standby path and will implement it at a later date.

Complete timeline:

13:24 Initial nagios alert.
13:34 Netops is paged.
13:56 Netops responds.
14:05 Escalation to dmoore (page)
14:13 Escalation to dmoore (phone call)
14:15 Escalation to ravi (page)
14:16 Escalation to ravi (page)
14:16 Ravi responds
14:48 fw1.scl1 becomes unresponsive
16:17 Nagios alerts begin to clear