Switch IOS Upgrade Post-Mortem

Last night Derek attempted to upgrade IOS on core1 & core2 to pickup software support for Cisco’s ACE module.

We ran into several issues and postponed the upgrade on core2 until those issues are resolved.

  1. The switch’s Compact Flash cards weren’t formatted in a format that rommon (the low level boot loader) could read and the switch failed to load any OS when rebooted (bug 473084).

    Unfortunately IOS didn’t flag that as an error when reading/writing to it. It didn’t even flag an error when the boot variable was set to boot off of it which is stupid because IOS clearly knew as shown in the log when I had remote-hands re-seat the card:

    Jan 11 22:44:23 core2 3927937: Jan 11 22:44:21.978 PDT: %PCMCIAFS-SP-5-DIBERR: PCMCIA disk 0 is formatted from a different router or PC. A format in this router is required before an image can be booted from this device

  2. Two of the VMware ESX storage arrays are single-homed, connected to one switch (bug 473113).  This is fallout from previous NetApp performance issues that were forgotten and never addressed and caused a number of build VMs to go offline (bug 473112) .
  3. A number of non-user facing, multi-homed hosts went offline.  All of the RHEL Linux servers have an active/standby network setup.  In several cases the standby interface didn’t work or wasn’t properly configured.  This was more of an annoyance to the IT Team than to anyone else but did cause outages for some backend services (most notably the VMware VC server and mradm01, one of the Nagios servers).

We’ll be addressing those issues before scheduling the remaining upgrade to core2. We’ll also be looking at implementing some routine (perhaps quarterly) test of the infrastructure in a controlled environment to ensure its high “availbility-ness”.

Categories: Mozilla, Networking