Mozilla Network Outage Report (Phoenix) – 03/08/2011, 5:00am PST – 11:30am PST

mrz

For several hours this morning, Mozilla’s Phoenix data center suffered several intermittent outages. This was fall out from yesterday’s Juniper SRX JunOS upgrade.

The following sites/services may have experienced degraded performance or partial/full outages:

  • Firefox Sync
  • Socorro (crash-stats.mozilla.com & crash-reports.mozilla.com)
  • input.mozilla.com
  • pulse.mozilla.org
  • firefoxlive.mozilla.org
  • demos.mozilla.org
  • www.mozillademos.org
  • www.drumbeat.org

Background:
There were two separate issues that we encountered, both tracked in bug 639745.

  1. DHCP relay failures. This is a regression in the JunOS code.

    Just before 10:00pm Monday night, multiple hosts in Phoenix began to lose their DHCP leases and drop offline. For reasons not yet understood, the DHCP relay feature was no longer operational.

    This caused an 8 minute outage for support.mozilla.com.

  2. High CPU load. We began experiencing high (maximum) CPU usage on multiple FPCs after upgrading from 10.1 to 10.4R2. This did not have any immediate impact and we opted to continue working with overnight with JTAC on resolution.

    This morning as general load increased, this became a service impacting issue. Netops downgraded to 10.3R2.11 and eventually downgraded to 10.2S7 to resolve these issues.

We apologize for any inconvenience this may have caused and will continue to work with Juniper to understand why this failed and on a long term remedy.