To Tree or Not to Tree

This week, our Phoenix datacenter fell prey to a series of brief rolling outages which visibly impacted many of Mozilla’s public services.

I blame fox2mike

Generally speaking, our datacenter architectures are intentionally simple and spanning tree has served us well. However, as we have grown to meet demand, some of our more… venerable datacenters have become convoluted as new applications are shoehorned into old infrastructure.

Two weeks ago, we brought a new expansion online in Phoenix. Little did we suspect this would be the straw which broke the camel’s back. Minor spanning tree events which had previously gone unnoticed quickly escalated into very noticeable spanning tree cascades. Frustratingly, outages would often resolve themselves before netops personnel could log in to diagnose them. Cell phones vibrated at odd hours of the night. Unkind words were spoken.

Ultimately, we traced the fragility to an oversight in our spanning tree design. Although Juniper is our vendor of choice, we do rely on Cisco’s 3120 blade switch for our HP c7000 chassis. This multi-vendor network creates interesting challenges. In this case, we discovered Juniper’s VSTP mode is not entirely compatible with Cisco’s rapid-pvst mode. In JUNOS versions prior to 10.3, VSTP is unable to fully converge with rapid-pvst. For more information, see Juniper KB 18291 (Juniper support account required).

What did we learn?

  1. Be diligent about marking server trunk ports as spanning tree edge ports. Otherwise, these ports will generate topology changes when a server reboots.
  2. There’s no such thing as too much logging. Logging of spanning tree events can alert you to unexpected topology changes (See #1).
  3. Not all spanning tree protocols are created equal. Don’t blindly trust that spanning tree is doing the right thing.

How do we avoid this, moving forward?

We’re taking great pains to eliminate spanning tree entirely from our newest datacenter, SCL3. While we’re not quite ready to make the leap to a unified fabric architecture (such as Juniper’s QFabric or Cisco’s Nexus), modern multi-chassis technologies can still offer significant improvements. In our case, we’ll be deploying Juniper’s XRE line to enable virtual chassis support on our core EX8200 platform.

Juniper's XRE200

Juniper's XRE200

With virtual chassis at every level (core, aggregation, access), we no longer depend on spanning tree for layer 2 redundancy. Instead, we will be able to rely on a link aggregation protocol (such as LACP). This comes with several added benefits:

  • Improved utilization and load balancing of redundant links
  • Faster convergence
  • Capacity for growth
  • Not spanning tree

Once this architecture is vetted in SCL3, retrofitting PHX1 with the XRE devices will become a top priority.