Router upgrades, San Jose

A couple months ago I mentioned how things have grown in the past two years at Mozilla.  Back then we barely pushed any traffic to the Internet and survived on less than a dozen app servers.

Things have changed.  I’ll highlight just a couple of them:

  • Active Firefox users grew from roughly 20 million users to over 70 million
  • Mozilla’s outbound traffic has grown from ~150Mbps to well over 800Mbps (and over 1.5Gbps during release periods)
  • BGP routers on the Internet have grown from something around 200k to more than 250k

That last bullet point brings us to today.

The two BGP speaking routers in San Jose both have Sup32 (the “CPU” of the router) and they have a limit to the maximum number of routes they can hold in their FIB TCAM (“route lookup table”).  Routes that can’t fit in the FIB TCAM end up being forwarded in software at the cost of CPU.  The more traffic we push, the high the CPU tends to run and lately it’s been running close to the point of uncomfortable.

I’m routinely getting alert emails: five minute load average 62% exceeds 60% five minute load average 83% exceeds 60%

And from trend graphs, it’s quite obvious.

I will be upgrading the Sup32s this week to Sup720-3BXLs.  I plan on doing one Tuesday and the other Thursday.  For the most part, this should be non-user impacting.  Most of the headache is going to be in the backend, moving router interfaces around, moving cacti graphs around and updating aggregtate graphs.


Categories: Mozilla, Networking