In computers systems (and with others) there are often bottlenecks and removing those often reveals new ones. Today’s an example of just that.
During a normal release we have tools we can use to adjust the rate at which we offer updates. We use this to reduce load on the back end systems or to help reduce load on the download mirrors.
Our preference is to do a release completely unthrottled so users get timely updates.
During the Firefox 3.0.6 release we had a number of system problems that prevented us from releasing updates unthrottled. These were all detailed in the Post Mortem.
To the Operations Team’s credit (and I’m serious here), most of those issues were removed prior to yesterday’s Firefox 3.0.7 release and by 9am this morning we were cranking along – no throttling.
Unfortunately the Mirror Network started showing pressure and instead of throttling back on the release, we opted to augment the Mirror Network with our own download servers in San Jose.
That pushed our aggregate bandwidth out of San Jose to nearly 3Gbps:
At around this time offsite monitors starting alerting about a sharp increase in page load times to various Mozilla website properties. Took a bit to track down but the newly turned up Level 3 peer was saturated:
Any outbound traffic whose best route was out through Level3 was impacted. We fixed this temporarily by turning down Level3.
(I should note that our design requirements for upstream transit is at least two connections per provider so we can push 2Gbps. Level 3 is no exception, however, the second connection has been offline because Derek was seeing a lot of packet loss across the optical connection which coincidentally got resolved today.)
These problems are solvable and we’ve had plans to put tools in place to balance load during situations like this. Unfortunately, today’s issues came up a lot quicker than we had planned.
A couple things we’ll be looking at before the next release:
- Evaluating Internap’s FCP to dynamically shift traffic based on cost and performance metrics. (And as luck would have it, this showed up this afternoon!)
- Looking to see how we can better balance outbound traffic outside of using FCP.
- Adding capacity to our Mirror Network (can you help?).
- Evaluating options around upgrading from several 1GE upstream connections to 10GE connections.
This is a great problem to have, to be sure, and a far cry from the panic three years ago of “OMG we’re about to push 100Mbps!”.
I’m really interested in how others have gone about solving problems like this. Leave me comments.