As you may have noticed, we’ve missed a few of our weekly updates. We’ve had a rather rough go of it lately, and it’s about time for the highlight reel. Two major events stick out in hindsight.
#1, No email for two days
First up on the plate was a major Zimbra outage. This was sparked by a RAID failure in one of our HP Storage Blades, but rapidly escalated into a data-loss situation due to what can only be described as poor backup planning.
The short version: backups were being made regularly, but were not being reliably shipped off of the server. It took a lot of effort from IT and patience from our users (that is: all of Mozilla’s paid staff… thank you!) to get back on track.
Remarkably, this had relatively minimal affect on development or release cadence- it didn’t delay the Rapid Release train or close the tree, and newsgroups were unaffected. It may have caused some features to develop more slowly than they otherwise would have, but the community and the company pulled together to get through the situation with more ease than anyone could have expected.
Fortunately, things on that front are much better now- our Zimbra infrastructure has improved, and the backup strategy has changed such that the same issue is largely negated. We’re better off now than we were 6 months ago, and we have plans to be significantly better yet in another 6 months.
#2, addons.mozilla.org & versioncheck.addons.mozilla.org
In mid-December we began having major performance issues with sites behind our Phoenix Zeus (load balancer) cluster. This cluster supports a number of production sites, including addons.mozilla.org and versioncheck.addons.mozilla.org.
The issue was caused by an unexpected and extended increase of traffic from Firefox 3.6 users upgrading to Firefox 8 (and the Addons version checks).
This also highlighted 2 architectural design limitations that we were aware of but had not expected to be a problem for some time.
This Zeus cluster sits behind a redundant pair of Juniper SRX firewalls. These protect the Zeus Linux hosts and provide a point to monitor for unwanted activity (IDS). Like any device, these firewalls are limited by the number of concurrent sessions and new sessions per second that they can handle. The additional traffic put us over the threshold, and they started to drop connections.
Moving Zeus from behind the SRX solved one problem while exposing another bottleneck. This time, we learned that the traffic for versioncheck.addons.mozilla.org was overwhelming the 1GbE interfaces on the Zeus cluster and we had to quickly spin up 10GbE Zeus nodes.
These are really just general scaling issues that we were going to need to deal with sooner or later. Unfortunately, we hit a level of scale that didn’t fit the way we had planned (a lot of these improvements were things we had planned on doing in the early part of 2012).
Among the fixes were:
- TCP stack tuning
- Upgrading to 10-gigabit Ethernet on the Zeus hosts
- Routing / firewall changes (including removing the hardware firewall and switching to iptables)
- Reducing our use of multicast VIPs, opting for multiple “normal” VIPs using DNS to send traffic to all load balancers
- Segregating backend traffic onto separate load balancers in a different LB cluster
- Sending traffic to other datacenters (notably: versioncheck.addons.mozilla.org, which is highly cache-able)
There’s more work still to come on this. We are currently experimenting with ScaleArc iDB, a database/SQL aware load balancer, which would theoretically give us database query caching and query distribution. We presently do such load balancing through our main Zeus cluster’s, but they don’t understand any database-specific protocols… it’s just simple TCP-based proxying. A protocol-aware load balancer should provide the same performance benefits for database queries that an HTTP-caching-LB does for web content.
So, what’s the long term fix?
From these events we have drastically altered our approach to how we handle certain parts of our infrastructure. We are very excited to be working on what we are loosely dubbing the “Hyper-Critical Infrastructure” cluster, which is a completely standalone VMware / NetApp installation designed to be as autonomous as we can reasonably make it. For starters this will house Zimbra & LDAP. Sometime after that we will also likely migrate Mana – our internal documentation system based on Confluence – and intranet.mozilla.org. Other extremely critical apps are also fair game.
Specific to Zimbra, we’re spending some time to make sure we make it as scalable and reliable as we need it to be. VMware High Availability and NetApp storage gets us freedom from most kinds of hardware failures, and Zimbra internally supports sharding to multiple servers for scale. We know Zimbra can do the job (it’s used by organizations much, much bigger than Mozilla), and this architecture will get us there.
Zeus Load Balancers
We’re continuing to replace the 1GbE Zeus cluster (HP BL460c) with 10GbE servers (HP DL360 G7) and are working on sharding our traffic into separate Zeus clusters where it makes sense to do so. We are also looking into separating database traffic away from Zeus altogether and onto a protocol-aware load balancer that can cache results.
On a higher level, we’re pushing out services like Cedexis to augment our geo- and performance-based global load balancing. This replaces another external service (3crowd), the discontinued “Zeus GLB” app, and the newer “Zeus Multi-Site Manager” app. This gives us a unified, convenient, and high-performing way to “front” Zeus and distribute traffic efficiently between multiple Zeus clusters.
It wasn’t all bad
Of course a number of good things happened in December as well:
- BrowserID went to production
- Ramped up Tegra capacity for Native UI & Android UI testing for Firefox Mobile
- Our Inventory system got a nice overhaul and a migration to a new cluster
- Lots of CDN work, including SSL CDN trials (Akamai, Highwinds) and Cedexis experimentation / implementation
- Turn-up of a new 9-cabinet module in our PHX1 datacenter (fortunate, since it helped significantly with the load balancer issues)
- A new ESX cluster in PHX1 (apart from the Hyper-Critical one above, which happened later)
- Dozens of web content pushes – these are so reliable now we don’t even announce them anymore
- BrowserID & LDAP integration work for the Mozilla Community Directory
It can be easy to forget about the wins, because in general you’re winning whenever something isn’t broken!
We’re hoping to get back on track with more frequent updates… look for more recent info very soon!