December 2011 in IT

jakem

5

As you may have noticed, we’ve missed a few of our weekly updates. We’ve had a rather rough go of it lately, and it’s about time for the highlight reel. Two major events stick out in hindsight.

#1, No email for two days

First up on the plate was a major Zimbra outage. This was sparked by a RAID failure in one of our HP Storage Blades, but rapidly escalated into a data-loss situation due to what can only be described as poor backup planning.

The short version: backups were being made regularly, but were not being reliably shipped off of the server. It took a lot of effort from IT and patience from our users (that is: all of Mozilla’s paid staff… thank you!) to get back on track.

Remarkably, this had relatively minimal affect on development or release cadence- it didn’t delay the Rapid Release train or close the tree, and newsgroups were unaffected. It may have caused some features to develop more slowly than they otherwise would have, but the community and the company pulled together to get through the situation with more ease than anyone could have expected.

Fortunately, things on that front are much better now- our Zimbra infrastructure has improved, and the backup strategy has changed such that the same issue is largely negated. We’re better off now than we were 6 months ago, and we have plans to be significantly better yet in another 6 months.

 #2, addons.mozilla.org & versioncheck.addons.mozilla.org

In mid-December we began having major performance issues with sites behind our Phoenix Zeus (load balancer) cluster. This cluster supports a number of production sites, including addons.mozilla.org and versioncheck.addons.mozilla.org.

The issue was caused by an unexpected and extended increase of traffic from Firefox 3.6 users upgrading to Firefox 8 (and the Addons version checks).

This also highlighted 2 architectural design limitations that we were aware of but had not expected to be a problem for some time.

This Zeus cluster sits behind a redundant pair of Juniper SRX firewalls. These protect the Zeus Linux hosts and provide a point to monitor for unwanted activity (IDS). Like any device, these firewalls are limited by the number of concurrent sessions and new sessions per second that they can handle. The additional traffic put us over the threshold, and they started to drop connections.

Moving Zeus from behind the SRX solved one problem while exposing another bottleneck. This time, we learned that the traffic for versioncheck.addons.mozilla.org was overwhelming the 1GbE interfaces on the Zeus cluster and we had to quickly spin up 10GbE Zeus nodes.

These are really just general scaling issues that we were going to need to deal with sooner or later. Unfortunately, we hit a level of scale that didn’t fit the way we had planned (a lot of these improvements were things we had planned on doing in the early part of 2012).

Among the fixes were:

  • TCP stack tuning
  • Upgrading to 10-gigabit Ethernet on the Zeus hosts
  • Routing / firewall changes (including removing the hardware firewall and switching to iptables)
  • Reducing our use of multicast VIPs, opting for multiple “normal” VIPs using DNS to send traffic to all load balancers
  • Segregating backend traffic onto separate load balancers in a different LB cluster
  • Sending traffic to other datacenters (notably: versioncheck.addons.mozilla.org, which is highly cache-able)

There’s more work still to come on this. We are currently experimenting with ScaleArc iDB, a database/SQL aware load balancer, which would theoretically give us database query caching and query distribution. We presently do such load balancing through our main Zeus cluster’s, but they don’t understand any database-specific protocols… it’s just simple TCP-based proxying. A protocol-aware load balancer should provide the same performance benefits for database queries that an HTTP-caching-LB does for web content.

So, what’s the long term fix?

From these events we have drastically altered our approach to how we handle certain parts of our infrastructure. We are very excited to be working on what we are loosely dubbing the “Hyper-Critical Infrastructure” cluster, which is a completely standalone VMware / NetApp installation designed to be as autonomous as we can reasonably make it. For starters this will house Zimbra & LDAP. Sometime after that we will also likely migrate Mana – our internal documentation system based on Confluence – and intranet.mozilla.org. Other extremely critical apps are also fair game.

Zimbra

Specific to Zimbra, we’re spending some time to make sure we make it as scalable and reliable as we need it to be. VMware High Availability and NetApp storage gets us freedom from most kinds of hardware failures, and Zimbra internally supports sharding to multiple servers for scale. We know Zimbra can do the job (it’s used by organizations much, much bigger than Mozilla), and this architecture will get us there.

Zeus Load Balancers

We’re continuing to replace the 1GbE Zeus cluster (HP BL460c) with 10GbE servers (HP DL360 G7) and are working on sharding our traffic into separate Zeus clusters where it makes sense to do so. We are also looking into separating database traffic away from Zeus altogether and onto a protocol-aware load balancer that can cache results.

On a higher level, we’re pushing out services like Cedexis to augment our geo- and performance-based global load balancing. This replaces another external service (3crowd), the discontinued “Zeus GLB” app, and the newer “Zeus Multi-Site Manager” app. This gives us a unified, convenient, and high-performing way to “front” Zeus and distribute traffic efficiently between multiple Zeus clusters.

It wasn’t all bad

Of course a number of good things happened in December as well:

  • BrowserID went to production
  • Ramped up Tegra capacity for Native UI & Android UI testing for Firefox Mobile
  • Our Inventory system got a nice overhaul and a migration to a new cluster
  • Lots of CDN work, including SSL CDN trials (Akamai, Highwinds) and Cedexis experimentation / implementation
  • Turn-up of a new 9-cabinet module in our PHX1 datacenter (fortunate, since it helped significantly with the load balancer issues)
  • A new ESX cluster in PHX1 (apart from the Hyper-Critical one above, which happened later)
  • Dozens of web content pushes – these are so reliable now we don’t even announce them anymore
  • BrowserID & LDAP integration work for the Mozilla Community Directory

It can be easy to forget about the wins, because in general you’re winning whenever something isn’t broken!

We’re hoping to get back on track with more frequent updates… look for more recent info very soon!

Jake

5 responses

  1. Jim Hopp wrote on :

    Thanks for the forthright explanation of the incidents.

    Could you expand (perhaps in a separate blog posting) about your experience with iptables on a web front-end? I’ve felt that tuned Linux boxes running iptables could serve as a web frontend without the need for proprietary hardware firewalls and load-balancers, but real data from a large site would be helpful.

    Thanks.

    1. jakem wrote on :

      Sure! I can’t speak for our NetEng or InfraSec folks, but I can tell you my thoughts as a sysadmin. Personally, I’m a pretty big fan of Linux+iptables as opposed to network appliance firewalls. There are lots of tuning options (including hardware upgrades), documentation is very good, finding people with iptables experience is comparatively easy, and it can generally do all sorts of crazy packet mangling jobs. There are lots of more advanced iptables modules that most folks never use- time-of-day matching, rate-limiting, quotas, rudimentary load balancing, string-payload matching… crazy.

      From a sysadmin perspective, it can be difficult to get a handle on how much load your appliance-firewall is really under. You see how much traffic goes in and how much comes out, but I don’t think you generally have the toolset for dealing with performance problems that any Linux system would have. With iptables, you get all the standard tools that come with any Linux system.

      For another thing, typically (at least in bigger companies) network firewalls are managed by another group of folks – netops or infrasec – and this can cause delays and finger-pointing when you need to get something done or when there’s a problem. If I can control all aspects of a problem, I can more easily determine where the problem lies, and what the “best” overall fix is. It also means I can be self-sufficient- I don’t have to submit a bug/ticket and wait a while for someone to handle it for me… I can just do it and get on with the ultimate goal. The flip side, of course, is that with sysadmin-managed firewall rules you may end up with things that those other departments have good reasons for not wanting… you lose some expertise in exchange for the convenience of a jack-of-all-trades approach.

      There is some benefit to a centralized firewall, but I’m not at all convinced that it needs to be a proprietary / commercial black-box appliance. I think Linux+iptables (or a similar *BSD setup) is something a lot more folks should consider before spending lots of cash on something proprietary. That doesn’t mean you should be afraid of spending money, but it should always circle back to “what do I get out of this that I don’t get with commodity hardware + Linux + iptables?”

      1. jakem wrote on :

        … in retrospect, you’re right that this is probably worth a separate blog post all on its own. :)

  2. Stefania Castelli wrote on ::

    Hello,
    I came to your topic since I was looking for info about a Web Proxy service I discovered on a machine of mine and (I have to check better, nevertheless) active on my LAN and possibly VPN-LAN.
    Before Nmapping everything like a crazy monkey (and injecting Valium) I tried to use Fiddler + Ammonite to get some more info about this strange presence.
    I noticed it because I found a local IP address in Process Hacker I never personally configured and telnetting it, gave me a positive feedback for the http port.
    Analyzing the header of the http transaction, I found out that it was a Web Proxy 5.2 and Fiddler then told me that this address (the subnet 192.168.201.120) was nothing but a connect tunnel to versioncheck.addons.mozilla.org:443 and its only physical folder resulted a /favicon one.

    I sould investigate more, but I suppose this is part of the Nightly Release “reporting feedback” I subscribed after the install of the Nightly Alpha.

    Is this correct?

    If yes, since I didn’t find any doc/wiki… etc ilustrating this, it would be important to keep sysadmins serene and peaceful avoiding them tilting at windmills…. ;-)

    Of course, if Mozilla is not involved, please drop a brief note so I can start my battle against malware….

    All the best

    Stefania

    1. jakem wrote on :

      Firefox does make connections to versioncheck.addons.mozilla.org, but does not set up any sort of separate proxy or program for this… the connections come from Firefox itself, like any normal web connections (except of course they happen in the background).

      It sounds like you’ve got something else going on, and a Firefox instance somewhere just happened to be sending traffic through it.Best of luck tracking it down!