Global Load Balancing at Mozilla

jakem

What is Global Load Balancing?

Global Load Balancing is the act of directing traffic to multiple locations over a wide area. It differs from normal load balancing in that the nodes you’re directing traffic to are not “local”… you can’t count on the link between the load balancer and the node being LAN-speed.

The main purpose of spreading your servers out is to avoid localized failures: datacenter outages, natural disasters, etc. Properly designed, there aren’t many external factors that can take down a site hosted in multiple datacenters. This is where global load balancing comes into play.

In order to make this work, your load balancer itself obviously ought to be available in multiple datacenters- you don’t have much redundancy if your global load balancer is in only one location! But this immediately creates another instance of the same problem… how do you route traffic to your global load balancers?

Luckily, the DNS protocol handles this type of thing for you- you specify multiple nameservers, and the client will (or is supposed to) query one of them, and fail over if it doesn’t work. Consequently, most global load balancers are DNS services.

This is very different from most “normal” load balancers, which are usually either proxies (layer 7) or NAT/IP-mangling devices (layer 4). In some scenarios either solution (LB or GLB) will work… but typically, you will have a global load balancer that directs traffic to multiple normal load balancers, ideally in different datacenters. The normal load balancers will in turn direct traffic to a local cluster of servers in the same datacenter.

Mozilla’s Current GLB Solutions

At Mozilla we’ve experimented with many different solutions for global load balancing. Several solutions have “stuck” over the years… and once they stick, they tend to hang around for quite a while. Consequently, as of March 1 we actually have a total of 5 GLB solutions actively in use, with a 6th under consideration!

  • Netscaler GSLB
    • Built-in functionality of some old Citrix Netscaler appliances that we are phasing out.
    • Removed this week! This handled various web traffic… notably, blog.mozilla.org used this.
      • Most things became non-GLB properties, because only one node/location was actually functional anyway. :)
  • geodns
    • This manages releases.mozilla.org.
    • It’s somewhat strongly embedded because it handles which mirrors are actively in use, depending on which ones are up-to-date. Switching means rewriting this logic on top of another platform.
    • Discussions are ongoing as to how we can replace this, or at least move it out of SJC1 during our datacenter migration.
  • Zeus GLB
    • This manages a few websites… notably, bugzilla.mozilla.org uses this to determine which datacenter is “active” and which is passive.
    • This is actually an end-of-life product. It’s direct replacement is Zeus Multi-Site-Manager, but an upgrade is non-trivial in our case, and we’ve ultimately decided to migrate away from this entirely.
      • Migrations are largely moving to Cedexis, or becoming non-GLB services.
  • 3crowd CrowdDirector
    • This manages 2 things currently: releases-rsync.mozilla.org and irc.mozilla.org.
    • This is a 3rd party service – we delegate certain names over to them, and then use their interface/software to set rules on when it should return which records.
  • Cedexis Openmix
    • This manages most of our multi-hosted websites now.
    • This is also a 3rd party service, similar to 3crowd. The main differences are:
      • 3crowd gets delegations, via NS records… Openmix gets CNAME records. This makes 3crowd a bit more complicated but also a bit more flexible.
      • Openmix is more complicated and more flexible in terms of how routing decisions are actually made- you define your own script. Consequently it can make decisions based on a wide variety of criteria.

Other Solutions Considered

Other solutions considered-and-rejected or actively under consideration:

  • Zeus Multi-Site-Manager
    • Replacement for Zeus GLB… integrated with normal Zeus Traffic Manager appliance.
    • This was rejected as too complicated- by tying together GLB and normal load balancing, it actually became rather confusing trying to maintain a large-ish installation. Ultimately it was easier to keep the two layers separate.
  • Dynect Managed DNS
    • This is another 3rd-party service. We actually used this for all of our DNS management in the past. As we grew it became financially infeasible, and we brought it all in-house. We are now considering sending certain things back to them, specifically for the GLB functionality.
    • The main benefit of Dynect vs Cedexis or 3crowd is that they are a full DNS management service. Neither Cedexis nor 3crowd are quite as feature-complete in cases where you want to host a full domain on them… they’re more aligned with handling certain individual records.

Moving Forward

As you can see, we’ve tried a lot of things on this front. Each system has its own benefits and drawbacks… however, there is a lot of overlap in functionality. We’re in the process of consolidating down to fewer systems. Specifically,

  • Netscaler GSLB is eliminated as of this week!
  • Zeus GLB will be eliminated, likely in Q2/Q3. Migrations will be to Cedexis, non-GLB services, and/or possibly 3crowd.
  • geodns may be eliminated, if we can effectively replace it with Dynect, 3crowd, or Cedexis. Time frame on this is undetermined.

This will leave us with 3 GLB services: Cedexis, 3crowd, and (presuming it passes trials) Dynect. Not quite ideal, but each has it’s unique strong points that we’re not quite willing to give up just yet. It’s certainly an improvement in any case. In time perhaps we can condense even further…