Mozilla has had a long and storied history with using Zimbra as an email server. For as much as I personally hate Java (most of the guts of Zimbra are written in Java), Zimbra really does its job well, and it turns out to be really scalable. If you take the time and acquire the equipment to scale it anyway.
I tend to ramble a bit in storytelling mode, so this post is a little long. I’ll try to keep it entertaining though. This post is split into a couple sections. The first section is background about where we came from. The second section is a high-level overview of what we’ve just implemented, and the third is about the stuff we’re about to do. The latter two are probably the more interesting parts if you don’t care about history.
Where we’ve been
Mozilla has always had a specific sysadmin whose job was to maintain our Zimbra installation (among other things he also did). Our problem at the time was that he was the only one that knew it. Then one day he left. At that point, things were going well. Zimbra mostly took care of itself, and rarely had any major problems. We had a master server in San Jose that did most of the work, and a local server in the Beijing office that hosted mailboxes for the users in that office to keep their response times low. This all worked pretty good at the time, but the company was growing. Rapidly. Eventually the number of users we had outgrew the hardware, and we started having performance issues frequently. Corey and Jabba made a heroic undertaking to swap the server onto new more powerful hardware. That worked, for a while. And then the issues cropped up again. A single IMAP user deleting 40,000 messages at once was enough to make the server unusable. That wasn’t a good sign.
I was sent off to take Zimbra’s Sysadmin training class. We needed to learn how to do this better. After all, Comcast hosts more than 10,000 users on the same Zimbra installation that they host, so why are we having so many problems with 850 users? And learn I did. My number one takeaway from the class: Zimbra has many internal components. And they don’t all have to be hosted on the same box. Not the only thing I learned, of course, just the most important thing. This was our number one issue with the way our installation was set up. When you get to this many users (especially when most of our users use IMAP), it’s better to separate the components out and run them on separate servers to better use resources. Some of the components can even be run in multiple instances to spread the load out across servers.
We began to make plans for scaling out Zimbra onto multiple servers. Three new blade servers were purchased for our Phoenix data center with the intention of splitting the users across them. Fate intervened. A few days after the servers arrived (we barely had time to rack them), our existing server in the San Jose datacenter had a failed hard drive on its disk array. Disk arrays are made to withstand this (that’s what RAID is for) so this wasn’t a big deal, yet. When the failed disk was replaced, three more drives failed out simultaneously. Ouch. Now we have a problem. RAID6 (we know, bad choice) with one hot spare can only handle 3 simultaneous drive failures, and then only if the first one has rebuilt onto the spare before the second one fails. Three at once without the spare rebuilt yet meant the entire array was failed. It was then that we discovered some shortcuts that had been made in setting up the server. The backups were being made to the local disk. Yes, the same one that failed. Oops. They had been written off to tape (which were still safe) but only after three weeks. So our most-recent usable backup was actually almost a month old. Mozilla as a company spent three days with no email while we waited for a tape restore of 750 GB of data followed by importing that data into a freshly re-installed Zimbra on a new server. At this point everyone was back up and running, but with a 3-week gap in their email.
During the two months that followed that array failure (it took that long to get beyond the data recovery) I learned more about the inner workings of Zimbra than I think I ever wanted to know. We were fortunate to have experts in both OpenLDAP and MySQL on our IT staff, who were of invaluable help salvaging data from the partially damaged databases on the failed drive. One of our developer tools engineers (thanks Joel!) managed to write us a script that would crawl the damaged disk array looking for any of Zimbra’s mail storage “blobs” that belonged to a given user, and re-inject them into that user’s account (an intact email in a recovery folder is better than not having it at all in the original folder it should have been in). We discovered some of the individual users (but not all of them) had intact backups in the current backup that was on the array that was partially damaged. We restored backups for the users who had them intact, and used imapsync to move the messages back into their live accounts. All in all it was quite an adventure, and given the extent of the damage to the disk array and our failure to put the backups in a good home, we really came out lucky in how much we did manage to salvage.
The first thing we did after getting the data back online was to attach network storage for the backups to be saved to. Even the “local” daily backups would be on a redundant network array so that a disk failure would no longer eat the backups with it.
Here’s a look at where we got to by this point in the story:
The text in the image, in case you can’t read it:
How things were, As of 15 January 2012
Two servers. San Jose is the primary server, hosting all Zimbra components and almost all of the accounts. The Beijing server serves only as local mailbox storage and SMTP for users in the Beijing office.
Two weeks worth of backups are stored on a NetApp filer, and then written to tape from there for long term storage.
Where we are now
One of the good things that came out of that disaster was many important people realized that working email was really important to keeping our company operating, and that keeping email up and running should really be a high priority. A new sub-department was formed within IT to manage Infrastructure issues, with email, phones, and high-priority background services like DNS and LDAP in mind; things that would bring company operations to a halt if they died. An entirely separate block of network space behind its own routers was set up to hold the master servers for these functions, to keep them separate from interference an overload of any other services might cause. We brought in a Zimbra trainer to go through the Zimbra Sysadmin training that I had previously done for several additional people so that I wouldn’t be a single point of failure like the previous guy who left was.
Back to Zimbra… A few years back, VMware purchased Zimbra from Yahoo!. Since then, VMware has put a lot of effort into making sure Zimbra runs really well within virtual machines. Some say this has come at the expense of other features they should have been improving instead, but VMs are what VMware does best, so it had to be expected. It turns out that some of the internal components within Zimbra don’t multi-thread well. This means putting the installation on a 12-core server (which we had done) doesn’t really help much because it won’t use them all. However, spreading the users out over multiple 2-core servers works *really* well. And with a VMware vSphere installation, spinning up lots of little VMs to spread the load around to was pretty easy and preformed really well, too.
With that in mind, along with the new resources available to us as part of the new “Hyper-Critical Infrastructure Network” (henceforth referenced as “HCI”), we started over on our expansion plan using a vSphere cluster and lots of VMs. We have 5 ESX hosts in this cluster (intended for all of the HCI stuff, not just Zimbra). Each of them has 24 CPU cores and 192 GB of RAM.
From the image:
How we are currently, As of 26 March 2012
The new Phoenix installation consists of a VMWare ESX cluster, with each of the Zimbra components separated into their own VMs to allow scaling, components which can be configured redundantly placed behind the load balancer, and multiple mailbox servers to handle the number of user accounts we have.
San Jose remains the master for now, and new accounts are currently created on the standalone physical server in Phoenix, which was added to pick up load in mid-January before the new ESX cluster was ready.
We are now ready to start moving mailboxes off of the standalone servers in San Jose and Phoenix into the new cluster in Phoenix.
Since this image was created, we’ve moved all of the users off the standalone server in Phoenix into the new HCI cluster, and already decommissioned that server. We’ve also moved all of the users off the server in San Jose, though that server is still currently the master LDAP server as I type this.
The main priority with the new infrastructure is to make everything as redundant as possible, to reduce the amount of downtime users end up seeing. Having pairs of MTAs, LDAP Slaves, and proxy servers behind a load balancer means we can pull one of them out of service for maintenance without breaking the entire system for the users in that datacenter.
Where we’re going
As mentioned above, we’ve already moved all of the users from San Jose to Phoenix. There was collectively 645 GB of user data on that server, and the move process which started Tuesday morning lasted until late morning Friday. We discovered much to our enjoyment that the process of moving a user between servers is almost completely transparent to the user. The way the moves work leaves the user locked out for no more than 30 to 60 seconds while the source and destination servers do a final sync-up before making the account live on the destination server. This is less time than the user gets locked out for during the weekly backups every weekend.
We will need to take a downtime at some point in order to promote the primary LDAP slave in Phoenix to be the new master, change the mail.mozilla.com DNS pointer to point at the Phoenix proxy, and reconfigure all of the servers to send their logs to the Phoenix log server instead of the San Jose one. That may happen this weekend or early next week.
The Beijing data center is also going to get the VM treatment, like we described above, though on a much smaller scale. A lack of power in the data center there is preventing us from having as much redundancy as we’d like though.
Here’s what we’re intending for this all to look like once we’re done moving everything around:
From the image:
Future Plan, Target Date Late April 2012
The Phoenix and Beijing installations both consist of VMWare ESX Clusters (single ESX server in Beijing), with each of the Zimbra components separated into their own VMs to allow scaling. In Phoenix, components which can be configured redundantly are placed behind the load balancer, and multiple mailbox servers are present to handle the number of user accounts we have.
The primary LDAP slave in Phoenix has been promoted to Master, and Central Logging now happens on the primary proxy server in Phoenix. The mail.mozilla.com domain name now points at the proxy server in Phoenix.
The VMs are being configured as I type for the new Beijing cluster, and all signs are showing we’ll probably have that live by early this coming week also.
WordPress is telling me I’ve passed up the 2000 word mark here, so this did end up being kind of wordy. I apologize, and hope it was entertaining anyway.