This weekend, I rolled out a new LDAP infrastructure. Here are the details:
At Mozilla we depend heavily on our OpenLDAP-based authentication system. As we’ve grown quite a bit over the past year or so, it has become apparent that our LDAP ecosystem wasn’t scaling accordingly. Until now, we’ve relied mostly on a single master server with a few slaves all behind an aging load balancer that has given us trouble in the past. Most of the slaves were no longer in the pool as their configurations have drifted over the years and a lack of documentation and consistency made it hard to add capacity, make changes or even ensure high availability in the event of a hardware failure. All the LDAP servers were actually servers primarily dedicated to other services, so extra load on those services would have the side effect of making the LDAP service unreliable. As we’ve added a few satellite offices and an extra datacenter that all relied on having some sort of authentication system in sync with our primary LDAP server in our San Jose data center, it was decided that we need to redo this setup with something a little more scalable and with better configuration management.
Over the past 6 months or so, I’ve done quite a bit of research learning how OpenLDAP works, how it behaves, how it scales and most of all, how it is set up and being used at Mozilla. Understanding the current setup was key to designing a better architecture. The first thing to do was to gather all the information, the configurations of all the existing LDAP servers and merge those into a centralized configuration management tool. As we’ve been using puppet, I wrote a module to manage our OpenLDAP infrastructure. This turned out to be a little more difficult task than anticipated at first, as it had to support managing SSL certificates, a master server, slave servers, intermediary servers (servers that act as a master to other slaves, but are slaves themselves, replicating from the main master), all of which have slightly different configuration directives but overall need a similar configuration that stays in sync. Furthermore, we have some applications that act as frontend management tools for LDAP. Some are used by our team to provision new user accounts, maintain access control groups, etc. but also our phonebook directory app, which allows users to edit their own LDAP entries through a web interface. All these things needed be taken into account when re-architecting the infrastructure. All of the apps were hosted on a single server, which was also the master LDAP server among other things. If that single box failed, we’d be in a bad situation.
I designed a new infrastructure that splits the various components out a little. The phonebook app would live on a cluster of 6 machines (shared with other, similar apps – think: intranet). Our LDAP master would move to a dedicated server, not used as an authentication backend for other services, but rather have dedicated authentication slaves in each datacenter and office. As we have more capacity in our Phoenix data center, the master would live there along with the webservers serving the phonebook app. Starting out, there are two dedicated slaves replicating from the master in Phoenix, load balanced behind our Zeus load balancer cluster to be used as an authentication backend for any services in our Phoenix datacenter. This setup makes it easy to add capacity as needed, and provides high availability in the event of a failure. There are also two dedicated slaves in our San Jose data center, but rather than have them both replicate from the master in Phoenix, it was decided to add an intermediary server in San Jose. The intermediary server would replicate from the master in Phoenix, and the San Jose slaves would replicate from it. Aside from reducing the cross-datacenter traffic, this provides better data consistency within a datacenter. A similar intermediary server was set up in our Mountain View office to provide a single point for all the other office LDAP servers to replicate from. Each individual Mozilla office has two LDAP slaves and instead of a load balancer, those use a virtual floating IP address that can move from one server to the other using keepalived. We balance our other services such as DNS in this fashion and it reduces the need for extra load balancing equipment in the lower traffic offices, where LDAP is primarily used for wireless connectivity. All servers would be setup completely using puppet, so that in the event of a hardware failure, or the need to add more slaves, adding a few lines to our puppet configurations can make this happen in a matter of minutes without much thought or effort involved. The other piece to the puzzle is our internal addressbook. We use OpenLDAP to provide addressbook lookups for mail clients. In the past, this was done using a single machine that was an LDAP slave. As we scale, we needed better redundancy there too. The addressbook would now live on two machines, also behind a load balancer. As an extra security measure, since this service is directly on the internet, it was decided to change the configuration to make the addressbook “slaves” be simple proxies that only allow the lookup of a few select attributes. A compromised addressbook server would not result in a compromised LDAP database. Win! Speaking of security, the entire LDAP infrastructure was moved to a more secure VLAN, completely inaccessible to or from the internet, with the addressbook being the only thing with any exposure to the internet. Also all ACLS were audited and updated to provide the minimal access necessary, while at the same time using standardized templates with puppet making it easy to add and remove ACLs as needed with proper version control and auditing in place.
First, I set up all the new hardware and set up the full configuration using puppet. After drawing diagrams and identifying all network flows needed for replication and authentication to work, I worked closely with our network operations team to make sure everything would work as expected. Then I set up our San Jose intermediary server to replicate from the old master in San Jose to ensure that the overall flow of replication would work as expected and began testing various LDAP queries. Meanwhile, I set up the phonebook application in Phoenix on a new cluster of seamicro servers and began testing it against the new master. All of this was unknown, as we were moving from using a local LDAP server to using a remote one, going from RHEL5 to RHEL6, moving from single-hosted to multi-hosted. It was a whole new environment. I worked with the webdev team and our own tools developer to update our LDAP apps to work in the new environment. As we were already rolling out our new offices in San Francisco and Toronto and Paris, I set those up to replicate from the new intermediary server in Mountain View, so that half of the infrastructure has already been in production for a while, and was actually crucial to the testing phase. Once I was satisfied that everything would work properly, I identified what was left to do to get completely off the old infrastructure and move to the new. At this point everything was set up, and it was essentially just a matter of moving the master database to the new master server in Phoenix, changing some DNS names to point at new IPs and making sure that all the clients still worked as expected. All of this needed to happen with minimal downtime as we rely on LDAP being available 24/7 for so many things, including mail, wi-fi, svn, mercurial, our intranet wikis, shell servers, etc. I decided to tackle the project on a Saturday when the least number of people would be affected. I was pretty confident that the move could be done in under two hours, since I spent months preparing for it and ironing out the details. This was mostly the case, and there were only a few brief downtimes during the two hour window where various services failed to authenticate. This was mostly when I had to re-sync the slaves to have them catch up to their new masters. I did run into a few problems though, and here is the postmortem on that:
Everything went as planned. I shutdown the old master, copied the full database to the new server, set the intermediary slave in San Jose to replicate from the new master in Phoenix and reset synchronization on the slaves in Phoenix. At the same time I changed the DNS to point at the new load balanced IPs. I started the process around 7am on Saturday of ensuring that the shell servers could talk to the new LDAP servers, fixing some clients that had hardcoded the old master as their LDAP server and added nagios checks to all the new slaves and the new master. At 10am, the maintenance window started, so I shutdown the master and did the move and DNS changes then. By 10:30 San Jose was completely using only the new servers. Then I changed the DNS in Phoenix to point at the new servers and made sure the replication was working properly, both locally and remotely all the way to the remote offices through two intermediary servers. Everything went pretty well. By the end of the maintenance window at noon, I was just double and triple checking that replication was working. Around 1pm, I remembered that although I had made sure that our password reset app was working on our new cluster in Phoenix, I hadn’t ever tested changing my password and ensuring that the change would replicate properly. I tested it and found that it didn’t work. I worked with Rob Tucker, who happened to be online at the time to try to troubleshoot why it wasn’t working. It turned out that there was one piece missing from our new master. A password check module that we’ve had in use for a few years, but was completely undocumented. Rob helped me get the 64-bit version of the module compiled and installed in the new master and we finally got a password change to go through… but still not with the webapp that is user-facing for this purpose. We discovered that the app had one portion where it had hardcoded “localhost” as its LDAP server, rather than honoring the configuration directive. After patching the app, we finally had a successful password change. At approximately 4pm, I was wrapping up and ready to leave for the evening, when a user e-mailed me to inform me that the phonebook app wouldn’t allow him to change anything in his profile. I couldn’t reproduce the issue, but realized this was because as an LDAP administrator, I have full admin rights to the LDAP database, whereas a regular user has a different set of ACLs that apply. This is when I realized that in all my testing, I neglected to test a normal user account. I hacked on the permissions a bit more on Saturday night to get that working and had the ACLs fixed by midnight (I took a break from 5pm – 11pm). On Sunday morning, I went through and verified that the little surprises I discovered the day before were documented and added to the puppet manifest. I checked in the temprorary patch to the password reset app into svn and tested it again to make sure there were no more glitches. I also fixed the nagios alerts for the Mountain View slaves, which were misconfigured and wouldn’t have alerted to a problem, if there had been one. I’m glad I came back to double check that. 🙂
What I learned:
Details are important. Now that we have the new infrastructure, my next priority is setting up a stage infrastructure that can be used for the phonebook app, the password reset app and other tools and in general as a place to test and stage changes to our ldap infrastructure.
Testing is important. I should have tested the password reset app. I should have tested the phonebook with a normal user account.
People can be single points of failure. Before I started working on this, there was only one person with the full set of knowledge of our LDAP infrastructure. He left earlier this year to pursue other things and although he had documented most common issues and troubleshooting steps for our infrastructure, there was a huge amount of information that we didn’t have (the password check module for instance), and I had to learn a lot from scratch. I feel like I’ve learned a tremendous amount about how OpenLDAP works in the past 6 months and I enjoy working on it, but it is now extremely important that I don’t become a single point of functionality for our LDAP environment. I’ve worked on documenting all the bits and pieces I have about our infrastructure, gave a tech talk to our team about it and will continue involving other members of my team and documenting all the changes. I hope also that this blog post provides some insight into how it is all set up, to give an idea of what is involved with the setup, explain why when you change your password, it takes ten minutes before the wireless controller in San Francisco notices the change, etc.
The next steps:
Overall, I think our LDAP infrastructure is now infinitely better than it was. It is set up to scale now and it is easy to make documented version controlled changes. I’ve put in a lot of time over the past 6 months planning this out, learning LDAP, and it seems that I’ve been eating, sleeping and breathing OpenLDAP for a while now. However, I still don’t consider myself an expert on the subject and am continually learning. I think there are still a lot of improvements that can be made to the infrastructure and what we have now is pretty much a better scaled version of what we had. The basic configuration directives are the same though. There is likely some more tuning that can be done for better performance, more reliable replication and stability. These are all things I want to pursue, but with production services that are so integral to our infrastructure, it is crucial to take things one step at a time.
Special thanks to Rob Tucker, Fred Wenzel, Corey Shields, Phong Tran, Dumitru Gherman, Pete Fritchman, Michael Coates, Guillaume Destuynder, Adam Newman and the rest of the IT team for helping make this happen.