IT goings-on

Hello all and welcome to this week’s IT update.  Instead of the usual wrap-up of interesting tidbits from across the team, this post is dedicated to the recent major maintenance event at one of our two primary data centres.  Let’s dive in!

Fact: Mozilla leverages a mind-boggling variety of technical infrastructure.  The sheer breadth of machines and configurations is difficult to fully grasp.  This infrastructure is situated in a number of physical locations, including data centres in the USA and China, as well as our offices around the world.  Over the past couple of years, one of the major long-term projects at Mozilla IT has been to consolidate and industrialise these physical locations and their contents – no small feat, and a project that will remain on-going for the foreseeable future.

Today we have two primary data centres on the North American continent: PHX1 and SCL3.  These data centres are treated a little bit differently than our other locations, as they are not only our largest installations, but are specifically designed to provide highly stable, highly available environments – in other words, no downtime.  One of the key elements in this architecture is called the core network stack, which refers to the networking equipment that is responsible for routing all of the traffic between a given data centre and the Internet at large.  The stack needs to be as reliable as humanly (or machinely) possible – without it, there is no communication with the outside world.

Earlier this year a problem was detected in the stack at SCL3.  This problem had a direct impact on the stability and reliability of the core network, and if left untreated, would have eventually resulted in a major unplanned outage.  In fact, small service interruptions and other events had already been tied to this issue, and while work-arounds were implemented, the fact remained that this was a ticking time bomb.  Ultimately the decision was made to simply remove the problematic hardware entirely from the stack.  While this was certain to solve the issue, it also meant incurring the one thing that the HA architecture was designed to avoid: downtime.

Many of the products and services that Mozilla provides rely on SCL3, including – but not limited to – such things as product delivery (i.e. Firefox downloads, updates, and the like), the build network (for building and testing those deliverables), the Mozilla Developer Network, and so forth.  We worked with key stakeholders from across the company to explain the situation and come up with plans for how to deal with the impending outage.  These plans ranged from the relatively simple (such as putting up a “hardhat“-style message explaining the situation), to the non-trivial (such as replicating the entire repository infrastructure at PHX1), to the heroic (implementing product delivery entirely in the cloud).

Furthermore, we weren’t content with simply addressing the problematic issue (and since we were going to be experiencing a service outage no matter what), we worked with our vendor to come up with a new architecture – one that would ensure that even if we have to perform major network manipulations again, we should now be able to avoid total blackouts in the future.  This helped to turn what was “merely” a problem-solving exercise into a real opportunity to extend and improve our service offering.

As part of this planning process, we set up a lab environment with hardware supplied by our vendor, which allowed us to practice with the mechanisms and manipulations ahead of time.  I can’t stress enough how critical this was: knowing  what to expect going into it in terms of pitfalls and processes was absolutely essential.  This helped us to form realistic expectations and set up a time-line for the maintenance event itself.

There were emails; there were meetings; there were flowcharts and diagrams to last a lifetime – but at the end of the day, how did the event actually turn out?  Corey Shields with the details:

All in all, the maintenance was a success.  The work was completed without any major problems and done in time.  Even in a successful event like this one, we have a postmortem meeting to cover what was done well (to continue those behaviors in the future), and what needs improving.  We identified a few things that could have been done better, mostly around communication for this window.  Some community stakeholders were not notified ahead of time, and the communication itself was a bit confusing as to the network impact within the 8 hour window.  We have taken this feedback and will improve in our future maintenance windows.

There are any number of interesting individual stories that can (and should) be told about this maintenance, so keep watching this blog for more updates!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on  See you next time!