IT goings-on

Hello everybody – it’s time once again for the weekly IT update.

First up, the Mozilla Operations Centre (or MOC) is up and running!  This new team currently comprises seven employees from all over the world, including the USA, India, and Europe.  They’ll be handling such critical functions as monitoring, tier 1 and 2 support, and issue tracking and escalation for our entire infrastructure.  What’s more, they’ll be acting as a sort of interface layer for the more technical aspects of Mozilla’s mobile partner relationships.  Expect more news about this highly important team in the weeks and months to come.

The RelOps team stopped testing Firefox on OS X 10.7 due to falling usage and the similarity of coverage between 10.8 and 10.6.  They re-purposed all 83 of the Mac Minis running 10.7 to  now run and test on 10.6 – effectively doubling the available 10.6 test capacity.  This had the net effect of reducing 10.6 wait times considerably, thus improving the overall level of service and – most importantly – increasing the satisfaction of developers testing against those targets.

They’re also making progress on the project to standardise all of their virtual machines on a single platform.  Already, all of the KVM “odd ducks” at SCL3 have been replaced, which is a big win in terms of paying off technical debt for all of IT.  Good work, RelOps!

On the topic of performance improvements, Solarce from the WebOps team cleaned out tonnes of old jobs and tasks from our in-house Jenkins system, which has dramatically reduced start-up and task run times, and improved stability overall.

Mozillians physically working in our offices were plagued by a small but highly irritating problem: the tablets used to check and book conference rooms were skewed by up to eight minutes, which meant it was sometimes tricky to reserve rooms properly.  Thankfully, the Desktop and NetOps teams came to the rescue, and the problem is no longer!

Finally, in case you missed it earlier this week, Sheeri from the Database team put up an interesting post about a recent run of MySQL upgrades – highly recommended!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on  See you next time!

A Tale of Two MySQL Upgrades

At the beginning of 2013, Mozilla’s MySQL databases were a mix of MySQL 5.0, Percona’s patched MySQL 5.1, Percona’s patched MySQL 5.5 and MariaDB 5.5. MySQL 5.1 was released in November 2008 – so at the beginning of the year, we still had databases with no new major features in 4 years. Currently we have almost all our databases at Oracle’s MySQL 5.6 – the only stragglers are our cluster running TokuDB and a few machines that are no longer in use. Here’s a graph showing the state of our machines – you can see that in the first half of the year we concentrated on upgrading our 5.0 and 5.1 servers to 5.5, and then in the second half of the year we upgraded everything to MySQL 5.6 (click on the image to get a larger version):

MySQL Versions in 2013

After running some tests, we determined that MariaDB 5.5 was the best option for us and our particular workload. For most of our servers, it did not matter whether we use Percona, MariaDB or Oracle’s MySQL, but our Bugzilla servers really benefited from MariaDB’s better subquery optimization, so we went with that. We had set up some Percona 5.5 servers over the spring/summer of 2012, when we moved some of our infrastructure to a new data center.

We upgraded to MySQL 5.5 to be on a recent version of MySQL. In the middle of the year, we had a choice – should we stay where we were, or should we upgrade? We had no particular directive from developers to upgrade for the new MySQL 5.6 features. However, we have been doing more and more digging into our systems, and we really wanted the performance_schema features so we could dig even more. We want to be able to parse queries in real-time, perhaps with Anemometer without having to take an offline log file and run pt-query-digest on it.

So, we chose to upgrade to MySQL 5.6. Unfortunately, there were no other GA products to test against – by mid-2013, neither MariaDB nor Percona had a GA 5.6 product, so our bake-off was functional only, not performance-related. Oracle’s MySQL 5.6 passed with flying colors, and so we proceeded to upgrade.

Now, we have a recent and consistent version of MySQL installed, that we can work with to gain insights into our systems. A pretty great goal to have been met for 2013!

IT goings-on

Hello all and welcome to this week’s IT update.  Instead of the usual wrap-up of interesting tidbits from across the team, this post is dedicated to the recent major maintenance event at one of our two primary data centres.  Let’s dive in!

Fact: Mozilla leverages a mind-boggling variety of technical infrastructure.  The sheer breadth of machines and configurations is difficult to fully grasp.  This infrastructure is situated in a number of physical locations, including data centres in the USA and China, as well as our offices around the world.  Over the past couple of years, one of the major long-term projects at Mozilla IT has been to consolidate and industrialise these physical locations and their contents – no small feat, and a project that will remain on-going for the foreseeable future.

Today we have two primary data centres on the North American continent: PHX1 and SCL3.  These data centres are treated a little bit differently than our other locations, as they are not only our largest installations, but are specifically designed to provide highly stable, highly available environments – in other words, no downtime.  One of the key elements in this architecture is called the core network stack, which refers to the networking equipment that is responsible for routing all of the traffic between a given data centre and the Internet at large.  The stack needs to be as reliable as humanly (or machinely) possible – without it, there is no communication with the outside world.

Earlier this year a problem was detected in the stack at SCL3.  This problem had a direct impact on the stability and reliability of the core network, and if left untreated, would have eventually resulted in a major unplanned outage.  In fact, small service interruptions and other events had already been tied to this issue, and while work-arounds were implemented, the fact remained that this was a ticking time bomb.  Ultimately the decision was made to simply remove the problematic hardware entirely from the stack.  While this was certain to solve the issue, it also meant incurring the one thing that the HA architecture was designed to avoid: downtime.

Many of the products and services that Mozilla provides rely on SCL3, including – but not limited to – such things as product delivery (i.e. Firefox downloads, updates, and the like), the build network (for building and testing those deliverables), the Mozilla Developer Network, and so forth.  We worked with key stakeholders from across the company to explain the situation and come up with plans for how to deal with the impending outage.  These plans ranged from the relatively simple (such as putting up a “hardhat“-style message explaining the situation), to the non-trivial (such as replicating the entire repository infrastructure at PHX1), to the heroic (implementing product delivery entirely in the cloud).

Furthermore, we weren’t content with simply addressing the problematic issue (and since we were going to be experiencing a service outage no matter what), we worked with our vendor to come up with a new architecture – one that would ensure that even if we have to perform major network manipulations again, we should now be able to avoid total blackouts in the future.  This helped to turn what was “merely” a problem-solving exercise into a real opportunity to extend and improve our service offering.

As part of this planning process, we set up a lab environment with hardware supplied by our vendor, which allowed us to practice with the mechanisms and manipulations ahead of time.  I can’t stress enough how critical this was: knowing  what to expect going into it in terms of pitfalls and processes was absolutely essential.  This helped us to form realistic expectations and set up a time-line for the maintenance event itself.

There were emails; there were meetings; there were flowcharts and diagrams to last a lifetime – but at the end of the day, how did the event actually turn out?  Corey Shields with the details:

All in all, the maintenance was a success.  The work was completed without any major problems and done in time.  Even in a successful event like this one, we have a postmortem meeting to cover what was done well (to continue those behaviors in the future), and what needs improving.  We identified a few things that could have been done better, mostly around communication for this window.  Some community stakeholders were not notified ahead of time, and the communication itself was a bit confusing as to the network impact within the 8 hour window.  We have taken this feedback and will improve in our future maintenance windows.

There are any number of interesting individual stories that can (and should) be told about this maintenance, so keep watching this blog for more updates!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on  See you next time!


IT goings-on

Greetings people of Earth (and elsewhere, perhaps) and welcome to another weekly Mozilla IT update.

Some big news this week from Chris Turra and the WebOps team: our internal PaaS has, at long last, passed the security review phase and is ready for immediate production use!  Many of you have been using the PaaS in a development capacity for some time, so for those who are already familiar with the environment, you’ll be happy to learn that a number of high-availability back-end services have been deployed in order to ensure that the service is production-ready (phew, that’s a lot of hyphenated words).  Frankly, there’s way too much goodness here to cram into a single paragraph – keep watching this space for a post dedicated to the new PaaS.

Also on a WebOps tip, Jacques Uber continues to astound and amaze with new updates to “Inventory“, which is Mozilla IT’s fully open-sourced infrastructure management application.  The newest functionality includes a GUI process for assigning Static Registrations – check the repo for more details.

The Release Engineering team has been hard at work rolling out the new imaging techniques and metrics tools to the test and windows infrastructure (as noted previously), but somehow Amy Rich found the time to participate in a panel discussion at LISA ’13 entitled “Women in Advanced Computing“.  This was part of a small series on the topic, the other session being a half-day workshop hosted in part by Sheeri Cabral of the Database team.  Both sessions were very well received by all accounts!

Finally, a preview into the next post, wherein we’re going to talk a little bit about the big data centre maintenance that occurred this past week-end – so stay tuned!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on  See you next time!

IT goings-on

Hello everybody, and welcome to this week’s Mozilla IT update – let’s dive right in, shall we?

First off, some good news for those of you that felt the 3-month rotation for LDAP passwords was too short: most accounts have now moved to a 6-month rotation period.  Of course, all that really means is that in a given year, one is merely 50% less likely to get locked out after having forgotten to change their password.  Fortunately, resets of locked accounts are now totally self-service, so getting locked out is now much less irritating at is used to be.

Speaking of improving end-user experience, NetOps did some work on the wireless networks in both Paris and Taipei, moving them into larger configuration groups so as to standardise their configurations.  They are now easier to manage, and thus, easier to diagnose and optimise.

On the graphing and visualisation front, a big shout-out to Ali and Anurag from the Metrics team, who have been hard at work on implementing a new visualisation tool, an example of which can be seen here.  Their new self-service framework allows the rest of us to quickly generate graphs and dashboards for just about anything we might be interested in.  In addition, Ben Sullins has been hard at work on implementing Tableau to help people build tools for analysing and displaying data, which Jacques Uber from WebOps has used to set up a hardware warranty summary graph, which has already helped us to better understand and plan for upcoming expiration dates across thousands of machines.

On the topic of machines, the Storage and Virtualisation team spent the better part of a week braving the noise, temperature, and glaring fluorescent lights of the data centre during their push to expand our back-end NetApp infrastructure.  Thanks to their efforts, we now have room for some 300 additional virtual machines, which is going to give use some much-needed breathing room in that area.  This is going to be a big help going forward as we decommission those aforementioned out-of-warranty machines and replace them with spiffy new VMs.

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on  See you next time!

Notes from Nagios World Conference 2013

Nagios World Conference 2013 was held between Sep 30th and Oct 3rd at St. Paul, MN. I represented Mozilla IT/SRE along with Sheeri Cabral, who spoke about MySQL plugins. I wanted to share some observations and my best takeaways from the conference. I attended about 10 talks in all and spent more time discussing setups and best practices.

The biggest draw at the conf this year was Nagios 4.0 that was announced at last year’s keynote. 4.0 brings in some long awaited and much needed rocket power to Nagios. The changelog has detailed information about the big features but the ones that interested me the most were:

  • Core Workers – I have been researching on how to scale up service check execution on some of our bigger instances. mod-gearman has till now been the tool of choice but with Core Workers, Nagios natively steps up to the task. The legacy forking-for-each-check model was unsurprisingly hitting limits in some places and 4.x replaces it with worker processes that get check execution delegated to them. There is a massive performance gain and I’m looking to leverage that vs. integrating with mod-gearman.
  • Query Handlers – This feels like baked-in MK Livestatus. It’s made available via a socket. Unlike livestatus, it doesn’t yet have a lot of fancy and it’s mostly basic at the moment. I’d expect it would get a lot of attention in future versions.

Among other things I’m looking forward to integrating Multisite in our infrastructure. We have close to a dozen Nagios instances here at Mozilla and our primary interface to each is via IRC bots. As one would imagine, it doesn’t scale well and isn’t ideal for dealing with mass changes. This is where Multisite comes in very handy. Along with Livestatus, Multisite provides for a supercharged way to deal with multiple instances and multiple service/hosts within each. Do try out the demo because it’s hard to put awesome in words 🙂

Some nice talks that stood out:

  • Effective monitoring by Rodrigue Chakode where he spoke about filtering false alerts and actionable alerts and using business processes to monitor the most effective elements of a system.
  • Nagios at Etsy by Avleen Vig, who had an eventful road trip to the conference and discussed some cool things Etsy has done, particularly measuring alert fatigue by correlating alerts and sleep inputs from fitbit worn by oncalls. He also spoke at length about “Monitoring hygiene” and how Etsy went from 300 alerts/day to 45 alerts over the course of two years.

In all, it was a great conference, like last year. Looking forward to a year of 4.x and trying to get the in-house puppet module out on github 😉

MySQL Workbench “Clean up SQL” Feature

I was playing around with MySQL Workbench earlier in the week, and ran across the “clean up SQL” feature, which I thought was neat. Here’s a picture-based demonstration – you can click on the pictures to make them bigger, so they are more readable.

Here is a typical complex query that looks pretty good formatted in the results from a performance schema query:
query from performance schema

Simply click the “broom” icon and watch as your SQL is cleaned up, with one field in the SELECT per line and the JOINs indented and formatted prettily:
nicer, cleaned up SQL

Pretty cool, for just the click of a button!

IT goings-on

For those curious what IT has been up to lately, wonder no longer – here’s a quick status update from the past week, highlighting just some of the great stuff the Mozilla IT team has been working on recently.

First up, our team is growing:

  • Welcome Chris Knowles, who will be officially helping us out as a Storage and Virtualisation admin (unofficially, he’s going to help us with an even greater challenge: safely landing our Kerbals on the Mun).
  • A big hello to the entire OpSec team, who are now part of the IT team proper.  As separate entities we already had a close and fruitful working relationship – now that we’re all together, things are only going to get better!  Expect more updates about all the interesting stuff we’ll be working on together – as long as it’s not classified top secret. 😉

Our fabulous SRE team had an epic bug squashing session during which they reduced their open bugs by over 40%.  Given how diverse their queue is at any given moment, this is quite the accomplishment – great work guys!

Speaking of SRE’s, our very own Dumitru Gherman, along with developer Emma Irwin, made the long trip from Mountain View out to London to host a session on “Hacking your online safety” at MozFest.

Local Windows machine deployments at our San Francisco office used to take hours, but thanks to the efforts of Mike Poessy at the SFO service desk, via a combination of templating and new imaging techniques, this has been reduced down to as little as 20 minutes.  This newly streamlined process will almost certainly find its way to our other offices as well.

And last, but most certainly not least, the Release Operations team has been very busy:

  • Jake Watkins wrote a module for Windows called “metric-collective” that polls system stats and forwards the results to graphite/statsd.  This is going to be rolled out to all the Windows build and test systems during the next release cycle.  Combined with related initiative to roll out collectd across all of the OS X and Linux machines, we’re going to have a whole new level of graphing and trending available across the entire release infrastructure.
  • Mark Cornmesser and Q Fortier have been hard at work on a brand new imaging and management mechanism for our Windows 2008 build hosts.  The biggest change is a move away from our old, manually maintained monolithic image, to a modular image complete with proper change and configuration management.  This new platform is currently being tested on a number of project branches and is expected to be rolled out to mozilla-central, try, and inbound in the next few weeks.
  • Dustin Mitchell worked with the Auto Tools team to set up an independent Puppet instance, itself destined to automate the management of the entire QA Mozmill CI infrastructure.  He also set up local Python and NPM mirrors for use by Mozillians, providing a compelling model and implementation for other use-cases going forward.

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on  See you next time!

Learn MySQL for Free with MySQL Marinate, Season 3!

The 3rd season of MySQL Marinate begins October 1st, 2013*. Join the meetup group and RSVP at season 3 to join! You can do the work on MySQL, or if you prefer, MariaDB or Percona.

If you do not have the book yet, you can still do the first week by using the online material from “Browse Contents” on the O’Reilly book page for Learning MySQL. There is homework for week 1, see the master list for all the information.

If you would like to learn MySQL from the ground up, consider joining us. This is for beginners – If you have no experience with MySQL, or if you are a developer that wants to learn how to administer MySQL, or an administrator that wants to learn how to query MySQL, this course is what you want.

If you are not a beginner, you are welcome to join too – maybe you need a refresher, or maybe you just want to test your knowledge or earn badges. That’s OK too!

The format of a virtual self-study group is as follows:

Each participant acquires the same textbook (Learning MySQL, the “butterfly O’Reilly book”, published 2007). You can acquire the textbook however you want (e.g. from the libary or from a friend) but if you buy the book, we ask that you buy it from our Amazon Store, to help pay for meetup fees.

Each participant commits to read one chapter per week, complete the exercises and post a link to the completed work. Tweet using the hashtag #mysqlmarinate.

Each participant obtains assistance by posting questions to a discussion area set up on the Virtual Tech Self Study Message Board for each chapter.

Each participant receives a badge upon finishing each chapter and all assignments.

Note: There is no classroom or video instruction.

How do I get started?

Become a member of the Virtual Tech Self Study Meetup Group.

Register for MySQL Marinate. RSVP to this event: Yes

Acquire the book (the only item that may cost money). Get your hands Learning MySQL – see if your local library has it, if someone is selling their copy, or buy it from our Amazon Store (this helps pay for meetup fees).

When your book arrives, start your virtual learning by reading one chapter per week. Complete the exercises; if you have any questions, comments or want to learn more in-depth, that’s what the forums are for!

Learning MySQL


Q: How long will the course last?

A: We will cover 12 lessons (chapters) in the book, so 12 (twelve) weeks starting October 1st, though we will have one week that is a break so that you can catch up if you need to or you have a week off if you need it. Refer to the MySQL Marinate Season 3 Master Discussion List for specific dates.

By January 1st, 2014, you will know MySQL!!

Q: Can I get ahead?

A: Sure! This is go-at-your-own-pace. To prevent spoilers, please put comments in the appropriate chapter threads.

Q: Does this cover the Percona patch set or MariaDB forks?

A: This covers the basics of MySQL, which are immediately transferable to Percona’s patched MySQL or MariaDB builds.

Q: What do I need in order to start the course?

A: All you need is the book and access to a computer, preferably one that you have control over. Installing MySQL is chapter 2, so really, all you need is the book and a computer to start, you don’t have to worry about any prerequisites. If you do not have the book yet, you can still do the first week by using the online material from “Browse Contents” at the

Categories: Databases