Pay No Attention to the Virtualization Behind the Curtain

gcox

VMware migrations are often seamless.  With vMotion, Storage vMotion, DRS and HA, you can go a long time without a hiccup.  There’s always tasks that are more difficult than you’d expect:

  • “Why can’t I vMotion a template?”
  • “Why can’t I stage a VM to reboot into having more RAM?”
  • “Why can’t I edit configuration settings on a live VM ahead of a reboot?”

…”Why can’t I move VMs to a new vCenter?”  That was one that we were facing: moving from a Windows vCenter to the new Linux-based vCenter Server Appliance (VCSA).  It’s what we wanted, but the problem is, that’s the wrong question.

At Mozilla we have two main datacenters with ESX clusters, which were running ESXi5.0.  The hardware and ESXi versions were getting a little long in the tooth, so we got a new batch of hardware, and installed ESXi5.5.

Problem there: how to move just under 1000 VMs from the old vCenters to the new ones.  While a lot of our users are flexible about reboots, the task of scheduling downtime, shutting down, unregistering from the old, registering with the new, tweaking the network, booting back, then updating the inventory system… it was rather daunting.

It took a lot of searching, but we found out we were basically asking the wrong question.  The question isn’t “how do I move a guest to a new vCenter?”, it’s “how do I move a host to a new vCenter (and, oh yeah, mind if he brings some guests along)?”

So, the setup:

vSphere Clusters pre-move
On both sides, we have the same datastores set up (not shown), and the same VLANs being trunked in, so really this became a question of how we land VMs/hosts going from one side to another.  We have vSphere Distributed Switches (vDS) on both clusters, which means the network configuration is tied to the individual vCenters.

There may be a way to transfer a VM directly between two disparate vDS’es, but either we weren’t finding it or it’s too much dark magic or risk of failure.  We used a multiple-hop approach that was the right level of “quick”, “makes sense”, and, most importantly, “works”.

On both clusters, we took one host out of service and split its redundant network links into two half-networks.  Netops unaggregated the ports, and we made port 1 be the uplink for a standard switch, with all of our trunked-in VLANs (names ala “VLAN2″, “VLAN3″).  Port 2 was returned to service as the now-nonredundant uplink for the vDS, with its usual VLANs (names ala “Private VLAN2″, “DMZ VLAN3″).  On the 5.0 side, we referred to this as “the lifeboat”, and on the 5.5 side it was “the dock”.

I think we're going to need a bigger boat

The process at this point became a whole lot of routine work.

  • On the old cluster, pick VMs that you want to move.
  • Turn DRS to manual so nobody moves where they shouldn’t.
  • vMotion the selected VMs into the lifeboat until it is very full.
  • Look at the lifeboat’s configuration tab, under vDS, see what vlans are in use on this host, and by how many VMs.
  • For each VLAN in use on the lifeboat
    • Under Networking, the trunk vDS, choose Migrate Virtual Machine Networking
    • Migrate “Private VLAN2″ to “VLAN2″.  This will only work on the lifeboat, since it’s the only host that can access both forms of VLAN2, so choosing “all” (and ignoring a warning) is perfectly safe here.
    • Watch the VMs cut over to the standard switch (dropping maybe 1 packet).
  • Checking the lifeboat’s configuration, nobody is on the vDS now; all VMs on the lifeboat are on the local-to-the-lifeboat standard switch.
  • Disconnect the lifeboat from the old vCenter.
  • Remove the lifeboat from the old vCenter.
  • On the new vCenter, add the lifeboat as a new host.  This takes a while, and even after multiple successful runs there was always the worry of “this time it’s going to get stuck,” but it just worked.

wave dead chickens as you see fit

Once the lifeboat host is added to the new cluster, vMotion all the VMs from the lifeboat onto the dock.  Now work can split in 2 directions: one person sends the lifeboat back; another starts processing the newly-landed VMs that are sitting on the dock.

vSphere Clusters, unloading lifeboat

Sending the lifeboat back is fairly trivial.  Disconnect/remove the lifeboat from the new vCenter, add the host back to the old vCenter, and add the links to the vDS.  At this point, this person can start loading up the next batch of evacuees.

On the receiving side, all the VMs are currently pinned to the dock, since it’s now the only host with a standard switch.  All of the VMs there need to have their networks moved to the new vCenter’s vDS.  The process is just the reverse of before (“Migrate Virtual Machine Networking” under the networking tab, moving “VLAN2″ to “Private VLAN2″).  The rest is housekeeping: file the VMs into the right resource pools and folders, update the in-house inventory system to indicate the VMs were moved to a new vCenter, start vmware-tools upgrading.  Last step, we’d enable DRS and put the dock in maintenance mode, to eject all the new VMs into the remainder of the new cluster, to make room for the next boatload of arrivals.

We had very few problems, but I’ll list them:

  • Any snapshots on the VMs were invalid after the move.  We found this the hard way: someone in QA rolled a guest back to a baseline snapshot, only to find the networking lost, because it restored references to the old vDS.  Easily fixed once identified, and we were lucky that it wasn’t a bigger problem since we’d coordinated that move with the user.
  • Two VMs had vNICs that had had manually configured MAC addresses in the 00:50:56 space.  The VM refused to land on the new side, because it could conflict/not be managed.  We had to do a hot-swap of the vNIC to get onto an automatic MAC, at which point the VM moved happily.
  • And, of course, human error.  Misclassifying VMs into the wrong place because we were moving so much so fast.

One person would own one boatload, noting which pools/folders they got VMs from, and owning putting them in the right place on the far side.  All in all, with 2 people, we were able to move 200VMs across in a full working day, and so we finished up evacuating the old vCenter in 5 working days.  We only had to coordinate with 2 customers, and took one 15m maintenance window (for the manual-MAC-vNIC issue), and even then we didn’t disrupt service.

Around 1000 VMs moved, and nobody noticed.  Just how we wanted it.

The power of indexes

Brandon Johnson

tl;dr: If you think there’s slowness or performance issues in your application database, ask your local friendly DBAs. Often times we can help (and we like to!) — (graphs below for evidence) :)
Or if you don’t have a DB team, check out this great guide on how to check for slow queries.

Longer version:

Last week, one of the release engineering staff approached me in regards to some slow performance on our buildbot database systems.

Upon investigation, we realized there was a significant buildup of slow queries during weekdays. In this system, slow queries were defined as any query taking longer than 2 seconds.

After a short investigation, it was pretty easily determined that there was a set of queries repeating about 1-2 times every second, that took 2 seconds to run. This query was able to gain some substantial benefits from the addition of only a few single indexes. Since the impact was low, we decided to execute this change into dev/stage and then into production on Tuesday.

The graphs below show the impact of the change over the course of 3 days (Sunday, Monday, Tuesday) where the change was implemented on the morning of Tuesday the 28th, and again for the last 7 days.

3 days around the change:

Screen Shot 2014-02-03 at 12.43.10 PM

 

Last 7 days:

Screen Shot 2014-02-03 at 12.48.58 PM

 

7 days prior to the change (for perspective):

Screen Shot 2014-02-03 at 12.53.15 PM

A Physical Control Surface for Video Production

Richard Milewski

1

At Mozilla we make heavy use of Telestream’s Wirecast to stream video events. While Wirecast has a nice GUI, using a mouse or trackpad to control a video production is far from ideal. In some of our larger venues we have Blackmagic ATEM production video switchers, but for remote events and streaming from smaller venues we’ve been stuck with the default GUI control interface. Until now…

NanoKontrol2At the Mozilla Festival in London earlier this year I saw some projects using MIDI control surfaces for things other than controlling music. It turns out that grafting a midi control surface to any program with an Applescript interface is quite easy thanks to Nico Wald’s MidiPipe program.

A few minutes of searching the web showed that none of this is a new idea. Mark over at ttfn.tv implemented a similar solution a couple of years ago.

Mark’s solution used an earlier version of the Korg control surface, and was specific to an earlier Wirecast 4 release. It also used more than a dozen different scripts which made it hard to both understand and maintain.

I’ve done a ground-up rewrite and bundled the scripts and configuration files into something I’m calling Video Director. There are versions for both Wirecast 4 and Wirecast 5, and you can get all the code over on GitHub. (Here’s hoping that someone with better Applescript Fu than mine will fork these projects).

Video Director is a script designed to enable control of Wirecast 4 from a Korg nanoKONTROL2 midi control surface on MacOS X computers. The Korg nanoKONTROL2 is a low-cost solution to providing a tactile interface which, while not as elegant as a real production video switcher, provides much more tactile feedback than trying to control a video production with a mouse or touchpad.

In addition to simply providing real physical buttons for video switching operations, Video Director also simplifies the process of populating the various control layers of Wirecast with video and graphic content. It will load layer content from a pre-defined directory structure on the host machine, allowing rapid re-configuration of Wirecast for programs with differing content requirements.

The functionality of Video Director is limited by the very restricted subset of Wirecast operations for which Telestream has exposed a scriptable interface. The most obvious omission is that there appears to be no way to script the master audio level control, either through the Wirecast API or via System Events scripting. For use with a control surface such as the Korg nanoKONTROL2 with its many sliders and knobs, this is a galling omission. In addition, Wirecast 5 appears to be even less scriptable than Wirecast 4. Here’s hoping that changes in the next couple of releases.

Upgrading from MySQL 5.1 to MariaDB 5.5

Sheeri

1

In my last post, a tale of two MySQL upgrades, a few folks asked if I would outline the process we used to upgrade, and what kind of downtime we had.

Well, the processes were different for each upgrade, so I will tackle them in separate blog posts. The first step was to upgrade all our MySQL 5.1 machines to MariaDB 5.5. As mentioned in the previous post, MariaDB’s superior performance for subqueries is why we switched – and we switched back to MySQL for 5.6 to take full advantage of the performance_schema.

It is not difficult to blog about our procedure, as we have documentation on each process. My first tip would be to do that in your own environment. This also enables other folks to help, even if they are sysadmins and not normally DBAs. You may notice the steps contain items that might be “obvious” to someone who has done maintenance before – we try to write them detailed enough that if you were doing it at 3 am and a bit sleep-deprived, you could follow the checklist and not miss anything. This also helps junior and aspiring DBAs not miss any steps as well.

The major difference between MySQL 5.1 and MySQL 5.5 (and its forks, like MariaDB) is that FLOAT columns are handled differently. On MySQL 5.1, a float value could be in scientific notation (e.g. 9.58084e-05) and in 5.5, it’s not (e.g. 0.0000958084). This makes checksumming difficult, as all FLOAT values will show differences even when they are the same number. There is a workaround for this, devised by Shlomi Noach.

We have an n+1 architecture for databases at Mozilla – this means that we have an extra server. If we need 1 master and 3 slaves, then n+1 is 1 master and 4 slaves. Because of this, there are 2 different ways we upgrade – the first slave we upgrade, and subsequent slaves/masters.

These steps are copied and pasted from our notes, with minor changes (for example, item #2 is “send out maintenance notices” but in our document we have the e-mail addresses to send to).

Assumptions: Throughout these notes we use ‘/var/lib/mysql’, as that is our standard place for MySQL. You may need to change this to suit your environment. We are also using Red Hat Enterprise Linux for our operating system, so this procedure is tailored to it (e.g. “yum install/yum remove”). We control packages using the freely available puppet mysql module we created.

For the first slave
The overall procedure is to perform a logical backup the database, create a new empty installation of the new server version, and import the backup. Replication does work from MySQL 5.1 to MariaDB 5.5 and back (at least, on the 25 or so clusters we have, replication worked in both directions. Your mileage may vary).

  1. Make sure the slave has the same data as the master with checksums (the previous checksum is fine, they should be running every 12 hours).
  2. Send out maintenance notices.

  3. Take the machine out of any load balanced services, if appropriate

  4. Set appropriate downtimes in Nagios

  5. Start a screen session on the server

  6. Do a SHOW PROCESSLIST to see if there are any slaves of the machine. If so, move them to another master if they are needed. [we have a different checklist for this]

  7. Do a SHOW SLAVE STATUS to see if this machine is a slave.

    1. If this machine is a slave, ensure that its master will not delete its binlogs while the upgrade is occurring.

    2. If this machine is a slave, do a SLAVE STOP; and copy the master.info file somewhere safe [or the slave_master_info table if using that]

  8. Stop access to the machine from anyone other than root (assuming you are connecting from root):

  9. UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;
  10. See what the default character set is for the server and databases:
    SHOW VARIABLES LIKE 'character_set_server'; SHOW VARIABLES LIKE 'character_set_database';
    SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE DEFAULT_CHARACTER_SET_NAME!='utf8' AND SCHEMA_NAME NOT IN ('mysql');

    If applicable, change the server defaults to UTF8 and change databases to utf8 with ALTER DATABASE dbname DEFAULT CHARACTER SET utf8;

  11. Stop access to the machine from anyone other than root (assuming you are connecting from root): UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;

  12. Check to see how big the data is:
    mysql> SELECT SUM(DATA_LENGTH)/1024/1024/1024 AS sizeGb FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA!='information_schema';

  13. Determine how you can export the data, given the size. You may be able to export without compression, or you may need to do a mysqldump | gzip -c > file.sql, then compress the old data files instead of just moving them aside.

  14. Do a du -sh * of the datadir and save for later, if you want to compare the size of the database to see how much space is returned after defragmenting

  15. Export the data from all databases, preserving character set, routines and triggers. Record the time for documentation’s sake. I’m assuming the character set from step 9 is utf8 (if it’s something like latin1, you’ll need to put in –default-character-set=latin1 in the command). If the machine has slaves, make sure to use –master-data=1. If you need to compress, change the shell command accordingly:
    time mysqldump --all-databases --routines --triggers --events > `date +%Y-%m-%d`_backup.sql
  16. Stop MySQL

  17. Copy the config file (usually /etc/my.cnf) to a safe place (like /etc/my.cnf.51)

  18. Do a rpm -qa | egrep -i "percona|mysql". Do a yum remove for the mysql/percona packages. It’s OK if it also removes related packages, like perl-DBD, but make a note of them, because you will want to reinstall them later. Sample:
    yum remove Percona-Server-client Percona-Server-shared-compat Percona-XtraDB-Cluster-devel Percona-Server-server
  19. Move the /var/lib/mysql directory to /var/lib/mysql-old. Compress any files that need compression (if you need to compress, to decompress the sql file). If you absolutely cannot keep the files, see if you can copy them somewhere. We really want to preserve the old data directory just in case we need to revert.

  20. Decompress the sql file, if applicable.

  21. Install the proper packages by changing puppet to use “maridb55″ instead of “mysql51″ or “percona51″. Verify with rpm -qa | egrep -i “percona|mysql|maria”
  22. [this may be different in your environment; we use the freely available puppet mysql module we created.
  23. Run mysql_install_db

  24. Make any changes to /etc/my.cnf (e.g. run puppet). When going from MySQL 5.1 to 5.5, there are no particular global changes Mozilla made.
  25. - when we went from MySQL 5.0 to MySQL 5.1, we did a global change to reflect the new slow query log options.
  26. chown -R mysql:mysql /var/lib/mysql/

  27. chmod 775 /var/lib/mysql

  28. Start MySQL and check the error logs for any warnings. Get rid of any warnings/errors, and make sure MySQL is running.

  29. Turn off binary logging. Import the export, timing how long it takes, for reference:

  30. time mysql < YYYY_MM_DD_backup.sql
  31. Restart MySQL and look for errors, you may need to run mysql_upgrade.

  32. Turn on binary logging, if applicable.

  33. Test.

  34. If this machine was a slave, re-slave it. Let it catch up, making sure there are no data integrity errors, and no replication errors.

  35. Reinstate permissions on the users:
    UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;
  36. Re-slave any slaves of this machine, if needed.

  37. Turn back on Nagios, making sure all the checks are green first.

  38. Run a checksum on the master to propagate to this slave, and double-check data integrity on the slave. Note that you will want to use --ignore-columns with the output of this command in the checksum, to avoid false positives from scientific notation change (see https://blog.mozilla.org/it/2013/01/17/mysql-5-1-vs-mysql-5-5-floats-doubles-and-scientific-notation/)

  39. Find FLOAT/DOUBLE fields to ignore in checksum: SELECT GROUP_CONCAT(DISTINCT COLUMN_NAME) FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE IN ('float','double') AND TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema');
  40. Put the machine back into the load balancer, if applicable.

  41. Inform folks the upgrade is over

On the first upgrade, we did what is usually recommended - do a logical export with mysqldump, and then an import. With other upgrades in the same replication hierarchy, we can take advantage of Xtrabackup to stream the new version directly to the machine to be upgraded.

The general procedure here is similar to the above, except that a logical export is not taken. After preparation steps are taken, a new empty MariaDB 5.5 server is installed. Then we use xtrabackup to backup and restore the existing MariaDB 5.5 server to the machine we are upgrading.

For subsequent slaves, and the master

  1. Coordinate with affected parties ahead of time

  2. Send out any notices for downtime

  3. Take the machine out of any load balanced services, if appropriate. If the machine is a master, this means failing over the master first, so that this machine becomes a regular slave. [we have a different checklist for how to failover]

  4. Set appropriate downtimes in Nagios, including for any slaves

  5. Start a screen session on the server

  6. Do a SHOW PROCESSLIST to see if there are any slaves of the machine. If so, move them to another master if they are needed.

  7. Do a SHOW SLAVE STATUS to see if this machine is a slave.
    1. If this machine is a slave, ensure that the master will not delete its binlogs while the upgrade is occurring.

    2. If this machine is a slave, do a SLAVE STOP; and copy the master.info file somewhere safe

  8. Save a list of grants from pt-show-grants, just in case there are users/permissions that need to be preserved.
  9. [this is done because sometimes masters and slaves have different users, though we try to keep everything consistent]
  10. Figure out how big the backup will be by doing a du -sh on the datadir of the already-upgraded machine to be backed up, and make sure the new machine has enough space to keep the old version and have the new version as well.

  11. Stop MySQL on the machine to be upgraded.

  12. Copy the config file (usually /etc/my.cnf) to a safe place (like /etc/my.cnf.51)

  13. Do a rpm -qa | egrep -i "mysql|percona". Do a yum remove for the mysql packages (at least mysql-server, mysql). It’s OK if it also removes related packages, like perl-DBD, but make a note of them, because you will want to reinstall them later.

  14. Move the /var/lib/mysql directory to /var/lib/mysql-old. Compress any files that need compression. If you absolutely cannot keep the files, see if you can copy them somewhere. We really want to preserve the old data directory just in case we need to revert.

  15. Install the proper packages by changing puppet to use “maridb55″ instead of “mysql51″ or “percona51″, running puppet manually. Verify with rpm -qa | egrep -i "percona|mysql|maria"

  16. Run mysql_install_db

  17. Make any changes to /etc/my.cnf (or run puppet). When going from MySQL 5.1 to 5.5, there are no particular changes.

  18. chown -R mysql:mysql /var/lib/mysql/

  19. chmod 775 /var/lib/mysql

  20. Start MySQL and check the error logs for any warnings. Get rid of any warnings/errors, and make sure MySQL is started.

  21. Stop MySQL, and move or delete the datadir that was created on upgrade.

  22. If you are directly streaming the backup to the machine to be upgraded, do this on the machine to be upgraded:
    cd $DATADIR
    nc -l 9999 | tar xfi -

  23. On the machine to be backed up (that is already upgraded), in a screen session, making sure you get any slave info:
    time innobackupex --slave-info --stream=tar $DATADIR | nc (IP/hostname) 9999

  24. Once xtrabackup is complete, fix permissions on the datadir:
    chown -R mysql:mysql /var/lib/mysql/
    chmod 775 /var/lib/mysql

  25. Prepare the backup:
    time innobackupex --apply-logs --target-dir=/var/lib/mysql

  26. Fix permissions on the datadir again:
    chown -R mysql:mysql /var/lib/mysql/
    chmod 775 /var/lib/mysql

  27. Restart MySQL and look for errors

  28. Test.

  29. If this machine was a slave, re-slave it. Let it catch up, making sure there are no data integrity errors, and no replication errors.

  30. Re-slave any slaves of this machine, if needed.

  31. Turn back on Nagios, making sure all checks are green first.

  32. Put the machine back into the load balancer, if applicable.

  33. Inform folks the upgrade is over

It’s long and detailed, but not particularly difficult.

IT goings-on

phrawzty

Hello everybody – it’s time once again for the weekly IT update.

First up, the Mozilla Operations Centre (or MOC) is up and running!  This new team currently comprises seven employees from all over the world, including the USA, India, and Europe.  They’ll be handling such critical functions as monitoring, tier 1 and 2 support, and issue tracking and escalation for our entire infrastructure.  What’s more, they’ll be acting as a sort of interface layer for the more technical aspects of Mozilla’s mobile partner relationships.  Expect more news about this highly important team in the weeks and months to come.

The RelOps team stopped testing Firefox on OS X 10.7 due to falling usage and the similarity of coverage between 10.8 and 10.6.  They re-purposed all 83 of the Mac Minis running 10.7 to  now run and test on 10.6 – effectively doubling the available 10.6 test capacity.  This had the net effect of reducing 10.6 wait times considerably, thus improving the overall level of service and – most importantly – increasing the satisfaction of developers testing against those targets.

They’re also making progress on the project to standardise all of their virtual machines on a single platform.  Already, all of the KVM “odd ducks” at SCL3 have been replaced, which is a big win in terms of paying off technical debt for all of IT.  Good work, RelOps!

On the topic of performance improvements, Solarce from the WebOps team cleaned out tonnes of old jobs and tasks from our in-house Jenkins system, which has dramatically reduced start-up and task run times, and improved stability overall.

Mozillians physically working in our offices were plagued by a small but highly irritating problem: the tablets used to check and book conference rooms were skewed by up to eight minutes, which meant it was sometimes tricky to reserve rooms properly.  Thankfully, the Desktop and NetOps teams came to the rescue, and the problem is no longer!

Finally, in case you missed it earlier this week, Sheeri from the Database team put up an interesting post about a recent run of MySQL upgrades – highly recommended!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

A Tale of Two MySQL Upgrades

Sheeri

11

At the beginning of 2013, Mozilla’s MySQL databases were a mix of MySQL 5.0, Percona’s patched MySQL 5.1, Percona’s patched MySQL 5.5 and MariaDB 5.5. MySQL 5.1 was released in November 2008 – so at the beginning of the year, we still had databases with no new major features in 4 years. Currently we have almost all our databases at Oracle’s MySQL 5.6 – the only stragglers are our cluster running TokuDB and a few machines that are no longer in use. Here’s a graph showing the state of our machines – you can see that in the first half of the year we concentrated on upgrading our 5.0 and 5.1 servers to 5.5, and then in the second half of the year we upgraded everything to MySQL 5.6 (click on the image to get a larger version):

MySQL Versions in 2013

After running some tests, we determined that MariaDB 5.5 was the best option for us and our particular workload. For most of our servers, it did not matter whether we use Percona, MariaDB or Oracle’s MySQL, but our Bugzilla servers really benefited from MariaDB’s better subquery optimization, so we went with that. We had set up some Percona 5.5 servers over the spring/summer of 2012, when we moved some of our infrastructure to a new data center.

We upgraded to MySQL 5.5 to be on a recent version of MySQL. In the middle of the year, we had a choice – should we stay where we were, or should we upgrade? We had no particular directive from developers to upgrade for the new MySQL 5.6 features. However, we have been doing more and more digging into our systems, and we really wanted the performance_schema features so we could dig even more. We want to be able to parse queries in real-time, perhaps with Anemometer without having to take an offline log file and run pt-query-digest on it.

So, we chose to upgrade to MySQL 5.6. Unfortunately, there were no other GA products to test against – by mid-2013, neither MariaDB nor Percona had a GA 5.6 product, so our bake-off was functional only, not performance-related. Oracle’s MySQL 5.6 passed with flying colors, and so we proceeded to upgrade.

Now, we have a recent and consistent version of MySQL installed, that we can work with to gain insights into our systems. A pretty great goal to have been met for 2013!

IT goings-on

phrawzty

Hello all and welcome to this week’s IT update.  Instead of the usual wrap-up of interesting tidbits from across the team, this post is dedicated to the recent major maintenance event at one of our two primary data centres.  Let’s dive in!

Fact: Mozilla leverages a mind-boggling variety of technical infrastructure.  The sheer breadth of machines and configurations is difficult to fully grasp.  This infrastructure is situated in a number of physical locations, including data centres in the USA and China, as well as our offices around the world.  Over the past couple of years, one of the major long-term projects at Mozilla IT has been to consolidate and industrialise these physical locations and their contents – no small feat, and a project that will remain on-going for the foreseeable future.

Today we have two primary data centres on the North American continent: PHX1 and SCL3.  These data centres are treated a little bit differently than our other locations, as they are not only our largest installations, but are specifically designed to provide highly stable, highly available environments – in other words, no downtime.  One of the key elements in this architecture is called the core network stack, which refers to the networking equipment that is responsible for routing all of the traffic between a given data centre and the Internet at large.  The stack needs to be as reliable as humanly (or machinely) possible – without it, there is no communication with the outside world.

Earlier this year a problem was detected in the stack at SCL3.  This problem had a direct impact on the stability and reliability of the core network, and if left untreated, would have eventually resulted in a major unplanned outage.  In fact, small service interruptions and other events had already been tied to this issue, and while work-arounds were implemented, the fact remained that this was a ticking time bomb.  Ultimately the decision was made to simply remove the problematic hardware entirely from the stack.  While this was certain to solve the issue, it also meant incurring the one thing that the HA architecture was designed to avoid: downtime.

Many of the products and services that Mozilla provides rely on SCL3, including – but not limited to – such things as product delivery (i.e. Firefox downloads, updates, and the like), the build network (for building and testing those deliverables), the Mozilla Developer Network, and so forth.  We worked with key stakeholders from across the company to explain the situation and come up with plans for how to deal with the impending outage.  These plans ranged from the relatively simple (such as putting up a “hardhat“-style message explaining the situation), to the non-trivial (such as replicating the entire repository infrastructure at PHX1), to the heroic (implementing product delivery entirely in the cloud).

Furthermore, we weren’t content with simply addressing the problematic issue (and since we were going to be experiencing a service outage no matter what), we worked with our vendor to come up with a new architecture – one that would ensure that even if we have to perform major network manipulations again, we should now be able to avoid total blackouts in the future.  This helped to turn what was “merely” a problem-solving exercise into a real opportunity to extend and improve our service offering.

As part of this planning process, we set up a lab environment with hardware supplied by our vendor, which allowed us to practice with the mechanisms and manipulations ahead of time.  I can’t stress enough how critical this was: knowing  what to expect going into it in terms of pitfalls and processes was absolutely essential.  This helped us to form realistic expectations and set up a time-line for the maintenance event itself.

There were emails; there were meetings; there were flowcharts and diagrams to last a lifetime – but at the end of the day, how did the event actually turn out?  Corey Shields with the details:

All in all, the maintenance was a success.  The work was completed without any major problems and done in time.  Even in a successful event like this one, we have a postmortem meeting to cover what was done well (to continue those behaviors in the future), and what needs improving.  We identified a few things that could have been done better, mostly around communication for this window.  Some community stakeholders were not notified ahead of time, and the communication itself was a bit confusing as to the network impact within the 8 hour window.  We have taken this feedback and will improve in our future maintenance windows.

There are any number of interesting individual stories that can (and should) be told about this maintenance, so keep watching this blog for more updates!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

 

IT goings-on

phrawzty

Greetings people of Earth (and elsewhere, perhaps) and welcome to another weekly Mozilla IT update.

Some big news this week from Chris Turra and the WebOps team: our internal PaaS has, at long last, passed the security review phase and is ready for immediate production use!  Many of you have been using the PaaS in a development capacity for some time, so for those who are already familiar with the environment, you’ll be happy to learn that a number of high-availability back-end services have been deployed in order to ensure that the service is production-ready (phew, that’s a lot of hyphenated words).  Frankly, there’s way too much goodness here to cram into a single paragraph – keep watching this space for a post dedicated to the new PaaS.

Also on a WebOps tip, Jacques Uber continues to astound and amaze with new updates to “Inventory“, which is Mozilla IT’s fully open-sourced infrastructure management application.  The newest functionality includes a GUI process for assigning Static Registrations – check the repo for more details.

The Release Engineering team has been hard at work rolling out the new imaging techniques and metrics tools to the test and windows infrastructure (as noted previously), but somehow Amy Rich found the time to participate in a panel discussion at LISA ’13 entitled “Women in Advanced Computing“.  This was part of a small series on the topic, the other session being a half-day workshop hosted in part by Sheeri Cabral of the Database team.  Both sessions were very well received by all accounts!

Finally, a preview into the next post, wherein we’re going to talk a little bit about the big data centre maintenance that occurred this past week-end – so stay tuned!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

IT goings-on

phrawzty

Hello everybody, and welcome to this week’s Mozilla IT update – let’s dive right in, shall we?

First off, some good news for those of you that felt the 3-month rotation for LDAP passwords was too short: most accounts have now moved to a 6-month rotation period.  Of course, all that really means is that in a given year, one is merely 50% less likely to get locked out after having forgotten to change their password.  Fortunately, resets of locked accounts are now totally self-service, so getting locked out is now much less irritating at is used to be.

Speaking of improving end-user experience, NetOps did some work on the wireless networks in both Paris and Taipei, moving them into larger configuration groups so as to standardise their configurations.  They are now easier to manage, and thus, easier to diagnose and optimise.

On the graphing and visualisation front, a big shout-out to Ali and Anurag from the Metrics team, who have been hard at work on implementing a new visualisation tool, an example of which can be seen here.  Their new self-service framework allows the rest of us to quickly generate graphs and dashboards for just about anything we might be interested in.  In addition, Ben Sullins has been hard at work on implementing Tableau to help people build tools for analysing and displaying data, which Jacques Uber from WebOps has used to set up a hardware warranty summary graph, which has already helped us to better understand and plan for upcoming expiration dates across thousands of machines.

On the topic of machines, the Storage and Virtualisation team spent the better part of a week braving the noise, temperature, and glaring fluorescent lights of the data centre during their push to expand our back-end NetApp infrastructure.  Thanks to their efforts, we now have room for some 300 additional virtual machines, which is going to give use some much-needed breathing room in that area.  This is going to be a big help going forward as we decommission those aforementioned out-of-warranty machines and replace them with spiffy new VMs.

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

Notes from Nagios World Conference 2013

Ashish Vijayaram

Nagios World Conference 2013 was held between Sep 30th and Oct 3rd at St. Paul, MN. I represented Mozilla IT/SRE along with Sheeri Cabral, who spoke about MySQL plugins. I wanted to share some observations and my best takeaways from the conference. I attended about 10 talks in all and spent more time discussing setups and best practices.

The biggest draw at the conf this year was Nagios 4.0 that was announced at last year’s keynote. 4.0 brings in some long awaited and much needed rocket power to Nagios. The changelog has detailed information about the big features but the ones that interested me the most were:

  • Core Workers – I have been researching on how to scale up service check execution on some of our bigger instances. mod-gearman has till now been the tool of choice but with Core Workers, Nagios natively steps up to the task. The legacy forking-for-each-check model was unsurprisingly hitting limits in some places and 4.x replaces it with worker processes that get check execution delegated to them. There is a massive performance gain and I’m looking to leverage that vs. integrating with mod-gearman.
  • Query Handlers – This feels like baked-in MK Livestatus. It’s made available via a socket. Unlike livestatus, it doesn’t yet have a lot of fancy and it’s mostly basic at the moment. I’d expect it would get a lot of attention in future versions.

Among other things I’m looking forward to integrating Multisite in our infrastructure. We have close to a dozen Nagios instances here at Mozilla and our primary interface to each is via IRC bots. As one would imagine, it doesn’t scale well and isn’t ideal for dealing with mass changes. This is where Multisite comes in very handy. Along with Livestatus, Multisite provides for a supercharged way to deal with multiple instances and multiple service/hosts within each. Do try out the demo because it’s hard to put awesome in words :)

Some nice talks that stood out:

  • Effective monitoring by Rodrigue Chakode where he spoke about filtering false alerts and actionable alerts and using business processes to monitor the most effective elements of a system.
  • Nagios at Etsy by Avleen Vig, who had an eventful road trip to the conference and discussed some cool things Etsy has done, particularly measuring alert fatigue by correlating alerts and sleep inputs from fitbit worn by oncalls. He also spoke at length about “Monitoring hygiene” and how Etsy went from 300 alerts/day to 45 alerts over the course of two years.

In all, it was a great conference, like last year. Looking forward to a year of 4.x and trying to get the in-house puppet module out on github ;)