## Deploying tor relays

On November 11, 2014 Mozilla announced the Polaris Privacy Initiative.  One key part of the initiative is us supporting the Tor network by deploying Tor middle relay nodes.  On January 15, 2015 our first proof of concept (POC) went live.

TL;DR; here are our Tor relays: https://globe.torproject.org/#/search/query=mozilla

When we started this POC, the requirements we had were:

• the Tor nodes should run on dedicated hardware
• the nodes should be logically and physically separated from our production infrastructure
• use low cost and commoditized hardware
• nodes should be operational within 3 weeks

## Hardware and Infrastructure

• We chose to make use of our spare and decommissioned hardware.  That included a pair of Juniper EX4200 switches and three HP SL170zG6 (48GB ram, 2*Xeon L5640, 2*1Gbps NIC)
• We dedicated one of our existing IP Transit providers to the project (2 X 10Gbps).

The current design is fully redundant.  This allows us to complete maintenance or have node failure without impacting 100% of traffic.  The worst case scenario is a 50% loss of capacity.

The design also allows us to easily add more servers in the event we need more capacity, with no anticipated impact.

## Building and Learning

There is a large body of knowledge available on building Tor nodes.  I read mailing lists archives, blog posts, and tutorials. I had exchanges with people already running large relays.  There are still data points Mozilla needs to understand before our experiment is complete.  This section is a “quick run down” on some of those data points.

• A single organization shouldn’t be running more than 10Gbps of traffic for a middle relay (and 5Gbps for an exit node).

This seems to be more of gut feeling from existing operators than a proven value (let me know if I’m wrong), but it makes sense.  We do have available transit and capacity. Understanding throughput and resource utilization is a key criteria for us.

Important Note: An operator running relays must use the “MyFamily” option in torrc.  This ensures a user doesn’t bounce through several of your servers.

• Slow ramp up

A new Tor instance (identified by its private/public key pair) will take time (up to 2 months) to use all its available bandwidth. This is explained in this blog post: The lifecycle of a new relay. We will be updating our blog posts and are curious how closely our nodes mirror the lifecycle.

• A Tor process (instance) can only push about 400Mbps.

This is based on mailing list discussions, as we haven’t reached that bandwidth yet. We run several instances per physical server.

• A single public IP can only be shared by 2 Tor instances

This is a security feature to prevent a single person to run a ton of fake different nodes as explained in this research paper. This feature is documented in the Tor protocol specification.

• Listen on well known ports like 80 or 443

This helps people behind strict firewall to access Tor. Don’t worry about running the process as root (needed to listen on ports < 1024), as long as you have the “User” option in torrc, Tor will drop the privileges after binding to the ports.

## Automation

We decided to use Ansible for configuration management.  A few things motivated us to make that choice.

• There was an existing ansible-tor role very close to what we needed to accomplish (and here is our pull request with our fixes and additions).
• Some of our teams are using Ansible in production and we (Network Engineering) are considering it.
• Ansible does not require a heavy client/server infrastructure which should make it more accessible to other operators.

And look! Mozilla’s Ansible configuration is available on GitHub!

### Security

The security team helped us a lot along this project. Together we have put together a list of requirements, such as

• strict firewall filtering
• hardening the operating system (disable unneeded services, good SSH configuration, automatic updates)
• hardening the network devices management plane
• implementing edge filtering to make sure only authorized systems can connect to the “network management plane”

The only place for the infrastructure administration is the jumphost. Systems don’t accept management connection from anywhere else.

It is important to note, that many of the security requirements align nicely with what’s considered a good practices in general system and network administration. Take enabling NTP or centralized syslog for example – equally important for some services to run smoothly, for troubleshooting and for Incident Response. Similar concepts apply with the principle “make sure the network devices security is at least as good as system’s one”.

We’ve also implemented a periodic security check to be run on these systems. All of them are scanned from inside for security updates and outside for opened ports.

### Metrics

One of the points we’re wondering are: how do we figure out if we’re running an efficient relay (in terms of cost, participation in the Tor network, hardware efficiency, etc). Which metrics to use and how to use them?

Looking around it seems like there is no “good answer”. We’re graphing everything we can about bandwidth and servers utilization using Observium. The Tor network already has a project to collect relays statistics called Tor metrics. Thanks to it, tools like Globe and others can exists.

## Future

Note that we have just started them and they are far from running at their maximal bandwidth (for the reasons listed above). We will share more information down the road about performances and scaling.

Depending on the results of the POC,  we may move the nodes to a managed part of our infrastructure. As long as their private keys stay the same, their reputation will follow them wherever they go, no more ramp up period.

On a technical side there are a lot of possible things to do like adding IPv6 connectivity.  We’re reviewing opportunities to more parts of the deployment (like iptables, logs, etc…).

Here are a few links that you might find interesting:

## Thanks

Of course, none of that would have been possible without the help of Van, Michal (who wrote the part about security) and Opsec, Javaun, James, Moritz and the people of #tor!

## Introducing the Mozilla Operations Center

Mozillians,

Some of you may already know of the Mozilla Operations Center (MOC). The purpose of the MOC is to provide support and response for Mozilla’s critical production services and underlying infrastructure. We are a 24×7 IT function operating primarily from the SF and London MozSpaces, staffed by Mozilla employees on rotating shifts. In the past, this responsibility fell on several legacy IT teams and processes which have been inherited by the MOC. These legacy methods of support were spread across various IRC channels, email lists, and Bugzilla components and are now ready to be consolidated.

On Wednesday, September 17th, 2014 these changes will take effect and provide a single identifiable approach to requesting support for production services and reaching the MOC across IRC, email, and Bugzilla.

The MOC can be found via mozilla’s IRC network in channel: #moc

We understand this may not reach everyone, so we’re prepared to receive and educate unique requests that will undoubtedly come across via existing legacy channels.

# What is the MOC?

The Mozilla Operation Center is a 24×7 IT function with a purpose to support Mozilla’s critical services.  It provides this through standard approaches – incident management and proactive monitoring.  Effective mid-Sept 2014, the MOC will have a single identity and only known as “MOC” across IRC, email, and Bugzilla.

# Location

The MOC operates in 2 MozSpaces:  one in San Francisco and the other in London.  Through rotational shifts between these 2 hubs, the MOC achieves around the clock coverage every day of the year.  Future state may include an additional hub for a more traditional follow-the-sun model and/or hybrid considerations for better efficiencies.

# Vision

MOC will be a pro-active team that prevents incidents through proficient automation, thorough inspection, and using data-driven methods to alert business/product owners of possible problem areas to mitigate.  Additionally, the MOC will ensure proper service levels are achieved, accurately measured, and corrective actions applied.

### Vision targets:

• Sustain IT infrastructure operations of Mozilla’s most critical services meeting defined SLA.
• Be transparent and visible to the Project and Mozilla’s key partners for support and communication of service updates.
• Establish and drive repeatable processes (Incident, problem, crisis, reporting) throughout the Project.
• Predictable reaction to service outages, un/planned changes, response, and intended outcomes.

More details to our services and offerings will be provided as they come to fruition.

## Pay No Attention to the Virtualization Behind the Curtain

VMware migrations are often seamless.  With vMotion, Storage vMotion, DRS and HA, you can go a long time without a hiccup.  There’s always tasks that are more difficult than you’d expect:

• “Why can’t I vMotion a template?”
• “Why can’t I stage a VM to reboot into having more RAM?”
• “Why can’t I edit configuration settings on a live VM ahead of a reboot?”

…”Why can’t I move VMs to a new vCenter?”  That was one that we were facing: moving from a Windows vCenter to the new Linux-based vCenter Server Appliance (VCSA).  It’s what we wanted, but the problem is, that’s the wrong question.

At Mozilla we have two main datacenters with ESX clusters, which were running ESXi5.0.  The hardware and ESXi versions were getting a little long in the tooth, so we got a new batch of hardware, and installed ESXi5.5.

Problem there: how to move just under 1000 VMs from the old vCenters to the new ones.  While a lot of our users are flexible about reboots, the task of scheduling downtime, shutting down, unregistering from the old, registering with the new, tweaking the network, booting back, then updating the inventory system… it was rather daunting.

It took a lot of searching, but we found out we were basically asking the wrong question.  The question isn’t “how do I move a guest to a new vCenter?”, it’s “how do I move a host to a new vCenter (and, oh yeah, mind if he brings some guests along)?”

So, the setup:

On both sides, we have the same datastores set up (not shown), and the same VLANs being trunked in, so really this became a question of how we land VMs/hosts going from one side to another.  We have vSphere Distributed Switches (vDS) on both clusters, which means the network configuration is tied to the individual vCenters.

There may be a way to transfer a VM directly between two disparate vDS’es, but either we weren’t finding it or it’s too much dark magic or risk of failure.  We used a multiple-hop approach that was the right level of “quick”, “makes sense”, and, most importantly, “works”.

On both clusters, we took one host out of service and split its redundant network links into two half-networks.  Netops unaggregated the ports, and we made port 1 be the uplink for a standard switch, with all of our trunked-in VLANs (names ala “VLAN2″, “VLAN3″).  Port 2 was returned to service as the now-nonredundant uplink for the vDS, with its usual VLANs (names ala “Private VLAN2″, “DMZ VLAN3″).  On the 5.0 side, we referred to this as “the lifeboat”, and on the 5.5 side it was “the dock”.

The process at this point became a whole lot of routine work.

• On the old cluster, pick VMs that you want to move.
• Turn DRS to manual so nobody moves where they shouldn’t.
• vMotion the selected VMs into the lifeboat until it is very full.
• Look at the lifeboat’s configuration tab, under vDS, see what vlans are in use on this host, and by how many VMs.
• For each VLAN in use on the lifeboat
• Under Networking, the trunk vDS, choose Migrate Virtual Machine Networking
• Migrate “Private VLAN2″ to “VLAN2″.  This will only work on the lifeboat, since it’s the only host that can access both forms of VLAN2, so choosing “all” (and ignoring a warning) is perfectly safe here.
• Watch the VMs cut over to the standard switch (dropping maybe 1 packet).
• Checking the lifeboat’s configuration, nobody is on the vDS now; all VMs on the lifeboat are on the local-to-the-lifeboat standard switch.
• Disconnect the lifeboat from the old vCenter.
• Remove the lifeboat from the old vCenter.
• On the new vCenter, add the lifeboat as a new host.  This takes a while, and even after multiple successful runs there was always the worry of “this time it’s going to get stuck,” but it just worked.

Once the lifeboat host is added to the new cluster, vMotion all the VMs from the lifeboat onto the dock.  Now work can split in 2 directions: one person sends the lifeboat back; another starts processing the newly-landed VMs that are sitting on the dock.

Sending the lifeboat back is fairly trivial.  Disconnect/remove the lifeboat from the new vCenter, add the host back to the old vCenter, and add the links to the vDS.  At this point, this person can start loading up the next batch of evacuees.

On the receiving side, all the VMs are currently pinned to the dock, since it’s now the only host with a standard switch.  All of the VMs there need to have their networks moved to the new vCenter’s vDS.  The process is just the reverse of before (“Migrate Virtual Machine Networking” under the networking tab, moving “VLAN2″ to “Private VLAN2″).  The rest is housekeeping: file the VMs into the right resource pools and folders, update the in-house inventory system to indicate the VMs were moved to a new vCenter, start vmware-tools upgrading.  Last step, we’d enable DRS and put the dock in maintenance mode, to eject all the new VMs into the remainder of the new cluster, to make room for the next boatload of arrivals.

We had very few problems, but I’ll list them:

• Any snapshots on the VMs were invalid after the move.  We found this the hard way: someone in QA rolled a guest back to a baseline snapshot, only to find the networking lost, because it restored references to the old vDS.  Easily fixed once identified, and we were lucky that it wasn’t a bigger problem since we’d coordinated that move with the user.
• Two VMs had vNICs that had had manually configured MAC addresses in the 00:50:56 space.  The VM refused to land on the new side, because it could conflict/not be managed.  We had to do a hot-swap of the vNIC to get onto an automatic MAC, at which point the VM moved happily.
• And, of course, human error.  Misclassifying VMs into the wrong place because we were moving so much so fast.

One person would own one boatload, noting which pools/folders they got VMs from, and owning putting them in the right place on the far side.  All in all, with 2 people, we were able to move 200VMs across in a full working day, and so we finished up evacuating the old vCenter in 5 working days.  We only had to coordinate with 2 customers, and took one 15m maintenance window (for the manual-MAC-vNIC issue), and even then we didn’t disrupt service.

Around 1000 VMs moved, and nobody noticed.  Just how we wanted it.

## The power of indexes

tl;dr: If you think there’s slowness or performance issues in your application database, ask your local friendly DBAs. Often times we can help (and we like to!) — (graphs below for evidence)
Or if you don’t have a DB team, check out this great guide on how to check for slow queries.

Longer version:

Last week, one of the release engineering staff approached me in regards to some slow performance on our buildbot database systems.

Upon investigation, we realized there was a significant buildup of slow queries during weekdays. In this system, slow queries were defined as any query taking longer than 2 seconds.

After a short investigation, it was pretty easily determined that there was a set of queries repeating about 1-2 times every second, that took 2 seconds to run. This query was able to gain some substantial benefits from the addition of only a few single indexes. Since the impact was low, we decided to execute this change into dev/stage and then into production on Tuesday.

The graphs below show the impact of the change over the course of 3 days (Sunday, Monday, Tuesday) where the change was implemented on the morning of Tuesday the 28th, and again for the last 7 days.

3 days around the change:

Last 7 days:

7 days prior to the change (for perspective):

## A Physical Control Surface for Video Production

At Mozilla we make heavy use of Telestream’s Wirecast to stream video events. While Wirecast has a nice GUI, using a mouse or trackpad to control a video production is far from ideal. In some of our larger venues we have Blackmagic ATEM production video switchers, but for remote events and streaming from smaller venues we’ve been stuck with the default GUI control interface. Until now…

At the Mozilla Festival in London earlier this year I saw some projects using MIDI control surfaces for things other than controlling music. It turns out that grafting a midi control surface to any program with an Applescript interface is quite easy thanks to Nico Wald’s MidiPipe program.

A few minutes of searching the web showed that none of this is a new idea. Mark over at ttfn.tv implemented a similar solution a couple of years ago.

Mark’s solution used an earlier version of the Korg control surface, and was specific to an earlier Wirecast 4 release. It also used more than a dozen different scripts which made it hard to both understand and maintain.

I’ve done a ground-up rewrite and bundled the scripts and configuration files into something I’m calling Video Director. There are versions for both Wirecast 4 and Wirecast 5, and you can get all the code over on GitHub. (Here’s hoping that someone with better Applescript Fu than mine will fork these projects).

Video Director is a script designed to enable control of Wirecast 4 from a Korg nanoKONTROL2 midi control surface on MacOS X computers. The Korg nanoKONTROL2 is a low-cost solution to providing a tactile interface which, while not as elegant as a real production video switcher, provides much more tactile feedback than trying to control a video production with a mouse or touchpad.

In addition to simply providing real physical buttons for video switching operations, Video Director also simplifies the process of populating the various control layers of Wirecast with video and graphic content. It will load layer content from a pre-defined directory structure on the host machine, allowing rapid re-configuration of Wirecast for programs with differing content requirements.

The functionality of Video Director is limited by the very restricted subset of Wirecast operations for which Telestream has exposed a scriptable interface. The most obvious omission is that there appears to be no way to script the master audio level control, either through the Wirecast API or via System Events scripting. For use with a control surface such as the Korg nanoKONTROL2 with its many sliders and knobs, this is a galling omission. In addition, Wirecast 5 appears to be even less scriptable than Wirecast 4. Here’s hoping that changes in the next couple of releases.

In my last post, a tale of two MySQL upgrades, a few folks asked if I would outline the process we used to upgrade, and what kind of downtime we had.

Well, the processes were different for each upgrade, so I will tackle them in separate blog posts. The first step was to upgrade all our MySQL 5.1 machines to MariaDB 5.5. As mentioned in the previous post, MariaDB’s superior performance for subqueries is why we switched – and we switched back to MySQL for 5.6 to take full advantage of the performance_schema.

It is not difficult to blog about our procedure, as we have documentation on each process. My first tip would be to do that in your own environment. This also enables other folks to help, even if they are sysadmins and not normally DBAs. You may notice the steps contain items that might be “obvious” to someone who has done maintenance before – we try to write them detailed enough that if you were doing it at 3 am and a bit sleep-deprived, you could follow the checklist and not miss anything. This also helps junior and aspiring DBAs not miss any steps as well.

The major difference between MySQL 5.1 and MySQL 5.5 (and its forks, like MariaDB) is that FLOAT columns are handled differently. On MySQL 5.1, a float value could be in scientific notation (e.g. 9.58084e-05) and in 5.5, it’s not (e.g. 0.0000958084). This makes checksumming difficult, as all FLOAT values will show differences even when they are the same number. There is a workaround for this, devised by Shlomi Noach.

We have an n+1 architecture for databases at Mozilla – this means that we have an extra server. If we need 1 master and 3 slaves, then n+1 is 1 master and 4 slaves. Because of this, there are 2 different ways we upgrade – the first slave we upgrade, and subsequent slaves/masters.

These steps are copied and pasted from our notes, with minor changes (for example, item #2 is “send out maintenance notices” but in our document we have the e-mail addresses to send to).

Assumptions: Throughout these notes we use ‘/var/lib/mysql’, as that is our standard place for MySQL. You may need to change this to suit your environment. We are also using Red Hat Enterprise Linux for our operating system, so this procedure is tailored to it (e.g. “yum install/yum remove”). We control packages using the freely available puppet mysql module we created.

For the first slave
The overall procedure is to perform a logical backup the database, create a new empty installation of the new server version, and import the backup. Replication does work from MySQL 5.1 to MariaDB 5.5 and back (at least, on the 25 or so clusters we have, replication worked in both directions. Your mileage may vary).

1. Make sure the slave has the same data as the master with checksums (the previous checksum is fine, they should be running every 12 hours).
2. Send out maintenance notices.

3. Take the machine out of any load balanced services, if appropriate

4. Set appropriate downtimes in Nagios

5. Start a screen session on the server

6. Do a SHOW PROCESSLIST to see if there are any slaves of the machine. If so, move them to another master if they are needed. [we have a different checklist for this]

7. Do a SHOW SLAVE STATUS to see if this machine is a slave.

1. If this machine is a slave, ensure that its master will not delete its binlogs while the upgrade is occurring.

2. If this machine is a slave, do a SLAVE STOP; and copy the master.info file somewhere safe [or the slave_master_info table if using that]

8. Stop access to the machine from anyone other than root (assuming you are connecting from root):

9. UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;
10. See what the default character set is for the server and databases:
SHOW VARIABLES LIKE 'character_set_server'; SHOW VARIABLES LIKE 'character_set_database'; SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE DEFAULT_CHARACTER_SET_NAME!='utf8' AND SCHEMA_NAME NOT IN ('mysql');
If applicable, change the server defaults to UTF8 and change databases to utf8 with ALTER DATABASE dbname DEFAULT CHARACTER SET utf8;

11. Stop access to the machine from anyone other than root (assuming you are connecting from root): UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;

12. Check to see how big the data is:
mysql> SELECT SUM(DATA_LENGTH)/1024/1024/1024 AS sizeGb FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA!='information_schema';

13. Determine how you can export the data, given the size. You may be able to export without compression, or you may need to do a mysqldump | gzip -c > file.sql, then compress the old data files instead of just moving them aside.

14. Do a du -sh * of the datadir and save for later, if you want to compare the size of the database to see how much space is returned after defragmenting

15. Export the data from all databases, preserving character set, routines and triggers. Record the time for documentation’s sake. I’m assuming the character set from step 9 is utf8 (if it’s something like latin1, you’ll need to put in –default-character-set=latin1 in the command). If the machine has slaves, make sure to use –master-data=1. If you need to compress, change the shell command accordingly:
time mysqldump --all-databases --routines --triggers --events > date +%Y-%m-%d_backup.sql
16. Stop MySQL

17. Copy the config file (usually /etc/my.cnf) to a safe place (like /etc/my.cnf.51)

18. Do a rpm -qa | egrep -i "percona|mysql". Do a yum remove for the mysql/percona packages. It’s OK if it also removes related packages, like perl-DBD, but make a note of them, because you will want to reinstall them later. Sample:
yum remove Percona-Server-client Percona-Server-shared-compat Percona-XtraDB-Cluster-devel Percona-Server-server
19. Move the /var/lib/mysql directory to /var/lib/mysql-old. Compress any files that need compression (if you need to compress, to decompress the sql file). If you absolutely cannot keep the files, see if you can copy them somewhere. We really want to preserve the old data directory just in case we need to revert.

20. Decompress the sql file, if applicable.

21. Install the proper packages by changing puppet to use “maridb55″ instead of “mysql51″ or “percona51″. Verify with rpm -qa | egrep -i “percona|mysql|maria”
22. [this may be different in your environment; we use the freely available puppet mysql module we created.
23. Run mysql_install_db

24. Make any changes to /etc/my.cnf (e.g. run puppet). When going from MySQL 5.1 to 5.5, there are no particular global changes Mozilla made.
25. – when we went from MySQL 5.0 to MySQL 5.1, we did a global change to reflect the new slow query log options.
26. chown -R mysql:mysql /var/lib/mysql/

27. chmod 775 /var/lib/mysql

28. Start MySQL and check the error logs for any warnings. Get rid of any warnings/errors, and make sure MySQL is running.

29. Turn off binary logging. Import the export, timing how long it takes, for reference:

30. time mysql < YYYY_MM_DD_backup.sql
31. Restart MySQL and look for errors, you may need to run mysql_upgrade.

32. Turn on binary logging, if applicable.

33. Test.

34. If this machine was a slave, re-slave it. Let it catch up, making sure there are no data integrity errors, and no replication errors.

35. Reinstate permissions on the users:
UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;
36. Re-slave any slaves of this machine, if needed.

37. Turn back on Nagios, making sure all the checks are green first.

38. Run a checksum on the master to propagate to this slave, and double-check data integrity on the slave. Note that you will want to use --ignore-columns with the output of this command in the checksum, to avoid false positives from scientific notation change (see https://blog.mozilla.org/it/2013/01/17/mysql-5-1-vs-mysql-5-5-floats-doubles-and-scientific-notation/)

39. Find FLOAT/DOUBLE fields to ignore in checksum: SELECT GROUP_CONCAT(DISTINCT COLUMN_NAME) FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE IN ('float','double') AND TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema');
40. Put the machine back into the load balancer, if applicable.

41. Inform folks the upgrade is over

On the first upgrade, we did what is usually recommended - do a logical export with mysqldump, and then an import. With other upgrades in the same replication hierarchy, we can take advantage of Xtrabackup to stream the new version directly to the machine to be upgraded.

The general procedure here is similar to the above, except that a logical export is not taken. After preparation steps are taken, a new empty MariaDB 5.5 server is installed. Then we use xtrabackup to backup and restore the existing MariaDB 5.5 server to the machine we are upgrading.

For subsequent slaves, and the master

1. Coordinate with affected parties ahead of time

2. Send out any notices for downtime

3. Take the machine out of any load balanced services, if appropriate. If the machine is a master, this means failing over the master first, so that this machine becomes a regular slave. [we have a different checklist for how to failover]

4. Set appropriate downtimes in Nagios, including for any slaves

5. Start a screen session on the server

6. Do a SHOW PROCESSLIST to see if there are any slaves of the machine. If so, move them to another master if they are needed.

7. Do a SHOW SLAVE STATUS to see if this machine is a slave.
1. If this machine is a slave, ensure that the master will not delete its binlogs while the upgrade is occurring.

2. If this machine is a slave, do a SLAVE STOP; and copy the master.info file somewhere safe

8. Save a list of grants from pt-show-grants, just in case there are users/permissions that need to be preserved.
9. [this is done because sometimes masters and slaves have different users, though we try to keep everything consistent]
10. Figure out how big the backup will be by doing a du -sh on the datadir of the already-upgraded machine to be backed up, and make sure the new machine has enough space to keep the old version and have the new version as well.

11. Stop MySQL on the machine to be upgraded.

12. Copy the config file (usually /etc/my.cnf) to a safe place (like /etc/my.cnf.51)

13. Do a rpm -qa | egrep -i "mysql|percona". Do a yum remove for the mysql packages (at least mysql-server, mysql). It's OK if it also removes related packages, like perl-DBD, but make a note of them, because you will want to reinstall them later.

14. Move the /var/lib/mysql directory to /var/lib/mysql-old. Compress any files that need compression. If you absolutely cannot keep the files, see if you can copy them somewhere. We really want to preserve the old data directory just in case we need to revert.

15. Install the proper packages by changing puppet to use "maridb55" instead of "mysql51" or "percona51", running puppet manually. Verify with rpm -qa | egrep -i "percona|mysql|maria"

16. Run mysql_install_db

17. Make any changes to /etc/my.cnf (or run puppet). When going from MySQL 5.1 to 5.5, there are no particular changes.

18. chown -R mysql:mysql /var/lib/mysql/

19. chmod 775 /var/lib/mysql

20. Start MySQL and check the error logs for any warnings. Get rid of any warnings/errors, and make sure MySQL is started.

21. Stop MySQL, and move or delete the datadir that was created on upgrade.

22. If you are directly streaming the backup to the machine to be upgraded, do this on the machine to be upgraded:
cd $DATADIR nc -l 9999 | tar xfi - 23. On the machine to be backed up (that is already upgraded), in a screen session, making sure you get any slave info: time innobackupex --slave-info --stream=tar$DATADIR | nc (IP/hostname) 9999

24. Once xtrabackup is complete, fix permissions on the datadir:
chown -R mysql:mysql /var/lib/mysql/ chmod 775 /var/lib/mysql

25. Prepare the backup:
time innobackupex --apply-logs --target-dir=/var/lib/mysql

26. Fix permissions on the datadir again:
chown -R mysql:mysql /var/lib/mysql/ chmod 775 /var/lib/mysql

27. Restart MySQL and look for errors

28. Test.

29. If this machine was a slave, re-slave it. Let it catch up, making sure there are no data integrity errors, and no replication errors.

30. Re-slave any slaves of this machine, if needed.

31. Turn back on Nagios, making sure all checks are green first.

32. Put the machine back into the load balancer, if applicable.

33. Inform folks the upgrade is over

It's long and detailed, but not particularly difficult.

## IT goings-on

Hello everybody – it’s time once again for the weekly IT update.

First up, the Mozilla Operations Centre (or MOC) is up and running!  This new team currently comprises seven employees from all over the world, including the USA, India, and Europe.  They’ll be handling such critical functions as monitoring, tier 1 and 2 support, and issue tracking and escalation for our entire infrastructure.  What’s more, they’ll be acting as a sort of interface layer for the more technical aspects of Mozilla’s mobile partner relationships.  Expect more news about this highly important team in the weeks and months to come.

The RelOps team stopped testing Firefox on OS X 10.7 due to falling usage and the similarity of coverage between 10.8 and 10.6.  They re-purposed all 83 of the Mac Minis running 10.7 to  now run and test on 10.6 – effectively doubling the available 10.6 test capacity.  This had the net effect of reducing 10.6 wait times considerably, thus improving the overall level of service and – most importantly – increasing the satisfaction of developers testing against those targets.

They’re also making progress on the project to standardise all of their virtual machines on a single platform.  Already, all of the KVM “odd ducks” at SCL3 have been replaced, which is a big win in terms of paying off technical debt for all of IT.  Good work, RelOps!

On the topic of performance improvements, Solarce from the WebOps team cleaned out tonnes of old jobs and tasks from our in-house Jenkins system, which has dramatically reduced start-up and task run times, and improved stability overall.

Mozillians physically working in our offices were plagued by a small but highly irritating problem: the tablets used to check and book conference rooms were skewed by up to eight minutes, which meant it was sometimes tricky to reserve rooms properly.  Thankfully, the Desktop and NetOps teams came to the rescue, and the problem is no longer!

Finally, in case you missed it earlier this week, Sheeri from the Database team put up an interesting post about a recent run of MySQL upgrades – highly recommended!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

## A Tale of Two MySQL Upgrades

At the beginning of 2013, Mozilla’s MySQL databases were a mix of MySQL 5.0, Percona’s patched MySQL 5.1, Percona’s patched MySQL 5.5 and MariaDB 5.5. MySQL 5.1 was released in November 2008 – so at the beginning of the year, we still had databases with no new major features in 4 years. Currently we have almost all our databases at Oracle’s MySQL 5.6 – the only stragglers are our cluster running TokuDB and a few machines that are no longer in use. Here’s a graph showing the state of our machines – you can see that in the first half of the year we concentrated on upgrading our 5.0 and 5.1 servers to 5.5, and then in the second half of the year we upgraded everything to MySQL 5.6 (click on the image to get a larger version):

After running some tests, we determined that MariaDB 5.5 was the best option for us and our particular workload. For most of our servers, it did not matter whether we use Percona, MariaDB or Oracle’s MySQL, but our Bugzilla servers really benefited from MariaDB’s better subquery optimization, so we went with that. We had set up some Percona 5.5 servers over the spring/summer of 2012, when we moved some of our infrastructure to a new data center.

We upgraded to MySQL 5.5 to be on a recent version of MySQL. In the middle of the year, we had a choice – should we stay where we were, or should we upgrade? We had no particular directive from developers to upgrade for the new MySQL 5.6 features. However, we have been doing more and more digging into our systems, and we really wanted the performance_schema features so we could dig even more. We want to be able to parse queries in real-time, perhaps with Anemometer without having to take an offline log file and run pt-query-digest on it.

So, we chose to upgrade to MySQL 5.6. Unfortunately, there were no other GA products to test against – by mid-2013, neither MariaDB nor Percona had a GA 5.6 product, so our bake-off was functional only, not performance-related. Oracle’s MySQL 5.6 passed with flying colors, and so we proceeded to upgrade.

Now, we have a recent and consistent version of MySQL installed, that we can work with to gain insights into our systems. A pretty great goal to have been met for 2013!

## IT goings-on

Hello all and welcome to this week’s IT update.  Instead of the usual wrap-up of interesting tidbits from across the team, this post is dedicated to the recent major maintenance event at one of our two primary data centres.  Let’s dive in!

Fact: Mozilla leverages a mind-boggling variety of technical infrastructure.  The sheer breadth of machines and configurations is difficult to fully grasp.  This infrastructure is situated in a number of physical locations, including data centres in the USA and China, as well as our offices around the world.  Over the past couple of years, one of the major long-term projects at Mozilla IT has been to consolidate and industrialise these physical locations and their contents – no small feat, and a project that will remain on-going for the foreseeable future.

Today we have two primary data centres on the North American continent: PHX1 and SCL3.  These data centres are treated a little bit differently than our other locations, as they are not only our largest installations, but are specifically designed to provide highly stable, highly available environments – in other words, no downtime.  One of the key elements in this architecture is called the core network stack, which refers to the networking equipment that is responsible for routing all of the traffic between a given data centre and the Internet at large.  The stack needs to be as reliable as humanly (or machinely) possible – without it, there is no communication with the outside world.

Earlier this year a problem was detected in the stack at SCL3.  This problem had a direct impact on the stability and reliability of the core network, and if left untreated, would have eventually resulted in a major unplanned outage.  In fact, small service interruptions and other events had already been tied to this issue, and while work-arounds were implemented, the fact remained that this was a ticking time bomb.  Ultimately the decision was made to simply remove the problematic hardware entirely from the stack.  While this was certain to solve the issue, it also meant incurring the one thing that the HA architecture was designed to avoid: downtime.

Many of the products and services that Mozilla provides rely on SCL3, including – but not limited to – such things as product delivery (i.e. Firefox downloads, updates, and the like), the build network (for building and testing those deliverables), the Mozilla Developer Network, and so forth.  We worked with key stakeholders from across the company to explain the situation and come up with plans for how to deal with the impending outage.  These plans ranged from the relatively simple (such as putting up a “hardhat“-style message explaining the situation), to the non-trivial (such as replicating the entire repository infrastructure at PHX1), to the heroic (implementing product delivery entirely in the cloud).

Furthermore, we weren’t content with simply addressing the problematic issue (and since we were going to be experiencing a service outage no matter what), we worked with our vendor to come up with a new architecture – one that would ensure that even if we have to perform major network manipulations again, we should now be able to avoid total blackouts in the future.  This helped to turn what was “merely” a problem-solving exercise into a real opportunity to extend and improve our service offering.

As part of this planning process, we set up a lab environment with hardware supplied by our vendor, which allowed us to practice with the mechanisms and manipulations ahead of time.  I can’t stress enough how critical this was: knowing  what to expect going into it in terms of pitfalls and processes was absolutely essential.  This helped us to form realistic expectations and set up a time-line for the maintenance event itself.

There were emails; there were meetings; there were flowcharts and diagrams to last a lifetime – but at the end of the day, how did the event actually turn out?  Corey Shields with the details:

All in all, the maintenance was a success.  The work was completed without any major problems and done in time.  Even in a successful event like this one, we have a postmortem meeting to cover what was done well (to continue those behaviors in the future), and what needs improving.  We identified a few things that could have been done better, mostly around communication for this window.  Some community stakeholders were not notified ahead of time, and the communication itself was a bit confusing as to the network impact within the 8 hour window.  We have taken this feedback and will improve in our future maintenance windows.

There are any number of interesting individual stories that can (and should) be told about this maintenance, so keep watching this blog for more updates!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!

## IT goings-on

Greetings people of Earth (and elsewhere, perhaps) and welcome to another weekly Mozilla IT update.

Some big news this week from Chris Turra and the WebOps team: our internal PaaS has, at long last, passed the security review phase and is ready for immediate production use!  Many of you have been using the PaaS in a development capacity for some time, so for those who are already familiar with the environment, you’ll be happy to learn that a number of high-availability back-end services have been deployed in order to ensure that the service is production-ready (phew, that’s a lot of hyphenated words).  Frankly, there’s way too much goodness here to cram into a single paragraph – keep watching this space for a post dedicated to the new PaaS.

Also on a WebOps tip, Jacques Uber continues to astound and amaze with new updates to “Inventory“, which is Mozilla IT’s fully open-sourced infrastructure management application.  The newest functionality includes a GUI process for assigning Static Registrations – check the repo for more details.

The Release Engineering team has been hard at work rolling out the new imaging techniques and metrics tools to the test and windows infrastructure (as noted previously), but somehow Amy Rich found the time to participate in a panel discussion at LISA ’13 entitled “Women in Advanced Computing“.  This was part of a small series on the topic, the other session being a half-day workshop hosted in part by Sheeri Cabral of the Database team.  Both sessions were very well received by all accounts!

Finally, a preview into the next post, wherein we’re going to talk a little bit about the big data centre maintenance that occurred this past week-end – so stay tuned!

As always, if you have any questions or want to learn more, feel free to comment below or hop on to #it on irc.mozilla.org.  See you next time!