## Desks of IT – part 1

After seeing a post from our webdev colleagues a while back, we in IT are jumping on the battlestation post bandwagon. IT is scattered around the globe, with a good mix of office employees and remoties.

With all of the responses I got to for such a post, I will break this up into a couple of posts for the sake of those reading through a feed.

Melissa O’Conner
Location: New Hampshire, USA
Works on: EUS Management, IT Project Management

(she keeps a smartsheet on her wall)

Richard Weiss
Location: San Francisco Office
Works on: IT AWS architecture

Richard has opted for one of the new fancy adjustable standing desks.

Ludovic Hirlimann
Location: France
Works on: Mozilla Operations Center (MOC)

“I don’t drink coffee – my coffee used to be “Tales of the arabian nights”. It needed servicing first, hence it being open.”

Greg Cox
Location: Oregon
Works on: Storage & Virtualization

“one-of-a-kind Mk1 lap, built it myself” …  “work is anywhere the VPN isn’t blocked”

Ashish Vijayaram
Location: San Francisco
Works on: Mozilla Operations Center (MOC)

(Ashish had just returned from maternity leave)  “Thanks to *some* enterprising colleagues, my desk overnight turned into a hybrid working desk + changing table for infants. Designed for oncall ops and dadops – deal with multiple types of incidents – all from the same desk!”

Linda Ypulong
Location: San Francisco
Works on: Mozilla Operations Center (MOC)

“Going for the true minimalist approach – I tend to either be in meetings or sit in the center of the MOC …”

Ryan Watson
Location: UK
Works on: Web Operations

“Corey for scale”
(editors note: sigh..)

Julien Vehent
Location: Florida, USA
Works on: IT Security

“The Gopher”, in its natural state 🙂

Philippe Chiasson
Location: Quebec, CA
Works on: AWS / Web Operations

“Believe it or not, but I happen to like the colour green! What’s missing here is the 4 children, cat and dog roaming around, obviously!”

## CVS & BZR services decommissioned on mozilla.org

tl;dr: CVS & BZR (aka Bazaar) version control systems have been decommissioned at Mozilla. See https://archive.mozilla.org/pub/mozilla.org/vcs-archive/README for final archives.

As part of our ongoing efforts to ensure that the services operated by Mozilla continue to meet the current needs of Mozilla, the following VCS systems have been decommissioned:

This work took coordinated effort between IT, Developer Services, and the remaining active users of those systems. Thanks to all of them for their contributions!

Final archives of the public repositories are currently available at https://archive.mozilla.org/pub/mozilla.org/vcs-archive/. The README file has instructions for retrieval of non-public repositories.

NOTE: These URLs are subject to change, please refer back to this blog post for the up to date link.

## Product Delivery Migration: What is changing, when it’s changing and the impacts

As promised, the FTP Migration team is following up from the 7/20 Monday Project Meeting where Sean Rich talked about a project that is underway to make our Product Delivery System better.

As a part of this project, we are migrating content out of our data centers to AWS. In addition to storage locations changing, namespaces will change and the FTP protocol for this system will be deprecated. If, after reading this post, you have any further questions, please email the team.

Action: The ftp protocol on ftp.mozilla.org is being turned off.
Timing: Wednesday, 5th August 2015.
Impacts:

After 8/5/15, ftp protocol support for ftp.mozilla.org will be completely disabled and downloads can only be accessed through http/https.
Users will no longer be able to just enter “ftp.mozilla.org” into their browser, because this action defaults to the ftp protocol. Going forward, users should start using archive.mozilla.org. The old name will still work but needs to be entered in your browser as https://ftp.mozilla.org/

Action: The contents of ftp.mozilla.org are being migrated from the NetApp in SCL3 to AWS/S3 managed by Cloud Services.
Timing: Migrating ftp.mozilla.org contents will start in late August and conclude by end of October. Impacted teams will be notified of their migration date.
Impacts:

Those teams that currently manually upload to these locations have been contacted and will be provided with S3 API keys. They will be notified prior to their migration date and given a chance to validate their upload functionality post-migration.

## Introduction

If you’re working from home, a coffee shop or even an office and are experiencing “issues with the Internet” this blogpost might be useful to you.
We’re going to go through some basic troubleshooting and tips to help solve your problem. At worse gather data so your support team can investigate.

Something to keep in mind while reading is that everything is packets. When downloading a webpage, image, email or video-conferencing, you have to imagine your computer splitting everything into tiny numbered packets and sending them to their destination through pipes, where they will be reassembled in order.
Network issues are those little packets not getting properly from A to Z, because of pipes full, bugs, overloaded computers, external perturbations and a LOT more possible reasons.

## Easy tests

We can start by running some easy online tests, even if not 100% accurate, you can run them in a few clicks from your browser (or even phone) and can point toward the right direction.

Speedtest is a good example. Hit the big button and wait, don’t run any download or bandwidth heavy applications. The best is to know what a “normal” value is, so you can compare both.

As some sneaky providers can prioritize connections to the famous Speedtest, you can try another less known website like ping.online.net. Start the download of a large file and check how fast it goes.

Note that the connection is shared between all the connected users. So one can easily monopolize the bandwidth and clog the pipe for everyone else. This is less likely to happen in our offices where we try to have pipes large enough for everyone, but can happen in public places. In this case there is nothing much you can do.

Next is a GREAT one: Netalyzr. You will need Java but it’s worth it (first time I have ever said that). There is also an Android app but it’s better to run it from the device experiencing the issue (but it’s also interesting to see how good your mobile/data “Internet” really is). Netalyzr will runs TONS of tests, DNS, blocked ports, latency, etc… And give you a report quite easy to understand or share with a helpdesk. See Mozilla’s Paris office.

Some ISPs start to roll out a new version of the Internet, called IPv6 (finally!). As it’s “new” for them, there can be some issues. To check that, run the test on Test-ipv6.

## Basic connectivity

If the previous tests show that something is wrong or you still have a doubt, these tools are made to test basic connectivity to the Internet as well as checking the path and basic health of each network segments from you to a distant server.

“ping” will send probes each second to a target and will reports how much time it takes (round trip) to reach it. For an indication, a basic time between Europe and the US west coast is about ~160ms, US east to US west coast ~90ms, Taipei to Europe ~300ms. If it is higher than that you might indeed see some “laggish/slow Internet”.

“traceroute” is a bit different; it does the same but shows you the intermediate hops in between source and destination.

The best in this domain is called “mtr” and is a great combination of the previous 2: for each hop, it runs a continuous ping. Add –report so you can easily copy and paste the result if you need to share it. Having some higher latency or packet loss for a hop or two is usually not an issue. This is because network devices are not made to reply to those tests, they just do it if they are not too busy with their main function, redirecting a packet in a direction or another. A good issue indicator is if many nodes (and especially your target) start showing packet loss or higher latency.

If the numbers are high on the first hop this means something is wrong between your computer and your home/office/bar’s Internet gateway.

If the issues appear “around the middle” it’s an issue with your ISP or your ISP’s ISP and the only thing you can do is to call their support to complain or change ISP when possible. That is what happened with Netflix in the US.

If the numbers are high near the really end it’s probably an issue with the service you’re testing, try to contact them.

## Wireless

An easy one when possible. If you are on wireless try to switch to a wired network and re run the tests mentioned above. It might be surprising but wireless isn’t magic 🙂

Most of the home and small shops wifi routers (labeled “BGN” or 2.4Ghz) can only use 1 out of 3 frequencies/channels to talk to their clients, and clients have to wait for their turn to talk.
So imagine your neighbor’s wifi is on the same frequency. Even if your wifi distinguishes your packets from your neighbor’s, they will have to share the channel. If you have more than 2 neighbors (Wireless networks visible in the wifi list), you will have an non optimal wireless experience and this is exponential.

To solve/mitigate it you can use an app like “Wifi Analyzer” on Android. X are the frequencies, only 1, 6 and 11 are really usable as they don’t overlap. and Y is how noisy they are. Locate yours, then if a channel is less busy go in your wireless router’s settings and move to that one.

If they are all busy, last option is to buy a router that supports more recent standard, labeled “ABGN” (even now “AC”) or 5Ghz. Most high end phones and laptop support it as well. That range has around 20 usable channels instead of 3.

Another common issue in public places is when some people can’t connect at all but other people don’t have any issue. This is usually due to a mechanism called DHCP that allocates an address (here you can see it as a slot) to your device. Default settings are made for small networks and remember slots for a long time even if they are not used anymore. It’s usually possible to reduce the “remember” time and/or enlarge the pool of slots in the router’s settings.

## Wired

These are less complicated to troubleshot. You can start by unplugging everything else connected to your router and see if it’s better. If not the issue might come from that box. Also be careful not to create physical loops in your network (like plugging 2 cables between your router and a switch).

## Deploying tor relays

On November 11, 2014 Mozilla announced the Polaris Privacy Initiative.  One key part of the initiative is us supporting the Tor network by deploying Tor middle relay nodes.  On January 15, 2015 our first proof of concept (POC) went live.

TL;DR; here are our Tor relays: https://globe.torproject.org/#/search/query=mozilla

When we started this POC, the requirements we had were:

• the Tor nodes should run on dedicated hardware
• the nodes should be logically and physically separated from our production infrastructure
• use low cost and commoditized hardware
• nodes should be operational within 3 weeks

## Hardware and Infrastructure

• We chose to make use of our spare and decommissioned hardware.  That included a pair of Juniper EX4200 switches and three HP SL170zG6 (48GB ram, 2*Xeon L5640, 2*1Gbps NIC)
• We dedicated one of our existing IP Transit providers to the project (2 X 10Gbps).

The current design is fully redundant.  This allows us to complete maintenance or have node failure without impacting 100% of traffic.  The worst case scenario is a 50% loss of capacity.

The design also allows us to easily add more servers in the event we need more capacity, with no anticipated impact.

## Building and Learning

There is a large body of knowledge available on building Tor nodes.  I read mailing lists archives, blog posts, and tutorials. I had exchanges with people already running large relays.  There are still data points Mozilla needs to understand before our experiment is complete.  This section is a “quick run down” on some of those data points.

• A single organization shouldn’t be running more than 10Gbps of traffic for a middle relay (and 5Gbps for an exit node).

This seems to be more of gut feeling from existing operators than a proven value (let me know if I’m wrong), but it makes sense.  We do have available transit and capacity. Understanding throughput and resource utilization is a key criteria for us.

Important Note: An operator running relays must use the “MyFamily” option in torrc.  This ensures a user doesn’t bounce through several of your servers.

• Slow ramp up

A new Tor instance (identified by its private/public key pair) will take time (up to 2 months) to use all its available bandwidth. This is explained in this blog post: The lifecycle of a new relay. We will be updating our blog posts and are curious how closely our nodes mirror the lifecycle.

• A Tor process (instance) can only push about 400Mbps.

This is based on mailing list discussions, as we haven’t reached that bandwidth yet. We run several instances per physical server.

• A single public IP can only be shared by 2 Tor instances

This is a security feature to prevent a single person to run a ton of fake different nodes as explained in this research paper. This feature is documented in the Tor protocol specification.

• Listen on well known ports like 80 or 443

This helps people behind strict firewall to access Tor. Don’t worry about running the process as root (needed to listen on ports < 1024), as long as you have the “User” option in torrc, Tor will drop the privileges after binding to the ports.

## Automation

We decided to use Ansible for configuration management.  A few things motivated us to make that choice.

• There was an existing ansible-tor role very close to what we needed to accomplish (and here is our pull request with our fixes and additions).
• Some of our teams are using Ansible in production and we (Network Engineering) are considering it.
• Ansible does not require a heavy client/server infrastructure which should make it more accessible to other operators.

And look! Mozilla’s Ansible configuration is available on GitHub!

### Security

The security team helped us a lot along this project. Together we have put together a list of requirements, such as

• strict firewall filtering
• hardening the operating system (disable unneeded services, good SSH configuration, automatic updates)
• hardening the network devices management plane
• implementing edge filtering to make sure only authorized systems can connect to the “network management plane”

The only place for the infrastructure administration is the jumphost. Systems don’t accept management connection from anywhere else.

It is important to note, that many of the security requirements align nicely with what’s considered a good practices in general system and network administration. Take enabling NTP or centralized syslog for example – equally important for some services to run smoothly, for troubleshooting and for Incident Response. Similar concepts apply with the principle “make sure the network devices security is at least as good as system’s one”.

We’ve also implemented a periodic security check to be run on these systems. All of them are scanned from inside for security updates and outside for opened ports.

### Metrics

One of the points we’re wondering are: how do we figure out if we’re running an efficient relay (in terms of cost, participation in the Tor network, hardware efficiency, etc). Which metrics to use and how to use them?

Looking around it seems like there is no “good answer”. We’re graphing everything we can about bandwidth and servers utilization using Observium. The Tor network already has a project to collect relays statistics called Tor metrics. Thanks to it, tools like Globe and others can exists.

## Future

Note that we have just started them and they are far from running at their maximal bandwidth (for the reasons listed above). We will share more information down the road about performances and scaling.

Depending on the results of the POC,  we may move the nodes to a managed part of our infrastructure. As long as their private keys stay the same, their reputation will follow them wherever they go, no more ramp up period.

On a technical side there are a lot of possible things to do like adding IPv6 connectivity.  We’re reviewing opportunities to more parts of the deployment (like iptables, logs, etc…).

Here are a few links that you might find interesting:

## Thanks

Of course, none of that would have been possible without the help of Van, Michal (who wrote the part about security) and Opsec, Javaun, James, Moritz and the people of #tor!

## Introducing the Mozilla Operations Center

Mozillians,

Some of you may already know of the Mozilla Operations Center (MOC). The purpose of the MOC is to provide support and response for Mozilla’s critical production services and underlying infrastructure. We are a 24×7 IT function operating primarily from the SF and London MozSpaces, staffed by Mozilla employees on rotating shifts. In the past, this responsibility fell on several legacy IT teams and processes which have been inherited by the MOC. These legacy methods of support were spread across various IRC channels, email lists, and Bugzilla components and are now ready to be consolidated.

On Wednesday, September 17th, 2014 these changes will take effect and provide a single identifiable approach to requesting support for production services and reaching the MOC across IRC, email, and Bugzilla.

The MOC can be found via mozilla’s IRC network in channel: #moc

We understand this may not reach everyone, so we’re prepared to receive and educate unique requests that will undoubtedly come across via existing legacy channels.

# What is the MOC?

The Mozilla Operation Center is a 24×7 IT function with a purpose to support Mozilla’s critical services.  It provides this through standard approaches – incident management and proactive monitoring.  Effective mid-Sept 2014, the MOC will have a single identity and only known as “MOC” across IRC, email, and Bugzilla.

# Location

The MOC operates in 2 MozSpaces:  one in San Francisco and the other in London.  Through rotational shifts between these 2 hubs, the MOC achieves around the clock coverage every day of the year.  Future state may include an additional hub for a more traditional follow-the-sun model and/or hybrid considerations for better efficiencies.

# Vision

MOC will be a pro-active team that prevents incidents through proficient automation, thorough inspection, and using data-driven methods to alert business/product owners of possible problem areas to mitigate.  Additionally, the MOC will ensure proper service levels are achieved, accurately measured, and corrective actions applied.

### Vision targets:

• Sustain IT infrastructure operations of Mozilla’s most critical services meeting defined SLA.
• Be transparent and visible to the Project and Mozilla’s key partners for support and communication of service updates.
• Establish and drive repeatable processes (Incident, problem, crisis, reporting) throughout the Project.
• Predictable reaction to service outages, un/planned changes, response, and intended outcomes.

More details to our services and offerings will be provided as they come to fruition.

## Pay No Attention to the Virtualization Behind the Curtain

VMware migrations are often seamless.  With vMotion, Storage vMotion, DRS and HA, you can go a long time without a hiccup.  There’s always tasks that are more difficult than you’d expect:

• “Why can’t I vMotion a template?”
• “Why can’t I stage a VM to reboot into having more RAM?”
• “Why can’t I edit configuration settings on a live VM ahead of a reboot?”

…”Why can’t I move VMs to a new vCenter?”  That was one that we were facing: moving from a Windows vCenter to the new Linux-based vCenter Server Appliance (VCSA).  It’s what we wanted, but the problem is, that’s the wrong question.

At Mozilla we have two main datacenters with ESX clusters, which were running ESXi5.0.  The hardware and ESXi versions were getting a little long in the tooth, so we got a new batch of hardware, and installed ESXi5.5.

Problem there: how to move just under 1000 VMs from the old vCenters to the new ones.  While a lot of our users are flexible about reboots, the task of scheduling downtime, shutting down, unregistering from the old, registering with the new, tweaking the network, booting back, then updating the inventory system… it was rather daunting.

It took a lot of searching, but we found out we were basically asking the wrong question.  The question isn’t “how do I move a guest to a new vCenter?”, it’s “how do I move a host to a new vCenter (and, oh yeah, mind if he brings some guests along)?”

So, the setup:

On both sides, we have the same datastores set up (not shown), and the same VLANs being trunked in, so really this became a question of how we land VMs/hosts going from one side to another.  We have vSphere Distributed Switches (vDS) on both clusters, which means the network configuration is tied to the individual vCenters.

There may be a way to transfer a VM directly between two disparate vDS’es, but either we weren’t finding it or it’s too much dark magic or risk of failure.  We used a multiple-hop approach that was the right level of “quick”, “makes sense”, and, most importantly, “works”.

On both clusters, we took one host out of service and split its redundant network links into two half-networks.  Netops unaggregated the ports, and we made port 1 be the uplink for a standard switch, with all of our trunked-in VLANs (names ala “VLAN2”, “VLAN3”).  Port 2 was returned to service as the now-nonredundant uplink for the vDS, with its usual VLANs (names ala “Private VLAN2”, “DMZ VLAN3”).  On the 5.0 side, we referred to this as “the lifeboat”, and on the 5.5 side it was “the dock”.

The process at this point became a whole lot of routine work.

• On the old cluster, pick VMs that you want to move.
• Turn DRS to manual so nobody moves where they shouldn’t.
• vMotion the selected VMs into the lifeboat until it is very full.
• Look at the lifeboat’s configuration tab, under vDS, see what vlans are in use on this host, and by how many VMs.
• For each VLAN in use on the lifeboat
• Under Networking, the trunk vDS, choose Migrate Virtual Machine Networking
• Migrate “Private VLAN2” to “VLAN2”.  This will only work on the lifeboat, since it’s the only host that can access both forms of VLAN2, so choosing “all” (and ignoring a warning) is perfectly safe here.
• Watch the VMs cut over to the standard switch (dropping maybe 1 packet).
• Checking the lifeboat’s configuration, nobody is on the vDS now; all VMs on the lifeboat are on the local-to-the-lifeboat standard switch.
• Disconnect the lifeboat from the old vCenter.
• Remove the lifeboat from the old vCenter.
• On the new vCenter, add the lifeboat as a new host.  This takes a while, and even after multiple successful runs there was always the worry of “this time it’s going to get stuck,” but it just worked.

Once the lifeboat host is added to the new cluster, vMotion all the VMs from the lifeboat onto the dock.  Now work can split in 2 directions: one person sends the lifeboat back; another starts processing the newly-landed VMs that are sitting on the dock.

Sending the lifeboat back is fairly trivial.  Disconnect/remove the lifeboat from the new vCenter, add the host back to the old vCenter, and add the links to the vDS.  At this point, this person can start loading up the next batch of evacuees.

On the receiving side, all the VMs are currently pinned to the dock, since it’s now the only host with a standard switch.  All of the VMs there need to have their networks moved to the new vCenter’s vDS.  The process is just the reverse of before (“Migrate Virtual Machine Networking” under the networking tab, moving “VLAN2” to “Private VLAN2”).  The rest is housekeeping: file the VMs into the right resource pools and folders, update the in-house inventory system to indicate the VMs were moved to a new vCenter, start vmware-tools upgrading.  Last step, we’d enable DRS and put the dock in maintenance mode, to eject all the new VMs into the remainder of the new cluster, to make room for the next boatload of arrivals.

We had very few problems, but I’ll list them:

• Any snapshots on the VMs were invalid after the move.  We found this the hard way: someone in QA rolled a guest back to a baseline snapshot, only to find the networking lost, because it restored references to the old vDS.  Easily fixed once identified, and we were lucky that it wasn’t a bigger problem since we’d coordinated that move with the user.
• Two VMs had vNICs that had had manually configured MAC addresses in the 00:50:56 space.  The VM refused to land on the new side, because it could conflict/not be managed.  We had to do a hot-swap of the vNIC to get onto an automatic MAC, at which point the VM moved happily.
• And, of course, human error.  Misclassifying VMs into the wrong place because we were moving so much so fast.

One person would own one boatload, noting which pools/folders they got VMs from, and owning putting them in the right place on the far side.  All in all, with 2 people, we were able to move 200VMs across in a full working day, and so we finished up evacuating the old vCenter in 5 working days.  We only had to coordinate with 2 customers, and took one 15m maintenance window (for the manual-MAC-vNIC issue), and even then we didn’t disrupt service.

Around 1000 VMs moved, and nobody noticed.  Just how we wanted it.

## The power of indexes

tl;dr: If you think there’s slowness or performance issues in your application database, ask your local friendly DBAs. Often times we can help (and we like to!) — (graphs below for evidence) 🙂
Or if you don’t have a DB team, check out this great guide on how to check for slow queries.

Longer version:

Last week, one of the release engineering staff approached me in regards to some slow performance on our buildbot database systems.

Upon investigation, we realized there was a significant buildup of slow queries during weekdays. In this system, slow queries were defined as any query taking longer than 2 seconds.

After a short investigation, it was pretty easily determined that there was a set of queries repeating about 1-2 times every second, that took 2 seconds to run. This query was able to gain some substantial benefits from the addition of only a few single indexes. Since the impact was low, we decided to execute this change into dev/stage and then into production on Tuesday.

The graphs below show the impact of the change over the course of 3 days (Sunday, Monday, Tuesday) where the change was implemented on the morning of Tuesday the 28th, and again for the last 7 days.

3 days around the change:

Last 7 days:

7 days prior to the change (for perspective):

## A Physical Control Surface for Video Production

At Mozilla we make heavy use of Telestream’s Wirecast to stream video events. While Wirecast has a nice GUI, using a mouse or trackpad to control a video production is far from ideal. In some of our larger venues we have Blackmagic ATEM production video switchers, but for remote events and streaming from smaller venues we’ve been stuck with the default GUI control interface. Until now…

At the Mozilla Festival in London earlier this year I saw some projects using MIDI control surfaces for things other than controlling music. It turns out that grafting a midi control surface to any program with an Applescript interface is quite easy thanks to Nico Wald’s MidiPipe program.

A few minutes of searching the web showed that none of this is a new idea. Mark over at ttfn.tv implemented a similar solution a couple of years ago.

Mark’s solution used an earlier version of the Korg control surface, and was specific to an earlier Wirecast 4 release. It also used more than a dozen different scripts which made it hard to both understand and maintain.

I’ve done a ground-up rewrite and bundled the scripts and configuration files into something I’m calling Video Director. There are versions for both Wirecast 4 and Wirecast 5, and you can get all the code over on GitHub. (Here’s hoping that someone with better Applescript Fu than mine will fork these projects).

Video Director is a script designed to enable control of Wirecast 4 from a Korg nanoKONTROL2 midi control surface on MacOS X computers. The Korg nanoKONTROL2 is a low-cost solution to providing a tactile interface which, while not as elegant as a real production video switcher, provides much more tactile feedback than trying to control a video production with a mouse or touchpad.

In addition to simply providing real physical buttons for video switching operations, Video Director also simplifies the process of populating the various control layers of Wirecast with video and graphic content. It will load layer content from a pre-defined directory structure on the host machine, allowing rapid re-configuration of Wirecast for programs with differing content requirements.

The functionality of Video Director is limited by the very restricted subset of Wirecast operations for which Telestream has exposed a scriptable interface. The most obvious omission is that there appears to be no way to script the master audio level control, either through the Wirecast API or via System Events scripting. For use with a control surface such as the Korg nanoKONTROL2 with its many sliders and knobs, this is a galling omission. In addition, Wirecast 5 appears to be even less scriptable than Wirecast 4. Here’s hoping that changes in the next couple of releases.

In my last post, a tale of two MySQL upgrades, a few folks asked if I would outline the process we used to upgrade, and what kind of downtime we had.

Well, the processes were different for each upgrade, so I will tackle them in separate blog posts. The first step was to upgrade all our MySQL 5.1 machines to MariaDB 5.5. As mentioned in the previous post, MariaDB’s superior performance for subqueries is why we switched – and we switched back to MySQL for 5.6 to take full advantage of the performance_schema.

It is not difficult to blog about our procedure, as we have documentation on each process. My first tip would be to do that in your own environment. This also enables other folks to help, even if they are sysadmins and not normally DBAs. You may notice the steps contain items that might be “obvious” to someone who has done maintenance before – we try to write them detailed enough that if you were doing it at 3 am and a bit sleep-deprived, you could follow the checklist and not miss anything. This also helps junior and aspiring DBAs not miss any steps as well.

The major difference between MySQL 5.1 and MySQL 5.5 (and its forks, like MariaDB) is that FLOAT columns are handled differently. On MySQL 5.1, a float value could be in scientific notation (e.g. 9.58084e-05) and in 5.5, it’s not (e.g. 0.0000958084). This makes checksumming difficult, as all FLOAT values will show differences even when they are the same number. There is a workaround for this, devised by Shlomi Noach.

We have an n+1 architecture for databases at Mozilla – this means that we have an extra server. If we need 1 master and 3 slaves, then n+1 is 1 master and 4 slaves. Because of this, there are 2 different ways we upgrade – the first slave we upgrade, and subsequent slaves/masters.

These steps are copied and pasted from our notes, with minor changes (for example, item #2 is “send out maintenance notices” but in our document we have the e-mail addresses to send to).

Assumptions: Throughout these notes we use ‘/var/lib/mysql’, as that is our standard place for MySQL. You may need to change this to suit your environment. We are also using Red Hat Enterprise Linux for our operating system, so this procedure is tailored to it (e.g. “yum install/yum remove”). We control packages using the freely available puppet mysql module we created.

For the first slave
The overall procedure is to perform a logical backup the database, create a new empty installation of the new server version, and import the backup. Replication does work from MySQL 5.1 to MariaDB 5.5 and back (at least, on the 25 or so clusters we have, replication worked in both directions. Your mileage may vary).

1. Make sure the slave has the same data as the master with checksums (the previous checksum is fine, they should be running every 12 hours).
2. Send out maintenance notices.

3. Take the machine out of any load balanced services, if appropriate

4. Set appropriate downtimes in Nagios

5. Start a screen session on the server

6. Do a SHOW PROCESSLIST to see if there are any slaves of the machine. If so, move them to another master if they are needed. [we have a different checklist for this]

7. Do a SHOW SLAVE STATUS to see if this machine is a slave.

1. If this machine is a slave, ensure that its master will not delete its binlogs while the upgrade is occurring.

2. If this machine is a slave, do a SLAVE STOP; and copy the master.info file somewhere safe [or the slave_master_info table if using that]

8. Stop access to the machine from anyone other than root (assuming you are connecting from root):

9. UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;
10. See what the default character set is for the server and databases:
SHOW VARIABLES LIKE 'character_set_server'; SHOW VARIABLES LIKE 'character_set_database'; SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE DEFAULT_CHARACTER_SET_NAME!='utf8' AND SCHEMA_NAME NOT IN ('mysql');
If applicable, change the server defaults to UTF8 and change databases to utf8 with ALTER DATABASE dbname DEFAULT CHARACTER SET utf8;

11. Stop access to the machine from anyone other than root (assuming you are connecting from root): UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;

12. Check to see how big the data is:
mysql> SELECT SUM(DATA_LENGTH)/1024/1024/1024 AS sizeGb FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA!='information_schema';

13. Determine how you can export the data, given the size. You may be able to export without compression, or you may need to do a mysqldump | gzip -c > file.sql, then compress the old data files instead of just moving them aside.

14. Do a du -sh * of the datadir and save for later, if you want to compare the size of the database to see how much space is returned after defragmenting

15. Export the data from all databases, preserving character set, routines and triggers. Record the time for documentation’s sake. I’m assuming the character set from step 9 is utf8 (if it’s something like latin1, you’ll need to put in –default-character-set=latin1 in the command). If the machine has slaves, make sure to use –master-data=1. If you need to compress, change the shell command accordingly:
time mysqldump --all-databases --routines --triggers --events > date +%Y-%m-%d_backup.sql
16. Stop MySQL

17. Copy the config file (usually /etc/my.cnf) to a safe place (like /etc/my.cnf.51)

18. Do a rpm -qa | egrep -i "percona|mysql". Do a yum remove for the mysql/percona packages. It’s OK if it also removes related packages, like perl-DBD, but make a note of them, because you will want to reinstall them later. Sample:
yum remove Percona-Server-client Percona-Server-shared-compat Percona-XtraDB-Cluster-devel Percona-Server-server
19. Move the /var/lib/mysql directory to /var/lib/mysql-old. Compress any files that need compression (if you need to compress, to decompress the sql file). If you absolutely cannot keep the files, see if you can copy them somewhere. We really want to preserve the old data directory just in case we need to revert.

20. Decompress the sql file, if applicable.

21. Install the proper packages by changing puppet to use “maridb55” instead of “mysql51” or “percona51”. Verify with rpm -qa | egrep -i “percona|mysql|maria”
22. [this may be different in your environment; we use the freely available puppet mysql module we created.
23. Run mysql_install_db

24. Make any changes to /etc/my.cnf (e.g. run puppet). When going from MySQL 5.1 to 5.5, there are no particular global changes Mozilla made.
25. – when we went from MySQL 5.0 to MySQL 5.1, we did a global change to reflect the new slow query log options.
26. chown -R mysql:mysql /var/lib/mysql/

27. chmod 775 /var/lib/mysql

28. Start MySQL and check the error logs for any warnings. Get rid of any warnings/errors, and make sure MySQL is running.

29. Turn off binary logging. Import the export, timing how long it takes, for reference:

30. time mysql < YYYY_MM_DD_backup.sql
31. Restart MySQL and look for errors, you may need to run mysql_upgrade.

32. Turn on binary logging, if applicable.

33. Test.

34. If this machine was a slave, re-slave it. Let it catch up, making sure there are no data integrity errors, and no replication errors.

35. Reinstate permissions on the users:
UPDATE mysql.user SET password=REVERSE(password) WHERE user!='root'; FLUSH PRIVILEGES;
36. Re-slave any slaves of this machine, if needed.

37. Turn back on Nagios, making sure all the checks are green first.

38. Run a checksum on the master to propagate to this slave, and double-check data integrity on the slave. Note that you will want to use --ignore-columns with the output of this command in the checksum, to avoid false positives from scientific notation change (see https://blog.mozilla.org/it/2013/01/17/mysql-5-1-vs-mysql-5-5-floats-doubles-and-scientific-notation/)

39. Find FLOAT/DOUBLE fields to ignore in checksum: SELECT GROUP_CONCAT(DISTINCT COLUMN_NAME) FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE IN ('float','double') AND TABLE_SCHEMA NOT IN ('mysql','information_schema','performance_schema');
40. Put the machine back into the load balancer, if applicable.

41. Inform folks the upgrade is over

On the first upgrade, we did what is usually recommended - do a logical export with mysqldump, and then an import. With other upgrades in the same replication hierarchy, we can take advantage of Xtrabackup to stream the new version directly to the machine to be upgraded.

The general procedure here is similar to the above, except that a logical export is not taken. After preparation steps are taken, a new empty MariaDB 5.5 server is installed. Then we use xtrabackup to backup and restore the existing MariaDB 5.5 server to the machine we are upgrading.

For subsequent slaves, and the master

1. Coordinate with affected parties ahead of time

2. Send out any notices for downtime

3. Take the machine out of any load balanced services, if appropriate. If the machine is a master, this means failing over the master first, so that this machine becomes a regular slave. [we have a different checklist for how to failover]

4. Set appropriate downtimes in Nagios, including for any slaves

5. Start a screen session on the server

6. Do a SHOW PROCESSLIST to see if there are any slaves of the machine. If so, move them to another master if they are needed.

7. Do a SHOW SLAVE STATUS to see if this machine is a slave.
1. If this machine is a slave, ensure that the master will not delete its binlogs while the upgrade is occurring.

2. If this machine is a slave, do a SLAVE STOP; and copy the master.info file somewhere safe

8. Save a list of grants from pt-show-grants, just in case there are users/permissions that need to be preserved.
9. [this is done because sometimes masters and slaves have different users, though we try to keep everything consistent]
10. Figure out how big the backup will be by doing a du -sh on the datadir of the already-upgraded machine to be backed up, and make sure the new machine has enough space to keep the old version and have the new version as well.

11. Stop MySQL on the machine to be upgraded.

12. Copy the config file (usually /etc/my.cnf) to a safe place (like /etc/my.cnf.51)

13. Do a rpm -qa | egrep -i "mysql|percona". Do a yum remove for the mysql packages (at least mysql-server, mysql). It's OK if it also removes related packages, like perl-DBD, but make a note of them, because you will want to reinstall them later.

14. Move the /var/lib/mysql directory to /var/lib/mysql-old. Compress any files that need compression. If you absolutely cannot keep the files, see if you can copy them somewhere. We really want to preserve the old data directory just in case we need to revert.

15. Install the proper packages by changing puppet to use "maridb55" instead of "mysql51" or "percona51", running puppet manually. Verify with rpm -qa | egrep -i "percona|mysql|maria"

16. Run mysql_install_db

17. Make any changes to /etc/my.cnf (or run puppet). When going from MySQL 5.1 to 5.5, there are no particular changes.

18. chown -R mysql:mysql /var/lib/mysql/

19. chmod 775 /var/lib/mysql

20. Start MySQL and check the error logs for any warnings. Get rid of any warnings/errors, and make sure MySQL is started.

21. Stop MySQL, and move or delete the datadir that was created on upgrade.

22. If you are directly streaming the backup to the machine to be upgraded, do this on the machine to be upgraded:
cd $DATADIR nc -l 9999 | tar xfi - 23. On the machine to be backed up (that is already upgraded), in a screen session, making sure you get any slave info: time innobackupex --slave-info --stream=tar$DATADIR | nc (IP/hostname) 9999

24. Once xtrabackup is complete, fix permissions on the datadir:
chown -R mysql:mysql /var/lib/mysql/ chmod 775 /var/lib/mysql

25. Prepare the backup:
time innobackupex --apply-logs --target-dir=/var/lib/mysql

26. Fix permissions on the datadir again:
chown -R mysql:mysql /var/lib/mysql/ chmod 775 /var/lib/mysql

27. Restart MySQL and look for errors

28. Test.

29. If this machine was a slave, re-slave it. Let it catch up, making sure there are no data integrity errors, and no replication errors.

30. Re-slave any slaves of this machine, if needed.

31. Turn back on Nagios, making sure all checks are green first.

32. Put the machine back into the load balancer, if applicable.

33. Inform folks the upgrade is over

It's long and detailed, but not particularly difficult.