Pay No Attention to the Virtualization Behind the Curtain

gcox

VMware migrations are often seamless.  With vMotion, Storage vMotion, DRS and HA, you can go a long time without a hiccup.  There’s always tasks that are more difficult than you’d expect:

  • “Why can’t I vMotion a template?”
  • “Why can’t I stage a VM to reboot into having more RAM?”
  • “Why can’t I edit configuration settings on a live VM ahead of a reboot?”

…”Why can’t I move VMs to a new vCenter?”  That was one that we were facing: moving from a Windows vCenter to the new Linux-based vCenter Server Appliance (VCSA).  It’s what we wanted, but the problem is, that’s the wrong question.

At Mozilla we have two main datacenters with ESX clusters, which were running ESXi5.0.  The hardware and ESXi versions were getting a little long in the tooth, so we got a new batch of hardware, and installed ESXi5.5.

Problem there: how to move just under 1000 VMs from the old vCenters to the new ones.  While a lot of our users are flexible about reboots, the task of scheduling downtime, shutting down, unregistering from the old, registering with the new, tweaking the network, booting back, then updating the inventory system… it was rather daunting.

It took a lot of searching, but we found out we were basically asking the wrong question.  The question isn’t “how do I move a guest to a new vCenter?”, it’s “how do I move a host to a new vCenter (and, oh yeah, mind if he brings some guests along)?”

So, the setup:

vSphere Clusters pre-move
On both sides, we have the same datastores set up (not shown), and the same VLANs being trunked in, so really this became a question of how we land VMs/hosts going from one side to another.  We have vSphere Distributed Switches (vDS) on both clusters, which means the network configuration is tied to the individual vCenters.

There may be a way to transfer a VM directly between two disparate vDS’es, but either we weren’t finding it or it’s too much dark magic or risk of failure.  We used a multiple-hop approach that was the right level of “quick”, “makes sense”, and, most importantly, “works”.

On both clusters, we took one host out of service and split its redundant network links into two half-networks.  Netops unaggregated the ports, and we made port 1 be the uplink for a standard switch, with all of our trunked-in VLANs (names ala “VLAN2″, “VLAN3″).  Port 2 was returned to service as the now-nonredundant uplink for the vDS, with its usual VLANs (names ala “Private VLAN2″, “DMZ VLAN3″).  On the 5.0 side, we referred to this as “the lifeboat”, and on the 5.5 side it was “the dock”.

I think we're going to need a bigger boat

The process at this point became a whole lot of routine work.

  • On the old cluster, pick VMs that you want to move.
  • Turn DRS to manual so nobody moves where they shouldn’t.
  • vMotion the selected VMs into the lifeboat until it is very full.
  • Look at the lifeboat’s configuration tab, under vDS, see what vlans are in use on this host, and by how many VMs.
  • For each VLAN in use on the lifeboat
    • Under Networking, the trunk vDS, choose Migrate Virtual Machine Networking
    • Migrate “Private VLAN2″ to “VLAN2″.  This will only work on the lifeboat, since it’s the only host that can access both forms of VLAN2, so choosing “all” (and ignoring a warning) is perfectly safe here.
    • Watch the VMs cut over to the standard switch (dropping maybe 1 packet).
  • Checking the lifeboat’s configuration, nobody is on the vDS now; all VMs on the lifeboat are on the local-to-the-lifeboat standard switch.
  • Disconnect the lifeboat from the old vCenter.
  • Remove the lifeboat from the old vCenter.
  • On the new vCenter, add the lifeboat as a new host.  This takes a while, and even after multiple successful runs there was always the worry of “this time it’s going to get stuck,” but it just worked.

wave dead chickens as you see fit

Once the lifeboat host is added to the new cluster, vMotion all the VMs from the lifeboat onto the dock.  Now work can split in 2 directions: one person sends the lifeboat back; another starts processing the newly-landed VMs that are sitting on the dock.

vSphere Clusters, unloading lifeboat

Sending the lifeboat back is fairly trivial.  Disconnect/remove the lifeboat from the new vCenter, add the host back to the old vCenter, and add the links to the vDS.  At this point, this person can start loading up the next batch of evacuees.

On the receiving side, all the VMs are currently pinned to the dock, since it’s now the only host with a standard switch.  All of the VMs there need to have their networks moved to the new vCenter’s vDS.  The process is just the reverse of before (“Migrate Virtual Machine Networking” under the networking tab, moving “VLAN2″ to “Private VLAN2″).  The rest is housekeeping: file the VMs into the right resource pools and folders, update the in-house inventory system to indicate the VMs were moved to a new vCenter, start vmware-tools upgrading.  Last step, we’d enable DRS and put the dock in maintenance mode, to eject all the new VMs into the remainder of the new cluster, to make room for the next boatload of arrivals.

We had very few problems, but I’ll list them:

  • Any snapshots on the VMs were invalid after the move.  We found this the hard way: someone in QA rolled a guest back to a baseline snapshot, only to find the networking lost, because it restored references to the old vDS.  Easily fixed once identified, and we were lucky that it wasn’t a bigger problem since we’d coordinated that move with the user.
  • Two VMs had vNICs that had had manually configured MAC addresses in the 00:50:56 space.  The VM refused to land on the new side, because it could conflict/not be managed.  We had to do a hot-swap of the vNIC to get onto an automatic MAC, at which point the VM moved happily.
  • And, of course, human error.  Misclassifying VMs into the wrong place because we were moving so much so fast.

One person would own one boatload, noting which pools/folders they got VMs from, and owning putting them in the right place on the far side.  All in all, with 2 people, we were able to move 200VMs across in a full working day, and so we finished up evacuating the old vCenter in 5 working days.  We only had to coordinate with 2 customers, and took one 15m maintenance window (for the manual-MAC-vNIC issue), and even then we didn’t disrupt service.

Around 1000 VMs moved, and nobody noticed.  Just how we wanted it.