Socorro Data Center Migration – Downtime

Laura Thomson

On Saturday, January 22, starting at 9am PST, we will proceed with the Socorro datacenter migration.

Background

As most people who touch Socorro know, we have been close to capacity on the current systems in San Jose for some months. Due to space and power considerations, we chose to move to a set of shiny new machines in the Phoenix datacenter. As well as each box being significantly higher specced, we now have a lot more of them:

  • 70 HBase nodes as opposed to 15
  • 10 processors (with 24 threads each) up from 3 (with 4 threads each)
  • 3 Socorro API (middleware) servers up from 1
  • 2 PostgreSQL instances in (manual) failover
  • 5 dedicated webheads
  • Higher capacity and faster disks in collectors
  • Dedicated admin host

We’ve done a lot of work to make Phoenix superior in other respects, which I’m really excited about too. All configurations and releases are done automatically through puppet. We have added a great deal more monitoring and instrumentation to the system (with nagios and ganglia) which have allowed us to measure and monitor during our smoke tests, and will continue to give us better insights into the system in production.

For the last few weeks we have been running various tests on our setup in PHX. All QA tests pass and we have run millions of crashes through the system in smoke, load, and component failure tests.

Procedure

The basic procedure we will follow on Saturday is:

  • Shut down processing in San Jose
  • Switch collectors in San Jose to collect to local disk
  • Perform final sync of HBase
  • Perform final sync of PostgreSQL
  • Final tests before we throw the switch
  • Point crash-reports.mozilla.com and crash-stats.mozilla.com to PHX
  • San Jose collectors switched to submit crashes to PHX – this will clear the backlog accumulated during downtime

Downtime

During the migration we will continue to collect crashes, but processing and crash-stats will be unavailable. We anticipate a downtime of approximately 8 hours, but it may be longer. (In an absolute worst case scenario it would take about that again to roll back to San Jose, once we noticed an unsolvable problem.) We’ll keep San Jose static for a couple of weeks, and then it will become our test environment.

After the migration I’ll blog more about what we learned during this process. If you have any questions in the meantime, please send me an email.