Mark your calendar. Saturday, April 17 @ 2pm.
That’s the date we’ve scheduled to do some significant NetApp upgrades in San Jose (bug 502151).
Unlike most NetApp upgrades, these will require real downtime. The nature of the upgrade involves updating the firmware on several disk array shelves and requires a power cycle of those shelves.
This will affect a large swath of the infrastructure, including:
- All check-in trees
addons.mozilla.org(and related AMO sites)
- Breakpad / Socorro
- and more…
Basically a lot.
Why now? Honestly, we held off as long as we could.
About a year ago the filers started reporting a fault on all the attached disk shelves:
Jan 1 23:00:00 netapp-a [meer-netapp-a: monitor.shelf.fault:CRITICAL]: Fault reported on disk storage shelf attached to channel 0a. Please check fans, power, and temperature.
For those who like to play along, NetApp bug #258741.
“A bug in the ESH2 I/O modules firmware can lead to a I2C buffer loss which can cause the two modules in a shelf to stop communicating with each other.”
NetApp goes on to say,
“This bug does not cause any issues with data availability and the system will continue to function in the presence of these errors. This bug will cause the shelf to incorrectly report status, statistics, and environmental information.The error condition caused by this bug can be cleared by power-cycling the shelf. This should not be attempted while the system is online.”
This was basically a cosmetic error and not a real fault. Since it wasn’t affecting service we put it in the back of our mind and figured we’d address it after the Firefox 3.5 release in June.
June came by and other things were more important and before you knew it, it’s autumn and we decided to push it off until the holidays when the demand on the infrastructure is a lot less. In fact, we had originally planned to do this the same time we did an OS upgrade on another pair of NetApp filers.
I decided to delay it again until we had the Phoenix data center online and able to take production traffic.
And that’s where we are now. April 17 @ 2pm we’ll take a 3-hour window to do the OS & firmware upgrades.
During the week of April 12 we’ll manually fail over a number of production sites that service in-product features (
support.mozilla.com, Personas) and support product downloads (
addons.mozilla.org) to Phoenix as their primary data center.
I’m posting this well in advance of April 17 because of how much of the infrastructure this will impact. The date’s somewhat immovable and I realize this will impact many different parts of Mozilla. Feel free to comment here or in the bug with any concerns.