Recently we saw instability in one of our admin nodes. The hardware is a bit old so it is easy to throw that out as a cause. This is an HP DL360, 4th Generation (current is g7 with g8 right around the corner). Yet, blaming the hardware should be a last resort. Server problems are a cause and effect game. Often the “cause” is a change on the server, and that was the case here. We had updated to RHEL 5.6 (from 5.5) to pick up critical updates.
It is rare that an update like this will cause instability in a server. The reason RHEL is such a popular choice in production systems these days is because of the hardware testing and vetting that goes on before they roll out releases and updates. Yet it seemed that we fell victim to a bug.
Then as we were troubleshooting this issue on our admin node, people.mozilla.com died. (sorry about that) This is another DL360 g4 with recent RHEL updates. It was now obvious that this would be a problem with all of our servers running this combination.
To make a long story short, RHEL was in fact vetted and stable on this particular hardware platform. The difference in our case is that we were sitting on a firmware update that would have needed downtime to apply. The combination of kernel and firmware version was a reported RedHat bug, and the fix is to make sure both are updated. Emergency downtime for many hosts was taken to perform these updates (bug 661420) before the bug was triggered elsewhere.
This brings us to a common dillema in IT: when to apply updates and what updates to apply? OS level security updates are often non-impacting and require little to no downtime so they are a no brainer. Plus, they need to be done to maintain a secure system. On the other hand, firmware updates almost always require a reboot and some downtime to perform. They can invoke memories of bricked systems due to a failed update (which hardly happen nowadays). They hardly affect security, and as long as a system is running fine they fall under the argument of “if it is not broke, don’t fix it”. In our case, it broke because we were not proactive in fixing a problem that sat dormant for us.
Moving forward, we will have to be more proactive about updates like these. Downtime will need to be scheduled, and on nodes with no service redundancy (like people) they will not be popular, but I hope that this illustrates the necessity. It will be better for us to take systems down on our own terms with advanced notice to the users rather than see them go down at random in the middle of the day.
That’s all for this week. Next week, we attend a conference and talk to some geeks.