Which build infrastructure problems do you see the most?
August 13th, 2010
I’m hoping to tackle bug 505512 (Make infrastructure related problems turn the tree a color other than red) in the next few weeks. Most of the ground work for it is laid, which means that most of what I’ll be doing is parsing logs for infrastructure errors.
So, what errors do you see most from our build infrastructure? Are there other things that you would classify as infrastructure issues? Please add any suggestions you have to this Etherpad: http://etherpad.mozilla.com:9000/build-infra-errors
Update on recent Tinderbox issues
August 12th, 2010
My last post talked about the issues we’ve been having with load on the Tinderbox server and some ways we could fix it. I’m happy to report that two things were completed yesterday that should keep the load under control for the foreseeable future.
One of the things mentioned in my previous post, splitting incoming build processing from the rest of Tinderbox (bug 585691), was completed very late last night. Additionally, Nick Thomas discovered that we had lost the cronjob that takes care of cleaning out old builds from Tinderbox’s memory. That script was re-enabled and a one time clean up removed 64GB of old build data. Both of these were completed around 4am PDT this morning and load is looking much better.
Especially because we’re now running cleanup scripts on a regular basis again, I believe that this should get as back to good state.
Everyone should feel free to send justdave their thanks for staying up reaaaaallllly late last night to get us back to a good state.
Recent Tinderbox issues
August 10th, 2010
As many of you know there have been numerous times lately that Tinderbox has become unresponsive, sometimes to the point of going down completely for a period of time. This post will attempt to summarize the issues and what’s being done about them.
The biggest issue is load (surprise!). In a period of a few years we’ve gone from a few active trees with tens of columns between them to tens of active trees with hundreds of columns between them. Unsurprisingly, this has made the Tinderbox server a lot busier. The biggest load items are:
- showlog.cgi – Shows a log file for a specific build
- showbuilds.cgi – Shows the main page for a tree (like this)
- processbuilds.pl – Processes incoming “build complete” mail
A bit of profiling has also been done in bug 585814 to try to find specific hotspots.
We’ve already done a few things to help with Tinderbox load:
Other ways we’re looking at improving the situation:
- bug 585691 – Split up Tinderbox data processing from display. This wouldn’t reduce overall load, but it should segregate it enough to keep the Tinderbox display up.
- bug 390341 – Pregenerate brief and full logs. This would eliminate the need for showlog.cgi to uncompress logs in most cases.
- bug 530318 – Put full logs on FTP server; stop serving them from Tinderbox.
GRUB, the MBR, BIOS bugs?
March 9th, 2010
Recently we ordered a set of server class Linux machines to supplement our pool of VMs. They are lightning fast, especially compared to VMs, but it’s been a bit of a bumpy ride getting them ready to go to production. Most notably we’ve had an mysterious problem where they would occasionally refuse to boot, halting at a “GRUB _” dialogue. It took awhile, but we believe we have this fixed now.
This problem first occurred on 2 out of 25 slaves. Catlee quickly discovered that it could be fixed with a simple re-installation of Grub, so that’s we did, and moved on. The thought at the time was that the MBR somehow got partially overwritten or otherwise corrupted. A day later, 2 more slaves hit the same issue. Since it was the second time we hit the issue there was some more speculation and digging. We had made a few changes to the machines, including:
* Changing the hard disk controller from “IDE” mode to “AHCI”.
* Changing the kernel to a PAE version.
Both of those were pretty quickly dismissed as the causes. It seemed very unlikely that the kernel version could cause an issue with the bootloader, and the problem didn’t occur instantly after changing the disk controller mode, so that seemed unlikely too. With other important things happening we again moved on.
The next day I did some more googling, this time about GRUB in general, and came across a page detailing the GRUB boot process. In it, it talks about how to dump the contents of the MBR and view it as hex. Seeing that made me *very* eager to compare a working slave vs. a busted one. Unfortunately there was no longer a busted machine to look at.
After 5 or so days without issues, and after all other setup and configuration issue was taken care of we decided to move them to production and deal with the GRUB problems if they arose. As luck would have it, 2 machines refused to boot as they were being moved to production. After booting from a rescue disk and dumping the MBR I found that bytes 0×40 through 0×49 differed against a working slave. I also noticed that the MBR of a busted slave was identical to one that had *never* broken, and thus, never had GRUB re-installed. This seemed to rule out MBR corruption.
With some more information in my hands I looked for some help or pointers from the GRUB developers, on Freenode. One of them pointed me to this section of the GRUB Manual which documents some key bytes of the MBR. Notably, byte 0×40 is described as “The boot drive. If it is 0xFF, use a drive passed by BIOS.”. On a working slave this was set to 0xFF. On a broken one, it was set to 0×80 (which I was told means “first hard drive”). That certainly sounds like something that could affect bootability!
After thinking it over a few times I came to the conclusion that *somehow* 0×80 must end up being the wrong device to boot from. I also realized that no slave which had had GRUB re-installed had failed again. With all of that I became confident that re-installing GRUB would fix the problem permanently. I ran all of this by Catlee who told me that GRUB developers had told him that the BIOS could be re-ordering drives semi-randomly. That piece of information seems to fill in the last bit of the puzzle and I’m more confident than ever that GRUB installation will permanently fix the problem.
It’s still a mystery to me why the BIOS would be re-ordering the drives at random. There’s a “BIOSBugs” page on the GRUB wiki which describes a problem where the BIOS sends the *wrong* boot device. Since relying on the BIOS to send the boot device has fixed our problem I don’t think it’s the same thing. I haven’t been able to find any information on this specific issue, or how to find out what boot device the BIOS is sending the Bootloader, which makes it difficult to truly confirm our fix. If anyone has hit this, or knows how to get at this kind of information I’d love to hear from you.
All about the RelEng sheriff
October 26th, 2009
Since February of this year we’ve had a rotating RelEng “sheriff” available. We started it to make a couple of things better:
- Improve response time on critical issues
- Avoid having the whole team distracted with infrastructure issues
By and large, this has been an improvement for us and we think, for developers as well. Serious issues are dealt with more quickly; developers and the developer sheriff have someone specific to go to with acute issues that come up. Internally, this has helped us focus more, too. With the RelEng sheriff dealing with triage and other acute issues the rest of us are able to focus on our other work without distraction.
What is the RelEng sheriff responsible for?
- Managing the triage queue
- Monitoring #developers, mozilla.dev.tree-management, #build, and Nagios for issues
Who is the RelEng sheriff?
The RelEng sheriff is rotated weekly. You can find out who the current RelEng sheriff is by looking at the schedule.
How to get a hold of the RelEng sheriff
The best place to find them is on IRC, in #build or #developers. They should be wearing a ‘|buildduty’ tag at the end of their nick. You can also get our attention in other ways, if IRC doesn’t work for you:
- File a bug in mozilla.org : Release Engineering
- Post to the mozilla.dev.tree-management newsgroup
- E-mail release@mozilla.com
Bugs and IRC pokes are the preferred methods but any will work. Also note that the RelEng sheriff is only around during their normal working day, which can be PDT/PST, EDT/EST, or NZDT/NZST. If a RelEng sheriff isn’t around, someone can be reached in #build.
What can your sheriff do you for you?
The on-duty Releng Sheriff would be more than happy to do any of the following for you:
- Trigger any sort of build or test run you need, including:
- Extra unit test or Talos runs of any given build
- Retriggering builds that fail for spurious reasons
- Deal with any nightly updates that fail
- Help debug possible build machine issues
- Help debug test issues that you cannot reproduce yourself
- Answer questions you may have about build or test infrastructure
The RelEng sheriff is also a good first-contact point for any other random things. They may be able to help you directly but if not, they can certainly point you to the person who can.
After reading this, I hope you have a better understanding of the who, what, and why of the RelEng sheriff. If anything is unclear or absent I’m happy to clarify.
Anatomy of an SDK update
October 2nd, 2009
Over the course of the past week or so I’ve been working on rolling out the Windows 7 SDK to our build machines. Doing so presented two challenges: Getting the SDK to deploy silently and properly, and updating the appropriate build configurations to use it. Neither of these may sound very challenging, and indeed, they didn’t to me either, but because of a combination of factors this ended up becoming a week long ordeal. In this post I will attempt to detangle everything that happened.
Let’s start with the actual SDK installation. Unlike most other reasonable packages, the Windows 7 SDK is not distributed as an MSI package, but rather a collection of MSIs wrapped in an EXE. Unfortunately, this EXE doesn’t enable you to do a customized, silent install – the precise thing we need. Vainly, I thought I could figure out the proper order and magic options to install the enclosed MSIs properly. Needless to say, this failed. To work around this I fell back onto using an Autoit script that would click through the interactive installer for me. It took some fuss, but not too much difficulty to get that working.
Now, the fun part (of deployment). We use a piece of software called OPSI to schedule and perform software installations across our farm of 80 or so Windows VMs. OPSI runs very early in the Windows start-up process, and actually executes as the SYSTEM user. Well, it turns out that the Windows 7 SDK must be installed by a full user, not the SYSTEM account. This seems unnecessary, as we’ve deployed other SDKs through OPSI in the past without issue. After trying to fake it out by setting various environment variables I turned to the OPSI forums for some help. (As an aside, the OPSI developers have been fantastic in their support of our installation, many thanks to them.) It turns out that I’m not the first person to hit problems like this. They pointed me to a template for a script that works around such an issue. The solution ends up being:
- Copy installation files to the slave
- Create a new user in the Administrators group, set that user to automatically login at next boot
- Reboot, and run the package installation at login
- Restore the original automatic login, reboot
- Cleanup (delete installation files, remove the created user)
This is obviously quite hacky, but it gets the job done.
So! With that in hand (and in repo) we set the SDK to deploy over the course of Wednesday night and Thursday morning. Overall, this went smoothly. For a reason (which I haven’t yet figured out) some of the slaves needed some kicking to do the installation properly.
Remember how I said part 2 of this was updating the build configurations? I had planned to do this on Friday, and even posted a patch in preparation. Well, it turns out that MozillaBuild likes to be smart and find the most recent SDK and compiler for you. This completely slipped my mind while I was doing the deployment and a result, all builds from Thursday (yesterday) morning to Friday (today) morning, including those on mozilla-1.9.1, were done with the Windows 7 SDK. This went unnoticed most of Thursday until I was doing a final test of my build configuration patch.
Here’s where the fun starts for this part. After discovering I’d accidentally changed the SDK for everything I went into a bit of a panic and rapidly started testing some fixes out in our staging environment. During the course of this I discovered that things were worse than I thought. Most builds were using the Windows 7 SDK, but not the “unit test” ones. So we weren’t even using the same SDK for all the builds for a given branch! Getting all of that sorted out was compounded by all of the iterations of path styles (c:/ vs. c:\ vs. /c/) I had to try before I found the magic combination. In the end, I discovered a few things:
- If you’re specifying LIB/INCLUDE/SDKDIR in a mozconfig, you must use Windows-style paths
- If you’re specifying PATH in a mozconfig, you CANNOT use Windows-style paths – you must use MSYS style
- You can’t test for these things properly without clobbering
As I write this the first set of builds that all use the correct SDK are finishing up, and this deployment from hell appears to be nearly over. I want to express a special thanks to the OPSI developers, who were very helpful, and to Nick Thomas and Chris AtLee, for their patience with my countless iterations of build configuration patches. As a final note, let me state explicitly which SDK is being used where:
- Windows Vista SDK (6.0a): mozilla-1.9.1 builds
- Windows 7 SDK (7.0): mozilla-central, mozilla-1.9.2, TraceMonkey, Electrolysis, and Places builds
WinCE and WinMO builds are unaffected by this deployment.
Recent and upcoming Try Server changes
May 15th, 2009
This morning I landed bug 486567 – which cleaned up the try server code significantly. There’s still more to be done there, particularly running unittests on packaged builds once it’s production counterpart lands (bug 383136). Both of these things help us keep the Try Server in sync with the rest of the world – which has always been a problem.
Looking forward a little bit, I’m looking to land a patch that enables e-mail notification for try server builds and unit tests on Tuesday. With this patch, every try submission would result in 6 e-mails to the submitter: (1 per platform/build type combination). Here’s what they’ll look like:
Build:
Your Try Server build (try-1c170baeac1) was successfully completed on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1
Visit http://hg.mozilla.org/MozillaTry to view the full logs.
Unit test:
Your Try Server unit test (try-1c170baeac1) completed with warnings on linux. It should be available for download at http://build.mozilla.org/tryserver-builds/bhearsum@mozilla.com-try-1c170baeac1
Summary of unittest results:
check: 2/0
Visit http://hg.mozilla.org/MozillaTry to view the full logs.
(The unittest e-mails will have the full results listed, of course).
E-mail notification has been an oft requested feature so I’m really excited that this will be landing soon.
New publicly available CentOS ref platform!
May 5th, 2009
I’m happy to announce that I’ve finally updated the publicly available version of our CentOS 5.0 build reference platform. There are many changes to it since the last released version, most notably a Scratchbox installation and Mercurial. For all the details you can have a look at the reference platform wiki page. Everything up to Version 17 is included on the released version.
You can get it here: ftp://ftp.mozilla.org/pub/mozilla/VMs/
Firefox 3.1/3.2 builds no longer reporting codesighs ‘mZ’ metric
January 6th, 2009
bug 358845 pointed out that the ‘mZ’ we report for Codesighs tests is meaningless for Firefox. As such, we have stopped running it. This is just a quick note to let people know not to panic, it’s fine! The ‘Z’ number is still being reported and valid.
Mozilla Scheduled Maintenance – 10/20/2008, 4am – 7am PDT
October 17th, 2008
During this time we will be upgrading both the Firefox 3 and Firefox 3.1 Unit test Buildbots to a newer version (0.7.9). In order to avoid interrupting running builds we will be closing the tree at 4am PDT and stopping any new builds from being scheduled. Once all builds have finished we will perform the upgrade and open the tree again. Depending on the timing this could take anywhere from 20 minutes to a few hours. The tree should be open again no later than 7am PDT.
If there is any reason why we shouldn’t go ahead with this please e-mail release@mozilla.com