With Mozilla’s increasingly exciting ventures into the world of mobile computing, we need to expand the pool of hardware available for continuous integration testing. While some tests can run on emulators using x86-based servers, other tests need to be run on actual ARM hardware. For Firefox for Android, we have been using Nvidia Tegra boards for several years, but new tegras are no longer available. We are now beginning to use Panda Boards for both Firefox for Android and Firefox OS.
Development Boards .. In Production?
The problem we face is this: development boards are not designed for use in a data center, or to be used in a production system. They are small, fragile, and sensitive to ESD. They have no redundant power, no out of band management and no easy means to mount them in a standard rack. Historically we’ve mounted them in fabric shoe-racks, or carefully balanced on rack shelves where a stray tug on a cable causes devices to topple like dominoes and short out solder traces.
Even without humans on-site, things go wrong all the time: SD cards fail or become corrupted, boards overheat, connectors come loose, and the things just plain die more often than any real phones, desktops, laptops, or servers. If we treat dev board failures the way we treat server failures, that adds up to hundreds of man hours spent handling device failures – not scalable. Even so, we’ve already deployed over 800 pandas and 400 tegras, with plans for more.
This led us to develop a system called Mozpool. Mozpool gives us the ability to manage this large collection of unstable devices in a highly automated fashion and only apply human hands and minds when absolutely necessary. The tool essentially provides hardware on demand — a bit like Amazon EC2, but without virtualization. The tool is designed to provide a higher level of reliability to its clients than that provided by the underlying devices, just as reliable TCP does for unreliable IP. Mozpool accomplishes this by weeding out the faulty hardware or “problem child” devices through the use of automated failure detection and remediation, and by treating devices as fungible: one is just as good as another, so Mozpool has the flexibility to substitute working devices for broken devices.
Mozpool primarily serves our build and test infrastructure, where commits are automatically built and devices supplied by Mozpool are used to test the builds. We also have a smaller pool for security assurance to support security fuzzing. Devices can also be loaned to developers for debugging or performance analysis.
The tool is made up of three software layers:
- Mozpool (a self titled component)
At this level, requests are made through an API for a device meeting certain criteria: hardware type, environment (staging, production, etc.), or rack location. Requests are valid for a limited duration, after which the device returns to the pool (unless the lease is renewed). Requests can also specify the OS image that should be on the device when it is delivered. If necessary, Mozpool will flash that image onto a device.
When the build and test system has a fresh Firefox OS build it would like to test, it specifies the location of this build to Mozpool, which installs it before handing over the selected device.
This level is an all-encompassing state machine which tracks states and statistics for every device managed by Mozpool. It is aware of all device workflows available for manipulating devices. It detects failures when devices don’t do what they are told within a reasonable amount of time. For example, if a re-image is requested and the device fails while writing to the SD card, Lifeguard will detect the failure as a timeout, and iterate through a certain number of retries before moving the device into a failed state where it can be promoted to automated hardware testing or hands-on attention.
- Black Mobile Magic
The lowest level includes set of tools that allow Mozpool to manipulate a device remotely. This includes individual device power control through the use of network aware power relay boards, as well as an integrated PXE-boot system. Each panda’s sdcard is pre-seeded with a boot image that performs a DHCP request and boots from the network if an appropriate configuration file is available via TFTP.
The combination allows Mozpool to automatically boot a device into a live Linux system, where shell scripts are loaded to perform various tasks: install an image to the SD Card, run a set of hardware validation tests, or enter an SSH-accessible maintenance mode.
The pandas themselves are mounted by the dozen in custom-designed 4U chassis, each equipped with the relay boards mentioned above, as well as active cooling and network connections. Within the chassis, each panda is securely mounted in a field-serviceable, removable mounting bracket.
The Mozpool project has been a cross-team effort, with staff from release engineering, release operations, and the a-team participating. In particular, Jake Watkins, Mark Côté, Ted Mielczarek, John Hopkins, and Armen Zambrano Gasparnian deserve credit for most of the hard work creating the tool. Getting this much hardware from the loading-dock into production this quickly isn’t easy, either, particularly since the pandas moved into brand-new space in one of our DC’s. Credit for that goes to the DC Ops team: Derek Moore, Ashlee Chavez, Van Le, Sal Espinoza, and Vinh Hua. Melissa O’Connor held the whole project together, through a lot of complications and setbacks not described here.
Although Mozpool is quite young (the software was designed and written in about 6 weeks), it has already begun to prove its value, as we shift pandas between purposes and coordinate around working or failed pandas.