buildbot « Axel Hecht

March 3, 2017

On updating the automation behind l10n.m.o

Filed under: L10n,Mozilla — Tags: buildbot, compare-locales, L10n, Mozilla — Axel Hecht @ 1:01 pm

Or, how to change everything and nobody sees a difference.

Heads up: All I’m writing about here is running on non-web-facing VMs behind VPN.

tl;dr: I changed 5 VMs, landed 76 changesets in 7 repositories, resolving 12 bugs, got two issues in docker fixed, and took a couple of days of downtime. If automation is your cup of tea, I have some open questions at the end, too.

To set the stage: Behind the scenes of the elmo website, there’s a system that generates the data that it shows. That system consists of two additional VMs, which help with the automation.

One is nick-named a10n, and is responsible for polling all those mercurial repositories that we use for l10n, and to update the elmo database with information about these repositories as it comes in. elmo basically keeps a copy of the mercurial metadata for quicker access.

The other is running buildbot to do the actual data collection jobs about the l10n status in our source repositories. This machine runs both a master and one slave (the actual workhorse, not my naming).

This latter machine is an old VM, on old OS, old Python (2.6), never had real IT support, and is all around historic. And needed to go.

With the help of IT, I had a new VM, with a new shiny python 2.7.x, and a new storage. Something that can actually run current versions of compare-locales, too. So I had to create an update for

Python 2.6	→	Python 2.7.x
globally installed python modules	→	virtualenv
Django 1.4.18	→	Django 1.8.x
Ubuntu	→	CentOS
Mercurial 3.7.3	→	Mercurial 4.0.1 and hglib
individual local clones	→	unified local clones
No working stage	→	docker-compose up

At the same time, we also changed hg.m.o from http to https all over the place, which also required a handful of code changes.

One thing that I did not change is buildbot. I’m using a heavily customized version of buildbot 0.7.12, which is incompatible with later buildbot changes. So I’m tied to my branch of 0.7.12 for now, and with that to Twisted 8.2.0. That will change, but in a different blog post.

Unified Repositories

One thing I wanted and needed for a long time was to use unified clones of our mercurial repositories. Aside from the obvious win in terms of disk usage, it allows to use mercurial directly to create a diff from a revision that’s only on aurora against a revision that’s only on beta. Sadly, I did think otherwise when I wrote the first parts of elmo and the automation behind it, often falling back to default instead of an actual hash revision, if I didn’t know anything ad-hoc. So that had to go, and required a surprising amount of changes. I also changed the way that comparisons are triggered, making them fully reproducible. They also got more robust. I used to run hg id -ir . to get the revision, which worked OK, unless you had extension errors in stdout/stderr. Meh. Good that that’s gone.

As I noted, the unified repositories also benefit doing diffs, which is one of the features of elmo for reviewing localizations. Now that we can just use plain mercurial to get those diffs, I could remove a bunch of code that created diffs between aurora and beta by creating diffs between each head and some ancestor, and then sticking those diffs back together. Good that that’s gone.

Testing

Testing an automation with that many moving parts is hard. Some things can be tested via unit tests, but more often, you just need integration tests. I still have to find a way to write automated integration tests, but even manual integration tests require a ton of set-up:

elmo
MySQL
ElasticSearch
RabbitMQ
Mercurial upstream repositories
Mercurial web server
a10n get-pushes poller
a10n data ingestion worker
Buildbot master
Buildbot slave

Doing this manually is evil, and on Macs, it’s not even possible, because Twisted 8.2.0 doesn’t build anymore. I used to have a script that did many of these things, but that’s …. you guessed it. Good that that’s gone. Now I have a docker-compose test setup, that has most things running with just a docker-compose up. I’m still running elmo and MySQL on my host machine, fixing that is for another day. Also, I haven’t found a good way to do initial project setup like database creations. Anyway, after finding a couple of bugs in docker, this system now fires up quickly and let’s me do various changes and see how they pass through the system. One particularly nice artifact is that the output of docker-compose is actually all the logs together in one stream. So as you’re pushing things through the system, you just have one log to watch.

As part of this work, I also greatly simplified the code structure, and moved the buildbot integration from three repositories into one. Good that those are gone.

snafus

Sadly there were a few bits and pieces where my local testing didn’t help:

Changing the URL schemes for hg.m.o to https alongside this change triggered a couple of problems where Twisted 8.2 and modern Python/OpenSSL can’t get a connection up. Had to replace the requests to websites with synchronous urllib2.urlopen calls.

Installing mercurial in a virtualenv to be used via hglib is good, but WSGI doesn’t activate the virtualenv, and thus PATH isn’t set. My fix still needs some server-side changes to work.

I didn’t have enough local testing for the things that Thunderbird needs. That left that setup burning for longer than I anticipated. The fix wasn’t hard, just badly timed.

Every now and then, Django 1.8.x and MySQL decide that it’s a good idea to throw away the connection, and die badly. In the case of long-running automation jobs, that’s really hard to prevent, in particular because I still haven’t fully understood what change actually made that happen, and what the right fix is. I just plaster connection.close() into every other function, and see if it stops dying.

On Saturday morning I woke up, and the automation didn’t process Firefox for a locale on aurora. I freaked out, and added tons of logging. Good logging that is. Best logging. Found a different bug. Also found out that the locale was Belarus, and that wasn’t part of the build on Saturday. Hit my head against a wall or two.

Said logging made uncaught exceptions in some parts of the code actually show up in logs, and discovered that I hadn’t tested my work against bad configurations. And we have that, Thunderbird just builds everything on central, regardless of whether the repositories it should use for that exist or not. I’m not really happy yet with the way I fixed this.

Open Questions

Anyone got taskcluster running on something resembling docker-compose for local testing and development? You know, to get off of buildbot.
Initial setup steps for the docker-compose staging environment are best done … ?
Test https connections in docker-compose? Can I? Which error cases would that cover?

Comments Off

November 15, 2010

As sure as logs are logs

Filed under: Mozilla — Tags: build, buildbot, Mozilla — Axel Hecht @ 12:04 pm

… or not.

As promised, I’ll write a bit about build logs today. You’ll see what our logs are, and, to begin with, I’ll take you on a tour through buildMessage to explain how the logs we have end up being what you see served off of tinderbox.

First off, buildbot is basically the same thing as any regular gecko app, one main thread and loads of callbacks. So when reading on, all your spontanous reactions are good.

The buildMessage code does:

synchronous IO to load all logs of a build into memory, basically up to some 70M
synchronous string handling to paste all that data together, with some extra padding
synchronous compression of the resulting string
synchronous base64 encoding of the compressed string

All on the main thread, all in one go, blocking. All of that to give you a single lengthy unformatted blob of text. Why?

Because our build logs are actually not a single lengthy unformatted blob of text, which is what tinderbox wants.

Let’s have a peek into what our build logs are, really. In my previous posts, I introduced you to the concept of build steps. They’re really the basic entity of work to be done for a build. Now, the logs are stored in buildbot pretty much in how the data comes, that is, each log is associated with a step, and the storage is happening as the chunks arrive. Commonly, that’d be stdout and stderr data coming from shell commands run on the slave. The information about which stream the data is on is persisted, too, as is the order, so any log looks like this, basically:

Step reference	header	length	data
	stdout	length	data
	stdout	length	data
	stdout	length	data
	stderr	length	data
	stdout	length	data

As most of you aren’t among the few priviledged ones to actually look at the real logs, I’ve set up a fake log page for you to take a look. It’s an l10n repack, mostly because they’re somewhat small in both step count and log size, and because I’m used to them. Here’s the actual make step highlighted. You can see the introduction being shown in blue, which is the common color for header chunks. Buildbot just uses that channel to show setup and shutdown information on the step. Then there’s the actual make output in black. If there was something on stderr, it’d be styled in red. Sorry, I didn’t quickly come up with something that has stderr.

The first take-away is that you can get to just the build output of the step you’re interested in.

If you’re nostalgic, you can check the checkbox for tinderbox, the css style sheet changes to show you what you’d get from tinderbox. Try to find the information again?

One further detail, there can be more than one log per step. Buildsteps that set build properties quite commonly have two logs, one that keeps track of the command that got run, and another that keeps track of the actually changed build properties. You can look at an example in the builddir step. The boring last line is the second log.

Log files are really not all that complicated, and much more useful than what we get back from tinderbox. Let’s look at some of the pros:

Log files come in as the build goes. This enables buildbot to publish build logs in almost realtime. There’s little-to-no cost for that, too, a simple node.js proxy can ensure that only one log is read at any time. Another benefit is, one can archive logs incrementally, removing the current stress on the masters to publish more data than they want to chew in one go.

Log files are per task. As the logs are associated with a step, which has a name and a builder, there’s pretty rich information available on what the data in question is actually about. Think about hg-specific error parsers for one step, ftp-specific ones for the next, and mochitest-specific ones for the one after that. All in one build. If we’d archive the raw data, we can easily improve our parsers and be compatible with old logs. Or add new steps to the build process without fear to break existing log parsers.

Tinderbox can still be fed. Even if we’re not sending out tinderbox log mails from the masters, we can still do the processing out of band in an external process or even external machine, offload the masters, and not enforce us to change all infrastructure in one go.

There is a hard piece, too, storage. Build logs are plenty, and they’re anywhere from a dozen bytes to 70M. Within the same build, even. There a hundreds of thousands small files, and thousands of really large ones. I hope that adding some information on what our build logs really are helps to spike a design discussion on this. If to compress, on which level. Retention, per step type, even? Store as single files, in one dir, or in a hierarchy, or as tar balls? Or all of the above as part of retention? Is hbase a fit?

Comments Off

November 12, 2010

Counting sourcestamps, changes, and faking data

Filed under: Mozilla — Tags: build, buildbot, Mozilla — Axel Hecht @ 8:31 am

As a follow up to my previous post on my digging through our build status, I want to look with a bit more detail, pretend it’d all be simple and what it could be, and, well, add the promised chocolate to coconut. Bounty.

Let’s look at the actual data for two and half landings. First, I’ll start with a rather simple landing by roc, revision 1b43… on mozilla-central. Let me summarize the builds real quick:

1 changeset in 1 push.

27 change rows in status db.

16 different branch names.

106 sourcestamps in status db.

245 builds.

That’s a lot, because, what we’re really interested in is

1 push, 245 builds.

Talk is cheap, but what’s really cheap is manipulating other peoples database, so while Vettel was running in circles in Brazil, I was running circles in the db and manually stitched things back together. The result is still coconut, but with chocolate, so here is the same url in bounty.

1 source, 1 change, 253 builds.

Wait, what, not 245? No, 253, because, well, there are more disconnects in the status, so the query in the database doesn’t catch them all. That’s what you need manual stitching for. Also, finding the right sourcestamp to keep isn’t trivial.

Which is why I only did it for a few builds. Sorry, you’ll not find a lot of builds that are stitched together, so enjoy the guided tour through the few shiny places.

During my needlework I came across another set of changes which are worthwhile to include into today’s tour. It’s two pushes, by khuey and vlad. Let’s give Count Count a rest and look at them with chocolate right away, the builds for khuey’s revision and vlad’s. What you’ll notice is that some of the builds for khuey aren’t there, but lumped together with vlad’s. Why’s that?

Our build infrastructure really doesn’t know about pushes. As I’ve detailed in my previous post, there are sourcestamps and changes, but no further grouping. At this point it’s really a design decision on whether the buildbot changes are hg pushes or hg changesets. This decision is currently in favor of hg changesets, which may just be right. At that point, the scheduling logic that puts changes into builds needs to put extra care into creating sourcestamps for what we intend to get builds for, and to keep those sourcestamps apart. The current implementation puts anything coming in within three minutes into the same sourcestamp, which is somewhat of a load limiter.

Anyway, back to chocolate. When you looked at the pages, did you realize that once you add it, you’re almost at a tinderboxpushlog page? Right, it could be that “easy”.

What’s between reality and chocolate? Well, sendchange. That’s a buildbot architecture component that allows shell commands to insert changes into buildmaster. It’s rather limited in the data it can transport, which is why we loose data on the way. There’s an alternative feature called trigger, which doesn’t. There is an open ticket to make that span different buildmasters, but given how much Mozilla invested into schedulerdb, let’s pray it’s easy to fix. Filed bug 611670.

Update: Changed links to l10n community server.

Comments Off

November 10, 2010

Looking at the internals of our builds

Filed under: Mozilla — Tags: build, buildbot, Mozilla — Axel Hecht @ 9:34 am

Chris Atlee has put up database dumps of both the scheduler and the status databases. These databases are the most detailed and (almost, status db is not) first class information on what our builds are really doing. The current code on top of that is all pylons-based, and I am, as many other mozillians, part of the django shop. So for one I figured “let’s see if I can make django read this”.

I can. It’s a bit rough, though, as some of the tables for many-to-many relationships don’t contain a primary key column. Django really doesn’t like that, and thus there are some things that you cannot do without those columns. Most of it works just fine, though. This holds at least for the status db, which I looked at in more detail, but the scheduler db ain’t much different. Filed bug 611014.

The code for this is up at django_moz_status on github.

Of course, me being able to talk to the db with a python shell won’t help you much, right? So I’ve spent a few more hours to actually create a really rough website on top of it, which I want to share with you.

Coconut. Hard to open, and once you get there, you may not like it. I have a thing with project names.

Coconut is a bunch of django views on top of django_moz_status, held together by site_demo. You can see it in action (for now) on the l10n community server. This view is exposing three concepts of buildbot, and how they play together:

Sourcestamps (first column): Every build in buildbot has a sourcestamp, and a sourcestamp can have multiple builds. The sourcestamp knows a branch, a revision, and a list of changes going into the builds associated with it.

Changes (second column): Changes are the external “real life” events that may or may not trigger builds. In this view, you see a few list of changes that look like a push to hg (and are just that), as well as a plethora of changes by Mr. sendchange, and Mr. sendchange-unittest. If you remove some query params on the above URL, you can also see a bunch of sourcestamps without changes. Those are nightlies.

Builds (last column): Each sourcestamp can have multiple builds, I’m just showing the builder name (a symbolic short name), the buildnumber, and the result as color. The third column is actually a guess on the platform of the build, based on the platform build property. If that’s not set, unknown is used.

Which brings me nicely to another two pieces of our build infrastructure that has been hard to look at so far. Build steps and build properties. Surf along? Let’s look at a build.

The first section of the build pages shows some general information, builder, buildnumber, status. The start and end time, how long the build took. Also it lists the buildbot master, and the category of the builder. Categories allow to filter for builds, sadly, a builder can only be in one category.

Next up is build steps. Each step in a build is an item of work, and an entity of failure. Different steps can handle failures differently, too. You can see that the build starts with a flock of steps that do administrative tasks on the slave. You can see which fragment of time of the build that step took by looking at the bar in the second column. You’ll see that the majority of that build went into the compile step. And that that passed. And after that compile, there’s a bunch of adminstrative stuff again.

There are two things that you do not see. One is, each of those steps has build logs attached to them. Commonly one, but possibly more. I’ll talk more about logs in a different post. The other thing is, steps can change the build properties. Which is to say, the build properties which are shown at the end of the build page are not static, but change during the build run.

Build properties? Right, within buildbot, several objects can have properties, among them, builds. You’ll find things like the buildnumber, the slavename, the branch, buildername (pretty). You’ll also find a host of things around the packaging of the build. Quite generally, our build try to put things that are needed for the build itself or for logic around into build properties.

The end of the page is reiterating which changes are associated with the sourcestamp for this build.

Let me stress your patience once more and invite you to visit a build with a failed step. In this build page you can see how the clobber step worked fine, and took quite some time, and the actual status of the build is due to the actual test step failing with a warning.

Now this post is already pretty lengthy, so I’ll take a break here and invite you to go in and click back and forth a bit, and to do some url hacking. If you think this is rough and you’re having a hard time figuring out what’s why, I’ll do a follow up post on how to add chocolate to coconut.

PS: the database this instance is working on is a snapshot that ends in August, details may be different today. I shrunk the database, too, only the last 10k builds still have the buildsteps.

Update: Changed the links to the l10n community server.

Comments (2)

July 12, 2009

oh my eyes, but good still

Filed under: Mozilla — Tags: build, buildbot, Mozilla — Axel Hecht @ 2:19 pm

I’ve uploaded a snapshot of a build in progress on my test display which gives a bit better insight on what’s possible to show about a build based on the information that buildbot has, or a build database had. The important pieces here compared to tinderbox are:

Builds in progress are associated with the check-in that triggered the build.
Builds in progress show individual build steps.
Finished builds are displayed in compact form. In this case, all builds end with warnings, and thus come back in a shade of orange.
Builds not yet started that are already requested are displayed on top.

I didn’t go into the detail of mentioning which builders are having pending builds. This is mostly me waiting for django 1.1 and aggregation support, but in the end it’s as simple as a GROUP BY. Nor did I try to make that display visually pretty, hell no.

In the context of our regular builds, it’s worthwhile mentioning what unit test and talos runs would look like, i.e., “builds” that are scheduled after the binary builds are done. Those wouldn’t show up until they’re actually scheduled, which is fair enough, as that is what’s actually happening. If a windows build fails, there won’t be unit tests nor talos builds run. You wouldn’t end up in a situation where you think you’re done and you aren’t, though. (Talos not working this way aside, that needs thought due to different masters etc.) The actual builders don’t finish until they triggered the spin-off runs, so you either see the binary build as still running, or you see the triggered runs as pending. Here, as soon as you don’t have anything running or pending and your boxens are green, you’re off the hook.

I also added microsummaries and RSS feeds for this view, so you can use the web to learn about your the fire you lit up.

Comments Off

June 10, 2009

Running into builds, just testing

Filed under: L10n,Mozilla — Tags: build, buildbot, L10n, Mozilla — Axel Hecht @ 10:29 am

I’ve blogged previously on how to set up a staging environment to test the l10n build system, but I didn’t go into any detail on how to actually do builds in that set up. That shall be fixed.

I’m picking you up at the point where

twistd get-pushes -n -t 1 -s l10n_site.settings

is running stable. You probably want to tail -f twistd.log to follow what it’s doing. This piece is going to watch the hg repositories in stage/repos and feed all the pushes to those into the db. Next is to make sure that you actually get builds.

The first thing you need to do is to configure the l10n-master to access the same database as the django-site. Just make sure that DATABASE_* settings in l10n-master/settings.py and django-site/l10n_site/settings.py match. The other setting to sync is REPOSITORY_BASE, which needs to match in both settings.pys. I suggest setting that to an empy dir next to the l10n-master. I wouldn’t use the stage/repos dir, mostly because I didn’t test that. Now you set up the master for buildbot, by running

buildbot create-master .

inside the l10n-master clone. Next thing is to create a slave dir, which is well put next to the l10n-master. Despite what buildbot usually likes, this slave should be on the same machine that the rest is running on.

mkdir slave buildbot create-slave slave localhost:9021 cs pwd

So much for the general setup. With the twistd daemon running to get the pushes, you can now switch on the master and the slave, by doing a buildbot start . in both dirs. I suggest that you keep a tail -f twistd.log running on the master. If you decide to set things up to track the upstream repositories, I start the master, then the slave, and if both are up fine, I start the twistd daemon for get-pushes.

Now let’s trigger some builds:

Open stage/workdir/mozilla/browser/locales/en-US/file.properties in your favorite editor, and do a modification. I suggest to just do a whitespace edit, or to change the value of the existing entity, as that is going to keep your localizations green. Check the change in to the working clone, and then push. The get-pushes output should show that it got a new push, and then on the next cycle, feed the push into the database. You’ll notice by the output of a hg pull showing up in the log. On the next poll on the l10n-master, builds for all 4 locales should trigger. You should see an update of four builds on the waterfall, and 4 locales on the test tree on the local dashboard.

Comments (2)

June 1, 2009

L10n ecosystem in a fishbowl

Filed under: L10n,Mozilla — Tags: build, buildbot, L10n, Mozilla — Axel Hecht @ 7:02 am

Building the infrastructure for our l10n builds is hard, mostly because it’s consisting of a ton of things that you don’t have control over. We’re building 3 and a half applications, Firefox, Thunderbird, Fennec, and Sunbird for calendar. Firefox is built on three versions, one of which is still coming from CVS. Thunderbird is one version on CVS, one on hg. We’re touching some 170 hg repos, and a single check-in can do anything between no build, one build, or up to almost 200 builds. Rinse, repeat, yeah, 200 builds for a single landing. Worse than that, you don’t have any control over who’s landing what when where. Bottom line, you really can’t test a change in the l10n build automation reliably in our production setup.

You can create a fake ecosystem, though, and I’ll explain a bit how that works. Of course it doesn’t end up being trivial, but it’s contained. It’s not trying to cover the CVS branches, those would require a setup of bonsai, which I chicken away from. Take this post with a grain of salt, I assume there are some errors here as most of it is typed from memory.

As with any recipe, here are the list of ingredients:

A set of hg repositories, both for a fake application and some fake l10n repos.
An hg server serving those repositories (make that port 8001).
Some buildbot infrastructure working on top of these repositories that you’re trying to test.
Possibly an instance of the l10n dashboard presenting both the build and the l10n data.

The initial chunk is creating the repositories. I created a helper script create-stage that does that, which is part of buildbotcustom/bin/l10n-stage. It’s main purpose is to get the templates, hooks etc that are part of our server-side setup on hg.mozilla.org, create some upstream repos for en-US and l10n, and push some initial content from a set of working clones. You call it like

python l10n-stage -l stagedir

The -l keeps the l10n repositories from pushing their initial content, which yields a scenario that is closer to what we have upstream, i.e., a flock of en-US pushes before the l10n repos start. This command creates a bunch of main repositories in the repos subdir of stagedir, and a bunch of working clones in workdir. It also creates a webdir.conf, that you’ll use to run the local hg http server. Let’s run that now, in stagedir:

hg serve -p 8001 --webdir-conf=webdir.conf -t repos/hg_templates

Now you have a local setup of a application repository called mozilla, and 4 localizations in l10n, ab, de, ja-JP-mac, x-testing. They’re all equipped with the same hooks that we run on hg.m.o, in particular, they support pushlog.

Now on to the buildbot infrastructure. There’s a sibling script to create-stage, create-buildbot, which should create a master setup that is rather close to what we run on releng. It supports various degrees of parallelism for multiple slaves on three platforms, does only dummy builds, though. IIRC. I want to go into more detail on how to set up the new dashboard master, though.

The dashboard master is merely running compare-locales on the actual source repositories. It does come with our bonsai replacement pushes, though. That’s basically a pushlog spanning repositories, including file and branch indexing. Here’s the basic software components you’ll need:

django 1.0.x (1.1pre might work, too)
buildbot 0.7.10p1, older versions won’t work

and from hg.m.o, you’ll need compare-locales, locale-inspector, l10n-master, buildbotcustom and django-site.

Firstly, you set up the db. sqlite and mysql should both work, mysql is actively tested. Edit the various settings.py files to reference your db, with an absolute path if sqlite, and create the schema. The main entry point to the django site is l10n_site, go in there to run python manage.py syncdb. Another edit you want to do is to point REPOSITORY_BASE to a dir where you can stage another set of clones of your repos. I suggest to not share the hg master repo dir here.

Next, create a buildbot master and a slave. You do that by running the buildbot create-master command on your local clone of l10n-master. You’ll need to adapt l10nbuilds.ini to the test set up,

[test]
app = browser
type = hg
locales = all
mozilla = mozilla
l10n = l10n
repo = http://localhost:8001/
l10n.ini = browser/locales/l10n.ini
builders = compare

I should put that into the repo somewhere, I guess.

Setting up the slave is trivial, you need to make sure it’s on the same machine, though. It will run on the django clones directly.

Before starting master and slave, make sure that all the deps are in your PYTHONPATH.

Last but not least, you need to get all the pushes from your repo setup into your django setup. First, you need to tell the db which repositories it’s supposed to get from where. I created sample data for the test app, which you can load by

python manage.py loaddata localhost

The repositories I use for the production environment are in hg_mozilla_org, fwiw. You fill the database with the actual push data by running a twisted daemon. Inside django-site, run

twistd get-pushes -n -t 1 -s l10n_site.settings

That will ping one repo per second, not update, with data from l10n_site.settings. Now you have everything set up, and you can start to edit en-US and l10n files in your workingdir, and push, and see how that changes your builds and dashboard.

The buildbot waterfall will be on port 8364, and with python manage.py runserver, the dashboard will show up on port 8000. None of this is setup to be on a production server at this point, but it’s good for testing.

Update: Forgot to mention that you need to bootstrap the repo lists.

Comments (3)

May 10, 2009

Thoughts on killing tinderbox, foundations

Filed under: Mozilla — Tags: buildbot, Mozilla — Axel Hecht @ 9:45 pm

I figured it’d be a good idea to just dump my thinkings on killing tinderbox (as in the the web interface to mozilla’s builds). As just the background information grows to a lengthy post, I’ll cut them into pieces.

To point you where I’m heading, the executive summary of my thinking is:

Killing tinderbox is a webdev problem for the most part, with some chunks in IT and build/releng. The latter two should be fine to mostly do what webdev needs to get the job done.

I won’t give my complete rationale for killing tinderbox, for the most part because I’ve been thinking about that too long and have come up with too many reasons to write them down. But the most important fragments would be in …

The Rationale:

Tinderbox knows relatively little about our builds, and displays even less. The front end is hard to hack, and the back end is tied to a build model that doesn’t match that of buildbot. In particular in our move to hg, things have changed considerably in the back.

Listening in to previous discussions, there seems to be a gap between how people talk about our builds, and what our builds really are. Thus I’ll bore the few people that actually hack on our build automation for a few, and dive into …

What buildbot knows:

Buildbot, aka the software we’re using to control and run our builds, knows plenty about our builds. I’ll give a list of things that come up to my mind, with a focus on things that tinderbox wouldn’t.

Why a build is running.
- Which changes went into this build, or if it was a periodic or forced build
The steps of each build, with separate description, results, logs
- Logs have separate stdout and stderr chunks in order
A set of build properties, holding slave name, build number, revision
- The set of build properties can be amended, to hold more data. The data can be basically anything that can be pickled in python, and could be constrained to json values, or just natives thereof.
Start and end times for both the individual steps and the complete build

There are some shortages in particular when it comes down to our build setup, mostly …

What even buildbot doesn’t know:

Dependencies between builds. Buildbot has two builtin methods to run builds that depend on prior builds, but it doesn’t keep track of that relationship.

For those into schema, one possible version of that is depicted in this graph.

Then, there are things we keep …

On tinderbox, but not on buildbot:

Tree rules
- open/closed (used to be on despot for cvs/bonsai)
- sheriff
Build comments
log parsers

So much for the read-only side of life. On top of this, there are a few important things that buildbot enables us to do, which we don’t empower our community to use (at least not without a releng-sheriff around).

Buildbot can:

Trigger builds on arbitrary builders, possibly with particular properties set (the latter requires hackery).
Stop most builds while running.

Exposing these should provide a powerful tool to investigate and clear bustages.

You can get a slightly better idea of how things are looking on buildbot itself if you browse around on Chromium’s waterfall. IMHO, they share the problem of not being able to present the data they have, even though they have less platforms and trees to handle than we do. You can also see the problem of dependent test builds hanging somewhere in the air. You can also nicely see the output per step with the details they have, unconditionally though. Most of the time, you likely don’t care.

Going forward, I’ll try to wrap my head around which problems our web frontend to our builds actually needs to solve, and which routes I see to getting there.

Comments (4)

December 19, 2008

working demo, waterfall and more

Filed under: L10n,Mozilla — Tags: build, buildbot, Mozilla — Axel Hecht @ 4:50 pm

Thanks to Reed, I just put up my latest pet project up on the l10n server.

It’s offering a new web interface for buildbot builds. It does so by feeding the status data that buildbot has into a database on the one end, and on the other end is a django app displaying that data.

The nice thing about this is that writing features (or fixing bugs) is “just webdev”. Compared to whatever you want to call tinderbox hacking.

There are already a few concepts demoed on the site. All urls are in flux, note the “stage” in them. But the principle should be obvious.

Firstly, you can get to a regular waterfall on waterfall. Yes, there are some time-sorting issues there. But it’s quick, which is cool. Compare it to the regular buildbot waterfall (didn’t bother checking which timerange that shows). And it offers a nice compromise (IMHO) for displaying detail or not. For finished builds, it shows one box per build. For builds in progress, it shows a box per step (it doesn’t show a box for the build for those steps, which is confusing). It has a blame column, too. Whenever you see a change number, that links to a new page, which lists all builds for that particular change. This one shows an en-US check-in with all locales turning red, for example. Another one just shows how things go green for Arabic again, as that localizer checked in.

For the l10n builds, that’s peanuts, but if you’re landing on a tree with real compiles, being able to follow all builds for your landing, and no others sounds cool.

And django comes with helpers for generating feeds, so creating a meaningful live bookmark to follow your own landing doesn’t seem like an unsolvable RFE.

There’s more. You can restrict the waterfall to only show builds for a particular slave. You can restrict the shown builds to only show builds with a particular property, say, the Macedonian builds, compared to all builds.

There’s obviously lots of room for improvement, the code is in the tinder app in my hg repo. Volunteers welcome.

Comments (3)

October 21, 2008

l10n-merged linux builds on the l10n server

Filed under: L10n,Mozilla — Tags: build, buildbot, L10n, Mozilla — Axel Hecht @ 2:43 pm

I reached another milestone on the l10n builds on the l10n server – reliable l10n depend builds.

A short recap on why they could not be reliable. Details are in Armen’s and John’s presentation in Whistler. First and foremost, l10n builds with missing strings break. They might start, or not, maybe even crash. Or just display the yellow screen of xml parsing death. Now, l10n builds are not really builds, but repackages of an en-US build. Between the time that the en-US build started, or, in hg, the revision it used, and the tip at the time when the en-US binary is finished and available, there can be further l10n-impact landings. We are using the nightly builds for the repackages throughout the whole day even, so the chance that the current en-US source doesn’t correspond to the nightly increases. So even if you know that a localization is good tip vs tip, you can’t say if it’s breaking the previous nightly or not. Huh? Look at the graphs in Armen’s and John’s presentation for more arrows going back and forth in time. ;-)

Enter bugs 452426 and 458014. 452426 added the changeset id to application.ini (thanks Ted), and 458014 refactored browser/locales/Makefile.in with additional logic to extract that info for the build system. I got that one landed yesterday, so we can now get the source stamp of mozilla-central for a firefox build.

Right, good catch, this doesn’t work for comm-central builds. I’ll leave it up to them to figure out how to reproduce the plethora of repos they have.

So far, so good. You download the nightly, unpack, ident (the rule to extract the changeset id). Now you back to your source tree and hg update to that revision, and run compare-locales against that. We’d be able to reliably say “works” or “better don’t touch”.

We promised more, and more pieces came together today.

With reliable compare-locales code, one can not only detect missing strings, but also add missing strings to files. Think about a CPP step, nothing permanent, nothing gets landed upstream. But just for the needs of this particular build, you’d have something that has all strings. Not all translated, some padded from en-US. That works. compare-locales is already able to do merges for a while now, storing the changed files into a separate location. Mostly because I consider changing the source to be evil. So what about missing files? Nothing. Good files? Nothing. How does the build pick up files from merges, l10n, and en-US then?

By rewriting make-jars.pl, enter JarMaker.py. Among overall readibility improvements and removing XPFE hacks, JarMaker.py offers to pick up l10n files from a list of top-level source dirs. It offers another cute feature, by writing both chrome and extension manifests at once. Now, with bug 458014, we don’t have to run the libs phase for installers and langpack separately. (I never got why we do that until I rewrote make-jars.pl, actually.) The rewrite of JarMaker.py was preceeded by rewriting Preprocessor.py, so that all of the jar generation can happen in a single python process.

Starting from today, all of this came together with my installation of buildbot on the l10n server.

This gives us

builds on push, i.e., feedback within 5-10 minutes (real stats pending)
comparisons of the l10n tip against both
- the en-US tip (for the upcoming nightly)
- changeset of the previous (for the existing nightly, with l10n-merge)
html-ified output for both of those
updates for the dashboard

and last, but not at all least, a

working build, even for partial translations.

Find 60 3.1b2pre linux builds on the l10n server.

Thanks to Armen, I used a few of his new makefile targets for download and upload, he did a bunch of work for the sourcestamps-in-application.ini on cvs, too. Thanks to Ted, the poor fellow had to review all my rewrites and Makefile dependencies foo, and did some patches, too. hgpoller stuff not to forget.

TODO:

silme will offer even more reliable merges
nightly scheduler for all locales (currently I only build on l10n and en-US l10n-impact changes) (*)
mar’s
comm-central
more Makefile foo to pick up more missing files from en-US in doubt… (*)
… or at least document the core set of required files (*)

I won’t take most of those, fwiw. Possibly only the (*) ones.

Sources are in my tooling repository, and there’s an updated version of compare-locales, 0.6, on pypi. No drastic changes here, just some paths fixes, mostly for Windows.

Comments (7)

Older Posts »

Axel Hecht Mozilla in Your Language

March 3, 2017

Unified Repositories

Testing

snafus

Open Questions

November 15, 2010

November 12, 2010

November 10, 2010

July 12, 2009

June 10, 2009

June 1, 2009

May 10, 2009

The Rationale:

What buildbot knows:

What even buildbot doesn’t know:

On tinderbox, but not on buildbot:

Buildbot can:

December 19, 2008

October 21, 2008