20
Dec 19

Running taskcluster tasks locally

Work right from your own home!

It can be difficult to debug failures in Taskcluster that don’t happen locally. Interactive tasks are very useful for this, but interactive tasks broke during the last migration — a relevant bug is bug 1596632, which is duped to a just-fixed bug, so maybe it works now?. I recently encountered a situation where I really needed to interactively debug something, so I decided to take the plunge and discover the answer to the question: how can I run tasks locally?

Local tasks provide not only the advantages of interactive tasks, but also allow running against your local checkout. That makes for a much faster edit-run-curse-debug cycle, and opens up possibilities for using this in a lot more situations than the usual last-ditch efforts that interactive try server tasks are usually used for. (Or at least, that’s how I use them. And mostly don’t use them.)

I’m going to walk through the process of setting up and running a taskcluster job in a local container. Note that I have no idea how generally applicable this is. I will give the steps necessary to run the SM(gdb) job, which builds the JS shell and runs some gdb prettyprinter tests against it. I have no idea how far it will get you to running something like mochitests.

Getting the image

Taskcluster normally runs Docker images. So the first step is to get your very own copy of the appropriate docker image. There’s a handy blog post by someone who actually knows what he’s talking about that I found well after the fact (of course). But I’m going to give the exact steps that I used:

  • Click on the task you’re trying to replicate in treeherder.
  • Open the full log file.
  • Search for a line that says something like “Downloading artifact “public/image.tar.zst” from task ID: VuFo68PeQjCH7k15tSN2Dg.” near the beginning of the file. Call that ID $IMAGEID.
  • Run ./mach taskcluster-load-image --task-id $IMAGEID from your Gecko checkout.

and then optionally,

  • Curse and flail around when something goes wrong with the docker import process, as it always seems to.
  • Maybe install docker in the first place. Whoops, forgot to mention that.
  • You probably want it to be running as well.

Getting the image up and running

mach will helpfully give you a command to run a shell in the image, something like

/usr/bin/docker-current run -ti --rm debian7-amd64-build:e2e821aea119e4a264340c22b79324ac804955b605577dd225df5f4f8e98e0cc bash

. Don’t do that. It’s a great command, but it’s a little overzealous about cleaning up after itself. But grab out that image name: IMAGE=debian7-amd64-build:e2e821aea119e4a264340c22b79324ac804955b605577dd225df5f4f8e98e0cc

Although for now, I guess it’s really not bad. Just remove the --rm option and give it a try.

If you get a shell to pop up, congratulations! Be happy! If not, try asking someone with a clue or, failing that, ask me. I’m sfink in the #developers channel on IRC, or if you’re reading that after we’ve spun up our new Matrix overlord, I’ll probably be moving there. Oh, and if you’re in the Mozilla secret club, I suppose I won’t ignore you if you hit me (@sfink) up on Slack either.

Anyway, we’re going need to download some stuff into this image, which means we need a network. Mine didn’t start with a network. I don’t know much about Docker, but this got me a network:

  • ifconfig to figure out your local IP address, or do it some other way. My IP was 10.0.0.14.
  • docker network create -o "com.docker.network.bridge.host_binding_ipv4"="10.0.0.14" my-network

    , replacing the “10.0.0.14” with your own IP and, if you wish, “my-network” with something cooler-sounding. That’ll spit out some monstrous ID like 1793d9caad6d5973922b7a78ae11a2bce6005781ca18c0e253d1c2c5317f5c93 that you have to read out in Pig Latin in under 5 seconds. Or you can just ignore it.

  • docker ps to get the ID of your running container. (Or add -a if you’re going to be running a container you’ve created already.) Call that $CONTAINER_ID.
  • docker network connect my-network $CONTAINER_ID

Come to think of it, I only did that once with an old container I’m not longer using, and all of the new containers I’ve created come up with a functioning network from the get-go. So you can probably ignore all of the above.

Grafting your source into your container

Now that you have a container with a network running and everything, it’s time to throw it out and start over. I did say “don’t do that”, remember?

The next goal is to start up a container with your local source tree bind-mounted. Let’s call the absolute path to your checkout $SRCDIR.

  • Let’s expand your container-creating command to something like:
    docker run -ti -v $SRCDIR:/builds/worker/source:z $IMAGE bash

    [Note 2]

  • But don’t run that either. Or at least, don’t run it if you actually are trying to run the gdb task, because it requires some extra privileges in order to do the right ptrace magic.
  • Here’s the actual command I use:
    docker run -ti -v $SRCDIR:/builds/worker/source:z --cap-add=SYS_PTRACE --security-opt seccomp=unconfined $IMAGE bash

Ignoring the gdb ptrace goop, what that’s doing is bind-mounting $SRCDIR on your host so that it shows up at /builds/worker/source within your container, and additionally does the fixup necessary for selinux to allow you to then access the data from within the container. If you’re worried about stuff running within the container messing up your source checkout, you could add ,ro to the volume portion of that command:

docker run -ti -v $SRCDIR:/builds/worker/source:z,ro --cap-add=SYS_PTRACE --security-opt seccomp=unconfined $IMAGE bash

. But honestly, I’ve never tried doing that yet.

Snarfing taskcluster initialization

Hopefully, you now have a shell open in a container that is basically identical to what runs in taskcluster. You’re home free, right?

Not so fast. Taskcluster does some magic setup, I’m not entirely sure how, to provide an environment with a bunch of important settings that don’t come with a default shell. I figured out a bunch of stuff you could do manually to replicate this environment. Here’s a list of steps that I recommend you do not take:

  • Go back to your push on treeherder.
  • Click on the Task link in the bottom left pane.
  • Expand the “payload” section.
  • Somehow convert the whole “env” section to environment variable setting commands. I used to save the whole payload as a JSON file /tmp/task.json, then run
    perl -lne 'if (/"env"/ .. /^\s*\}/) { print "export $1='\''$2'\''" if /"(.*?)": "(.*)"/ }' /tmp/task.json
  • Cut & paste that into the shell running on your container.
  • Also cut & paste
    export TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com

    to prevent it from attempting to access stuff via internal URLs that won’t work from your desktop.

  • Now grab the “command” key from that payload and stitch it together into a shell command to paste…
      …but that’s way too much work, and is incomplete besides.

      Running the command

      That last step, where you grabbed the command out of the payload? It’s not going to work. The main reason is that it attempts to do some hg fingerprinting thing that won’t work when you’re running outside of the data center. But the automation I created to avoid that also does the environment initialization piece, so I’ll whack the two little birdies with one rock.

      • Download https://raw.githubusercontent.com/hotsphink/sfink-tools/master/bin/mk-task-runner or checkout all of https://github.com/hotsphink/sfink-tools and find it in bin/.
      • Look at the bottom left pane for the Task field of the job you’re cloning. Copy the magic task ID next to it, something like fpcWJf1hTEinv8F49luc_w. Let’s call that $TASKID.
      • From within your source checkout, run mk-task-runner $TASKID. That’ll download the task descriptor and grab out the relevant pieces, and generate a simple run-task.sh.
      • Because this is in your source checkout, it should be visible from within the container. So run source/run-task.sh.

      This should run the whole task, and after it’s done, drop you into another shell with the environment settings preserved in case you want to do some further poking around.

      …except, once again, it probably won’t work.

      run-task

      Tasks run via a script taskcluster/scripts/run-task. Which is great, and it almost works perfectly for our purposes. Except it tries to do its own checkout of gecko, and it spends a bunch of time downloading stuff and then deletes it at the end. Both of those are not so helpful if you’re trying to run and rerun against your own checkout.

      I have bug 1605232 open for patches that add options to avoid that, but (1) it hasn’t landed, (2) it hasn’t been reviewed, (3) it may not be the direction The Powers That Be want to go, and (4) they might really rather not be using run-task for both automation and manual running in the first place. All of which could lead to this solution changing. If I’m a good person[Note 1], I’ll come back to this post and update it with the updated information when things change.

      In the meantime, you have two main options:

      1. Edit run-task.sh to get rid of the --keep and --existing-gecko-checkout=... options and let it run against a fresh checkout a re-download stuff, or
      2. Apply the patch in the above bug to your local checkout.

      Aftermath

      I was careful to post this during the holiday season to be sure you wouldn’t read it, but it looks like you somehow did anyway. If taskclustery people who actually work on this stuff would like to correct my undoubtedly numerous mistakes, I would be most appreciative, so please get in touch. Or if you can tell me where I’m making it all too hard.

      If you use this and it works for you, I’d be curious to know what you’re using it for. If you try to use it and it doesn’t work, I’d kinda like to know that too (I haven’t tried any other tasks yet.) If you try to use it and get angry about it not working, or it eats your data, I’m perfectly okay with you not getting in touch with me.


      Footnote 1: I’m not a good person.

      Footnote 2: The documentation says you should really be using --mount in place of -v aka --volume. But my version of docker doesn’t have --mount.


17
Aug 18

Type examination in gdb

Sometimes, the exact layout of objects in memory becomes very important. Some situations you may encounter:

  • When overlaying different types as “views” of the same memory location, perhaps via reinterpret_cast, unions, or void*-casting. You want to know where the field in one view lands in another.
  • When examining a struct layout’s packing, to see if there is space being wasted.
  • When looking at a crash at some offset like 0xc8, it’s common for that to be a NULL pointer dereference of a field of some structure, as if you did MyStruct* foo = nullptr; return foo->field;.

A very handy command for looking at the underlying storage of types is pahole. But I’m usually in gdb already when I want to examine types. Plus, I somehow have never gotten comfortable with the separate pahole command — possibly because it has a tendency to seg fault when I try running it. The latest gdb (8.1) also has it built-in if you run ptype/o typename.

Tom Tromey implemented a simple version of pahole inside gdb using the Python scripting APIs. I stole1 his code, fixed some bugs, mucked with his output layout, added some more bugs, fixed some of those, and added it to my gdb startup script collection. I also abstracted out the type traversal and created an additional command to examine what is at a given offset.

offset

Let’s say we have a crash at the address 0x58. Suspecting a NULL pointer dereference while manipulating a js::RegExpShared object, let’s look at what is at that offset:

(gdb) offset 0x58 js::RegExpShared
Scanning byte offsets 88..95
overlap at byte 88..127 with this.tables : js::RegExpShared::JitCodeTables
overlap at byte 88..95 with this.tables.mBegin : mozilla::UniquePtr *

By default, when you examine an offset, offset will look at native word’s worth of memory. Here, I’m running on 64-bit, so we’re looking at the 8 bytes starting at 0x58 (aka 88 decimal). (You could do offset/1 88 js::RegExpShared to only look at the single byte offset 88.)

From the above output, we can see that the field tables overlaps the offset range being inspected. tables itself is a js::RegExpShared::JitCodeTables structure, and its mBegin field occupies the relevant offsets.

Let’s look at another example:

(gdb) offset 184 JSContext
Scanning byte offsets 184..191
overlap at byte 184..199 with this.kind_ : js::WriteOnceData
overlap at byte 184..187 with this.kind_.value : js::ContextKind
overlap at byte 188..187 with this.kind_.check : js::CheckUnprotected
overlap at byte 188..191 with this.kind_ : <32-bit hole> in js::WriteOnceData before field 'nwrites'

This has some weirdnesses. First, notice the “byte 188..187” range of kind_.check. That’s an empty js::CheckUnprotected struct2.

Next, we seem to have collided with a hole in the JSContext structure. We can use pahole to look at the overall structure. But first, let’s switch to a simpler structure (JSContext is huge and the output would be pages long).

(gdb) offset 0 js::jit::ABIArg
Scanning byte offsets 0..7
overlap at byte 0..3 with this.kind_ : js::jit::ABIArg::Kind
overlap at byte 4..7 with this : <32-bit hole> in js::jit::ABIArg before field 'u'

Now we can get a higher-level view with pahole.

pahole

(gdb) pahole js::jit::ABIArg
  offset size
       0   16 : struct js::jit::ABIArg {
       0    4 :   kind_ : js::jit::ABIArg::Kind
       4    4 : --> 32 bit hole in js::jit::ABIArg <--
       8    8 :   u : struct union {...} {
   8  +0    1 :     gpr_ : js::jit::Register::Code
   8  +0    8 :     fpu_ : js::jit::FloatRegister::Code
   8  +0    4 :     offset_ : uint32_t
                  } union {...}
                } js::jit::ABIArg

This displays the full structure, with each field (or hole) displayed beside its offset and size within the overall type. Offsets of fields directly inside of the given type are given directly. Offsets within sub-structures are given as the offset of that structure within the outermost struct, plus an offset within the inner struct.

These outputs can get very noisy when they are deeply nested, possibly displaying a lot more data than you care about. You can clamp the depth of the tree with a /N suffix option if you wish:

(gdb) pahole/1 js::jit::ABIArg
  offset size
       0   16 : struct js::jit::ABIArg {
       0    4 :   kind_ : js::jit::ABIArg::Kind
       4    4 : --> 32 bit hole in struct js::jit::ABIArg <--
       8    8 :   u : union {...}
                } struct js::jit::ABIArg

Probably not useful with this example, since the interesting stuff is now hidden, but it makes large or deeply nested types much more readable.

Installing

If you would like to use this stuff, you can see the full directions for installing my random helper crap, or you can just grab the one file gdbinit.pahole.py and source it from your ~/.gdbinit:

source ~/path/to/gdbinit.pahole.py

Footnotes

1. "Stole" in the GPLv3 sense, that is -- tromey's original and my modified version are both released under the GPLv3 license.

2. Whether empty structs should even be displayed is questionable. What does that even mean?