automation | sfink @ Mozilla

Mozilla — No Comments
03
Jul 25

Effectful Logging

These recent blog posts are veering in the “here’s a horrible thing I just did!” direction. No apologies.

Recently, I was working on a weird problem where I wanted to snapshot /proc/$pid/maps before and after a couple of mmap and madvise calls. But I didn’t particularly want to write C++ code to do it. So:

JS_LOG_FMT(debug, Info, "About to mmap at {:x}", ptr); JS_LOG_FMT(debug, Info, "SKYNET: mkdir /tmp/js-{0}/before /tmp/js-{0}/after", getpid()); JS_LOG_FMT(debug, Info, "SKYNET: cp /proc/{0}/maps /tmp/js-{0}/before", getpid()); sleep(3); mmap(...); JS_LOG_FMT(debug, Info, "SKYNET: cp /proc/{0}/maps /tmp/js-{0}/after", getpid()); sleep(3);

That produces “SKYNET: …” log messages, with a pause. If only someone were reading those log messages and quickly cutting & pasting the commands… let’s give them 3 seconds to do each.

Then I run in my terminal:
MOZ_LOG=debug:5 $JS blahblah.js |& perl -lpe 'system($1) if /SKYNET: (.*)/'

Whenever one of these messages is produced, the Perl script grabs it and runs it in a new shell. Victory!

Note: the Skynet reference is from way back when we all watched Terminator and thought that it would be possible to prevent the AIs from taking over the world just by keeping one company from giving it so much power, in the form of tools and permission to do arbitrary things. We didn’t predict that in only a few decades, thousands of people would be doing exactly that on a daily basis.

Mozilla — No Comments
07
May 25

Sinful Debugging

Recently, I was debugging my SpiderMonkey changes when running a JS test script, and got annoyed at the length of the feedback cycle: I’d make a change to the test script or the C++ code, rerun (under rr), go into the debugger, stop execution at a point where I knew what variable was what, set convenience variables to their pointer values, then run to where the interesting stuff was happening.

One part I had already solved: if I have some JS variables, say base and str, and I want to capture their pointer values in the debugger, I’ll call Math.sin(0, base, str) and set a breakpoint on math_sin. Why the leading 0? In the past, I’d run into problems when Math.sin converted its first argument to a number, which disrupted what I was trying to look at. So now I feed it a number first and put my “real” stuff after it where it’s ignored, even when it doesn’t matter (and it usually doesn’t).

But it’s painful to get set up again. It looks something like:

(rr) b math_sin
(rr) c # ontinue
(rr) p vp[3].toString()
$1 = (JSString *) 0x33315dc003c0
(rr) set $base=$
(rr) p vp[4].toString()
$2 = (JSString *) 0x33315dc01828
(rr) set $str=$

Yay, now I can do things like p $str->dump() and it will work!

But even worse, sometimes I’d want to add or remove variables. So I hacked around it:

First, instead of just passing in the variables, I pass in names along with them. An actual example:

Math.sin(0, "ND2", ND2, "TD3", TD3, "NB4", NB4, "NB5", NB5, "TD6", TD6);

(yes, those names mean something to me). Then the setup becomes more like:

(rr) b math_sin
(rr) c # ontinue
(rr) p vp[3]
$7 = $JS::Value("ND2")
(rr) set $ND2=vp[4].toString()
(rr) p vp[5]
$7 = $JS::Value("TD3")
(rr) set $ND2=vp[6].toString()

Ok, that’s longer, and requires cutting & pasting. I know, shut up.

The next set is to automate with a gdbinit script. Here’s a slightly modified version of mine:

define mlabel
  set $_VP=vp
  python
import re
argc = int(gdb.parse_and_eval("argc"))
for i in range(3, argc + 2, 2):
  namer = f"$_VP[{i}]"
  m = re.search(r'::Value\("(.*?)"',
                str(gdb.parse_and_eval(namer)))
  if not m:
    print(f"Failed to match: {namer}")
    continue
  name = m.group(1)
  setter = f"set ${name}=$_VP[{i+1}].toGCThing()"
  gdb.execute(setter)
end
end
document mlabel
Special-purpose tool for grabbing out things passed to Math.sin(0, "name1", val1, "name2", ...) and converting them to labels.
end

“mlabel” stands for “multi-label”, because it… well, it doesn’t label anything, but in my real version, it runs my own command

label name=value

that does some other magic besides setting a gdb convenience variable (yes, they’re actually called that).

I don’t remember why I went through the extra step of setting a $_VP variable rather than using vp directly. But it’s probably specific to my scenario, so you’ll have to adapt this anyway. This post is meant more to give you an idea.

The result is my JS test code talking to my gdb session and spilling its secrets. Now when I’m debugging the interesting stuff, I can do (rr) print $TD3->dump() and it will do something useful.

Here’s a log of an actual session:

--------------------------------------------------
 ---> Reached target process 3902272 at event 14.
--------------------------------------------------
(rr) Working directory /home/sfink/src/mozilla-ff/js/src/jit-test/tests/gc.
(rr) pretty
Loading JavaScript value pretty-printers; see js/src/gdb/README.
If they cause trouble, type: disable pretty-printer .* SpiderMonkey
SpiderMonkey unwinder is disabled by default, to enable it type:
	enable unwinder .* SpiderMonkey
(rr) b math_sin
Breakpoint 1 at 0x564d91b3942b: file /home/sfink/src/mozilla-ff/js/src/jsmath.cpp, line 649.
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, math_sin (cx=cx@entry=0x7f20e5d3a200, argc=11, vp=0x7f20d5ba8168)
    at /home/sfink/src/mozilla-ff/js/src/jsmath.cpp:649
stopped at breakpoint 1: (N/A) -> (N/A)
(rr) mlabel
all occurrences of 0x33315dc003c0 will be replaced with $ND2 of type js::gc::Cell *
all occurrences of 0x38deaf678670 will be replaced with $TD3 of type js::gc::Cell *
all occurrences of 0x33315dc00340 will be replaced with $NB4 of type js::gc::Cell *
all occurrences of 0x33315dc00188 will be replaced with $NB5 of type js::gc::Cell *
all occurrences of 0x38deaf678658 will be replaced with $TD6 of type js::gc::Cell *
(rr) b promoteString
Breakpoint 2 at 0x564d92737698: file /home/sfink/src/mozilla-ff/js/src/gc/Tenuring.cpp, line 882.
(rr) c
Continuing.

Thread 1 hit Breakpoint 2, js::gc::TenuringTracer::promoteString (this=this@entry=0x7ffcf5f9cb40, 
    src="MY YOUNGEST MEMORY IS OF A TOE, A GIANT BLUE TOE, IT MADE FUN OF ME INCESSANTLY BUT THAT DID NOT BOTHER ME IN THE LEAST. MY MOTHER WOULD HAVE BEEN HORRIFIED, BUT SHE WAS A GOOSE AND HAD ALREADY LAID T"...)
    at /home/sfink/src/mozilla-ff/js/src/gc/Tenuring.cpp:882
stopped at breakpoint 2: (N/A) -> (N/A)
(rr) p (void*)src
$1 = (void *) $NB5
(rr)

Uncategorized — 2 Comments
21
Jan 12

bzexport –new: crash test dummies wanted

Scenario 1: you have a patch to some bug sitting in our mercurial queue. You want to attach it to a bug, but the bugzilla interface is painful and annoying. What do you do?

Use bzexport. It’s great! You can even request review at the same time.

What I really like about bzexport is that while writing and testing a patch, I’m in an editor and the command line. I may not even have a browser running, if I’m constantly re-starting it to test something out. Needing to go to the bugzilla web UI interrupts my flow. With bzexport, I can stay in the shell and move onto something else immediately.

Scenario 2: You have a patch, but haven’t filed a bug yet. Neither has anybody else. But your patch has a pretty good description of what the bug is. (This is common, especially for small things.) Do you really have to go through the obnoxious bug-filing procedure? It sure is tempting just to roll this fix up into some other vaguely related bug, isn’t it? Surely there’s a simple way to do things the right way without bouncing between interfaces?

Well, you’re screwed. Unless you’re willing to test something out for me. If not, please stop reading.
Continue reading →

Uncategorized — 3 Comments
17
May 11

mozilla-central automated landing proposal

This was originally a post to the monster thread “Data and commit rules” on dev-planning, which descended from the even bigger thread “Proposing a tree rule change for mozilla-central”. But it’s really an independent proposal, implementable with or without the changes discussed in those threads. It is most like Ehsan’s automated landing proposal but takes a somewhat different approach.

Create a mozilla-pending tree. All pushes are queued up here. Each gets its own build, but no build starts until the preceding push’s build is complete and successful (the tests don’t need to succeed, nor even start.) Or maybe mostly complete, if we have some slow builds.
Pushers have to watch their own results, though anyone can star on their behalf.
Any failures are sent to the pusher, via firebot on IRC, email, instant messaging, registered mail, carrier pigeon, trained rat, and psychic medium (in extreme circumstances.)
When starring, you have to explicitly say whether the result is known-intermittent, questionable, or other. (Other means the push was bad.)
When any push “finishes” — all expected results have been seen — then it is eligible to proceed. Meaning, if all results are green or starred known-intermittent, its patches are automatically pushed to mozilla-central.
Any questionable result is automatically retried once, but no matter what the outcome of the new job is, all results still have to be starred as known-intermittent for the push to go to mozilla-central.
Any bad results (build failures or results starred as failing) cause the push to be automatically backed out and all jobs for later pushes canceled. The push is evicted from the queue, all later pushes are requeued, and the process restarts at the top.
When all results are in, a completion notification is sent to the pusher with the number of remaining unmarked failures

Silly 20-minute Gimped-up example:

Good1 and Good2 are queued up, followed by a bad push Bad1
The builds trickle in. Good1 and Good2 both have a pair of intermittent oranges.
The pusher, or someone, stars the intermittent oranges and Good1 and Good2 are pushed to mozilla-central
The oranges on Bad1 turn out to be real. They are starred as failures, and the push is rolled back.
All builds for Good3 and Good4 are discarded. (Notice how they have fewer results in the 3rd line?)
Good3 gets an unknown orange. The test is retriggered.
Bad1 gets fixed and pushed back onto the queue.
Good3’s orange turns out to be intermittent, so it is starred. That is the trigger for landing it on mozilla-central (assuming all jobs are done.)

To deal with needs-clobber, you can set that as a flag on a push when queueing it up. (Possibly on your second try, when you discover that it needs it.)

mozilla-central doesn’t actually need to do builds, since it only gets exact tree versions that have already passed through a full cycle.

On a perf regression, you have to queue up a backout through the same mechanism, and your life kinda sucks for a while and you’ll probably have to be very friendly with the Try server.

Project branch merges go through the same pipeline. I’d be tempted to allow them to jump the queue.

You would normally pull from mozilla-pending only to queue up landings. For development, you’d pull mozilla-central.

Alternatively, mozilla-central would pull directly from the relevant changeset on mozilla-pending, meaning it would get all of the backouts in its history. But then you could use mozilla-pending directly. (You’d be at the mercy of pending failures, which would cause you to rebase on top of the resulting backouts. But that’s not substantially different from the alternative, where you have perf regression-triggered backouts and other people’s changes to contend with.) Upon further reflection, I think I like this better than making mozilla-central’s history artificially clean.

The major danger I see here is that the queue can grow arbitrarily. But you have a collective incentive for everyone in the queue to scrutinize the failures up at the front of the queue, so the length should be self-limiting even if people aren’t watching their own pushes very well. (Which gets harder to do in this model, since you never know when your turn will come up, and you’re guaranteed to have to wait a whole build cycle.)

You’d probably also want a way to step out of the queue when you discover a problem yourself.

Did I just recreate Ehsan’s long-term proposal? No. For one, this one doesn’t depend on fixing the intermittent orange problem first, though it does gain from it. (More good pushes go through without waiting on human intervention.)

But Ehsan’s proposal is sort of like a separate channel into mozilla-central, using the try server and automated merges to detect bit-rotting. This proposal relies on being the only path to mozilla-central, so there’s no opportunity for bitrot.

What’s the justification for this? Well, if you play fast and loose with assumptions, it’s the optimal algorithm for landing a collection of unproven changes. If all changes are good, you trivially get almost the best pipelining of tests (the best would be spawning builds immediately). With a bad change, you have to assume that all results after that point are useless, so you have no new information to use to decide between the remaining changes. There are faster algorithms that would try appending pushes in parallel, but they get more complicated and burn way more infrastructural resources. (Having two mozilla-pendings that merge into one mozilla-mergedpending before feeding into mozilla-central might be vaguely reasonable, but that’s already more than my brain can encompass and would probably make perf regressions suck too hard…)

Side question: how many non-intermittent failures happen on Windows PGO builds that would not happen on (faster) Windows non-PGO builds?

sfink @ Mozilla One more Blog.mozilla.com weblog than you need

Archives

Effectful Logging

Sinful Debugging

bzexport –new: crash test dummies wanted

mozilla-central automated landing proposal

Archives