JS Probes
Have you ever had your browser mysteriously stall periodically and wondered “what the f#@$! is it doing?!!” Or perhaps you’re working on something, say the garbage collector, and you’d like to see what effect your changes are having. Or maybe even write a little analysis that postprocesses some sort of trace of what is going on, and figures out what the optimal pattern of actions would be. (“If I’d thrown this big chunk of data out of the cache here, then I would’ve had room for all of these little things that got evicted instead, and would have had way fewer misses…”)
The usual way to do things like this is to manually add some instrumentation code (probably just logging a bunch of events) and postprocess the results. This works fine, but it has a few drawbacks: (1) you have to figure out where to insert your instrumentation, often in unfamiliar code; (2) you’ll need to recompile, possibly several times; (3) the logs can get very large very quickly; and (4) you’ll probably end up writing a very special-purpose postprocessor that (5) dumps stuff to a text file that only you know how to interpret, and even you will only remember what it all means for a week or two. The next time you need to do something similar, you’ll find that all of your instrumentation code is severely bitrotted and misses some paths that have been added in the meantime, so you’ll start everything over from scratch.
Well, tough luck. Sometimes those are just facts of life and you’ll need to suck it up. Quit whining, dammit.
But many times, the events of interest (or more precisely, “probe points”) are of general interest. If you can manage to slip them into the code and so get other developers to maintain them for you as they make changes, then everyone can rely on those probes being in roughly the right place permanently. That’s #1 above, and depending on how they’re implemented there’s a good chance you won’t even need to recompile, so that’s #2.
I’ve done an implementation of these sorts of probes in the SpiderMonkey Javascript engine. There are probe points like “a GC is starting (and it’s local to one compartment)”, “the heap has been resized”, and “javascript function F is being called/is returning.” Some of these are straightforward to place into the code — the start of a GC wasn’t hard to figure out, for example. Some weren’t so straightforward, such as JS function calls (they might seem simple, but what if you’re running JITted? Which JIT? Are you still running JITted by the time you return from the function?) I’ve delivered the probe information to various backends — anything from Windows’ ETW (blog post forthcoming whenever I manage to implement the start/stop functionality), to dtrace/systemtap (another blog post, probably coming sooner since I recently scraped together a demo), to a simple callback mechanism (see JS_SetFunctionCallback on MDN) and other special-purpose ones that only care about a small subset of probes.
#3 (log it all vs online handling) ventures into religious territory. It is easiest to mindlessly log everything of interest and postprocess it. But what if you want realtime updates? Or if you want to track different information depending on what you learn from other probe points? Or what if the volume of your log writing interferes with whatever you’re trying to measure (eg disk I/O)? Or maybe you need to track some sort of state in order to give the probes meaning. (GC when idle => good. Avoidable GC when the user is waiting => bad.)
Those arguments are what led to the creation of tools like DTrace and Systemtap. Both give you a scripting environment that can aggregate information from probes as they fire, control exactly what information gets tracked as things are happening, and can be attached/detached at any time. They’re pretty cool, and invaluable once you get familiar with them. They’re also extremely system-dependent and generally require root access or special builds or kernel debuginfo or something, which ends up meaning that you often can’t just hand off analysis scripts to other people and have those people get some use out of them. And even you may not be able to take them to another environment.
Still, they deal pretty well with #4 (avoiding one-use, special-purpose processors), at least for environments matching the one they were written for. And if they can draw from statically-inserted probe points (the type I was talking about above), they can actually be pretty general. #5 is still a killer, though — at least the way I write systemtap scripts, they all end up with idiosyncratic ways of dumping out the results of some particular analysis, and nobody else is going to get much enlightenment without studying the script for a while first.
What if we could do better? What if we could insert these static probes, but rather than feeding the information to some niche tool that is usable by only a handful of people, we make the data available to a plain old Firefox addon? You could collect, aggregate, summarize, mutilate, fold, spindle, or crush the data directly in JS code. Then we could let addon authors go crazy with visualizations and analysis libraries. That’d be cool, right?
Graph GC behavior. Warn the user when slow or suspicious stuff is happening. Figure out what’s going on during long event handlers. Graph the percentage of time spent in different subsystems. Correlate performance/trace data with user-meaningful actions. Make a flight-recording of various metrics and let the user walk through history. Your ideas here.
Ok, so I tricked you. I’m not going to tell you how to do any of that. This blog post is a tease, an advertisement for the work that Brian Burg did this summer during his Mozilla internship. If you’re interested, he’ll be giving his internship final presentation tomorrow (today when you’re reading this, or perhaps yesterday or last month for those of you who have fallen behind on your Planet reading.) That’s 1:30PM PDT on Thursday, September 22 at the Mountain View Mozilla headquarters, and I’m 97.2% sure it will be broadcast over Air Mozilla as well. And taped, I think? (Sadly, I can’t find where those are archived. Somebody please tell me and I’ll update this post.) There will be a demo. With pretty pictures! And he’ll be writing it up on his own blog Real Soon Now. I’m not going to say any more for now — I’d get it wrong anyway.
Update: Argh! I got the date wrong! It’s not Wednesday, September 21 as I originally wrote. It’s today, Thursday, September 22. Sorry for the confusion!
Contexts and Compartments
A while ago (at the Platform offsite just after the last all-hands, actually) I wrote up what I understood about contexts and compartments. I’ve since sent it to a couple of people and put it up on the wiki, but haven’t distributed it more widely because I wasn’t sure it was all correct. I am far from an expert, but mrbkap (who *is* the expert) has now read through this and pointed out only one glaring mistake, which is now fixed. So other than the parts I’ve added since then, it should be more or less correct now and thus is ready for a wider audience.
See also http://www.christianwimmer.at/Publications/Wagner11a/Wagner11a.pdf for the fundamental idea of compartments.
Contexts=Control, Compartments=Data
JSContexts are control, JSCompartments are data.
A JSContext (from here on, just ”context”) represents the execution of JS code. A context contains a JS stack and is associated with a thread. A thread may use multiple contexts, but a given context will only execute on a single thread at a time.
A JSCompartment (”compartment”) is a memory space that objects and other garbage-collected things (”GCthings”) are stored within.
A context is associated with a single compartment at all times (not necessarily always the same one, but only ever one at a time). The context is often said to be “running inside” that compartment. Any object created with that context will be physically stored within the context’s current compartment. Just about any GCthing read or touched by that context should also be within that same compartment.
To access data in another compartment, a context must first “enter” that other compartment. This is termed a “cross-compartment call” — remember, contexts are control, so changing a context’s compartment is only meaningful if you’re going to run code. The context will enter another compartment, do some stuff, then return, at which time it’ll exit back to the original compartment. (The APIs allow you to change to a different compartment and never change back, but using that is almost always a bug and will trigger an assertion in a debug build the first time you touch an object in a compartment that differs from your context’s compartment.)
When a context is not running code — as in, its JS stack is empty and it is not in a request — then it isn’t really associated with any compartment at all. In the future, starting a request and entering an initial compartment will become the same action. Also, a context is only ever running on one thread at a time. Update: or perhaps we’ll eliminate contexts altogether and just map from a thread to the relevant data.
In implementation terms, a context has a field (cx->compartment) that gives the current compartment. Contexts also maintain a default scope object (cx->globalObject) that is required to always be within the same compartment, and a “pending exception” object which, if set, will also be in the same compartment. Any object created using a context will be created inside the context’s current compartment, and the object’s scope chain will be initialized to a scope object within that same compartment. (That scope object might be cx->globalObject, but really that’s just the ultimate fallback. Usually the scope object will be found via the stack.)
To make a cross-compartment call, cx->compartment is updated to the new compartment. The scope object must also be updated, and for that reason you must pass in a target object in the destination compartment. The scope object will be set to the target object’s global object. (There’s a hacky special case when you’re using a JSScript for the target object, since they don’t have global objects, but ignore that.) If an exception is pending, it will be set to a wrapper (really, a proxy) inside the new compartment. The wrapper mediates access to the original exception object that lives in the origin compartment.
Finally, a dummy frame that represents the compartment transition is pushed onto the JS stack. This frame is used for setting the scope object of anything created while executing within the new compartment. Also, the security privileges of executing code are determined by the current stack — eg, if your chrome code in a chrome compartment calls a content script in a content compartment, that script will execute with content privileges until it returns, then will revert to chrome privileges.
When debugging, it is helpful to know that a compartment is associated with a “JSPrincipals” object that represents the “security information” for the contents of that compartment. This is used to decide who can access what, and is mostly opaque to the JS engine. But for Gecko, it’ll typically contain a human-understandable URL, which makes it much easier to figure out what’s going on:
(gdb) p obj $1 = (JSObject *) 0x7fffbeef (gdb) p obj->compartment() $2 = (JSCompartment *) 0xbf5450 (gdb) p obj->compartment()->principals() $3 = (JSPrincipals *) 0xc29860 (gdb) p obj->compartment()->principals->codebase $4 = 0x7fffd120 "[System Principal]" ...or perhaps... $4 = 0x7fffd120 "http://angryhippos.com/accounts/"
Anything within a single compartment can freely and directly access anything else in that same compartment. No locking or wrappers are necessary (or possible). The overall model is thus a partitioning of all (garbage collectible) data into separate compartments, with controlled access from one compartment to another but lockless, direct access between objects within a compartment. Cross-compartment access is handled via “wrappers”, which is the subject of the next section.
Wrappers
GCthings may be wrapped in cross-compartment wrappers for a number of reasons. When a context is transitioning from one compartment to another (ie, it’s making a cross-compartment call), its scope object and pending exception (if any) are changed to wrappers pointing back to the objects in the old compartment. But any object can be wrapped in a cross-compartment wrapper if needed. You can clone an object from another compartment, and all of its properties will be wrappers pointing at the “real” properties in the origin compartment.
Cross-compartment wrappers do not compose. When you wrap an object, any existing wrappers will be ripped off first. (Slight oversimplification; there is one exception.) In fact, the type of wrapper used for an object is uniquely determined by the source and destination compartments.
The precise terminology is a little confusing. A cross-compartment wrapper is a JSObject whose class is one of the proxy classes. When you access such an
object, it fetches its proxy handler (a subclass of JSProxyHandler) out of a slot to decide how to handle that access. Confusingly, in the code a JSCrossCompartmentWrapper is the subclass of JSProxyHandler that manages cross-compartment access, but usually when we refer to a “cross-compartment wrapper”, we’re really talking about the JSObject. (The JSObject of type js::SomethingProxyClass that has a private JSSLOT_PROXY_HANDLER field containing a JSProxyHandler subclass that knows how to mediate access to the proxied object stored in JSSLOT_PROXY_PRIVATE. Phew.)
A proxy handler mediates access to the proxied objects based on a set of rules embodied by some subclass of JSProxyHandler. A proxy handler might allow all accesses through, conceal certain properties, or check on each access whether the source compartment is allowed to see a particular property. Examples of proxy handler classes are the things listed on https://developer.mozilla.org/en/XPConnect_wrappers : cross-origin wrappers (XOWs), chrome object wrappers (COWs), etc.
Also, the same wrapper will always be used for a given object. This is necessary for equality testing between independently generated wrappings of the same object, and useful for performance and memory usage as well. Internally, every compartment has a wrapperCache that is keyed off of wrapped objects’ identity. You could think of the flavor of wrapper (i.e., the type of proxy handler) being determined by the tuple «destination compartment, source compartment, object», but the object is stored within the source compartment so those last two are redundant with each other.
From the JS engine’s point of view, there are a bunch of objects, every object lives in a different compartment, and whenever you call something or point to something in another compartment, the engine will interpose a cross-compartment wrapper for you. It’s up to the embedding — the user of the JS engine — to decide how to divide up data into different compartments, and what the behavior is triggered when you cross between compartments. You could have a “home” compartment and a “bigger” compartment, and the cross-compartment wrapper could convert any string to Pig Latin when it is retrieved from “bigger” by “home”. More practically, you could conceal certain properties from view when accessing them from an “unprivileged” compartment (whatever that might mean in your embedding), or you could do locking or queuing when accessing one compartment from another compartment in a different thread. Or add a remoting layer.
XPConnect (Gecko’s SpiderMonkey embedding code) uses cross-compartment wrappers to implement security policies and access rules. The ‘Introduction’ section at https://developer.mozilla.org/en/XPConnect_security_membranes gives a very good description of what XPConnect is using the wrappers for. Gecko uses (mostly) one compartment for chrome, and one compartment for each content domain. The wrapper is chosen based on whether the two compartments are the same origin, or whether one is privileged to see anything or a subset of the information in the other, etc. See js/src/xpconnect/wrappers/WrapperFactory.cpp for the gruesome details.
Future
(Or, “What Luke Wagner is plotting”.)
There are various plans that will probably change this picture substantially. Our threading story right now is a bit convoluted — compartments can only be touched by one thread at a time but can supposedly switch between threads, or something, and contexts need to be in a request before doing anything and beginning a request binds the context to a thread but requests can be suspended, and a context points to a thread data but you need to rebind the thread data if you switch threads… it’s complicated, ok? I tried to document it once, but just kept confusing myself.
Luke plans to make JSRuntimes be single-thread only, eliminate JSContexts entirely, make JSCompartments be per-global (right now you can have multiple global objects in a compartment). I don’t really understand all that (are JSRuntimes the new JSContexts?) but the point is that things are a’changin.
hg qedit
On his blog, Paul O’Shannessy came up with an ‘hg qedit’ alias that opens up an editor on your .hg/patches/series file for reordering your patch queue. It’s a nice simple solution to a common problem, so obviously I felt compelled to muck it up.
Here’s my version, for insertion into your ~/.hgrc:
[alias]
qedit = !S=$(hg root)/.hg/patches/series; cp $S $S.bak && perl -pale 'BEGIN { chomp(@a = qx(hg qapplied -q)); die if $?; @a{@a}=(); }; s/^/# (applied) / if exists $a{$F[0]}' $S > $S.new && ${EDITOR-vim} $S.new && sed -e 's/^# .applied. //' $S.new > $S
# Did you see this by scrolling over?
# I want better code snippet support
This fixes the main problem with zpao’s solution, which is that it’s too clean and simple.
No, wait, that’s not a problem.
The problem is that when I edit my series file, I often forget that I have some patches applied and end up reordering applied patches, which makes a complete mess. The above alias opens up an editor on your series file, only it also inserts comments showing which patches are already applied. (If you really, really want to mess yourself up, go ahead and reorder the commented lines. You’ll get what you deserve.)
Here’s what my queue looks like when editing the series file:
# (applied) better-dtrace-probes # (applied) try-enable-dtrace # (applied) bug-650078-no-remote bug-677985-callouts bug-677949-gc-roots hack-stackiter
Come to think of it, mq really shouldn’t let you mess up that way in the first place. It knows the original patch names for your applied patches (unless you are really determined to make your life difficult, and commit things on top without going through mq at all). It could detect when you reordered applied patches, and just undo what you did. And call you names. But maybe that would slow things down.
Update: it wasn’t working for jlebar, which turned out to be because he had added qapplied=-v to his [defaults] section. The above is now fixed for that scenario by adding a -q flag to hg qapplied.
Zombie Hunting
I’ve been looking at bug 669730 where enabling Firebug on a page (http://nytimes.com/ to be precise) results in the page’s compartment living forever. This is easy to see, now that we have the incredibly useful about:memory and its per-compartment breakdown. (What’s a compartment? It’s a memory space to keep related garbage collectable objects in. See the compartment paper, or for some more detail about how they are used in Firefox, try my contexts vs compartments writeup, though it’s more about contexts than compartments.)
I managed to find the object keeping the compartment alive, and I thought I should document what I did to either help other people hunt down zombie compartments, or beg for better tools, or both. (Note that I haven’t actually fixed the bug, so this is a little premature, but the hunting process is far more likely to be reusable and of general interest than the specifics of this leak.) Oh, and I didn’t actually figure everything out; I just kind of stumbled across the right answer.
I have a zombie compartment, so something in the compartment isn’t getting collected. That means there’s at least one GCthing alive in the compartment that shouldn’t be. The “inner” objects aren’t of much interest, so what really matters is that there’s at least one unwanted root. I want to figure out what that root is.
Things I know of that can be roots are pointers gathered by the conservative stack scanner, cross-compartment wrappers, and explicitly added GC roots.
The conservative roots are unlikely to matter here, because this leak survives returning to the event loop. The cross-compartment wrappers are what I initially suspected. I think that whenever XPCOM points to a JS object, or at least a JS object in a content compartment as this one is, it goes through a cross-compartment wrapper and the wrapped object is considered to be a GC root. So I want to see the objects rooted by cross-compartment wrappers.
I guess the wrappers I care about will actually live in a different compartment and point into the nytimes.com compartment. But it doesn’t seem to matter in this case, because the only function I could see to set a breakpoint in is JSCompartment::markCrossCompartmentWrappers() and in my test runs, it never seemed to hit. It looks like maybe it only gets called when doing a compartmental GC, and we probably don’t do many of those on a zombie compartment. Still, how do those roots get marked? I still don’t know, because while wandering around the code trying to figure out what was going on, I stumbled across the right place to watch for the third set of roots — explicitly added roots — and I took a detour to check those out. (Thinking about it, it’s not the cross-compartment wrappers that are the roots, it’s the objects they point to. Maybe those end up in the explicitly-added roots list? Dunno.)
Specifically, MarkRuntime() in jsgc.cpp iterates through a runtime’s gcRootsHash and calls gc_root_traversal on each one. That grabs out the pointer value and name (yay!) of each root and scans it. So all I needed to do was check each of these roots to see which compartment it’s in, and stop when the compartment is the one I care about. Fortunately, gc_root_traversal calls MarkIfGCThingWord and it already computes the compartment. (It’s just a bit of bit masking and pointer chasing to do manually, so it’s not a big deal anyway.)
Conditional breakpoints are great and everything, but from the name of the function it sounded like it might get called a lot, so I just crammed my own debug code into the routine:
static JSCompartment *interesting_compartment = NULL;
if (aheader->compartment == interesting_compartment)
printf("root: %p kind %u\n", (void*)addr, thingKind);
Then I reran under gdb. I still needed the address of the compartment (unfortunately, about:memory only shows compartment pointers for chrome compartments). So I looked for something looping over compartments, found several, and set a breakpoint in one of them. I’m not sure which one. They all loop over rt->compartments. For each compartment, I displayed the principals->codebasevalue:
(gdb) display (*c)->principals ? *(*c)->principals : 0
Then I ‘n’exted through until I found the nytimes.com one. With that pointer in hand, I set a breakpoint on my added code, above, and set ‘interesting_compartment’ to my magic pointer value. This printed out the address of the root in question, together with its ‘thingKind‘ which was zero. A quick look at jsgc.h showed me that zero means FINALIZE_OBJECT0, and the code just after my printf showed that I can cast that to JSObject*. A call to js_DumpObject((JSObject*)0x7fffd0a99d78) told me that this was an ‘Error’ object. Even better, when I walked up the stack one level, I could see that the root was labeled “JSDValue”.
So JSD is hanging onto a content Error object, probably one that it grabbed from hooking exception throwing or catching. Is it JSD not discarding something when you turn it off, or Firebug holding onto the object itself? I don’t know yet.
Learnings:
- We need an easy way to get the pointer value of all compartments. Maybe in the ?verbose=1 output of about:memory?
- Enumerating roots is handy. We should expose a function to dump out the roots given a compartment, so we could do this whole analysis via the chrome-privileged Web Console.
- That means we need a way to refer to compartments from JS. Perhaps a weak map from principals’ codebases to JSCompartment* objects?
Related: see Jim Blandy’s bug 672736 for adding a findReferences JS call that gives all of the incoming edges to a JS object. I originally misinterpreted that to mean displaying the full path from a GC root to an object, and I started out trying to use findReferences by grabbing any object in the zombie compartment and calling findReferences on it. But I stopped when I realized that, knowing as little about the memory layout as I do, it was probably easier for me to find the roots themselves than figuring out how to look into the chunks/arenas/arena pools/whatever for the compartment to grab out random GCthings. And all I wanted was the root, so findReferences wouldn’t be of interest unless it crossed the XPConnect boundary and told me what was keeping the JS object alive via some sort of wrapper.
Now please, someone comment and tell me how I could have done this much more easily…
Heredity puzzle
I know this is wrong, but I’m going to use my privileged status as a source for Planet Mozilla (ok, it’s not that privileged) to point people to a non-Mozilla-related puzzle on my personal blog, because the few friends I have who read it are lame and haven’t come up with any answers yet. (Did I mean “the few friends I have, who read it” or “the few who read it”? None of your business.) And I really want somebody to come up with a simple, understandable proof.
Warning: it involves incest, intergenerational sexual relations, and time travel. But what doesn’t, these days?
Firefox at work
As with several other people, I’m going to put my opinions on the whole “Firefox hates enterprise users” kerfuffle here. Why here? Because this way I don’t have to pretend to have read everyone else’s thoughtful and incisive comments on the mailing list thread, and because the conversation isn’t going to move from the mailing list to here so I can safely jot out my unconsidered opinions and then back away, quickly. (I am in fact not reading every message in the thread. I scan through it once every 1-2 days and look for messages written by a handful of users who tend to say things I find worth hearing.)
The conversation seems to be generating far more heat than light. One thing that strikes me is that some questions are worth considering, and others aren’t. For example: “should Mozilla treat enterprises as a priority?” is not worth considering. It’s nearly meaningless. Consider these answers: “Yes, of course it should! Mozilla cares about everybody!” “No, it costs too much given the small slice of our user base that it represents.” Is anybody happy with either of them? Can anyone figure out what to do based on either conclusion? I can’t.
So I wanted to write out some questions that I think are worth considering. But first, a pet peeve: I’m not going to use the word “enterprise”. It barely means something as a noun, and means everything and nothing as an adjective (which is worse than meaning nothing, because then at least readers don’t imbue it with whatever meaning it holds in their own heads.) Anyway, here goes:
- If corporations abandoned Firefox, what impact would it have on Mozilla’s mission?
Forget for now why they’re abandoning FF. Maybe we decided to heighten security by rendering all HTTP pages in unselectable ROT-13 text, requiring our users to learn to decode them from memory or switch to HTTPS. Maybe we punish bad JS/server side coding practices by automatically detecting them and posting unencrypted passwords to a public server. Whatever it is, what impact would it have? How many people would not want to use a different browser at home and at work, and what would be the impact on our market share? What are the influence patterns that matter to us (eg new-to-the-Web users pick their browser based on what they or their friends know about from work)? How many add-on authors would support multiple platforms? Standardize on a single non-Firefox platform? Are corporate users already comfortable with using multiple browsers (eg IE6 for the intranet, something else for everything else)?
- What are we actually changing that is relevant to corporate users?
I think the answer is something like: we’re releasing features at a faster cadence. The average feature per unit time metric isn’t intentionally being changed, though MoCo is doing a lot of hiring so that’ll probably ramp up too for reasons other than our release policy. And we’re no longer separating features from fixes.
IM(naive)O, business users care about the first (frequency of features becoming available), but not enough to override other concerns. That’s why long-term support versions exist. Actually, that’s an oversimplification: business users are the same as anyone else, and would much prefer to have the most features the soonest, but those damn IT departments get mad at them when they install the latest version of FirewallBusterSupreme the day before it’s officially released. Forgoing shiny new features is the cost of ensuring stability and predictability, and you never want the scheduling and cost of upgrades to be any more at the whim of your vendor than absolutely necessary. You have a core business function to worry about — unless of course your core business happens to be advising other businesses about the impact of software upgrades. (Which is a sucky business, by the way; it’s another one of those where your customers are pretty much guaranteed to hate you. Your only function is to tell them “no”.) So the key bit really is the separate of features from fixes — or really, “stuff that is likely to break my users” from “stuff that is relatively safe and likely to keep me closer to the status quo than not having it would be”. An obvious example of the latter is security fixes — the risk from the software change is less than the risk from people starting to exploit some new vulnerability.
- What level of support would make the difference to “enough” corporate users?
- What are the relevant types of support?
If 60 months of long term support for selected versions isn’t enough for an IT department, then nothing will be, so forget about those users. And what does “long term support” mean, anyway? It isn’t a binary distinction. 60 months of “all security fixes we ever make for any version” is obviously unsustainable. I’m not sure “critical security fixes” is adequately complete as a description either. The severity of a security fix leaves out many relevant factors: backwards compatibility, divergence from trunk, maintainability, etc.
- What do various support options cost the Mozilla organization?
Cost, by the way, isn’t just measured in money coming out of MoCo’s pockets. Focus, maintainability, goodwill, brand, freshness, risk, etc.
- What could we do to make it easier for someone (perhaps us) to better support IT department-blessed use?
If we offered up a package (access to sensitive bugs + cash + agreement to support specific components + testing infrastructure + …) to attempt to lure 3rd parties into maintaining older versions “for us”, would anyone bite? (“For us” in quotes because we’re the Mozilla community, and they’d immediately become part of “us” if they accepted.)
- What is the connection between a rapid release cadence and long term support?
They’re not diametrically opposed, but rapid releases obviously introduce difficulties for long term support. You could do rapid releases without affecting long term support at all, if all you’re talking about is the frequency of releases. But we’re not; we want to release new features quickly. In the gray area are incompatible changes to existing functionality that aren’t critical for moving the Web forward.
- What other groups are effected by the same long term support issues as “enterprise” (sorry) users?
Add-on authors have already been brought up. We at least have a defensible story there (mainly, “use the add-on SDK”.) That community could be usefully subdivided, but who else is affected?
Them’s all the thoughts that’ve leaked out of my brain so far. I’ll try to do a better job of keeping them inside where they belong.
More stupid mercurial tricks
I think I’m missing something. How do people get those changeset URLs to paste into bugs? Ok, if I’m landing on mozilla-central or a project branch, I just get it from tbpl since I’ll be staring at it anyway. But what about some other repo? Like, say, ssh://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog?
As usual, I coded my way around the problem before asking the question, which is stupid and backwards. But just in case there really isn’t a good way, here’s my silly hackaround. Put this in the [alias] section of your ~/.hgrc then, after landing a change, do ‘hg urls -l 3′ or similar. (That’ll give you the latest 3 changesets):
urls = !$HG log --template='{node|short} {desc|firstline}\n' ${HG_ARGS/urls /} | perl -lpe 'BEGIN { ($url = shift) =~ s/^\w+/http/ }; s!^(?=\w+)!$url/rev/!' `hg path default`
Picking that apart, it removes the misfeature that $HG_ARGS contains the command you’re running, then passes the remaining command line to hg log with a template set to just print out the changeset shorthash and the first line of the commit message. It sends that and the URL of the default upstream repo through a perl command that rewrites the hg log output to “
A mess, but it works for me.
And yes, I should switch to a blog that isn’t hostile to code. Sorry about that line up there.
Record your freshness
I often like to split patches up into independent pieces, for ease of reviewing by both reviewers and myself. You can split off preparatory refactorings, low-level mechanism from high-level users, features from tests, etc., making it much easier to evaluate the sanity of each piece.
But it’s something of a pain to do. If I’ve been hacking along and accumulated a monster patch, with stock hg and mq I’d do:
hg qref -X '*' # get all the changes in the working directory; only
# needed if you've been qref'ing along the way
hg qref -I '...pattern...' # put in any touched files
hg qnew temp # stash away the rest so you can edit the patch
hg qpop
hg qpop # go back to unpatched version
emacs $(hg root --mq)/patchname # hack out the pieces you don't want,
# put them in /tmp/p or somewhere...
hg qpush # reapply just the parts you want
patch -p1 < /tmp/p
... # you get the point. There'll be a qfold somewhere in here...
and on and on. It’s a major pain. I even started working on a web-based patch munging tool because I was doing it so often.
Then I discovered qcrecord, part of the crecord extension. It is teh awesome with a capital T (and A, but this is a family blog). It gives you a mostly-spiffy-but-slightly-clunky curses (textual) interface to select which files to include, and within those files which patch chunks to include, and within those chunks which individual lines to include. That last part, especially, is way cool — it lets you do things that you’d have to be crazy to attempt working with the raw patches, and are a major nuisance with the raw files.
Assuming you are again starting with a huge patch that you’ve been qreffing, the workflow goes something like:
hg qref -X '*' hg qcrecord my-patch-part1 hg qcrecord my-patch-part2 hg qcrecord my-patch-part3 hg qpop -a hg qrm original-patchname hg qpush -a
Way, way nicer. No more dangerous direct edits of patch files. But what’s that messy business about nuking the original patch? Hold that thought.
Now that you have a nicely split-up patch series, you’ll be wanting to edit various parts of it. As usual with mq, you qpop or qgoto to the patch you want to hack on, then edit it, and finally qref (qrefresh). But many times you’ll end up putting in some bits and pieces that really belong in the other patches. So if you were working on my-patch-part2 and made some changes that really belong in my-patch-part3, you do something like:
hg qcrecord piece-meant-for-part3 # only select the part intended for part3 hg qnew remaining-updates-for-part2 # make a patch with the rest of the updates, to go into part2 hg qgoto my-patch-part2 hg qpush --move remaining-updates-for-part2 # now we have part2 and its updates adjacent hg qpop hg qfold remaining-updates-for-part2 # fold them together, producing a final part2 hg qpush hg qfold my-patch-part3 # fold in part3 with its updates from the beginning hg qmv my-patch-part3 # and rename, mangling the comment
or at least, that’s what I generally do. If I were smarter, I would use qcrecord to pick out the remaining updates for part2, making it just:
hg qcrecord more-part2 # select everything intended for part2 hg qnew update-part3 # make a patch with the rest, intended for part3 hg qfold my-patch-part3 # fold to make a final part3 hg qmv my-patch-part3 # ...with the wrong name, so fix and mess up the comment hg qgoto my-patch-part2 hg qfold more-part2 # and make a final part2
but that’s still a mess. The fundamental problem is that, as great as qcrecord is, it always wants to create a new patch. And you don’t.
Enter qcrefresh. It doesn’t exist, but you can get it by replacing your stock crecord with
hg clone https://sfink@bitbucket.org/sfink/crecord # Obsolete!
Update: it has been merged into the main crecord repo! Use
hg clone https://bitbucket.org/edgimar/crecord
It does the obvious thing — it does the equivalent of a qrefresh, except it uses the crecord interface to select what parts should end up in the current patch. So now the above is:
hg qcref # Keep everything you want for the current patch hg qnew update-part3 hg qfold my-patch-part3 hg qmv my-patch-part3
Still a little bit of juggling (though you could alias the latter 3 commands in your ~/.hgrc, I guess.) It would be nice if qfold had a “reverse fold” option.
Finally, when splitting up a large patch you often want to keep the original patch’s name and comment, so you’d really do:
hg qcref # keep just the parts you want in the main patch hg qcrec my-patch-part2 # make a final part2 hg qcrec my-patch-part3 # make a final part3
And life is good.
mozilla-central automated landing proposal
This was originally a post to the monster thread “Data and commit rules” on dev-planning, which descended from the even bigger thread “Proposing a tree rule change for mozilla-central”. But it’s really an independent proposal, implementable with or without the changes discussed in those threads. It is most like Ehsan’s automated landing proposal but takes a somewhat different approach.
- Create a mozilla-pending tree. All pushes are queued up here. Each gets its own build, but no build starts until the preceding push’s build is complete and successful (the tests don’t need to succeed, nor even start.) Or maybe mostly complete, if we have some slow builds.
- Pushers have to watch their own results, though anyone can star on their behalf.
- Any failures are sent to the pusher, via firebot on IRC, email, instant messaging, registered mail, carrier pigeon, trained rat, and psychic medium (in extreme circumstances.)
- When starring, you have to explicitly say whether the result is known-intermittent, questionable, or other. (Other means the push was bad.)
- When any push “finishes” — all expected results have been seen — then it is eligible to proceed. Meaning, if all results are green or starred known-intermittent, its patches are automatically pushed to mozilla-central.
- Any questionable result is automatically retried once, but no matter what the outcome of the new job is, all results still have to be starred as known-intermittent for the push to go to mozilla-central.
- Any bad results (build failures or results starred as failing) cause the push to be automatically backed out and all jobs for later pushes canceled. The push is evicted from the queue, all later pushes are requeued, and the process restarts at the top.
- When all results are in, a completion notification is sent to the pusher with the number of remaining unmarked failures
Silly 20-minute Gimped-up example:
- Good1 and Good2 are queued up, followed by a bad push Bad1
- The builds trickle in. Good1 and Good2 both have a pair of intermittent oranges.
- The pusher, or someone, stars the intermittent oranges and Good1 and Good2 are pushed to mozilla-central
- The oranges on Bad1 turn out to be real. They are starred as failures, and the push is rolled back.
- All builds for Good3 and Good4 are discarded. (Notice how they have fewer results in the 3rd line?)
- Good3 gets an unknown orange. The test is retriggered.
- Bad1 gets fixed and pushed back onto the queue.
- Good3′s orange turns out to be intermittent, so it is starred. That is the trigger for landing it on mozilla-central (assuming all jobs are done.)
To deal with needs-clobber, you can set that as a flag on a push when queueing it up. (Possibly on your second try, when you discover that it needs it.)
mozilla-central doesn’t actually need to do builds, since it only gets exact tree versions that have already passed through a full cycle.
On a perf regression, you have to queue up a backout through the same mechanism, and your life kinda sucks for a while and you’ll probably have to be very friendly with the Try server.
Project branch merges go through the same pipeline. I’d be tempted to allow them to jump the queue.
You would normally pull from mozilla-pending only to queue up landings. For development, you’d pull mozilla-central.
Alternatively, mozilla-central would pull directly from the relevant changeset on mozilla-pending, meaning it would get all of the backouts in its history. But then you could use mozilla-pending directly. (You’d be at the mercy of pending failures, which would cause you to rebase on top of the resulting backouts. But that’s not substantially different from the alternative, where you have perf regression-triggered backouts and other people’s changes to contend with.) Upon further reflection, I think I like this better than making mozilla-central’s history artificially clean.
The major danger I see here is that the queue can grow arbitrarily. But you have a collective incentive for everyone in the queue to scrutinize the failures up at the front of the queue, so the length should be self-limiting even if people aren’t watching their own pushes very well. (Which gets harder to do in this model, since you never know when your turn will come up, and you’re guaranteed to have to wait a whole build cycle.)
You’d probably also want a way to step out of the queue when you discover a problem yourself.
Did I just recreate Ehsan’s long-term proposal? No. For one, this one doesn’t depend on fixing the intermittent orange problem first, though it does gain from it. (More good pushes go through without waiting on human intervention.)
But Ehsan’s proposal is sort of like a separate channel into mozilla-central, using the try server and automated merges to detect bit-rotting. This proposal relies on being the only path to mozilla-central, so there’s no opportunity for bitrot.
What’s the justification for this? Well, if you play fast and loose with assumptions, it’s the optimal algorithm for landing a collection of unproven changes. If all changes are good, you trivially get almost the best pipelining of tests (the best would be spawning builds immediately). With a bad change, you have to assume that all results after that point are useless, so you have no new information to use to decide between the remaining changes. There are faster algorithms that would try appending pushes in parallel, but they get more complicated and burn way more infrastructural resources. (Having two mozilla-pendings that merge into one mozilla-mergedpending before feeding into mozilla-central might be vaguely reasonable, but that’s already more than my brain can encompass and would probably make perf regressions suck too hard…)
Side question: how many non-intermittent failures happen on Windows PGO builds that would not happen on (faster) Windows non-PGO builds?
Wading through history
Recently — well, actually, by now it wasn’t recently at all — I received a review request for a patch to JSD. It fixed an intermittent crash when using Firebug on a page that went into an endless stack-eating loop. A couple of people had worked on reproducing it, and the exact conditions were a little flaky, so I first tried it out myself. Kaboom! Yay!
So I imported the patch just to verify that it fixed the problem. Before compiling with it, I updated my tree to the latest version. Why? I don’t know. Just because it’s what I usually do. It seemed like a good idea at the time.
Only it wasn’t. It was a really, really dumb idea. I was changing two variables while trying to test one of them, and I got what I deserved: it stopped crashing after the patch, but when digging in to verify that it really was behaving as intended, I discovered it still wasn’t crashing.
This was just before the All Hands, and although I poked at it every few days, I didn’t make any headway: the patch seemed good, but I really wanted to confirm that it fixed the crash. (There were reasons why I was a little skeptical, but it’s not really relevant here.)
Eventually, when I had some time to think about it properly, I realized the best thing to do would be to revert to the older version that crashed for me. But how to find it?
One way would be to binary search nightlies. But I happened to be on a poor network connection, and downloading nightlies was insanely slow.
Also, I thought I should be able to do better. I run with an mq extension (mq = Mercurial Queues) that commits my patch queue on any change. Get it at git://github.com/hotsphink/mqext.git (I really should switch to bitbucket, rather than pointlessly restricting my audience to people who are minimally comfortable with both git and hg.) So all I had to do was to go back to the point where I imported the patch from bugzilla.
Finding the right moment was easy: ‘hg log –mq’ showed me all the changes made to my patch queue, one of which was commented “IMPORT: bz://643360″ (an autogenerated comment courtesy of mqext.) That was changeset 026ac43e9114. Yay!
But that changeset is for my patch queue, not my source repo. Fortunately, mq stores ‘parent’ fields in patch files that give the source repo changeset id that a patch was applied on top of. I’ll skip a number of failed attempts to track through this, and just give my final recipe:
- (already described) hg log –mq to find the appropriate changeset in the patch queue repo.
- cd to .hg/patches and run hg cat -r changeset series. This is because you need to know the names of the patch files in order to look at them — or specifically, the name of the first patch file, because it’s the only one whose parent will still be in the source repo. All other patches’ parents will be the source repo with mq patches applied to them, and will have been stripped out of the repo due to intervening actions. Because hg (or rather, mq) is not interested in preserving history.
- hg cat -r firstpatchname and look for the “# Parent changeset” line.
- cd back to your source repo and fetch that revision however you want — update to it, or clone a repo with it, or whatever.
I’m guessing this little recipe isn’t going to be useful to very many people, but I wanted to write it out for myself. So phbbbtt!!!
sfink

