Have you ever had your browser mysteriously stall periodically and wondered “what the f#@$! is it doing?!!” Or perhaps you’re working on something, say the garbage collector, and you’d like to see what effect your changes are having. Or maybe even write a little analysis that postprocesses some sort of trace of what is going on, and figures out what the optimal pattern of actions would be. (“If I’d thrown this big chunk of data out of the cache here, then I would’ve had room for all of these little things that got evicted instead, and would have had way fewer misses…”)
The usual way to do things like this is to manually add some instrumentation code (probably just logging a bunch of events) and postprocess the results. This works fine, but it has a few drawbacks: (1) you have to figure out where to insert your instrumentation, often in unfamiliar code; (2) you’ll need to recompile, possibly several times; (3) the logs can get very large very quickly; and (4) you’ll probably end up writing a very special-purpose postprocessor that (5) dumps stuff to a text file that only you know how to interpret, and even you will only remember what it all means for a week or two. The next time you need to do something similar, you’ll find that all of your instrumentation code is severely bitrotted and misses some paths that have been added in the meantime, so you’ll start everything over from scratch.
Well, tough luck. Sometimes those are just facts of life and you’ll need to suck it up. Quit whining, dammit.
But many times, the events of interest (or more precisely, “probe points”) are of general interest. If you can manage to slip them into the code and so get other developers to maintain them for you as they make changes, then everyone can rely on those probes being in roughly the right place permanently. That’s #1 above, and depending on how they’re implemented there’s a good chance you won’t even need to recompile, so that’s #2.
#3 (log it all vs online handling) ventures into religious territory. It is easiest to mindlessly log everything of interest and postprocess it. But what if you want realtime updates? Or if you want to track different information depending on what you learn from other probe points? Or what if the volume of your log writing interferes with whatever you’re trying to measure (eg disk I/O)? Or maybe you need to track some sort of state in order to give the probes meaning. (GC when idle => good. Avoidable GC when the user is waiting => bad.)
Those arguments are what led to the creation of tools like DTrace and Systemtap. Both give you a scripting environment that can aggregate information from probes as they fire, control exactly what information gets tracked as things are happening, and can be attached/detached at any time. They’re pretty cool, and invaluable once you get familiar with them. They’re also extremely system-dependent and generally require root access or special builds or kernel debuginfo or something, which ends up meaning that you often can’t just hand off analysis scripts to other people and have those people get some use out of them. And even you may not be able to take them to another environment.
Still, they deal pretty well with #4 (avoiding one-use, special-purpose processors), at least for environments matching the one they were written for. And if they can draw from statically-inserted probe points (the type I was talking about above), they can actually be pretty general. #5 is still a killer, though — at least the way I write systemtap scripts, they all end up with idiosyncratic ways of dumping out the results of some particular analysis, and nobody else is going to get much enlightenment without studying the script for a while first.
What if we could do better? What if we could insert these static probes, but rather than feeding the information to some niche tool that is usable by only a handful of people, we make the data available to a plain old Firefox addon? You could collect, aggregate, summarize, mutilate, fold, spindle, or crush the data directly in JS code. Then we could let addon authors go crazy with visualizations and analysis libraries. That’d be cool, right?
Graph GC behavior. Warn the user when slow or suspicious stuff is happening. Figure out what’s going on during long event handlers. Graph the percentage of time spent in different subsystems. Correlate performance/trace data with user-meaningful actions. Make a flight-recording of various metrics and let the user walk through history. Your ideas here.
Ok, so I tricked you. I’m not going to tell you how to do any of that. This blog post is a tease, an advertisement for the work that Brian Burg did this summer during his Mozilla internship. If you’re interested, he’ll be giving his internship final presentation tomorrow (today when you’re reading this, or perhaps yesterday or last month for those of you who have fallen behind on your Planet reading.) That’s 1:30PM PDT on Thursday, September 22 at the Mountain View Mozilla headquarters, and I’m 97.2% sure it will be broadcast over Air Mozilla as well. And taped, I think? (Sadly, I can’t find where those are archived. Somebody please tell me and I’ll update this post.) There will be a demo. With pretty pictures! And he’ll be writing it up on his own blog Real Soon Now. I’m not going to say any more for now — I’d get it wrong anyway.
Update: Argh! I got the date wrong! It’s not Wednesday, September 21 as I originally wrote. It’s today, Thursday, September 22. Sorry for the confusion!