ETW: Event Tracing for Windows, Part 1: Intro

(See all of my ETW posts)

A while back, Andreas Gal asked me to add some ETW (Event Tracing for Windows) functionality into SpiderMonkey, the Mozilla Javascript engine. Though I’m anything but a Windows guy, it seemed straightforward enough. And actually, implementing it wasn’t too bad. Figuring out what the heck I had just implemented, how it worked, and how to use it is another story. Or stories. I will be writing a series of blog posts about the various aspects I have discovered.

Note: you may want to skip ahead to later parts in this series. This one is going to be high-level and whiny. Later installments should be more technical and HOWTOish.

First, some context. ETW (Event Tracing for Windows) was introduced in Windows 2000 as a single API for handling a grab bag of tasks, which all sound about the same but turn out to require wildly different implementations. The “supported” tasks include:

  • a developer (of an application, the kernel, or a driver) inserting logging statements for personal use
  • alerts for administrators
  • low-overhead offline performance profiling (start up your scenario, generate massive log files, stop the logging, chew through them and produce reports and visualizations)
  • realtime performance monitoring and analysis (watch the pretty CPU usage graph go up and down, or “Augh! Nothing is responding! Wtf is my stupid computer doing right now?!”)
  • using a separate monitoring machine to suck down data from a different machine to do any of the above
  • tracking end-to-end performance of requests traversing multiple applications and servers
  • tracing the exact behavior (files opened, disk offset patterns, …) of some application
  • capturing problem reports from the field
  • doing everything from either command line or a GUI

which is a lot to stuff into one architecture. The result is inconsistent and tangled, but more or less functional. (More for some uses, less for others.)

The GUIs are rather limited: some parts are just silly and don’t need to exist, other parts have really great stuff but they’re hard to find. They feel like early prototypes. The connection between what you’re capturing and what you end up seeing is cryptic. The GUIs I have discovered so far for viewing trace results are Xperfview, Event Viewer, Sawbuck (a Google project), and RPM (Reliability and Performance Monitor aka Perfmon). There are also GUIs for configuring trace capture.

The command line tools are a mess. Multiple generations are hanging around. You are suggested to only use the latest stuff (xperf), but the latest stuff requires a separate install and so is much less useful for field report capturing (hint: if you can give people something to cut & paste, they’ll grumble but do it and get it right. If you have to walk them through a GUI, they’ll be happy but probably get something wrong. If you tell them to go install something and then do either of the above, you won’t hear back from any but the most motivated.) The different generations have largely overlapping functionality, but completely different syntax — and most confusingly, terminology.

On the other hand, the underlying implementation of the trace recording seems nice. It can log straight to files, log into buffers, do circular logging, and trade off overhead/completeness/reliability (eg you can capture to per-CPU buffers to avoid locking overhead and thus miss fewer events, but things may get out of order). It has a bunch of tuning knobs that semantically make sense. The GUIs and command-line tools for controlling them, on the other hand, do not. You have great flexibility in dynamically enabling and disabling providers, updating parameters, starting/stopping/flushing logging sessions, associating providers with sessions, etc.

There is decent support for implementing your own providers, which is what I originally set out to do. It feels a lot more complex than it really ought to be, but all in all it really isn’t that bad. Well, ok, I say that but I still haven’t completely figured out how to use all of the bits that are generated from my ETW manifest. But that’s more of a tooling issue — the different tools use totally different mechanisms for making use of the metadata in the ETW manifest, and some tools just can’t do anything with it. (That unfortunately includes the latest greatest visualization tool, xperfview.)

Navigating through the maze of tools, options, and documentation is insanely hard from my naive point of view. Is this how things are in the Windows ecosystem? I’ve read through dozens of whitepapers, official-ish manuals, blog posts, forum threads, etc., and I feel like each one I read brushes the dirt off of one more spot in the picture, but piles more dirt back on the rest. Nothing is complete or even gives a complete high-level overview. There are several that try to, some of them very well-written and polished, but they still seem to have a lot of wishful thinking: “if you just switch over to this way of doing things, then it all works like this, so be happy” not mentioning that their chosen way of doing things is incomplete and you’ll end up digging through the older tools and getting completely lost as you try to accomplish the one last little piece you need. There is little user-facing consistency.

Which is not to say that I’m going to do any better, but I’ll try to stick to some very constrained usage scenarios.

Tags: , , , ,

Comments are closed.