03
Nov 10

ETW: Event Tracing for Windows, part 4: collection

(See all of my ETW posts)
Enough theory. Let’s walk through examples of all this.

Providers

You can enumerate the available event providers either the old-school way:

 logman query providers

or the new way:

  xperf -providers I

(the ‘I’ is for ‘installed/known’). There is also

  xperf -providers R

(‘R’ for ‘registered’), which for me dumps out the same set of providers with less information and different formatting. Welcome to the joys of the Windows command line. The old-school logman tends to have saner output, but is mysteriously much slower when constructing the list. And finally, some people prefer

  wevtutil ep

(or you can expand ‘ep’ to ‘enum-publishers’), though that contains a bunch of other weird stuff that logman and xperf do not.

Each provider has a globally unique id (guid). The logman output and the xperf -providers I output show these together with the readable names.

Now let’s look at the flags and flag groups defined for the kernel provider:

  xperf -providers K

Notice, for instance, the kernel group ‘DiagEasy’, which maps to ‘PROC_THREAD+LOADER+DISK_IO+HARD_FAULTS+DPC+INTERRUPT+CSWITCH+PERF_COUNTER‘. If you look at the earlier part of the output, you can map each of those to a descriptive string. I don’t know how to map them to their hex values. I’ll give two untrustworthy ways, though. First, we’ll do it by starting a session using the kernel provider, then using logman to dump out its info:

  xperf -on DiagEasy
  logman query "NT Kernel Logger" -ets
  xperf -stop

In the “Provider:” section of the output, I see

  KeywordsAny: 0x237f (process,thread,img,proccntr,cswitch,dpc,isr,disk,file,hf)

which is very similar to the DiagEasy kernel group expansion if you squint a little. The second way is to first figure out what the provider’s name is. From the previous, that seems to be “Windows Kernel Trace” (not “Circular Kernel Session Provider”, though that would work too.) Now we can dump out all keywords and their corresponding hex values in a simple format:

  logman query providers "Windows Kernel Trace"

and attempt to match up these keywords with the PROC_THREAD etc. from the DiagEasy kernel group. Note that if we used “Circular Kernel Session Provider” in the above command, we’d get exactly the same set of keywords. Hm.

You can use the same logman command to investigate other providers. It will often give some additional information, such as levels (another way to tag events for selection; defaults are things like Error, Warning, and Informational).

I can’t figure out any way to get the xperf tool to display similar information, nor do I know how it stores the equivalence between its keyword-like things (“PROC_THREAD”) and logman’s understandable ones (“proc”=0x1,”thread”=0x2, with descriptive strings for each.)

Event Trace Sessions

Let’s say you have some sessions running. (And you do; Windows starts several up for you in its default configuration.) You can get a very basic listing with

  logman -ets

where it will refer to them as “Data Collector Sets”. (logman and perfmon talk about Data Collector Sets, which seem to be about the same as xperf’s event tracing sessions except that Data Collector Sets aren’t necessarily running; they can be lying in wait with Status=Stopped.) xperf, instead, refers to them as “Loggers” or more verbosely as “active logging sessions”, which seems a little more familiar. Run

  xperf -Loggers

and it will dump out each one, together with a whole bunch of information about each one’s configuration and state. Of particular interest is the “Trace Flags” field, which will give the xperf version of the flags/keyword bitmask. To experiment with this a bit, try running

  xperf -on DiagEasy
  xperf -Loggers "NT Kernel Logger"
  xperf -stop

You’ll see the Trace Flags match the DiagEasy kernel group you looked at before. You can also look at the Trace Flags for the “DiagLog” logger. It pulls data from a wide range of providers, requesting specific flags/keywords from each. You can use

  xperf -LoggersWithNames

to have it decode those settings even more.

Remember, one event tracing session (aka logger) can pull events from multiple providers, and we should be able to modify the set of providers at runtime. Let’s give that a try. First, start up a logger:

  xperf -start blogtest1 -on Microsoft-Windows-Kernel-Process:0x10

And ask what’s going on in three different ways:

  xperf -LoggersWithNames blogtest1
  logman -ets
  logman query blogtest1 -ets

xperf should tell you that you’re looking for the “WINEVENT_KEYWORD_PROCESS” keyword/flag from the “Microsoft-Windows-Kernel-Process” provider (the trailing 0xff means all levels, info+warning+error+whatever). The first logman invocation should just show you that ‘blogtest1’ is indeed running. The second tells you something similar to what xperf did, though in a format that splits out 3 things: the logger itself, something I don’t understand that seems to be the logging engine associated with the logger (it knows the buffer configuration, output file, etc.), and the provider that the logger (session) is sucking events from. Now let’s add in another provider:

  logman update blogtest1 -p "Windows Kernel Trace" -ets

I can’t figure out how to do the equivalent with xperf. I know it has an -update option, but it spits in my face whenever I try to use it. (This is a common feature in the command-line and even the GUI tools. There are invisible restrictions of various sorts, and when you do something not allowed, you get either silent failure or a seemingly unrelated error message back.)

Now when we run the querying commands, we see both providers listed. Now, let’s shut down:

  xperf -stop blogtest1

Or, if you prefer,

  logman stop blogtest1 -ets

Kernel vs non-Kernel Events

One note: the above probably only works because both providers are really within the kernel. If you tried to update with an application-based provider, you’d get back the cryptic “Property value conflict.” If you started with an application-based provider and used logman to update with a kernel provider, it would act completely happy, but only capture events from the application-based provider. And yet, if you stopped it with

  xperf -stop blogtest1 -d output.etl

it would merge relevant portions of the kernel log with the application logs to produce output.etl, or in other words, it would Do The Right Thing behind your back.

Also: at least in Windows 7, there are a number of providers named “Microsoft-Windows-Kernel-Something“. These seem to be manifest-based views onto the real kernel provider. Or something. On the web, I’ve found them mentioned as “Crimson” aka “Unified Windows Eventing 6.0” providers. At any rate, they seem to allow you to pull in events from the kernel as if it were a regular application provider. You can do things like eg logman query providers Microsoft-Windows-Kernel-Process and get a sense for which events each handles, and eventually reverse-engineer the keyword groups like ‘DiagEasy’ using a mixture of several providers. I like the idea of relays that make the kernel look exactly like any other provider, but I don’t know if that’s what’s actually happening. I’m still waiting for responses to my questions about this on ServerFault and er… the Windows forum seems to have eaten my post.

Predefined Tracing Configuration

Something I’ve been ignoring up until now is predefined tracing configuration, largely because it still baffles me. But it is a major part of all the tools.

XPerf seems to only talk about something called profiles, which are for “High-level ETW trace control” via “reusable, persisted configuration files (.xpp)”. logman has an entirely different mechanism, which seems to center around “Data Collector Sets” (or collectors for short). Collectors seem to be able to exist independently of any loggers actively using them. To see any dormant collectors you might have lying around, run logman with no options (not even the -ets that I’ve been throwing in everywhere without explaining.)

Data Collector Sets

XPerf’s profiles sound like something I’d want a GUI to generate, so let’s start with logman’s collectors:

  logman create trace blogtest1 -p Microsoft-Windows-Kernel-Process 0xffff

That creates a Data Collector Set named “blogtest1” with a specific provider (and keyword bitmask). Note that it is not running; it is just lying around somewhere ready for use. You can see it with

  logman

or dig into its details with

  logman query blogtest1

To make it not quite so trivial, let’s add some providers to it:

  logman update blogtest1 -p Microsoft-Windows-Kernel-Disk 0xffff
  logman update blogtest1 -p Microsoft-Windows-HttpService 0xffff

You probably want to query it again to see what a mess you’ve made. Now fire it up with

  logman start blogtest1

It now has status ‘Running’. The query output will tell you where it’s logging to: %systemdrive%\PerfLogs\Admin or more specifically in my case, C:\PerfLogs\Admin\blogtest1_000001.etl (I guess it really, really wants to be able to roll over from one log file to the next). Stop this running collector with

  logman stop blogtest1

So what’s the deal with the magic -ets argument? The logman -h output is mostly useless: “Send commands to Event Trace Sessions directly without saving or scheduling.” A technet article is much more illuminating: “Creates and starts an event trace session with the options specified on the command line”. By experimentation, it seems like this event trace session has no relationship to a Data Collector Set created without the -ets flag, even one with the same name. If you ran logman start blogtest1 -ets above, it would have created a trace session named ‘blogtest1’ with no providers. If you run logman (or equivalently, logman query), it shows the non-ets collectors. logman -ets or logman query -ets shows the with-ets collectors. xperf -Loggers only sees the with-ets collectors. I don’t understand the difference, but it seems to be fundamental and major.

The non-ets collectors look to be useful for automatically scheduling tracing by time of day and duration: see logman’s -b, -e, -r, and -rf options. I confess that I haven’t gotten those to work for me, though.

Multiple Providers

logman normally only allows you to specify one provider at a time. If you want to merge together the output from multiple providers, you have a few options:

  1. Use logman update to add the providers one at a time, as we did above
  2. Use the -pf option to point to a file containing providers, one per line (put the flags right after them, as you would on the -p command line)
  3. Use xperf instead of logman. XPerf lets you do -on provider1:0xff+provider2:0xffff+.... In fact, it gives you several other per-provider configuration options as well, ones that make sense in the manifest-based world.

Profiles

I haven’t used XPerf’s profiles at all, so I really can’t say anything beyond what xperf -help profiles tells you.

Data Sources

Early on, I made the claim that there are different types of providers: performance counters, event traces, configuration settings, and alerts. That may be bad terminology: those four can definitely be referred to as things you can put into Data Collector Sets, but “provider” might only refer to event trace providers. I’m not sure.

It looks to me like the configuration settings have been folded up as a provider in Windows Vista and later, at least by looking at

  logman query providers Microsoft-Windows-Kernel-Registry

It has keywords for all the different registry key operations. Note that you can see the same data in a more verbose and yuckier format with

  wevtutil gp Microsoft-Windows-Kernel-Registry

I have been bad about ignoring the wevtutil tool, but that’s mostly because I haven’t found it to be useful for much. (I use it to register my custom manifest-based provider, but even then, nothing makes very good use of that registered information.)

Alerts don’t seem to be a source of data so much as something that a logger can emit. If you look at the logman -h output, you’ll see an example of creating an alert to fire off when too much ‘Processor Time’ is used, whatever that means. Probably the best place to learn about alerts are the perfmon help pages. I’ll ignore them.

Performance counters sound much more interesting. logman can be used to automate their creation and management, but performance counters depend heavily on knowing what counters are available, so I would recommend using perfmon to deal with them. It provides a UI for selecting from the many, many counters available. It doesn’t seem to provide much assistance for the expression language involved, but once again, I don’t know enough on this topic to be talking. For now, I’ll just point you to perfmon. Perhaps when I start actively using all of this stuff, I’ll have more to say.

Other Resources

There are a lot of people talking about little bits and pieces of ETW on the web. Most weren’t useful for whatever sort of audience I happen to be, but this technet article is a better help message for logman than logman’s own -h output, and the MS forum FAQ is somewhat helpful. The Pigs Can Fly blog has some useful bits in it, from an experienced user, such as this one about using the Windows sample profiler with XPerf.

Stack Traces

You’ve probably noticed, but I tend to stick to using logman rather than xperf for most things. I just like its syntax better; it seems a little more rational. But there is one really big reason why you’ll probably end up using xperf in actual practice: stack traces. Unfortunately, I don’t have enough material on that to make up a blog post yet, so I’m going to wait until I get some more experience in analyzing traces and seeing what needs to get into them before writing that part.

Who Cares? or: How to Use Collected Traces

I must apologize for crawling through all of this boring detail on the trace collection side. The only thing interesting about any of this is how you use the collected traces, and I haven’t said anything about that yet.

Well, sorry. That’s for a later article. Partly because this one is already far too long, and partly because I haven’t figured out enough yet to be worth wasting your time reading.


02
Nov 10

ETW: Event Tracing for Windows, part 3: architecture

(See all of my ETW posts)

Architorture

It’s time for an ETW architecture discussion. There is a reasonably good official description of the architecture, but it’s a bit long and hard to follow. Here’s my short and hard to follow version:

The Windows kernel, many Windows components, and a scattering of 3rd party libraries and applications all have a mechanism for reporting various events of interest. Heck, they use the same mechanism to report a ton of events of no interest at all. They are all said to serve as event providers. There are a couple of different types of providers: providers for performance counters (eg number of page faults), event traces, alerts, and configuration settings. I will ignore everything but event traces until the end of this article. Controllers enable and disable providers, and also associate one or more providers with each event trace session. Each session is associated with a logging engine. Finally, event consumers hook up to one or more event trace sessions to receive the events. And even more finally, you may end up with event trace logs (*.etl files) that store an archive of a set of events from some trace session. Sometimes these will be produced directly by a logging engine, and sometimes by an event consumer that may just write out the currently buffered events from a trace session, or might even merge together multiple sources of information to produce the final file.

Note that while this is a nice happy high-level data model, reality is more constrained. For example, while the data model says that the same provider can provide event streams to multiple trace sessions, this is only true when using newer “manifest-based” providers (which I may or may not blog about in the future.) “Classic” providers will only provide data to a single session, and the kernel provider is a totally special case: it can only be hooked up to two specific predefined sessions, and you can’t mix any other providers with it at all. If you try to associate a classic provider with more than one session, bad stuff will happen. My various sources are inconsistent about what exactly happens. One says that the newer session will win and the older session will stop receiving events, another says that the activation will fail. Also, no more than eight sessions are allowed to receive data from a single manifest-based provider.

Controllers can activate providers at any time, even while the providing application is running. Controllers can also dynamically change the set of providers feeding an event session (so theoretically you could wait for a particular event to happen or a threshold to be reached, then add in events from the kernel provider for a while, then turn it off again later). The ETW overhead for a provider application is low when no active loggers are requesting information (not as low as eg dtrace, which compiles down to a no-op when not in use, but still only a handful of instructions). The standard built-in controller includes scheduling facilities, so eg you can configure a session to start up at 3am and gather events for an hour every day. I believe the command-line tools logman and xperf are both controllers, as is the perfmon GUI tool (also known, I think, as RPM, for Reliability and Performance Monitor.) logman and perfmon both seem to be old-school tools. xperf is newer (meaning manifest-aware and part of the Crimson aka Unified Windows Eventing world).

Logging engines are configurable with things like the number and size of memory buffers used, whether to be “real-time” (deliver events immediately to consumers) or log to a file, what type of memory to use for buffering, and more esoteric options. According to the Ntdebugging blog, kernel tracing can only use two specific loggers: a circular logger for detailed info on the last few seconds, and a regular logger that is used for longer-term monitoring.

Providers can produce a variety of different events, so there are mechanisms for grouping and categorizing them. Each type of event produced is associated with zero or more flags (confusingly also referred to as keywords). When starting up or updating a trace session, you can specify which flags you are interested in, and only matching events will be grabbed from the provider. (The filtering seems to happen at both ends — the provider will suppress events that nobody wants, and the session will suppress events that it doesn’t want.) In addition, there are some predefined groups of flags that can be used for convenience.

Consumers eat trace data and do something with it. A couple of consumers are built Windows, to provide standard features: eg, the Event Viewer and the Performance Monitor. tracerpt and tracefmt are command-line tools for generating raw dumps of events from an .etl file. XPerfView is a graphical tool for displaying performance information from a trace. This article is not going to discuss event consumers, because they’re the most interesting part and I’d rather make you suffer.

Ok, there. That’s it. That’s my architectural description, omitting lots and lots of important details.

Terminology

The terminology used in the actual tools is inconsistent between the old and new, and even within individual tools. “Logger” and “Data Collector Set” particularly seem to mean multiple things. Also, some things seem to not quite fit the data model. There is a provider named “Circular Kernel Session Provider”. Huh? “Circular” would seem to be a description of a logging engine or logging engine parameter, and “Session” makes the name sound like it’s the provider for a kernel session, which is redundant since all providers are providers for sessions. If it were just “Kernel”, it would make sense.

But maybe that name is just fallout from the kernel provider’s special (limited) status.

Huh?

If all this seems too abstract, I’ll walk through examples of playing with this stuff in my next article.


01
Nov 10

ETW: Event Tracing for Windows, part 2: field reporting

(See all of my ETW posts)

One use of event tracing is gathering field reports: the user notices something going horribly wrong, and is willing to help out by providing useful information. ETW is very valuable for this use, since 90% of Firefox users are on Windows and ETW logging comes standard with Windows. So here’s a guide to submitting useful trace info. If you are reporting a performance-related bug, it would probably be useful if you could offer up a .etl file with your bug report. (But you probably shouldn’t directly attach it to the bug if it’s big, and it probably will be.)

I’m going to give the “universal” option first, which will work on Windows XP on up with no additional installs needed:

  logman start "NT Kernel Logger" -o browser.etl -p "Windows Kernel Trace" 0x237f -bs 1024 -nb 10 100 -ets
  run your test scenario...
  logman stop "NT Kernel Logger" -ets

This will produce a file “browser.etl” containing all kinds of system-level tracing data. If you’re uploading it somewhere, you might want to compress it first. These files compress pretty well. I’ll explain what all the wacky options mean in a later post.

The other option is to use xperf to control the tracing. It is part of the Windows Performance Toolkit (WPT), which itself ships with the Windows 7 SDK. You do not need to install the entire SDK to get WPT! (The SDK is pretty huge.) So if you are running Vista or Windows 7 (or later, presumably):

  1. Download and run the Windows 7 SDK (click on the Download button).
  2. Select just the Windows Performance Toolkit components
  3. Install!

Here’s the important part where you select just the component you need, stolen shamelessly from a very good blog post giving precise instructions for doing exactly this.

If you are running the 64-bit version of Vista or Windows 7, you’ll need to do one more thing to get full functionality: run

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

This will allow you to collect stack traces.

If you are running Windows XP, it’s a little more work. You can do it, but you will need to copy things from a machine with WPT already installed. Make sure you are copying from the same architecture (64-bit binaries won’t work on 32-bit Windows XP). Copy over just the files xperf.exe and perfctrl.dll into somewhere in your $PATH. I would link to a copy of those files if I knew it was legal. On your Windows 7 machine with the WPT installed, it will probably be somewhere like C:\Program Files\Microsoft Windows Performance Toolkit\xperf.exe and you can copy it to C:\Windows on your XP machine.

Now you can run it:

  xperf -on DiagEasy -stackwalk profile
  run your scenario...
  xperf -stop -d browser.etl

This will gather more or less the same information as the logman command above, except it’ll also grab stacktraces during the profile events. I think it may even cram some additional xperf-specific details into the trace log, which make it easier to analyze the file on another machine according to the ntdebugging guy(s?).

Either way, link to this .etl file in a bug report and it will be much easier to make sense of what’s going wrong. The xperf-produced .etl files are preferred, since they supposedly have more goodies in them, but if installing xperf is too much trouble then the logman-generated ones are still quite handy.

If you want to see what you’ve collected, run

  xperf browser.etl

I’ll talk about how to interpret what you see later. This article is all about capturing traces, not analyzing them.


So what are those commands doing? To really answer this question, I’d need to go into the details of the ETW architecture, which will be a followup blog post. So I’ll just give a short version for now:

The logman command starts up a tracing session aka logger using the predefined name “NT Kernel Logger”, and connects it up to receive events from the provider named “Windows Kernel Trace”. The 0x237f option is a bitmask of what flags aka keywords to include in the trace.

  logman query providers "Windows Kernel Trace"

shows the list of available keyword bits for that provider and the descriptions of what they cover. The trace session is logged to the file browser.etl, and when the session is stopped all final events are flushed to it. The buffer settings are also configured to something reasonable.

The xperf command starts up the same logger with the DiagEasy group of keywords, which turns out to be the same as the 0x237f from the logman command. The log file is not specified, so it will default to \kernel.etl but we’re not going to directly use that file this time. Then the xperf -d command stops the logger and merges its output into browser.etl. According to the help message, it also adds “image identification information required for safe symbol decoding”, which sounds like a good thing.


01
Nov 10

ETW: Event Tracing for Windows, Part 1: Intro

(See all of my ETW posts)

A while back, Andreas Gal asked me to add some ETW (Event Tracing for Windows) functionality into SpiderMonkey, the Mozilla Javascript engine. Though I’m anything but a Windows guy, it seemed straightforward enough. And actually, implementing it wasn’t too bad. Figuring out what the heck I had just implemented, how it worked, and how to use it is another story. Or stories. I will be writing a series of blog posts about the various aspects I have discovered.

Note: you may want to skip ahead to later parts in this series. This one is going to be high-level and whiny. Later installments should be more technical and HOWTOish.

First, some context. ETW (Event Tracing for Windows) was introduced in Windows 2000 as a single API for handling a grab bag of tasks, which all sound about the same but turn out to require wildly different implementations. The “supported” tasks include:

  • a developer (of an application, the kernel, or a driver) inserting logging statements for personal use
  • alerts for administrators
  • low-overhead offline performance profiling (start up your scenario, generate massive log files, stop the logging, chew through them and produce reports and visualizations)
  • realtime performance monitoring and analysis (watch the pretty CPU usage graph go up and down, or “Augh! Nothing is responding! Wtf is my stupid computer doing right now?!”)
  • using a separate monitoring machine to suck down data from a different machine to do any of the above
  • tracking end-to-end performance of requests traversing multiple applications and servers
  • tracing the exact behavior (files opened, disk offset patterns, …) of some application
  • capturing problem reports from the field
  • doing everything from either command line or a GUI

which is a lot to stuff into one architecture. The result is inconsistent and tangled, but more or less functional. (More for some uses, less for others.)

The GUIs are rather limited: some parts are just silly and don’t need to exist, other parts have really great stuff but they’re hard to find. They feel like early prototypes. The connection between what you’re capturing and what you end up seeing is cryptic. The GUIs I have discovered so far for viewing trace results are Xperfview, Event Viewer, Sawbuck (a Google project), and RPM (Reliability and Performance Monitor aka Perfmon). There are also GUIs for configuring trace capture.

The command line tools are a mess. Multiple generations are hanging around. You are suggested to only use the latest stuff (xperf), but the latest stuff requires a separate install and so is much less useful for field report capturing (hint: if you can give people something to cut & paste, they’ll grumble but do it and get it right. If you have to walk them through a GUI, they’ll be happy but probably get something wrong. If you tell them to go install something and then do either of the above, you won’t hear back from any but the most motivated.) The different generations have largely overlapping functionality, but completely different syntax — and most confusingly, terminology.

On the other hand, the underlying implementation of the trace recording seems nice. It can log straight to files, log into buffers, do circular logging, and trade off overhead/completeness/reliability (eg you can capture to per-CPU buffers to avoid locking overhead and thus miss fewer events, but things may get out of order). It has a bunch of tuning knobs that semantically make sense. The GUIs and command-line tools for controlling them, on the other hand, do not. You have great flexibility in dynamically enabling and disabling providers, updating parameters, starting/stopping/flushing logging sessions, associating providers with sessions, etc.

There is decent support for implementing your own providers, which is what I originally set out to do. It feels a lot more complex than it really ought to be, but all in all it really isn’t that bad. Well, ok, I say that but I still haven’t completely figured out how to use all of the bits that are generated from my ETW manifest. But that’s more of a tooling issue — the different tools use totally different mechanisms for making use of the metadata in the ETW manifest, and some tools just can’t do anything with it. (That unfortunately includes the latest greatest visualization tool, xperfview.)

Navigating through the maze of tools, options, and documentation is insanely hard from my naive point of view. Is this how things are in the Windows ecosystem? I’ve read through dozens of whitepapers, official-ish manuals, blog posts, forum threads, etc., and I feel like each one I read brushes the dirt off of one more spot in the picture, but piles more dirt back on the rest. Nothing is complete or even gives a complete high-level overview. There are several that try to, some of them very well-written and polished, but they still seem to have a lot of wishful thinking: “if you just switch over to this way of doing things, then it all works like this, so be happy” not mentioning that their chosen way of doing things is incomplete and you’ll end up digging through the older tools and getting completely lost as you try to accomplish the one last little piece you need. There is little user-facing consistency.

Which is not to say that I’m going to do any better, but I’ll try to stick to some very constrained usage scenarios.