I have used a variety of profiling tools over the years, including several I wrote myself.
But there is one profiling tool I have used more than any other. It is capable of providing invaluable, domain-specific profiling data of a kind not obtainable by any general-purpose profiler.
It’s a simple text processor implemented in a few dozen lines of code. I use it in combination with logging print statements in the programs I am profiling. No joke.
The tool is called
counts, and it tallies line frequencies within text files, like an improved version of the Unix command chain
sort | uniq -c. For example, given the following input.
a 1 b 2 b 2 c 3 c 3 c 3 d 4 d 4 d 4 d 4
counts produces the following output.
10 counts: ( 1) 4 (40.0%, 40.0%): d 4 ( 2) 3 (30.0%, 70.0%): c 3 ( 3) 2 (20.0%, 90.0%): b 2 ( 4) 1 (10.0%,100.0%): a 1
It gives a total line count, and shows all the unique lines, ordered by frequency, with individual and cumulative percentages.
Alternatively, when invoked with the
-w flag, it assigns each line a weight, determined by the last integer that appears on the line (or 1 if there is no such integer). On the same input,
counts -w produces the following output.
30 counts: ( 1) 16 (53.3%, 53.3%): d 4 ( 2) 9 (30.0%, 83.3%): c 3 ( 3) 4 (13.3%, 96.7%): b 2 ( 4) 1 ( 3.3%,100.0%): a 1
The total and per-line counts are now weighted; the output incorporates both frequency and a measure of magnitude.
That’s it. That’s all
counts does. I originally implemented it in 48 lines of Perl, then later rewrote it as 48 lines of Python, and then later again rewrote it as 71 lines of Rust.
In terms of benefit-to-effort ratio, it is by far the best code I have ever written.
counts in action
As an example, I added print statements to Firefox’s heap allocator so it prints a line for every allocation that shows its category, requested size, and actual size. A short run of Firefox with this instrumentation produced a 77 MB file containing 5.27 million lines.
counts produced the following output for this file.
5270459 counts: ( 1) 576937 (10.9%, 10.9%): small 32 (32) ( 2) 546618 (10.4%, 21.3%): small 24 (32) ( 3) 492358 ( 9.3%, 30.7%): small 64 (64) ( 4) 321517 ( 6.1%, 36.8%): small 16 (16) ( 5) 288327 ( 5.5%, 42.2%): small 128 (128) ( 6) 251023 ( 4.8%, 47.0%): small 512 (512) ( 7) 191818 ( 3.6%, 50.6%): small 48 (48) ( 8) 164846 ( 3.1%, 53.8%): small 256 (256) ( 9) 162634 ( 3.1%, 56.8%): small 8 (8) ( 10) 146220 ( 2.8%, 59.6%): small 40 (48) ( 11) 111528 ( 2.1%, 61.7%): small 72 (80) ( 12) 94332 ( 1.8%, 63.5%): small 4 (8) ( 13) 91727 ( 1.7%, 65.3%): small 56 (64) ( 14) 78092 ( 1.5%, 66.7%): small 168 (176) ( 15) 64829 ( 1.2%, 68.0%): small 96 (96) ( 16) 60394 ( 1.1%, 69.1%): small 88 (96) ( 17) 58414 ( 1.1%, 70.2%): small 80 (80) ( 18) 53193 ( 1.0%, 71.2%): large 4096 (4096) ( 19) 51623 ( 1.0%, 72.2%): small 1024 (1024) ( 20) 45979 ( 0.9%, 73.1%): small 2048 (2048)
Unsurprisingly, small allocations dominate. But what happens if we weight each entry by its size?
counts -w produced the following output.
2554515775 counts: ( 1) 501481472 (19.6%, 19.6%): large 32768 (32768) ( 2) 217878528 ( 8.5%, 28.2%): large 4096 (4096) ( 3) 156762112 ( 6.1%, 34.3%): large 65536 (65536) ( 4) 133554176 ( 5.2%, 39.5%): large 8192 (8192) ( 5) 128523776 ( 5.0%, 44.6%): small 512 (512) ( 6) 96550912 ( 3.8%, 48.3%): large 3072 (4096) ( 7) 94164992 ( 3.7%, 52.0%): small 2048 (2048) ( 8) 52861952 ( 2.1%, 54.1%): small 1024 (1024) ( 9) 44564480 ( 1.7%, 55.8%): large 262144 (262144) ( 10) 42200576 ( 1.7%, 57.5%): small 256 (256) ( 11) 41926656 ( 1.6%, 59.1%): large 16384 (16384) ( 12) 39976960 ( 1.6%, 60.7%): large 131072 (131072) ( 13) 38928384 ( 1.5%, 62.2%): huge 4864000 (4866048) ( 14) 37748736 ( 1.5%, 63.7%): huge 2097152 (2097152) ( 15) 36905856 ( 1.4%, 65.1%): small 128 (128) ( 16) 31510912 ( 1.2%, 66.4%): small 64 (64) ( 17) 24805376 ( 1.0%, 67.3%): huge 3097600 (3100672) ( 18) 23068672 ( 0.9%, 68.2%): huge 1048576 (1048576) ( 19) 22020096 ( 0.9%, 69.1%): large 524288 (524288) ( 20) 18980864 ( 0.7%, 69.9%): large 5432 (8192)
This shows that the cumulative count of allocated bytes (2.55GB) is dominated by a mixture of larger allocation sizes.
This example gives just a taste of what
counts can do.
(An aside: in both cases it’s good the see there isn’t much slop, i.e. the difference between the requested sizes and actual sizes are mostly 0. That 5432 entry at the bottom of the second table is curious, though.)
This technique is often useful when you already know something — e.g. a general-purpose profiler showed that a particular function is hot — but you want to know more.
- Exactly how many times are paths X, Y and Z executed? For example, how often do lookups succeed or fail in data structure D? Print an identifying string each time a path is hit.
- How many times does loop L iterate? What does the loop count distribution look like? Is it executed frequently with a low loop count, or infrequently with a high loop count, or a mix? Print the iteration count before or after the loop.
- How many elements are typically in hash table H at this code location? Few? Many? A mixture? Print the element count.
- What are the contents of vector V at this code location? Print the contents.
- How many bytes of memory are used by data structure D at this code location? Print the byte size.
- Which call sites of function F are the hot ones? Print an identifying string at the call site.
counts to aggregate the data. Often this domain-specific data is critical to fully optimize hot code.
Worse is better
Print statements are an admittedly crude way to get this kind of information, profligate with I/O and disk space. In many cases you could do it in a way that uses machine resources much more efficiently, e.g. by creating a small table data structure in the code to track frequencies, and then printing that table at program termination.
But that would require:
- writing the custom table (collection and printing);
- deciding where to define the table;
- possibly exposing the table to multiple modules;
- deciding where to initialize the table; and
- deciding where to print the contents of the table.
That is a pain, especially in a large program you don’t fully understand.
Alternatively, sometimes you want information that a general-purpose profiler could give you, but running that profiler on your program is a hassle because the program you want to profile is actually layered under something else, and setting things up properly takes effort.
In contrast, inserting print statements is trivial. Any measurement can be set up in no time at all. (Recompiling is often the slowest part of the process.) This encourages experimentation. You can also kill a running program at any point with no loss of profiling data.
Don’t feel guilty about wasting machine resources; this is temporary code. You might sometimes end up with output files that are gigabytes in size. But
counts is fast because it’s so simple… and the Rust version is 3–4x faster than the Python version, which is nice. Let the machine do the work for you. (It does help if you have a machine with an SSD.)
Ad Hoc Profiling
For a long time I have, in my own mind, used the term ad hoc profiling to describe this combination of logging print statements and frequency-based post-processing. Wikipedia defines “ad hoc” as follows.
In English, it generally signifies a solution designed for a specific problem or task, non-generalizable, and not intended to be able to be adapted to other purposes
The process of writing custom code to collect this kind of profiling data — in the manner I disparaged in the previous section — truly matches this definition of “ad hoc”.
counts is valuable specifically because it makes this type of custom profiling less ad hoc and more repeatable. I should arguably call it “generalized ad hoc profiling” or “not so ad hoc profiling”… but those names don’t have quite the same ring to them.
Use unbuffered output for the print statements. In C and C++ code, use
fprintf(stderr, ...). In Rust code use
eprintln!. (Update: Rust 1.32 added the
dbg! macro, which also works well.)
Pipe the stderr output to file, e.g.
firefox 2> log.
Sometimes programs print other lines of output to stderr that should be ignored by
counts. (Especially if they include integer IDs that
counts -w would interpret as weights!) Prepend all logging lines with a short identifier, and then use
grep $ID log | counts to ignore the other lines. If you use more than one prefix, you can grep for each prefix individually or all together.
Occasionally output lines get munged together when multiple print statements are present. Because there are typically many lines of output, having a few garbage ones almost never matters.
It’s often useful to use both
counts -w on the same log file; each one gives different insights into the data.
To find which call sites of a function are hot, you can instrument the call sites directly. But it’s easy to miss one, and the same print statements need to be repeated multiple times. An alternative is to add an extra string or integer argument to the function, pass in a unique value from each call site, and then print that value within the function.
It’s occasionally useful to look at the raw logs as well as the output of
counts, because the sequence of output lines can be informative. For example, I recently diagnosed an occurrences of quadratic behaviour in the Rust compiler by seeing that a loop iterated 1, 2, 3, …, 9000+ times.
counts is available here.
counts to do ad hoc profiling all the time. It’s the first tool I reach for any time I have a question about code execution patterns. I have used it extensively for every bout of major performance work I have done in the past few years, as well as in plenty of other circumstances. I even built direct support for it into rustc-perf, the Rust compiler’s benchmark suite, via the
profile eprintln subcommand. Give it a try!