Now that I’m working for Mozilla, one of my goals is to make Valgrind easier to use on big programs like Firefox. One feature of such programs is that often they create multiple processes. For example, when I invoke ‘firefox’ on my Linux box, I’m really running three programs. /usr/bin/firefox is a start-up shell script. It uses /usr/bin/basename as part of its preprocessing, and then invokes /usr/lib/firefox-3.0.9/firefox, which is the real firefox, via ‘exec’. (And a program like Google Chrome would be much worse, having one process per tab.)
In this post I’m going to make several suggestions for improvements to Valgrind to it easier for users to use with multi-process programs. I’d love to hear feedback from Valgrind users about these suggestions.
Proposal 1: trace child processes by default
If you run “valgrind firefox”, you get this output on the terminal:
==9045== Memcheck, a memory error detector.
==9045== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==9045== Using LibVEX rev 1888, a library for dynamic binary translation.
==9045== Copyright (C) 2004-2009, and GNU GPL'd, by OpenWorks LLP.
==9045== Using valgrind-3.5.0.SVN, a dynamic binary instrumentation framework.
==9045== Copyright (C) 2000-2009, and GNU GPL'd, by Julian Seward et al.
==9045== For more details, rerun with: -v
==9045==
and then Firefox starts up suspiciously quickly. Where’s that slow-down due to Valgrind? And where are the error messages from Valgrind? As it happens, by default Valgrind doesn’t trace into any child processes spawned by the program it’s tracing. So Valgrind is tracing /usr/bin/firefox, but /usr/bin/basename and /usr/lib/firefox-3.0.9/firefox are run natively.
In order to trace into child processes, you have to use the –trace-children=yes option; then it’ll do what you want.
But I think that not tracing by default is a bad idea. First of all, it’s quite unclear what’s happening, especially if you don’t understand Valgrind’s behaviour. We even have an entry in the FAQ about this. (In contrast, if we traced by default and you didn’t want that behaviour, the fact that you’d get one Valgrind start-up message per process makes it clearer what’s happening.)
Furthermore, in my experience, –trace-children=no is almost never what you want. And it’s easy to forget –trace-children=yes; I do it all the time.
So I think tracing into children should be the default. Others may disagree, so it would be useful to know if Valgrind users have an opinion on this.
Proposal 2: show what command is being run
If I invoke “valgrind –trace-children=yes firefox”, let it load the default page, and then quit, I get this output (eliding some of the startup/shutdown messages for brevity):
==9658== Memcheck, a memory error detector.
==9658== ...
==9658==
==9659== Memcheck, a memory error detector.
==9659== ...
==9659==
==9659==
==9659== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 1)
==9659== ...
==9658== Memcheck, a memory error detector.
==9658== ...
==9658== Syscall param write(buf) points to uninitialised byte(s)
==9658== at 0x4E38E90: __write_nocancel (in /lib/libpthread-2.8.90.so)
==9658== by 0xE55DEFE: ??? (in /usr/lib/libICE.so.6.3.0)
==9658== ...
==9658== Address 0x5e91964 is 12 bytes inside a block of size 1,024 alloc'd
==9658== at 0x4C24724: calloc (vg_replace_malloc.c:368)
==9658== by 0xE55A373: IceOpenConnection (in /usr/lib/libICE.so.6.3.0)
==9658== ...
==9658==
==9658== Syscall param write(buf) points to uninitialised byte(s)
==9658== at 0x4E38ECB: ??? (in /lib/libpthread-2.8.90.so)
==9658== by 0x7E00876: ??? (in /usr/lib/libsqlite3.so.0.8.6)
==9658== ...
==9658== Address 0x15dcfefc is 36 bytes inside a block of size 4,104 alloc'd
==9658== at 0x4C2694E: malloc (vg_replace_malloc.c:178)
==9658== by 0x7DE9CF7: sqlite3_malloc (in /usr/lib/libsqlite3.so.0.8.6)
==9658== ...
==9658==
==9658== Syscall param write(buf) points to uninitialised byte(s)
==9658== at 0x4E38ECB: ??? (in /lib/libpthread-2.8.90.so)
==9658== by 0x7E00876: ??? (in /usr/lib/libsqlite3.so.0.8.6)
==9658== by ...
==9658== Address 0x15dcfefc is 36 bytes inside a block of size 4,104 alloc'd
==9658== at 0x4C2694E: malloc (vg_replace_malloc.c:178)
==9658== by 0x7DE9CF7: sqlite3_malloc (in /usr/lib/libsqlite3.so.0.8.6)
==9658== ...
==9658==
==9658== ERROR SUMMARY: 19 errors from 3 contexts (suppressed: 343 from 3)
==9658== ...
We have three Memcheck start-up messages, two Memcheck shut-down messages, and two PIDs. What’s going on? The first start-up message (PID 9658) is for /usr/bin/firefox. The second (PID 9659) is for /usr/bin/basename. The third start-up message is for /usr/lib/firefox-3.0.9/firefox; the PID 9658 is reused because /usr/lib/firefox-3.0.9/firefox is invoked with ‘exec’, which reuses the same process — this also explains why there are only two shut-down messages.
But working this out isn’t easy. In fact, I cheated, by also using the -v option. This make Valgrind produce verbose output, and one of the things this includes is the command being executed. Without that I would have had a much harder time understanding what happened. But -v produces lots of extra stuff that is rarely interesting, so it’s not a good solution.
So my second proposal is to always print the invoked command as part of the Valgrind start-up message, like this:
==9045== Memcheck, a memory error detector.
==9045== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==9045== Using LibVEX rev 1888, a library for dynamic binary translation.
==9045== Copyright (C) 2004-2009, and GNU GPL'd, by OpenWorks LLP.
==9045== Using valgrind-3.5.0.SVN, a dynamic binary instrumentation framework.
==9045== Copyright (C) 2000-2009, and GNU GPL'd, by Julian Seward et al.
==9045== Running: /usr/bin/firefox
==9045== For more details, rerun with: -v
==9045==
(We could possibly move the “For more details” message to shut-down, where other, similar messages are shown.)
In this case the command has no arguments, but they would be shown if present.
This change would make the output of running Valgrind on multi-process programs much easier to run.
Another possibility is to also show the parent process’s command, but that is probably overkill.
Proposal 3: control child tracing via black-listing
Currently you either trace all child processes, or none of them. This is crude. It would often be useful to be able to trace some of them.
An obvious way to do this is with a black-list. You would specify a list of processes, anything not on that list would be traced, anything on the list would not be traced. And allowing patterns would be useful. Valgrind already has support for patterns containing shell style ‘*’ and ‘?’ wildcards, so that would be an obvious choice to use.
Some examples:
# Matches nothing, ie. traces all children. (Single quotes are necessary to
# protect most patterns from shell interference.)
--trace-blacklist=''
# Matches everything, ie. traces no children.
--trace-blacklist='*'
# Skips all /usr/bin/python subprocesses.
--trace-blacklist='/usr/bin/python *'
# Skips all /usr/bin/python subprocesses invoked with -v.
--trace-blacklist='/usr/bin/python *-v*'
# Matches nothing, ie. traces all children. It looks like it might match
# all command containing the substring "python", but it does not, because
# patterns must match the entire command, not just part of it.
--trace-blacklist='python'
# Skips all /usr/bin/python and /usr/bin/perl subprocesses; multiple
# blacklist options are combined, and any process matching any of the
# blacklist entries is blacklisted.
--trace-blacklist='/usr/bin/python *' --trace-blacklist='/usr/bin/perl *'
One interesting question is this: what exactly does it mean to not trace a process? More specifically, if an untraced process spawns its own children, should we trace them? If we run the process natively (as –trace-children=no currently does for child processes) then any spawned children will not be traced — once Valgrind loses control, it cannot get it back. An alternative is to run the black-listed processes under Nulgrind, the Valgrind tool that adds no instrumentation. This incurs a slow-down of about 5x compared to native execution, but allows Valgrind to keep control.
So there are two possible kinds of black-list: the “skip you” black-list, and the “skip you and all your descendents” black-list. If I had too choose one, I’d probably pick the latter; the former seems less likely to be useful. If we added both kinds, I don’t know what name I’d give the options.
Specifying both a –trace-blacklist option and a –trace-children option would be disallowed, as it’s not clear how they would interact.
If proposal 2 is implemented, it would probably make sense to output a message like “Skipping due to black-list: <cmd>” for black-listed processes.
Proposal 4: control child tracing via white-listing
Another way to control which processes are traced is with a white-list. In this case, any process not on the whitelist would have to be run with Nulgrind, so that its children can be traced. (You could also have a “skip you and your descendents” whitelist in which non-matching processes don’t have their children traced, but that seems less useful.)
Some examples:
# Matches nothing, ie. traces no processes (even the one named on the
# command line). (Well, it traces them with Nulgrind.)
--trace-whitelist=''
# Matches everything, ie. traces all processes.
--trace-whitelist='*'
# Traces only /usr/lib/firefox-3.0.9/firefox processes.
--trace-whitelist='/usr/lib/firefox-3.0.9/firefox*'
For whitelists, it’s clear that you want the top-level process (ie. the one named on the command line) to be considered as part of the whitelist matching, not just the children of the initial process. This is different to black-lists. At least, it’s different to “skip you and all your descendents” black-lists, where black-listing the top-level process is not useful, as it is equivalent to not running Valgrind at all. If “skip you” black-lists were also implemented, then considering the top-level process for black-listing makes more sense. (Alternatively, maybe making black-list and white-list behaviour equivalent is better, I’m not sure.)
You couldn’t use both –trace-blacklist and –trace-whitelist in the same invocation of Valgrind, as there is no clear meaning (what if a command matches both lists? What if it matches neither?) Likewise with –trace-children and –trace-whitelist.
And again, if proposal 2 is implemented, it would make sense to output a message “Skipping due to white-list: <cmd>” for non-white-listed processes.
Proposal 5: remove –trace-children
With whitelists and blacklists present, –trace-children could be removed, because it is subsumed by them:
- –trace-children=yes is equivalent to –trace-whitelist=’*’ and –trace-blacklist=”
- –trace-children=no is equivalent to –trace-blacklist=’*’
I think this is a good idea, because I don’t think it’s smart to have multiple options with overlapping functionality.
Conclusion
I think these changes would make Valgrind easier to use with multi-process programs, but there are some design decisions still to be made. Any feedback about them from Valgrind users would be very helpful. Thanks.