{"id":90,"date":"2009-04-30T09:20:45","date_gmt":"2009-04-29T22:20:45","guid":{"rendered":"http:\/\/blog.mozilla.org\/nnethercote\/?p=90"},"modified":"2009-04-30T14:20:13","modified_gmt":"2009-04-30T03:20:13","slug":"making-valgrind-easier-to-use-with-multi-process-programs","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/nnethercote\/2009\/04\/30\/making-valgrind-easier-to-use-with-multi-process-programs\/","title":{"rendered":"RFC: Making Valgrind easier to use with multi-process programs"},"content":{"rendered":"<p>Now that I&#8217;m working for Mozilla, one of my goals is to make Valgrind easier to use on big programs like Firefox.  One feature of such programs is that often they create multiple processes.  For example, when I invoke &#8216;firefox&#8217; on my Linux box, I&#8217;m really running three programs.  \/usr\/bin\/firefox is a start-up shell script.  It uses \/usr\/bin\/basename as part of its preprocessing, and then invokes \/usr\/lib\/firefox-3.0.9\/firefox, which is the real firefox, via &#8216;exec&#8217;.  (And a program like Google Chrome would be much worse, having one process per tab.)<\/p>\n<p>In this post I&#8217;m going to make several suggestions for improvements to Valgrind to it easier for users to use with multi-process programs.  I&#8217;d love to hear feedback from Valgrind users about these suggestions.<\/p>\n<h3>Proposal 1: trace child processes by default<\/h3>\n<p>If you run &#8220;valgrind firefox&#8221;, you get this output on the terminal:<\/p>\n<pre>==9045== Memcheck, a memory error detector.\r\n==9045== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.\r\n==9045== Using LibVEX rev 1888, a library for dynamic binary translation.\r\n==9045== Copyright (C) 2004-2009, and GNU GPL'd, by OpenWorks LLP.\r\n==9045== Using valgrind-3.5.0.SVN, a dynamic binary instrumentation framework.\r\n==9045== Copyright (C) 2000-2009, and GNU GPL'd, by Julian Seward et al.\r\n==9045== For more details, rerun with: -v\r\n==9045==<\/pre>\n<p>and then Firefox starts up suspiciously quickly.\u00a0 Where&#8217;s that slow-down due to Valgrind?\u00a0 And where are the error messages from Valgrind?\u00a0 As it happens, by default Valgrind doesn&#8217;t trace into any child processes spawned by the program it&#8217;s tracing.\u00a0 So Valgrind is tracing \/usr\/bin\/firefox, but \/usr\/bin\/basename and \/usr\/lib\/firefox-3.0.9\/firefox are run natively.<\/p>\n<p>In order to trace into child processes, you have to use the &#8211;trace-children=yes option;\u00a0 then it&#8217;ll do what you want.<\/p>\n<p>But I think that not tracing by default is a bad idea.\u00a0 First of all, it&#8217;s quite unclear what&#8217;s happening, especially if you don&#8217;t understand Valgrind&#8217;s behaviour.\u00a0 We even have an entry in the FAQ about this.\u00a0 (In contrast, if we traced by default and you didn&#8217;t want that behaviour, the fact that you&#8217;d get one Valgrind start-up message per process makes it clearer what&#8217;s happening.)<\/p>\n<p>Furthermore, in my experience, &#8211;trace-children=no is almost never what you want.\u00a0 And it&#8217;s easy to forget &#8211;trace-children=yes;\u00a0 I do it all the time.<\/p>\n<p>So I think tracing into children should be the default.\u00a0 Others may disagree, so it would be useful to know if Valgrind users have an opinion on this.<\/p>\n<h3>Proposal 2: show what command is being run<\/h3>\n<p>If I invoke &#8220;valgrind &#8211;trace-children=yes firefox&#8221;, let it load the default page, and then quit, I get this output (eliding some of the startup\/shutdown messages for brevity):<\/p>\n<pre>==9658== Memcheck, a memory error detector.\r\n==9658== ...\r\n==9658==\r\n==9659== Memcheck, a memory error detector.\r\n==9659== ...\r\n==9659==\r\n==9659==\r\n==9659== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 1)\r\n==9659== ...\r\n==9658== Memcheck, a memory error detector.\r\n==9658== ...\r\n==9658== Syscall param write(buf) points to uninitialised byte(s)\r\n==9658==\u00a0\u00a0\u00a0 at 0x4E38E90: __write_nocancel (in \/lib\/libpthread-2.8.90.so)\r\n==9658==\u00a0\u00a0\u00a0 by 0xE55DEFE: ??? (in \/usr\/lib\/libICE.so.6.3.0)\r\n==9658==\u00a0\u00a0\u00a0 ...\r\n==9658==\u00a0 Address 0x5e91964 is 12 bytes inside a block of size 1,024 alloc'd\r\n==9658==\u00a0\u00a0\u00a0 at 0x4C24724: calloc (vg_replace_malloc.c:368)\r\n==9658==\u00a0\u00a0\u00a0 by 0xE55A373: IceOpenConnection (in \/usr\/lib\/libICE.so.6.3.0)\r\n==9658==\u00a0\u00a0\u00a0 ...\r\n==9658==\r\n==9658== Syscall param write(buf) points to uninitialised byte(s)\r\n==9658==\u00a0\u00a0\u00a0 at 0x4E38ECB: ??? (in \/lib\/libpthread-2.8.90.so)\r\n==9658==\u00a0\u00a0\u00a0 by 0x7E00876: ??? (in \/usr\/lib\/libsqlite3.so.0.8.6)\r\n==9658==\u00a0\u00a0\u00a0 ...\r\n==9658==\u00a0 Address 0x15dcfefc is 36 bytes inside a block of size 4,104 alloc'd\r\n==9658==\u00a0\u00a0\u00a0 at 0x4C2694E: malloc (vg_replace_malloc.c:178)\r\n==9658==\u00a0\u00a0\u00a0 by 0x7DE9CF7: sqlite3_malloc (in \/usr\/lib\/libsqlite3.so.0.8.6)\r\n==9658==\u00a0\u00a0\u00a0 ...\r\n==9658==\r\n==9658== Syscall param write(buf) points to uninitialised byte(s)\r\n==9658==\u00a0\u00a0\u00a0 at 0x4E38ECB: ??? (in \/lib\/libpthread-2.8.90.so)\r\n==9658==\u00a0\u00a0\u00a0 by 0x7E00876: ??? (in \/usr\/lib\/libsqlite3.so.0.8.6)\r\n==9658==\u00a0\u00a0\u00a0 by ...\r\n==9658==\u00a0 Address 0x15dcfefc is 36 bytes inside a block of size 4,104 alloc'd\r\n==9658==\u00a0\u00a0\u00a0 at 0x4C2694E: malloc (vg_replace_malloc.c:178)\r\n==9658==\u00a0\u00a0\u00a0 by 0x7DE9CF7: sqlite3_malloc (in \/usr\/lib\/libsqlite3.so.0.8.6)\r\n==9658==\u00a0\u00a0\u00a0 ...\r\n==9658==\r\n==9658== ERROR SUMMARY: 19 errors from 3 contexts (suppressed: 343 from 3)\r\n==9658== ...<\/pre>\n<p>We have three Memcheck start-up messages, two Memcheck shut-down messages, and two PIDs.  What&#8217;s going on?  The first start-up message (PID 9658) is for \/usr\/bin\/firefox.  The second (PID 9659) is for \/usr\/bin\/basename.  The third start-up message is for \/usr\/lib\/firefox-3.0.9\/firefox;  the PID 9658 is reused because \/usr\/lib\/firefox-3.0.9\/firefox is invoked with &#8216;exec&#8217;, which reuses the same process &#8212; this also explains why there are only two shut-down messages.<\/p>\n<p>But working this out isn&#8217;t easy.  In fact, I cheated, by also using the -v option.  This make Valgrind produce verbose output, and one of the things this includes is the command being executed.  Without that I would have had a much harder time understanding what happened.  But -v produces lots of extra stuff that is rarely interesting, so it&#8217;s not a good solution.<\/p>\n<p>So my second proposal is to always print the invoked command as part of the Valgrind start-up message, like this:<\/p>\n<pre>==9045== Memcheck, a memory error detector.\r\n==9045== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.\r\n==9045== Using LibVEX rev 1888, a library for dynamic binary translation.\r\n==9045== Copyright (C) 2004-2009, and GNU GPL'd, by OpenWorks LLP.\r\n==9045== Using valgrind-3.5.0.SVN, a dynamic binary instrumentation framework.\r\n==9045== Copyright (C) 2000-2009, and GNU GPL'd, by Julian Seward et al.\r\n==9045== Running: \/usr\/bin\/firefox\r\n==9045== For more details, rerun with: -v\r\n==9045==<\/pre>\n<p>(We could possibly move the &#8220;For more details&#8221; message to shut-down, where other, similar messages are shown.)<\/p>\n<p>In this case the command has no arguments, but they would be shown if present.<\/p>\n<p>This change would make the output of running Valgrind on multi-process programs much easier to run.<\/p>\n<p>Another possibility is to also show the parent process&#8217;s command, but that is probably overkill.<\/p>\n<h3>Proposal 3: control child tracing via black-listing<\/h3>\n<p>Currently you either trace all child processes, or none of them.\u00a0 This is crude.\u00a0 It would often be useful to be able to trace some of them.<\/p>\n<p>An obvious way to do this is with a black-list.\u00a0 You would specify a list of processes, anything not on that list would be traced, anything on the list would not be traced. And allowing patterns would be useful.\u00a0 Valgrind already has support for patterns containing shell style &#8216;*&#8217; and &#8216;?&#8217; wildcards, so that would be an obvious choice to use.<\/p>\n<p>Some examples:<\/p>\n<pre># Matches nothing, ie. traces all children.\u00a0 (Single quotes are necessary to\r\n# protect most patterns from shell interference.)\r\n--trace-blacklist=''\r\n\r\n# Matches everything, ie. traces no children.\r\n--trace-blacklist='*'\r\n\r\n# Skips all \/usr\/bin\/python subprocesses.\r\n--trace-blacklist='\/usr\/bin\/python *'\r\n\r\n# Skips all \/usr\/bin\/python subprocesses invoked with -v.\r\n--trace-blacklist='\/usr\/bin\/python *-v*'\r\n\r\n# Matches nothing, ie. traces all children.\u00a0 It looks like it might match\r\n# all command containing the substring \"python\", but it does not, because\r\n# patterns must match the entire command, not just part of it.\r\n--trace-blacklist='python'\r\n\r\n# Skips all \/usr\/bin\/python and \/usr\/bin\/perl subprocesses;\u00a0 multiple\r\n# blacklist options are combined, and any process matching any of the\r\n# blacklist entries is blacklisted.\r\n--trace-blacklist='\/usr\/bin\/python *' --trace-blacklist='\/usr\/bin\/perl *'<\/pre>\n<p>One interesting question is this:\u00a0 what exactly does it mean to not trace a process?\u00a0 More specifically, if an untraced process spawns its own children, should we trace them?\u00a0 If we run the process natively (as &#8211;trace-children=no currently does for child processes) then any spawned children will not be traced &#8212; once Valgrind loses control, it cannot get it back.\u00a0 An alternative is to run the black-listed processes under Nulgrind, the Valgrind tool that adds no instrumentation.\u00a0 This incurs a slow-down of about 5x compared to native execution, but allows Valgrind to keep control.<\/p>\n<p>So there are two possible kinds of black-list:\u00a0 the &#8220;skip you&#8221; black-list, and the &#8220;skip you and all your descendents&#8221; black-list.\u00a0 If I had too choose one, I&#8217;d probably pick the latter;\u00a0 the former seems less likely to be useful.\u00a0 If we added both kinds, I don&#8217;t know what name I&#8217;d give the options.<\/p>\n<p>Specifying both a &#8211;trace-blacklist option and a &#8211;trace-children option would be disallowed, as it&#8217;s not clear how they would interact.<\/p>\n<p>If proposal 2 is implemented, it would probably make sense to output a message like &#8220;Skipping due to black-list: &lt;cmd&gt;&#8221; for black-listed processes.<\/p>\n<h3>Proposal 4: control child tracing via white-listing<\/h3>\n<p>Another way to control which processes are traced is with a white-list.\u00a0 In this case, any process not on the whitelist would have to be run with Nulgrind, so that its children can be traced.\u00a0 (You could also have a &#8220;skip you and your descendents&#8221; whitelist in which non-matching processes don&#8217;t have their children traced, but that seems less useful.)<\/p>\n<p>Some examples:<\/p>\n<pre># Matches nothing, ie. traces no processes (even the one named on the\r\n# command line).\u00a0 (Well, it traces them with Nulgrind.)\r\n--trace-whitelist=''\r\n\r\n# Matches everything, ie. traces all processes.\r\n--trace-whitelist='*'\r\n\r\n# Traces only \/usr\/lib\/firefox-3.0.9\/firefox processes.\r\n--trace-whitelist='\/usr\/lib\/firefox-3.0.9\/firefox*'<\/pre>\n<p>For whitelists, it&#8217;s clear that you want the top-level process (ie. the one named on the command line) to be considered as part of the whitelist matching, not just the children of the initial process.\u00a0 This is different to black-lists.\u00a0 At least, it&#8217;s different to &#8220;skip you and all your descendents&#8221; black-lists, where black-listing the top-level process is not useful, as it is equivalent to not running Valgrind at all.\u00a0 If &#8220;skip you&#8221; black-lists were also implemented, then considering the top-level process for black-listing makes more sense.\u00a0 (Alternatively, maybe making black-list and white-list behaviour equivalent is better, I&#8217;m not sure.)<\/p>\n<p>You couldn&#8217;t use both &#8211;trace-blacklist and &#8211;trace-whitelist in the same invocation of Valgrind, as there is no clear meaning (what if a command matches both lists?\u00a0 What if it matches neither?)\u00a0 Likewise with &#8211;trace-children and &#8211;trace-whitelist.<\/p>\n<p>And again, if proposal 2 is implemented, it would make sense to output a message &#8220;Skipping due to white-list: &lt;cmd&gt;&#8221; for non-white-listed processes.<\/p>\n<h3>Proposal 5: remove &#8211;trace-children<\/h3>\n<p>With whitelists and blacklists present, &#8211;trace-children could be removed, because it is subsumed by them:<\/p>\n<ul>\n<li>&#8211;trace-children=yes is equivalent to &#8211;trace-whitelist=&#8217;*&#8217; and &#8211;trace-blacklist=&#8221;<\/li>\n<li>&#8211;trace-children=no\u00a0 is equivalent to &#8211;trace-blacklist=&#8217;*&#8217;<\/li>\n<\/ul>\n<p>I think this is a good idea, because I don&#8217;t think it&#8217;s smart to have multiple options with overlapping functionality.<\/p>\n<h3>Conclusion<\/h3>\n<p>I think these changes would make Valgrind easier to use with multi-process programs, but there are some design decisions still to be made.\u00a0 Any feedback about them from Valgrind users would be very helpful.\u00a0 Thanks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Now that I&#8217;m working for Mozilla, one of my goals is to make Valgrind easier to use on big programs like Firefox. One feature of such programs is that often they create multiple processes. For example, when I invoke &#8216;firefox&#8217; on my Linux box, I&#8217;m really running three programs. \/usr\/bin\/firefox is a start-up shell script. [&hellip;]<\/p>\n","protected":false},"author":139,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[484],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/90"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/users\/139"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/comments?post=90"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/posts\/90\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/media?parent=90"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/categories?post=90"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/nnethercote\/wp-json\/wp\/v2\/tags?post=90"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}