Categories
Mac OS X Valgrind

Valgrind + Mac OS X update (June 17, 2009)

It’s time for the June update on the progress of the Mac OS X port of Valgrind.

Progress has been good: the DARWIN branch has been merged to the trunk.  With that having happened, we’re now in sight of an actual release (3.5.0) containing Mac OS X support.  There’s some polishing and bug-fixing — both for Mac OS X and in general — to be done before that happens, but hopefully we’ll release 3.5.0 in early August.  That will be before Snow Leopard comes out;  another release may be necessary afterwards, but we want to get this code released sooner rather than later.

One interesting problem we encountered was some users were having Valgrind abort with a SIGTRAP extremely early.  It was very mysterious, and none of the developers were able to reproduce it.  Turns out that a program called Instant Hijack by a company called Rogue Amoeba was the cause of the problem.  Both Valgrind and Instant Hijack do some stuff with dyld, and apparently Instant Hijack’s stuff is a bit dodgy.  Turns out there’s an easy workaround, which involves temporarily disabling Instant Hijack.  This was reported by a Rogue Amoeba developer, fortunately he tried Valgrind himself, had the same SIGTRAP abort, found the bug report, and realised what the problem was.  If it wasn’t for him, we’d still be scratching our heads!

In the meantime, keep reporting any problems you have, in particular any unimplemented syscall wrappers — a number have been added lately but there are still more to be done.  Please report problems via Bugzilla rather than in comments on this blog, as bugzilla reports are more likely to be acted upon.  Thanks!

Categories
Mac OS X Valgrind

Mac OS X now supported on the Valgrind trunk

This morning I merged the DARWIN branch, which had been holding Valgrind’s support for Mac OS X, onto the trunk. The branch is now defunct, and Valgrind-on-Mac users should check out the trunk like so:

svn co svn://svn.valgrind.org/valgrind/trunk <dirname>
cd <dirname>

and then build it according to the instructions in the README file.

This is a good thing, if only because it means I can spend less time maintaining a branch and more time actually fixing things.

Update: fixed the svn URL.

Categories
Mac OS X Valgrind

Valgrind + Mac OS X update (May 18, 2009)

It’s time for the May update on the progress of the Mac OS X port of Valgrind. In the last month, 133 commits have been made to the DARWIN branch by Julian Seward and myself.

Here are the current (as of r9898) values of the metrics I have been using as a means of tracking progress.

  • The number of regression test failures on Mac was 418/128/43/0. It’s now 421/102/15/0.  I.e. the number of failures went from 171 to 117.  If we ignore the tools Helgrind, DRD and exp-Ptrcheck (which are not widely used and still mostly broken on the branch) the number of failures dropped from 50 to 13.  That’s a similar number to what we get on some Linux systems, and we’re in real diminishing-returns territory — the failing tests are all testing very obscure things.  So we can basically declare victory on that front.
  • The number of “FIXME”-style marker comments that indicate something in the code that needs to be fixed was 274.  It’s now 260.  Furthermore, the method I used last month to count “FIXME”-style comments was flawed, so the number has actually gone down by more than 14;  the comparison next month will be reliable.  But a lot of these comments are for very obscure things that won’t need to be fixed even before a release, so you shouldn’t be worried by the high number!

Functionality improvements from the last month are as follows.

  • Some extra system calls are handled.
  • Some more signal-handling improvements.
  • Some debug info reading improvements.
  • File descriptor tracking (–track-fds) now works.
  • The –auto-run-dsymutil option was added.  When used, it makes Valgrind run dsymutil to generate debug info for any files that need it.
  • Helgrind sort of works;  some of its tests pass.  But it’s still probably not usable.

Things are going well enough that we should be ready to merge the branch to the trunk soon!  That will be a significant milestone, and will make life easier as I won’t have to maintain the branch in parallel with the trunk.  I’m currently going through the branch/trunk differences carefully in order to get ready for the merged, with luck it will happen by the end of this week.

Update, March 19: fixed some HTML tags.

Categories
Valgrind

RFC: Making Valgrind easier to use with multi-process programs

Now that I’m working for Mozilla, one of my goals is to make Valgrind easier to use on big programs like Firefox. One feature of such programs is that often they create multiple processes. For example, when I invoke ‘firefox’ on my Linux box, I’m really running three programs. /usr/bin/firefox is a start-up shell script. It uses /usr/bin/basename as part of its preprocessing, and then invokes /usr/lib/firefox-3.0.9/firefox, which is the real firefox, via ‘exec’. (And a program like Google Chrome would be much worse, having one process per tab.)

In this post I’m going to make several suggestions for improvements to Valgrind to it easier for users to use with multi-process programs. I’d love to hear feedback from Valgrind users about these suggestions.

Proposal 1: trace child processes by default

If you run “valgrind firefox”, you get this output on the terminal:

==9045== Memcheck, a memory error detector.
==9045== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==9045== Using LibVEX rev 1888, a library for dynamic binary translation.
==9045== Copyright (C) 2004-2009, and GNU GPL'd, by OpenWorks LLP.
==9045== Using valgrind-3.5.0.SVN, a dynamic binary instrumentation framework.
==9045== Copyright (C) 2000-2009, and GNU GPL'd, by Julian Seward et al.
==9045== For more details, rerun with: -v
==9045==

and then Firefox starts up suspiciously quickly.  Where’s that slow-down due to Valgrind?  And where are the error messages from Valgrind?  As it happens, by default Valgrind doesn’t trace into any child processes spawned by the program it’s tracing.  So Valgrind is tracing /usr/bin/firefox, but /usr/bin/basename and /usr/lib/firefox-3.0.9/firefox are run natively.

In order to trace into child processes, you have to use the –trace-children=yes option;  then it’ll do what you want.

But I think that not tracing by default is a bad idea.  First of all, it’s quite unclear what’s happening, especially if you don’t understand Valgrind’s behaviour.  We even have an entry in the FAQ about this.  (In contrast, if we traced by default and you didn’t want that behaviour, the fact that you’d get one Valgrind start-up message per process makes it clearer what’s happening.)

Furthermore, in my experience, –trace-children=no is almost never what you want.  And it’s easy to forget –trace-children=yes;  I do it all the time.

So I think tracing into children should be the default.  Others may disagree, so it would be useful to know if Valgrind users have an opinion on this.

Proposal 2: show what command is being run

If I invoke “valgrind –trace-children=yes firefox”, let it load the default page, and then quit, I get this output (eliding some of the startup/shutdown messages for brevity):

==9658== Memcheck, a memory error detector.
==9658== ...
==9658==
==9659== Memcheck, a memory error detector.
==9659== ...
==9659==
==9659==
==9659== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 1)
==9659== ...
==9658== Memcheck, a memory error detector.
==9658== ...
==9658== Syscall param write(buf) points to uninitialised byte(s)
==9658==    at 0x4E38E90: __write_nocancel (in /lib/libpthread-2.8.90.so)
==9658==    by 0xE55DEFE: ??? (in /usr/lib/libICE.so.6.3.0)
==9658==    ...
==9658==  Address 0x5e91964 is 12 bytes inside a block of size 1,024 alloc'd
==9658==    at 0x4C24724: calloc (vg_replace_malloc.c:368)
==9658==    by 0xE55A373: IceOpenConnection (in /usr/lib/libICE.so.6.3.0)
==9658==    ...
==9658==
==9658== Syscall param write(buf) points to uninitialised byte(s)
==9658==    at 0x4E38ECB: ??? (in /lib/libpthread-2.8.90.so)
==9658==    by 0x7E00876: ??? (in /usr/lib/libsqlite3.so.0.8.6)
==9658==    ...
==9658==  Address 0x15dcfefc is 36 bytes inside a block of size 4,104 alloc'd
==9658==    at 0x4C2694E: malloc (vg_replace_malloc.c:178)
==9658==    by 0x7DE9CF7: sqlite3_malloc (in /usr/lib/libsqlite3.so.0.8.6)
==9658==    ...
==9658==
==9658== Syscall param write(buf) points to uninitialised byte(s)
==9658==    at 0x4E38ECB: ??? (in /lib/libpthread-2.8.90.so)
==9658==    by 0x7E00876: ??? (in /usr/lib/libsqlite3.so.0.8.6)
==9658==    by ...
==9658==  Address 0x15dcfefc is 36 bytes inside a block of size 4,104 alloc'd
==9658==    at 0x4C2694E: malloc (vg_replace_malloc.c:178)
==9658==    by 0x7DE9CF7: sqlite3_malloc (in /usr/lib/libsqlite3.so.0.8.6)
==9658==    ...
==9658==
==9658== ERROR SUMMARY: 19 errors from 3 contexts (suppressed: 343 from 3)
==9658== ...

We have three Memcheck start-up messages, two Memcheck shut-down messages, and two PIDs. What’s going on? The first start-up message (PID 9658) is for /usr/bin/firefox. The second (PID 9659) is for /usr/bin/basename. The third start-up message is for /usr/lib/firefox-3.0.9/firefox; the PID 9658 is reused because /usr/lib/firefox-3.0.9/firefox is invoked with ‘exec’, which reuses the same process — this also explains why there are only two shut-down messages.

But working this out isn’t easy. In fact, I cheated, by also using the -v option. This make Valgrind produce verbose output, and one of the things this includes is the command being executed. Without that I would have had a much harder time understanding what happened. But -v produces lots of extra stuff that is rarely interesting, so it’s not a good solution.

So my second proposal is to always print the invoked command as part of the Valgrind start-up message, like this:

==9045== Memcheck, a memory error detector.
==9045== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==9045== Using LibVEX rev 1888, a library for dynamic binary translation.
==9045== Copyright (C) 2004-2009, and GNU GPL'd, by OpenWorks LLP.
==9045== Using valgrind-3.5.0.SVN, a dynamic binary instrumentation framework.
==9045== Copyright (C) 2000-2009, and GNU GPL'd, by Julian Seward et al.
==9045== Running: /usr/bin/firefox
==9045== For more details, rerun with: -v
==9045==

(We could possibly move the “For more details” message to shut-down, where other, similar messages are shown.)

In this case the command has no arguments, but they would be shown if present.

This change would make the output of running Valgrind on multi-process programs much easier to run.

Another possibility is to also show the parent process’s command, but that is probably overkill.

Proposal 3: control child tracing via black-listing

Currently you either trace all child processes, or none of them.  This is crude.  It would often be useful to be able to trace some of them.

An obvious way to do this is with a black-list.  You would specify a list of processes, anything not on that list would be traced, anything on the list would not be traced. And allowing patterns would be useful.  Valgrind already has support for patterns containing shell style ‘*’ and ‘?’ wildcards, so that would be an obvious choice to use.

Some examples:

# Matches nothing, ie. traces all children.  (Single quotes are necessary to
# protect most patterns from shell interference.)
--trace-blacklist=''

# Matches everything, ie. traces no children.
--trace-blacklist='*'

# Skips all /usr/bin/python subprocesses.
--trace-blacklist='/usr/bin/python *'

# Skips all /usr/bin/python subprocesses invoked with -v.
--trace-blacklist='/usr/bin/python *-v*'

# Matches nothing, ie. traces all children.  It looks like it might match
# all command containing the substring "python", but it does not, because
# patterns must match the entire command, not just part of it.
--trace-blacklist='python'

# Skips all /usr/bin/python and /usr/bin/perl subprocesses;  multiple
# blacklist options are combined, and any process matching any of the
# blacklist entries is blacklisted.
--trace-blacklist='/usr/bin/python *' --trace-blacklist='/usr/bin/perl *'

One interesting question is this:  what exactly does it mean to not trace a process?  More specifically, if an untraced process spawns its own children, should we trace them?  If we run the process natively (as –trace-children=no currently does for child processes) then any spawned children will not be traced — once Valgrind loses control, it cannot get it back.  An alternative is to run the black-listed processes under Nulgrind, the Valgrind tool that adds no instrumentation.  This incurs a slow-down of about 5x compared to native execution, but allows Valgrind to keep control.

So there are two possible kinds of black-list:  the “skip you” black-list, and the “skip you and all your descendents” black-list.  If I had too choose one, I’d probably pick the latter;  the former seems less likely to be useful.  If we added both kinds, I don’t know what name I’d give the options.

Specifying both a –trace-blacklist option and a –trace-children option would be disallowed, as it’s not clear how they would interact.

If proposal 2 is implemented, it would probably make sense to output a message like “Skipping due to black-list: <cmd>” for black-listed processes.

Proposal 4: control child tracing via white-listing

Another way to control which processes are traced is with a white-list.  In this case, any process not on the whitelist would have to be run with Nulgrind, so that its children can be traced.  (You could also have a “skip you and your descendents” whitelist in which non-matching processes don’t have their children traced, but that seems less useful.)

Some examples:

# Matches nothing, ie. traces no processes (even the one named on the
# command line).  (Well, it traces them with Nulgrind.)
--trace-whitelist=''

# Matches everything, ie. traces all processes.
--trace-whitelist='*'

# Traces only /usr/lib/firefox-3.0.9/firefox processes.
--trace-whitelist='/usr/lib/firefox-3.0.9/firefox*'

For whitelists, it’s clear that you want the top-level process (ie. the one named on the command line) to be considered as part of the whitelist matching, not just the children of the initial process.  This is different to black-lists.  At least, it’s different to “skip you and all your descendents” black-lists, where black-listing the top-level process is not useful, as it is equivalent to not running Valgrind at all.  If “skip you” black-lists were also implemented, then considering the top-level process for black-listing makes more sense.  (Alternatively, maybe making black-list and white-list behaviour equivalent is better, I’m not sure.)

You couldn’t use both –trace-blacklist and –trace-whitelist in the same invocation of Valgrind, as there is no clear meaning (what if a command matches both lists?  What if it matches neither?)  Likewise with –trace-children and –trace-whitelist.

And again, if proposal 2 is implemented, it would make sense to output a message “Skipping due to white-list: <cmd>” for non-white-listed processes.

Proposal 5: remove –trace-children

With whitelists and blacklists present, –trace-children could be removed, because it is subsumed by them:

  • –trace-children=yes is equivalent to –trace-whitelist=’*’ and –trace-blacklist=”
  • –trace-children=no  is equivalent to –trace-blacklist=’*’

I think this is a good idea, because I don’t think it’s smart to have multiple options with overlapping functionality.

Conclusion

I think these changes would make Valgrind easier to use with multi-process programs, but there are some design decisions still to be made.  Any feedback about them from Valgrind users would be very helpful.  Thanks.

Categories
Mac OS X Valgrind

Valgrind + Mac OS X update (April 17, 2009)

It’s time for the April update on the progress of the Mac OS X port of Valgrind.  It’s been a quieter month because I was on vacation for over 3 weeks, and Julian Seward hasn’t had a great deal of time to work on the port either.  Even still, in that time 77 commits have been made to the DARWIN branch.

Here are the current (as of r9567) values of the metrics I have been using as a means of tracking progress.

  • The number of regression test failures on Mac was 422/172/41/0. It’s now 418/128/43/0. I.e. the number of failures went from 213 to 171.  If we ignore the tools Helgrind, DRD and exp-Ptrcheck (which are not widely used and still completely broken on the branch) the number of failures dropped from 92 to 50.  So the functionality of the branch is progressing well.
  • The size of the diff between the trunk and the branch was 38,248 lines (1.3MB). It’s now 39,027 lines (1.3MB). However, 2,223 of these lines are code that was cut, but was put in a text file for reference.  So the more realistic number would be 36,804 lines (1.2MB).  This metric was intended to indicate how close the branch is to being ready to merge with the trunk, but it doesn’t do that very well, so I will stop using it in the future.
  • Instead, I’m going to use a new metric:  the number of “FIXME”-style marker comments that indicate something in the code that needs to be fixed.  A lot of these mark Darwin-specific code that works correctly, but hasn’t been abstracted cleanly.  When this approaches zero, it will mean that the branch should be very close to merge-ready.  (Actually, the branch may be merge-ready before it reaches zero.)  The current number of these 274.  (The task-tracking used within Valgrind is mostly pretty informal, you can get away with it when there’s only a handful of frequent contributors!) That number is quite high, but a lot of those will be easy to fix.

Functionality improvements are as follows.

  • The build system now works with older versions of automake (pre 1.10).  automake’s handling of assembly code files (specifically, whether AM_CPPFLAGS is used for them) changed in 1.10, and the build system wasn’t working with older versions.
  • Some extra system calls are handled, enough that iTunes apparently now runs (although I haven’t tried it myself).
  • -mdynamic-no-pic is now used for compilation of Valgrind.  This turns off position-independent code, which (strangely enough) is the default for GCC on Darwin.  This speeds up most programs at least a little, and in some cases up to 30%.
  • Some more signal-handling improvements.

So things are still moving along well.

Categories
Mac OS X Valgrind

Valgrind + Mac OS X update (March 17, 2009)

Another month has passed since I last wrote about my work on the Mac OS X port of Valgrind.  In that time 126 commits have been made to the DARWIN branch (and a similar number to the trunk).  I’ve done a lot of them, but Julian Seward has found some time to work on the DARWIN branch and so has been doing some as well.

Here are the current (as of r9455) values of the metrics I have been using as a means of tracking progress.

  • The number of regression test failures on Linux was: 484 tests, 4 stderr failures, 1 stdout failures, 0 post failures (which I’ll abbreviate as 484/4/1/0). It’s now 484/0/1/0.  I.e. the number of failures went from 5 to 1, and that one failure occurs on my machine even on the trunk (it’s a bad test).  In other words, the branch works on Linux as well as the trunk.  Now that this metric is the same on the branch as the trunk, I won’t bother tracking it in the future.
  • The number of regression test failures on Mac was 402/213/52/0.  It’s now 422/172/41/0. I.e. the number of failures went from 265 to 213.  Also, 20 extra tests are being run — a broken CPU feature-detection program meant that a number of tests that should have been running were not, and this has been fixed.  Once again, this is the most important metric, and it’s improving steadily, but there’s still a long way to go.  One encouraging thing here is that 121 of these failures (more than half) involve the tools Helgrind, DRD and exp-Ptrcheck, which are three of the less-used tools in the Valgrind distribution, and which are all completely broken on the branch, and which I haven’t really looked at yet precisely because they are less-used.  The other 92 failures involve Memcheck and Nulgrind (the “no-instrumentation” tool, failures for which indicate problems with the testing of Valgrind’s core).  A lot of these are problems with non-portable tests, rather than the Darwin port’s functionality.  Furthermore, the tools Cachegrind, Callgrind, and Massif pass all of their tests.
  • The size of the diff between the trunk and the branch was 41,895 lines (1.5MB).  It’s now 38,248 (1.3MB). But note, once again, that this is not a very useful metric.  I just scanned through the diff and there’s not a great deal of differences in the diff than can be merged before we reach the point of the big branch-to-trunk merge.

Functionality improvements are as follows.

  • Basic signals are now supported, thanks to Julian.  This accounted for a lot of the new test passes.  This also means that debug builds of Firefox run successfully!
  • Some extra system calls are handled.
  • 64-bit builds are working.  To configure Valgrind for them, pass to ./configure the option –build=amd64-darwin.  64-bit Valgrind is quite slow, it does some very large mmaps at startup which take several seconds.  This will need to be fixed.  This also hasn’t been tested as much as the 32-bit version, and passes fewer tests.

I’m taking three weeks of vacation starting on Thursday, so progress on Valgrind+Darwin will be minimal over the next month.  But I will be visiting Mountain View early next week (Monday, March 23 and Tuesday, March 24) so I’ll be able to actually meet some of the people I work with!  I may also give a talk about Valgrind, depending on whether it can be scheduled.  Any suggestions for things to talk about are welcome.

Categories
C Correctness Valgrind

atol() considered harmful

Here’s what sounds like a simple question: in C, what’s the best way to convert a string representing an integer to an integer?

The good way

One way is to use the standard library function called strtol(). There are a number of similar functions, such as strtoll() (convert to a long long) and strtod (convert to a double), but they all work much the same so I’ll focus on strtol().

Here’s the prototype for strtol() from the man page on my Linux box:

#include <stdlib.h>
long int strtol(const char *nptr, char **endptr, int base);

The first argument is the string to convert.  The third argument is the base;  if you give it zero the base used will be 10 unless the string begins with “0x” (hexadecimal) or ‘0’ (octal).  Any whitespace at the front is skipped over, and a ‘+’ or ‘-‘ just before the digits is allowed.  The return value is the long integer value that the string was converted into.

The second argument is a call-by-reference return value.  If it’s non-NULL, it gets filled in with a pointer to the first non-converted char in the string.  If the string was entirely numbers, such as “123”, endptr will point to the terminating NUL char.  If the string had no valid integral prefix, e.g. “one two three”, endptr will point to the start of the string.  If the string was “123xyz”, i.e. contains numbers followed by non-numbers, endptr will point to the ‘x’ char in the string.

There are two good things about endptr.  First, it lets you do error-detection.  For example, if you are expecting a number without any extra chars at the end, endptr must point to a NUL char:

char* endptr;
long int x = strtol(s, &endptr, 0);
if (*endptr) { /* error case */ }

Second, it allows more flexible parsing.  For example, if you need to parse a string consisting of three comma-separated integers (e.g. “1, 2, 3”) you can do this:

int i1 = strtol(s,        &endptr, 0);  if (*endptr != ',')  goto bad;
int i2 = strtol(endptr+1, &endptr, 0);  if (*endptr != ',')  goto bad;
int i3 = strtol(endptr+1, &endptr, 0);  if (*endptr != '\0') goto bad;
...
bad: /* error case */

(Nb: This example allows whitespace before each number but not before each comma.)

Finally, strtol() returns 0 and sets errno to EINVAL if the conversion could not be performed because the given base was invalid.  Also, it clamps the value and sets errno to ERANGE if an overflow or underflow occurrs (as could be the case in the above code examples).

The bad way

Another way is to use the standard library function called atol().  (Again, it’s one of a family of similar functions, such as atoi() and atof().)  atol() is equivalent to this call to strtol():

strtol(str, (char **)NULL, 10);

Seems like a nice simplification, especially if you know you want base 10 numbers, right?  The problem is that there is no scope for error-detection.  atol(“123xyz”) will return 123.  atol(“John Smith”) will return 0.  Even better, atol() doesn’t have to set errno in any case!  (And this means it’s actually not quite equivalent to the above strtol() call).

It’s quite amazing, really;  I’m not aware of any other C standard library functions that actually make error-detection impossible.  Furthermore, the documentation involving atol() doesn’t make this clear.  Compare this to the dangerous function gets():  on my Linux box, the man page has a BUGS section that says “Never use gets().”  The man page on my Mac is a little less forthright;  in the SECURITY CONSIDERATIONS section it says “The gets() function cannot be used securely.”

The POSIX specification is a little more forthright about atol():

The atol() function is subsumed by strtol() but is retained because it is used extensively in existing code. If the number is not known to be in range, strtol() should be used because atol() is not required to perform any error checking.

The only use I can think of for atol() is in the case where you’ve already scanned the string for some reason and know it contains only digits.

A few of the options in the current release of Valgrind (3.4.0) that take numbers currently accept non-numbers because they are implemented using atoll().  (Most of them use strtoll(), however, and the discrepancy has been removed in the trunk.)  For example, if you pass the option –log-fd=foo it will interpret the “foo” as 0.  Lovely.

Categories
Mac OS X Valgrind

Valgrind on iPhone

With Valgrind now working reasonably well on Darwin, it’s possible to run Valgrind on an iPhone. Well, not directly on an iPhone, because the Darwin port doesn’t work on ARM, but you can run Valgrind on x86 binaries built against the iPhone Simulator SDK.

However, it’s a little tricky, because you can’t run Valgrind from the command line in this environment.  Landon Fuller explains here the small hack that is required to get around this limitation.

Categories
Correctness Valgrind

Eliminating undefined values with Valgrind, the easy way

(For those who don’t want to read a long post, here’s the short version:  if you use Valgrind, make sure you learn about the –track-origins=yes option that was added to version 3.4.0, it will make your life easier.)

Earlier this week, Jesse Ruderman reported a problem found with the Valgrind tool Memcheck:

==56620== Conditional jump or move depends on uninitialised value(s)
==56620==    at 0x1870962: nsHTMLDocument::MaybeEditingStateChanged()
==56620==    <more stack trace lines omitted>

Mats Palmgren investigated, but wasn’t able to determine the cause. His conclusion was:

The only possible explanations I can think of is either a bug in valgrind itself, or we make bogus VALGRIND_* calls to misinform it about the status of this memory. AFAICT, jemalloc is only place we use VALGRIND_* macros… and I traced all of those and I don’t see any suspicious calls…

It turns out that the Memcheck was right, and the problem was real.  So why did Mats have trouble understanding what Valgrind was trying to tell him?

Tracking undefined values

Memcheck tracks every single value bit present in a running program, and tracks whether it is defined or undefined.  Undefined values arise from certain kinds of allocations;  for example, stack- and heap-allocated memory is undefined if it is not initialised.  Furthermore, undefinedness gets propagated through all operations.  For example, if you add an undefined value to a defined value, the result is undefined.

Actually, because the definedness tracking is bit-level, it’s much more complicated than that:  you can have partially defined values due to bit-field operations, and for every arithmetic operation Memcheck has to determine how undefined bits in the inputs can lead to undefined bits in the output.  This paper describes how it works in detail. It’s quite amazing that Memcheck is able to do this so accurately and efficiently, and Julian Seward deserves a great deal of credit for this minor miracle of code engineering.  (For those of you who have used Purify, this bit-precise definedness tracking is one of the main things that distinguishes Memcheck from Purify.)  Furthermore, the technique can be to track any binary property, and has been reused in tools tracking which bits are tainted and which bits are secret.

Anyway, the gory details don’t really matter.  What does matter is that when Memcheck complains that a “Conditional jump or move depends on uninitialised value(s)” — “undefined” would arguably be better in that message than “uninitialised”, but that message hasn’t changed for years — it’s determined that an undefined value is being used in a dangerous way.  And it’s almost always right.

But I still haven’t answered the original question of why Mats found the message hard to understand.

An infection chain

In my previous post I defined the terms defect, infection and failure, which are the three stages of a “bug”, and together consitute an infection chain.  If you haven’t read that post, or know of these terms from elsewhere, I suggest you read that post before continuing.

What’s the relevant defect here?  It’s almost always that the code fails to initialise a value (e.g. a variable or field).

What’s the infection?  When the defective code is executed, the value ends up undefined, which means it’s probably incorrect.  But it’s not guaranteed to be incorrect — you might be lucky (or arguably unlucky) and the garbage value ends up being the same as the one that would have been assigned.  (This is most common with heap allocations, which in practice often result in all the allocated memory being zeroed.)

How does the infection propagate?  Well, once you have a possibly incorrect value, there are many ways for things to go wrong:

  • If an incorrect value is used as an input to an operation that writes a result, such as an assignment, arithmetic operation or system call, the likely result another incorrect value (a further infection).  But this isn’t guaranteed, e.g. if you bitwise-AND an incorrect value with 0, the result is always 0.
  • If an incorrect value is used as an input to an operation with user-visible side-effects (e.g. a system call or an assertion), a possible result is incorrect output or an abort (both failures).
  • If an incorrect value is used as an address in an operation that accesses memory, it might lead to an incorrect value in the load/store destination (more infections).  Or it could cause inappropriate memory to be accessed, leading to a crash (a failure).
  • If an incorrect value is used in a conditional branch, or as a jump target, that can lead to an incorrect program counter (another infection), which can lead to failures such as incorrect behaviour, or crashes (if the jump destination is sufficiently bogus).

In other words, incorrect values can lead to all sorts of problems.  And that’s why failing to initialise values can lead to such insidious bugs — the range of possible outcomes is extremely wide.

Back to Memcheck

So where in this chain does Memcheck complain?  Ideally it would be as soon as the defect is executed.  But that could lead to false positives — it’s possible that the value will be initialised later, before being used in dangerous ways.

You might think that Memcheck should complain if an undefined value is used in any way.  But that too would lead to false positives — it turns that normal programs copy around lots of undefined values, because of padding within structures.  (Indeed, we’ve tried doing exactly this with Memcheck, and it results in hundreds of warnings for trivial programs like Hello World.)

You might think that Memcheck should complain if an undefined value is used in an arithmetic operation.  Again, it doesn’t, because such an operation is not necessarily dangerous;  it depends if the resulting value is undefined and how it is then used.

Really, the problem is that it’s hard to tell if an undefined value constitutes an infection — doing that requires knowing the programmer’s intention.

What Memcheck does is complain when the use of an undefined value is very likely to lead to defects:

  • In the condition of a conditional branch or test, or the target address of a branch, because it is likely to lead to incorrect control flow, which will lead to incorrect behaviour.  (The complaint above was an example of this.)
  • As an argument to a system call, because it is likely to cause an incorrect externally-visible side-effect.
  • As the address in a load or store, because it could cause a crash.

The end result of this is that the place where Memcheck complains can be quite some distance from the defect, and also quite some distance from any failures that manifest as a result.  This is why Mats had trouble with this warning.

People often ask “can’t you do more eager checking?”  It should be clear by now that there are trade-offs between how early definedness problems are reported and how many false positives you get.  It’s not clear that there’s a single best answer, but the Valgrind developers have given this matter lots of thought over many years and it’s probably not going to change any time soon!  But do not despair…

So how do I get better information?

When Memcheck complains about a dangerous use of an undefined value, you need to work backwards to find the defect — the missing initialisation.  In easy cases, you can do this in your head, just by looking at the code.

If you’re still stumped, you can insert the Memcheck client requests VALGRIND_CHECK_MEM_IS_DEFINED or VALGRIND_CHECK_VALUE_IS_DEFINED into your code to determine which variables are undefined at different points in your code.  These can be very helpful, but they’re a bit laborious because you have to recompile your code every time you want to insert a new one.  But for over seven years that was the best way to do it.

However, Valgrind 3.4.0 introduced a new option, –track-origins=yes.  What it does is, for every undefined value, it tracks where the undefinedness originated.  The origin is usually a stack or heap allocation.  With –track-origins=yes, each complaint about an undefined value will be augmented with origin information like this:

==29553==  Uninitialised value was created by a stack allocation
==29553==    at 0x400968: main (origin1-yes.c:23)

or this:

==29553==  Uninitialised value was created by a heap allocation
==29553==    at 0x4C2758C: malloc (vg_replace_malloc.c:192)
==29553==    by 0x400A6F: main (origin1-yes.c:61)

This is useful because the missing initialisation — the defect — is very often supposed to occur immediately after the allocation.  So origin-tracking usually takes you immediately to the actual defect.  This is one of those features that probably doesn’t seem that impressive unless you’ve had to deal with this problem before, in which case you’ll realise it can be a godsend.

By default, –track-origins is not turned on.  This is because it slows Memcheck down a lot, because of all the extra origin information that must be propagated.  But if you have undefined value warnings, you should turn it on immediately!

With this information, Mats was able to track down the defect:  a constructor failed to initialise two fields.  Hooray!  And just think, if C++ warned about fields uninitialised by constructors, all these heroics wouldn’t be necessary 🙂

Categories
Mac OS X Valgrind

Valgrind + Mac OS X update (Feb 17, 2009)

It’s been a month since I first wrote about my work on the Mac OS X port of Valgrind.  In that time I’ve made 85 commits to the DARWIN branch (and a similar number to the trunk).

Here are the current (as of r9192) values of the metrics I defined in the first post as a means of tracking progress.

  • The number of regression test failures on Linux was: 477 tests, 220 stderr failures, 53 stdout failures, 25 post failures (which I’ll abbreviate as 477/220/53/25). It’s now 484/4/1/0.  I.e. the number of failures went from 298 to 5.  A few new tests have been added.  Four of the failures are in Helgrind, the data race detector tool, which I haven’t tracked down yet.  The other failure is one that also occurs on the trunk.  So almost all the Linux functionality broken by the changes has been restored.
  • The number of regression test failures on Mac was 419/293/58/29.  It’s now 402/213/52/0.  I.e. the number of failures went from 380 to 265.  The total number of tests has gone down because some Linux-specific tests are no longer being (inappropriately) run on Mac.  This is the most important metric, and it’s improving steadily, but there’s still a long way to go.
  • The number of compiler warnings on Linux was 186.  It’s now 10, and all of these are from #warning declarations that mark places where improvement need to be made to the Darwin port, but aren’t actually a problem for Linux.  The number of compiler warnings on Mac was 461.  It’s now 44.  Of these, 33 are from #warning declarations, and 10 are from code generated by the Darwin ‘mig’ utility which I have no control over.  So compiler warnings aren’t an issue any more, and I won’t bother tracking them as a metric in the future.
  • The size of the diff between the trunk and the branch was 55,852 lines (1.9MB).  It’s now 41,895 lines (1.5MB).  But note that this is not a very useful metric;  progress will usually cause it to drop, but it will also increase as missing Darwin functionality is added.

Interestingly enough, although this number of Mac test failures has gone down significantly, if the branch didn’t handle your program a month ago it probably still won’t handle it now (although getsockopt() no longer causes an abort).  But Valgrind’s output may well be better (e.g. debugging information will be better utilized).  Much of my effort has been in making the tests pass — improving cases where the Darwin port was doing basically the right thing, but its output didn’t exactly match that expected.

One example is that stack traces were a little unclean, in various minor ways.  Another example is that I added a –ignore-fn option to Massif (the heap profiler) which allows it to ignore certain heap allocations.  This was required because Darwin’s libc always does a few heap allocations at start-up, but Linux’s libc doesn’t.  The new option allows the Darwin allocations to be ignored and therefore Massif’s output to be consistent on both platforms.

Few if any of these changes have made the branch closer to handling new programs, at least directly.  But there’s no point apologising about this, because the branch won’t reach a highly functional state without a working test suite to serve as a safety net against regressions.  And as I progress, getting more tests to pass will require genuine new program functionality to be supported, so improvements should start to occur on that front soon.  For example, signals currently aren’t supported at all, and this is why Firefox does not run under Valgrind on Mac yet — all calls to sigaction() currently return -1, which causes an assertion failure somewhere in NSPR.

Something else worth mentioning:  I bought a new MacBook Pro, as my old 32-bit only was was slow and noisy and getting annoying.  The new machine is 64-bit capable, but compiles to 32-bit by default and Valgrind’s configure script identifies it as a 32-bit only machine.  If anybody knows how to make configure recognise that it’s a 64-bit machine I’d love to hear about it.

Update, March 17: fixed a broken link to an earlier post.