Valgrind on Android — Current Status

This is a long post.  Here’s a summary.

  • the good news: Valgrind’s Memcheck tool now works well enough on Nexus S that it can run Firefox and find real bugs, for example bug 688733.  The false error rate from Memcheck is pretty low, certainly low enough to be usable.
  • as of this morning, all required Valgrind fixes are in the Valgrind SVN trunk repo.
  • the bad news: this requires building a custom ROM and kernel for the Nexus S, that is to say, mucho hoop jumping.  Not the faint hearted.

The rest of this post explains what the constraints are and approximately what is necessary to get started.

Constraints

The main difficulty is the need to build a custom ROM and kernel.  There are three reasons for this:

  • To have a swap enabled kernel.  Starting a debuggable build of Firefox on Memcheck gives a process size of around 800MB.  Without swap, it gets OOM-killed at around 270MB.  But the default kernel isn’t configured for swap, hence a rebuild is required, as per Mike Hommey’s instructions.  Once in place, I gave the kernel a 1GB swap file parked in /sdcard, and that seemed to work ok.
  • Libraries with symbols.  Memcheck needs to have symbol information for /system/lib/libc.so and /system/bin/linker (libc and the dynamic linker).  Without these it generates huge numbers of false error reports and is unusable.  Symbols for the other libraries are nice (better stacktraces) but not essential.
  • To make it possible to start Firefox on Valgrind.  Valgrind can only insert itself into a process at an exec() transition, that is, when the process starts from new.  On normal Unixes this is no problem, since the shell from which you start Valgrind does the normal fork-exec thing.  But application start on Android is completely different, and doesn’t involve exec().

Instead, there is a master process called the Zygote.  To start an application (eg Firefox), a message is sent via a socket to the Zygote.  This creates a child with fork(), and the child then goes on to load the relevant bytecode and (presumably) native code and “becomes” Firefox.  So there’s no exec()  boundary for Valgrind to enter at.

Fortunately the AOSP folks provided a solution a couple of months back.  They modified Zygote so that it can start selected processes under the control of a user-specified wrapper, which is precisely the hook we need.  The AOSP tree now has this fix.

Overview of getting started

Here’s an overview of the process.  It doesn’t contain enough details to simply copy and paste, but it does give some idea of the hoop jumping that is unfortunately still required.

Download sources and build Android images, as per directions at http://source.android.com/source/building.html.  This in itself is a major exercise.  The relevant “lunch” flavour is full_crespo-eng, I think.  At the end of this stage, you’ll have (amongst things) libraries with symbols and a wrapper-enabled Zygote.  But not a swap enabled kernel.

Build a swap enabled kernel as per Mike Hommey’s instructions, and incorporate it into the images built in the previous stage.  In fact, I skipped this step — Mike kindly did it for me.

Push the images onto the phone, reboot, check it’s still alive.

Check out a copy of the Valgrind trunk from svn://svn.valgrind.org/valgrind/trunk, and build as described in detail in README.android.  If you complete that successfully, you’ll have a working installation of Valgrind on the phone at /data/local/Inst/bin/valgrind.

Install busybox on the phone, to make life a little less austere in the shell.

On the Linux host, generate a 1GB swap file and transfer it to /sdcard on the phone (that’s the only place it will fit).  Then enable swapping:

  cat /proc/swaps
  /data/local/Bin/busybox swapon /sdcard/swapfile1G
  cat /proc/swaps

Note you’ll have to manually re-enable swapping every time the phone is rebooted.

Copy from the host, the contents of out/target/product/crespo/symbols/system to /sdcard/symbols/system.  These are the debuginfo objects for the system libraries.  Valgrind expects them to be present, as per comments above, so it can read symbols for libc.so and /system/bin/linker.  This will copy far more than that, which is not essential but nice for debugging.

Build a Firefox you want to debug.  That of course means with line number info and with the flags –disable-jemalloc –enable-valgrind.  I strongly suggest you use “-O -g” for a good compromise between speed and debuggability.  When the build finishes, ask the build system to make an .apk file with the debug info in place, and install it.  The .apk will be huge, about 125MB:

  (cd $objdir && make package PKG_SKIP_STRIP=1)
  adb install -r $objdir/dist/fennec-9.0a1.en-US.eabi-arm.apk

We’re nearly there.  We have a device which is all set up, and a debuggable Firefox on it.  But we need to tell Zygote that we want to start Firefox with a wrapper, namely Valgrind.  In the shell on the phone, do this:

  setprop wrap.org.mozilla.fennec_sewardj "logwrapper /data/local/start_valgrind_fennec"

This tells Zygote that any startup of “org.mozilla.fennec_sewardj” should be done via an exec() of /data/local/start_valgrind_fennec applied to Zygote-specified arguments.  So, now we can put any old thing in a shell script, and Zygote will run it.  Here’s what I have for /data/local/start_valgrind_fennec:

  #!/system/bin/sh
  VGPARAMS='--error-limit=no'
  export TMPDIR=/data/data/org.mozilla.fennec_sewardj
  exec /data/local/Inst/bin/valgrind $VGPARAMS $*

Obviously you can put any Valgrind params you want in VGPARAMS; you get the general idea.  Note that this is ARM, so you don’t need the –smc-check= flag that’s necessary on x86 targets.

Only two more hoops to jump through now.  One question is where the Valgrind output should go.  Initially I tried using Valgrind’s little-known but very useful –log-socket= parameter (see here for details), but it seemed to crash the phone on a regular basis.

So I abandoned that.  By default, Valgrind’s output winds up in the phone’s system log, mixed up with lots of other stuff.  In the end I wound up running the following on the host, which works pretty well:

  adb logcat -c ; adb logcat | grep --line-buffered start_valgrind \
    | sed -u sQ/data/local/start_valgrind_QQg | tee logfile.txt

And finally .. we need to start Firefox.  Now, due to recent changes in how the libraries are packaged for Android, you can’t start it by pressing on the Fennec icon (well, you can, but Valgrind won’t read the debuginfo.)  Instead, issue this command in a shell on the phone:

  am start -a org.mozilla.gecko.DEBUG -n org.mozilla.fennec_sewardj/.App

This requests a “debug intent” startup of Firefox, which sidesteps the fancy dynamic unpacking of libraries into ashmem, and instead does the old style thing of unpacking them into /data/data/org.mozilla.fennec_sewardj.  From there Valgrind can read debuginfo in the normal way.

One minor last hint: run “top -d 2 -s cpu -m 19”.  Because Valgrind runs slowly on the phone, I’m  often in the situation of wondering am-I-waiting-for-it or is-it-waiting-for-me?  Running top pretty much answers that question.

And .. so .. it works!  It’s slow, but it appears to be stable, and, crucially, the false error rate from  Memcheck is low enough to be usable.

So, what’s next?  Writing this reminded me what a hassle it is to get all the ducks lined up right.  We need to streamline it.  Suggestions welcome!.

One thing I’ve been thinking about is to to avoid the need to have debuginfo on the target, by allowing Valgrind to query the host somehow.  Another thing I plan to do is make the Callgrind tool work, so we can get profile information too.

 

7 responses

  1. njn wrote on :

    I shudder to think how many hours of trial and error it took for you to get this working.

    1. jseward wrote on :

      The most difficult part was getting Memcheck to behave reasonably. My initial run of Firefox on Memcheck yielded 21 million errors, almost all of which were false, and figuring out where they came from and how to get rid of them wasn’t simple.

      But yes, true, grappling with the difference between Linux and Android user-space and toolchains was indeed a fiddly time-sink.

  2. glandium wrote on :

    We recently got pandaboards, which have more memory and more cpu horsepower. That could help significantly.

    1. jseward wrote on :

      That would help a bit, and I have one on order, but it won’t make the difference between it being too complex to set up and use, vs being usable. What would really make a difference is if we had a pre-build ROM (and kernel) image for the Nexus S, containing all the pieces, and folks could just re-flash their phones and start using this with minimal difficulty.

  3. yohoro wrote on :

    On android,in the JNI mode, some java app call c/c++ libraries, can Valgrind check the c/c++ library’s memory error?

    1. jseward wrote on :

      Yes.

  4. yohoro wrote on :

    Hi Julian, I want to port the valgrind to my x86-android phone, yes it’s x86-android.
    1. In the README.android,there are lines:
    # Currently the only supported value is: nexus_s
    #
    export HWKIND=nexus_t # Samsung Nexus S
    Whether it is possible if I want to support my own special hardware, how should I do?

    2. And also, do you know what the “–host” should be if the cpu is x86 not arm in the configure flags:
    ./configure –prefix=/data/local/Inst \
    –host=armv7-unknown-linux –target=armv7-unknown-linux \

    Thanks a lot!