This is a long post. Here’s a summary.
- the good news: Valgrind’s Memcheck tool now works well enough on Nexus S that it can run Firefox and find real bugs, for example bug 688733. The false error rate from Memcheck is pretty low, certainly low enough to be usable.
- as of this morning, all required Valgrind fixes are in the Valgrind SVN trunk repo.
- the bad news: this requires building a custom ROM and kernel for the Nexus S, that is to say, mucho hoop jumping. Not the faint hearted.
The rest of this post explains what the constraints are and approximately what is necessary to get started.
The main difficulty is the need to build a custom ROM and kernel. There are three reasons for this:
- To have a swap enabled kernel. Starting a debuggable build of Firefox on Memcheck gives a process size of around 800MB. Without swap, it gets OOM-killed at around 270MB. But the default kernel isn’t configured for swap, hence a rebuild is required, as per Mike Hommey’s instructions. Once in place, I gave the kernel a 1GB swap file parked in /sdcard, and that seemed to work ok.
- Libraries with symbols. Memcheck needs to have symbol information for /system/lib/libc.so and /system/bin/linker (libc and the dynamic linker). Without these it generates huge numbers of false error reports and is unusable. Symbols for the other libraries are nice (better stacktraces) but not essential.
- To make it possible to start Firefox on Valgrind. Valgrind can only insert itself into a process at an exec() transition, that is, when the process starts from new. On normal Unixes this is no problem, since the shell from which you start Valgrind does the normal fork-exec thing. But application start on Android is completely different, and doesn’t involve exec().
Instead, there is a master process called the Zygote. To start an application (eg Firefox), a message is sent via a socket to the Zygote. This creates a child with fork(), and the child then goes on to load the relevant bytecode and (presumably) native code and “becomes” Firefox. So there’s no exec() boundary for Valgrind to enter at.
Fortunately the AOSP folks provided a solution a couple of months back. They modified Zygote so that it can start selected processes under the control of a user-specified wrapper, which is precisely the hook we need. The AOSP tree now has this fix.
Overview of getting started
Here’s an overview of the process. It doesn’t contain enough details to simply copy and paste, but it does give some idea of the hoop jumping that is unfortunately still required.
Download sources and build Android images, as per directions at http://source.android.com/source/building.html. This in itself is a major exercise. The relevant “lunch” flavour is full_crespo-eng, I think. At the end of this stage, you’ll have (amongst things) libraries with symbols and a wrapper-enabled Zygote. But not a swap enabled kernel.
Build a swap enabled kernel as per Mike Hommey’s instructions, and incorporate it into the images built in the previous stage. In fact, I skipped this step — Mike kindly did it for me.
Push the images onto the phone, reboot, check it’s still alive.
Check out a copy of the Valgrind trunk from svn://svn.valgrind.org/valgrind/trunk, and build as described in detail in README.android. If you complete that successfully, you’ll have a working installation of Valgrind on the phone at /data/local/Inst/bin/valgrind.
Install busybox on the phone, to make life a little less austere in the shell.
On the Linux host, generate a 1GB swap file and transfer it to /sdcard on the phone (that’s the only place it will fit). Then enable swapping:
cat /proc/swaps /data/local/Bin/busybox swapon /sdcard/swapfile1G cat /proc/swaps
Note you’ll have to manually re-enable swapping every time the phone is rebooted.
Copy from the host, the contents of out/target/product/crespo/symbols/system to /sdcard/symbols/system. These are the debuginfo objects for the system libraries. Valgrind expects them to be present, as per comments above, so it can read symbols for libc.so and /system/bin/linker. This will copy far more than that, which is not essential but nice for debugging.
Build a Firefox you want to debug. That of course means with line number info and with the flags –disable-jemalloc –enable-valgrind. I strongly suggest you use “-O -g” for a good compromise between speed and debuggability. When the build finishes, ask the build system to make an .apk file with the debug info in place, and install it. The .apk will be huge, about 125MB:
(cd $objdir && make package PKG_SKIP_STRIP=1) adb install -r $objdir/dist/fennec-9.0a1.en-US.eabi-arm.apk
We’re nearly there. We have a device which is all set up, and a debuggable Firefox on it. But we need to tell Zygote that we want to start Firefox with a wrapper, namely Valgrind. In the shell on the phone, do this:
setprop wrap.org.mozilla.fennec_sewardj "logwrapper /data/local/start_valgrind_fennec"
This tells Zygote that any startup of “org.mozilla.fennec_sewardj” should be done via an exec() of /data/local/start_valgrind_fennec applied to Zygote-specified arguments. So, now we can put any old thing in a shell script, and Zygote will run it. Here’s what I have for /data/local/start_valgrind_fennec:
#!/system/bin/sh VGPARAMS='--error-limit=no' export TMPDIR=/data/data/org.mozilla.fennec_sewardj exec /data/local/Inst/bin/valgrind $VGPARAMS $*
Obviously you can put any Valgrind params you want in VGPARAMS; you get the general idea. Note that this is ARM, so you don’t need the –smc-check= flag that’s necessary on x86 targets.
Only two more hoops to jump through now. One question is where the Valgrind output should go. Initially I tried using Valgrind’s little-known but very useful –log-socket= parameter (see here for details), but it seemed to crash the phone on a regular basis.
So I abandoned that. By default, Valgrind’s output winds up in the phone’s system log, mixed up with lots of other stuff. In the end I wound up running the following on the host, which works pretty well:
adb logcat -c ; adb logcat | grep --line-buffered start_valgrind \ | sed -u sQ/data/local/start_valgrind_QQg | tee logfile.txt
And finally .. we need to start Firefox. Now, due to recent changes in how the libraries are packaged for Android, you can’t start it by pressing on the Fennec icon (well, you can, but Valgrind won’t read the debuginfo.) Instead, issue this command in a shell on the phone:
am start -a org.mozilla.gecko.DEBUG -n org.mozilla.fennec_sewardj/.App
This requests a “debug intent” startup of Firefox, which sidesteps the fancy dynamic unpacking of libraries into ashmem, and instead does the old style thing of unpacking them into /data/data/org.mozilla.fennec_sewardj. From there Valgrind can read debuginfo in the normal way.
One minor last hint: run “top -d 2 -s cpu -m 19”. Because Valgrind runs slowly on the phone, I’m often in the situation of wondering am-I-waiting-for-it or is-it-waiting-for-me? Running top pretty much answers that question.
And .. so .. it works! It’s slow, but it appears to be stable, and, crucially, the false error rate from Memcheck is low enough to be usable.
So, what’s next? Writing this reminded me what a hassle it is to get all the ducks lined up right. We need to streamline it. Suggestions welcome!.
One thing I’ve been thinking about is to to avoid the need to have debuginfo on the target, by allowing Valgrind to query the host somehow. Another thing I plan to do is make the Callgrind tool work, so we can get profile information too.