Dr Memory: a memory-checking tool for Windows

Valgrind’s Memcheck tool works on Linux and MacOS, but not on Windows. Interestingly, there is something like it for Windows: “Dr Memory”.  Similar in style to Memcheck, Dr Memory is an open source memory checking tool built on top of a JIT-based instrumentation framework called DynamoRIO. It provides essentially identical functionality: detection of invalid memory accesses, uninitialised value uses and memory leaks. Dr Memory claims to be considerably faster than Memcheck, so I was curious to see how it performed.

I recently tried Dr Memory 1.9.0-RC1 on Windows 7, running 32-bit Firefox builds, to see to what extent it can provide coverage for the Windows-specific parts of Gecko.

Installing and getting started isn’t difficult. There are command line flags to direct the output, control the level of instrumentation, specify files listing errors to hide, and so on. As you’d expect.

Despite considerable efforts with Dr Memory, I came away feeling it was a promising tool, but just a bit too hard to use. I encountered two kinds of problems.

Firstly, about half of my Firefox startups ended up spinning. Some of the time, Firefox would start (slowly, of course) and be usable after a couple of minutes. Other runs would spin for an hour or more and still not produce a usable browser. I never figured out why. This seems to be related to the instrumentation, because if I run Firefox uninstrumented on the DynamoRIO core, like Valgrind’s –tool=none, it works reliably.

A second problem was the considerable number of uninitialised memory read errors. I tried out both non-optimised (“/Zi /Od”) and optimised (“/Zi /O2 /Oy- /Ob0”) builds of Firefox.

For the non-optimised builds, Dr Memory reports no invalid accesses and a few uninitialised memory reads, which is what I’d expect. But it’s unusably slow, because the unoptimised build lacks reasonable register allocation, which easily doubles the number of memory accesses that have to be checked.

So my next step was to try an optimised build. This runs a great deal faster. There’s a down side, though: the number of uninitialised memory accesses goes way up. Most of these must be false positives, because they weren’t reported in the unoptimised runs.

I investigated further. It is likely that one source of false positives is Dr Memory’s incomplete description of the Windows system call interface. Valgrind’s description of the Linux syscall interface is itself complex, and it is said that the Windows interface makes the Linux interface look simple. Given that, I’m impressed that Dr Memory works as well as it does.

The other source of false positives appears to be bitfields. Dr Memory tracks the definedness state of each byte of memory using one bit for each byte. Consequently it has no way to accurately model partially initialised bytes, and so must unavoidably either report false positives, or miss real errors, depending on which of the two available shadow states partially initialised bytes are mapped to.

One way to detect probable false-positive bitfield errors in cross platform Gecko code is to check whether Memcheck reports errors at the same places. In many cases it doesn’t. I created a suppressions file, which tells Dr Memory to hide errors I identified as clearly false. A second line of defense is to add extra initialisation code for bitfields purely in order to keep Dr Memory happy. Neither of these are really what one wants to do, though.

The false positive problem seriously compromises Dr Memory’s usefulness on optimised Gecko code, compared to Memcheck. The effect is to create a lot more undefined value errors needing investigation. The situation is exacerbated because Dr Memory doesn’t have an equivalent to Memcheck’s origin-tracking feature, which makes it more difficult to analyse the errors and to determine where, if any, dummy initialisations should be placed.

Dr Memory does have a “light” mode, which restricts it to invalid-address and leak checking only. This increases usability at the expense of losing undefined value checking. If you’re looking for possible heap corruption on Windows, this would be worth a try.

Mochitests are now Valgrind-clean

One of the things I’ve been doing these past few months is running Mochitests on Valgrind on a semi-regular basis — roughly weekly — and trying to drive the resultant error count down to zero. I did this some years back, in the approach to Firefox 4, but the situation has drifted since then.

Anyway, I am pleased to say that Mochitests on 64-bit Linux is once again Valgrind/Memcheck clean. That is, free from invalid memory accesses and uses of undefined values. This involved fixing around 25 bugs, mostly to do with uninitialised values. They fall into three categories:

  • Forgetting to initialise a field in a constructor. Pretty simple once you figure out which constructor is involved.
  • Forgetting to write values to an out-parameter — an ever-popular failure mode. Typically a called function intends to return a value through a pointer to a caller-supplied buffer, and also returns an success/failure indication. But there’s some path through it which returns “success” without writing to the buffer.
  • More complex sequencing errors, in which an object is allocated, and only partly initialised, in the expectation that other fields will be set up later, but before use, using an Init() method or some such. But the sequencing doesn’t quite work out, and Init() is called only after those fields are read.

The majority of the bugs are not to do with memory accesses in invalid places. I suspect most invalid accesses have already been picked up by our ASan runs.

The severity of the bugs varies. Any bug picked up by Valgrind might have serious consequences, as a result of making the computation depend on unknown values, either undefined ones, or ones taken from unknown places in memory. Some of the bugs seem fairly harmless.  Others looked like they could cause user visible weirdness, for example decisions along the lines of “is the download rate sufficient to support this video resolution, or should we request a lower resolution?” At least one would have been an obvious crash had the uninitialised value pulled out of memory been suitably unfavourable.

So, is it truly clean? Well, not entirely, for various reasons.  For one thing this is dynamic testing, of course, so it only covers scenarios that the test set exercises.

Secondly, the high level of C++ churn in the tree means there’s a constant, if low, influx of new memory errors, so we can only really get to “approaching zero” detectable errors. Doing this on the Aurora or Beta branches might help, but that doesn’t fit with our “mozilla-central first” development model.

Thirdly, this tests core Gecko code and the Linux specifics. It unfortunately doesn’t cover Mac specifics, Windows specifics or FirefoxOS specifics, although the latter is in progress.

One noticeable thing, if you play the run-Valgrind-and-fix-what-you-see game long enough, is that we can drive the observable memory errors in our code down to zero. That is, in anything that we build from source. But we can’t do anything about the various proprietary binary blobs we have to live with. Eventually we arrive at a situation where — at least so it seems — our code is cleaner than some of the other libraries it relies on.

Valgrinding Mochitests is quicker than it used to be. Back in the Firefox 4 days, it took about 14 CPU hours. Now it completes in about 5 hours. This is due to a combination of faster hardware (Intel Haswell), Valgrind generally being faster and scaling better for large processes, and having advanced to the point where it’s feasible to test “gcc -O2” compiled code with a near-zero false error rate. It may also be that Gecko is simply faster, giving less work for Valgrind to do.

I’m beginning to run Valgrind tests of FirefoxOS startup on phones, in particular on the Flame. Given how resource-hungry Valgrind is, doing so is something of a challenge, requiring dodging timeouts, memory limits, sandboxes and proprietary driver blobs. But it is just about doable, and various bits of FirefoxOS-specific badness are in the process of getting cleaned up.

From a personal standpoint, this is very satisfying. Memcheck was originally conceived as a tool for finding bugs you didn’t yet know you had, rather than as a post-crash diagnostic tool. So it’s great to see it being used to remove a whole class of bugs from our tree.

We also have toolery (TSan, Helgrind) to find low level data races.  Nathan Froyd and Christian Holler have made good progress in using them to detect races and in getting folks to fix those races, as tracked by metabug 929478. I look forward to the day when we can say that Mochitests is also free from detectable data races.

Valgrind support for MacOS X 10.10 (Yosemite)

Valgrind support on Yosemite has improved significantly in the past few months. It is now feasible to run Firefox on Valgrind on Yosemite. Support for 10.9 (Mavericks) has also improved, but 64-bit Yosemite remains the primary focus of support work.

For various reasons, MacOS is a difficult target for Valgrind. So you’ll find it little slower, and possibly more flaky, than running on Linux. Occasionally, the MacOS kernel will panic, for unknown reasons.  But it does work.

Full details of running Firefox on Valgrind on Yosemite are at https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Valgrind.  For serious use, you need to read that page.  In the meantime, here is the minimal getting-started recipe.

First, build Valgrind. You’ll need to use the trunk, since Yosemite support didn’t make it in time for the current stable (3.10.x) line.

  svn co svn://svn.valgrind.org/valgrind/trunk valgrind-trunk
  cd valgrind-trunk
  ./autogen.sh && ./configure --prefix=`pwd`/Inst
  make -j8 && make -j8 install

Make the headers available for building Firefox:

  cd /usr/include
  sudo ln -s /path/to/valgrind-trunk/Inst/include/valgrind .

Now build your Firefox tree. I recommend a mozconfig containing
these:

  ac_add_options --disable-debug
  ac_add_options --enable-optimize="-g -O2"
  ac_add_options --disable-jemalloc
  ac_add_options --enable-valgrind

And run, being sure to run the real binary (firefox-bin):

  /path/to/valgrind-trunk/Inst/bin/valgrind --smc-check=all-non-file \
   --vex-iropt-register-updates=allregs-at-mem-access --dsymutil=yes \
   ./objdir/dist/Nightly.app/Contents/MacOS/firefox-bin

 

LUL: A Lightweight Unwinder Library for profiling Gecko

Last August I asked the question “How fast can CFI/EXIDX-based stack unwinding be?” At the time I was experimenting with native unwinding using our in-tree Breakpad copy, but getting dismal performance results. The posting observed that Breakpad’s CFI unwinder is around 30 times slower than Valgrind’s CFI unwinder, and looked in detail at the reasons for this slowness.

Based on that analysis, I wrote a new lightweight unwinder library.  LUL — as it became known — is aimed directly at doing unwinding for profiling. It is fast, robust, fairly accurate, and designed to allow a pool of worker threads to do unwinding, if that’s somewhere we want to go. It is also set up to facilitate the space-saving schemes discussed in “How compactly can CFI/EXIDX stack unwinding info be represented?” although those have not been implemented as yet. LUL stores unwind information in a simple, quick-to-use format, which could conceivably be generated by the Javascript JITs so as to facilitate transparent unwinding through Javascript as well as C++.

LUL has been integrated into the SPS profiler, and landed a couple of weeks back.

It currently provides unwinding on x86_64-linux, x86_32-linux and arm-android, using the Dwarf CFI and ARM EXIDX unwind formats.  Unwinding by stack scanning is also supported, although that should rarely be needed. Compared to the Breakpad unwinder, there is a very substantial performance increase, achieving a cost of about 40% of a 1.2 GHz Cortex A9 for 1000 unwinds/second from leaf frames all the way back to XRE_Main().

To use LUL, build with –enable-profiling –enable-optimize=”-g -O2″.  I then start the desktop builds with the following environment variable settings:

  MOZ_PROFILER_INTERVAL=1 MOZ_PROFILER_NEW=1
  MOZ_PROFILER_VERBOSE=1 MOZ_PROFILER_MODE=native

In particular, setting MOZ_PROFILER_MODE=help gives more details.

On Android, a suitable magic incantation is:

  adb logcat -c ; \
  adb shell sh /system/bin/am start -S -n \
    org.mozilla.fennec_sewardj/.App \
      --es env0 MOZ_PROFILER_INTERVAL=1 \
      --es env1 MOZ_PROFILER_MODE=native \
      --es env2 MOZ_PROFILER_NEW=1 \
      --es env3 MOZ_PROFILER_VERBOSE=1 \
      --es env4 MOZ_PROFILER_STARTUP=1 ; \
  adb logcat 2>&1 | tee logfile.txt

What next for LUL? I’d like to implement the space-saving schemes mentioned earlier. But more important, it would be nice to have developers using the SPS/LUL combination, so as to give real-use feedback. That will help to move it forward in the most immediately useful direction.

How compactly can CFI/EXIDX stack unwinding info be represented?

In last week’s episode of The Wonderful World of CFI Unwinding we looked at how fast it could be done.  This week we’re going to look at how compactly the unwind information can be stored.

But I know you want a summary up front, so here it is: I reckon it’s possible to store enough CFI to unwind all of libxul.so on x86_64-linux in just 5.4MB.  At least, if my numbers are not wrong.

Background

Representing CFI and EXIDX unwind information compactly in memory is critical for memory constrained devices running FirefoxOS and Fennec.

The title of this posting is misleading, though.  The question is not “how compactly can it be stored?”, but rather “how compactly can it be stored and still give us the really fast access we need?”.

I did some digging.  Valgrind will, if asked nicely, print out the CFI unwinding rules as it stores them.  The previous posting described how it only stores unwind information for a limited register set — on x86_64-linux, which I’m experimenting with here — %rip, %rsp and %rbp.  It manages to summarise the unwind rule for each address range into just one line of text:

  [0x53eea71 .. 0x53eea84]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16)

This says: for %rip values in the range 0x53eea71 to 0x53eea84, first compute

  cfa = %rbp + 16

then

  previous %rip = *(cfa-8)
  previous %rsp = cfa
  previous %rbp = *(cfa-16)

For libxul.so for a recent m-c, built with “-g -O”, there are 890,065 of these records — that is, the compiler gives unwind descriptions for 890,065 different code address ranges in libxul.so.

That sounds like a lot of data.  But those descriptions are immensely repetitive.  Just how repetitive we can see by generating the descriptions, cutting off the address range part and throwing the rest through sort -u:

  valgrind --smc-check=all-non-file --tool=none --trace-cfi=yes " \
     --trace-symtab-patt=*libxul.so*" \
     ./ff-opt-linux/dist/bin/firefox-bin &> logfile

  cat logfile | grep "let cfa=" | cut -c 27- | sort -u | wc

The are just 75 different ones in nearly 900k address ranges!

In hindsight, that shouldn’t be a big surprise.  A description of how to recover the return address, stack pointer and frame pointer must be pretty boring.  Either they’re parked in memory at a small handful of 8-aligned offsets from the stack pointer, or they are the value of one of the previous registers plus or minus a similarly constrained offset.

In other words, the set of descriptions is small because GCC generates only a small set of stack frame layouts, if we restrict ourselves to considering just the parts of the frame needed for unwinding.

I was surprised by these figures, so I also tested the CFI on the main Valgrind executable.  That has 35,430 address ranges containing 160 unique descriptions.  Not quite as striking as the libxul.so case, but not far off.

Storing the address ranges compactly

The obvious storage optimisation is to park the 75 descriptions in a dictionary, the size of which is insignificant, and represent the 890,065 address ranges and dictionary-entry number as compactly as possible.  Then, sort these entries by address range and put them in a flat array, for fast binary search.

How compactly can we represent an (address, length, dictionary-entry-number) triple?  Naively, the address is a 64 bit word, and length and entry number could be 32 bits, giving 16 bytes in total.

That’s way overkill.  Since we’ll have one table per mapped object, the base address can be replaced by the offset from the base of the object.  libxul.so has about 36MB of text, so unfortunately that’s 4 bytes.  The address range lengths are tiny, though, mostly less than 256, so we could store than in a byte, and duplicate the descriptor for the occasional longer run.  And the dictionary entry number in the two cases I tested would fit in 8 bits.

So that’s 6 bytes per description.  For 890,065 descriptions, 5,340,390 bytes in total.

It would be interesting to see if the same level of duplication occurs for CFI and EXIDX on ARM.  But given that it merely reflects the non-diversity of frame layouts, I find it hard to believe we’d see anything much different.

EXIDX contains less information than CFI, yet — as experiments on Fennec/ARM have shown — it gives good unwinding results.  So this analysis surely applies equally to our EXIDX-based targets, Fennec/ARM and FirefoxOS/ARM.

How fast can CFI/EXIDX-based stack unwinding be?

This is a long post.  Here’s a summary.

Valgrind unwinds CFI typically 30 times faster than Breakpad.  Bad though that sounds, it’s really encouraging.  If we can get even half that speedup — that is, 15 times faster than at present — CFI/EXIDX unwinding in SPS will be a lot more useful than it currently is.

There are a number of reasons for the discrepancy, which the post discusses in detail.

Background

For some time now, we’ve been producing nightlies that have the ability to time-profile themselves using the built-in SPS profiler.  SPS is a statistical sample-based profiler that samples specified (C++) threads by unwinding their stacks using a variety of mechanisms.

In the past couple of months, Fennec and Linux nightlies have acquired the ability to take samples by unwinding the stacks using the same “native” unwind information that would be used for diagnosing a crash with (for example) GDB.  On Linux that information is Dwarf2 CFI, and on ARM/Fennec we use the ARM-specific EXIDX format.  The unwinding is done using the Google Breakpad library that lives in our tree.

Breakpad works well in the sense that it unwinds reliably through both libxul, system libraries and, apparently, JIT-produced code.  But it’s slow.  After considerable efforts at speeding it up, including some as-yet uncommitted changes (bug 893542 plus other hacks), Breakpad can unwind at a cost of about 6,600 instructions per frame for x86_64/Linux, and probably a bit worse for ARM/Android.

That’s expensive — for a typical 18 frame unwind that comes to around 120,000 instructions.  On a desktop processor that can sustain perhaps 2000 million instructions/second, that means we’d saturate one core at about 16,500 unwinds/second.  Supposing we’d like 90% of the CPU to be used for real work, not unwinding.  Then we have budget for only 1650 unwinds/second, which is pretty hopeless considering we want to sample multiple threads at a 1 KHz rate.  On a low end phone, the problem is an order of magnitude worse.

I spent quite some time profiling Breakpad’s core CFI unwind loop on x86_64/Linux.  One thing that becomes very clear from this is that Breakpad is designed for generality, flexibility and correctness, but it is not designed for fast in-process unwinding.

Valgrind also does CFI based unwinding, and has been highly tuned over the years, particularly to support Helgrind, which is very unwind-intensive.  I wondered how it compared, so I profiled it — a bit of a tricky exercise, running a big test app on an inner Valgrind (which does a lot of unwinding) on an outer Valgrind, running Callgrind, to get profile data.

The numbers really surprised me.  Valgrind’s unwinder achieves about 220 instructions/frame.  That’s 30 times faster than my best Breakpad results so far.  Why the huge difference?  Surely some mistake?  I started digging.  First, though, a look at the algorithm.

The core CFI unwinding algorithm

The core algorithm is simple to understand.  We start off with registers taken from the innermost frame — as a minimum, three: the program counter, the stack pointer and the frame pointer.  The CFI data provides, for each possible instruction, a set of rules which say how recover the register values in the calling frame.  So we apply those rules to our three registers, and repeat.  The stack trace we’re after is then the sequence of PC values resulting.

One thing to note is that, although we only want the program counter values, we need to compute values for multiple registers, including at the very least the stack pointer.  That’s because return addresses — hence, previous program counter values — are typically in memory at some SP offset, and so we need to know the SP values.

Diehard CFI-heads will recognise that the above description omits a lot of details.  Nonetheless it encapsulates what the unwinder needs to be fast at.  Viz:

   (PC, SP, FP) = <values taken from CPU registers at start of unwind>

   repeat
      output_array[index++] = PC
      current_ruleset = Lookup_in_huge_table(PC)
      new_PC = evaluate_rule(current_ruleset.rule_for_PC,  PC, SP, FP)
      new_SP = evaluate_rule(current_ruleset.rule_for_SP,  PC, SP, FP)
      new_FP = evaluate_rule(current_ruleset.rule_for_FP,  PC, SP, FP)
      PC = new_PC ; SP = new_SP; FP = new_FP;
   until
      end of stack reached, or we have enough frames

The rule for each register is usually simple, having one of the following two forms:

   new_REG =                 old_PC (or old_SP or old_FP) + constant
   new_REG = read-memory-at( old_PC (or old_SP or old_FP) + constant )

For example, it might well be the case that, at some point

   new_PC = read-memory-at( old_SP + 64 )

if we know that the return address is 64 bytes back up the stack.

Implementation differences

With that in mind, the differences between the Breakpad and Valgrind implementations are as follows:

  • Lookup_in_huge_table(PC)

Valgrind maintains a 509-entry direct mapped cache containing previously used rule-sets.  If we are doing a lot of unwinds, most of them will involve the same few hundred instructions, because the outer parts of the traces are identical or very similar.  It achieves a near 100% hit rate in practice.  Hence finding the entry comes down to computing “PC % 509”, a couple of loads and a conditional branch to check we’ve got the right entry.

Breakpad has no such cache.  When bug 893542 lands it will have at a std::map-based cache.  That’s a big improvement on no cache, but it’s still not anywhere near as good as a direct-mapped cache, since any lookup in the cache involves a std::map.find(), which means a red-black tree lookup,  costing several hundred instructions.

Breakpad has another disadvantage: its architecture forces the cache test to be placed later than is optimal.  Valgrind loads the unwind rules for all shared objects at process startup.  Breakpad only reads debuginfo for an object at the first visit of the object.  Hence Breakpad’s unwind loop must first, inefficiently, check each PC value to find out whether it needs to read debuginfo for the containing shared object.  This alone adds a couple of hundred instructions per frame.

  • Representation of the rules

Breakpad is nothing if not flexible.  The set of registers it is prepared to unwind is not fixed.  Instead it depends on whatever registers the compiler-generated CFI provides unwind rules for. Hence Breakpad produces, at each unwind step, new values for all the integer registers that have a CFI unwind rule available at that location.

Valgrind, on the other hand, just unwinds a fixed set of registers which are known to be important in practice.  On x86_64-linux, the set is %rip, %rsp and %rbp.  On arm-linux: r15, r14, r13, r12, r11 and r7.  If unwinding needs the value of some other register it’s just too bad — unwinding stops and it’s Game Over.  But this never appears to be a problem in practice.

This disadvantages Breakpad in two ways.

Firstly, Breakpad unwinds more registers than Valgrind does, because the in-file CFI typically gives unwind rules for more registers.  On x86_64-linux, for example, the CFI rules often give unwind info for %r15, %r14, %r13, %r12 and %r11, which Valgrind simply ignores.

Secondly, Valgrind’s use of a fixed register set makes its data structures much more efficient.  Breakpad’s unwind loop has to maintain a std::map of register names to register values for all the currently tracked registers, and a similar map for register names to recovery rules.  Hence it spends much time iterating over, and looking up in, these mappings.  Despite efforts to make these lookups efficient, they can’t come close to Valgrind’s compiled-in (C-style) structs for register and rule sets, that require no lookups or iteration.

  • Dynamic memory allocation

Valgrind unwinds with no dynamic memory allocation and dumps the resulting PC values in a known-to-be-large-enough caller-supplied array.

Breakpad constructs a std::vector of StackFrame objects.  Each frame therefore requires a heap allocation — and, eventually, deallocation, which just by themselves come to more than Valgrind’s total per-frame cost.  The 6,600 insn/frame cost is after I did some experimental restructuring to avoid these allocations.

  • Valgrind cheats

Valgrind shortcuts in two ways, which makes the comparison slightly unfair.

As described above, Breakpad unwinds all integer registers for which CFI rules are available.  Valgrind unwinds a fixed and generally smaller subset.  This doesn’t appear to make any difference in practice.

Breakpad tracks the validity state of each register.  This helps make unwinding more reliable and is needed to support transitioning between different frame recovery methods (CFI unwind, frame-pointer following, stack scanning) on a per-frame basis.  Really, Valgrind ought to do that too.  It will give some extra overhead, but not a lot — perhaps increasing the per-frame cost by 50%.

SPS profiler backend news, 12 August 2013

In short: more adaptation to real-use cases, and more unwind speedups.

  • Benoit redid time-unit measurements so as to support profiling with sampling intervals of less than one millisecond.  This will help profiling graphics and animation code (820048).
  • Gijs Kruitbosch made it possible to control the size of the profiler’s circular buffer using the MOZ_PROFILER_ENTRIES environment variable (901481).
  • :roc and :avih taught the compositor to notify the profiler of missed (graphics) frames, so as to help monitor and debug smoothness issues (900785).
  • glibc backtrace() unwinding on Linux was removed, as it risked deadlock, and the breakpad unwinder made it redundant (880158).

There was work to improve the performance of CFI and EXIDX native unwinding, by improving performance of the data structures accessed by the CFI/EXIDX unwind algorithm:

  • speedups for GetModuleForAddress — replaced a linear search with a binary search (892774).
  • don’t generate useless frames via stack scanning, which turned out to be happening even though we thought stack scanning was disabled (894264).
  • the unwinder spends considerable effort repeatedly re-generating unwind rules for the same relatively small set of code addresses.  Caching them makes a big difference (893542).

With the above fixes in place, native unwinding performance for x86_64-linux costs about 8100 instructions per frame, or around 145k instructions/unwind for an average 18-frame unwind.

There are still opportunities to remove inefficiency, but they are becoming scarcer.  It might be possible to get to about 5000 insns/frame without too much extra effort, but below that  could be difficult.

 

SPS Profiler backend news, 8 July 2013

There’s been quite some progress since the last report, focussed on Firefox for Android.

  • Anton Kovalyov landed changes to the GUI that show the event timelines for multiple threads in a synchronised way.  The multithreaded aspects of the profiler are stabilising nicely.
  • Multi-thread profiling support is ready.  Select ‘Multi-Thread’ when profiling to watch any registered thread.  For now a local build is recommended to tweak which threads you want to profile.
  • Android Java profiling support is ready.  Set ‘profiler.java’ to true on the host desktop machine before connecting to your phone. A new timeline will appear that will represent main java thread execution.
  • Native unwind for Android nightlies is now available.  Select ‘breakpad’ in the GUI on the host desktop machine when connecting to your phone.  With this, full symbolicated profiles are available.
  • Native unwind on Android uses the ARM-specific EXIDX unwind format.  Addition of it to the nightly .apk increases size by about 2.5MB (10% ish), which is not a bad result for near-complete unwindability of libxul and the other libraries.
  • There was some additional work to improve the EXIDX unwind performance and stability to usable levels (883126, 882903).  Sampling beyond 200Hz is viable on a dual-core Cortex A9.  Investigations are in progress to improve unwind speed and reduce memory use, to levels that would make it useful for Firefox OS.

Profiler backend news, 3 June 2013

Since the last report, work has focused on making multithreaded profiling more usable, using ARM EXIDX unwind data for Android nightlies, and more analysis of unwind failures.  The latter two lines of investigation happened concurrently because there was some doubt as to how good EXIDX is, so I wanted to find out to what extent unwind failures were caused by EXIDX limitations as opposed to other problems.

Multithreaded profiling:

  • Benoit worked on facilities to select threads to be profiled by name (873914) and improved the names of web worker threads (865885).
  • Benoit also improved controllability of the profiler using environment variables (873915), which might be important for profiling B2G.

Unwind quality:

  • I fixed a problem (872496) caused by the way threads register their stacks for profiling.  They need to do it earlier.  This problem affects not only EXIDX unwinding, but also CFI and frame-pointer too.  I also noticed that native unwinds for Linux nightlies had regressed recently, for the same reason.
  • The simple unwind-stats-gathering patch that landed a while back  (863705) proved very useful in figuring this out.

EXIDX:

  • There was investigation into whether EXIDX is a suitable replacement for the Dwarf CFI format on Android.  EXIDX is an ARM-specific unwind data format, and has the advantage of being compact.  Including complete EXIDX unwind information in Android nightlies  increases the .APK size from 28.8MB to only 30.7MB, which is an excellent result.

The concern with EXIDX is that it is designed for unwinding only at  points where C++ exceptions could possibly be thrown.  It is not designed for the “anywhere, anytime” unwinding that Dwarf CFI provides, and so is not a drop-in replacement for CFI.  In particular, EXIDX does not provide correct unwind data in function prologues and epilogues, so if the thread being sampled happens to be in either of those places, we’re out of luck.

Despite that, EXIDX unwinding does appear to work pretty well.

It might be possible to get around the prologue/epilogue problem by fiddling with the topmost frame’s PC so as to move it away from such areas.  I haven’t investigated this yet.

  • Breakpad doesn’t actually handle EXIDX right now.  I picked up an EDIDX-reading patch originally from Linaro, with an initial port into Breakpad (863475) by TedM.  The patch required some rework but is now close to being landable.
  • Turns out NSPR was not being built with EXIDX generation, so all stack traces stopped at the first NSPR frame.  872649 fixes this.

Profiler backend news, 30 April 2013

There has been a lot of activity since the previous report, mostly along three lines of development: multithreaded profiling, getting profiling working for Android nightlies, and improving the quality of the stack traces.

We now have native unwinds working out-of-the-box for 32- and 64-bit Linux nightlies, and very nearly working for Android nightlies.

Multithreaded profiling:

  • Benoit landed a big change (734691) to the profiler backend, that makes it able to profile multiple threads.  Work is in progress (861863) to update the Cleopatra GUI to match, but it has not yet landed.  Early adopters are welcome to try out this new functionality.
  • Benoit also landed 788022, which adds support for profiling Java code running on the Dalvik VM on Android.

Getting profiling working for Android nightlies:

  • On Android, we landed plumbing code that allows Breakpad to read debug info direct out of APK files, by interfacing with faulty.lib (802240, 861141).
  • On Android, we landed a 1-liner that enabled profiling on nightlies (863264, 863375).  Then the fun started.  A few reftests started failing, so the patch was backed out.  After quite some digging, it turns out that this change had the effect of changing the compile flags from “-O2” to “-O2 -fno-omit-frame-pointer”, and gcc-4.6.2 wound up miscompiling gfxFont::RunMetrics::RunMetrics() in such a way as to copy 16 bytes of uninitialised stack-allocated garbage to the start of the object it is initialising.  Nathan Froyd suggested and landed a workaround.

Improving the quality of native stack traces:

  • I made a simple change that periodically shows counts of how many frames were recovered by CFI data, how many by following frame pointers, and how many by stack scanning (863705).  Using this, it is finally possible to get some idea of how well Breakpad is doing and why it sometimes produces results that are poorer than we expect.
  • Using the 863705 patch, I investigated unwinding behaviour on both 32- and 64-bit Linux, and concluded that we are doing well there.  In particular, Breakpad is able to unwind using frame pointers on 32-bit Linux, which had up to that point been somewhat in doubt. Any incomplete stack traces on that platform are due to system libraries which have been compiled without frame pointers and for which there is no CFI available.  There’s nothing we can do about them, except for resorting to stack scanning.  As discussed in the previous posting, stack scanning gives poor results and is disabled by default.
  • Also on the theme of diagnosing unwind problenms, I modified SPS so as to produce CFI coverage statistics (859775).  This prints extra information at debug info load time, indicating how much CFI was read and how much address range it covers.  I had hoped to also be able to get the size of the relevant .text segment, so as to enable printing messages of the form “available CFI covers 85.7% of the text segment”.  This unfortunately didn’t work out due to the difficulties of getting the segment size.
  • The idea of enhancing Breakpad to use EXIDX unwind info on ARM came back to life and was generally well received.  TedM refreshed his patch (863475) and it is now waiting for further resync with our local Breakpad changes.

Misc other fixes:

  • A problem causing Firefox to livelock when starting any external program while the profiler is running (837390) was fixed.
  • As it stands, SPS/breakpad will unwind up to 1024 stack frames before stopping.  That can happen if the stack is corrupted.  In the worst case this can potentially waste a lot of unwinder time, so we installed a 256 frame limit (859745).