27
May 10

Startup: Backward Constructors

This post is a result of debugging bug 561842. Turns out one needs to go far beyond lumping libraries together to reap startup benefits.

I made a pdf to illustrate the cost centers of loading libxul.so (the essence of Firefox).

With Icegrind I demonstrated that better binary layout can significantly improve application startup. However I still didn’t have a breakdown of reasons of why loading binaries is so damn inefficient. That’s what the above pdf is about.

Loading libxul consists of 4 major phases:

  1. Runtime linker setup: mapping segments in, zeroing .bss, loading dependent libraries, etc
  2. Runtime linker relocations.
  3. Library intializers.
  4. main() and the rest of application code runs

I blogged about 1, 2, 4, this post is about #3.

Library Initializers?

Michael Meeks pointed me at the funny backwards IO pattern in his IO logs. I even made fun of how by default libxul.so is read mostly via backwards IO. Once I assigned userspace symbols to my pagefault log, it became clear that the backwards IO pattern was entirely due to library initializers. C++ compiler generates code that runs on library initialization to initialize globals and run relevant C++ constructors. In C one can assign a “constructor” GNU attribute to a function to participate in this mayhem.

Running Backwards?

Ian Lance Taylor clued me in on why these things run backwards.When one links the program, the object files are laid out sequentially. Static libraries are specified after the code that depends on them. Once an object is linked, the easiest way to make sure that libraries are initialized before their users is to invoke initializers backwards. The list of initializers is stored in the .ctors section and they loaded by libgcc.

In Mozilla (and likely other C++ codebases) these global initializers are more or less evenly scattered throughout the codebase. By the time main() is run, most of the program has been paged in an unfortunately inefficient manner.

Run Faster Please?

The most interesting part about all this that the compiling toolchain can make a rather precise guess at how a large part of the initial program execution is going to go. To test this theory I wrote my best Mozilla patch ever.

One can place a function near the beginning of the library file and another one at the end (with a “constructor” attribute). The function at the end runs first and it can figure out the approximate range of memory that will need to be paged in and madvise() it. This results in a 5x reduction in libxul pagefaults. Unfortunately since constructors execute backwards and readahead forwards, the constructor execution stalls to wait for readahead, so the speedup is rather hard to detect.

Run Forward Faster!

Depressed about my hack failing to make a dent in startup time I patched gcc to run initializers in a forward order (and reversed the function-placement logic in above patch). Now readahead happened in the same direction as library initialization and my Firefox started 30% faster! I wrapped this up into a standalone gcc patch (speed up any bloated C++ startup with a simple change to the compiler!). Note this hack reverses the library initialization order discussed above, this happens to not be a problem for Mozilla.

Conclusion: Order Matters!

The linker can reverse the per-library initializers such that initializers run forward, but cross-library dependencies are honoured. That in itself isn’t enough to boost startup without cleverer readahead on the kernel side (or application-side hacks).

It’s weird to have initializers page in most of the binary. An interesting optimization would to have the compiler transitively mark functions reached by library initialization and place those in a .text.initializers section. Then one could have the linker group the initializers together.

Plans

I haven’t made up my mind on how to proceed. This madvise() hack + a simple linker patch could be deployed more easily than icegrind. This hack also appears to be as performant as a static firefox build + icegrind (due to inadequate kernel readahead without madvise()). Icegrid + libxul.so isn’t quite as efficient. I have a feeling that we’ll end up with a combination of icegrind + some form the initializer madvise() hack.

Continue reading →


24
May 10

Teething Troubles: Assigning blame for pagefaults

I have been able to get precise filed-backed page fault logging(systemtap on Linux, xperf on Windows) for a while. It is incredibly useful to see exactly how Firefox is being loaded from disk. From there one I deduce what is causing the IO, try to make improvements and measure if I accomplished anything.

Unfortunately, a mere IO log requires a lot of pondering of why IO is happening. It would be so much easier if one could just get a report of every IO operation + an application backtrace to easily identify the cause. I was having trouble figuring out why some of my optimizations were not having the impact I expected, so i embarked on adding a backtrace to my log.

XPerf Fail
I fed xperf my Firefox symbols hoping this would plop stack traces next to my faults, but no such luck. It records backtraces in just about every probe, except for the “hard faults” probe I care about. I wonder if a custom perf probe could log what I want.

Perf Fail

Some prominent kernel hackers have long been complaining about OProfile/SystemTap/NIH performance monitoring tools. They finally produced a perf tool (It’s like they tried to make it hard to google for. It does not have a proper website; Real men read the source and skim LKML archives) to do profiling the Linux way(tm). I might be wrong, but so far it appears to be a functional equivalent of Microsoft’s xperf minus the nice UI.

Turned out that my Fedora 2.6.32 perf implementation is too buggy to even log pagefaults. Apparently this works in the current Linus kernels. I’m not completely sure, but looks like even if xperf pagefault logging worked, it’s pretty neutered. It does not appear that it can log file offsets next to pagefaults, nor stacks.

I think perf could be fixed to log io and accompanying userspace backtraces. There are some talented folks contributing to it. However I think that the pre-canned analysis model sucks. It is useful for building sophisticated versions of top, but when you really need to dig into what’s causing a particular issue, it really sucks to be restricted by what the developers foresaw as useful.

SystemTap

As awesome as the kernel side of SystemTap is, I keep running into userspace bugs and limitations. Getting userspace stacks for large collections of large libraries that Firefox relies on has been a systemtap-bug-finding affair. I can occasionally get useful userspace tracks for userspace probes, but apparently recording a userspace stack from a kernel probe is a hard problem that SystemTap devs haven’t fully addressed yet.

Luckily SystemTap provides a uaddr() function which appears to get correct addresses from my kernel probes(which is way more than the other tools offer). Unfortunately usymname() fails to resolve those addresses.

As a workaround, Jim Blandy suggested turning off address-space randomization so I can log uaddr() and resolve the values in gdb. I’ve been manually printing this with gdb’s “p/a <addr>” command until recently.

Success?

Then it dawned on me that I can use python gdb-scripting to automatically post-process the log. So now with a combination of a systemtap io logging script and a hacky gdb python script I can produce logs like this.

I still don’t have backtraces, but at least now I have the name of the function that’s causing trouble. This is surprisingly useful already. One can now easily tell how much of startup is being wasted on relocations(dlopen() in a prelinked binary!). Another obvious one is the harm of single-page COW faults to zero .bss (memset entries in the log). Turns out sprinkling initializers all over the binary is a bad idea. Looks like there are significant performance wins to be had with a bit of ‘easy’ compiler/linker hacking.

All of the above problems are really obvious and would’ve been fixed a long time ago if it was easier to get at this information. Unfortunately, there is still a lot of room for improvement in developer tools.

Update:

Sounds like I can use addr2line instead of gdb.