This post is a result of debugging bug 561842. Turns out one needs to go far beyond lumping libraries together to reap startup benefits.
I made a pdf to illustrate the cost centers of loading libxul.so (the essence of Firefox).
With Icegrind I demonstrated that better binary layout can significantly improve application startup. However I still didn’t have a breakdown of reasons of why loading binaries is so damn inefficient. That’s what the above pdf is about.
Loading libxul consists of 4 major phases:
- Runtime linker setup: mapping segments in, zeroing .bss, loading dependent libraries, etc
- Runtime linker relocations.
- Library intializers.
- main() and the rest of application code runs
I blogged about 1, 2, 4, this post is about #3.
Michael Meeks pointed me at the funny backwards IO pattern in his IO logs. I even made fun of how by default libxul.so is read mostly via backwards IO. Once I assigned userspace symbols to my pagefault log, it became clear that the backwards IO pattern was entirely due to library initializers. C++ compiler generates code that runs on library initialization to initialize globals and run relevant C++ constructors. In C one can assign a “constructor” GNU attribute to a function to participate in this mayhem.
Ian Lance Taylor clued me in on why these things run backwards.When one links the program, the object files are laid out sequentially. Static libraries are specified after the code that depends on them. Once an object is linked, the easiest way to make sure that libraries are initialized before their users is to invoke initializers backwards. The list of initializers is stored in the .ctors section and they loaded by libgcc.
In Mozilla (and likely other C++ codebases) these global initializers are more or less evenly scattered throughout the codebase. By the time main() is run, most of the program has been paged in an unfortunately inefficient manner.
Run Faster Please?
The most interesting part about all this that the compiling toolchain can make a rather precise guess at how a large part of the initial program execution is going to go. To test this theory I wrote my best Mozilla patch ever.
One can place a function near the beginning of the library file and another one at the end (with a “constructor” attribute). The function at the end runs first and it can figure out the approximate range of memory that will need to be paged in and madvise() it. This results in a 5x reduction in libxul pagefaults. Unfortunately since constructors execute backwards and readahead forwards, the constructor execution stalls to wait for readahead, so the speedup is rather hard to detect.
Run Forward Faster!
Depressed about my hack failing to make a dent in startup time I patched gcc to run initializers in a forward order (and reversed the function-placement logic in above patch). Now readahead happened in the same direction as library initialization and my Firefox started 30% faster! I wrapped this up into a standalone gcc patch (speed up any bloated C++ startup with a simple change to the compiler!). Note this hack reverses the library initialization order discussed above, this happens to not be a problem for Mozilla.
Conclusion: Order Matters!
The linker can reverse the per-library initializers such that initializers run forward, but cross-library dependencies are honoured. That in itself isn’t enough to boost startup without cleverer readahead on the kernel side (or application-side hacks).
It’s weird to have initializers page in most of the binary. An interesting optimization would to have the compiler transitively mark functions reached by library initialization and place those in a .text.initializers section. Then one could have the linker group the initializers together.
I haven’t made up my mind on how to proceed. This madvise() hack + a simple linker patch could be deployed more easily than icegrind. This hack also appears to be as performant as a static firefox build + icegrind (due to inadequate kernel readahead without madvise()). Icegrid + libxul.so isn’t quite as efficient. I have a feeling that we’ll end up with a combination of icegrind + some form the initializer madvise() hack.