Note: I am doing my measurements and experiments on Fedora 12, once I feel that I understand and can solve the problems on Linux, other operating systems will follow. The aim of this post is to document what I have learned about the mysterious process of loading programs from the filesystem perspective.
A binary is broken up into segments. There are about half dozen different segments in an executable, but only two that matter here:
- A segment that mostly contains function-bodies and some const data. It’s mapped in read+execute mode
- A segment that contains variable initializers, GOT, PLT, lists of constructors/destructors, etc
The compile-time linker composes segments out of sections that contain variable data, function bodies, etc. The run-time linker maps the segments into memory via mmap. It resolves references to data and code in dynamic libraries (eg relocation) via COW.
IO happens when the program (on behalf of the run-time linker) is trying to access memory that hasn’t been read from disk yet. These interruptions are called page faults. They cause the kernel to interrupt the program to read in some multiple of pages (4096byte chunks) from disk. On my system, page faults trigger 131072byte (32 pages) reads.
For a detailed explanation of how linkers work check out the guides written by experts. Out of the scant list of literature on the subject, my “favourite” is Ulrich Drepper’s “How to Write Shared Libraries”. It actually explains things in terms of file offsets, not just virtual addresses.
A common misconception is that mmap()-caused IO is free (because you don’t issue any explicit read() statements). IO via page faults is just another “API” for file-access, there is no reason for it to be free. Another misconception is that one has no control over IO-patterns triggered by mmap(). On Linux-like OSes one can use madvise(). (Windows is more limited, afaik one can only set io-pattern-flags on the filehandles).
Having the run-time linker fix up the binary causes a huge amount of IO even before any application code gets executed. These faults are visible in the second graph in my previous post. The linker’s faults (pun intended) are the small green rectangles above the big ones. The graph clearly shows the huge relative cost of inefficient run-time linker memory prodding.
Supposedly, this problem is largely solved by prelinking. I can’t confirm that as prelink does not work on Firefox on the systems that I can measure IO with SystemTap. This non-determinism is frustrating; we should figure out a way to warn to the user that the OS infrastructure failed them.
Above IO patterns can be blamed on careless run-time linker behavior. IO after that can be attributed to lack of organization in the output of the compiler and the compile-time linker. Turns out that the layout of the .text section (where all of the function bodies lie) and to a smaller degree .data[, .bss, etc] sections(ie variable initializers) is completely unhelpful. For a good laugh look at how the official nightly libxul.so is read mostly through backwards io (graph link).
<aside>The libxul.so graphs in the previous post did not exhibit this kind of insanity. I did my best to order functions based on chronological order of access (courtesy of my valgrind hack). Unfortunately, the chronological log is not deterministic. Someone suggested that statistical fudging via markov chains will help. From the io patterns in the graph I’m guessing that io patterns detected by valgrind and those that actually happen diverge due to threading differences. Linux users that are interested in fast startup should pray to the GCC Santas to reorder binaries as part of Profile-Guided-Optimization.</aside>
Are Large Programs Screwed Out of Fast Startup?
Yes, but it doesn’t have to be this way. Turns out this utter disaster is caused by naive use of mmap() in the dynamic-linker. The second graph (previous post) shows, that even a late madvise call (delightfully nasty patch) can significantly improve the situation.
Specifying madvise(MADV_WILLNEED) causes individual faults to read in 2097152bytes (512 pages, 16x larger reads than default), 3x(10x if one counts only ones after madvise()) reduction in the number of total faults, saves about 1 second of startup.
The basic trick is outlined as “approach c” in this bug. My current thinking is to:
- use Jim Blandy’s executable-parsing library from breakpad(which is already in our build) to parse our binaries
- calculate what the dynamic linker will mmap() at runtime.
- have some firefox.sh-like native frontend mmap()/madvise() it with appropriate flags
In the longer term some fixes for madvise() should land in both runtime and compile-time linkers.
It took me a long time to produce the above story from my runtime linker observations. As recently as December I had no idea what a runtime linker was or what linker segments, sections, etc were. I’d like to thank Valgrind authors, kind folks on #gcc and numerous individuals who helped me connect the pieces. The above is written to the best of my still-incomplete understanding; I will appreciate any corrections.