I am going to start with numbers to give an idea of the magnitudes involved here. I’m still using my 1.8ghz core2duo laptop with a 7200 200GB harddrive.
|Time(ms)||# of libxul.so reads|
|Typical Firefox build||2300||147|
|Prelink-Ordered + Prelinked Firefox||1636||66|
Additionally, proper binary reordering results in >2mb reduction in memory usage(out of 14mb that’s mapped in for code) since less random code gets paged in during readahead. This should be interesting for mobile where our binaries are RISC-bloated and there is less RAM is available.
A commonly-suggested linux adage is to prelink if your binaries are loading slowly. All of the good Linux distributions are doing using it. Unfortunately that alone gives pretty pathetic improvements. Beyond the weak 8.5% speed one has to do own tools to speed things up.
As I mentioned before, application binaries are laid out in a basically random order. This seemed like an obvious optimization so I embarked on a non-obvious quest to capture every single memory access and to use that info to order our binaries sensibly.
Due to disappointing results I had to change my strategies a few times. The happy numbers(easy 30% speedup) in the above table were produced after the 3rd rewrite of my valgrind plugin. Before I seemed perpetually stuck at 10%.
In the current revision of my valgrind plugin I produce a section listing by inferring section names from symbol names via a libelf-based program(unfortunately I do not know of a way get ld to retain function sections in the final binary). This turned out to be easier to get right than abusing Valgrind’s symbol-lookup APIs into figuring out what sections they came from.
Also in addition to reordering executable code in .text, the plugin now reorders the various .data sections. Turned out that even though data is a relatively small portion of the executable, it is located on the opposite end of the executable from code. This means that every page fault in the .data section kills continuous reading of the .text section.
I also switched to using gold with a section-ordering patch, it seems to produce binaries that are basically the same size as unordered ones(unlike ones produced by my linker script hack).
What is a Prelink-Ordered binary?
In the end, turned out prelink was the key to my problem. I realized that I am measuring memory accesses in valgrind on a non-prelinked binary causing the linker-induced memory accesses to drive my binary layout. During symbol relocation, the dynamic linker rummages through the .text and .data sections (which I am trying to layout correctly) in order that does not correlate later execution of the program. Unfortunately I was using that data to order my binary even if the final result was meant to be prelinked.
Perhaps that explains why, in the above table, ordered non-prelinked firefox is actually slower than default non-prelinked firefox. Another explanation is that this could be to additional disk fragmentation or other factors. Cold startup numbers depend hard-drive’s luck at seeking + filesystem fragmentation, so the only reliably comparator is the number of reads/page-faults.
As of now my recipe to producing fast-starting binaries is:
- Build firefox
- Switch to root, set LD_LIBRARY_PATH to /dist/bin/ in the object directory, run:
prelink $LD_LIBRARY_PATH/firefox-bin $LD_LIBRARY_PATH/*.so
- Run my libelf utility:
elflog –contents dist/bin/libxul.so > dist/bin/libxul.so.sections
- As a normal user run Firefox under my valgrind plugin. It will output a list of section names to dist/bin/libxul.so.log
- Relink libxul.so with -Wl,–section-ordering-file,$HOME/builds/minefield.release/dist/bin/libxul.so.log
- make dist, copy resulting binaries somewhere, prelink em
- Enjoy faster startup
Using prelink incorrectly can cause massive performance variation.
My plugin does .data reordering now, but it would be very hard to do .data reordering as part of profile-guided optimization. Valgrind is the best tool for this job.
I will try to cleanup the code and release my plugin this week. Pretty much every significant application can benefit from this, might as well let this loose. I need to decide on a name: ldgrind? startupgrind? binarymaidgrind?
We need to develop a built-in diagnostic for detecting when the user isn’t using prelink (or has other startup misconfiguration issues).
Measuring startup times is highly machine-specific and varies even on individual machines. A much better metric is to measure the amount of io(ie number and size of pagefaults and non-cached reads) serviced by the kernel, that’s very consistent.
Prelink sure gets upset easily. The last (fastest) result in the table above causes:
prelink: /hd/startup.test/firefox//firefox-bin: section file offsets not monotonically increasing
What’s going on? I’m only modifying libxul inbetween prelink runs, why is prelink complaining about firefox-bin which stays constant?
prelink: /hd/startup.test/firefox.ordered.static/firefox-bin: DT_JMPREL tag not adjacent to DT_REL relocations
What does that mean?