Linux: How to Make Startup Suck Less (Also Reduce Memory Usage!)

As I explained before, loading binaries from disk sucks. Aside from switching glibc to use madvise/fadvise, what can application developers do to minimize this suckyness?

I am going to start with numbers to give an idea of the magnitudes involved here. I’m still using my 1.8ghz core2duo laptop with a 7200 200GB harddrive.

Time(ms) # of libxul.so reads
Typical Firefox build 2300 147
Prelinked Firefox 2130 139
Ordered Firefox 2500 131
Ordered+Prelinked 2065 124
Prelinked-Ordered Firefox 1860 72
Prelink-Ordered + Prelinked Firefox 1636 66

Additionally, proper binary reordering results in >2mb reduction in memory usage(out of 14mb that’s mapped in for code) since less random code gets paged in during readahead. This should be interesting for mobile where our binaries are RISC-bloated and there is less RAM is available.

Analysis

A commonly-suggested linux adage is to prelink if your binaries are loading slowly. All of the good Linux distributions are doing using it. Unfortunately that alone gives pretty pathetic improvements. Beyond the weak 8.5% speed one has to do own tools to speed things up.

As I mentioned before, application binaries are laid out in a basically random order. This seemed like an obvious optimization so I embarked on a non-obvious quest to capture every single memory access and to use that info to order our binaries sensibly.

Valgrind Adventures

Due to disappointing results I had to change my strategies a few times. The happy numbers(easy 30% speedup) in the above table were produced after the 3rd rewrite of my valgrind plugin. Before I seemed perpetually stuck at 10%.

In the current revision of my valgrind plugin I produce a section listing by inferring section names from symbol names via a libelf-based program(unfortunately I do not know of a way get ld to retain function sections in the final binary). This turned out to be easier to get right than abusing Valgrind’s symbol-lookup APIs into figuring out what sections they came from.

Also in addition to reordering executable code in .text, the plugin now reorders the various .data sections. Turned out that even though data is a relatively small portion of the executable, it is located on the opposite end of the executable from code. This means that every page fault in the .data section kills continuous reading of the .text section.

I also switched to using gold with a section-ordering patch, it seems to produce binaries that are basically the same size as unordered ones(unlike ones produced by my linker script hack).

What is a Prelink-Ordered binary?

In the end, turned out prelink was the key to my problem. I realized that I am measuring memory accesses in valgrind on a non-prelinked binary causing the linker-induced memory accesses  to drive my binary layout. During symbol relocation, the dynamic linker rummages through the .text and .data sections (which I am trying to layout correctly) in order that does not correlate later execution of the program. Unfortunately I was using that data to order my binary even if the final result was meant to be prelinked.

Perhaps that explains why, in the above table, ordered non-prelinked firefox is actually slower than default non-prelinked firefox. Another explanation is that this could be to additional disk fragmentation or other factors. Cold startup numbers depend hard-drive’s luck at seeking + filesystem fragmentation, so the only reliably comparator is the number of reads/page-faults.

As of now my recipe to producing fast-starting binaries is:

  1. Build firefox
  2. Switch to root, set LD_LIBRARY_PATH to /dist/bin/ in the object directory, run:
    prelink $LD_LIBRARY_PATH/firefox-bin $LD_LIBRARY_PATH/*.so
  3. Run my libelf utility:
    elflog  –contents  dist/bin/libxul.so > dist/bin/libxul.so.sections
  4. As a normal user run Firefox under my valgrind plugin. It will output a list of section names to dist/bin/libxul.so.log
  5. Relink libxul.so with -Wl,–section-ordering-file,$HOME/builds/minefield.release/dist/bin/libxul.so.log
  6. make dist, copy resulting binaries somewhere, prelink em
  7. Enjoy faster startup

Conclusion

Using prelink incorrectly can cause massive performance variation.

My plugin does .data reordering now, but it would be very hard to do .data reordering as part of profile-guided optimization. Valgrind is the best tool for this job.

I will try to cleanup the code and release my plugin this week. Pretty much every significant application can benefit from this, might as well let this loose. I need to decide on a name: ldgrind? startupgrind? binarymaidgrind?

We need to develop a built-in diagnostic for detecting when the user isn’t using prelink (or has other startup misconfiguration issues).

Measuring startup times is highly machine-specific and varies even on individual machines. A much better metric is to measure the amount of io(ie number and size of pagefaults and non-cached reads) serviced by the kernel, that’s very consistent.

Misc

Prelink sure gets upset easily. The last (fastest) result in the table above causes:

prelink $LD_LIBRARY_PATH/firefox-bin
prelink: /hd/startup.test/firefox//firefox-bin: section file offsets not monotonically increasing

What’s going on? I’m only modifying libxul inbetween prelink runs, why is prelink complaining about firefox-bin which stays constant?

prelink: /hd/startup.test/firefox.ordered.static/firefox-bin: DT_JMPREL tag not adjacent to DT_REL relocations

What does that mean?

10 comments

  1. Robert O'Callahan

    Can you blog about why there are still 66 reads left and what, if anything, we can do about them?

  2. roc, the remaining 66 reads are due to run-time linker not using madvise.
    See http://blog.mozilla.org/tglek/2010/03/24/linux-why-loading-binaries-from-disk-sucks/ “Are Large Programs Screwed Out of Fast Startup?” on getting rid of the remaining reads.

    Also http://blog.mozilla.org/tglek/2010/03/25/madvise-prelink-update/ for bugs that I filed on the dynamic linker.

    I’m pretty sure that we could read libxul.so in 5-6 reads with proper dynamic-link cooperation. This gets especially exciting once we combine mozjs/xul/firefox-bin into a single binary.

  3. Aren’t we supposed to only read a dynamic library once into memory and share it like we ought to?
    Something is really wrong if you need to read it again and again 66 times, one almost thinks of static linking the library into the binary instead despite all the costs of that.

    These are really ridiculous numbers, don’t you think?

  4. @Yaa101
    The library isn’t being read 66 times. It is read in 66 chunks. Furthermore a library like libxul.so isn’t going to be shared much. Thunderbird usually lags a version behind, as do other apps.

  5. Thanks for the answer, still, why isn’t it read in one burst instead of 66 chunks? are parts only read when the function or object is needed? It seems such a waste of time to me. Especially now that RAM is getting so cheap.

  6. Is there _any_ reasonable way to build this into our automated tooling we’re using for building our applications on our normal build machines?

  7. Robert, yes. It should be equivalent to building-in-profile-guided optimization

  8. Would the cost of reading 66 (or 147) chunks go away if the whole libxul.so file was already in disk cache (eg. due to being read by readahead)?

  9. i think you forgot to proof read this article before publishing it

  10. I see no errors in this article. If it was formal, the word suck wouldn’t be used. Great blog.

    I’ve wondered why in the past I haven’t seen the results i’d like with prelink. Maybe it’s all my gentoo rice burning flags :P?