So far I only replaced the common IOService getters. URL objects are 30% less slow to create now. A contributor, Mitchell Field, has volunteered to switch over a huge amount of other common services in bug 560095. That rocks.
Last week I learned about how Windows handles page faults backed by files (specifically xul.dll). I already knew that Linux was suboptimal in this area, perhaps the clever people at Redmond did better.
Shaver pointed me at xperf, which is sort of like the new Linux perf tools. Xperf rocks in that it can capture the relevant data and it can export it as .csv.
Even with profile-guided-optimization Windows causes 3x as much IO in xul.dll than linux does on libxul.so. That’s especially interesting given that xul.dll is one-third smaller on Windows. Here is the graph. PGO isn’t helping on Windows as much as it can help on Linux because MSVC PGO doesn’t do the equivalent of GCC’s -freorder-blocks-and-partition (unless I missed something in the docs).
With the Windows prefetcher, there were 4x less xul.dll IOs (graph here). Unfortunately, the prefetcher can’t figure out that the whole xul.dll should be paged in and we still end up with an excess of random IO.
When a page fault occurs, Windows goes to read the page from the file and reads a little extra (just like any other sane OS) assuming that there will be more IO nearby. Unfortunately the gods of virtual memory at Microsoft decided that for every page fault, only 7 extra pages should read. So reads occur in 32K chunks (vs 128K in Linux, which is still too small). To make matters worse, segments mapped in as data only read in 3 extra pages (ie 16K chunks).
This is funny in a sad way. Reading in 32K chunks is supposed to minimize ram usage (which makes no bloody sense when Win7 officially requires 1GB of RAM). As a result of being so thrifty on easy-to-evict file cache, windows ends up doing 4x as much file IO as Linux. The 16K reads are particularly funny as one can see the result of that misoptimization in the string of puny reads on the top right of the graphs.
Surely There is an API like madvise()
On Posix systems madvise() can be used to influence the caching behavior of the OS. fadvise() is another such call for IO based on filehandles. For example, Firefox fastload files are now madvise()ed such that they are read in a single 2mb chunk on startup. Unfortunately, it appears that Windows has no such APIs so one is stuck with pathetically small reads.
At first, I thought that, passing FILE_FLAG_SEQUENTIAL_SCAN when opening the file handle will work like a crappy fadvise()-equivalent. Turns out that mmaping files completely bypasses the Windows Cache Manager, so that flag just gets ignored.
So as far as I can tell the only way to convince Windows to not read stuff in stupidly small chunks is to mmap() everything we care about using large pages. Unfortunately that comes with some significant costs.
We are going to try to order the binaries better which should halve the amount of page faults.
Can Windows Do Anything Better?
Yes. The “Unix way” of breaking up everything into a billion libraries and configuration files results in 10x more files being read on startup on Linux vs Windows. Just because Linux can read individual libraries 4x faster, doesn’t mean that IO becomes free.
Presently, in ideal conditions, Firefox starts up 30-50% faster on Windows. The Windows Prefetcher hack sweeps a lot of Windows suck under the carpet, but Microsoft has a lot of room for improvement.
People seem to prefer to comment on reddit. If you want me to not miss your comment, make sure you comment here.
Magic of GCC PGO
On Friday I finally got gold to produce a prelinkable static binary(bug). I also got around to trying out GCC profile-guided-optimization with the debloatifying -freorder-blocks-and-partition option. This option breaks up every profiled function into cold and hot “functions”. It then lumps all of the hot functions together.
PGO performance is amazingly close to that of binaries produced by icegrind (within 10% based on page-counts).
|Startup Time||RSS (KB)|
In above table, ordered means application of icegrind, static means a fat firefox-bin. To generate the PGO profile I just started and closed Firefox. So it’s not too surprising that the results are similar to those of an explicitly ordered binary. RSS refers to how much memory is mmap()ed into the process(lower RSS usage means we are paging in less junk). I was not able to control the layout of PGO builds; will need some linker hackery to deal with split functions.
I think the numbers speak for themselves. Isn’t it scary how wasteful binaries are by default? It amazes me that Firefox can shrug off a significant amount of resource bloat without changing a single line of code. I think this is a good demonstration on why application developers should a) expect more out of their toolchain (Linux, GCC, Binutils) b) contribute to their toolchain.
I think my next step is to tackle PGO Firefox builds(bug). From there I would like to teach icegrind to play together with PGO.
Pieces of The Startup Puzzle
It took a long time since I first noticed the suckyness of library io. After much digging, help by smart people on #gcc + LKML discussion, I think I finally have a pretty clear list of the remaining/inprogress things needed for Linux applications to start faster.
- Wu Fengguang is making headway on smarter readahead that makes better use of available RAM + disk bandwidth. Firefox could be read in 512kb chunks instead of 128 (4x page-fault reduction). Distributions should be aggressively testing this patch.
- Better agreement on binary organization between the compile-time linker, run-time linker and the kernel (see LKML discussion). Can shave off a handful of unneeded page-faults per file this way.
- A linker flag to specify how much of a particular library should be read-in via madvise(). For example any xulrunner apps will know ahead of time that they need large parts of libxul.so – might as well let the OS know.
- Transparent read-only per-file ext4 compression (like OSX). Ted Tso indicated that this would be easy to add to ext4, but as far as I know nobody has jumped on this.
I think with all of the above combined, we could load apps like Firefox at near-warm (0.5 second) speeds. Most of these are easy. #1 is hard, but it’s already being worked on. I’ll be happy to point someone at how to solve items 2-4 while I work on the Firefox side of things. The end result of all this should be a more instant gratification for all Linux users.
Most program binaries are laid out with little to no regard to how programs get loaded from disk. This disconnect between compile-time and runtime behaviour of binaries imposes a significant performance penalty to on large applications such as browsers, office suites, etc.
It is incredibly difficult to observe both the cause (ie calling a random function) of binary-induced IO and the effect (the program gets suspended during startup while parts being loaded from disk), so this area doesn’t get as much optimization love as it deserves.
My estimate is that around 50% of Firefox startup time is wasted on subobtimal binary layout. My previous post demonstrated the kind of difference a better binary layout can make. Note that reordering executables isn’t the only solution, eliminating dead code should also speed things up (deleting dead code is a hard).
Optimizing Binary Layout
Disclaimer:I just finished my 3rd rewrite of icegrind a few hours ago, be gentle.
Step 1a: Produce a build
Since I am interested in reorganizing program binaries, I build mozilla with “-ffunction-sections -fdata-sections” in CFLAGS/CXXFLAGS
I also prelink the binaries in dist/bin such that my binaries better correspond to how they will be used:
prelink $LD_LIBRARY_PATH/firefox-bin $LD_LIBRARY_PATH/*.so
Step 1b: Produce a description of interesting files
I use my elflog utility to produce a .sections description of files I’m interested in. Elflog looks at the symbol table and tries to infer section names (produced by -ffunction-sections -fdata-sections) from symbol names/locations(see also –print-map option for ld).
elflog –contents libxul.so > libxul.so.sections
elflog currently emits non-existent .comment.* sections because it gets confused by 0-length sections such as .bss.
Note, one can also build tools to describe other kinds of files, such as jar or sqlite files. The only limitation is that Icegrind currently only tracks mmap()-caused disk IO, it would be trivial to extend it to deal with open/seek/read kind of disk IO.
Step 2: Produce a log with icegrind!
Apply my icegrind patch, build+install valgrind.
valgrind –tool=icegrind firefox-bin -profile /tmp/ff -no-remote
This will produce a .log file for every mmap()ed file with a .sections description. This log chronologically lists sections in the order of access.
Step 3: Tell gold to link using the above log
Build/install binutils (I use a CVS checkout from a month ago) with the section ordering patch, specify –enable-gold.
To reorder the binary, I just add -Wl,–section-ordering-file,libxul.so.log to my linker commandline.
Note there are still some teething issues with using this patch, it exhibits N^2 behavior (ie takes 10min to link libxul.so with it) and occasionally swaps order for .rela.plt and .rela.dyn, which makes prelink upset. But unlike my earlier attempt with linker scripts, it does not affect the binary size.
Step 4: Enjoy!
Now strip, install, prelink your binaries and enjoy faster startup.
I would like to see the gold patch fixed up and landed. Once that is done I’d like to turn this on for our Linux and mobile linux builds.
I am hoping that some sort of sensible ordering of binaries will become commonplace in the future.
I am going to start with numbers to give an idea of the magnitudes involved here. I’m still using my 1.8ghz core2duo laptop with a 7200 200GB harddrive.
|Time(ms)||# of libxul.so reads|
|Typical Firefox build||2300||147|
|Prelink-Ordered + Prelinked Firefox||1636||66|
Additionally, proper binary reordering results in >2mb reduction in memory usage(out of 14mb that’s mapped in for code) since less random code gets paged in during readahead. This should be interesting for mobile where our binaries are RISC-bloated and there is less RAM is available.
A commonly-suggested linux adage is to prelink if your binaries are loading slowly. All of the good Linux distributions are doing using it. Unfortunately that alone gives pretty pathetic improvements. Beyond the weak 8.5% speed one has to do own tools to speed things up.
As I mentioned before, application binaries are laid out in a basically random order. This seemed like an obvious optimization so I embarked on a non-obvious quest to capture every single memory access and to use that info to order our binaries sensibly.
Due to disappointing results I had to change my strategies a few times. The happy numbers(easy 30% speedup) in the above table were produced after the 3rd rewrite of my valgrind plugin. Before I seemed perpetually stuck at 10%.
In the current revision of my valgrind plugin I produce a section listing by inferring section names from symbol names via a libelf-based program(unfortunately I do not know of a way get ld to retain function sections in the final binary). This turned out to be easier to get right than abusing Valgrind’s symbol-lookup APIs into figuring out what sections they came from.
Also in addition to reordering executable code in .text, the plugin now reorders the various .data sections. Turned out that even though data is a relatively small portion of the executable, it is located on the opposite end of the executable from code. This means that every page fault in the .data section kills continuous reading of the .text section.
I also switched to using gold with a section-ordering patch, it seems to produce binaries that are basically the same size as unordered ones(unlike ones produced by my linker script hack).
What is a Prelink-Ordered binary?
In the end, turned out prelink was the key to my problem. I realized that I am measuring memory accesses in valgrind on a non-prelinked binary causing the linker-induced memory accesses to drive my binary layout. During symbol relocation, the dynamic linker rummages through the .text and .data sections (which I am trying to layout correctly) in order that does not correlate later execution of the program. Unfortunately I was using that data to order my binary even if the final result was meant to be prelinked.
Perhaps that explains why, in the above table, ordered non-prelinked firefox is actually slower than default non-prelinked firefox. Another explanation is that this could be to additional disk fragmentation or other factors. Cold startup numbers depend hard-drive’s luck at seeking + filesystem fragmentation, so the only reliably comparator is the number of reads/page-faults.
As of now my recipe to producing fast-starting binaries is:
- Build firefox
- Switch to root, set LD_LIBRARY_PATH to /dist/bin/ in the object directory, run:
prelink $LD_LIBRARY_PATH/firefox-bin $LD_LIBRARY_PATH/*.so
- Run my libelf utility:
elflog –contents dist/bin/libxul.so > dist/bin/libxul.so.sections
- As a normal user run Firefox under my valgrind plugin. It will output a list of section names to dist/bin/libxul.so.log
- Relink libxul.so with -Wl,–section-ordering-file,$HOME/builds/minefield.release/dist/bin/libxul.so.log
- make dist, copy resulting binaries somewhere, prelink em
- Enjoy faster startup
Using prelink incorrectly can cause massive performance variation.
My plugin does .data reordering now, but it would be very hard to do .data reordering as part of profile-guided optimization. Valgrind is the best tool for this job.
I will try to cleanup the code and release my plugin this week. Pretty much every significant application can benefit from this, might as well let this loose. I need to decide on a name: ldgrind? startupgrind? binarymaidgrind?
We need to develop a built-in diagnostic for detecting when the user isn’t using prelink (or has other startup misconfiguration issues).
Measuring startup times is highly machine-specific and varies even on individual machines. A much better metric is to measure the amount of io(ie number and size of pagefaults and non-cached reads) serviced by the kernel, that’s very consistent.
Prelink sure gets upset easily. The last (fastest) result in the table above causes:
prelink: /hd/startup.test/firefox//firefox-bin: section file offsets not monotonically increasing
What’s going on? I’m only modifying libxul inbetween prelink runs, why is prelink complaining about firefox-bin which stays constant?
prelink: /hd/startup.test/firefox.ordered.static/firefox-bin: DT_JMPREL tag not adjacent to DT_REL relocations
What does that mean?