19
Apr 10

Windows Sucks At Memory-Mapped IO During Startup

Last week I learned about how Windows handles page faults backed by files (specifically xul.dll). I already knew that Linux was suboptimal in this area, perhaps the clever people at Redmond did better.

Shaver pointed me at xperf, which is sort of like the new Linux perf tools. Xperf rocks in that it can capture the relevant data and it can export it as .csv.

Even with profile-guided-optimization Windows causes 3x as much IO in xul.dll than linux does on libxul.so. That’s especially interesting given that xul.dll is one-third smaller on Windows. Here is the graph. PGO isn’t helping on Windows as much as it can help on Linux because MSVC PGO doesn’t do the equivalent of GCC’s -freorder-blocks-and-partition (unless I missed something in the docs).

With the Windows prefetcher, there were 4x less xul.dll IOs (graph here). Unfortunately, the prefetcher can’t figure out that the whole xul.dll should be paged in and we still end up with an excess of random IO.

Why?

When a page fault occurs, Windows goes to read the page from the file and reads a little extra (just like any other sane OS) assuming that there will be more IO nearby. Unfortunately the gods of virtual memory at Microsoft decided that for every page fault, only 7 extra pages should read. So reads occur in 32K chunks (vs 128K in Linux, which is still too small). To make matters worse, segments mapped in as data only read in 3 extra pages (ie 16K chunks).

This is funny in a sad way. Reading in 32K chunks is supposed to minimize ram usage (which makes no bloody sense when Win7 officially requires 1GB of RAM). As a result of being so thrifty on easy-to-evict file cache, windows ends up doing 4x as much file IO as Linux. The 16K reads are particularly funny as one can see the result of that misoptimization in the string of puny reads on the top right of the graphs.

Surely There is an API like madvise()
On Posix systems madvise() can be used to influence the caching behavior of the OS. fadvise() is another such call for IO based on filehandles. For example, Firefox fastload files are now madvise()ed such that they are read in a single 2mb chunk on startup. Unfortunately, it appears that Windows has no such APIs so one is stuck with pathetically small reads.

At first, I thought that, passing FILE_FLAG_SEQUENTIAL_SCAN when opening the file handle will work like a crappy fadvise()-equivalent. Turns out that mmaping files completely bypasses the Windows Cache Manager, so that flag just gets ignored.

So as far as I can tell the only way to convince Windows to not read stuff in stupidly small chunks is to mmap() everything we care about using large pages. Unfortunately that comes with some significant costs.

We are going to try to order the binaries better which should halve the amount of page faults.

Can Windows Do Anything Better?

Yes. The “Unix way” of breaking up everything into a billion libraries and configuration files results in 10x more files being read on startup on Linux vs Windows. Just because Linux can read individual libraries 4x faster, doesn’t mean that IO becomes free.

Presently, in ideal conditions, Firefox starts up 30-50% faster on Windows. The Windows Prefetcher hack sweeps a lot of Windows suck under the carpet, but Microsoft has a lot of room for improvement.

Update:
People seem to prefer to comment on reddit. If you want me to not miss your comment, make sure you comment here.


12
Apr 10

Squeezing Every Last Bit of Performance Out of The Linux Toolchain

Magic of GCC PGO

On Friday I finally got gold to produce a prelinkable static binary(bug). I also got around to trying out GCC profile-guided-optimization with the debloatifying -freorder-blocks-and-partition option. This option breaks up every profiled function into cold and hot “functions”. It then lumps all of the hot functions together.

PGO performance is amazingly close to that of binaries produced by icegrind (within 10% based on page-counts).

Startup Time RSS (KB)
firefox.stock 2515ms 49452
firefox.ordered 1919ms 45344
firefox.static 2321ms 49616
firefox.static.ordered 1577ms 37072
firefox.static.pgo 1619ms 38436

In above table, ordered means application of icegrind, static means a fat firefox-bin. To generate the PGO profile I just started and closed Firefox. So it’s not too surprising that the results are similar to those of an explicitly ordered binary. RSS refers to how much memory is mmap()ed into the process(lower RSS usage means we are paging in less junk). I was not able to control the layout of PGO builds; will need some linker hackery to deal with split functions.

I think the numbers speak for themselves. Isn’t it scary how wasteful binaries are by default? It amazes me that Firefox can shrug off a significant amount of resource bloat without changing a single line of code. I think this is a good demonstration on why application developers should a) expect more out of their toolchain (Linux, GCC, Binutils) b) contribute to their toolchain.

I think my next step is to tackle PGO Firefox builds(bug). From there I would like to teach icegrind to play together with PGO.

Pieces of The Startup Puzzle

It took a long time since I first noticed the suckyness of library io. After much digging, help by smart people on #gcc + LKML discussion, I think I finally have a pretty clear list of the remaining/inprogress things needed for Linux applications to start faster.

  1. Wu Fengguang is making headway on smarter readahead that makes better use of available RAM + disk bandwidth. Firefox could be read in 512kb chunks instead of 128 (4x page-fault reduction). Distributions should be aggressively testing this patch.
  2. Better agreement on binary organization between the compile-time linker, run-time linker and the kernel (see LKML discussion). Can shave off a handful of unneeded page-faults per file this way.
  3. A linker flag to specify how much of a particular library should be read-in via madvise(). For example any xulrunner apps will know ahead of time that they need large parts of libxul.so – might as well let the OS know.
  4. Transparent read-only per-file ext4 compression (like OSX). Ted Tso indicated that this would be easy to add to ext4, but as far as I know nobody has jumped on this.

I think with all of the above combined, we could load apps like Firefox at near-warm (0.5 second) speeds. Most of these are easy. #1 is hard, but it’s already being worked on. I’ll be happy to point someone at how to solve items 2-4 while I work on the Firefox side of things. The end result of all this should be a more instant gratification for all Linux users.


24
Mar 10

Linux: Why Loading Binaries From Disk Sucks

Note: I am doing my measurements and experiments on Fedora 12, once I feel that I understand and can solve the problems on Linux, other operating systems will follow. The aim of this post is to document what I have learned about the mysterious process of loading programs from the filesystem perspective.

A binary is broken up into segments. There are about half dozen different segments in an executable, but only two that matter here:

  1. A segment that mostly contains function-bodies and some const data. It’s mapped in read+execute mode
  2. A segment that contains variable initializers, GOT, PLT, lists of constructors/destructors, etc

The compile-time linker composes segments out of sections that contain variable data, function bodies, etc. The run-time linker maps the segments into memory via mmap. It resolves references to data and code in dynamic libraries (eg relocation) via COW.
IO happens when the program (on behalf of the run-time linker) is trying to access memory that hasn’t been read from disk yet. These interruptions are called page faults. They cause the kernel to interrupt the program to read in some multiple of pages (4096byte chunks) from disk. On my system, page faults trigger 131072byte (32 pages) reads.

For a detailed explanation of how linkers work check out the guides written by experts. Out of the scant list of literature on the subject, my “favourite” is Ulrich Drepper’s “How to Write Shared Libraries”. It actually explains things in terms of file offsets, not just virtual addresses.

A common misconception is that mmap()-caused IO is free (because you don’t issue any explicit read() statements). IO via page faults is just another “API” for file-access, there is no reason for it to be free. Another misconception is that one has no control over IO-patterns triggered by mmap(). On Linux-like OSes one can use madvise(). (Windows is more limited, afaik one can only set io-pattern-flags on the filehandles).

Prelinking Fail

Having the run-time linker fix up the binary causes a huge amount of IO even before any application code gets executed. These faults are visible in the second graph in my previous post. The linker’s faults (pun intended) are the small green rectangles above the big ones. The graph clearly shows the huge relative cost of inefficient run-time linker memory prodding.

Supposedly, this problem is largely solved by prelinking. I can’t confirm that as prelink does not work on Firefox on the systems that I can measure IO with SystemTap. This non-determinism is frustrating; we should figure out a way to warn to the user that the OS infrastructure failed them.

Post-linker Fail

Above IO patterns can be blamed on careless run-time linker behavior. IO after that can be attributed to lack of organization in the output of the compiler and the compile-time linker. Turns out that the layout of the .text section (where all of the function bodies lie) and to a smaller degree .data[, .bss, etc] sections(ie variable initializers) is completely unhelpful. For a good laugh look at how the official nightly libxul.so is read mostly through backwards io (graph link).

<aside>The libxul.so graphs in the previous post did not exhibit this kind of insanity. I did my best to order functions based on chronological order of access (courtesy of my valgrind hack). Unfortunately, the chronological log is not deterministic. Someone suggested that statistical fudging via markov chains will help. From the io patterns in the graph I’m guessing that io patterns detected by valgrind and those that actually happen diverge due to threading differences. Linux users that are interested in fast startup should pray to the GCC Santas to reorder binaries as part of Profile-Guided-Optimization.</aside>

Are Large Programs Screwed Out of Fast Startup?

Yes, but it doesn’t have to be this way. Turns out this utter disaster is caused by naive use of mmap() in the dynamic-linker. The second graph (previous post) shows, that even a late madvise call (delightfully nasty patch) can significantly improve the situation.
Specifying madvise(MADV_WILLNEED) causes individual faults to read in 2097152bytes (512 pages,  16x larger reads than default),  3x(10x if one counts only ones after madvise()) reduction in the number of total faults, saves about 1 second of startup.

The basic trick is outlined as “approach c” in this bug. My current thinking is to:

  1. use Jim Blandy’s executable-parsing library from breakpad(which is already in our build) to parse our binaries
  2. calculate what the dynamic linker will mmap() at runtime.
  3. have some firefox.sh-like native frontend mmap()/madvise() it with appropriate flags

In the longer term some fixes for madvise() should land in both runtime and compile-time linkers.

Conclusion

It took me a long time to produce the above story from my runtime linker observations. As recently as December I had no idea what a runtime linker was or what linker segments, sections, etc were. I’d like to thank Valgrind authors, kind folks on #gcc and numerous individuals who helped me connect the pieces. The above is written to the best of my still-incomplete understanding; I will appreciate any corrections.


23
Mar 10

When in Trouble, Draw a Picture

Graphs

Note: the following graphs broke on the nightlies this week. I would appreciate help with reducing this to a proper testcase. They work fine on older nightlies and released versions of Firefox. Non-spidermonkey JS engines wont work as they don’t support destructuring assignment and other goodies.

Once I graphed my file-access logs, most of my problems became clear. Here is the first graph(screenshot). The y-axis is time; once you click on a square, x-axis is the file offset. One can clearly see that libxul (4th rectangle) is a big chunk of our io. It’s also clear that initially the file is being accessed in the begining and near the end. One can also see that there is a some “backwards” io. It’s ugly stuff.

I first saw this picture on Friday evening, I spent the last 2 days trying to make libxul.so load less stupidly. Here is a graph(screenshot) from a hack I tried this morning (approach b in the bug) . The top is still stupid, but near the bottom, libxul is being read in a much more efficient manner (>50% less io).

This graph is from a lunch-time hack(approach c). IO is being spread across two processes so it’s evading my systemtap script for  now. Fortunately, “approach c” also shaved off a second of my startup time, so I know the libxul graph would’ve been real pretty.

What follows is a story of how I ended up graphing stuff; more on lessons learned later…

Continue reading →


11
Mar 10

Extensions & Startup

Dietrich blogged a “wake up and smell the startup” executive overview of startup issues caused by our extension practices. This post is a “numbers” followup. For this experiment I installed a brand-spankin-new copy of Linux Firefox 3.6. Firefox is installed on a 7200 hard drive, the rest of my system lives on an SSD. The CPU is core2duo, keep in mind these numbers will be significantly worse for people running on netbooks and other common hardware. The numbers vary +/- 150ms, but the general picture is pretty clear.

Results

Startup Time
Firefox 3.6 with no extensions: 2240ms
+Adblock Plus (no subscriptions) 2538ms
+Video Download Helper 2727ms
+Personas 3220ms
+Greasemonkey 3300ms
+EasyList subscription for adblock 4044ms

I just doubled cold startup time for Firefox by merely adding 4 extensions. It takes weeks or even months of developer time to shave off every 100ms off Firefox startup, but mere seconds to undo any of those gains by installing extensions. These are just the top-4 extensions in the list (presumably they are higher quality too), I’m sure there are lots of other extensions with more drastic performance hits.

Dietrich’s post details some of the remedies that should reduce the startup cost of extensions. For the inquisitive minds: I used SystemTap to produce a report of files read by Firefox on startup ordered by their startup cost.

Update:
Dietrich asked me to summarize warm startup too:

  • Without extensions: 550ms
  • With above Extensions: 1800ms

Note that this is a developer blog, so by “remedies” I meant “things developers can do to”. There is little normal users can do short of complaining to the extension authors.

This post isn’t meant to shame specific extension authors into speeding up their extensions. The aim is to show that a measurable percentage of startup is due to extensions and that we need to:

  1. Educate extension developers about it
  2. Provide better tools to measure slowdowns caused by extensions
  3. Make sure that the Firefox side of extension handling is sufficiently efficient

19
Feb 10

Teaching ld to optimize binaries for startup

I have been told that it should be possible to control the way the GNU linker lays out binaries. Unfortunately until recently I couldn’t figure out the right incantations to convince ld to do my bidding. Turns out what I needed was to be stranded on a beach in Fiji with nothing better to do than to reread the ld info page a few times.

Recipe:

  1. Produce 2 mozilla builds:
    A tracing build with -finstrument-functions in CXXFLAGS/CFLAGS
    A release build with -ffunction-sections and -fdata-sections CXXFLAGS/CFLAGS to allow the linker to move stuff at function or static data(mostly variables) granularity
  2. Link my profile.cpp into libxul in the tracing build (without -finstrument-functions flag)
  3. Run the tracing build, capturing the spew from profile.cpp into a log file
  4. Feed the log file to my script to produce a linker script. This will produce library.so.script files for all of Mozilla libraries.
  5. Rebuild relevant libraries in the release build with -T library.so.script linker flag
  6. Enjoy faster startup

This results in 200ms faster startup my 7200rpm laptop harddrive which is about a 10% of my startup. I think that’s pretty good for a proof of concept. Unfortunately there isn’t a measurable win on the SSD (not surprising) nor a reduction in memory usage (I expected one due to not having to page in code that isn’t needed for firefox startup).

I suspect the problem is that data sections need to be laid out adjacent to functions that refer to them. I started sketching out a treehydra script to extract that info.

I posted the relevant testcase and scripts. Do hg clone http://people.mozilla.com/~tglek/startup/ld to see the simple testcase and various WIP firefox scripts.

Long-term Expectations

The majority of Firefox startup overhead (prior to rendering of web pages) comes from frustrating areas such inefficient libraries (eg fontconfig, gtk) and the mess caused by crappy layout of binaries and overuse of dynamic libraries. This post describes one small step towards fixing the crappy layout of our binaries.

I would like to end up in a world where our binaries are static and laid out such that they are read sequentially on startup (such that we can use the massive sequential read speeds provided by modern storage media). Laying out code/data properly should result in memory usage reductions which should be especially welcome on Fennec (especially on Windows Mobile).

I am hoping to see 30-50% startup time improvements from this work if everything goes according to plan.


19
Jan 10

Chromium vs Minefield: Cold startup performance comparison

Hunting Down Mythical “Slowness”

I recently met a developer who used Chromium instead of Firefox. Chromium’s superior startup speed was his reason for using it.This got me excited because said developer was running Linux, so it was relatively easy to measure cold startup and get a complete IO breakdown.

Turned out Firefox took roughly 23 seconds to start. After much cursing about how I’ve never seen Firefox startup this slow, I eventually gave up on figuring out what’s slowing his startup and instead we measured Chromium startup. It also turned out to also be roughly 23 seconds. The super-slow hard drive made everything slow. Turned out Chromium’s superior startup was a myth in this case.

Measuring Startup

As a result of investigating the startup myth above, my kiwi coworkers encouraged me to post a comparison of Chrome/Firefox startup. I am at linuxconf at the moment so I did the comparison on my laptop.

Laptop configuration:

  • Intel(R) Core(TM)2 Duo CPU  L9400 running at 800Mhz to amplify any performance differences.
  • HITACHI HTS722020K9SA00 harddrive for the user profile and browser binaries
  • OCZ Vertex 30GB SSD for system libraries/configuration.
  • Fedora 12, Minefield 20100119 tarball, chromium-4.0.285.0-0.1.20091230svn35370.fc12.i686
  • sudo sync && sudo sysctl -w vm.drop_caches=3 && sudo sysctl -w vm.drop_caches=0 to clear the disk cache inbetween runs

What am I testing? I am measuring the time between invoking the browser until a JavaScript snippet embedded within a basic webpage is executed (ie Vlad’s approach, with a slightly modified startup.html). The above sysctl command clears disk caches, this creates a similar situation to when one turns on the computer and it hasn’t yet loaded all of the browser libraries from disk into memory. This is a blackbox approach to measuring how long it takes from clicking on the browser icon to get an interactive browser.

Firefox commandline: firefox -profile /mnt/startup/profile/firefox  -no-remote file://`pwd`/startup.html#`python -c ‘import time; print int(time.time() * 1000);’`

Chromium commandline: chromium-browser –user-data-dir=/mnt/startup/profile/chrome  file://`pwd`/startup.html#`python -c ‘import time; print int(time.time() * 1000);’`

Both of these tests are done with an empty profile that was populated and has settled after running the browser a few times.

Results

The following numbers are milliseconds reported by the startup.html above.

Running Chromium five times: 4685, 4168, 4222, 4197, 4232

Running Minefield five times: 3155, 3273, 3352, 3311, 3322

I picked Minefield because that’s the browser that I run and the codebase that I focus on. The linux Chromium channel seems to be the closest parallel to Minefield. I did not test on Windows because it is a bit of a nightmare to measure cold startup there.

Conclusion

On my system Minefield is around 30% faster at starting up with  an empty profile than Chromium (the difference is amplified by running the CPU at 800Mhz). For comparison of Minefield against older Firefox versions, see Dietrich’s post.

I suspect that there is a relatively small difference between the two browsers because we are running into the fundamental limitations of loading large applications into memory (my rant).


04
Jan 10

Windows 7 Startup Exploration

I did some digging to figure out if one can setup cold-startup testing in Windows 7 without nasty hacks. My conclusion is: sorta-kinda.

The Good – Most of the Ingredients Are Present

I haven’t actively used Windows since pre-XP days. It looks like it has come a long way since then: there is now a decent interactive shell, all kinds of settings/services can be controlled from the commandline and there is even sudo-like functionality.

PowerShell takes inspiration from the korn shell and throws in .net which allows for much nicer “shell programming” than the dominant bash shell.

mountvol is a terrible equivalent to mount in linux – but it exists, so I’m happy.

NTFS junctions are frustrating equivalents to links in a unix filesystem.

The Bad

The essential ability to completely flush filesystem caches isn’t there. This isn’t quite as embarrassing as it seems as Mac OS X’s purge command does not flush the page cache (resulting in mmapped files not purged from cache), so technically OS X has the same limitation and only Linux gets it right.

The Ugly Workaround

After much brainstorming we figured out that we can clear all relevant caches on Mac OS X by putting files that we care about on a separate partition and mounting/unmounting it for every measurement.

Ridiculously, Windows is “smarter” than that and appears to cache stuff per-drive, such that mounting/unmounting a partition has no effect on the cache. The best workaround I could come up with involves putting the said partition onto a USB disk and unplugging it in-between unmount/mount testing cycle.

Windows 7 Startup Recipe

1) Set up junctions for the 2 profile directories to point to the USB partition, unzip firefox onto that partition.

2)
$old = (get-location)
$mountpoint = $env:userprofile + "\cold"
# magic name given by running mountvol
$drive = "\\?\Volume{885d5bc3-e918-11de-a4e5-002268e3077c}\"
# Based on http://poshcode.org/696 + fiddling with UAC settings to avoid prompts
sudo mountvol $mountpoint $drive
# Mountvol doesn't seem to block until drive is mounted
sleep 1
#mountvol
cd $mountpoint\firefox
echo (pwd)
# The following command shows PowerShell awesomeness
# based on Vlad's approach
./firefox.exe -no-remote "file://$(pwd)\startup.html#$([Int64](([DateTime]::utcnow - (new-object DateTime 1970,1,1)).ticks/10000))"
cd $old
# I haven't yet figured out how to wait on firefox.exe to finish
sleep 10
sudo mountvol $mountpoint /d

3) Unplug USB drive