Windows Sucks At Memory-Mapped IO During Startup

Last week I learned about how Windows handles page faults backed by files (specifically xul.dll). I already knew that Linux was suboptimal in this area, perhaps the clever people at Redmond did better.

Shaver pointed me at xperf, which is sort of like the new Linux perf tools. Xperf rocks in that it can capture the relevant data and it can export it as .csv.

Even with profile-guided-optimization Windows causes 3x as much IO in xul.dll than linux does on libxul.so. That’s especially interesting given that xul.dll is one-third smaller on Windows. Here is the graph. PGO isn’t helping on Windows as much as it can help on Linux because MSVC PGO doesn’t do the equivalent of GCC’s -freorder-blocks-and-partition (unless I missed something in the docs).

With the Windows prefetcher, there were 4x less xul.dll IOs (graph here). Unfortunately, the prefetcher can’t figure out that the whole xul.dll should be paged in and we still end up with an excess of random IO.

Why?

When a page fault occurs, Windows goes to read the page from the file and reads a little extra (just like any other sane OS) assuming that there will be more IO nearby. Unfortunately the gods of virtual memory at Microsoft decided that for every page fault, only 7 extra pages should read. So reads occur in 32K chunks (vs 128K in Linux, which is still too small). To make matters worse, segments mapped in as data only read in 3 extra pages (ie 16K chunks).

This is funny in a sad way. Reading in 32K chunks is supposed to minimize ram usage (which makes no bloody sense when Win7 officially requires 1GB of RAM). As a result of being so thrifty on easy-to-evict file cache, windows ends up doing 4x as much file IO as Linux. The 16K reads are particularly funny as one can see the result of that misoptimization in the string of puny reads on the top right of the graphs.

Surely There is an API like madvise()
On Posix systems madvise() can be used to influence the caching behavior of the OS. fadvise() is another such call for IO based on filehandles. For example, Firefox fastload files are now madvise()ed such that they are read in a single 2mb chunk on startup. Unfortunately, it appears that Windows has no such APIs so one is stuck with pathetically small reads.

At first, I thought that, passing FILE_FLAG_SEQUENTIAL_SCAN when opening the file handle will work like a crappy fadvise()-equivalent. Turns out that mmaping files completely bypasses the Windows Cache Manager, so that flag just gets ignored.

So as far as I can tell the only way to convince Windows to not read stuff in stupidly small chunks is to mmap() everything we care about using large pages. Unfortunately that comes with some significant costs.

We are going to try to order the binaries better which should halve the amount of page faults.

Can Windows Do Anything Better?

Yes. The “Unix way” of breaking up everything into a billion libraries and configuration files results in 10x more files being read on startup on Linux vs Windows. Just because Linux can read individual libraries 4x faster, doesn’t mean that IO becomes free.

Presently, in ideal conditions, Firefox starts up 30-50% faster on Windows. The Windows Prefetcher hack sweeps a lot of Windows suck under the carpet, but Microsoft has a lot of room for improvement.

Update:
People seem to prefer to comment on reddit. If you want me to not miss your comment, make sure you comment here.

17 comments

  1. > Just because Linux can read individual libraries 4x faster, doesn’t mean that IO becomes free.

    To be truthful, these libraries are most likely already in memory because they were already loaded by the desktop environment. That doesn’t come syscall free, but certainly prevents a lot of I/O from happening.

    What is interesting, though, is that Mozilla’s inclusion of some libraries somehow can also make things (slightly) worse, as I saw once in a strace that the system libogg was being loaded indirectly by some other library (I think it was libcanberra), while libxul.so has its own copy embedded.

  2. I know it’s heresy to the “share every possible piece of code” crowd, but IMO the right answer is simply to statically link as much as possible on windows. Then you have one big binary that you can order as correctly as possible, and the windows prefetcher has the maximum chance to do something right.

  3. Taras, your graph links are wrong – you have the prefetch link for the no prefecting case and noprefetch link for the prefetching case. Which got me confused at first, the numbers contradict the point you are making.

  4. Taras, this might be a total hack, but we already hook the DLL loader — we could also hook NtOpenFile, which is called by the loader to create the file handle for the DLL (see http://msdn.microsoft.com/en-us/magazine/cc301727.aspx). There’s a FILE_SEQUENTIAL_ONLY flag that can be passed here, but that seems to imply that /only/ sequential access will be used. We may be able to do something smarter, like pre-reading the file ourselves.

  5. That’s a little surprising — I was under the impression that Windows Vista removed the read-ahead limit: http://geekswithblogs.net/sdorman/archive/2006/06/18/82251.aspx

  6. This is silly. If you want to read a large file sequentially… Well, just read it! Allocate a big buffer, do a single disk read for the whole file, deallocate, load library. Isn’t that possible?

  7. @glandium

    I think we both agree that extra IO is bad, warm or cold. Furthermore Firefox is likely to be one of the first apps started for many users(it is for me). Some app pays the price for poorly filesystem access.
    I actually agree with you on shared libraries.

    @Noel
    Yes. There is no reason to have a crapload of libraries on Windows. https://bugzilla.mozilla.org/show_bug.cgi?id=525013 is moving us in that direction.

    @Wladimir thanks for the headsup, fixed it.

    @Sid this is why xperf is nice, lets one confirm any such theories. As I refereed to in the blog mmap bypasses the Cache Manager which is in charge of readahead(way to do it in the wrong part of the OS, Microsoft!). Apparently the cache manager does IO via mmap() too. So it’s not clear if readahead in Microsoft terminology is just a bunch of sequential 32K reads.

    @Vlad That’s awesome, would be interesting to try to enable large pages for mapping the binary. As I mentioned in the post FILE_SEQUENTIAL_ONLY is ignored by mmap-io.

    @Rem not for loading libraries.

  8. @Rem

    Even then, open+read+close would be one copy instead of zero. Theoretically (i.e. if it wasn’t for the cost of page faults) mmap-ed read-only files would be faster because the pages would be backed by the OS caches, with absolutely no memory-to-memory traffic.

  9. Maybe the answer is to stop trying to load 11mb of pointless bloatware rubbish?

    I remember when firefox was a lean, mean and *fast* browser. I miss those days.

  10. Rather than Windows’ reads being too small, have you considered that perhaps your application is too big? Perhaps Firefox startup is slow due to wanting to load a XML-based javascript GUI framework when it starts up rather than being a slim, fast native application?

    There are a lot of applications that load way faster than Firefox, so that should serve as an existence proof regarding where the fault lies.

  11. Wow, yet Thunderbird loads WAY faster on my Fedora netbook than on my XP quad-core (Intel i-5) workstation at work with a “real” hard disk, not a slow netbook one.

    I/O on Windows sucks SO much, it’s incredible.

  12. Actually, Mac OS X has what you’re describing wrt Mm and Cm using the same cache called Unified Buffer…something (it’s been awhile), and it ends up being really bad for certain scenarios, because you’ll end up paging stuff to disk in order to free up disk cache space.

    So imagine a VM booting up – it’s doing tons of reads, so lots of memory gets used for disk cache. Well, where do those reads go? To memory, so now all of the memory you’ve allocated that got paged out to service the disk cache, gets paged back in. The two systems end up fighting with each other, rather than in Windows where disk cache memory is free to be converted to allocated memory.

    Also, don’t use large pages, they require the kernel to find large contiguous sections of memory to back the page, which under some systems will just fail.

    If you want to fool Superfetch, it’s simple – start up a thread in the background that simply reads over the module in memory. That’ll cause the prefetcher to record that the entire module will be paged in.

    However, are you *sure* you really need the entire thing? Imagine if every app considered itself important enough that it never wanted to get page faults, you’d end up with a far *slower* system overall. Superfetch is smart, it knows when your disk isn’t in use so it can page in stuff in the background.

  13. @RobM and @Says If you have nothing helpful to say, then don’t say it.

    Firefox was never been leaner(in terms of performance and memory footprint) than it is now.

    Unfortunately in order to implement a web browser you have to write an application bigger than calc.exe. Large apps are a pain(to different degrees) to load on most OSes.

    All I’m doing is teaching it to play nice with specific OSes that it runs on so it can continue doing what it does, more efficiently.

    As far as removing bloat goes, I believe Mozilla is on the cutting edge of software development for that. See http://ehren.wordpress.com/ for how we are getting rid of unneeded bloat.

    @Paul thanks for your help.

  14. Oh, I see re the cache manager vs mmap. Wow, that’s just plain idiotic.

  15. Taras, Visual C++’s PGO certainly does do block separation, see:
    http://msdn.microsoft.com/en-us/library/e7k32f4k%28VS.80%29.aspx
    “* Basic Block Optimization – Basic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.
    * Function Layout – Based on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.
    * Dead Code Separation – Code that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.”

    I know for sure that our binaries have at least some of these optimizations applied, I’ve seen the results when looking at the debug symbols we produce.

  16. Although not helpful in itself, RobM’s comment is just the common sense reaction of most readers. “Sounds like there’s too much in there, maybe many things can be optimized away.”

    Thanks for providing an interesting reply with good links so we all can gain more insight into all the things the mozilla developers have to take into account.

  17. @RobM this isn’t even getting in to loading the XUL and all that other stuff, this is purely loading native compiled code (which if you toss out XUL and js is only going to be BIGGER and so slower to load than it is now since you still need pretty much all of xul.dll since it’s got the layout engine in it, and it does HTML too so without that firefox wouldn’t be very useful… using XUL and js for app code is just re-using stuff they have to build anyway to support web pages for broweser UI and that means less native code to load and that means less of the kind of IO being profiled here)