Squeezing Every Last Bit of Performance Out of The Linux Toolchain

Magic of GCC PGO

On Friday I finally got gold to produce a prelinkable static binary(bug). I also got around to trying out GCC profile-guided-optimization with the debloatifying -freorder-blocks-and-partition option. This option breaks up every profiled function into cold and hot “functions”. It then lumps all of the hot functions together.

PGO performance is amazingly close to that of binaries produced by icegrind (within 10% based on page-counts).

Startup Time RSS (KB)
firefox.stock 2515ms 49452
firefox.ordered 1919ms 45344
firefox.static 2321ms 49616
firefox.static.ordered 1577ms 37072
firefox.static.pgo 1619ms 38436

In above table, ordered means application of icegrind, static means a fat firefox-bin. To generate the PGO profile I just started and closed Firefox. So it’s not too surprising that the results are similar to those of an explicitly ordered binary. RSS refers to how much memory is mmap()ed into the process(lower RSS usage means we are paging in less junk). I was not able to control the layout of PGO builds; will need some linker hackery to deal with split functions.

I think the numbers speak for themselves. Isn’t it scary how wasteful binaries are by default? It amazes me that Firefox can shrug off a significant amount of resource bloat without changing a single line of code. I think this is a good demonstration on why application developers should a) expect more out of their toolchain (Linux, GCC, Binutils) b) contribute to their toolchain.

I think my next step is to tackle PGO Firefox builds(bug). From there I would like to teach icegrind to play together with PGO.

Pieces of The Startup Puzzle

It took a long time since I first noticed the suckyness of library io. After much digging, help by smart people on #gcc + LKML discussion, I think I finally have a pretty clear list of the remaining/inprogress things needed for Linux applications to start faster.

  1. Wu Fengguang is making headway on smarter readahead that makes better use of available RAM + disk bandwidth. Firefox could be read in 512kb chunks instead of 128 (4x page-fault reduction). Distributions should be aggressively testing this patch.
  2. Better agreement on binary organization between the compile-time linker, run-time linker and the kernel (see LKML discussion). Can shave off a handful of unneeded page-faults per file this way.
  3. A linker flag to specify how much of a particular library should be read-in via madvise(). For example any xulrunner apps will know ahead of time that they need large parts of libxul.so – might as well let the OS know.
  4. Transparent read-only per-file ext4 compression (like OSX). Ted Tso indicated that this would be easy to add to ext4, but as far as I know nobody has jumped on this.

I think with all of the above combined, we could load apps like Firefox at near-warm (0.5 second) speeds. Most of these are easy. #1 is hard, but it’s already being worked on. I’ll be happy to point someone at how to solve items 2-4 while I work on the Firefox side of things. The end result of all this should be a more instant gratification for all Linux users.

13 comments

  1. Do you think this work will lead to reduced boot times for the whole system?

  2. @Darrin
    Yes. I believe this would improve distribution startup in addition to individual application startup.

  3. This is great news. Thanks for all your hard work on this. It looks like Linux will be better for it :-)

  4. Per-file compression might also be useful for .mo files which are large but contain easily-compressed ASCII data.

  5. Very impressive results. Just under a second saved all adds up

  6. “Transparent read-only per-file ext4 compression (like OSX).”

    Just put ‘em all in a jar/ZIP file ;-)

    Presumably file compression is a bad idea for any file sections you can mmap() from disk. Are there any such files in an app like Firefox?

    Such great work!

  7. well, kernel has lzma support, why not use this for fs.

  8. @bgm, because someone actually has to do the work of hooking that up to a filesystem. Talk is cheap.

  9. @steve: Aren’t all libraries and executables mmaped in? But I assume that is hidden from the higher levels so it will still work, it will just be a little slower (assuming that you can still efficiently find the right page to decompress and map in)

  10. On a sidenote: PGO is the reason why I’m still not using FF 3.6 – thats because there’s no PGO-based, CPU-centered build of Firefox 3.6 available yet – aka Swiftweasel. FF 3.5 + 3.6 as in the default vanilla binary taste are huge resource hogs, which nearly brought me to converting back to FF 3.0 – till I found out about PGO-based builds like Swiftweasel. After that, I was happy again .. works like a charm ;)

    And seeing this data in here lets me hope this is going to be indeed a godawful improvement for FF, esp. for Linux / *x users like me ;)

    cu, w0lf.

  11. @Ben Bennet,skierpage: the disadvantage of jar/zip files is exactly that mmap’ing doesn’t work, and this wastes memory (the file is both in cache and in the process memory). And the app has no control about how to store libraries – it’s ld.so from glibc who has. And since this idea is broken, glibc won’t implement it.
    When you do it at the filesystem-level, I guess that the simplest way to make mmap() work is to have the uncompressed file content in the page cache. At that point, you don’t even need a perfect index of the file, just a good approximate one, so that what you uncompress will contain the right area. Actually, I guess that compression is done in a more local, per-chunk way (since LZ77-based algorithms work on a sliding window anyway, so you probably don’t loose a lot). Not sure about LZMA, though.

  12. well, kernel has lzma support, why not use this for fs.

  13. So is icegrind not needed then or what?