31
Mar 10

How to get reviews fast: Delete Code!

The fastest review I ever got was when I deleted a bunch of code from libjar. It was so riddled crud it was next to impossible to modify the code. That was a mere 24KB patch produced manually.

Ehren just put my personal records to shame with a 94KB patch that got r+ed at breakneck speed. Best part is that he generates these by plucking off dead parts of the Mozilla callgraph via static analysis. I bet if he tried he could land 90KB patches every day. Way to go Ehren!

I expect more good code deletions to make themselves known via this meta bug.


29
Mar 10

Linux Startup Inefficiency

There is a little bit of a disconnect between the kernel, the dynamic linker and compile-time linker. As I mentioned in my main “startup sucks” post: not-notifying the kernel of your patterns can kill performance, so can reading files backwards.

Turns out that if the compile-time linker lays out files without considering *exactly* how they are read by the runtime linker, binaries will load slower. I filed a bug on that.

Currently, SuSE appears to be leading in startup performance by shipping their glibc with the fadvise() patch (to load binaries from disk in bigger chunks). A bug is filed for this to get fixed in glibc, but in the meantime distributions should consider including SuSE’s patch (glibc-2.3.90-ld.so-madvise.diff) in their libc.


25
Mar 10

Madvise, Prelink Update

Got some helpful comments on my previous post.

madvise(WILLNEED) in ld.so

Frank Ch. Eigler pointed out that other people have noticed the madvise/mmap deficiency in the dynamic linker. Unfortunately those unfortunate people did not have the ability to flush caches or to measure page faults exactly. Linux has gotten a lot nicer for diagnosing this. Frank also filed a glibc bug. SuSE already ships with a madvise()ed dynamic linker.

Prelink saga

Both Bradley Baetz and Frank pointed me at the FIPS-caused bug which causes prelink to get angry at Firefox. This isn’t the first time that practically-useless FIPS has gotten in the way of a higher performing Firefox.

The most common response to people who complain about run-time linker issue is “are you using prelink?”. So I did my best to assume that prelink would erase a bunch of the overhead I saw in the previous bug.

I got rid of the offending nss rule in the prelink configuration as the bug instructed; prelink started working on Firefox (according to LD_DEBUG=statistics). Unfortunately the IO pattern was identical to what I was seeing before.

Frank suggested that I try prelink -u to undo any prelinking. This caused a bunch more runtime-linker IO from which one can conclude that the overhead I am seeing is not solved by prelink. The is still a lot of IO that happens before any application code is executed, so the runtime linker is still my primary suspect.

Going to spend some quality time with SystemTap to try to figure out what the linker is doing.


24
Mar 10

Linux: Why Loading Binaries From Disk Sucks

Note: I am doing my measurements and experiments on Fedora 12, once I feel that I understand and can solve the problems on Linux, other operating systems will follow. The aim of this post is to document what I have learned about the mysterious process of loading programs from the filesystem perspective.

A binary is broken up into segments. There are about half dozen different segments in an executable, but only two that matter here:

  1. A segment that mostly contains function-bodies and some const data. It’s mapped in read+execute mode
  2. A segment that contains variable initializers, GOT, PLT, lists of constructors/destructors, etc

The compile-time linker composes segments out of sections that contain variable data, function bodies, etc. The run-time linker maps the segments into memory via mmap. It resolves references to data and code in dynamic libraries (eg relocation) via COW.
IO happens when the program (on behalf of the run-time linker) is trying to access memory that hasn’t been read from disk yet. These interruptions are called page faults. They cause the kernel to interrupt the program to read in some multiple of pages (4096byte chunks) from disk. On my system, page faults trigger 131072byte (32 pages) reads.

For a detailed explanation of how linkers work check out the guides written by experts. Out of the scant list of literature on the subject, my “favourite” is Ulrich Drepper’s “How to Write Shared Libraries”. It actually explains things in terms of file offsets, not just virtual addresses.

A common misconception is that mmap()-caused IO is free (because you don’t issue any explicit read() statements). IO via page faults is just another “API” for file-access, there is no reason for it to be free. Another misconception is that one has no control over IO-patterns triggered by mmap(). On Linux-like OSes one can use madvise(). (Windows is more limited, afaik one can only set io-pattern-flags on the filehandles).

Prelinking Fail

Having the run-time linker fix up the binary causes a huge amount of IO even before any application code gets executed. These faults are visible in the second graph in my previous post. The linker’s faults (pun intended) are the small green rectangles above the big ones. The graph clearly shows the huge relative cost of inefficient run-time linker memory prodding.

Supposedly, this problem is largely solved by prelinking. I can’t confirm that as prelink does not work on Firefox on the systems that I can measure IO with SystemTap. This non-determinism is frustrating; we should figure out a way to warn to the user that the OS infrastructure failed them.

Post-linker Fail

Above IO patterns can be blamed on careless run-time linker behavior. IO after that can be attributed to lack of organization in the output of the compiler and the compile-time linker. Turns out that the layout of the .text section (where all of the function bodies lie) and to a smaller degree .data[, .bss, etc] sections(ie variable initializers) is completely unhelpful. For a good laugh look at how the official nightly libxul.so is read mostly through backwards io (graph link).

<aside>The libxul.so graphs in the previous post did not exhibit this kind of insanity. I did my best to order functions based on chronological order of access (courtesy of my valgrind hack). Unfortunately, the chronological log is not deterministic. Someone suggested that statistical fudging via markov chains will help. From the io patterns in the graph I’m guessing that io patterns detected by valgrind and those that actually happen diverge due to threading differences. Linux users that are interested in fast startup should pray to the GCC Santas to reorder binaries as part of Profile-Guided-Optimization.</aside>

Are Large Programs Screwed Out of Fast Startup?

Yes, but it doesn’t have to be this way. Turns out this utter disaster is caused by naive use of mmap() in the dynamic-linker. The second graph (previous post) shows, that even a late madvise call (delightfully nasty patch) can significantly improve the situation.
Specifying madvise(MADV_WILLNEED) causes individual faults to read in 2097152bytes (512 pages,  16x larger reads than default),  3x(10x if one counts only ones after madvise()) reduction in the number of total faults, saves about 1 second of startup.

The basic trick is outlined as “approach c” in this bug. My current thinking is to:

  1. use Jim Blandy’s executable-parsing library from breakpad(which is already in our build) to parse our binaries
  2. calculate what the dynamic linker will mmap() at runtime.
  3. have some firefox.sh-like native frontend mmap()/madvise() it with appropriate flags

In the longer term some fixes for madvise() should land in both runtime and compile-time linkers.

Conclusion

It took me a long time to produce the above story from my runtime linker observations. As recently as December I had no idea what a runtime linker was or what linker segments, sections, etc were. I’d like to thank Valgrind authors, kind folks on #gcc and numerous individuals who helped me connect the pieces. The above is written to the best of my still-incomplete understanding; I will appreciate any corrections.


23
Mar 10

When in Trouble, Draw a Picture

Graphs

Note: the following graphs broke on the nightlies this week. I would appreciate help with reducing this to a proper testcase. They work fine on older nightlies and released versions of Firefox. Non-spidermonkey JS engines wont work as they don’t support destructuring assignment and other goodies.

Once I graphed my file-access logs, most of my problems became clear. Here is the first graph(screenshot). The y-axis is time; once you click on a square, x-axis is the file offset. One can clearly see that libxul (4th rectangle) is a big chunk of our io. It’s also clear that initially the file is being accessed in the begining and near the end. One can also see that there is a some “backwards” io. It’s ugly stuff.

I first saw this picture on Friday evening, I spent the last 2 days trying to make libxul.so load less stupidly. Here is a graph(screenshot) from a hack I tried this morning (approach b in the bug) . The top is still stupid, but near the bottom, libxul is being read in a much more efficient manner (>50% less io).

This graph is from a lunch-time hack(approach c). IO is being spread across two processes so it’s evading my systemtap script for  now. Fortunately, “approach c” also shaved off a second of my startup time, so I know the libxul graph would’ve been real pretty.

What follows is a story of how I ended up graphing stuff; more on lessons learned later…

Continue reading →


11
Mar 10

Extensions & Startup

Dietrich blogged a “wake up and smell the startup” executive overview of startup issues caused by our extension practices. This post is a “numbers” followup. For this experiment I installed a brand-spankin-new copy of Linux Firefox 3.6. Firefox is installed on a 7200 hard drive, the rest of my system lives on an SSD. The CPU is core2duo, keep in mind these numbers will be significantly worse for people running on netbooks and other common hardware. The numbers vary +/- 150ms, but the general picture is pretty clear.

Results

Startup Time
Firefox 3.6 with no extensions: 2240ms
+Adblock Plus (no subscriptions) 2538ms
+Video Download Helper 2727ms
+Personas 3220ms
+Greasemonkey 3300ms
+EasyList subscription for adblock 4044ms

I just doubled cold startup time for Firefox by merely adding 4 extensions. It takes weeks or even months of developer time to shave off every 100ms off Firefox startup, but mere seconds to undo any of those gains by installing extensions. These are just the top-4 extensions in the list (presumably they are higher quality too), I’m sure there are lots of other extensions with more drastic performance hits.

Dietrich’s post details some of the remedies that should reduce the startup cost of extensions. For the inquisitive minds: I used SystemTap to produce a report of files read by Firefox on startup ordered by their startup cost.

Update:
Dietrich asked me to summarize warm startup too:

  • Without extensions: 550ms
  • With above Extensions: 1800ms

Note that this is a developer blog, so by “remedies” I meant “things developers can do to”. There is little normal users can do short of complaining to the extension authors.

This post isn’t meant to shame specific extension authors into speeding up their extensions. The aim is to show that a measurable percentage of startup is due to extensions and that we need to:

  1. Educate extension developers about it
  2. Provide better tools to measure slowdowns caused by extensions
  3. Make sure that the Firefox side of extension handling is sufficiently efficient

05
Mar 10

Mirror, Mirror on the Wall, Why is my Binary Slow?

In an earlier post I described my Fiji hack: how to use some nasty instrumentation to spit out ld scripts to speed up cold startup. This week I tried to extract more data out of the binary to lay it out even better. Trouble is that even if one lays out functions perfectly, they load data for things like variable initializers which will cause more IO.

A very clever friend suggested that I can write a valgrind plugin that can detect data accesses and function access and write the linker input files in one step. So with much hand-holding I hacked a sample valgrind plugin to do what I want. Unfortunately, my binaries ended up not being significantly faster(if at all) than the Fiji ones. They also ended up 20% bigger.

Fortunately, the GCC devs were able to point out my linker mistakes and pointed me at a linker patch that does what I want without linker scripts(and has less binary-bloating side-effects). Unfortunately, that just confirmed that the speedup I was looking for wasn’t hiding behind data symbols. So I am going to have to sit down with my io tracing script and study what the heck is going on.

Cool Things I Learned

In the process of helping me, GCC people namedropped some compiler flags that may prove very helpful:

  • -freorder-blocks-and-partition: Apparently this breaks up functions into hot/cold parts and gives them different section names so they can be moved around at link time.
  • -fno-common, -fno-zero-initialized-in-bss should go well with my favourites: -ffunction-sections -fdata-sections

Additionally, it may be possible to benefit from linking with large page support. I have some doubts about that.

I did learn about some cool GNU Gold flags:

  • –compress-debug-sections=zlib: Most of the overhead of linking a development libxul.so is writing out a near gig of debug data
  • –icf: Identical code folding, I think that matches the deduplication feature found in the ms linker. Saves 5% on my libxul.so