Oct 09

Studying Library IO – SystemTap Style

In my last blog post I expressed frustation with slowness induced by library IO. Then I went on a mission to measure it. I have been wanting to this for a while, but I figured that only DTrace can get this info without recompiling my kernel. So I tried to build Mozilla under Slowlaris (but the linker got up to 3GB and then set there swapping, ensuring that the nickname is justified). Then I fired up DTrace on the mini, but ran screaming because it seemed like fbt DTrace provider refused to let me dereference structs (later Joel told me that I’m supposed to copy data explicitly like here).

But while googling for a fbt workaround, I stumbled upon a DTrace/SystemTap comparision wiki. SystemTap? The DTrace knockoff I have been hearing about? It works? This was a lightbulb moment where I realized that Linux was about to provide me with more information than I thought was possible.

So here is the data I got out of it:

Continue reading →

Oct 09

Rant on Library IO

So I’ve been trying to figure out how optimize disk IO startup. I looked into IO caused by libraries and turns out that apps with big libraries are screwed. Here is how I came to this conclusion:

Gnomer’s research on startup pointed out that dumb readahead leads to wins in terms file io. So I wrote some code and sure enough, reading in libxul on top of our main() function does indeed result in a significant measurable speed-up on both Linux and OSX.

From the gnome page I found a link to some diskstat stuff. There lay a presentation with graphs that appear to show that OpenOffice has a much better cold IO pattern than Firefox. Given that there are some strong similarities between our application layouts I went digging to see if OpenOffice does something funny. And oh boy, it does do funny page reordering on Windows and “slightly-smarter-than-dumb-readahead-style library prefetch” on Linux…

So here is an innocent question: Why is page-reordering not done as a PGO step? I mean shouldn’t you fire up your app, feed some info back to the linker and be done with it? Another question: Why can’t we mark certain files as “keep this whole file in ram if someone asks for part of it to be paged in”?

So is the only way to fast application startup via static linking? It sure is easy to

posix_fadvise(open(argv[0],O_RDONLY),  POSIX_FADV_WILLNEED);

Are these hacks still the state of the art in making apps with large libraries startup fast?

Update: Found some mentions of GNU Rope unfinishedware and a relatively recent blog post

Oct 09

Restless Bug Fixing

I spent the past couple weeks analyzing and improving fastload performance. I’ve long been suspicious of fastload, but only finally got around to investigating it in detail. I think there is some fundamentally ironic rule in software that if you put the word “fast” in the name of a component, it is bound to eventually become a performance bottleneck.

Almost a decade has passed since the conception of this code, so it was time to update code’s assumptions to reflect the capabilities of modern OSes. I landed the fix today. It results in startup performance gains of 1-20% on various platforms I tested, making this the most exiting perf bug I’ve worked on.


Now that I’ve had my fill of almost a year’s worth of startup performance analysis, for the remainder of the year I plan to refocus on static analysis. My main goal is decent  C support on Dehydra(not to mention the ever elusive GCC 4.5 compatibility) and to facilitate a production-quality DXR.

I’m hoping that we’ll end up with cool ways of dealing with the painful/slow boilerplate (bugs 520626, 516085 and 517370)