20
Sep 11

Firefox 7: Cheating the Operating System to Start Faster

Firefox 7 features improved startup speed. Our research has shown that most OSes are not capable of starting large programs efficiently (see my older blog entries). As a result, Firefox 7 will explicitly tell the OS to aggressively preload our xul.dll/libxul.so/XUL library before passing it on to the runtime linker. This marks the productization of the approach explored in “20-line patch that doubles Firefox startup” that got people so excited. See bug 552864 and dependent bugs for exciting technical details.

Do Not Try this at Home: Ugly Windows

Note: Windows Prefetch does everything possible to thwart exciting startup optimizations. Above optimization only works when prefetch is disabled/broken (ie Firefox should be faster with Windows Prefetch off). See this comment on disabling prefetch for Firefox. Alternatively one may get Windows Prefetch to not slow down Firefox startup by the following magic incantation:

  1. Install Firefox (or delete the windows prefetch entry for existing Firefox 7)
  2. Reboot (Do not start Firefox after installation!)
  3. Start Firefox.

Above helps populate the Windows Prefetch in a less counter-productive way. Explanation: on warm startup Windows Prefetch records irrelevant IO operations and blocks Firefox startup to preload files that Firefox accesses after startup.

Note: the heuristic that we use to detect Windows Prefetch will also prevent this optimization from kicking in on some exceptionally slow hard drives when prefetch is off. This is unfortunate as this optimization is most dramatic on exceptionally slow machines.

According to our preliminary telemetry data, less than 25% of our Windows users have Windows Prefetch off and can benefit from vastly improved startup. We expect to improve this in future Firefox releases by scheduling a prefetch cleanup operation similar to the script in above bugzilla comment.

Operating systems without counter-productive startup heuristics (ie Mac/Linux) will simply allow Firefox 7 to start faster.


26
Apr 11

Measuring Startup Speed Correctly

Until recently our state of the art method for measuring startup was to subtract a timestamp passed via commandline from a new Date() timestamp within a <script> tag. Vlad pioneered this approach, me and others adopted it.

Turns out there are two problems with this approach:

  1. It is cumbersome, especially on Windows where there is no easy way to pass a timestamp via the commandline.
  2. It is wrong. Turns out that Firefox starts loading web pages before the UI is shown. One can’t be sure that the page being loaded is within a visible browser

Our oldest startup benchmark, ts,  has been gathering wrong numbers all along.  This resulted in a class of perverse optimizations that decreased the ts number, but increased the time taken for UI to appear (ie bug 641691). The new tpaint (bug 612190) benchmark should should address this. On my machine measuring pageload vs paint-time results in a 50-100ms difference. See the graph server for more data.

This is why AMO’s complicated method of measuring startup is wrong. Please use our shiny new about:startup extension or if you absolutely want to avoid adding any overhead use getStartupInfo API directly.


09
Feb 11

Magic patch that halves Windows startup

Internet as of late have been obsessing over magically short patches that improve performance _ times(probably as a result of LKML cgroups patch from a few weeks ago). So my work in bug 627591 got picked up in all kinds of news sources(mostly due to @limi’s manlove). Apparently all that internet fame is good for is getting script-kiddies to upload viruses as bugzilla attachments. Dear Internet, please do not interrupt me in the middle of an investigation.

To crux of the optimization lies in trading waiting for random io for fast sequential IO. Turned out that my patch worked great if windows prefetch wasn’t trying to help (ie firefox ran faster without prefetch on my test systems). With prefetch on, the patch was either a smaller win or a downright loss. When I dug in deeper, it turned that the Windows Prefetch helpfully spends 3-6 seconds doing IO before any Firefox code gets to run. It also doesn’t read in a very clever pattern, resulting in a very small speed up for Firefox, but preventing my exciting optimization.

So I curled up into my defeated fetal position and pondered on how would I prevent Windows Prefetch from being so “helpful”. One way would be to install some crapware to cripple prefetch (kidding!), another way is to do the sequential IO in a separate executable(ala run-mozilla.sh on Unix). This way Windows doesn’t try to do insane amounts of IO before my preloading logic gets to run. This seems to work (see wrapper.exe talk in the bug) and has potential to double Firefox startup times. It’s also ugly as sin, but if that’s what it takes…

So now I need more reports to make sure the executable wrapper approach reliably/significantly speeds up cold(post-reboot) startup. Then we can make a decision on how to integrate this into Firefox. But until we have all the data, please don’t jump to conclusions on what will and wont make Firefox 2x faster.


29
Dec 10

Faster Plugin Enumeration + Help Wanted

In addition to slow font enumeration, we were suffering from a similar problem: slow plugin enumeration. Just as with fonts, the plugin enumeration code is different on every platform. Unlike the font situation, plugin enumeration is done completely within our code(ie easy to fix).

Plugin enumeration is often triggered by JavaScript code (for example by checking if a Java handler is present). This means that enumeration is a blocking operation that must happen quickly. XPerf made me wonder why so many plugin-like .dll files were being read. This lead me to a fun set of perf fixes.

The Algorithm

  1. Files in plugin directories are listed
  2. Platform-specific IsPluginFile function to determines what files look like plugins(ie np*.dll on Windows).
  3. Code then checks if the files + their timestamps are known by pluginreg.dat. If so, cached info is used and the following steps are skipped
  4. For each library-file that isn’t found in pluginreg.dat, we use platform-specific GetPluginInfo to load the library-file to see if it is indeed a valid plugin (and to see what mimetypes it handles/etc).
  5. Valid plugins are recorded in pluginreg.dat.

This process took up to 3 seconds on a user’s computer. WTF? There were gotchas in almost every step of the way.

  1. Windows directory listing code would request metadata for every bloody file in the directory. Which resulted in an easiest optimization ever: pure code deletion.
  2. IsPluginFile on Windows/Mac sneakily did more than just check the filename. It also checked if the file was loadable, which on Windows loaded the dll and all of the dependencies. Mac code was satisfied with merely doing a little extra IO.
  3. This part was right
  4. #2 was easily fixed by moving file IO here.
  5. Files that failed the check in #4 were doomed to cause extra IO for all of eternity. Scott Greenlay fixed that by recording invalid plugin-like files too.

This was a rare fix that resulted in seconds saved on crapware-loaded computers. Usually I have to count my progress in milliseconds :(

Help Wanted

I have plans for vastly improving Firefox startup, but I need help to get there. If you enjoy beating under-performing code into submission and want to work for Mozilla, please send me your resume(taras at mozilla dot com). Example projects: a better performance testsuite (ie tracking IO, cpu instructions, etc), better infrastructure for profiling addons, optimizing away various CSS/XUL markup, etc. A low-level approach to solving problems is helpful, compiler/linker/kernel hackers are well-suited (but not required) for this.


21
Dec 10

Rude Surprise: Startup Overhead of Windows Font APIs

Imagine a typical Firefox user who starts their Windows computer in order to surf the web. First app they launch is Firefox 4. Turned out that on systems that support hardware-acceleration for 2D graphics, Firefox 4 takes minutes to startup. WTF? XPerf-aided investigation showed that, the Windows font enumeration code causes us to do 30x more disk IO (~300MB) than the rest of Firefox code.

In order to hardware accelerate Firefox, we switched from GDI to using DirectWrite for font stuffs. Apparently, DirectWrite is a wonderful api, but the implementation has some teething issues. DirectWrite opens a connection to the Font Service (and starts it if it isn’t already running), however if service fails to respond DirectWrite proceeds to enumerate all of the system fonts on the client-side. This isn’t cool for multiple reasons: a) it is slow as hell b) it causes Firefox to run out of memory(installing IE9 helps!) sooner.  This means that currently Firefox 4 starts up a lot slower than 3.6. John Daggett is busy working on a workaround by using older GDI APIs to enumerate fonts. Firefox is one of the first popular Windows applications to switch to DirectWrite, so we get to suffer the consequences.

Unfortunately it turns out that using Microsoft GDI APIs to enumerate fonts still causes a significant amount of disk IO (~30-60MB), John plans to fix that next.

How Did We Miss This?

This bug came from a fundamental difference of how developers and users start Firefox. A developer will restart Firefox a dozen times an hour. This means we rarely get to observe true cold startup. Our tests only measure warm startup (because most operating systems make it difficult to test cold startup). Windows is also incredibly slow to develop on, so a lot of us test in a virtual machine to speed things up and avoid rebooting the computer all the time. This also makes observing cold startup hard. Fortunately xperf makes IO much easier to observe. We should deploy xperf on our test infrastructure as soon as possible.


23
Nov 10

Of linkers and avoiding suck

There is a common fallacy that since linkers and compilers are written by really smart people, there aren’t any huge performance wins left in the toolchain. My theory is that the efficiency of any given codebase varies inversely with the number of people who tried to optimize it.

I have long complained of suboptimal binaries generated from our code. Modern profiling tools such as systemtap and icegrind made this painfully obvious. Mike Hommey opted for actually doing something about it. What started as a simple ld.so hack grew into a badass binary-rewriting tool (and the most interesting blog post I’ve read this year).


04
Oct 10

Diagnosing Slow Startup

I spent last week trying to reproduce slow startup on Windows. Some users were reporting >30 second startup, supernova_00 has been feeding me xperf traces on IRC reproducing slow startups.

Startup Bugs

Turns out that if a website uses non-standard font names this can trigger Firefox to start parsing every single font on the system, freezing the browser in the meantime. Turns out facebook does this :(. This is now a blocker bug 600713. This bug has the unfortunate effect of overshadowing any startup improvements in Firefox 4.

We have some code to keep our databases in good shape by VACUUMing them. This is getting revamped in Firefox in bug 541373. In the meantime, for many current users performance suffers due to missing vacuums. If you are suffering from slow Firefox startup, and/or slow Awesomebar try this manual vacuum. This helps in older Firefox releases, but in Firefox 4 this has the effect of supercharging the SQLite database performance by switching to 32K pages.

Scareware

Another fun discovery was the effect of anti-virus software(AVG in this case). Like an annoying pet, AVG has to have a sniff and fondle every file that Firefox opens on startup. Apparently this is a feature called on-demand scanning, yuck.

But the fun doesn’t stop there, Windows has a wonderful prefetch mechanism that speeds up app startup. Unfortunately for supernova_00, \Windows\Prefetch just wouldn’t get populated with Firefox info, meaning that Windows wasn’t optimizing Firefox startup. Once I installed AVG, I ran into the same problem. Uninstalling AVG didn’t help. For whatever reason deleting every file in \Windows\Prefetch fixes that problem. For both of us prefetch got repopulated after being cleaned.

XPerf

Microsoft XPerf makes trivial to optimize cold startup. None of the other OSes have precanned analyses showing how much each individual file access is contributing to slow startup.

If you have a startup problem, I’m much more likely to be able to reproduce it if the report comes with an xperf trace. To get xperf run the Microsoft Platform SDK installer, select “Windows Performance Toolkit”.

To record an IO trace:

  1. Reboot
  2. run cmd.exe as Administrator
  3. xperf -on latency+FILE_IO+FILE_IO_INIT+DISK_IO
  4. run Firefox, reproduce the bug
  5. xperf -d report.etl
  6. Run xperf report.etl to view the report.

Click on “IO Counts” or “Hard faults” graph, select “Summary Table”. “IO Time (ms)” is the interesting column there. To get an idea of the sequence of IO operations, export the summary table to .csv and load it in a spreadsheet/grep/whatever. Every Firefox developer should give xperf a try, addon authors are encouraged too.


30
Jul 10

MSVC Static Initializers – Decent Stuff

I was digging through a MSVC++ map file for xul.dll. Turns out MSVC++ isn’t as naive about virtual initializers as the GNU toolchain. Initializers are all laid out next to each other. Same goes for what looks like finalizers and exception unwinding stuff. Initializers have an __E prefix and look like this:

0001:0089b470       ??__E?config@AvmCore@avmplus@@2UConfig@nanojit@@A@@YAXXZ 1089c470 f    CIL library: CIL module
0001:0089b475       ??__EkStaticModules@@YAXXZ 1089c475 f   nsStaticXULComponents.obj
0001:0089b638       ??__E?sSnifferEntries@nsUnknownDecoder@@1PAUnsSnifferEntry@1@A@@YAXXZ 1089c638 f   necko:nsUnknownDecoder.obj

Now if only Microsoft fixed their kernel to do memory-mapped IO efficiently, it’d be a superior OS for starting Firefox.


22
Jul 10

File Fragmentation

Files are considered fragmented when they aren’t laid out in a continuous chunk on disk. This causes extra seeks even if the file is being read sequentially.

I was discussing startup over dinner, someone asked about how much of an issue fragmentation is in Firefox.

Early on I decided to pretend that fragmentation does not exist as we had bigger fish to fry. We were opening too many files on startup, effectively causing our own high-level fragmentation. Luckily, that problem should be mostly solved in Firefox 4 once omnijar and fat xul bugs land (unfortunately, extensions can cause similar issues until we stick em into a single file).

To measure fragmentation I used my SystemTap script to get a list of files opened (one could also use strace or any similar tool) and piped the results to filefrag. Filefrag is a Linux fragmentation-measuring utility. On Windows one can use contig and Mac OS X features hfsdebug. I’m using ext4 on Linux.

My top offenders were:
places.sqlite: 34 extents
cookies.sqlite: 18 extents
XPC.mfasl: 11 extents
Cache/_CACHE_003_: 11 extents
urlclassifier3.sqlite: 6 extents
Cache/_CACHE_002_: 6 extents
Cache/_CACHE_001_: 6 extents
XUL.mfasl: 5 extents
formhistory.sqlite: 5 extents
content-prefs.sqlite: 4 extents
libxul.so: 4 extents
signons.sqlite: 3 extents
icon-theme.cache: 2 extents
libatk-1.0.so.0.2809.1: 2 extents
libflashplayer.so: 2 extents
Cache/_CACHE_MAP_: 2 extents

I did an informal poll of my friends and it seems that the order of fragmentation is similar among them, only the magnitude differs. For example, XFS tends to be 10-20 times more fragmented than ext4 :). I don’t have any numbers for HFS+, but I suspect XFS takes the crown as most fragmentation-prone filesystem to run Firefox.

Interestingly, my friend running NTFS reported similar fragmentation to ext4. That was disappointing as Windows Prefetch supposedly defragments files used with the first 10 seconds of startup. Clearly, isn’t keeping up in this case.

Preliminary Conclusions

places.sqlite is the largest and most performance-critical file in Firefox. It contains browser history and bookmarks. It is the brains behind the AwesomeBar.The fact that it is severely affected by fragmentation significantly impacts Firefox responsiveness. There are no easy fixes for fragmentation there. mak suggested moving history to a separate file to mitigate this, but that isn’t an easy change.

In contrast, cookies.sqlite is tiny(<1mb for me) and probably so fragmented due to cookie expiration. I am guessing that easiest workaround here is to write a new sqlite file every time there is a mass update to the file.

urlclassifier.sqlite is a large file that may be mitigated similarly to cookies.

SQLite 3.7.0 came out today which features WAL logging, which may reduce fragmentation (or make battling it easier). In general, sqlite’s VACUUM (used to clean and compact the database) command does not help with fragmentation, we really need to be doing something like hot backup which would create a new database file every VACUUM.

Our cache code is ancient and sucks. The cache files get fragmented immediately and severely. They are accessed in insane patterns and they get laid out insanely on disk. There are some efforts to improve the code, but I suspect that’s equivalent to putting lipstick on a pig.

*.mfasl files are due to be obsoleted by a startup cache jar. It may get less fragmented. Should be a straight-forward fix it if it does get fragmented.

I’m disappointed to see the .so files get fragmented. This might be an ext4 bug or has something to do with how the updater works (both ours and yum on Fedora).

Further Work

I would like to see more data on fragmentation on Windows/OSX. Feel free to leave a comment with fragmentation numbers for cache, mfasl, sqlite and .dll files in your Firefox. We should look into online defragmentation APIs in modern OSes.

Workarounds?

Easiest way to fix fragmented files is to make a copy of the original file, delete the original and then rename the copy. This works on sane filesystems, apparently it doesn’t work too well on OS X.


27
May 10

Startup: Backward Constructors

This post is a result of debugging bug 561842. Turns out one needs to go far beyond lumping libraries together to reap startup benefits.

I made a pdf to illustrate the cost centers of loading libxul.so (the essence of Firefox).

With Icegrind I demonstrated that better binary layout can significantly improve application startup. However I still didn’t have a breakdown of reasons of why loading binaries is so damn inefficient. That’s what the above pdf is about.

Loading libxul consists of 4 major phases:

  1. Runtime linker setup: mapping segments in, zeroing .bss, loading dependent libraries, etc
  2. Runtime linker relocations.
  3. Library intializers.
  4. main() and the rest of application code runs

I blogged about 1, 2, 4, this post is about #3.

Library Initializers?

Michael Meeks pointed me at the funny backwards IO pattern in his IO logs. I even made fun of how by default libxul.so is read mostly via backwards IO. Once I assigned userspace symbols to my pagefault log, it became clear that the backwards IO pattern was entirely due to library initializers. C++ compiler generates code that runs on library initialization to initialize globals and run relevant C++ constructors. In C one can assign a “constructor” GNU attribute to a function to participate in this mayhem.

Running Backwards?

Ian Lance Taylor clued me in on why these things run backwards.When one links the program, the object files are laid out sequentially. Static libraries are specified after the code that depends on them. Once an object is linked, the easiest way to make sure that libraries are initialized before their users is to invoke initializers backwards. The list of initializers is stored in the .ctors section and they loaded by libgcc.

In Mozilla (and likely other C++ codebases) these global initializers are more or less evenly scattered throughout the codebase. By the time main() is run, most of the program has been paged in an unfortunately inefficient manner.

Run Faster Please?

The most interesting part about all this that the compiling toolchain can make a rather precise guess at how a large part of the initial program execution is going to go. To test this theory I wrote my best Mozilla patch ever.

One can place a function near the beginning of the library file and another one at the end (with a “constructor” attribute). The function at the end runs first and it can figure out the approximate range of memory that will need to be paged in and madvise() it. This results in a 5x reduction in libxul pagefaults. Unfortunately since constructors execute backwards and readahead forwards, the constructor execution stalls to wait for readahead, so the speedup is rather hard to detect.

Run Forward Faster!

Depressed about my hack failing to make a dent in startup time I patched gcc to run initializers in a forward order (and reversed the function-placement logic in above patch). Now readahead happened in the same direction as library initialization and my Firefox started 30% faster! I wrapped this up into a standalone gcc patch (speed up any bloated C++ startup with a simple change to the compiler!). Note this hack reverses the library initialization order discussed above, this happens to not be a problem for Mozilla.

Conclusion: Order Matters!

The linker can reverse the per-library initializers such that initializers run forward, but cross-library dependencies are honoured. That in itself isn’t enough to boost startup without cleverer readahead on the kernel side (or application-side hacks).

It’s weird to have initializers page in most of the binary. An interesting optimization would to have the compiler transitively mark functions reached by library initialization and place those in a .text.initializers section. Then one could have the linker group the initializers together.

Plans

I haven’t made up my mind on how to proceed. This madvise() hack + a simple linker patch could be deployed more easily than icegrind. This hack also appears to be as performant as a static firefox build + icegrind (due to inadequate kernel readahead without madvise()). Icegrid + libxul.so isn’t quite as efficient. I have a feeling that we’ll end up with a combination of icegrind + some form the initializer madvise() hack.

Continue reading →