29
Dec 10

Faster Plugin Enumeration + Help Wanted

In addition to slow font enumeration, we were suffering from a similar problem: slow plugin enumeration. Just as with fonts, the plugin enumeration code is different on every platform. Unlike the font situation, plugin enumeration is done completely within our code(ie easy to fix).

Plugin enumeration is often triggered by JavaScript code (for example by checking if a Java handler is present). This means that enumeration is a blocking operation that must happen quickly. XPerf made me wonder why so many plugin-like .dll files were being read. This lead me to a fun set of perf fixes.

The Algorithm

  1. Files in plugin directories are listed
  2. Platform-specific IsPluginFile function to determines what files look like plugins(ie np*.dll on Windows).
  3. Code then checks if the files + their timestamps are known by pluginreg.dat. If so, cached info is used and the following steps are skipped
  4. For each library-file that isn’t found in pluginreg.dat, we use platform-specific GetPluginInfo to load the library-file to see if it is indeed a valid plugin (and to see what mimetypes it handles/etc).
  5. Valid plugins are recorded in pluginreg.dat.

This process took up to 3 seconds on a user’s computer. WTF? There were gotchas in almost every step of the way.

  1. Windows directory listing code would request metadata for every bloody file in the directory. Which resulted in an easiest optimization ever: pure code deletion.
  2. IsPluginFile on Windows/Mac sneakily did more than just check the filename. It also checked if the file was loadable, which on Windows loaded the dll and all of the dependencies. Mac code was satisfied with merely doing a little extra IO.
  3. This part was right
  4. #2 was easily fixed by moving file IO here.
  5. Files that failed the check in #4 were doomed to cause extra IO for all of eternity. Scott Greenlay fixed that by recording invalid plugin-like files too.

This was a rare fix that resulted in seconds saved on crapware-loaded computers. Usually I have to count my progress in milliseconds :(

Help Wanted

I have plans for vastly improving Firefox startup, but I need help to get there. If you enjoy beating under-performing code into submission and want to work for Mozilla, please send me your resume(taras at mozilla dot com). Example projects: a better performance testsuite (ie tracking IO, cpu instructions, etc), better infrastructure for profiling addons, optimizing away various CSS/XUL markup, etc. A low-level approach to solving problems is helpful, compiler/linker/kernel hackers are well-suited (but not required) for this.


21
Dec 10

Rude Surprise: Startup Overhead of Windows Font APIs

Imagine a typical Firefox user who starts their Windows computer in order to surf the web. First app they launch is Firefox 4. Turned out that on systems that support hardware-acceleration for 2D graphics, Firefox 4 takes minutes to startup. WTF? XPerf-aided investigation showed that, the Windows font enumeration code causes us to do 30x more disk IO (~300MB) than the rest of Firefox code.

In order to hardware accelerate Firefox, we switched from GDI to using DirectWrite for font stuffs. Apparently, DirectWrite is a wonderful api, but the implementation has some teething issues. DirectWrite opens a connection to the Font Service (and starts it if it isn’t already running), however if service fails to respond DirectWrite proceeds to enumerate all of the system fonts on the client-side. This isn’t cool for multiple reasons: a) it is slow as hell b) it causes Firefox to run out of memory(installing IE9 helps!) sooner.  This means that currently Firefox 4 starts up a lot slower than 3.6. John Daggett is busy working on a workaround by using older GDI APIs to enumerate fonts. Firefox is one of the first popular Windows applications to switch to DirectWrite, so we get to suffer the consequences.

Unfortunately it turns out that using Microsoft GDI APIs to enumerate fonts still causes a significant amount of disk IO (~30-60MB), John plans to fix that next.

How Did We Miss This?

This bug came from a fundamental difference of how developers and users start Firefox. A developer will restart Firefox a dozen times an hour. This means we rarely get to observe true cold startup. Our tests only measure warm startup (because most operating systems make it difficult to test cold startup). Windows is also incredibly slow to develop on, so a lot of us test in a virtual machine to speed things up and avoid rebooting the computer all the time. This also makes observing cold startup hard. Fortunately xperf makes IO much easier to observe. We should deploy xperf on our test infrastructure as soon as possible.


30
Nov 10

Crapware and Firefox

I completely agree with Asa that having unwanted crap forced upon the user is morally wrong. We should do a better job of undoing this kind of braindamage. In the meantime here is a brief rant on the parasitic underpinnings of crapware.

Until recently, I have been testing Firefox on my own installs of Windows. I had no idea how aggressive bundleware could be. Then I got this piece-of-crap i7 Acer laptop with Windows 7 (and relatively little crapware preinstalled) and tried to use it as my primary machine. Suddenly, I could reproduce a lot more “slow” scenarios. I even went further and tried installing common crapware known as AVG to reproduce more bugs.

Turns out almost every vendor tries to mix in crap into Firefox. Acer, Microsoft Office/Silverlight, Adobe flash/acrobat, Google, AVG, etc all added unwanted functionality to my Firefox. I marveled at all kinds of “helpful” functionality such as the wonderful ability to click on a link in a webpage and have Google chrome install without any warning that the webpage is about to execute a windows program. AVG adds a couple of extensions that make Firefox start up 0.5-4x slower.

So far I noticed 2 vectors of attack: plugins and extensions. Plugins are fun because those get added by registering bonus plugin directories. Plugin directories are usually just application directories that contain plugins. This means Firefox gets to slowly rummage through bonus application directories looking for what might be a plugin. Extensions are fun because unlike plugins (which affect most browsers on the computer), extensions are very browser-specific. Most extension crapware doesn’t yet support Chrome. Installing things like AVG retards Firefox performance while Chrome escapes unmolested.

Benjamin, I don’t think these software vendors are “doing exactly as we ask of them.”

Personally, I would like us to be a lot more aggressive about blacklisting ill-performing software. Ie we need to go above and beyond warning users when crapware. I would like to to actively check performance of popular plugins/addons and ban them if they are substandard.


23
Nov 10

Of linkers and avoiding suck

There is a common fallacy that since linkers and compilers are written by really smart people, there aren’t any huge performance wins left in the toolchain. My theory is that the efficiency of any given codebase varies inversely with the number of people who tried to optimize it.

I have long complained of suboptimal binaries generated from our code. Modern profiling tools such as systemtap and icegrind made this painfully obvious. Mike Hommey opted for actually doing something about it. What started as a simple ld.so hack grew into a badass binary-rewriting tool (and the most interesting blog post I’ve read this year).


16
Nov 10

Performance Update. Fragmentation: Mostly fixed. GCC: work-in-progress.

Fragmentation: SQLite & Friends

I am happy to report that the SQLite fragmentation problem is now solved. I copied my profile a month ago, and my places.sqlite is still in a single fragment! There was a similar fix done to Firefox disk cache. Thanks to helpful comments on my OSX preallocation cry for help, we now preallocate efficiently on OSX too.

Startup cache is the last remaining bastion of fragmentation, but that’s already 10x better than it was a month ago. I have two complimentary solutions for that: either omnijar startup cache generation for core code and/or write the cache more efficiently.

Firefox 4 will be a lot more gentle on those spinning platters.

GCC

I helped Jan Hubicka on a GCC summit paper. Those nasty static initializers will not be a hassle in GCC 4.6!

I keep wanting to blog about how we switched to GCC 4.5 and how life is wonderful, but life didn’t work out this way. So far we tried switching away from 4.3 compiler three times. The first time GCC completely failed in terms of -Os performance. C++ -Os is more bloated in 4.5 (because that option is benchmarked on C apps). Then it turned out that libffi was being miscompiled (we also found a related bug in libffi). Last time, we tried switching to GCC 4.5 + -O3 since that performs much better than -Os, but that broke sunspider. Hopefully we can fix the sunspider issue and try again next week. I would really like to utilize GCC PGO to produce fastest possible Linux Firefox builds.

Nonetheless I happy with recent GCC progress. With Jan’s help, GCC will eventually be very good at compiling Mozilla. In my spare cycles I’ve been working on setting up GCC benchmarks using Mozilla to help avoid future surprises like we discovered in GCC 4.5. More on this later.


04
Oct 10

Diagnosing Slow Startup

I spent last week trying to reproduce slow startup on Windows. Some users were reporting >30 second startup, supernova_00 has been feeding me xperf traces on IRC reproducing slow startups.

Startup Bugs

Turns out that if a website uses non-standard font names this can trigger Firefox to start parsing every single font on the system, freezing the browser in the meantime. Turns out facebook does this :(. This is now a blocker bug 600713. This bug has the unfortunate effect of overshadowing any startup improvements in Firefox 4.

We have some code to keep our databases in good shape by VACUUMing them. This is getting revamped in Firefox in bug 541373. In the meantime, for many current users performance suffers due to missing vacuums. If you are suffering from slow Firefox startup, and/or slow Awesomebar try this manual vacuum. This helps in older Firefox releases, but in Firefox 4 this has the effect of supercharging the SQLite database performance by switching to 32K pages.

Scareware

Another fun discovery was the effect of anti-virus software(AVG in this case). Like an annoying pet, AVG has to have a sniff and fondle every file that Firefox opens on startup. Apparently this is a feature called on-demand scanning, yuck.

But the fun doesn’t stop there, Windows has a wonderful prefetch mechanism that speeds up app startup. Unfortunately for supernova_00, \Windows\Prefetch just wouldn’t get populated with Firefox info, meaning that Windows wasn’t optimizing Firefox startup. Once I installed AVG, I ran into the same problem. Uninstalling AVG didn’t help. For whatever reason deleting every file in \Windows\Prefetch fixes that problem. For both of us prefetch got repopulated after being cleaned.

XPerf

Microsoft XPerf makes trivial to optimize cold startup. None of the other OSes have precanned analyses showing how much each individual file access is contributing to slow startup.

If you have a startup problem, I’m much more likely to be able to reproduce it if the report comes with an xperf trace. To get xperf run the Microsoft Platform SDK installer, select “Windows Performance Toolkit”.

To record an IO trace:

  1. Reboot
  2. run cmd.exe as Administrator
  3. xperf -on latency+FILE_IO+FILE_IO_INIT+DISK_IO
  4. run Firefox, reproduce the bug
  5. xperf -d report.etl
  6. Run xperf report.etl to view the report.

Click on “IO Counts” or “Hard faults” graph, select “Summary Table”. “IO Time (ms)” is the interesting column there. To get an idea of the sequence of IO operations, export the summary table to .csv and load it in a spreadsheet/grep/whatever. Every Firefox developer should give xperf a try, addon authors are encouraged too.


14
Sep 10

Firefox 4: jar jar jar

Opening files is relatively expensive. There is a small syscall overhead and a higher overhead of fetching data from disk. Depending on physical data layout and disk type, this can leave modern CPUs twiddling their thumbs for a long time while the disk skips around fetching all of the different file pieces.

Optimization #1: Fewer naked files

About two years ago I started gathering naked files on disk and shoving them into jars (eg bug 508421). We made jar reading as efficient as possible by cleaning up code and switching to mmap. Eventually all application data files read from disk during “normal” startup ended up in jars. Unfortunately we ended up with four jars (toolkit, chrome + 2 locale jars), which felt silly. Due to limitations in XPCOM, a lot of naked files were still read from disk on version upgrades and extension installation.

Optimization #2: One jar to rule them all

Recently Michael Wu unleashed a can of omnijar whoopass. This was a massive effort driven by Android packaging requirements. Now application startup data is always being read from one file. This implies better data locality, less seeking, less waiting. One benefit of packing files tightly is that the OS speculatively reads data from disk in chunks that are usually larger than what the application requests. This makes reading nearby files free. Unfortunately there was no good way to predict the order that files will be accessed in without actually running Firefox, so there was more room for improvement.

Optimization #3: Optimized jar layout

So now that all of our data was in one file, the next logical step was to pack it intelligently. The only way to do this is to profile Firefox startup and then order the jar according to that. Unfortunately even once one lays out all of the jar entries sequentially we were still doing our io suboptimally. This was due to the fact that the zip index (jars are zip files) is traditionally located on the end of the file. Wikipedia entry has pictures to illustrate this.

In order to maximize readahead benefits and minimize disk seeks it would be nice to have the file index in the front of the file. So I changed our zip layout from

<entry1><entry2>…<entryN><central directory><end of central directory>

to

<offset of the last entry read on startup><central directory><end of central directory><entry1><entry2>…<entryN><end of central directory>

So all I did was change the offset in <end of central directory> to always be 4 (it can’t be 0 because anal zip programs balk at “NULL” central directory offsets). Then I added a second identical <end of central directory> entry to keep the the rule that the central directory is always followed by one. I also used the extra space forced upon me by overly vigilant zip programs to store a number indicating how much data we can preread on startup.

This yielded a 2-3x reduction in disk io over an unoptimized omnijar. This is on top of a >30-100x reduction achieved by going from naked files to omnijar.

The downside of my interpretation of the zip spec is that some zip programs expect zip files to be more rigid than the spec allows. Older versions of Firefox, Microsoft zip support in windows, WinRAR, unix zip programs, etc accept my optimized jars. 7zip, broken antivirus (it’s a security risk to be overly picky) fail.

Trivia: this isn’t the first time we got tripped up by picky zip reading code. For example, the Android apk reader irritatingly insists at having a zip entry at byte zero of an Android package. This means that one can’t use apks to do the Android equivalent of self-extracting .exe files on Windows. Michael Wu is writing a custom library loader to deal with that :)

Optimization #4: More Omnijar

Feeling that omnijar wasn’t awesome enough, Michael Wu went ahead and omnijared extensions. Most extensions will no longer need to be unpacked from xpi files. This also means that extension authors can opt to use the optimized jar format above to further speed up Firefox startup.

Other jar optimizations

Switching to jars via startup cache will allow us to further optimize our first startup. There is option of halving our jar IO further by actually making use of that readahead integer I added to optimized jars.


09
Sep 10

Help Wanted: Does fcntl(F_PREALLOCATE) Work as Advertised on OSX?

To fight fragmentation it is best to tell the OS to allocate a continuous chunk of space for your file. With specialized APIs, the OS can do this without performing any IO (not counting metadata). I am adding support for this as part of bug 592520.  Linux features posix_fadvise for preallocating files. Windows’s SetEndOfFile achieves the same result. Supposedly OSX can do this via fcntl(F_PREALLOCATE), but does it?

I’ve experimented with posix_fadvise/SetEndOfFile and determined that they both change the file size and do their best to avoid fragmentation. Unfortunately I do not see any effect of fcntl(F_PREALLOCATE) on OS X 10.6 (the return code is successful). The file size does not change and if I then write to the file, it seems to fragment just as much as before. Can a Mac expert demonstrate that fcntl(F_PREALLOCATE) makes any difference at all?

Update: Thanks a lot for the useful feedback, it was extremely helpful in producing this patch. It appears that the posix_fallocate equivalent is to the fnctl followed by a truncate() call (which actually forces data to be written to the file).


07
Sep 10

Fighting fragmentation: SQLite

Thanks for all of those who commented on previous post on fragmentation. My first fragmentation fix has landed. In current nightlies and future releases the main Firefox databases will grow more aggressively to avoid fragmentation. This should translate into better history/awesomebar/cookie performance for our most dedicated users.

Unfortunately fixing existing profiles is hard from within Firefox. In the meantime advanced users on non-Windows platforms who are suffering from fragmentation can manually copy *.sqlite files to another directory and back.

Windows: Ahead of the pack

Evidence suggests that the Windows fragmentation situation is slightly better than on other platforms. Firefox fragmentation behavior on Windows is similar to other OSes but Windows periodically defragments Firefox files opened on startup. So one ends up with a cycle of deteriorating performance, followed by better performance(ie right after defrag), followed by deteriorating performance, etc.

I haven’t observed Windows defragmenting files for me, but it seems to do this for most users. Would love to learn more on how/when it decides to defragment files.

Horror Stories

I found a few other places that are horridly affected by fragmentation, will be blogging about those as I fix them. Fragmentation is an interesting problem to optimize because it affects dedicated users most, yet it is very tricky to replicate in a developer environment. Furthermore, there are a lot of misconceptions floating around:

  1. Fragmentation is a Windows problem that Linux is immune to due to having awesomer filesystems.
  2. Mac OSX automatically defragments files, so fragmentation isn’t a problem there.
  3. Fragmentation isn’t a problem on SSDs

To which I say:

  1. Linux might be good at avoiding fragmentation for server workloads. It sucks for desktop users.
  2. OSX will defragment small files, but big ones hurt most.
  3. Cheap SSDs suck at tiny reads caused by fragmentation resulting in spectacularly bad IO. More on this in a future post.

To summarize: there are a lot of misleading stories floating around. I am always happy to hear more measurements/docs/bugs/etc on this subject, but I have zero patience for folk stories and speculation.

I should also mention that the fragmentation problem isn’t limited to Firefox. Other browsers suffer from it too.


30
Jul 10

MSVC Static Initializers – Decent Stuff

I was digging through a MSVC++ map file for xul.dll. Turns out MSVC++ isn’t as naive about virtual initializers as the GNU toolchain. Initializers are all laid out next to each other. Same goes for what looks like finalizers and exception unwinding stuff. Initializers have an __E prefix and look like this:

0001:0089b470       ??__E?config@AvmCore@avmplus@@2UConfig@nanojit@@A@@YAXXZ 1089c470 f    CIL library: CIL module
0001:0089b475       ??__EkStaticModules@@YAXXZ 1089c475 f   nsStaticXULComponents.obj
0001:0089b638       ??__E?sSnifferEntries@nsUnknownDecoder@@1PAUnsSnifferEntry@1@A@@YAXXZ 1089c638 f   necko:nsUnknownDecoder.obj

Now if only Microsoft fixed their kernel to do memory-mapped IO efficiently, it’d be a superior OS for starting Firefox.