Category Archives: Planet Mozilla

New policy: 24-hour backouts for major Talos regressions

Now that I’ve caught your attention with a sufficiently provocative title, please check out this new Talos regression policy that we* will be trying out starting next week :)!topic/

tl;dr: Perf sheriffs will back out any Talos regression of 10% or more if it affects a reliable test on Windows. We’ll give the patch author 24 hours to explain why the regression is acceptable and shouldn’t be backed out. Perf sheriffs will aim to have such regressions backed out within 48 hours of landing.

I promise this policy is much more nuanced and thought-through than the title or summary might suggest :) But I really want to hear developers’ opinions.

* I’m taking point on publicizing this new policy and answering any questions, but Joel Maher, William Lachance and Vaibhav Agarwal of the A-Team did all the heavy lifting. They built the tools for detecting & investigating Talos regressions and they’re the perf sheriffs.

Avi Halachmi from my team is helping to check the tools for correctness. I just participate in Talos policy decisions and occasionally act as an (unintentional) spokesperson :)

Announcing the Content Performance program


Aaron Klotz, Avi Halachmi and I have been studying Firefox’s performance on Android & Windows over the last few weeks as part of an effort to evaluate Firefox “content performance” and find actionable issues. We’re analyzing and measuring how well Firefox scrolls pages, loads sites, and navigates between pages. At first, we’re focusing on 3 reference sites: Twitter, Facebook, and Yahoo Search.

We’re trying to find reproducible, meaningful, and common use cases on popular sites which result in noticeable performance problems or where Firefox performs significantly worse than competitors. These use cases will be broken down into tests or profiles, and shared with platform teams for optimization. This “Content Performance” project is part of a larger organizational effort to improve Firefox quality.

I’ll be regularly posting blog posts with our progress here, but you can can also track our efforts on our mailing list and IRC channel:

Mailing list:
IRC channel: #contentperf
Project wiki page: Content_Performance_Program

Summary of Current Findings (June 18)

Generally speaking, desktop and mobile Firefox scroll as well as other browsers on reference sites when there is only a single tab loaded in a single window.

  • We compared Firefox vs Chrome and IE:
    • Desktop Firefox scrolling can badly deteriorate when the machine is in power-saver mode1 (Firefox performance relative to other browsers depends on the site)
    • Heavy activity in background tabs badly affects desktop Firefox’s scrolling performance1 (much worse than other browsers — we need E10S)
    • Scrolling on infinitely-scrolling pages only appears janky when the page is waiting on additional data to be fetched
  • Inter-page navigation in Firefox can exhibit flicker, similar to other browsers
  • The Firefox UI locks up during page loading, unlike other browsers (need E10S)
  • Scrolling in desktop E10S (with heavy background tab activity) is only as good as the other browsersn1 when Firefox is in the process-per-tab configuration (dom.ipc.processCount >> 1)

1 You can see Aaron’s scrolling measurements here:

Potential scenarios to test next:

  • Check impact of different Firefox configurations on scrolling smoothness:
    • Hardware acceleration disabled
    • Accessibility enabled & disabled
    • Maybe: Multiple monitors with different refresh rate (test separately on Win 8 and Win 10)
    • Maybe: OMTC, D2D, DWrite, display & font scaling enabled vs disabled
      • If we had a Telemetry measurement of scroll performance, it would be easier to determine relevant characteristics
  • Compare Firefox scrolling & page performance on Windows 8 vs Windows 10
    • Compare Firefox vs Edge on Win 10
  • Test other sites in Alexa top 20 and during random browsing
  • Test the various scroll methods on reference sites (Avi has done some of this already): mouse wheel, mouse drag, arrow key, page down, touch screen swipe and drag, touchpad drag, touchpad two finger swipe, trackpoints (special casing for ThinkPads should be re-evaluated).
    • Check impact of pointing device drivers
  • Check performance inside Google web apps (Search, Maps, Docs, Sheets)
    • Examine benefits of Chrome’s network pre-fetcher on Google properties (e.g. Google search)
    • Browse and scroll simple pages when top Google apps are loaded in pinned tabs
  • Compare Firefox page-load & page navigation performance on HTTP/2 sites (Facebook & Twitter, others?)
  • Check whether our cache and pre-connector benefit perceived performance, compare vs competition

Issues to report to Platform teams

  • Worse Firefox scrolling performance with laptop in power-save mode
  • Scrolling Twitter feed with YouTube HTML5 videos is jankier in Firefox
  • bug 1174899: Scrolling on Facebook profile with many HTML5 videos eventually causes 100% CPU usage on a Necko thread + heavy CPU usage on main thread + the page stops loading additional posts (videos)

Tooling questions:

  • Find a way to to measure when the page is “settled down” after loading, i.e. time until last page-loading event. This could be measured by the page itself (similar to Octane), which would allow us to compare different browsers
  • How to reproduce dynamic websites offline?
  • Easiest way to record demos of bad Firefox & Fennec performance vs other browsers?

Decisions made so far:

  • Exclusively focus on Android 5.0+ and Windows 7, 8.1 & 10
  • Devote the most attention to single-process Nightly on desktop, but do some checks of E10S performance as well
  • Desktop APZC and network pre-fetcher are a long time away, don’t wait

How to evaluate the performance of your new Firefox feature

There are a lot of good tools available now for studying Firefox performance, and I think a lot of them are not well known, so I put together a list of steps to follow when evaluating the performance of your next Firefox feature.

1. Make sure to test your feature on a low-end or mid-range Windows computer

  • Our dev machines are uncommonly powerful. Think machines with spinning hard drives, not SSDs. Testing on Windows is a must, as it is used by the vast majority of our users.
  • The perf team, fx-team, and gfx team have Windows Asus T100 tablets available in multiple offices just for this purpose. Contact me, Gavin, or Milan Sreckovic if you need one.

2. Ensure your feature does not touch storage on the main thread, either directly or indirectly

  • If there’s any chance it might cause main-thread IO, test it with the Gecko profiler. The profiler now has an option to show you all the IO done on the main thread, no matter how brief it is.
  • Also be careful about using SQLite

3. Make sure to add Telemetry probes that measure how well your feature performs on real user machines.

  • Check the Telemetry numbers again after your feature reaches the release channel. The release channel has a diversity of configurations that simply don’t exist on any of the pre-release channels.
    • You can check for regressions in the Telemetry dash, or you can ask the perf-team to show you how to do a custom analysis (e.g. performance on a particular gfx card type) using MapReduce or Spark.
    • The learning curve can be a bit steep, so the perf team can do one-off analyses for you.
    • We have additional performance dashboards; they are listed in the “More Dashboards” sidebar on
  • Always set the “alert_mails” field for your histogram in Histograms.json so you get automatic e-mail notifications of performance regressions and improvements.
    • Ideally, this email address should point to an alias for your team.
    • Note that the Telemetry regression detector has an extremely low false-positive rate so you won’t be getting any emails unless performance has changed significantly.

4. Keep an eye out on the Talos scores

  • The Talos tests are much less noisy now than they used to be, and more sensitive as well. This is thanks to Avi Halachmi’s, Joel Maher’s, and others’ efforts.
    Partly as a result of this, we now have a stricter Talos sheriffing policy. The patch author has 3 business days to respond to a Talos regression bug (before getting backed out), and two weeks to decide what to do with the regression.
  • Joel Maher will file a regression bug against you if you regress a Talos test.
  • The list of unresolved regressions in each release is tracked in the meta bugs: Firefox 36, Firefox 37, Firefox 38, etc
  • Joel tracks all the improvements together with all the regressions in a dashboard
  • If you cause a regression that you can’t reproduce on your own machine, you can capture a profile directly inside the Talos environment:

  • Some Talos tests can be run locally as extensions, others may require you to set up a Talos harness. Instructions for doing this will be provided in the Talos regression bugs from now on.
  • The graph server can show you a history of test scores and test noise to help you determine if the reported regression is real.
    • William Lachance is working on a new & improved graphing UI for treeherder.

5. Consider writing a new Talos test

  • Add a new Talos test if the performance of your feature is important and it is not covered by existing tests. The Perf team would be happy to help you design a meaningful and reliable test.
  • Make sure your test measures the right things, isn’t noisy and that it is is able to detect real regressions


I initially posted this message for discussion on the firefox-dev and newsgroups. This is now also a wiki page in the Performance wiki.

Diagnosing a Talos ts_paint regression on Windows XP

Recently, I worked on diagnosing a 3% startup time regression in the Talos ts_paint benchmark and I thought I’d share my experience with others who might be dealing with a similar regression.


My first step was to reproduce the regression on a Windows XP machine under my control. To prep the machine, I turned off unneeded background services to prevent them from interfering with measurements, created new Firefox profiles, configured Firefox to stop checking for updates and to use a blank page as its homepage, disabled extensions (that had been installed system-wide), and launched Firefox several times so that any data needed from disk during startup would be cached (ts_paint measures “hot” startups). I also wrote a batch script to automate the launching and shutting down of Firefox so that I could easily collect data from dozens of startups.

Finally, I turned on Telemetry data gathering since Telemetry records timestamps for each startup phase. During Firefox exit, Telemetry writes the collected data as a JSON file in <profile directory>\saved-telemetry-pings\, so every time I launched & shut down Firefox, it would spit out a new JSON file containing the startup timings. You can see a list of the startup times collected by Telemetry in the “Simple Measurements” section of your Firefox’s about:telemetry page.

I then wrote a simple Python script to extract and format the startup data from the Telemetry JSON file. This is the script and these are the results in a Google Docs spreadsheet. The regression is highlighted in red in the spreadsheet. The times show surprisingly little variation between runs and suggest that the regression is entirely contained in the startup phase directly preceding the first paint of a Firefox window. Luckily, the Gecko profiler has already been initialized at this point of startup so it was possible to capture profiles of the browser’s activity. Matt Noorenberghe and Mike Conley were able to show that at least part of this regression is caused by initialization of the new “Customize UI” functionality and painting of tabs inside the titlebar (bug 910283).

Problems profiling on Windows XP

I first tried profiling the startup regression with XPerf and I soon discovered that setting up XPerf on Windows XP s a surprisingly cumbersome task. I had to install XPerf from the Microsoft Windows SDK for Windows Server 2008 on a different computer running a more modern, 32-bit version of Windows (XP wouldn’t work, I used Vista), and then I manually copied xperf.exe onto the Windows XP machine. Unfortunately, I found that it’s not possible to profile with stackwalking enabled on XP and also that the captured profiles can’t be examined on an XP machine, requiring yet more copying to a newer Windows machine.

I also found that the pseudo-stack and native stack frames were not being interleaved properly in the Gecko Profiler on Windows XP (bug 900524). This was particularly irksome since it meant the JS and C++ stacks would not be merged correctly. It turned out that the version of StackWalk64 that shipped with dbghelp.dll in Windows XP was not walking the data stack properly and that replacing it with a newer version from Debugging Tools for Windows resolved the problem.

A quick Firefox startup update

Recently I’ve been working on a project to improve desktop Firefox’s startup time during “cold starts” where none of the Firefox binaries or data are cached in memory (e.g. the first launch of the browser after a reboot). I’ve been paying special attention to the time required to reach the “first paint” startup milestone: the point in time when the first Firefox window becomes visible.

The analysis has mostly consisted of profiling the latest Firefox Nightlies using XPerf on a reference dual-core Windows 7 laptop with a magnetic HDD. I’ve been working on several bugs arising from the investigation (bug 881575bug 881578, bug 827976bug 879957, bug 873640) and I have many more coming. This is an overview of a few challenges I’ve run into over the last month.

Making Startup Times Reproducible

I wanted to evaluate the impact of my experimental code changes by comparing startups, but I quickly discovered that there is a tremendous amount of variation in startup times in my test environment. I then turned off Windows Prefetching & Windows SuperFetch, two performance features responsible for pre-fetching files from disk based on the user’s usage patterns, but I still recorded excessive variation in start times.

I then turned off a plethora of 3rd-party and Windows services that were running in the background and accessing the disk: Windows Update & Indexing Service, OEM “boot optimizer” software, Flash & Chrome automatic updaters, graphics card configuration & monitoring software bundled with drivers, etc. After rebooting the laptop several times and disabling any remaining programs causing disk activity, I was finally able to achieve reproducible startup times. I expected that cold starts would be dominated by disk I/O, but I was suprised by just how heavily I/O operations dominated startup time in a vanilla Firefox install.

Startup time has improved almost 30% over the last year

In an attempt to reproduce the startup regression reported in bug 818257, I compared time to first paint for Firefox 13.0.1 and Firefox 21.0 using my test setup. To my surprise, I found Firefox 21.0 (current release channel) requires roughly 4.6 seconds to reach first paint during cold starts, while Firefox 13.0.1 (release channel from a year ago) required ~6.4 seconds! This is almost a 30% reduction in startup time.

See the raw results from 5 runs in a Google Docs spreadsheet here

I was surprised by this result because I expected increases in code size and the overhead from initializing new components added over the course of a year to cause regressions in startup. On the other hand, many people have landed patches to improve startup by postponing component initialization and generally reducing the amount of work done before the first-paint milestone. I haven’t tried to identify the patches responsible yet, but from a quick look at the XPerf profiles for each version, it looks like there were gains from fixing bug 756313  (“Don’t load homepage before first paint”) and from changing the list of Mozilla libraries pre-loaded at startup (see dependentlibs.list).

We are still FSYNC-ing too much at startup

Apparently, the FlushFileBuffers function on Windows causes the OS  to flush everything in the write-back cache as it “does not know what part of the cache belongs to your file”. As you can imagine, calling FlushFileBuffers is bad news for Firefox startups even it’s done off the main thread — other I/O requests will be delayed while the disk is busy writing data. Unfortunately we are currently calling this method on browser startup to write out the webapps.json file, the sessionstore.js file, and several SafeBrowsing files. The flush method isn’t being called directly, rather it’s the SafeFileOutputStream and OS.File.writeAtomic() implementations that force flushes for maximum reliability. In general, we should avoid calling methods that fsync/FlushFileBuffers unless such reliability is explicitly required, and I’ve asked Yoric to change OS.File.writeAtomic() behavior to forego flushing by default.

Next steps

I’m continuing to work on reducing the number of DLL loads triggered at startup and I’ll soon be filing bugs for fixing some of the smaller sources of startup I/O.

Performance Update, Issue #3

This is a post summarizing the activities of the Performance team over the past week. You can see the full weekly Engineering progress report for all teams here:

  • Mark Reid joined the Perf team. He will be working on the Telemetry backend reboot
  • Dhaval Giani joined the Perf team as a summer intern. Dhaval is a Master’s student at the University of Toronto where he works on detecting bugs in applications of RCU locking. Dhaval’s first internship project is storing Firefox caches in volatile (purgeable) memory on Android & B2G (bug 748598?)
  • bug 867757, bug 867762: Aaron Klotz is extending the Gecko Profiler to support arbitrary annotations
  • bug 881578, bug 881575, bug 879957: I wrote a few small improvements to reduce startup I/O
    • bug 880296: We need to load fewer DLLs on startup
  • Nathan Froyd is looking into improving Firefox startup on Android
  • bug 813742: Nathan is also working on parallelizing the reftest and crashtest suites
  • bug 872421, bug 880664: Yoric landed a module loader for chrome workers
  • bug 853388: Irving & Felipe continue to work on converting Addon Manager storage from SQLite to JSON, and moving its I/O off the main thread

The team blogged about their work:

Performance Update, Issue #2

This is my regularly scheduled post summarizing performance work from the past two weeks. Alternate title: Vlad’s Big Bowl of Performance Chilli

The Performance team had its first monthly status meeting. We decided on projects and set goals & timelines: wiki. The next meeting is on Thursday, June 6th @ 11am PDT (Vidyo room “Performance”), people from other teams who are working on related projects will be invited.

Main thread I/O continues to be a major source of Firefox jank. To illustrate this point, I ran Nightly 23 with its profile stored on an SD card and captured a screen recording. The results were not pretty, as Firefox hung repeatedly during common actions (see blog post). Patrick McManus posted a band-aid patch (bug 868441) that will allow Firefox to by-pass the network cache when locking in the network cache is taking too long. The long-term solution is a network cache re-design. Aaron Klotz and Joel Maher are working on detecting when new sources of main-thread I/O are added to the code in our test environment (bug 644744).

Drew Willcoxon wrote a patch to capture page thumbnails (for about:home) in the background (bug 841495). Once it’s hooked up, this will move the thumbnailing operation off the main thread and will allow Firefox to take snapshots of sites loaded without cookies to avoid capturing sensitive data.

Other fixes:

  • bug 852467: nsDisableOldMaxSmartSizePrefEvent runs on the gecko main thread, blocks for long periods of time
  • bug 649216: Remove unnecessary delay when clicking tab close buttons sequentially
  • bug 699331: Reduce impact of font name enumeration at startup

The team blogged about progress on their longer-term projects:

Running Nightly 23 from an SD card

I’ve noticed that most Mozilla developers are using recent laptops with fast SSDs for their work. This observation shouldn’t be surprising, as developers tend to be tech enthusiasts with higher requirements, but I wonder if these fast machines could also be masking some of the performance problems in the code we write?

We’ve known for a while that I/O operations on the main thread are a major source of Firefox jank, but I think we sometimes under-estimate the urgency of refactoring the remaining sources of main-thread I/O. While Firefox might feel fast on powerful hardware, I believe a significant share of our users are still using relatively old hardware. For example Firefox Health Report data shows that 73.5% of our more-technically-inclined Beta users (Release channel data not yet available) are on a computer with 1 or 2 cores, including hyper-threaded cores.

Even when users are on modern machines, their storage systems might be slow because of hardware problems, I/O contention, data fragmentation, Firefox profiles/binaries being accessed over a network share, power-saving settings, slow laptop hard-drives, and so on. I think it would be beneficial to test certain types of code changes against slow storage by running Firefox from a network share, an SD card, or some other consistently slow I/O device.

The Experiment

As an experiment to see what it’s like to run Firefox when I/O is consistently slow, I decided to create a Firefox profile on an SD card and surfed the web for a couple of hours, capturing profiles of jank along the way. I used an SD card which advertised maximum transfer rates of 20MB/s for  reads and 15MB/s for writes, although in practice, large transfers to and from the card peaked around 10MB/s. For reference, my mechanical hard drive performs the same operations an order of magnitude faster. I left the Firefox binaries on my hard drive — I was more interested in the impact of I/O operations that could be re-factored than the impact of Firefox code size.

As I visited pages to set up the new Firefox profile with my usual customizations (extensions, pinned tabs, etc), I observed regular severe jank at every step. After entering the URL of a new site and hitting Enter, Firefox would become unresponsive and Windows would mark it as “Not Responding”. After about 5 seconds, it would start handling events again. Profiles showed that the network cache (currently being re-designed) was the most common source of these hangs. I also hit noticeable jank from other I/O bottlenecks when I performed common actions such as entering data into forms, logging into sites, downloading files, bookmarking pages, etc. Most of these issues are being worked on already, and I’m hopeful that this experiment will produce very different results in 6 months.

This is a screen recording of a simple 5-minute browsing session. The janks shown in the video are usually absent when the profile is stored on my hard drive instead. You’ll need to listen to the audio to get all the details.


The following is a selection of some of the I/O bottlenecks I encountered during my brief & unrepresentative browsing sessions with this profile.

1) Initializing a new profile takes a very long time. After first launch, even with the application binaries in the disk cache, Firefox did not paint anything for 20 seconds. It took an additional 5 seconds to show the “Welcome to Firefox” page. The “Slow SQL” section of about:telemetry showed ~40 SQL statements creating tables, indexes and views, with many taking multiple seconds each. To the best of my knowledge, there hasn’t been much research into improving profile-creation time recently. We discussed packing pre-generated databases into omni.jar a year ago (bug 699615).

2) Installing extensions is janky. The new extension’s data is inserted into the addons.sqlite and extensions.sqlite DBs on the main thread. An equal amount of time is spent storing the downloaded XPI in the HTTP cache. See profile here. Surprisingly, the URL classifier also calls into the HTTP cache, triggering jank.

3) The AwesomeBar is pretty janky at first. It seems the Places DB gets cloned several times on the main thread.

4) Unsurprisingly, startup & shutdown are slower, especially if the Add-on Manager has to update extension DBs or if extensions need to do extra initialization or cleanup. For example, AdBlock triggers main-thread I/O shortly after startup by loading the AdBlock icon from its XPI into the browser toolbar.

5) New bugs/bugs needing attention:

  • The URL classifier opens/writes/closes “urlclassifierkey3.txt” on the main thread. Filed bug 867776
  • The cookie DB is closed on the main-thread after being read into memory. Filed bug 867798
  • The startupCache file might be another good candidate for read-ahead. Filed bug 867804
  • bug 818725: localStore.rdf is fsync’ed on the main thread during GC . Not completely new, but this bug could use some attention
  • bug 789945: Preferences are flushed & fsynced on the main thread several times during a session and they can cause noticeable jank. They also show up frequently in chrome hangs.
  • bug 833545: Telemetry eagerly loads saved pings on the main thread

6) Known issues observed during browsing:

  • Network cache creates a ton of jank when storage is slow/contended
  • Password Manager, Form History, Places and Download Manager do main-thread I/O
  • SQLite DBs cause jank when opened on the main-thread
  • Font loading causes jank

A Quick Update on Snappy Progress

The Performance team has retired the Snappy name, and the individual project leads are now blogging their projects’ progress instead of Taras doing regular Snappy blog posts.

However, I think there might still be some interest in seeing the performance improvements summarized in one place, and since I’m already doing Performance team updates at the Platform meeting, I thought I would try my hand at regularly blogging about ongoing performance work.

So without further ado, these are a few highlights from the past 2 weeks:

  • bug 859558: John Daggett is working on eliminating font jank. The font bugs currently tracked in this meta-bug are top offenders according to chrome-hang reports.
  • Honza Bambas wrote a draft proposal for a new network cache design. The locking in the current network cache is a common source of Firefox jank.
  • bug 830492: Gregory Szorc changed SQLite behavior in FHR to require fewer fsyncs
  • Kyle Lahnakoski developed a tool for comparing Telemetry simple measures. This tool is in the prototype stage and is currently only being used to look for correlations between slow startups and other Telemetry variables (more on that in another blog post). Since Kyle & I are currently the only users of this tool, the page is only accessible from Mozilla’s Toronto network. You will also have to disable mixed-content protection on the page.

A recent bug affected Telemetry submission rates on Firefox 21, 22, and 23 for several weeks. It has since been resolved (bug 844331 and bug 862599), but you’ll need to exercise caution when interpreting dashboard results from the affected time period. Specifically, you may want to exclude data from time periods with relatively few Telemetry submission counts.

Finally, there were several blog posts from the individual project leads:

Current state of Firefox chrome hangs

This post summarizes the top “chrome hangs” reported to Telemetry by Nightly 22 on Windows during the first half of March.

A “chrome hang” is a period of time during which the main thread is stuck processing a single event. It is not a permanent hang. By default, Nightly on Windows reports any chrome hangs lasting at least 5 seconds to Telemetry. You can see your own chrome hangs from your current browsing session in your Nightly’s about:telemetry page.

Data set

There are roughly 270,000 Firefox sessions with chrome hangs in this data set, reporting a total of ~570,000 chrome hang stacks. There are 84,000 “unique” stack signatures, but the heuristic I use for stack signature generation is far from perfect.

The top hangs:

  1. Top 100 stack signatures
  2. Top 50 signatures, excluding plugin hangs*
  3. Top 50 signatures, excluding plugin, GC, CC, HTTP cache, font, or content hangs*

* I removed JS-only stacks that did not contain any useful identifying frames


The raw chrome hang data is challenging to categorize perfectly since there is a very long tail of stack signatures.

Instead, I used a simple heuristic to categorize the stacks, and I found that hang stacks involving plugins are by far the most common (36% of hangs), stacks with font operations are second (12%), GC and CC are third (10%), and the HTTP cache is fourth (4%). The long-tail of “other” operations makes up the remaining 38% of the data.

After filtering the data set to only the top 100 most common hang signatures, I found the distribution of hangs mostly unchanged. Most of the hangs are again caused by plugins: 53 out of the top 100 signatures are plugins (67% by number of hangs). Font-loading operations take second place (11% by number of hangs), followed by GC and CC (9%) and locking in the HTTP cache (4%). The remaining 9% of hangs is mostly made up of long-running JavaScript scripts. Unfortunately, the chrome hang reporting code currently does not  walk the JavaScript stack, so it’s impossible to obtain useful information out of pure JavaScript hangs.

Data: Top 100 chrome hang stack signatures

The plugin problem

We’ve known for a while that plugins are a major source of browser unresponsiveness. In particular, the initial synchronous load of the plugin’s library and the creation of a new plugin instance are the most common hang stacks. The top 100 list also shows plugins taking a long time to handle events and destroy plugin instances. There is also a large number of stacks where Flash hangs cause Firefox hangs on account of Windows input queue synchronization (bug 818059).

Since we’re stuck with the synchronous NPAPI, I suspect we won’t be able to minimize the impact of of plugin hangs until we separate content and chrome into separate processes (i.e. the Electrolysis project). In the short term, we’re mitigating some of the plugin pains by adding read-ahead to improve plugin library loading speed (bug 836488) and starting with Firefox 20, we’re showing users a UI that allows them to terminate an unresponsive plugin after 11 seconds (bug 805591). Both of these patches were written by Aaron Klotz.

Benoit Girard, Georg Fritzsche and Benjamin Smedberg are working on uncovering the causes of some of these hangs by adding profiling support to the plugin-container.exe process (bug 853358 and bug 734691) and exposing IPC message information to the profiler (bug 853864 and bug 853363). You can find Benoit’s write-up here.

Finally, it might also be possible to move some of the heavy plugin operations off the main thread (e.g. bug 856743).


After plugins, font loading operations are the most frequent single category of browser hangs. To see an example of a fonts hang, restart your computer (or clear your OS disk cache by other means) and load the Wikipedia homepage (bug 832546). You should see a hang lasting at least 100ms (depending on your storage device) while the browser loads dozens of fonts from disk to render all the different languages displayed on the page.

There are several existing bugs for font hangs (bug 734308 and bug 699331), but I’d like to ask someone with some knowledge of  the fonts code to file individual bugs for the rest. I filed bug 859558 as a catch-all bug for all the font stacks that I did not recognize.

Data: Top 50 chrome hang stack signatures, excluding plugin hangs

After plugins and fonts, the locking done in the HTTP network cache causes most of the chrome hangs.  It seems the main thread often gets blocked waiting on a lock that is being held by a background cache thread. I assume the background thread is doing disk I/O while holding the contended lock. If you move your Firefox profile to a slow a storage device (e.g. an SD card), you can reliably reproduce these hangs by visiting new sites. The Necko team is currently working on plans for a new network cache design.

Garbage Collection & Cycle Collection are the third most common category of hangs. Surprisingly, incremental GC stacks also show up in the top 50.


In addition to the hangs described above, there is a significant number of  stacks with JIT-ed JavaScript code, page reflows and other content operations. Since chrome hangs don’t report the names of JavaScript functions on the stack and Telemetry doesn’t collect page URLs, these stacks are not useful.  I filtered out these types of stacks and came up with a list of the top 50 “other” hangs.

Data: Top 50 chrome hang stack signatures, excluding common hang types (plugins, GC, CC, HTTP cache, font, content hangs)

This list shows:

  • The JavaScript debugger frequently causes chrome hangs
  • JavaScript calls an nsIDNSService interface function that synchronously resolves DNS names. Could we deprecate use of this function on the main thread?
  • Switching graphics to hardware-accelerated mode after startup (filed bug 859652) and Direct3D device initialization (filed bug 859664) causes hangs
  • Printing causes hangs (filed bug 859655)
  • JavaScript functions called by nsContentPolicy::ShouldLoad take a long time to return
  • DOM workers can take a long time to return the amount of memory in use (filed bug 859657)
  • Extension and chrome JavaScript uses nsLocalFile for main thread I/O. Some of the main-thread calls to nsLocalFile::Exists might be TestPilot main-thread I/O (filed bug 856867)
  • Proxy resolution jank doesn’t seem to have been completely fixed (bug 781732)
  • Destroying CSS style sheets takes a long time (bug 819489)
  • nsSafeFileOutputStream is used on the main thread, blocking the main thread with fsyncs and other file I/O
  • JSON stringify is sometimes called on very large objects
  • Main-thread SQL is a common source of hangs, e.g. nsNavBookmarks::QueryFolderChildren, nsAnnotationService::GetPageAnnotationString, nsDownload::UpdateDB
  • JAR files are opened and closed from the main thread causing hangs

I have not identified all of the hangs in this top list:

  • nsCryptoHash::UpdateFromStream is called from the main thread with a file stream input. I haven’t found the source of this
  • A timer callback evals a string and calls js::SourceCompressorThread::waitOnCompression which causes hangs
  • nsIncrementalDownload::OnDataAvailable writes downloaded data to disk on the main thread
  • nsExternalAppHandler::SaveToDisk moves files on the main thread (known isssue?)


Plugins are the foremost cause of Firefox janks lasting multiple seconds. Fonts, GC/CC, and the HTTP network cache are also common sources. The long tail of hang signatures contains new bugs, some of which could be fixed quickly.