04
Sep 12

Snappy #38: Responsiveness Fixes Galore

End of summer is a tough time to make progress because a lot of people are on vacation. Surprisingly, Firefox got some good fixes in since the last update.

Less Slow Startups

Bug 726125: should get rid of a lot of super-slow startups. Due to an abstraction accident we ended up validating jars more eagerly than expected. Firefox would go on the net (on the main thread) to check the certificate every time a signed jar was opened. There are over 500 signed extensions on AMO with over 14million active users. See the following for background on the (now dead) feature that caused our jar code to go nuts: signed scripts and note on removal of signed script support. Thanks for Nicholas Chaim and Vladan Djeric for fixing this.

Less Proxy Lag (WIP)

Bug 769764. We have received a lot of strange complaints about Firefox network performance that we could never reproduce. Turned out this was because none of us used proxies. Patrick McManus discovered a lot of synchronous proxy and DNS code in our network stack.

Fix for this should also improve performance for people without proxies since proxy-autodetection code was also doing main thread IO. As a result all of us replacing sync APIs with async ones all of the existing proxy-related addons will have to be updated. Patrick is reaching out to addon authors to make sure addons are updated in time for the next release.

Less UI Repaint Lag

Bug 786421: Nightlies got unbearably slow for me recently. Turned out we ended continuously resizing + applying theme + redrawing invisible tooltips on every paint. Thanks for Timothy Nikkel for fixing this. This bug never affected anyone outside of the Nightly/Aurora testers, but it serves as yet another example of how the Gecko Profiler makes it easier than ever to diagnose weird performance problems. The single biggest contribution anyone can do at the moment is to provide instructions of how to reproduce lag with accompanying profiler traces.

Less Gradient Lag

Bug 761393: Paul Adenot implemented a gradient cache. This was landed as a Telemetry experiment so we can determine what the optimal cache retention strategy is. We’ll be watching the relationship between GRADIENT_DURATION and GRADIENT_RETENTION_TIME in the coming weeks.
Currently rendering gradients cause stalls in the GPU pipeline. In previous experiments we found out that most of the tab-switch rendering time in hardware-accelerated Firefox is spent rendering gradients :(. Gradients are hard to notice for casual users, but they are heavily used in our tab strip and on Google web properties.

MozCamp

I may not have a chance to post the next snappy update as I’ll be hopping on the plane to Warsaw right after our meeting. If you are attending MozCamp come to our ‘All About Performance’ session. Our goal for the talk is to significantly expand the pool of people who can diagnose Firefox (and web) performance problems.


29
Aug 12

DXR now does live regexp search, thanks Google Code

Google code was once the best code-search tool in the business. Then it got shut down, except for a few special instances like chromium.

Our intern, Jonas Finnemann Jensen, took the re2 code that used to power google code search and integrated it into DXR (among other cleanups). See his blog post for more details.

Regexps combined with the new instant search feature changed how I search Mozilla code. Instant search means that I’m constantly refining my search terms to narrow down my results to a minimum before I leave the search page. Digging through Mozilla code is pretty fun now. I believe our development instance* of DXR is the most pleasant./efficient (even if a bit rough) code indexing solution atm. I no longer miss google code for searching Mozilla.

Now that Mozilla can be searched in a pleasant way, something needs to fill the searching of masses of open source code usecase. Perhaps github could plug the “google code”-sized gap in developer hearts?

 

* we also need to stabilize the development version and move it to dxr.mozilla.org.


23
Aug 12

Snappy #37

Highlights from meeting notes for today:

  • Tim Taubert worked on Firefox UI speedups
  • Lots of improvements to the profiler from Benoit Girard
  • More incremental GC work from Jon Coppeard
  • Vladan Djeric got all of the security reviews and should be able to land Nicholas Chaim’s fix for networked certificate validation: bug 726125

We spent most of the meeting discussing bug bug 784512. According several data sources Firefox 15 Beta loads pages slower than 14. Occasionally problems squeeze past our performance testing + telemetry infrastructure, this looks like one of these times. Unfortunately, it’s quite hard to reduce a few noisy signals to a concrete performance problem. If you can reproduce a performance regression to do with loading webpages/games/etc in FF15 vs FF14, please leave a comment.

Thanks!

Thanks for the great comments on my previous snappy updates. Bug 783755 should take care of the new cache size pref not sticking. Bug 718910 on hiding Cache directory from Spotlight is making progress too.

Commenter, kumalos, reported a tab switching regression and posted a profile recorded with our profiler as evidence. This proved to be an example of bug 783748, and lead us to identify a previously unknown issue in bug 784756. Constructive feedback like this is one of the main reasons I blog.

I highly encourage users interested in improving Firefox performance to use Nightly builds and report bugs with profiler traces attached.

Shutdown Times

I’ll end with our latest Telemetry data point. This one took a while to get right, but we finally track our shutdown speed.


16
Aug 12

TriLite, Fast String Matching in Sqlite

One of the limitations of the SQLite FTS is that it can’t do regular expression or substring searches. Jonas is addressing this with TriLite. Be sure to subscribe to Jonas’ blog for exciting DXR developments coming up within the final two weeks of his internship.

 

Coming to our test DXR instance soon…


16
Aug 12

Snappy #36

Misc

I have been using dates to mark passage of time in the Snappy project. I think I’ll switch to a simple counter instead. We are ~36 updates into this project.

Blogging

Ludovic Hirlimann blogged about how spotlight spends a lot of time indexing the Firefox network ‘Cache’ directory (known problem, bug 718910). If you experience this problem and would like to see it fixed, please comment in the bug if the suggested remedy helps.

Tim Taubert wrote about reducing new-tab jank. I mention Tim a lot in these updates. He takes on a lot of interesting bugs in Firefox frontend. Hopefully he’ll make a habit out of blogging about his work.

Networking

Nick Hurley landed a change to reduce our maximum cache size to 350 megabytes. In order to avoid excessive disk IO traffic old cache size of 1 gigabyte remains in effect until the cache is reset. See bug 709297 and Nick’s blog post for more details. Progress is also being made on bug 777328 so we can move towards not blowing away our cache 10-20% of the time.

Michal Novotny is proceeding with incrementally reducing cache-caused jank that’s due to holding a lock on the main thread while doing IO on a background thread. He also removing a multitude of synchronous necko APIs, see bug 695399.

Patrick McManus is removing synchronous proxy-related code, see bug 766973 for the DNS-related bit. Turns out our proxy code also does all kinds of synchronous operations when detecting proxy configuration, etc. This is being worked on, but hasn’t been filed yet.

I usually try to highlight work that has already landed, but in this case it is important to point out that the Necko team is working hard on addressing significant problems in the networking code. These problems are tricky and will take a while to fix. The team is relatively new and is still discovering hidden surprises in their codebase.

Profiler

Benoit Girard posted a preview of view-source in the profiler. This will be handy for figuring out where performance problems lay, especially in JS files that have been preprocessed (our JS preprocessor does not try to keep the line numbers sane).


15
Aug 12

DXR is back at dxr.mozilla.org

For a while DXR lived on a lanedo.com server. Lanedo folks are now done with DXR development. DXR is once again deployed at dxr.mozilla.org. Our Perf intern, Jonas, is driving DXR development this summer. See his progress report. Be sure to subscribe to his blog. He may change how you use DXR as a result of his current work.

PS. Mozilla interns rule


14
Aug 12

Snappy for Aug 9

Lawrence took excellent Snappy meeting notes last week.

 


06
Aug 12

Snappy, Aug 2

Landed This Week

Neil Deakin joined the Snappy effort. He is working on eliminating pointless reflows in the tab strip. His fix for bug 752486 landed, 752376, 752496 are next.

Brian Bondy landed removal of our prefetch-nuking code in bug 770911. xul.dll preload is now always on based on our telemetry startup study in bug 765850.

Bill McCloskey landed the following improvements to reduce garbage collection pauses:

  • bug 777919 - Free LifoAlloc chunks on background thread, instead of as part of the final IGC slice. This isn’t a problem for most people, but for some people on OSX it can take anywhere from 50ms to 250ms or more.
  • bug 778993 - Separate runtime’s gcMallocBytes from compartment’s gcMallocBytes, so we trigger less non-incremental GCs with many tabs open
  • bug 767209 - Make GC slices longer when not painting to avoid non-incremental GCs.

See Bill’s comments in bug 767209 for some insight into the complex heuristics that go into minimizing GC interruptions: comment 1, comment 2.

Coming Soon

In the coming week I expect to see some good optimizations land for page rendering, tab-switching behavior, more robust cache, etc.

Some Snappy people will be attending MozCamp.eu 2012 in Warsaw, Poland on September 8, 9. Expect to see lots of talk on profiling and other performance tools.

I hope to have above 15-20 Performance/Snappy people in Warsaw for the following week. This is not yet finalized. At the moment we are looking to see if there is a coworking space or a company in Warsaw who could host us.


26
Jul 12

Snappy, July 26: Go Try The Gecko Profiler!

See raw notes for details on mid-flight snappy work.

Checkout the SPS Profiler!

SPS Gecko Profiler has gotten a lot of praise this week on #perf. If you ever wonder the hell Firefox is doing with your CPU, give the profiler a try. For the past couple of weeks it has been able to label stacks with JS, URLs and even favicons. It’s likely that Mozilla may have shipped the world’s first profiler to feature favicons.

Having JS support is nice, it lead to the first 2 snappy addon bugs: 777266, 777397. I documented how to act on addon responsiveness issues in the snappy wiki.

Whether you develop web pages, addons or are a core gecko hacker, the profiler may make the performance-analysis part of your life much more pleasant. Update: Benoit Girard wrote about the new profiler features.

Things To Not Do On Startup

Blair McBride did some digging, there may be 15million users with signed extensions which can cause Firefox to do network IO (ie stall for a long time) on startup.

Brian Bondy landed a fix to lower IO priority of nuking our cache: 773518. According to telemetry, 10-20% of startups feature cache nuking. It take a while to blow away 1GB of files on startup. Brian used telemetry to investigate causes for cache purges in bug 774146. Based on this data, Brian will begin tackling what may be the oldest snappy bug so far: bug 105843. For more details on our cache see Nick Hurley’s blog post (also see his link to a similar blog post from a Chrome person).

More Responsive Tabs

Tim Taubert made our new tab animation more pleasant in bug 716108. Tim also landed a fix to halve jank caused by thumbnail capture in bug 774811, this should result in better tab-switching experience. Stay tuned for more developer attention in this area.

GC

Jon Coppeard enabled incremental sweeping: bug 729760. This should result in slightly smaller GC pauses.

 

 

 


25
Jul 12

Telemetry and What It Is Good for: Part 2: Telemetry Achievements

An inquisitive mind sent me an email with a pointed question:

“Is there an example of someone who’s not you that had a burning question that would drive some sort of research or development activity and got it answered by telemetry?”

I forwarded his email and got a pretty fun survey. See below for a slightly edited version of emails I got. In addition to positive experiences below, people had a lot of complains about the telemetry experience.

Justin Lebar
Justin is probably the most vocal telemetry user who isn’t me. He blogged about one of his more successful telemetry experiences.
Andrew McCreight
One telemetry stat I added is CYCLE_COLLECTOR_NEED_GC.  Sometimes a read barrier fails and we need to do a GC synchronously at the start of a CC, which is terrible for pause times.  Using telemetry, I confirmed my suspicion that this is very rare, and thus not worth trying to improve.
Another state Olli added is FORGET_SKIPPABLE_MAX, which tracks the length of the CC cleanup phases we run.  As we made the cleanup more and more thorough, the times of these got longer and longer.  I think eventually this led Olli to try to fix the worst case cleanup phases, in bug 747675.  He had this comment in there: “Based on the initial telemetry data, the patch doesn’t affect too much to the already low median times, but helps significantly with the worst 5%, so mean time decreases quite nicely.”
Also, back around Firefox 13, Olli was using telemetry to observe the results of various CC optimizations, to assess their effectiveness, which he then used to decide whether or not to nominate various patches for landing on Aurora 12.  Telemetry let us see the reward part of the risk vs. reward tradeoff, and get some pretty big improvements into 12.
Locally, I use about:telemetry to get a sense of what the behavior on my local machine has been, but I suppose that doesn’t really fall under “telemetry” per se.  But it was quite useful during the Cycle Collector Crisis to see what CC behavior people had been seeing on their machines.

Olli Pettay
Thanks Andrew, very accurate summary of what I’ve been doing
I tend to look at CC telemetry data daily. I very rarely use the histogram, since evolution is more interesting to me. Especially median time and also how P75 and P95 evolve. (The focus in this Q is to get lower bad times, so we should manage to drop P75 and P95)
I use also about:telemetry locally since I tend to run builds with some patch, and I want to see if they affect badly to CC or GC times.

Taras
My blog is basically a collection of telemetry trivia :)
I do not have testimonials from other people. However I heard that the silent-update team proved something about silent updates with telemetry, Necko team discovered that some optimizations were not, etc.
I encourage other people who solved a problem with telemetry to either wrote a blog post or leave a comment.

In part 3 I will cover flaws in the current Telemetry experience.