26
Jul 12

Snappy, July 26: Go Try The Gecko Profiler!

See raw notes for details on mid-flight snappy work.

Checkout the SPS Profiler!

SPS Gecko Profiler has gotten a lot of praise this week on #perf. If you ever wonder the hell Firefox is doing with your CPU, give the profiler a try. For the past couple of weeks it has been able to label stacks with JS, URLs and even favicons. It’s likely that Mozilla may have shipped the world’s first profiler to feature favicons.

Having JS support is nice, it lead to the first 2 snappy addon bugs: 777266, 777397. I documented how to act on addon responsiveness issues in the snappy wiki.

Whether you develop web pages, addons or are a core gecko hacker, the profiler may make the performance-analysis part of your life much more pleasant. Update: Benoit Girard wrote about the new profiler features.

Things To Not Do On Startup

Blair McBride did some digging, there may be 15million users with signed extensions which can cause Firefox to do network IO (ie stall for a long time) on startup.

Brian Bondy landed a fix to lower IO priority of nuking our cache: 773518. According to telemetry, 10-20% of startups feature cache nuking. It take a while to blow away 1GB of files on startup. Brian used telemetry to investigate causes for cache purges in bug 774146. Based on this data, Brian will begin tackling what may be the oldest snappy bug so far: bug 105843. For more details on our cache see Nick Hurley’s blog post (also see his link to a similar blog post from a Chrome person).

More Responsive Tabs

Tim Taubert made our new tab animation more pleasant in bug 716108. Tim also landed a fix to halve jank caused by thumbnail capture in bug 774811, this should result in better tab-switching experience. Stay tuned for more developer attention in this area.

GC

Jon Coppeard enabled incremental sweeping: bug 729760. This should result in slightly smaller GC pauses.

 

 

 


25
Jul 12

Telemetry and What It Is Good for: Part 2: Telemetry Achievements

An inquisitive mind sent me an email with a pointed question:

“Is there an example of someone who’s not you that had a burning question that would drive some sort of research or development activity and got it answered by telemetry?”

I forwarded his email and got a pretty fun survey. See below for a slightly edited version of emails I got. In addition to positive experiences below, people had a lot of complains about the telemetry experience.

Justin Lebar
Justin is probably the most vocal telemetry user who isn’t me. He blogged about one of his more successful telemetry experiences.
Andrew McCreight
One telemetry stat I added is CYCLE_COLLECTOR_NEED_GC.  Sometimes a read barrier fails and we need to do a GC synchronously at the start of a CC, which is terrible for pause times.  Using telemetry, I confirmed my suspicion that this is very rare, and thus not worth trying to improve.
Another state Olli added is FORGET_SKIPPABLE_MAX, which tracks the length of the CC cleanup phases we run.  As we made the cleanup more and more thorough, the times of these got longer and longer.  I think eventually this led Olli to try to fix the worst case cleanup phases, in bug 747675.  He had this comment in there: “Based on the initial telemetry data, the patch doesn’t affect too much to the already low median times, but helps significantly with the worst 5%, so mean time decreases quite nicely.”
Also, back around Firefox 13, Olli was using telemetry to observe the results of various CC optimizations, to assess their effectiveness, which he then used to decide whether or not to nominate various patches for landing on Aurora 12.  Telemetry let us see the reward part of the risk vs. reward tradeoff, and get some pretty big improvements into 12.
Locally, I use about:telemetry to get a sense of what the behavior on my local machine has been, but I suppose that doesn’t really fall under “telemetry” per se.  But it was quite useful during the Cycle Collector Crisis to see what CC behavior people had been seeing on their machines.

Olli Pettay
Thanks Andrew, very accurate summary of what I’ve been doing
I tend to look at CC telemetry data daily. I very rarely use the histogram, since evolution is more interesting to me. Especially median time and also how P75 and P95 evolve. (The focus in this Q is to get lower bad times, so we should manage to drop P75 and P95)
I use also about:telemetry locally since I tend to run builds with some patch, and I want to see if they affect badly to CC or GC times.

Taras
My blog is basically a collection of telemetry trivia :)
I do not have testimonials from other people. However I heard that the silent-update team proved something about silent updates with telemetry, Necko team discovered that some optimizations were not, etc.
I encourage other people who solved a problem with telemetry to either wrote a blog post or leave a comment.

In part 3 I will cover flaws in the current Telemetry experience.


25
Jul 12

Telemetry and What It Is Good for: Part 1: Nuts and Bolts

Telemetry has been in production for about year. However, it turns out that many Mozillians do not know what it is good for. I presented about Telemetry at FOSDEM 2012, but have not had a chance to reach out to the core Mozilla developers because we haven’t had a Mozilla All-Hands since Telemetry got useful.

Why Would One Use Telemetry?

Telemetry exists for a single purpose: matching developer expectations with real Firefox behavior. My experience working on startup lead me to believe that is unreasonably complex to try to model real-world behavior in a lab setting and that it was actually easier to just measure real world behavior.

Anything that varies with IO, system configuration, user input, user workloads is easier to measure with Telemetry than to develop a useful finite benchmark for.

Nuts & Bolts

Telemetry consists of two parts: client-side collection code + serverside frontend.

Client-side Telemetry currently records:

  • simple measures: discrete numbers such as amount of ram, various startup times, flash version, etc
  • histograms: efficient one-dimensional means of gathering a range of values such as memory usage, cycle collection times, types of events occurring, etc. These are all specified in TelemetryHistograms.h. You can view your local histograms by enabling telemetry and installing about:telemetry.
  • slow sql statements: We record SQL statements that take over 100ms and whether they occur on main thread to prioritize Snappy SQL work.
  • chromehangs: Nightly builds ship with frame-pointers so we can detect when Firefox pauses for over 5 seconds. Every time Firefox pauses, we record the backtrace. We started sending those a month ago, processing them on the serverside is a work-in-progress. These should be very handy for prioritizing work on making Firefox more responsive

One current limitation is that histograms are on-dimensional, there is no way to relate cycle collection times to uptime, memory usage, etc. We also go to great lengths to avoid collecting any personal identifiers. As a result we have no user UIDs and no ability to track how a user’s performance changes over time.

Telemetry Frontend is a public dashboard that can be seen at arewesnappyyet.com. Anyone can get a BrowserID login and look at our browser stats. Telemetry dashboard consists of two views:

  • Telemetry Histograms: this is basically the same data as displayed in about:telemetry, but aggregated from our userbase. This was our original view and is likely to get folded into evolution in the future.
  • Telemetry Evolution: This view tracks how medians/percentiles gathered by histograms change over time. This is the view that most developers use.

Telemetry is not a technology unique to Firefox. I borrowed a lot of code from the Chromium implementation to get caught up. Microsoft also collects similar metrics.

There are two differences between us and other browser vendors:

  1. We do not assign a unique id to every user. This sucks from a developer perspective as it makes it a lot harder to track performance over time, but we believe the privacy benefits are worth it.
  2. We made our dashboards public because we would like to have our community actively involved in helping us track Firefox performance.

In part 2 I’ll discuss how various people at Mozilla use Telemetry.


23
Jul 12

Snappy, July 19: Telemetry Experiments

For the in-progress work and minor changes that landed see non-meeting notes for this week.

Jeff Muizelaar wrote an interesting blog post work involved in a tab switch on Mac.

Windows Prefetch: Experimental Data vs Reality

I once discovered that Windows Prefetch can adversely affect application startup times, bug 627591. Certain machines were showing performance to be much better with Windows prefetch disabled and using my “manual” dll preload code to warm up the cache. Manual dll preload is a win for loading large applications because it causes xul.dll to be read in sequentially rather than randomly via page-in (see my blog posts from 2010 for details of startup IO uglyness). Unfortunately Windows Prefetch + my preload code measured as a net regression. I found a weird API that seemed to return 0 when prefetch was broken and guarded preload on that.

We have recently backed out above heuristic based on a telemetry study in bug 757215. Perhaps this is why our startup numbers have started getting better in Firefox 16?

Brian Bondy setup a telemetry startup trial to randomly delete prefetch, turn on dll preload. Last week Saptarshi Guha crunched some telemetry numbers, see this bugzilla comment. Turned out Windows Prefetch is a huge win and dll preload is a tiny incremental improvement on top of that (rather than being a regression).

Moral of the story is: do not rely on manual performance testing for workloads involving large amounts of IO. Simulating a “typical” Windows machine is extremely hard without getting noisy numbers. Effort is better spent on analyzing noisy real-world numbers and running experiments in the wild.


12
Jul 12

Snappy. July 12, 2012

Testing

We discussed setting up Eideticker, for desktop Firefox responsiveness testing.

Andrew Halberstadt is making progress on a revised version of peptest. We are looking at loading talos pageset into individual tabs and tracking tab-switching

We also discussed how QA can help in helping us confirm + narrow down regressions found by telemetry.

Necko

Necko guys are continuing to remove main thread DNS resolution, are integrating a custom DNS resolver. Last week they landed a bunch of telemetry to help them play cache-lock-whack-a-mole: bug 763342, 767275.

Profiler

Our profiler should grok JavaScript now. See tomorrow’s nightly.

GC

Jon Coppeard put up a patch to do incremental sweeping. The cleanup phase of the GC is a major remaining continuous GC operation. This should help reduce remaining significant GC pauses.

Perf Team

Nicholas Chaim is almost done with setting a way to track main thread IO with XPerf in bug 770317. We would like to track main thread network IO via xperf, but it’s not clear if xperf can report what thread IO operations happen on.

Slow Startup

Turns out Firefox validates some signed extensions on startup: bug 726125. I think we finally have a good explanation for some of the ridiculously slow startups we’ve been looking at. Yuck.


09
Jul 12

Snappy update for week of July 5th

Frontend

Jared Wein discovered that our about:home was surprisingly expensive to load. He sped up the page by an estimated 30% in bug 765411. Similarly, Tim Taubert is fixing our new tab page performance in bug 753448.

Tim is also bravely attacking (via bug 769634) horrid performance issues with Firefox themes tracked by bug 650968.

Profiling

Alex Crichton added ability to profile JS in bug 761261. Benoit Girard is adding labels to the profiler to expose JS profiling info in bug 707308.  Same functionality will also allow us to add URLs to the stacks. This means that in addition to seeing what Firefox is busy with, the profiler will now provide context on what caused the processing (screenshot). This is huge. Benoit also improved profiler timing data in 769989.

Slow Startup Research

As I mentioned before, Nicholas Chaim wrote an addon to track system IO usage while starting Firefox. He has since updated his addon to be hosted on AMO and to submit that data for analysis. If you suffer from slow Firefox startups, please help us identify common IO hogs by installing his addon. Please encourage friends with slow startups to do the same.

This addon lists names of processes and amount of IO they did. This is somewhat private information, we can’t gather this data via telemetry.


02
Jul 12

Snappy June 21, 28

Necko team is busy squashing main-thread waits, see bug 765665, bug 766973.

Ehsan Akhgari turned on frame pointers on nightly channel: bug 764216. This means that one can now use the built-in profiler on nightly builds. The main purpose behind the change was to collect more chromehang data(long Firefox UI stalls). Vlad Djeric lowered the chromehang reporting threshold to 5 seconds: bug 763124. We are waiting on metrics to separate out chromehang reporting from telemetry pings: bug 763116.

Nathan Froyd is making heroic progress on teaching our events to queue so they can be prioritized: bug 715376.

Tim Taubert is working to reproduce a tab animation regression in bug 752837. He also taking over making Firefox themes less of a performance pig in bug 650968 .

We had great success with eliciting data on slow startups in Nicholas Chaim’s blog post.  We confirmed that external processes can affect Firefox startup (we had evidence for this) and that we can detect those situations (great work Nicholas!). It will be a hard slog before we can bolt a pretty UI to the extension + integrate system diagnostics into Firefox. In the meantime I recommend that SUMO people start suggesting this extension to diagnose slow Firefox installs. Nicholas is working on a revision of the extension that records slow-IO-caused-startup situations on a server so we prioritize + turn these into detectable fingerprints.

William McCloskey fixed a nasty GC regression which caused GCs to run too often: bug 768282. Andrew McCreight sped up cycle collection when closing tabs: bug 754495. Olli Pettay enabled freeing DOM nodes directly (bypassing cycle-collection) when setting .innerHtml of non-empty DOM elements see bug 730639.

PS

“The Performance of Open Source Applications” book is looking for contributors. Would be cool if someone snuck some Mozilla wisdom in there.

Sorry for skipping the snappy update last week. These posts take a lot more effort than is reasonable and I needed to direct it at my talk last week. You can see my Velocity slides here.

At least one person objected to the strong language used in the presentation (ie “dom local storage sucks”). I chose this language to emphasize the fact this isn’t a feature where one gets to weigh upsides vs downsides because the downsides are so severe. Most of the positive data on this is coming from what I believe to be unrepresentative benchmarks. I have not seen any other data points similar in quality to those reported by our telemetry.

Btw, Jan Varga is close to removing our IndexedDB prompt(bug 758357), opening up IndexedDB as an alternative to DOM Local Storage(which sucks).