25
Jul 12

Telemetry and What It Is Good for: Part 1: Nuts and Bolts

Telemetry has been in production for about year. However, it turns out that many Mozillians do not know what it is good for. I presented about Telemetry at FOSDEM 2012, but have not had a chance to reach out to the core Mozilla developers because we haven’t had a Mozilla All-Hands since Telemetry got useful.

Why Would One Use Telemetry?

Telemetry exists for a single purpose: matching developer expectations with real Firefox behavior. My experience working on startup lead me to believe that is unreasonably complex to try to model real-world behavior in a lab setting and that it was actually easier to just measure real world behavior.

Anything that varies with IO, system configuration, user input, user workloads is easier to measure with Telemetry than to develop a useful finite benchmark for.

Nuts & Bolts

Telemetry consists of two parts: client-side collection code + serverside frontend.

Client-side Telemetry currently records:

  • simple measures: discrete numbers such as amount of ram, various startup times, flash version, etc
  • histograms: efficient one-dimensional means of gathering a range of values such as memory usage, cycle collection times, types of events occurring, etc. These are all specified in TelemetryHistograms.h. You can view your local histograms by enabling telemetry and installing about:telemetry.
  • slow sql statements: We record SQL statements that take over 100ms and whether they occur on main thread to prioritize Snappy SQL work.
  • chromehangs: Nightly builds ship with frame-pointers so we can detect when Firefox pauses for over 5 seconds. Every time Firefox pauses, we record the backtrace. We started sending those a month ago, processing them on the serverside is a work-in-progress. These should be very handy for prioritizing work on making Firefox more responsive

One current limitation is that histograms are on-dimensional, there is no way to relate cycle collection times to uptime, memory usage, etc. We also go to great lengths to avoid collecting any personal identifiers. As a result we have no user UIDs and no ability to track how a user’s performance changes over time.

Telemetry Frontend is a public dashboard that can be seen at arewesnappyyet.com. Anyone can get a BrowserID login and look at our browser stats. Telemetry dashboard consists of two views:

  • Telemetry Histograms: this is basically the same data as displayed in about:telemetry, but aggregated from our userbase. This was our original view and is likely to get folded into evolution in the future.
  • Telemetry Evolution: This view tracks how medians/percentiles gathered by histograms change over time. This is the view that most developers use.

Telemetry is not a technology unique to Firefox. I borrowed a lot of code from the Chromium implementation to get caught up. Microsoft also collects similar metrics.

There are two differences between us and other browser vendors:

  1. We do not assign a unique id to every user. This sucks from a developer perspective as it makes it a lot harder to track performance over time, but we believe the privacy benefits are worth it.
  2. We made our dashboards public because we would like to have our community actively involved in helping us track Firefox performance.

In part 2 I’ll discuss how various people at Mozilla use Telemetry.


23
Jul 12

Snappy, July 19: Telemetry Experiments

For the in-progress work and minor changes that landed see non-meeting notes for this week.

Jeff Muizelaar wrote an interesting blog post work involved in a tab switch on Mac.

Windows Prefetch: Experimental Data vs Reality

I once discovered that Windows Prefetch can adversely affect application startup times, bug 627591. Certain machines were showing performance to be much better with Windows prefetch disabled and using my “manual” dll preload code to warm up the cache. Manual dll preload is a win for loading large applications because it causes xul.dll to be read in sequentially rather than randomly via page-in (see my blog posts from 2010 for details of startup IO uglyness). Unfortunately Windows Prefetch + my preload code measured as a net regression. I found a weird API that seemed to return 0 when prefetch was broken and guarded preload on that.

We have recently backed out above heuristic based on a telemetry study in bug 757215. Perhaps this is why our startup numbers have started getting better in Firefox 16?

Brian Bondy setup a telemetry startup trial to randomly delete prefetch, turn on dll preload. Last week Saptarshi Guha crunched some telemetry numbers, see this bugzilla comment. Turned out Windows Prefetch is a huge win and dll preload is a tiny incremental improvement on top of that (rather than being a regression).

Moral of the story is: do not rely on manual performance testing for workloads involving large amounts of IO. Simulating a “typical” Windows machine is extremely hard without getting noisy numbers. Effort is better spent on analyzing noisy real-world numbers and running experiments in the wild.


12
Jul 12

Snappy. July 12, 2012

Testing

We discussed setting up Eideticker, for desktop Firefox responsiveness testing.

Andrew Halberstadt is making progress on a revised version of peptest. We are looking at loading talos pageset into individual tabs and tracking tab-switching

We also discussed how QA can help in helping us confirm + narrow down regressions found by telemetry.

Necko

Necko guys are continuing to remove main thread DNS resolution, are integrating a custom DNS resolver. Last week they landed a bunch of telemetry to help them play cache-lock-whack-a-mole: bug 763342, 767275.

Profiler

Our profiler should grok JavaScript now. See tomorrow’s nightly.

GC

Jon Coppeard put up a patch to do incremental sweeping. The cleanup phase of the GC is a major remaining continuous GC operation. This should help reduce remaining significant GC pauses.

Perf Team

Nicholas Chaim is almost done with setting a way to track main thread IO with XPerf in bug 770317. We would like to track main thread network IO via xperf, but it’s not clear if xperf can report what thread IO operations happen on.

Slow Startup

Turns out Firefox validates some signed extensions on startup: bug 726125. I think we finally have a good explanation for some of the ridiculously slow startups we’ve been looking at. Yuck.


09
Jul 12

Snappy update for week of July 5th

Frontend

Jared Wein discovered that our about:home was surprisingly expensive to load. He sped up the page by an estimated 30% in bug 765411. Similarly, Tim Taubert is fixing our new tab page performance in bug 753448.

Tim is also bravely attacking (via bug 769634) horrid performance issues with Firefox themes tracked by bug 650968.

Profiling

Alex Crichton added ability to profile JS in bug 761261. Benoit Girard is adding labels to the profiler to expose JS profiling info in bug 707308.  Same functionality will also allow us to add URLs to the stacks. This means that in addition to seeing what Firefox is busy with, the profiler will now provide context on what caused the processing (screenshot). This is huge. Benoit also improved profiler timing data in 769989.

Slow Startup Research

As I mentioned before, Nicholas Chaim wrote an addon to track system IO usage while starting Firefox. He has since updated his addon to be hosted on AMO and to submit that data for analysis. If you suffer from slow Firefox startups, please help us identify common IO hogs by installing his addon. Please encourage friends with slow startups to do the same.

This addon lists names of processes and amount of IO they did. This is somewhat private information, we can’t gather this data via telemetry.


02
Jul 12

Snappy June 21, 28

Necko team is busy squashing main-thread waits, see bug 765665, bug 766973.

Ehsan Akhgari turned on frame pointers on nightly channel: bug 764216. This means that one can now use the built-in profiler on nightly builds. The main purpose behind the change was to collect more chromehang data(long Firefox UI stalls). Vlad Djeric lowered the chromehang reporting threshold to 5 seconds: bug 763124. We are waiting on metrics to separate out chromehang reporting from telemetry pings: bug 763116.

Nathan Froyd is making heroic progress on teaching our events to queue so they can be prioritized: bug 715376.

Tim Taubert is working to reproduce a tab animation regression in bug 752837. He also taking over making Firefox themes less of a performance pig in bug 650968 .

We had great success with eliciting data on slow startups in Nicholas Chaim’s blog post.  We confirmed that external processes can affect Firefox startup (we had evidence for this) and that we can detect those situations (great work Nicholas!). It will be a hard slog before we can bolt a pretty UI to the extension + integrate system diagnostics into Firefox. In the meantime I recommend that SUMO people start suggesting this extension to diagnose slow Firefox installs. Nicholas is working on a revision of the extension that records slow-IO-caused-startup situations on a server so we prioritize + turn these into detectable fingerprints.

William McCloskey fixed a nasty GC regression which caused GCs to run too often: bug 768282. Andrew McCreight sped up cycle collection when closing tabs: bug 754495. Olli Pettay enabled freeing DOM nodes directly (bypassing cycle-collection) when setting .innerHtml of non-empty DOM elements see bug 730639.

PS

“The Performance of Open Source Applications” book is looking for contributors. Would be cool if someone snuck some Mozilla wisdom in there.

Sorry for skipping the snappy update last week. These posts take a lot more effort than is reasonable and I needed to direct it at my talk last week. You can see my Velocity slides here.

At least one person objected to the strong language used in the presentation (ie “dom local storage sucks”). I chose this language to emphasize the fact this isn’t a feature where one gets to weigh upsides vs downsides because the downsides are so severe. Most of the positive data on this is coming from what I believe to be unrepresentative benchmarks. I have not seen any other data points similar in quality to those reported by our telemetry.

Btw, Jan Varga is close to removing our IndexedDB prompt(bug 758357), opening up IndexedDB as an alternative to DOM Local Storage(which sucks).


25
Jun 12

Slow startup logging: help wanted

We have an intern working on an analysis tool to analyze how other Windows applications/services affect Firefox startup.

If you run Windows and you experience slow startups, please see Nicolas’ blog post and submit data his addon gathers.

ps. I will not have time to post a Snappy update for last week. The next update will cover 2 weeks.


18
Jun 12

40-60% of startups are warm?

Note: click on the images if they get clipped by other content. Cold startups are those where data has to be read in from disk, warm ones are subsequent startups where the OS already has Firefox files in memory.

I’m really surprised by the amount of warm startups done by Firefox users. Somewhere between 40% to 60% of startups are warm. On Linux you can see that by watching whether pagefaults occur while loading the firefox binary via EARLY_GLUE_STARTUP_HARD_FAULTS histogram.

On Windows we do not have a good metric  for distinguishing cold startups from warm ones. However can look at the distribution of firstpaint histogram and see that faster startups are more common than slower ones. Only a small minority of machines should be able to cold start a browser in <3 seconds.  We have a lot of startups of various degrees of warmness.

I have no explanation on why people restart Firefox so much. We know < 10% of our shutdowns are unclean (most of those appear to be due to OS shutdown not waiting on Firefox, ie us shutting down too slowly) so users aren’t crashing their browser and starting again. They are voluntarily closing the browser and then starting it soon after (ie OS doesn’t get a chance to flush Firefox out of filesystem cache).

These patterns are pretty consistent across all of the Firefox release channels I checked, so I can’t blame warm startups on nightly users getting barraged with upgrade prompts. Can someone come up with a good theory(preferably with some evidence) for this?

Note telemetry only collects data once a day and requires the browser to be open for a few minutes before submitting data, data could be skewed here.


14
Jun 12

Snappy, June 14th: Telemetry Investigations

There are no news from the Firefox frontend team this week.

Adventures in Measuring Changes

Necko team spent this week investigating why the recent big cache fix was not showing as a win in telemetry.

We were on a verge of a big backout when Saptashi Guha’s analysis in bug 762576 suggested that we might actually be winning. It’s frustrating to have data point us in different directions. However, it is better to try to make sense of data than have no data at all as was the case only a year ago. I’ll have more on this next week.

William McCloskey landed fix to turn on incremental GC for real (bug 761739). This might fix the mysterious recent user-responsiveness regression spotted by telemetry (bug 761722). He  also landed another GC speed up in 743396.

Mark Cote met with metrics analysts to discuss reporting peptest results robustly. The goal is avoid noise in reporting, so responsiveness regressions are acted upon

Interactivity Profiler

Benoit Girrard added added badges to mark known stacks in the profiler, see his blog post. A few weeks ago Vladan taught the symbolication server to serve data from local .pdb files, allowing developers to use Benoit’s profiler in own builds. Mike Conley added incomplete Thunderbird support to the profiler.

 


11
Jun 12

Snappy, June 7

Notes.

Justin’s FUEL fix will help add-ons avoid leaks and shutdown hangs: bug 750454.

Jared plans to start landing Australis tab strip (738491) on UX branch this week. Australis is our new, faster UI theme.

We landed a cache locking fix recently (722034), but telemetry is now showing a regression (761736), so this will likely be backed out and reworked.

Vladan blogged about first results from our non-destructive chromehang. Last year we briefly caused our nightly to crash if it hung for over 30seconds, which got us a lot of useful data (and some of the initial snappy bugs). This piggybacked on our crash-handling infrastructure so it was a very effective experiment (a bit brutal though). Vladan spent time this year working on plumbing to get the same sort of data non-destructively. As a result we are looking to turn on frame pointers in nightly builds and dial down hang detection to 5 seconds (bug 763124).


04
Jun 12

Snappy, May 31st – Less lag

On Friday, the necko team finally landed a fix that makes cache less likely to freeze the UI thread during reads: bug 722034. Cache writes, other less common cache use-cases remain problematic (tracked by bug 717761). Poor cache/main-thread interactions are one of the main causes of UI lag tracked by the Snappy project, so this is very exciting. Barring the need to backout, this fix will appear in Firefox 15.

Help Wanted: The necko team is looking for some help to determine the optimal disk cache size, please see Nick’s post. We need users to install an extension and submit detailed stats on our cache lifecycle.

There are various Firefox frontend fixes in progress: improving session restore (working towards 669603, 669034), FUEL (bug 750454),  search service (bug 722332) and the new theme (bug 732583). I will blog about these in more detail as they land.

Bill landed turned on incremental GC again. Hopefully it will stay on in Firefox 15.

Andrew is making progress on reducing CC pauses while closing tabs: bug 754495.

Brian has instrumented our event loop to measure the extent of Firefox lag when responding to user events, bug 759449. This is different than measuring general event-loop lag in that it focuses on lag that the user would actually notice. Look for  the EVENTLOOP_UI_LAG_EXP_MS histogram in our telemetry dashboard (yes, we are the only browser vendor to make this sort of data public). This should help us track progress as we tweak heuristics to delay background processing during user interaction (eg bug 712478).

Brian also landed a way to bypass the windows prefetch service via our privileged silent update service, see bug 692255. In my testing prefetch is likely to prefetch too many files, slowing down startup for complex apps like Firefox. Hopefully we can do better with our own prefetch.