I abandoned wordpress for a less crappy solution. My new blog is located at http://taras.glek.net/.
Google code was once the best code-search tool in the business. Then it got shut down, except for a few special instances like chromium.
Our intern, Jonas Finnemann Jensen, took the re2 code that used to power google code search and integrated it into DXR (among other cleanups). See his blog post for more details.
Regexps combined with the new instant search feature changed how I search Mozilla code. Instant search means that I’m constantly refining my search terms to narrow down my results to a minimum before I leave the search page. Digging through Mozilla code is pretty fun now. I believe our development instance* of DXR is the most pleasant./efficient (even if a bit rough) code indexing solution atm. I no longer miss google code for searching Mozilla.
Now that Mozilla can be searched in a pleasant way, something needs to fill the searching of masses of open source code usecase. Perhaps github could plug the “google code”-sized gap in developer hearts?
* we also need to stabilize the development version and move it to dxr.mozilla.org.
One of the limitations of the SQLite FTS is that it can’t do regular expression or substring searches. Jonas is addressing this with TriLite. Be sure to subscribe to Jonas’ blog for exciting DXR developments coming up within the final two weeks of his internship.
Coming to our test DXR instance soon…
For a while DXR lived on a lanedo.com server. Lanedo folks are now done with DXR development. DXR is once again deployed at dxr.mozilla.org. Our Perf intern, Jonas, is driving DXR development this summer. See his progress report. Be sure to subscribe to his blog. He may change how you use DXR as a result of his current work.
PS. Mozilla interns rule
An inquisitive mind sent me an email with a pointed question:
“Is there an example of someone who’s not you that had a burning question that would drive some sort of research or development activity and got it answered by telemetry?”
I forwarded his email and got a pretty fun survey. See below for a slightly edited version of emails I got. In addition to positive experiences below, people had a lot of complains about the telemetry experience.
Justin is probably the most vocal telemetry user who isn’t me. He blogged about one of his more successful telemetry experiences.
One telemetry stat I added is CYCLE_COLLECTOR_NEED_GC. Sometimes a read barrier fails and we need to do a GC synchronously at the start of a CC, which is terrible for pause times. Using telemetry, I confirmed my suspicion that this is very rare, and thus not worth trying to improve.
Another state Olli added is FORGET_SKIPPABLE_MAX, which tracks the length of the CC cleanup phases we run. As we made the cleanup more and more thorough, the times of these got longer and longer. I think eventually this led Olli to try to fix the worst case cleanup phases, in bug 747675. He had this comment in there: “Based on the initial telemetry data, the patch doesn’t affect too much to the already low median times, but helps significantly with the worst 5%, so mean time decreases quite nicely.”
Also, back around Firefox 13, Olli was using telemetry to observe the results of various CC optimizations, to assess their effectiveness, which he then used to decide whether or not to nominate various patches for landing on Aurora 12. Telemetry let us see the reward part of the risk vs. reward tradeoff, and get some pretty big improvements into 12.
Locally, I use about:telemetry to get a sense of what the behavior on my local machine has been, but I suppose that doesn’t really fall under “telemetry” per se. But it was quite useful during the Cycle Collector Crisis to see what CC behavior people had been seeing on their machines.
Thanks Andrew, very accurate summary of what I’ve been doing
I tend to look at CC telemetry data daily. I very rarely use the histogram, since evolution is more interesting to me. Especially median time and also how P75 and P95 evolve. (The focus in this Q is to get lower bad times, so we should manage to drop P75 and P95)
I use also about:telemetry locally since I tend to run builds with some patch, and I want to see if they affect badly to CC or GC times.
My blog is basically a collection of telemetry trivia
I do not have testimonials from other people. However I heard that the silent-update team proved something about silent updates with telemetry, Necko team discovered that some optimizations were not, etc.
I encourage other people who solved a problem with telemetry to either wrote a blog post or leave a comment.
In part 3 I will cover flaws in the current Telemetry experience.
Telemetry has been in production for about year. However, it turns out that many Mozillians do not know what it is good for. I presented about Telemetry at FOSDEM 2012, but have not had a chance to reach out to the core Mozilla developers because we haven’t had a Mozilla All-Hands since Telemetry got useful.
Why Would One Use Telemetry?
Telemetry exists for a single purpose: matching developer expectations with real Firefox behavior. My experience working on startup lead me to believe that is unreasonably complex to try to model real-world behavior in a lab setting and that it was actually easier to just measure real world behavior.
Anything that varies with IO, system configuration, user input, user workloads is easier to measure with Telemetry than to develop a useful finite benchmark for.
Nuts & Bolts
Telemetry consists of two parts: client-side collection code + serverside frontend.
Client-side Telemetry currently records:
- simple measures: discrete numbers such as amount of ram, various startup times, flash version, etc
- histograms: efficient one-dimensional means of gathering a range of values such as memory usage, cycle collection times, types of events occurring, etc. These are all specified in TelemetryHistograms.h. You can view your local histograms by enabling telemetry and installing about:telemetry.
- slow sql statements: We record SQL statements that take over 100ms and whether they occur on main thread to prioritize Snappy SQL work.
- chromehangs: Nightly builds ship with frame-pointers so we can detect when Firefox pauses for over 5 seconds. Every time Firefox pauses, we record the backtrace. We started sending those a month ago, processing them on the serverside is a work-in-progress. These should be very handy for prioritizing work on making Firefox more responsive
One current limitation is that histograms are on-dimensional, there is no way to relate cycle collection times to uptime, memory usage, etc. We also go to great lengths to avoid collecting any personal identifiers. As a result we have no user UIDs and no ability to track how a user’s performance changes over time.
Telemetry Frontend is a public dashboard that can be seen at arewesnappyyet.com. Anyone can get a BrowserID login and look at our browser stats. Telemetry dashboard consists of two views:
- Telemetry Histograms: this is basically the same data as displayed in about:telemetry, but aggregated from our userbase. This was our original view and is likely to get folded into evolution in the future.
- Telemetry Evolution: This view tracks how medians/percentiles gathered by histograms change over time. This is the view that most developers use.
Telemetry is not a technology unique to Firefox. I borrowed a lot of code from the Chromium implementation to get caught up. Microsoft also collects similar metrics.
There are two differences between us and other browser vendors:
- We do not assign a unique id to every user. This sucks from a developer perspective as it makes it a lot harder to track performance over time, but we believe the privacy benefits are worth it.
- We made our dashboards public because we would like to have our community actively involved in helping us track Firefox performance.
In part 2 I’ll discuss how various people at Mozilla use Telemetry.
Note: click on the images if they get clipped by other content. Cold startups are those where data has to be read in from disk, warm ones are subsequent startups where the OS already has Firefox files in memory.
I’m really surprised by the amount of warm startups done by Firefox users. Somewhere between 40% to 60% of startups are warm. On Linux you can see that by watching whether pagefaults occur while loading the firefox binary via EARLY_GLUE_STARTUP_HARD_FAULTS histogram.
On Windows we do not have a good metric for distinguishing cold startups from warm ones. However can look at the distribution of firstpaint histogram and see that faster startups are more common than slower ones. Only a small minority of machines should be able to cold start a browser in <3 seconds. We have a lot of startups of various degrees of warmness.
I have no explanation on why people restart Firefox so much. We know < 10% of our shutdowns are unclean (most of those appear to be due to OS shutdown not waiting on Firefox, ie us shutting down too slowly) so users aren’t crashing their browser and starting again. They are voluntarily closing the browser and then starting it soon after (ie OS doesn’t get a chance to flush Firefox out of filesystem cache).
These patterns are pretty consistent across all of the Firefox release channels I checked, so I can’t blame warm startups on nightly users getting barraged with upgrade prompts. Can someone come up with a good theory(preferably with some evidence) for this?
Note telemetry only collects data once a day and requires the browser to be open for a few minutes before submitting data, data could be skewed here.
I cleaned up about:telemetry to be a bit less of a hack. It mostly generates charts using proper DOM manipulations now, so should be easier to contribute to.
Now that we have a few hundred histograms in telemetry, it’s become a chore to figure out if something changed. I added a ‘Diff’ button which re-polls telemetry since last telemetry-page-load/diff-press and highlights buckets with new activity in red. This is useful for cases when you see lag in the browser, but have no idea what’s causing it (sort of like about:jank). This turns histogram bars red when stuff changed. Unfortunately one still needs a pretty intimate knowledge of the browser to figure out if any of the histograms are related to observed lag.
I added a little input box to make finding relevant histograms quicker.
I ask every new Mozilla person to follow planet Mozilla. It’s the only easiest way to keep an eye on what is going on in Mozilla. Because it is the single most useful source of mozilla news, it irritates me to no end to see funny pictures of cats mixed in with useful posts at a 9:1 ratio. It’s like the funny pictures, personal agenda people are taking the rest of us hostage. I understand that the official planet policy encourages personal noise, but policies can evolve.
There are multiple proposals on how to fix this situation. IMHO the most correct one is to have planet refocus on relevant content and add noise.mozilla.org for everything else. People who like noise can subscribe to that. Polite people with personal blogs already syndicate moz-tech-only parts of the blogs to planet, others have personal blogs. Arguments on marriage, definition of hate speech, far leftyness, funny cats, other memes really drain the will to live when I want to get work done.
I do think we could use a better code of conduct for people deeply involved in the community. Open source is a bit too good with letting assholes get away with being assholes.
I mainly review code written by team perf. You’d think a 6 person team would imply a fairly chill review queue. The team must’ve conspired to burn me out, for a few weeks I was spending half of my day reviewing patches. A few patches took >3 days to review. I think I even let a patch lapse for almost a week (don’t always have mental capacity to review linker code). Recently, my review inbox is often empty and recent patches have been <10KB, I’m not falling behind anymore.
My point is that 24hour reviews are not bounded by time. It’s an aspirational target. If you value time spent writing the patch, please review accordingly. Same goes for people requesting reviews, I’m a lot more likely to give you a quick review if your patches are broken up into reasonably small logical pieces.
The 24hour-review-turn-around dream is still alive. I haven’t been making noise about it because I’m waiting on #metrics to provide us with hard review-latency data. I’m hoping my review turn-around is 24hours, but there is no easy way to find out atm.
Thanks to everyone who adopted 24-hour review goal. I have noticed reviews flying by at a faster pace. One reviewer even let a team-perf patch jump to the front of his review in return for pushing on 24hour reviews.