Categories
Gecko Profiler Performance

Improving the Gecko Profiler

Over the last three months I landed 167 patches, via 41 Bugzilla bugs, for the Gecko Profiler. These included crash fixes, assertion failure fixes, data race fixes, optimization fixes, and a great many refactorings.

Background

The Gecko Profiler is a profiler built into Firefox. It can be used with the Gecko Profiler Addon to profile Firefox. It also provides the core profiling mechanism that is used by Firefox’s devtools to profile JavaScript content code.

It’s a crucial component.  It was originally written 5 years ago in something of a hurry, because we desperately needed a built-in profiler, one that could give more detailed, custom information than is possible with an external profiler. As I understand it, part of it was imported from V8, so it was a mix of previously existing code and new code. And in the years since it has been extended by multiple people, in a variety of ways that didn’t necessarily mesh well.

As a result, at the start of Q1 it was in pretty bad shape. Crashes and assertion failures were frequent, and the code itself was hard to read and maintain. Markus Stange had recently taken over ownership, and had made some great improvements to the Addon, but the core code needed work as well. So I started digging in to see what I could find and improve. There was a lot!

Fixes

Bug 1331571. The profiler had code for incorporating power consumption estimates from the Intel Power Gadget. Unfortunately, this integration had major flaws: Intel Power Gadget only gives very coarse power consumption estimates; the profiler samples at 1000Hz and is CPU intensive and so is likely to skew the power consumption estimates significantly; and nobody had ever used it meaningfully. So I removed it.

Bug 1317771. The profiler had a “standalone” configuration that allowed it to be used in programs other than Firefox. But it was complex (lots of #ifdef statements) and broken and unlikely to be of use. So I removed it.

Bug 1328369, Bug 1328373: Dead code removal.

Bug 1332577. The public API for the profiler was a mess. It was split across several header files, and most API functions had an “outer” version with a profiler_ prefix that immediately called into an inner version with a mozilla_sampler_ prefix (though there were some inconsistencies). So I combined the various header files into a single file, GeckoProfiler.h, and simplified the API functions into a single level, all consistently named with a profiler_ prefix.

Bug 1333296. Even the name of the profiler was a source of confusion. It was originally known as “SPS”, which I believe is short for “simple profiling system”. At some point that changed to “the Gecko Profiler”, although was also occasionally referred to as “the built-in profiler”! Because of this history, the code was littered with references to SPS. In this bug I updated them all to refer to the Gecko Profiler. (I updated the MDN docs, too. The page name still uses “Built-in Profiler” because I don’t know how to change MDN page names.)

Bug 1329684. I removed some mutex wrapper classes that I think were necessary at one point for the “standalone” configuration.

Bug 1328365. Thread-local storage was being used to store a pointer to an object that was only accessed on the main thread, so I changed it to be a global variable. I also renamed some variables whose names referred to a type that had been renamed a long time ago.

Bug 1333655. The profiler had a cross-platform thread abstraction that was clumsy and over-engineered, so I streamlined it.

Bug 1334466. The profiler had a class called Sampler, which I think was imported from V8, and a subclass called GeckoSampler. Both classes were fairly complex, and we only ever instantiated the subclass. The separation merely obscured things, so I merged the subclass into Sampler. Having all that code in a single class and a single module made it much easier to see exactly what it was doing.

Bug 1335595. Two classes, ThreadInfo and ThreadProfile, were used for per-thread data. They were hopelessly intertwined: each one had a pointer to the other, and multiple fields were present (i.e. duplicated) in both of them. So I just merged them.

Bug 1336326. Three minor clean-ups.

Bug 816598. I implemented a memory reporter for the profiler. This was first requested in 2012, and five duplicate bugs had been filed in the interim!

Bug 1126576. I removed some very grotty manual refcounting from the PseudoStack class, which simplified things. A little too much, in fact… I misunderstood how things worked, causing a crash, which I subsequently fixed in bug 1340161.

Bug 1337189. The aforementioned Sampler class was still over-engineered. It only ever had 0 or 1 instantiations, and basically was an unnecessary level of abstraction. In this bug I basically got rid of it by merging it into another file. Which took 27 patches! (One of these patches introduced a regression, which I later fixed in bug 1340327.) At this point a lot of the core code that had previously been spread across multiple files and classes was now in a single file, tools/profiler/core/platform.cpp, and it was becoming increasingly obvious that there was a lot of global state being accessed from multiple threads with insufficient thread synchronization.

Bug 1338957. The profiler tracks which threads are “sleeping” (i.e. blocked on some operation), to support an optimization where it samples sleeping threads in a cheaper-than-normal fashion. It was using two counters and a boolean to track the sleep state of each thread. These three values were accessed from multiple threads; two of them were atomic, and one wasn’t, so the whole setup was very racy. I managed to condense the three values into a single atomic tri-state value, which made things simpler and thread-safe.

Bug 1339327. I landed eight refactoring patches with no particular common theme, mostly involving renaming things and removing unnecessary stuff.

Bug 1339435. I removed two erroneous assertions that I had added in an earlier patch — two functions that I thought only ran on the main thread turned out to run off the main thread as well.

Bug 1339695. The profiler has a lot of code that is specific to a particular architecture (e.g. x86), OS (e.g. Windows), or platform (e.g. x86/Windows). The #ifdef statements used to select these were massively inconsistent — turns out there are many ways to detect this stuff — so I fixed this up. Among other things, this involved using the nice constants in tools/profiler/core/PlatformMacros.h consistently throughout the profiler’s code. (I fixed a regression — caused by mistyping one of the #ifdef conditions, alas! — from this change in bug 1350211. And another one involving --disable-profiling in bug 1348776.) I also renamed some files that had .cc extensions instead of the usual .cpp because they had (I think) been imported from V8.

Bug 1340928. At this point I had started working on a major change to the handling of the profiler’s core global state. It would inevitably be a big patch, but I wanted it to be as small as possible, so I started aggressively carving off small changes that could be landed separately. This bug featured 16 of them.

Bug 1328378. The profiler has two kinds of samples: periodic, which are taken by a separate thread in response to a timer firing, and synchronous, which a thread takes itself in response to a request via the profiler API. There are a lot of similarities between the two, but also some important differences. This bug took some steps to simplify the messy handling of synchronous samples.

Bug 1344118. I mentioned earlier that the profiler tracks which threads are “sleeping” to support an optimization: when a thread is asleep, we can mostly duplicate its last sample without unwinding its stack. But the optimization was buggy and would become a catastrophic pessimization in certain circumstances, due to what should have been a short O(1)-ish buffer search becoming O(n²)-ish, which would quickly peg one CPU at 100% usage. As far as I can tell, this bug was present in the optimization ever since it was implemented three years ago. (It’s possible it wasn’t noticed because its effect increase as more threads are profiled, but the profiler defaults to only profiling the main thread and the compositor thread.) The fix was straightforward once the diagnosis was made, and Julian Seward did a follow-up that made the optimization even more effective.

Bug 1342306. In this bug I put almost all of the profiler’s global state into a single class and protected accesses to it with a mutex. Unlike the old code, the new code is simple and obviously thread-safe. The final patch in this bug was much bigger than I would have liked, at 142 KiB, even after I carved off as many precursor patches as I could. Unsurprisingly, there were some follow-up fixes required: bug 1346356 (a leak and a deadlock), bug 1347044 (another deadlock), bug 1348374 (yet another deadlock), and bug 1350967 (surprise! another deadlock).

Bug 1345262. I fixed an assertion failure caused by the profiler and the JS engine having differing views about what functions should be called on what threads.

Bug 1347348. Five more assorted clean-ups.

Bug 1349856. I fixed a minor error involving a call to the profiler from Gecko.

Bug 1346132. I removed the profiler’s bespoke logging system, replacing it with the standard Mozilla one. I also made the logging output more concise and informative.

Bug 1350212. I cleaned up a class and its uses a bit.

Bug 1351523. I reordered one function’s arguments to match the order used in two related functions.

Bug 1351528. I removed some unused values from an enum.

Bug 1348024. I simplified some environment variables used by the profiler.

Bug 1351946. I removed some gnarly code for starting the profiler on B2G.

Bug 1351136. The profiler’s testing coverage is not very good, as can be seen from the numerous regressions I introduced and fixed. So I added a gtest that improves coverage. There’s still room for more test coverage improvement.

Bug 1351963. I further clarified the handling of synchronous vs. periodic samples, and simplified the ownership of some per-thread data structures.

Discussion

I learned some interesting things while doing this work.

Learning a component

Three months ago I knew almost nothing about the profiler’s code. Today I’m a module peer.

At the start of January I had been told that the profiler needed work, and I had assigned myself a Q1 deliverable to “land three improvements to the Gecko Profiler”. I started just by looking for easy, obvious code clean-ups, such as dead code removal, fixing inconsistent naming of things, and removing unnecessary indirections. (These are the kinds of clean-ups you can make with only shallow understanding.) The profiler proved to be a target-rich environment for such changes!

After I made a few dozen such changes I started understanding more deeply how the pieces fit together. (Partly because I’d been staring at the code a lot, and partly because my changes were making the code easier to understand. Refactorings add up.) I started interleaving my easy clean-up patches with ones that required more insight into how the profiler worked. I made numerous mistakes along the way, as the various regression fixes above show. But that’s ok.

I also kept a text file in which I had a list of ideas for things to fix. Every time I saw something that looked like it could be improved, I added it to the file, and I repeatedly checked the file when deciding what to work on next. As my understanding of the code improved, multiple times I realized that items I had written down were wrong, or incomplete, or that seemingly separate things were related. (In fact, I’m still using the file, because I still have numerous things I want to improve.)

Multi-threaded programming basics

Although I first learned C and C++ about 20 years ago, and I have worked at Mozilla for more than 8 years, this was the first time I’ve ever done serious multi-threaded programming, i.e. at a level where I needed a reasonably deep understanding of how various threads can interact. I got the following two great tips from Julian Seward, which helped a lot.

  • Write down pseudocode for each thread.
  • Write down potential worst-case thread operation interleavings.

I also found it helpful to add comments (or assertions, where appropriate) to the top of functions that indicate which thread or threads they run on. For example:

void profiler_gathered_OOP_profile()
{
  MOZ_RELEASE_ASSERT(NS_IsMainThread());
  ...
}

and:

void profiler_thread_sleep()
{ 
  // This function runs both on and off the main thread.
  ...
}

A useful idiom: proof-of-lock tokens

I also employed a programming idiom that turned out to be extremely helpful.  Most of the global profiler state is in single class called ProfilerState. There is a single instance of this class, gPS, and a single mutex that protects it, gPSMutex. To guarantee that no code is able to access gPS‘s contents without first locking the mutex, for every field in ProfilerState there is a getter and a setter, both of which require a “proof-of-lock” token, which takes the form of a const PS::AutoLock&, where PS::AutoLock is an RAII type that locks and unlocks a mutex.

For example, consider this function, which checks if the profiler is paused.

bool profiler_is_paused()
{
  PS::AutoLock lock(gPSMutex);
  if (!gPS->IsActive(lock)) {
    return false;
  }
  return gPS->IsPaused(lock);
}

The PS::AutoLock locks the mutex. IsActive() and IsPaused() both access fields within gPS, and so they are passed lock, which serves as the proof-of-lock value. IsPaused() and SetIsPaused() are implemented as follow.

bool IsPaused(const PS::AutoLock&) const { return mIsPaused; }
void SetIsPaused(const PS::AutoLock&, bool aIsPaused) { mIsPaused = aIsPaused; }

Neither function actually uses the proof-of-lock token. Nonetheless, any function that calls a ProfilerState getter or setter must either lock gPSMutex, or have an ancestor that does. This idiom has two very nice benefits.

  • You can’t access gPS‘s contents without having first locked gPSMutex. (Well, it is possible to subvert the protection, but not by accident.)
  • It’s obvious that all functions that have a proof-of-lock argument are called only while gPSMutex is locked.

Functions that are called from multiple places sometimes must be split in two: an outer function in which the mutex is initially unlocked, and an inner function that takes a proof-of-lock token. This isn’t hard, though.

Deadlocks vs. data races

After my big change to the profiler’s global state, I had to fix numerous deadlocks. This wasn’t too hard. Deadlocks (caused by too much thread synchronization) are obvious, easy to diagnose, and these ones weren’t hard to fix. It’s useful to contrast them with data races (caused by too little thread synchronization) which typically have subtle effects and are difficult to diagnose.

Patch discipline

For this work I wrote a lot of small patches. This is my preferred way to work, for two reasons. First, small patches make life easier for reviewers, which in turn results in faster reviews. Second, small patches make regression hunting easy. It’s always nice when you bisect a regression to a small patch.

Future work and Thanks

The profiler still has plenty of room for improvement, and I’m planning to do more work on it in Q2. In the meantime, if you’ve tried the profiler in the past and had problems it might be worth trying again. It’s in much better shape now.

Finally, many thanks to Markus Stange for reviewing the majority of the patches and answering lots of questions, and Julian Seward for reviewing most of the remainder and for numerous helpful discussions about threaded programming.

4 replies on “Improving the Gecko Profiler”

Re renaming the MDN page, filing a bug should get you there, or ask on #mdn.

To rename a page, you need to have “mentor” status on MDN, either way should find you someone that has that.

Sadly, I don’t, so I couldn’t just do that. IIRC, I did ages ago :-/

Nice work. It’s amazing how much more effective devs can be at fixing architecture issues, bugs, performance issues, etc. when the meaning and interactions of the different pieces is clearer. Not only does it save everyone time when they’re trying to understand and reason about a bit of code, but it reduces regression risk (and associated time) when making changes too. I’d love to see us develop more of a culture of cleaning up and [usefully] documenting code.

It does work with e10s-multi.

I don’t have answers for your other questions, sorry.

Comments are closed.