Author Archives: Nicholas Nethercote

MemShrink progress, final

I was due to write a MemShrink progress report today, but I’ve decided that after almost 2.5 years, my reserves of enthusiasm for these regular reports has been exhausted.  Sorry!

I do still plan to write posts when significant fixes relating to memory consumption are made.  (For example, when generational GC lands, you’ll hear about it here.)  I will also continue to periodically update the MemShrink “big ticket items” list.  And MemShrink meetings will continue, so MemShrink-tagged bugs will still be triaged.  And for those of you who read the weekly Platform meeting notes, I will continue to write MemShrink updates there.  So don’t despair — good things will continue to happen, but they’ll just be marginally less visible.

Premature Optimisation

I loved this sentence from Olin Shivers’ description of some Scheme history:

I fashionably decried premature optimisation in college without really understanding it until I once committed an act of premature opt so horrific that I can now tell when it is going to rain by the twinges I get in the residual scar tissue. Now I understand premature optimisation.

I’d love to know exactly what the premature optimisation was.

I also read Olin’s Dissertation Advice about fifty times in 2004.  Great stuff.

Libraries should permit custom allocators

Some C and C++ libraries permit the use of custom allocators, which are registered through some kind of external API.  For example, the following libraries used by Firefox provide this facility.

  • FreeType provides this via the FT_MemoryRec_ argument of the FT_New_Library() function.
  • ICU provides this via the u_setMemoryFunctions() function.
  • SQLite provides this via the sqlite3_config() function.

This gives the users of these libraries additional flexibility that can be very helpful.  For example, in Firefox we provide custom allocators that measure the size of all the live allocations done by the library;  these measurements are shown in about:memory.

In contrast, libraries that don’t allow custom allocator are very hard to account for in about:memory.  Such libraries are major contributors to the dreaded “heap-unclassified” value in about:memory.  These include Cairo and the WebRTC libraries.

Now, supporting custom allocators in a library takes some effort.  You have to be careful to always allocate in a fashion that will use the custom allocators if they have been registered.  Direct calls to vanilla allocation/free functions like malloc(), realloc(), and free() must be avoided.  For example, SpiderMonkey allows custom allocators (although Firefox doesn’t need to use that functionality), and I just fixed a handful of cases where it was accidentally using vanilla allocation/free functions.

But, it’s a very useful facility to provide, and I encourage all library writers to consider it.

MemShrink progress, week 121–124

It’s been a quiet but steady four weeks for MemShrink with 19 bugs fixed, including several leaks.

The only fix that I feel is worth highlighting is bug 918207, in which I added support for fast, coarse-grained measurement of a tab’s memory consumption.  The implemented machinery isn’t currently exposed through the UI, though there are two bugs open that will use it:  a simple one that will implement a command for the developer toolbar, and a more complex one that will implement a constantly-updating memory monitor widget for the devtools pane.

See you next time!

Warning for Firefox devs planning to upgrade to Ubuntu 13.10

I just upgraded from Ubuntu 13.04 to Ubuntu 13.10, and Firefox wouldn’t build with either clang or GCC.

clang was initially failing during configure, complaining about not being able to find joystick.h, though the underlying failure was an inability to find stddef.h.  This Ubuntu bug describes a workaround, which is to do the following.

cd /usr/lib/clang/3.2/
sudo ln -s /usr/lib/llvm-3.2/lib/clang/3.2/include

With that in place, I clobbered and rebuilt, and clang complained about a problem in allocator.h relating to a name __allocator_base, and GCC complained about C++11 support being insufficient.

Both failures had the same underlying cause, which is that both compilers are hardwired to look for some GCC-4.7 headers (which they shouldn’t) as well as GCC-4.8 headers.  I filed a bug with Ubuntu about this.

I worked around the problem just by renaming /usr/include/c++/4.7/ and /usr/include/x86_64-linux-gnu/c++/4.7/.  There may be more elegant workarounds, but that was good enough for me.

How to trigger a child process in desktop Firefox

Firefox is now multi-process, and not just for the plugin-container process.  For example, there is now (present but disabled in Firefox 25, and likely to be released in Firefox 27) a separate process that is used to update the thumbnails shown in a new tab.

As a result, sometimes you might want to test something in the presence of multiple processes.  Here’s how I’ve been doing it.

  • Delete the images in the thumbnails/ directory within the profile’s temporary directory.
    • On Linux it’s ~/.cache/mozilla/firefox/<profile>/thumbnails/.
    • On Mac it’s ~/Library/Caches/Firefox/Profiles/<profile>/thumbnails/.
    • On Windows it’s C:\Users\<username>\AppData\Local\Mozilla\Firefox\Profiles\<profile>\thumbnails\.
    • I’m not sure about Android.
  • Open about:newtab.  This triggers a thumbnails process.  It’ll live for about 60 seconds.  (If you’ve configured about:newtab to be blank rather than showing thumbnails, this might not work, though I’m not sure.)

Please let me know if there’s a better way!

(And if anyone can give me extra info on the things I’m not sure about, I’ll update the text above accordingly.  Thanks!)

MemShrink progress, week 117–120

Lots of important MemShrink stuff has happened in the last 27 days:  22 bugs were fixed, and some of them were very important indeed.

Images

Timothy Nikkel fixed bug 847223, which greatly reduces peak memory consumption when loading image-heavy pages.  The combination of this fix and the fix from bug 689623 — which Timothy finished earlier this year and which shipped in Firefox 24 — have completely solved our longstanding memory consumption problems with image-heavy pages!  This was the #1 item on the MemShrink big ticket items list.

To give you an idea of the effect of these two fixes, I did some rough measurements on a page containing thousands of images, which are summarized in the graph below.

Improvements in Firefox's Memory Consumption on One Image-heavy Page

First consider Firefox 23, which had neither fix, and which is represented by the purple line in the graph.  When loading the page, physical memory consumption would jump to about 3 GB, because every image in the page was decoded (a.k.a. decompressed).  That decoded data was retained so long as the page was in the foreground.

Next, consider Firefox 24 (and 25), which had the first fix, and which is represented by the green line on the graph.  When loading the page, physical memory consumption would still jump to almost 3 GB, because the images are still decoded.  But it would soon drop down to a few hundred MB, as the decoded data for non-visible images was discarded, and stay there (with some minor variations) while scrolling around the page. So the scrolling behaviour was much improved, but the memory consumption spike still occurred, which could still cause paging, out-of-memory problems, and the like.

Finally consider Firefox 26 (currently in the Aurora channel), which has both fixes, and which is represented by the red line on the graph.  When loading the page, physical memory jumps to a few hundred MB and stays there.  Furthermore, the loading time for the page dropped from ~5 seconds to ~1 second, because the unnecessary decoding of most of the images is skipped.

These measurements were quite rough, and there was quite a bit of variation, but the magnitude of the improvement is obvious.  And all these memory consumption improvements have occurred without hurting scrolling performance.  This is fantastic work by Timothy, and great news for all Firefox users who visit image-heavy pages.

[Update: Timothy emailed me this:  "Only minor thing is that we still need to turn it on for b2g. We flipped the pref for fennec on central (it's not on aurora though). I've been delayed in testing b2g though, hopefully we can flip the pref on b2g soon. That's the last major thing before declaring it totally solved."]

[Update 2: This has hit Hacker News.]

NuWa

Cervantes Yu landed Nuwa, which is a low-level optimization of B2G.  Quoting from the big ticket items list (where this was item #3):

Nuwa… aims to give B2G a pre-initialized template process from which every subsequent process will be forked… it greatly increases the ability for B2G processes to share unchanging data.  In one test run, this increased the number of apps that could be run simultaneously from five to nine

Nuwa is currently disabled by default, so that Cervantes can fine-tune it, but I believe it’s intended to ship with B2G version 1.3.  Fingers crossed it makes it!

Memory Reporting

I made some major simplifications to our memory reporting infrastructure, paving the way for future improvements.

First, we used to have two kinds of memory reporters:  uni-reporters (which report a single measurement) and multi-reporters (which report multiple measurements).  Multi-reporters, unsurprisingly, subsume uni-reporters, and so I got rid of uni-reporters, which simplified quite a bit of code.

Second, I removed about:compartments and folded its functionality into about:memory.  I originally created about:compartments at the height of our zombie compartment problem.  But ever since Kyle Huey made it more or less impossible for add-ons to cause zombie compartments, about:compartments has hardly been used.   I was able to fold about:compartments’ data into about:memory, so there’s no functionality loss, and this change simplified quite a bit more code.  If you visit about:compartments now you’ll get a message telling you to visit about:memory.

Third, I removed the smaps (size/rss/pss/swap) memory reporters.  These were only present on Linux, they were of questionable utility, and they complicated about:memory significantly.

Finally, I fixed a leak in about:memory.  Yeah, it was my fault.  Sorry!

Summit

The Mozilla summit is coming up!  In fact, I’m writing this report a day earlier than normal because I will be travelling to Toronto tomorrow.  Please forgive any delayed responses to comments, because I will be travelling for almost 24 hours to get there.

Internet Banking Fail

My bank’s online banking service is generally very good.  Having said that, I got this today.

"Sorry we're unable to retrieve your Interest Statement details right now. Please try again between 7AM-9PM Mon-Fri (AEST/AEDT), excludes public holidays."

Sigh.

Bleg for a new machine: outcome

Recently I blegged (here and here) for help in designing a new machine.  My goals:  fast browser and JS shell builds, quietness, and a setup that wasn’t too complicated.  I now have the new machine and have done some comparisons to the old machine.

New vs old

The most important components of the new machine are:  an Intel i7-4770 CPU (I’m using the integrated graphics), 32 GiB of RAM, a 512GB Samsung 840 Pro SSD hard disk, and a Fractal Design Define R4 case.

In comparison, the equivalent components in the old machine were: an Intel i7-2600 CPU, 16 GiB of RAM, a magnetic hard disk, and an Antec Sonata III 500 case.

A basic comparison

The new machine is definitely faster.  Compile times are about 1.5x faster;  I can do a debug browser build with clang in under 13 minutes, and one with GCC in under 17 minutes.  (I hadn’t realized that clang was so much faster than GCC.)

Furthermore, disk-intensive operations are massively faster.  Just as importantly, disk-intensive operations vary in speed much less.  With a magnetic disk, if you’re doing something where the data is already in the disk cache, it’ll be pretty fast;  otherwise it’ll be horribly slow.  The SSD doesn’t suffer that dichotomy.

Finally, the new case, while not silent, is certainly quieter… maybe half as loud as the old one.  It’s also bigger than I expected — it’s 1–2 inches bigger in every dimension than the old one. There must be a lot of empty space inside.  And although it has a pleasingly minimalist aesthetic — it’s about as plain a black box as you could imagine — it does have an obnoxiously bright, blue power indicator light at the top of the front panel, which I quickly covered with a small strip of black electrical tape.

A detailed performance comparison

Building and testing

All builds are 64-bit debug builds.  I used clang 3.2-1~exp9ubuntu1 and gcc-4.7.real (Ubuntu/Linaro 4.7.3-1ubuntu1) for compilation. I measured each operation only once, and the old machine in particular would vary in its times due to the magnetic disk.  So don’t treat individual measurements as gospel.  In all cases I give the old machine’s time first.

  • Browser clobber build (clang): 19.7 minutes vs 12.7 minutes (1.56x faster).  I didn’t measure a GCC brower build on the old machine, but on the new machine it was 16.8 minutes (1.32x slower than clang).
  • Browser no-change build (clang): 48 seconds vs 31 seconds (1.55x faster).
  • Browser clobber build, with ccache, with an empty cache (clang): 23.3 minutes vs 14.8 minutes (1.57x faster).  These are 1.18x slower and 1.17x slower than the corresponding non-ccache builds.
  • Browser clobber build, with ccache, with a full cache (clang): 6.2 minutes vs 2.6 minutes (2.4x faster).  These are 3.18x faster and 4.89x faster than the corresponding non-ccache builds.  Here the effect of the SSD becomes clear — the new machine gets a much bigger benefit from ccache.
  • Two concurrent browser builds (clang): 45.9 & 45.4 minutes vs 22.5 & 22.5 minutes (2.03x faster).  Interestingly, the amortized per-build time on the old machine (22.9 minutes) was 1.16x slower than a single build, but the amortized per-build time on the new machine (11.3 minutes) was 1.12x faster than a single build.  The new machine, despite having the same number of cores, clearly provides more parallelism, and a single browser build doesn’t take full advantage of that parallelism.
  • JS shell everything-but-ICU build (clang): 59 seconds vs 42 seconds (1.4x faster).  It’s worth noting that JS shell builds spend a higher proportion of their time doing C++ compilation than browser builds.
  • JS shell everything-but-ICU build (GCC): 130 seconds vs 81 seconds (1.60x faster).  These are 2.20x slower and 1.93x slower than the corresponding clang builds!
  • JS jit tests (compiled with clang): 179 seconds vs 137 seconds (1.31x faster).  These tests are much more CPU-bound and less disk-bound than compilation, so the smaller speed up isn’t surprising.
  • SunSpider: 156 ms vs 127 ms (1.23x faster).  Again, CPU is the main factor.

Next, here are the times for some disk-intensive operations.  The results here, especially for the old machine, could be highly variable.

  • Delete a build directory: 10.5 seconds vs 1.4 seconds (7.5x faster).
  • Do a local clone of mozilla-inbound: 7.7 minutes vs 10 seconds (46x faster).
  • Recursive grep of .cpp/.h/.idl files in a repository, first time: 53.2 seconds vs 0.8 seconds (67x faster).
  • The same operation, immediately again: 0.2 seconds vs 0.2 seconds (same speed).

Those last two comparisons really drive home the impact of the SSD, and the reduction in variability it provides. It’s hard to describe how pleasing this is.  On the old machine I always knew when libxul.so was linking, because my whole machine would grind to a halt and trivial operations like saving a file in vim would take multiple seconds.  I don’t have that any more!

And this is relevant to ccache, too.  I tried ccache again recently on my old machine, and while it did speed up compilations somewhat, the extra load on the disk noticeably affected everything else — I had even more of those unpredictable pauses when doing anything other than building.  This was annoying enough that I disabled it.  But ccache should be much more attractive on the new machine.  I will try it again soon, once I’ve had the new machine long enough that I will be well-attuned to its performance.

Conclusion

The CPU is a decent improvement over the old one.  It accounts for roughly half the improvement in build times.

The SSD is a fantastic improvement over the old one.  It too accounts for roughly half the improvement in build times, but makes disk-intensive operations much faster.  It’s performance is also much less variable and thus more predictable.

clang is up to 2x faster than GCC!  This surprised me greatly.  I’d be interested to hear if others have seen such a large difference.

Thanks again to everybody who helped me design the new machine.  It’s been well worth the effort!

MemShrink progress, week 113–116

It’s been a relatively quiet four weeks for MemShrink, with 17 bugs fixed.  (Relatedly, in today’s MemShrink meeting we only had to triage 10 bugs, which is the lowest we’ve had for ages.)  Among the fixed bugs were lots for B2G leaks and leak-like things, many of which are hard to explain, but are important for the phone’s stability.

Fabrice Desré made a couple of notable B2G non-leak fixes.

On desktop, Firefox users who view about:memory may notice that it now sometimes mentions more than one process.  This is due to the thumbnails child process, which generates the thumbnails seen on the new tab page, and which occasionally is spawned and runs briefly in the background.  about:memory copes with this child process ok, but the mechanism it uses is sub-optimal, and I’m planning to rewrite it to be nicer and scale better in the presence of multiple child processes, because that’s a direction we’re heading in.

Finally, some sad news:  Justin Lebar, whose name should be familiar to any regular reader of these MemShrink reports, has left Mozilla.  Justin was a core MemShrink-er from the very beginning, and contributed greatly to the success of the project.  Thanks, Justin, and best of luck in the future!