Categories
Memory consumption MemShrink

MemShrink progress, week 2

Here are some potted highlights from the second week of MemShrink.

  • The beauty of Gregor Wagner’s time-based GC trigger, which I mentioned last week, became ever evident.  It’s been confirmed that nine other bug reports are fixed by this one change.  Great stuff!  See this comment and subsequent comments for an analysis.  Unfortunately, the release drivers decided it wasn’t suitable for back-porting to Firefox 6.  Users who leave their browser open overnight should find Firefox 7 more to their liking, once it’s available.
  • I landed lazy initialization of TraceMonitors and JaegerCompartments, something I wrote about previously.  This gave a 2.89% reduction in the Trace Malloc MaxHeap measurement on CentOS (x86_64) release 5, and probably similar reductions on other platforms.
  • Justin Lebar added hard and soft page fault counts to about:memory.  This makes it clear if Firefox is paging badly.
  • Dão Gottwald fixed a leak in the iQ library used by Panorama.  This bug could cause entire tabs to be leaked in some circumstances.
  • Blake Kaplan avoided the creation of some sandboxes.  This seems to have reduced the number of JavaScript compartments required for techcrunch.com.

There’s less than a week until the development period for Firefox 7 ends.  Hopefully we’ll see some more good MemShrink fixes before that happens.

Categories
Firefox

Firefox 6 will be released in six seven weeks

Firefox’s new rapid release cycle has received a lot of attention since Firefox 5 was released.  I won’t add to the discussion of its pros and cons.

But I will note that we’re now on a 6-week release cycle.  That means Firefox 6 should be released in early August, and Firefox 9 should be released before the end of the year.  I point this out because an awful lot of people think we’re on a 3-month release cycle.  I can think of two reasons for this misconception: (a) early discussion of the rapid release cycle said it would be 3 months long, if I remember correctly, and (b) the gap between Firefox 4 and Firefox 5 was 3 months.

This 6-week cycle is something that we need to publicize well, to avoid further charges of poor communication.  I don’t want to read 1000 comments saying “3 months between FF4 and FF5, 6 weeks between FF5 and FF6, is FF7 coming in 3 weeks!?!! ZOMG hasn’t Mozilla heard of Zeno’s Paradox? LOLWUT”.

Update: Mike Hommey has kindly corrected me;  there will be 8 weeks between FF5 and FF6.  There will then be 6 weeks between all subsequent releases.  The 12 week gap between FF4 and FF5 and the 8-week gap between FF5 and FF6 were artifacts of the timeline used to get the rapid release process up to speed.

Categories
MemShrink

MemShrink progress, week 1

We had our first MemShrink meeting a week ago.  Here’s some progress that’s been made since then.  Please note that this is a list of things that caught my eye, and isn’t meant to be exhaustive in any way.

Two big improvements were made to garbage collection trigger heuristics.

  • Gregor Wagner added a time-based trigger to the garbage collector.  Most notably, this should prevent slow build-ups of garbage when the browser is left idle but an open website is doing a small amount of continual work (eg. updates on a news site).  We’ve had a lot of reports about this problem, so this will hopefully be a big win. Thanks, Gregor!  Note that if the browser is truly idle it won’t trigger, so it shouldn’t cause power problems with Firefox mobile.  See this comment for full details of the new trigger.
  • Luke Wagner (no relation to Gregor!) fixed a gap in the garbage collector’s allocation-based triggers — certain string allocations weren’t being counted by the garbage collector, which meant it allowed them to build up excessively.  This fix particularly helps with complex regular expression operations.

Two improvements were made to the coverage of about:memory, thus reducing the amount of memory falling into the “heap-unclassified” bucket.

  • I added three new memory reporters: explicit/js/scripts, explicit/js/object-slots, explicit/js/string-chars.  When I open Gmail, these three together account for 11% of the explicit allocations.
  • Kyle Huey added a new memory reporter called xpti-working-set, which measures memory used by the XPCOM typelib system.  This is usually a bit over 1MB of memory.

Justin Lebar made progress on measuring page fault counts and using them to respond when memory pressure is high.

Finally, I made progress on reducing unnecessary compartment overhead, which I described previously.

Categories
about:memory JägerMonkey Memory consumption Tracemonkey

You make what you measure

Inspired by Shaver’s patch, I implemented some per-compartment stats in about:memory.  I then visited TechCrunch because I know it stresses the JS engine.  Wow, there are over 70 compartments!  There are 20 stories on the front page.  Every story has a Facebook “like” button, a Facebook “send” button, and a Google “+1” button, and every button gets its own compartment.

That sounds like a bug, but it’s probably not.  Nonetheless, every one of those buttons had an entry in about:memory that looked like this:

Old compartment measurements

(The ‘object-slots’ and ‘scripts’ and ‘string-chars’ measurements are also new, courtesy of bug 571249.)

Ugh, 255,099 bytes for a compartment that has only 971 bytes (i.e. not much) of scripts?  Even worse, this is actually an underestimate because there is another 68KB of tjit-data memory that isn’t being measured for each compartment.  That gives a total of about 323KB per compartment.  And it turns out that no JIT compilation is happening for these compartments, so all that tjit-data and mjit-code space is allocated but unused.

Fortunately, it’s not hard to avoid this wasted space, by lazily initializing each compartment’s TraceMonitor and JaegerCompartment.  With that done, the entry in about:memory looks like this:

New compartment measurements

That’s an easy memory saving of over 20MB for a single webpage.  The per-compartment memory reporters haven’t landed yet, and may not even land in their current form, but they’ve already demonstrated their worth.  You make what you measure.

Categories
Bugzilla Gmail

Gmail and Bugzilla: an update

I recently wrote about Gmail’s threading not handling Bugzilla emails nicely.  Thanks to a patch from Byron Jones, bugzilla.mozilla.org now lets you set an option so that Gmail’s threading will work nicely with Bugzilla.  Just go to “User Preferences”, and in the “General Preferences” tab, enable “Make email subjects compatible with Gmail’s threading.”  Thanks, Byron!

Categories
MemShrink

MemShrink meetings are go: Tuesday 1pm (Pacific time)

Exactly three months ago I wrote about a new project called MemShrink, the aim of which was to reduce Firefox’s memory usage.  Thanks to Johnny Stenback we will now have weekly MemShrink meetings in which bug reports will be discussed, triaged and assigned, and anything else relevant (see this wiki page) will be discussed.  These meetings will take place every week at Tuesday 1pm, Pacific time;  the first one will be on June 14.

This is great news!  MemShrink was never going to be more than a niche effort without weekly meetings.  To give you an idea of how important I think they are, I’ll be attending them even though it’ll be 6am Wednesday in my timezone.  So come along or dial-in.  Johnny will post dial-in instructions to the dev-platform list/newsgroup some time before the meeting.

Categories
about:memory Correctness Firefox SQLite

Asynchronous database connections must be closed with asyncClose()

TL;DR:  if you’re familiar with any Mozilla (C++ or JS) code that opens an async database connection, please go and check that it calls asyncClose() to close the connection;  not doing so can lead to crashes and/or memory leaks.

Specifically, in Firefox 6, when such a connection is destroyed (because it’s explicitly deallocated or its refcount drops to zero in C++, or it’s garbage collected in JS) it’ll currently cause sporadic crashes in about:memory or telemetry code or Test Pilot code.  This is because memory reporters end up pointing to connections that have been freed, and so when they are used they end up (innocently) accessing memory they shouldn’t.

I’ve opened bug 662989 to avoid the crashes, but even once it’s implemented, I think that if you fail to call asyncClose() it will still cause a memory leak, because sqlite3_close() is never called on the connection and so SQLite won’t free up memory used by the connection.  Also, connections that needlessly remain open can have active locks that can starve out other connections or cause extreme growth of the write-ahead-log.

As I write this, thanks to Marco Bonardo, I’m aware of three places in JS code that fail to call asyncClose() when they should:

  • browser/components/preferences/aboutPermissions.js.  Bug 654573 covers this, I’m about to land a patch to fix it.
  • toolkit/components/contentprefs/nsContentPrefService.js.  Bug 662986 covers this.
  • browser/app/profile/extensions/testpilot@labs.mozilla.com/modules/experiment_data_store.js.  Bug 662985 covers this.  Note that Test Pilot seems to fail to use asyncClose() when it should, and it also uses memory reporters.  So it’s unclear which of these two things is responsible for the Test Pilot crashes of this sort seen so far;  in other words, Test Pilot may be a culprit or an innocent bystander, or both.

These were found via a crude MXR search.  I haven’t looked for C++ code that makes this mistake.

If you only do synchronous transactions over your database connection, the connection will be cleaned up properly, even if you don’t call close(), when it is deallocated or garbage collected.  However, it’s better to close the connection as soon as you can so that the resources (memory, locks) are freed immediately.

See this MDC page for more details about database connections and the relevant APIs.  It appears that these APIs are somewhat error-prone.  As it happens, an NS_ENSURE_TRUE macro will fail when an async connection is destroyed without having been asyncClose’d first, and this causes a warning message to be printed to stderr.  Unfortunately, stderr is spammed with all sorts of warnings (something that was discussed in the comments of a previous post of mine), and so this warning gets lost among the noise.  Addressing this problem is something I’ll write about shortly.

Thanks to Andrew Sutherland and Marco Bonardo for helping me greatly with the bugs mentioned above.  Any errors in this post are mine!

UPDATE: I did some investigation and found that about:permissions leaked about 180KB of memory on my machine every time it failed to asyncClose a DB connection.   This showed up under the explicit/storage/sqlite/other reporter in about:memory.

Categories
Bugzilla Gmail

Gmail and Bugzilla

I use Gmail.  I like the way it manages conversations, with one exception.  New bugs reported in Bugzilla cause an email with a title like “New: Fix the bug” to be sent.  All subsequent changes to that bug cause an email with a title like “Fix the bug” to be sent.  So that first email doesn’t get included in the same conversation as the rest of the emails because the subject line has the “New: ” at the start.

This is annoying.  Does anyone know how to fix it, either on the Gmail side or the Bugzilla side?

Categories
Correctness Firefox

What is the point of non-fatal assertions?

When you run a debug build of Firefox you get a lot of stuff printed to stderr.  Some of it looks like this:

 ++DOCSHELL 0x7f53d836b800 == 3
 ++DOMWINDOW == 5 (0x7f53d836c078) [serial = 5] [outer = (nil)]

and I imagine it’s occasionally useful.

You also see a lot of warnings like these, which I just saw (with duplicates removed) when I just started up Firefox, loaded www.mozilla.com, and then exited:

 WARNING: 1 sort operation has occurred for the SQL statement '0x7f53e3d32208'.  See https://developer.mozilla.org/En/Storage/Warnings details.: file /home/njn/moz/mc2/storage/src/mozStoragePrivateHelpers.cpp, line 144
 WARNING: GetDefaultCharsetForLocale: need to add multi locale support: file /home/njn/moz/mc2/intl/locale/src/unix/nsUNIXCharset.cpp, line 146
 WARNING: Ignoring duplicate observer.: file /home/njn/moz/mc2/modules/libpref/src/nsPrefBranch.cpp, line 594
 WARNING: not an nsIRDFRemoteDataSource: 'remote != nsnull', file /home/njn/moz/mc2/rdf/datasource/src/nsLocalStore.cpp, line 312
 WARNING: NS_ENSURE_SUCCESS(rv, 0) failed with result 0x8000FFFF: file /home/njn/moz/mc2/content/base/src/nsContentUtils.cpp, line 2527
 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /home/njn/moz/mc2/content/base/src/nsFrameLoader.cpp, line 415
 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8053000C: file /home/njn/moz/mc2/content/base/src/nsGenericElement.cpp, line 5484
 WARNING: NS_ENSURE_TRUE(aURI) failed: file /home/njn/moz/mc2/content/base/src/nsContentUtils.cpp, line 4706
 WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc2/storage/src/mozStorageConnection.cpp, line 784
 WARNING: NS_ENSURE_TRUE(pusher.Push(aBoundElement)) failed: file /home/njn/moz/mc2/content/xbl/src/nsXBLProtoImplMethod.cpp, line 327
 WARNING: nsExceptionService ignoring thread destruction after shutdown: file /home/njn/moz/mc2/xpcom/base/nsExceptionService.cpp, line 197
 WARNING: SQLite returned error code 1 , Storage will convert it to NS_ERROR_FAILURE: file /home/njn/moz/mc2/storage/src/mozStoragePrivateHelpers.cpp, line 113
 WARNING: Subdocument container has no content: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2398
 WARNING: Subdocument container has no frame: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2418
 WARNING: Subdocument container has no frame: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2418

The NS_ENSURE_SUCCESS and NS_ENSURE_TRUE ones in particular look like assertion failures.  I’ve heard before that Firefox assertions (outside the JS engine) are non-fatal.  I was wondering about this last week and had planned to ask someone about it in more detail.

But before I did that, I spent several hours yesterday debugging the crash in bug 654573.  Turns out the problem is that this assertion in mozStorageConnection.cpp fails:

 NS_ENSURE_FALSE(mAsyncExecutionThread, NS_ERROR_UNEXPECTED);

This causes this warning to be printed out at run-time:

 WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp, line 799

Oh, and just to make things really complicated, here’s the definition of NS_ENSURE_FALSE, from nsDebug.h:

 #define NS_ENSURE_TRUE(x, ret)                                \
   PR_BEGIN_MACRO                                              \
     if (NS_UNLIKELY(!(x))) {                                  \
        NS_WARNING("NS_ENSURE_TRUE(" #x ") failed");           \
        return ret;                                            \
     }                                                         \
   PR_END_MACRO
 #define NS_ENSURE_FALSE(x, ret)
   NS_ENSURE_TRUE(!(x), ret)

Oh look;  if the condition fails it not only prints out a warning message, it also does an early return.  And that’s exactly what was causing the crash, because some code that followed the assertion wasn’t run, which meant that a database connection wasn’t closed properly, which caused a use-after-free error.

If this assertion had been fatal, I wouldn’t have spent several hours yesterday debugging this crash, because there wouldn’t have been a crash;  there would have been an assertion failure, and the problem would have been immediately obvious.  (As it happens, I eventually worked out the problem when I noticed that warning message among the pages of output that Firefox produces.)  If the assertion had been non-fatal but did not have the early return, there also probably wouldn’t have been a crash, but it’s hard to know, because once an assertion fails you’re pretty much in no-man’s land.

So, in summary, here are my thoughts about assertions:

  • Non-fatal assertions are close to useless.  They don’t get fixed.  The warnings just get lost among the noise of all the messages that go to stderr.
  • Non-fatal assertions that return early are worse than useless.  Especially since the early-return behaviour is entirely non-obvious from their names.

One suggestion I heard on IRC yesterday was that there just aren’t enough developers in general to fix all the assertion failures if we were to make them fatal.  But in contrast, the JS engine is able to use fatal assertions because it has a higher developer-to-code ratio than the rest of Firefox.  (This makes me also think of compiler warnings, where there’s a similar disparity between the JS engine and the rest of Firefox.)

Because this is Mozilla, I’m sure there is a ton of history behind this current situation that I’m unaware of.  I’d love to know what that is, and why we let obvious defects — because an assertion failure unambiguously indicates there’s a defect, be it in the code or the assertion itself — remain.  Thanks!

 

Categories
Firefox Memory consumption Performance Tracemonkey

You lose more when slow than you gain when fast

SpiderMonkey is Firefox’s JavaScript engine.  In Firefox 3.0 and earlier versions, it was just an interpreter.  In Firefox 3.5, a tracing JIT called TraceMonkey was added.  TraceMonkey was able to massively speed up certain parts of programs, such as tight loops;  parts of programs it couldn’t speed up continued to be interpreted.  TraceMonkey provided large speed improvements, but JavaScript performance overall still didn’t compare that well against that of Safari and Chrome, both of which used method JITs that had worse best-case performance than TraceMonkey, but much better worst-case performance.

If you look at the numbers, this isn’t so surprising.  If you’re 10x faster than the competition on half the cases, and 10x slower on half the cases, you end up being 5.05x slower overall.  Firefox 4.0 added a method JIT, JaegerMonkey, which avoided those really slow cases, and Firefox’s JavaScript performance is now very competitive with other browsers.

The take-away message:  you lose more when slow than you gain when fast. Your performance is determined primarily by your slowest operations.  This is true for two reasons.  First, in software you can easily get such large differences in performance: 10x, 100x, 1000x and more aren’t uncommon.  Second, in complex programs like a web browser, overall performance (i.e. what a user feels when browsing day-to-day) is determined by a huge range of different operations, some of which will be relatively fast and some of which will be relatively slow.

Once you realize this, you start to look for the really slow cases. You know, the ones where the browser slows to a crawl and user starts cursing and clicking wildly and holy crap if this happens again I’m switching to another browser.  Those are the performance effects that most users care about, not whether their browser is 2x slower on some benchmark.  When they say “it’s really fast”, the probably actually mean “it’s never really slow”.

That’s why memory leaks are so bad — because they lead to paging, which utterly destroys performance, probably more than anything else.

It also makes me think that the single most important metric when considering browser performance is page fault counts.  Hmm, I think it’s time to look again at Julian Seward’s VM profiler and the Linux perf tools.