Categories
Firefox Memory consumption MemShrink

MemShrink progress, week 3

This was the final week of development for Firefox 7, and lots of exciting MemShrink stuff happened.

Per-compartment memory reporters

I landed per-compartment memory reporters, using the new memory multi-reporter interface.  I’ve written previously about per-compartment reporters;  the number of things measured has increased since then, and each compartment now looks like this:

Output of a per-compartment memory reporter

One nice thing about this feature is that it gives technically-oriented users a way to tell which web sites are causing high memory usage.  This may help with perception, too;  people might think “geez, Facebook is using a lot of memory” instead of “geez, Firefox is using a lot of memory”.

Another nice thing is that it can help find leaks.  If you close all your tabs and some compartments are still around, that’s a leak, right?  Actually, it’s more complicated than that:  compartments are destroyed by the garbage collector, so there can be a delay.  But there are buttons at the bottom of about:memory which force GC to occur, so if you press them and some compartments are still around, that’s a leak right?  Well, it appears that sites that use timers can stick around while the timers are still live.  For example, TBPL appears to have a refresh timer that is 2 or 3 minutes long, so it’s compartment stays alive for that long after the page is closed.  (Actually, it’s not entirely clear if or why that should happen.)  However, once you’ve accounted for these complications, any remaining compartments are likely to have leaked.

Another thing to note is that the JS heap space used by each compartment is now reported.  Also, the fraction of the JS heap that isn’t currently used by any particular compartment is also reported:

Output of non-compartment GC memory reporters

Once this change landed, it didn’t take long to see that the “gc-heap-chunk-unused” number could get very large.  In particular, if you browsed for a while, then closed all tabs except about:memory, then triggered heap minimization (via the buttons at the bottom of about:memory), you’d often see very large “gc-heap-chunk-unused” numbers.  It turns out this is caused by fragmentation.  The JS heap is broken into 1MB chunks, and often you’d end up with with a lot of almost-empty chunks, but they’d have a small number of long-lived objects in them keeping them alive, and thus preventing them from being deallocated.

Reduced JavaScript heap fragmentation

Fortunately, Gregor Wagner had already anticipated this, and he tightened his grasp on the 2011 MemShrink MVP award by greatly reducing this fragmentation in the JavaScript heap.  Parts of Firefox itself are implemented in JavaScript, and many objects belonging to Firefox itself (more specifically, objects in the “system principal” and “atoms” compartments) are long-lived.  So Gregor added simple chunk segregation;  objects belonging to Firefox get put in one group of chunks, and all other objects (i.e. those from websites) get put in a second group of chunks.  This made a huge difference.  See this comment and subsequent comments for some detailed measurements of a short browsing session;  in short, the size of the heap was over 5x smaller (21MB vs. 108MB) after closing a number of tabs and forcing garbage collection.  Even if you don’t force garbage collection, it still helps greatly, because garbage collection happens periodically anyway, and longer browsing sessions will benefit more than shorter sessions.

This change will help everyday browsing a lot.  It will also help with the perception of Firefox’s memory usage — once you learn about about:memory, an obvious thing to try is to browse for a while, close all tabs, and see what the memory usage looks like.  Prior to this patch, the memory usage was often much higher than it is on start-up.  With this patch, the usage is much closer to the usage seen at start-up.  Ideally, it would be the same, and Justin Lebar opened a bug to track progress on exactly this issue.

There’s still room for reducing fragmentation in the JavaScript heap further.  Kyle Huey has some good ideas in this bug.

Other bits and pieces

  • Chris Pearce finished a major refactoring of media code.  Previously, every media (<audio> and <video>) element in a page required three threads (four on Linux).  And each thread has a stack.  On Linux, the stack is 8MB, on Windows it’s 1MB, on Mac it’s 64KB.  (Edit: Robert O’Callahan pointed out in the comments that this space is not necessarily committed, which means they’re not really costing anything other than virtual memory space.)  Webpages can have dozens, even hundreds of media elements, so these stacks can really add up.  Chris changed things so that each media element has a single thread.  This really helps on complicated web sites/apps like The Wilderness Downtown, ROME, and WebGL Quake.  (There is also a bug open to reduce the thread stack size on Linux;  8MB is excessive.)  Chris only just landed these patches on mozilla-inbound, he deliberately waited until after the FF7 deadline to maximize testing time.  Assuming there are no problems, this improvement will be in FF8.
  • Jesse Ruderman tweaked some of his fuzzing tools to look for leaks and found four new ones.
  • Mounir Lamouri landed some memory reporters for the DOM.  They’re currently disabled while he extends them and makes that they’re reporting sensible numbers.
  • I created mlk-fx8, a bug for tracking leaks and quasi-leaks reported against Firefox 8.

Quantifying progress

I’m going to attempt to quantify MemShrink’s progress in these weekly reports.  It’s not obvious what’s the best way to do this.  The MemShrink wiki page discusses this a bit.  I’m going to do something crude:  just count the number of bugs annotated with “MemShrink”, and track this each week.  Here are the current counts:

  • P1: 18
  • P2: 44
  • P3: 25
  • Unprioritized: 2

Obviously, this is a flawed way to measure progress;  for example, if we improve our leak-detection efforts, which is a good thing, we’ll get more bug reports.  But hopefully these numbers will trend downwards in the long term.

I’m also wondering if my choice to have three priority levels was a mistake, if only for the reason that people like to pick the middle item, so P2 will always dominate.  If we had four priority levels, we’d be forced to prioritize all those middling bugs up or down.

The minutes of today’s meeting are available here.  Thanks to Jesse for taking them.

Categories
Firefox Performance Uncategorized

Faster JavaScript parsing

Over the past year or so I’ve almost doubled the speed of SpiderMonkey’s JavaScript parser.  I did this by speeding up both the scanner (bug 564369, bug 584595, bug 588648, bug 639420, bug 645598) and the parser (bug 637549).  I used patch stacks in several of those bugs, and so in those six bugs I actually landed 28 changesets.

Notable things about scanning JavaScript code

Before I explain the changes I made, it will help to explain a few notable things about scanning JavaScript code.

  • JavaScript is encoded using UCS-2.  This means that each character is 16 bits.
  • There are several character sequences that indicate the end of a line (EOL): ‘\n’, ‘\r’, ‘\r\n’, \u2028 (a.k.a. LINE_SEPARATOR), and \u2029 (a.k.a. PARA_SEPARATOR).  Note that ‘\r\n’ is treated as a single character.
  • JavaScript code is often minified, and the characteristics of minified and non-minified code are quite different.  The most important difference is that minified code has much less whitespace.

Scanning improvements

Before I made any changes, there were two different modes in which the scanner could operate.  In the first mode, the entire character stream to be scanned was already in memory.  In the second, the scanner read the characters from a file in chunks a few thousand chars long.  Firefox always uses the first mode (except in the rare case where the platform doesn’t support mmap or an equivalent function), but the JavaScript shell used the second.  Supporting the second made made things complicated in two ways.

  • It was possible for an ‘\r\n’ EOL sequence to be split across two chunks, which required some extra checking code.
  • The scanner often needs to unget chars (up to six chars, due to the lookahead required for \uXXXX sequences), and it couldn’t unget chars across a chunk boundary.  This meant that it used a six-char unget buffer.  Every time a char was ungotten, it would be copied into this buffer.  As a consequence, every time it had to get a char, it first had to look in the unget buffer to see if there was one or more chars that had been previously ungotten.  This was an extra check (and a data-dependent and thus unpredictable check).

The first complication was easy to avoid by only reading N-1 chars into the chunk buffer, and only reading the Nth char in the ‘\r\n’ case.  But the second complication was harder to avoid with that design.  Instead, I just got rid of the second mode of operation;  if the JavaScript engine needs to read from file, it now reads the whole file into memory and then scans it via the first mode.  This can result in more memory being used but it only affects the shell, not the browser, so it was an acceptable change.  This allowed the unget buffer to be completely removed;  when a character is ungotten now the scanner just moves back one char in the char buffer being scanned.

Another improvement was that in the old code, there was an explicit EOL normalization step.  As each char was gotten from the memory buffer, the scanner would check if it was an EOL sequence;  if so it would change it to ‘\n’, if not, it would leave it unchanged.  Then it would copy this normalized char into another buffer, and scanning would proceed from this buffer.  (The way this copying worked was strange and confusing, to make things worse.)  I changed it so that getChar() would do the normalization without requiring the copy into the second buffer.

The scanner has to detect EOL sequences in order to know which line it is on.  At first glance, this requires checking every char to see if it’s an EOL, and the scanner uses a small look-up table to make this fast.  However, it turns out that you don’t have to check every char.  For example, once you know that you’re scanning an identifier, you know that if you hit an EOL sequence you’ll immediately unget it, because that marks the end of the identifier.  And when you unget that char you’ll undo the line update that you did when you hit the EOL.  This same logic applies in other situations (eg. parsing a number).  So I added a function getCharIgnoreEOL() that doesn’t do the EOL check.  It has to always be paired with ungetCharIgnoreEOL() and requires some care as to where it’s used, but it avoids the EOL check on more than half the scanned chars.

As well as detecting where each token starts and ends, for a lot of token kinds the scanner has to compute a value.  For example, after scanning the character sequence ” 123 ” it has to convert that to the number 123.  The old scanner would copy the chars into a temporary buffer before calling the function that did the conversion.  This was unnecessary — the conversion function didn’t even require NULL-terminated strings because it gets passed the length of the string being converted!  Also, the old scanner was using js_strtod() to do the number conversion.  js_strtod() can convert both integers and fractional numbers, but its quite slow and overkill for integers.  And when scanning, even before converting the string to a number, we know if the number we just scanned was an integer or not (by remembering if we saw a ‘.’ or exponent).  So now the scanner instead calls GetPrefixInteger() which is much faster.  Several of the tests in Kraken involve huge arrays of integers, and this made a big difference to them.

There’s a similar story with identifiers, but with an added complication.  Identifiers can contain \uXXXX chars, and these need to be normalized before we do more with the string inside SpiderMonkey.  So the scanner now remembers whether a \uXXXX char has occurred in an identifier.  If not, it can work directly (temporarily) with the copy of the string inside the char buffer.  Otherwise, the scanner will rescan the identifier, normalizing and copying it into a new buffer.  I.e. the scanner de-optimizes the (very) rare case in order to avoid the copying in the common case.

JavaScript supports decimal, hexadecimal and octal numbers.  The number-scanning code handled all three kinds in the same loop, which meant that it checked the radix every time it scanned another number char.  So I split this into three parts, which make it both faster and easier to read.

Although JavaScript chars are 16 bits, the vast majority of chars you see are in the first 128 chars.  This is true even for code written in non-Latin scripts, because of all the keywords (e.g. ‘var’) and operators (e.g. ‘+’) and punctuation (e.g. ‘;’).  So it’s worth optimizing for those.  The main scanning loop (in getTokenInternal()) now first checks every char to see if its value is greater than 128.  If so, it handles it in a side-path (the only legitimate such chars are whitespace, EOL or identifier chars, so that side-path is quite small).  The rest of getTokenInternal() can then assume that it’s a sub-128 char.  This meant I could be quite aggressive with look-up tables, because having lots of 128-entry look-up tables is fine, but having lots of 65,536-entry look-up tables would not be.  One particularly important look-up table is called firstCharKinds;  it tells you what kind of token you will have based on the first non-whitespace char in it.  For example, if the first char is a letter, it will be an identifier or keyword;  if the first char is a ‘0’ it will be a number;  and so on.

Another important look-up table is called oneCharTokens.  There are a handful of tokens that are one-char long, cannot form a valid prefix of another token, and don’t require any additional special handling:  ;,?[]{}().  These account for 35–45% of all tokens seen in real code!  The scanner can detect them immediately and use another look-up table to convert the token char to the internal token kind without any further tests.  After that, the rough order of frequency for different token kinds is as follows:  identifiers/keywords, ‘.’, ‘=’, strings, decimal numbers, ‘:’, ‘+’, hex/octal numbers, and then everything else.  The scanner now looks for these token kinds in that order.

That’s just a few of the improvements, there were lots of other little clean-ups.  While writing this post I looked at the old scanning code, as it was before I started changing it.  It was awful, it’s really hard to see what was happening;  getChar() was 150 lines long because it included code for reading the next chunk from file (if necessary) and also normalizing EOLs.

In comparison, as well as being much faster, the new code is much easier to read, and much more DFA-like.  It’s worth taking a look at getTokenInternal() in jsscan.cpp.

Parsing improvements

The parsing improvements were all related to the parsing of expressions.  When the parser parses an expression like “3” it needs to look for any following operators, such as “+”.  And there are roughly a dozen levels of operator precedence.  The way the parser did this was to get the next token, check if it matched any of the operators of a particular precedence, and then unget the token if it didn’t match.  It would then repeat these steps for the next precedence level, and so on.  So if there was no operator after the “3”, the parser would have gotten and ungotten the next token a dozen times!  Ungetting and regetting tokens is fast, because there’s a buffer used (i.e. you don’t rescan the token char by char) but it was still a bottleneck.  I changed it so that the sub-expression parsers were expected to parse one token past the end of the expression, instead of zero tokesn past the end.  This meant that the repeated getting/ungetting could be avoided.

These operator parsers are also very small.  I inlined them more aggressively, which also helped quite a bit.

Results

I had some timing results but now I can’t find them.  But I know that the overall speed-up from my changes was about 1.8x on Chris Leary’s parsemark suite, which takes code from lots of real sites, and the variation in parsing times for different codebases tends not to vary that much.

Many real websites, e.g. gmail, have MB of JS code, and this speed-up will probably save one or two tenths of a second when they load.  Not something you’d notice, but certainly something that’ll add up over time and help make the browser feel snappier.

Tools

I used Cachegrind to drive most of these changes.  It has two features that were crucial.

First, it does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time.  When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.

Second, it gives counts of these events for individual lines of source code.  This was particularly important for getTokenInternal(), which is the main scanning function and has around 700 lines;  function-level stats wouldn’t have been enough.

Categories
Firefox

Firefox 6 will be released in six seven weeks

Firefox’s new rapid release cycle has received a lot of attention since Firefox 5 was released.  I won’t add to the discussion of its pros and cons.

But I will note that we’re now on a 6-week release cycle.  That means Firefox 6 should be released in early August, and Firefox 9 should be released before the end of the year.  I point this out because an awful lot of people think we’re on a 3-month release cycle.  I can think of two reasons for this misconception: (a) early discussion of the rapid release cycle said it would be 3 months long, if I remember correctly, and (b) the gap between Firefox 4 and Firefox 5 was 3 months.

This 6-week cycle is something that we need to publicize well, to avoid further charges of poor communication.  I don’t want to read 1000 comments saying “3 months between FF4 and FF5, 6 weeks between FF5 and FF6, is FF7 coming in 3 weeks!?!! ZOMG hasn’t Mozilla heard of Zeno’s Paradox? LOLWUT”.

Update: Mike Hommey has kindly corrected me;  there will be 8 weeks between FF5 and FF6.  There will then be 6 weeks between all subsequent releases.  The 12 week gap between FF4 and FF5 and the 8-week gap between FF5 and FF6 were artifacts of the timeline used to get the rapid release process up to speed.

Categories
about:memory Correctness Firefox SQLite

Asynchronous database connections must be closed with asyncClose()

TL;DR:  if you’re familiar with any Mozilla (C++ or JS) code that opens an async database connection, please go and check that it calls asyncClose() to close the connection;  not doing so can lead to crashes and/or memory leaks.

Specifically, in Firefox 6, when such a connection is destroyed (because it’s explicitly deallocated or its refcount drops to zero in C++, or it’s garbage collected in JS) it’ll currently cause sporadic crashes in about:memory or telemetry code or Test Pilot code.  This is because memory reporters end up pointing to connections that have been freed, and so when they are used they end up (innocently) accessing memory they shouldn’t.

I’ve opened bug 662989 to avoid the crashes, but even once it’s implemented, I think that if you fail to call asyncClose() it will still cause a memory leak, because sqlite3_close() is never called on the connection and so SQLite won’t free up memory used by the connection.  Also, connections that needlessly remain open can have active locks that can starve out other connections or cause extreme growth of the write-ahead-log.

As I write this, thanks to Marco Bonardo, I’m aware of three places in JS code that fail to call asyncClose() when they should:

  • browser/components/preferences/aboutPermissions.js.  Bug 654573 covers this, I’m about to land a patch to fix it.
  • toolkit/components/contentprefs/nsContentPrefService.js.  Bug 662986 covers this.
  • browser/app/profile/extensions/testpilot@labs.mozilla.com/modules/experiment_data_store.js.  Bug 662985 covers this.  Note that Test Pilot seems to fail to use asyncClose() when it should, and it also uses memory reporters.  So it’s unclear which of these two things is responsible for the Test Pilot crashes of this sort seen so far;  in other words, Test Pilot may be a culprit or an innocent bystander, or both.

These were found via a crude MXR search.  I haven’t looked for C++ code that makes this mistake.

If you only do synchronous transactions over your database connection, the connection will be cleaned up properly, even if you don’t call close(), when it is deallocated or garbage collected.  However, it’s better to close the connection as soon as you can so that the resources (memory, locks) are freed immediately.

See this MDC page for more details about database connections and the relevant APIs.  It appears that these APIs are somewhat error-prone.  As it happens, an NS_ENSURE_TRUE macro will fail when an async connection is destroyed without having been asyncClose’d first, and this causes a warning message to be printed to stderr.  Unfortunately, stderr is spammed with all sorts of warnings (something that was discussed in the comments of a previous post of mine), and so this warning gets lost among the noise.  Addressing this problem is something I’ll write about shortly.

Thanks to Andrew Sutherland and Marco Bonardo for helping me greatly with the bugs mentioned above.  Any errors in this post are mine!

UPDATE: I did some investigation and found that about:permissions leaked about 180KB of memory on my machine every time it failed to asyncClose a DB connection.   This showed up under the explicit/storage/sqlite/other reporter in about:memory.

Categories
Correctness Firefox

What is the point of non-fatal assertions?

When you run a debug build of Firefox you get a lot of stuff printed to stderr.  Some of it looks like this:

 ++DOCSHELL 0x7f53d836b800 == 3
 ++DOMWINDOW == 5 (0x7f53d836c078) [serial = 5] [outer = (nil)]

and I imagine it’s occasionally useful.

You also see a lot of warnings like these, which I just saw (with duplicates removed) when I just started up Firefox, loaded www.mozilla.com, and then exited:

 WARNING: 1 sort operation has occurred for the SQL statement '0x7f53e3d32208'.  See https://developer.mozilla.org/En/Storage/Warnings details.: file /home/njn/moz/mc2/storage/src/mozStoragePrivateHelpers.cpp, line 144
 WARNING: GetDefaultCharsetForLocale: need to add multi locale support: file /home/njn/moz/mc2/intl/locale/src/unix/nsUNIXCharset.cpp, line 146
 WARNING: Ignoring duplicate observer.: file /home/njn/moz/mc2/modules/libpref/src/nsPrefBranch.cpp, line 594
 WARNING: not an nsIRDFRemoteDataSource: 'remote != nsnull', file /home/njn/moz/mc2/rdf/datasource/src/nsLocalStore.cpp, line 312
 WARNING: NS_ENSURE_SUCCESS(rv, 0) failed with result 0x8000FFFF: file /home/njn/moz/mc2/content/base/src/nsContentUtils.cpp, line 2527
 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /home/njn/moz/mc2/content/base/src/nsFrameLoader.cpp, line 415
 WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8053000C: file /home/njn/moz/mc2/content/base/src/nsGenericElement.cpp, line 5484
 WARNING: NS_ENSURE_TRUE(aURI) failed: file /home/njn/moz/mc2/content/base/src/nsContentUtils.cpp, line 4706
 WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc2/storage/src/mozStorageConnection.cpp, line 784
 WARNING: NS_ENSURE_TRUE(pusher.Push(aBoundElement)) failed: file /home/njn/moz/mc2/content/xbl/src/nsXBLProtoImplMethod.cpp, line 327
 WARNING: nsExceptionService ignoring thread destruction after shutdown: file /home/njn/moz/mc2/xpcom/base/nsExceptionService.cpp, line 197
 WARNING: SQLite returned error code 1 , Storage will convert it to NS_ERROR_FAILURE: file /home/njn/moz/mc2/storage/src/mozStoragePrivateHelpers.cpp, line 113
 WARNING: Subdocument container has no content: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2398
 WARNING: Subdocument container has no frame: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2418
 WARNING: Subdocument container has no frame: file /home/njn/moz/mc2/layout/base/nsDocumentViewer.cpp, line 2418

The NS_ENSURE_SUCCESS and NS_ENSURE_TRUE ones in particular look like assertion failures.  I’ve heard before that Firefox assertions (outside the JS engine) are non-fatal.  I was wondering about this last week and had planned to ask someone about it in more detail.

But before I did that, I spent several hours yesterday debugging the crash in bug 654573.  Turns out the problem is that this assertion in mozStorageConnection.cpp fails:

 NS_ENSURE_FALSE(mAsyncExecutionThread, NS_ERROR_UNEXPECTED);

This causes this warning to be printed out at run-time:

 WARNING: NS_ENSURE_TRUE(!(mAsyncExecutionThread)) failed: file /home/njn/moz/mc1/storage/src/mozStorageConnection.cpp, line 799

Oh, and just to make things really complicated, here’s the definition of NS_ENSURE_FALSE, from nsDebug.h:

 #define NS_ENSURE_TRUE(x, ret)                                \
   PR_BEGIN_MACRO                                              \
     if (NS_UNLIKELY(!(x))) {                                  \
        NS_WARNING("NS_ENSURE_TRUE(" #x ") failed");           \
        return ret;                                            \
     }                                                         \
   PR_END_MACRO
 #define NS_ENSURE_FALSE(x, ret)
   NS_ENSURE_TRUE(!(x), ret)

Oh look;  if the condition fails it not only prints out a warning message, it also does an early return.  And that’s exactly what was causing the crash, because some code that followed the assertion wasn’t run, which meant that a database connection wasn’t closed properly, which caused a use-after-free error.

If this assertion had been fatal, I wouldn’t have spent several hours yesterday debugging this crash, because there wouldn’t have been a crash;  there would have been an assertion failure, and the problem would have been immediately obvious.  (As it happens, I eventually worked out the problem when I noticed that warning message among the pages of output that Firefox produces.)  If the assertion had been non-fatal but did not have the early return, there also probably wouldn’t have been a crash, but it’s hard to know, because once an assertion fails you’re pretty much in no-man’s land.

So, in summary, here are my thoughts about assertions:

  • Non-fatal assertions are close to useless.  They don’t get fixed.  The warnings just get lost among the noise of all the messages that go to stderr.
  • Non-fatal assertions that return early are worse than useless.  Especially since the early-return behaviour is entirely non-obvious from their names.

One suggestion I heard on IRC yesterday was that there just aren’t enough developers in general to fix all the assertion failures if we were to make them fatal.  But in contrast, the JS engine is able to use fatal assertions because it has a higher developer-to-code ratio than the rest of Firefox.  (This makes me also think of compiler warnings, where there’s a similar disparity between the JS engine and the rest of Firefox.)

Because this is Mozilla, I’m sure there is a ton of history behind this current situation that I’m unaware of.  I’d love to know what that is, and why we let obvious defects — because an assertion failure unambiguously indicates there’s a defect, be it in the code or the assertion itself — remain.  Thanks!

 

Categories
Firefox Memory consumption Performance Tracemonkey

You lose more when slow than you gain when fast

SpiderMonkey is Firefox’s JavaScript engine.  In Firefox 3.0 and earlier versions, it was just an interpreter.  In Firefox 3.5, a tracing JIT called TraceMonkey was added.  TraceMonkey was able to massively speed up certain parts of programs, such as tight loops;  parts of programs it couldn’t speed up continued to be interpreted.  TraceMonkey provided large speed improvements, but JavaScript performance overall still didn’t compare that well against that of Safari and Chrome, both of which used method JITs that had worse best-case performance than TraceMonkey, but much better worst-case performance.

If you look at the numbers, this isn’t so surprising.  If you’re 10x faster than the competition on half the cases, and 10x slower on half the cases, you end up being 5.05x slower overall.  Firefox 4.0 added a method JIT, JaegerMonkey, which avoided those really slow cases, and Firefox’s JavaScript performance is now very competitive with other browsers.

The take-away message:  you lose more when slow than you gain when fast. Your performance is determined primarily by your slowest operations.  This is true for two reasons.  First, in software you can easily get such large differences in performance: 10x, 100x, 1000x and more aren’t uncommon.  Second, in complex programs like a web browser, overall performance (i.e. what a user feels when browsing day-to-day) is determined by a huge range of different operations, some of which will be relatively fast and some of which will be relatively slow.

Once you realize this, you start to look for the really slow cases. You know, the ones where the browser slows to a crawl and user starts cursing and clicking wildly and holy crap if this happens again I’m switching to another browser.  Those are the performance effects that most users care about, not whether their browser is 2x slower on some benchmark.  When they say “it’s really fast”, the probably actually mean “it’s never really slow”.

That’s why memory leaks are so bad — because they lead to paging, which utterly destroys performance, probably more than anything else.

It also makes me think that the single most important metric when considering browser performance is page fault counts.  Hmm, I think it’s time to look again at Julian Seward’s VM profiler and the Linux perf tools.

 

Categories
Firefox Memory consumption

Leak reports mini-triage, May 30, 2011

I just created bug 660577 which consolidated five bug reports, all of which were complaining about Firefox 4 having high memory usage and/or OOM aborts on image-heavy pages.  This is a clear regression from Firefox 3.6, and appears to have two likely causes:

  • The introduction of infallible new/new[] means Firefox 4 sometimes aborts where Firefox 3.6 would try to recover.  (Kyle Huey already opened bug 660580 to fix this.  Thanks, Kyle!)
  • image.mem.min_discard_timeout_ms was increased from 10,000 (10 seconds) to 120,000 (120 seconds).  This means that Firefox holds on to some image data (I don’t understand the exact details) for longer.

Input from people who know the details of this stuff would be most welcome!  Thanks.

Categories
Bugzilla Firefox Memory consumption MemShrink

The new leak tracking bugs are live

Yesterday I proposed a new way of tracking leak reports.  It’s now up and running.  Two old tracking bugs have been decommissioned: bug 632234 (which was already resolved) and bug 640452.  Five new bugs have been created:

  • Bug 659855 – (mlk-fx4-beta) [meta] Leaks and quasi-leaks reported against Firefox 4 betas
  • Bug 659856 – (mlk-fx4) [meta] Leaks and quasi-leaks reported against Firefox 4
  • Bug 659857 – (mlk-fx5) [meta] Leaks and quasi-leaks reported against Firefox 5
  • Bug 659858 – (mlk-fx6) [meta] Leaks and quasi-leaks reported against Firefox 6
  • Bug 659860 – (mlk-fx7) [meta] Leaks and quasi-leaks reported against Firefox 7

Please CC yourself if you are interested.  Apologies for any bugspam you received as a result of these changes.  Hopefully this new tracking system will work well.

 

 

Categories
Bugzilla Firefox Memory consumption MemShrink

A new way of tracking leak reports

We get lots of leak reports from users.  There is a spectrum of quality.

  • Some are hopelessly vague and will never lead to anything useful. (“After browsing for several hours, Firefox is using 100s of MBs of memory.  This is unacceptable;  please fix.”)  Bug 643177 is an example.
  • Some are very precise.  This makes them easy to reproduce, likely to be fixed quickly, and easy to re-confirm if other leaks are fixed in the interim.  (“I managed to reduce the problem down to the attached 10 line HTML file, it causes my machine to run out of memory within 10 seconds of loading.”) Bug 654106 is a good example.
  • Most are somewhere between these two extremes.

Because many of the reports aren’t great, it can be hard to tell if the problem is still present some time later. A single leak may be reported N times, then fixed, and N-1 reports stay open.  In short, leak reports get stale.  (This is true of many bug reports, but I think leak reports are more prone to staleness than most.)

How bugs are currently tracked

There is a keyword, ‘mlk’, which is added to almost all leak reports.  There are over 600 open bugs with that keyword, going back over 10 years.  So it’s not much use.

In the lead-up to Firefox 4, I used bug 632234 (which I’ll henceforth call “mlk-fx4-old”) to track potentially blocking leaks.  It worked well.

After that, I created bug 640452 (which I’ll henceforth call “mlk-fx5+”), with which I’ve been tracking leaks in the lead-up to Firefox 5 and later versions.  I carried over unresolved bugs from mlk-fx4-old.  mlk-fx5+ is starting to fill up feel stale.  Basically, I can see it suffering the same problems as the ‘mlk’ keyword before too long.

So I’m thinking about changing how these are tracked.  The basic idea is to use keep using the ‘mlk’ keyword for all leak reports, and then have one leak-tracking bug for each version of Firefox, so it’s clear which version each report applies to.

Steps needed to start this

Add the ‘mlk’ keyword to all mlk-fx4-old and mlk-fx5+ bugs that lack it.

Open new tracking bugs: mlk-fx4-beta, mlk-fx4, mlk-fx5, mlk-fx6.  (The Firefox 4 beta period was long enough, and there were enough leak reports filed against beta versions, that separating mlk-fx4-beta and mlk-fx4 seems worthwhile.)  Make each mlk-fxN depend on mlk-fx(N-1).

For all the existing bugs tracked by mlk-fx4-old and mlk-fx5+, add them to the appropriate new tracking bug.  With one exception: for hopelessly vague ones, just mark them as duplicates of mlk-fxN, with an explanatory message (“we’re not ignoring leaks, look at all these ones we’re tracking!  but your report doesn’t tell us anything we don’t already know, sorry”).

Close mlk-fx5+.

Steps needed to maintain this in the future

When Firefox version N’s cycle starts, open mlk-fxN, and mark it as depending on mlk-fx(N-1).

For all new leak reports, mark it as blocking mlk-fxN, for appropriate N.  Also add the ‘mlk’ keyword.

If someone confirms in a comment that a problem reported in version N is still present in version N+1, mark that bug as also blocking mlk-fx(N+1).

Properties of this system

You can still search for all leak reports, based on the ‘mlk’ keyword.

You can immediately tell roughly how stale a report is likely to be, based on which mlk-fxN tracking bug it blocks.  This is more reliable than the bug number or file date;  for example, we are still getting reports against Firefox 4 even though Firefox 5 (which has fixed a number of leaks) is in beta and Firefox 6 just went to Aurora.  This immediately gives a starting priority for all leak reports:  more recent ones have higher priority because they’re more likely to still be unfixed.

Hopelessly vague reports are resolved immediately by duplicating, so they don’t clog things up.

Tracking bugs shouldn’t get too big and unwieldy, because each Firefox version has a limited lifespan.

Reports against version N still block mlk-fx(N+1), but via one level of indirection.  Reports against version N+2 still block mlk-fx(N+2), but via two levels of indirection, etc.  So the full chain of dependencies is maintained.

We could periodically go through older bugs (eg. 3 releases ago) and ask people to re-confirm, and close out ones that get no response.  But we wouldn’t have to do that.

Am I crazy?

Is this bureaucratic overkill?  I don’t think so.  It’ll take some work, but I’m happy to do that.  It’ll only take an hour or two to set up, and then it won’t be much harder to maintain than what I’m currently doing with the mlk-fx5+ bug.  (I also have plans for writing instructions to help users file better leak reports.)  And it’ll allow us to proceed much more usefully with the lists of leak reports that we have.

But I’m interested to hear if you disagree, or have any ideas for improving it.  Thanks!

Categories
Firefox Memory consumption

Firefox 5 has fewer leaks than Firefox 4

There’s been some confirmation from users that Firefox 5 has fewer leaks than Firefox 4.

From bug 640923:

Well, FF5b2 is really doing way better in my environment. This is great! I’ll abandon FF4 now for me, it never was “stable” in my world.

From bug 657232:

I’ve just upgraded from Firefox 3.6.17 to Beta 5 and it at first appeared to be a lot more stable and responsive than Firefox 4.

Nice to hear!

“Fewer memory leaks” (or “reduced memory usage” if we want to be less blunt) should definitely be on the Firefox 5 feature list when it comes out.