Category Archives: Performance

How to Compare the Memory Efficiency of Web Browsers

TL;DR: Cross-browser comparisons of memory consumption should be avoided.  If you want to evaluate how efficiently browsers use memory, you should do cross-browser comparisons of performance across several machines featuring a range of memory configurations.

Cross-browser Memory Comparisons are Bad

Various tech sites periodically compare the performance of browsers.  These often involve some cross-browser comparisons of memory efficiency.  A typical one would be this:  open a bunch of pages in tabs, measure memory consumption, then close all of them except one and wait two minutes, and then measure memory consumption again.  Users sometimes do similar testing.

I think comparisons of memory consumption like these are (a) very difficult to make correctly, and (b) very difficult to interpret meaningfully.  I have suggestions below for alternative ways to measure memory efficiency of browsers, but first I’ll explain why I think these comparisons are a bad idea.

Cross-browser Memory Comparisons are Difficult to Make

Getting apples-to-apples comparisons are really difficult.

  1. Browser memory measurements aren’t easy.  In particular, all browsers use multiple processes, and accounting for shared memory is difficult.
  2. Browsers are non-deterministic programs, and this can cause wide variation in memory consumption results.  In particular, whether or not the JavaScript garbage collector runs can greatly reduce memory consumption.  If you get unlucky and the garbage collector runs just after you measure, you’ll get an unfairly high number.
  3. Browsers can exhibit adaptive memory behaviour.  If running on a machine with lots of free RAM, a browser may choose to take advantage of it;  if running on a machine with little free RAM, a browser may choose to discard regenerable data more aggressively.

If you are comparing two versions of the same browser, problems (1) and (3) are avoided, and so if you are careful with problem (2) you can get reasonable results.  But comparing different browsers hits all three problems.

Indeed, Tom’s Hardware de-emphasized memory consumption measurements in their latest Web Browser Grand Prix due to problem (3).  Kudos to them!

Cross-browser Memory Comparisons are Difficult to Interpret

Even if you could get the measurements right, memory consumption is still not a good thing to compare.  Before I can explain why, I’ll introduce a couple of terms.

  • A primary metric is one a user can directly perceive.  Metrics that measure performance and crash rate are good examples.
  • A secondary metric is one that a user can only indirectly perceive via some kind of tool.  Memory consumption is one example.  The L2 cache miss rate is another example.

(I made up these terms, I don’t know if there are existing terms for these concepts.)

Primary metrics are obviously important, precisely because user can detect them.  They measure things that users notice:  “this browser is fast/slow”, “this browser crashes all the time”, etc.

Secondary metrics are important because they can affect primary metrics:  memory consumption can affect performance and crash rate;  the L2 cache miss rate can affect performance.

Secondary metrics are also difficult to interpret.  They can certainly be suggestive, but there are lots of secondary metrics that affect each primary metric of interest, so focusing too strongly on any single secondary metric is not a good idea.  For example, if browser A has a higher L2 cache miss rate than browser B, that’s suggestive, but you’d be unwise to draw any strong conclusions from it.

Furthermore, memory consumption is harder to interpret than many other secondary metrics.  If all else is equal, a higher L2 cache miss rate is worse than a lower one.  But that’s not true for memory consumption.  There are all sorts of time/space trade-offs that can be made, and there are many cases where using more memory can make browsers faster;  JavaScript JITs are a great example.

And I haven’t even discussed which memory consumption metric you should use.  Physical memory consumption is an obvious choice, but I’ll discuss this more below.

A Better Methodology

So, I’ve explained why I think you shouldn’t do cross-browser memory comparisons.  That doesn’t mean that efficient usage of memory isn’t important! However, instead of directly measuring memory consumption — a secondary metric — it’s far better to measure the effect of memory consumption on primary metrics such as performance.

In particular, I think people often use memory consumption measurements as a proxy for performance on machines that don’t have much RAM.  If you care about performance on machines that don’t have much RAM, you should measure performance on a machine that doesn’t have much RAM instead of trying to infer it from another measurement.

Experimental Setup

I did exactly this by doing something I call memory sensitivity testing, which involves measuring browser performance across a range of memory configurations.  My test machine had the following characteristics.

  • CPU: Intel i7-2600 3.4GHz (quad core with hyperthreading)
  • RAM: 16GB DDR3
  • OS: Ubuntu 11.10, Linux kernel version 3.0.0.

I used a Linux machine because Linux has a feature called cgroups that allows you to restrict the machine resources available to one or more processes.  I followed Justin Lebar’s instructions to create the following configurations that limited the amount of physical memory available: 1024MiB, 768MiB, 512MiB, 448MiB, 384MiB, 320MiB, 256MiB, 192MiB, 160MiB, 128MiB, 96MiB, 64MiB, 48MiB, 32MiB.

(The more obvious way to do this is to use ulimit, but as far as I can tell it doesn’t work on recent versions of Linux or on Mac.  And I don’t know of any way to do this on Windows.  So my experiments had to be on Linux.)

I used the following browsers.

  • Firefox 12 Nightly, from 2012-01-10 (64-bit)
  • Firefox 9.0.1 (64-bit)
  • Chrome 16.0.912.75 (64-bit)
  • Opera 11.60 (64-bit)

IE and Safari aren’t represented because they don’t run on Linux.  Firefox is over-represented because that’s the browser I work on and care about the most :)  The versions are a bit old because I did this testing about six months ago.

I used the following benchmark suites:  Sunspider v0.9.1, V8 v6, Kraken v1.1.  These are all JavaScript benchmarks and are all awful for gauging a browser’s memory efficiency;  but they have the key advantage that they run quite quickly.  I thought about using Dromaeo and Peacekeeper to benchmark other aspects of browser performance, but they take several minutes each to run and I didn’t have the patience to run them a large number of times.  This isn’t ideal, but I did this exercise to test-drive a benchmarking methodology, not make a definitive statement about each browser’s memory efficiency, so please forgive me.

Experimental Results

The following graph shows the Sunspider results.  (Click on it to get a larger version.)

sunspider results graph

As the lines move from right to left, the amount of physical memory available drops.  Firefox was clearly the fastest in most configurations, with only minor differences between Firefox 9 and Firefox 12pre, but it slowed down drastically below 160MiB;  this is exactly the kind of curve I was expecting.  Opera was next fastest in most configurations, and then Chrome, and both of them didn’t show any noticeable degradation at any memory size, which was surprising and impressive.

All the browsers crashed/aborted if memory was reduced enough.  The point at which the graphs stop on the left-hand side indicate the lowest size that each browser successfully handled.  None of the browsers ran Sunspider with 48MiB available, and FF12pre failed to run it with 64MiB available.

The next graph shows the V8 results.

v8 results graph

The curves go the opposite way because V8 produces a score rather than a time, and bigger is better.  Chrome easily got the best scores.  Both Firefox versions degraded significantly.  Chrome and Opera degraded somewhat, and only at lower sizes.  Oddly enough, FF9 was the only browser that managed to run V8 with 128MiB available;  the other three only ran it with 160MiB or more available.

I don’t particularly like V8 as a benchmark.  I’ve always found that it doesn’t give consistent results if you run it multiple times, and these results concur with that observation.  Furthermore, I don’t like that it gives a score rather than a time or inverse-time (such as runs per second), because it’s unclear how different scores relate.

The final graph shows the Kraken results.

kraken results graph

As with Sunspider, Chrome barely degraded and both Firefoxes degraded significantly.  Opera was easily the slowest to begin with and degraded massively;  nonetheless, it managed to run with 128MiB available (as did Chrome), which neither Firefox managed.

Experimental Conclusions

Overall, Chrome did well, and Opera and the two Firefoxes had mixed results. But I did this experiment to test a methodology, not to crown a winner.  (And don’t forget that these experiments were done with browser versions that are now over six months old.)  My main conclusion is that Sunspider, V8 and Kraken are not good benchmarks when it comes to gauging how efficiently browsers use memory.  For example, none of the browsers slowed down on Sunspider until memory was restricted to 128MiB, which is a ridiculously small amount of memory for a desktop or laptop machine;  it’s small even for a smartphone.  V8 is clearly stresses memory consumption more, but it’s still not great.

What would a better benchmark look like?  I’m not completely sure, but it would certainly involve opening multiple tabs and simulate real-world browsing. Something like Membench (see here and here) might be a reasonable starting point.  To test the impact of memory consumption on performance, a clear performance measure would be required, because Membench lacks one currently.  To test the impact of memory consumption on crash rate, Membench could be modified to just keep opening pages until the browser crashes.  (The trouble with that is that you’d lose your count when the browser crashed!  You’d need to log the current count to a file or something like that.)

BTW, if you are thinking “you’ve just measured the working set size“, you’re exactly right! I think working set size is probably the best metric to use when evaluating memory consumption of a browser.  Unfortunately it’s hard to measure (as we’ve seen) and it is best measured via a  curve rather than a single number.

 A Simpler Methodology

I think memory sensitivity testing is an excellent way to gauge the memory efficiency of different browsers.  (In fact, the same methodology can be used for any kind of program, not just browsers.)

But the above experiment wasn’t easy:  it required a Linux machine, some non-trivial configuration of that machine that took me a while to get working, and at least 13 runs of each benchmark suite for each browser.  I understand that tech sites would be reluctant to do this kind of testing, especially when longer-running benchmark suites such as Dromaeo and Peacekeeper are involved.

A simpler alternative that would still be quite good would be to perform all the performance tests on several machines with different memory configurations.  For example, a good experimental setup might involve the following machines.

  • A fast desktop with 8GB or 16GB of RAM.
  • A mid-range laptop with 4GB of RAM.
  • A low-end netbook with 1GB or even 512MB of RAM.

This wouldn’t require nearly as many runs as full-scale memory sensitivity testing would.  It would avoid all the problems of cross-browser memory consumption comparisons:  difficult measurements, non-determinism, and adaptive behaviour.  It would avoid secondary metrics in favour of primary metrics.  And it would give results that are easy for anyone to understand.

(In some ways it’s even better than memory sensitivity testing because it involves real machines — a machine with a 3.4GHz i7-2600 CPU and only 128MiB of RAM isn’t a realistic configuration!)

I’d love it if tech sites started doing this.

MemShrink progress, week 51-52

Memory Reporting

Nathan Froyd added much more detail to the DOM and layout memory reporters, as the following example shows.

├──3,268,944 B (03.76%) -- window(http://www.reddit.com/)
│  ├──1,419,280 B (01.63%) -- layout
│  │  ├────403,904 B (00.46%) ── style-sets
│  │  ├────285,856 B (00.33%) -- frames
│  │  │    ├───90,000 B (00.10%) ── nsInlineFrame
│  │  │    ├───72,656 B (00.08%) ── nsBlockFrame
│  │  │    ├───62,496 B (00.07%) ── nsTextFrame
│  │  │    ├───42,672 B (00.05%) ── nsHTMLScrollFrame
│  │  │    └───18,032 B (00.02%) ── sundries
│  │  ├────259,392 B (00.30%) ── pres-shell
│  │  ├────168,480 B (00.19%) ── style-contexts
│  │  ├────160,176 B (00.18%) ── pres-contexts
│  │  ├─────55,424 B (00.06%) ── text-runs
│  │  ├─────45,792 B (00.05%) ── rule-nodes
│  │  └─────40,256 B (00.05%) ── line-boxes
│  ├────986,240 B (01.13%) -- dom
│  │    ├──780,056 B (00.90%) ── element-nodes
│  │    ├──156,184 B (00.18%) ── text-nodes
│  │    ├───39,936 B (00.05%) ── other [2]
│  │    └───10,064 B (00.01%) ── comment-nodes
│  └────863,424 B (00.99%) ── style-sheets

Although these changes slightly improved the coverage of the reporters (i.e. they reduce “heap-unclassified” a bit), the more important effect is that they give greater insight into DOM and layout memory consumption.  [Update:  Nathan blogged about these changes.]

I modified the memory reporter infrastructure and about:memory to allow trees of measurements to be shown in the “Other Measurements” section of about:memory.  The following excerpt shows two sets of measurements that were previously shown in a flat list.

108 (100.0%) -- js-compartments
├──102 (94.44%) ── system
└────6 (05.56%) ── user

3,900,832 B (100.0%) -- window-objects
├──2,047,712 B (52.49%) -- layout
│  ├──1,436,464 B (36.82%) ── style-sets
│  ├────402,056 B (10.31%) ── pres-shell
│  ├────100,440 B (02.57%) ── rule-nodes
│  ├─────62,400 B (01.60%) ── style-contexts
│  ├─────30,464 B (00.78%) ── pres-contexts
│  ├─────10,400 B (00.27%) ── frames
│  ├──────2,816 B (00.07%) ── line-boxes
│  └──────2,672 B (00.07%) ── text-runs
├──1,399,904 B (35.89%) ── style-sheets
└────453,216 B (11.62%) -- dom
     ├──319,648 B (08.19%) ── element-nodes
     ├──123,088 B (03.16%) ── other
     ├────8,880 B (00.23%) ── text-nodes
     ├────1,600 B (00.04%) ── comment-nodes
     └────────0 B (00.00%) ── cdata-nodes

This change improves the presentation of measurements that cross-cut those in the “explicit” tree.  I intend to modify a number of the JS engine memory reports to take advantage of this change.

One interesting bug was this one, where various people were seeing multiple compartments for the same site in about:compartments, as the following excerpt shows.

https://bugzilla.mozilla.org/
https://bugzilla.mozilla.org/, about:blank

It turns out this is not a bug.  The sites in question contain an empty iframe that doesn’t specify a URL, and so the memory reporters are doing exactly the right thing.  Still, a useful case to know about when reading about:compartments.

Add-ons

Justin Lebar fixed the FUEL API, which is used by numerous add-ons including Test Pilot and PDF.js, so that it doesn’t leak most of the objects it creates.  This leak caused 10+ minute shut-down times(!) for one user, so it’s a good one to have fixed.

Kyle Huey briefly broke his own Hueyfix, and then fixed it.  You had us worried for a bit there, Kyle!

asgerklasker fixed a leak in the HttpFox add-on that caused zombie compartments.

Miscellaneous

I restricted the amount of context that is shown in JavaScript error messages.  This fixed a longstanding bug where if you had enabled the javascript.options.strict option in about:config, and you had the error console open or Firebug installed, memory consumption would spike dramatically when viewing pages that contain JavaScript code that triggered many strict warnings.  This bug manifested rarely, but would bring Firefox to its knees when it did.

Asaf Romano fixed a leak that caused a zombie compartment if you used the “Highlight All” checkbox when doing a text search in a page.

Mike Hommey finished importing the new version of jemalloc into the Mozilla codebase, and then wrote about it.  The new version is currently disabled, but we hope to turn it on soon.  Preliminary experiments indicate that the new version is unlikely to affect performance or memory consumption much.  However, it will be good to be on a version that is in sync with the upstream version, instead of having a highly hacked Mozilla-specific version.

Justin Lebar wrote a nice blog post explaining what “ghost windows” are, and how we’ve used them to gauge some recent leak fixes.

Till Schneidereit fixed a garbage collection defect that could cause memory usage to spike in certain cases involving Firefox Sync.

LifeHacker published a new browser performance comparison which looked at Firefox 13, Chrome 19, IE9, and Opera 11.64.  Firefox did the best overall and also won both the memory consumption tests.  I personally take these kinds of comparisons with a huge bucket of salt.  Indeed, quoting from the article:

Our tests aren’t the most scientific on the planet, but they do reflect a relatively accurate view of the kind of experience you’d get from each browser, speed-wise.

If you ask me, the text before the “but” contradicts the rest of the sentence.  What I found more interesting was the distinct lack of “lolwut everyone knows Chrome is faster than Firefox” comments, which I’m used to seeing when Firefox does well in these kinds of articles.

Bug Counts

Here are the current bug counts.

  • P1: 25 (-1/+3)
  • P2: 88 (-3/+6)
  • P3: 106 (-0/+4)
  • Unprioritized: 2 (-2/+2)

Nothing too exciting there.

Notes on Reducing Firefox’s Memory Consumption

I gave a talk yesterday at the linux.conf.au Browser MiniConf, held in Ballarat, Australia.  Its title was “Notes On Reducing Firefox’s Memory Consumption”.

Below are the slides and notes in a SlideShare embedding. If you find that embedding problematic (some people do) you may prefer to download the PDF version directly.

Speeding up Mercurial

Just a reminder:  Mercurial repository clones get slower over time.  It’s worth deleting and recloning every so often.

For example, I had one old repository that I’d done a lot of work in and updated many times.  It had a patch queue containing a single patch. I tried doing hg qdiff four times in a row. The first one took 65 seconds, and the next three each took 16 seconds.

I deleted the repository, re-cloned, re-applied the patch, and hg qdiff now takes 1 second. Much better.

Building a page fault benchmark

I wrote a while ago about the importance of avoiding page faults for browser performance.  Despite this, I’ve been focusing specifically on reducing Firefox’s memory usage.  This is not a terrible thing;  page fault rates and memory usage are obviously strongly linked.  But they don’t have perfect correlation.  Not all memory reductions will have equal effect on page faults, and you can easily imagine changes that reduce page fault rates — by changing memory layout and access patterns — without reducing memory consumption.

A couple of days ago, Luke Wagner initiated an interesting email conversation with me about his desire for a page fault benchmark, and I want to write about some of the things we discussed.

It’s not obvious how to design a page fault benchmark, and to understand why I need to first talk about more typical time-based benchmarks like SunSpider.  SunSpider does the same amount of work every time it runs, and you want it to run as fast as possible.  It might take 200ms to run on your beefy desktop machine, 900ms on your netbook, and 2000ms to run on your smartphone.  In all cases, you have a useful baseline against which you can measure optimizations.  Also, any optimization that reduces the time on one device has a good chance of reducing time on the other devices.  The performance curve across devices is fairly flat.

In contrast, if you’re measuring page faults, these things probably won’t be true on a benchmark that does a constant amount of work.  If my desktop machine has 16GB of RAM, I’ll probably get close to zero page faults no matter what happens.  But on a smartphone with 512MB of RAM, the same benchmark may lead to a page fault death spiral;  the number will be enormous, assuming you even bother waiting for it to finish (or the OS doesn’t kill it).  And the netbook will probably lie unhelpfully on one side or the other of the cliff in the performance curve.  Such a benchmark will be of limited use.

However, maybe we can instead concoct a benchmark that repeats a sequence of interesting operations until a certain number of page faults have occurred.  The desktop machine might get 1000 operations, the netbook 400, the smartphone 100.  The performance curve is fairly flat again.

The operations should be representative of realistic browsing behaviour.  Obviously, the memory consumption has to increase each time you finish a sequence, but you don’t want to just open new pages.  A better sequence might look like “open foo.com in a new tab, follow some links, do some interaction, open three child pages, close two of them”.

And it would be interesting to run this test on a range of physical memory sizes, to emulate different machines such as smartphones, netbooks, desktops.  Fortunately, you can do this on Linux;  I’m not sure about other OSes.

I think a benchmark (or several benchmarks) like this would be challenging but not impossible to create.  It would be very valuable, because it measures a metric that directly affects users (page faults) rather than one that indirectly affects them (memory consumption).  It would be great to use as the workload under Julian Seward’s VM simulator, in order to find out which parts of the browser are causing page faults.  It might make areweslimyet.com catch managers’ eyes as much as arewefastyet.com does.  And finally, it would provide an interesting way to compare the memory “usage” (i.e. the stress put on the memory system) of different browsers, in contrast to comparisons of memory consumption which are difficult to interpret meaningfully.

 

Faster JavaScript parsing

Over the past year or so I’ve almost doubled the speed of SpiderMonkey’s JavaScript parser.  I did this by speeding up both the scanner (bug 564369, bug 584595, bug 588648, bug 639420, bug 645598) and the parser (bug 637549).  I used patch stacks in several of those bugs, and so in those six bugs I actually landed 28 changesets.

Notable things about scanning JavaScript code

Before I explain the changes I made, it will help to explain a few notable things about scanning JavaScript code.

  • JavaScript is encoded using UCS-2.  This means that each character is 16 bits.
  • There are several character sequences that indicate the end of a line (EOL): ‘\n’, ‘\r’, ‘\r\n’, \u2028 (a.k.a. LINE_SEPARATOR), and \u2029 (a.k.a. PARA_SEPARATOR).  Note that ‘\r\n’ is treated as a single character.
  • JavaScript code is often minified, and the characteristics of minified and non-minified code are quite different.  The most important difference is that minified code has much less whitespace.

Scanning improvements

Before I made any changes, there were two different modes in which the scanner could operate.  In the first mode, the entire character stream to be scanned was already in memory.  In the second, the scanner read the characters from a file in chunks a few thousand chars long.  Firefox always uses the first mode (except in the rare case where the platform doesn’t support mmap or an equivalent function), but the JavaScript shell used the second.  Supporting the second made made things complicated in two ways.

  • It was possible for an ‘\r\n’ EOL sequence to be split across two chunks, which required some extra checking code.
  • The scanner often needs to unget chars (up to six chars, due to the lookahead required for \uXXXX sequences), and it couldn’t unget chars across a chunk boundary.  This meant that it used a six-char unget buffer.  Every time a char was ungotten, it would be copied into this buffer.  As a consequence, every time it had to get a char, it first had to look in the unget buffer to see if there was one or more chars that had been previously ungotten.  This was an extra check (and a data-dependent and thus unpredictable check).

The first complication was easy to avoid by only reading N-1 chars into the chunk buffer, and only reading the Nth char in the ‘\r\n’ case.  But the second complication was harder to avoid with that design.  Instead, I just got rid of the second mode of operation;  if the JavaScript engine needs to read from file, it now reads the whole file into memory and then scans it via the first mode.  This can result in more memory being used but it only affects the shell, not the browser, so it was an acceptable change.  This allowed the unget buffer to be completely removed;  when a character is ungotten now the scanner just moves back one char in the char buffer being scanned.

Another improvement was that in the old code, there was an explicit EOL normalization step.  As each char was gotten from the memory buffer, the scanner would check if it was an EOL sequence;  if so it would change it to ‘\n’, if not, it would leave it unchanged.  Then it would copy this normalized char into another buffer, and scanning would proceed from this buffer.  (The way this copying worked was strange and confusing, to make things worse.)  I changed it so that getChar() would do the normalization without requiring the copy into the second buffer.

The scanner has to detect EOL sequences in order to know which line it is on.  At first glance, this requires checking every char to see if it’s an EOL, and the scanner uses a small look-up table to make this fast.  However, it turns out that you don’t have to check every char.  For example, once you know that you’re scanning an identifier, you know that if you hit an EOL sequence you’ll immediately unget it, because that marks the end of the identifier.  And when you unget that char you’ll undo the line update that you did when you hit the EOL.  This same logic applies in other situations (eg. parsing a number).  So I added a function getCharIgnoreEOL() that doesn’t do the EOL check.  It has to always be paired with ungetCharIgnoreEOL() and requires some care as to where it’s used, but it avoids the EOL check on more than half the scanned chars.

As well as detecting where each token starts and ends, for a lot of token kinds the scanner has to compute a value.  For example, after scanning the character sequence ” 123 ” it has to convert that to the number 123.  The old scanner would copy the chars into a temporary buffer before calling the function that did the conversion.  This was unnecessary — the conversion function didn’t even require NULL-terminated strings because it gets passed the length of the string being converted!  Also, the old scanner was using js_strtod() to do the number conversion.  js_strtod() can convert both integers and fractional numbers, but its quite slow and overkill for integers.  And when scanning, even before converting the string to a number, we know if the number we just scanned was an integer or not (by remembering if we saw a ‘.’ or exponent).  So now the scanner instead calls GetPrefixInteger() which is much faster.  Several of the tests in Kraken involve huge arrays of integers, and this made a big difference to them.

There’s a similar story with identifiers, but with an added complication.  Identifiers can contain \uXXXX chars, and these need to be normalized before we do more with the string inside SpiderMonkey.  So the scanner now remembers whether a \uXXXX char has occurred in an identifier.  If not, it can work directly (temporarily) with the copy of the string inside the char buffer.  Otherwise, the scanner will rescan the identifier, normalizing and copying it into a new buffer.  I.e. the scanner de-optimizes the (very) rare case in order to avoid the copying in the common case.

JavaScript supports decimal, hexadecimal and octal numbers.  The number-scanning code handled all three kinds in the same loop, which meant that it checked the radix every time it scanned another number char.  So I split this into three parts, which make it both faster and easier to read.

Although JavaScript chars are 16 bits, the vast majority of chars you see are in the first 128 chars.  This is true even for code written in non-Latin scripts, because of all the keywords (e.g. ‘var’) and operators (e.g. ‘+’) and punctuation (e.g. ‘;’).  So it’s worth optimizing for those.  The main scanning loop (in getTokenInternal()) now first checks every char to see if its value is greater than 128.  If so, it handles it in a side-path (the only legitimate such chars are whitespace, EOL or identifier chars, so that side-path is quite small).  The rest of getTokenInternal() can then assume that it’s a sub-128 char.  This meant I could be quite aggressive with look-up tables, because having lots of 128-entry look-up tables is fine, but having lots of 65,536-entry look-up tables would not be.  One particularly important look-up table is called firstCharKinds;  it tells you what kind of token you will have based on the first non-whitespace char in it.  For example, if the first char is a letter, it will be an identifier or keyword;  if the first char is a ‘0’ it will be a number;  and so on.

Another important look-up table is called oneCharTokens.  There are a handful of tokens that are one-char long, cannot form a valid prefix of another token, and don’t require any additional special handling:  ;,?[]{}().  These account for 35–45% of all tokens seen in real code!  The scanner can detect them immediately and use another look-up table to convert the token char to the internal token kind without any further tests.  After that, the rough order of frequency for different token kinds is as follows:  identifiers/keywords, ‘.’, ‘=’, strings, decimal numbers, ‘:’, ‘+’, hex/octal numbers, and then everything else.  The scanner now looks for these token kinds in that order.

That’s just a few of the improvements, there were lots of other little clean-ups.  While writing this post I looked at the old scanning code, as it was before I started changing it.  It was awful, it’s really hard to see what was happening;  getChar() was 150 lines long because it included code for reading the next chunk from file (if necessary) and also normalizing EOLs.

In comparison, as well as being much faster, the new code is much easier to read, and much more DFA-like.  It’s worth taking a look at getTokenInternal() in jsscan.cpp.

Parsing improvements

The parsing improvements were all related to the parsing of expressions.  When the parser parses an expression like “3” it needs to look for any following operators, such as “+”.  And there are roughly a dozen levels of operator precedence.  The way the parser did this was to get the next token, check if it matched any of the operators of a particular precedence, and then unget the token if it didn’t match.  It would then repeat these steps for the next precedence level, and so on.  So if there was no operator after the “3”, the parser would have gotten and ungotten the next token a dozen times!  Ungetting and regetting tokens is fast, because there’s a buffer used (i.e. you don’t rescan the token char by char) but it was still a bottleneck.  I changed it so that the sub-expression parsers were expected to parse one token past the end of the expression, instead of zero tokesn past the end.  This meant that the repeated getting/ungetting could be avoided.

These operator parsers are also very small.  I inlined them more aggressively, which also helped quite a bit.

Results

I had some timing results but now I can’t find them.  But I know that the overall speed-up from my changes was about 1.8x on Chris Leary’s parsemark suite, which takes code from lots of real sites, and the variation in parsing times for different codebases tends not to vary that much.

Many real websites, e.g. gmail, have MB of JS code, and this speed-up will probably save one or two tenths of a second when they load.  Not something you’d notice, but certainly something that’ll add up over time and help make the browser feel snappier.

Tools

I used Cachegrind to drive most of these changes.  It has two features that were crucial.

First, it does event-based profiling, i.e. it counts instructions, memory accesses, etc, rather than time.  When making a lot of very small improvements, noise variations often swamp the effects of the improvements, so being able to see that instruction counts are going down by 0.2% here, 0.3% there, is very helpful.

Second, it gives counts of these events for individual lines of source code.  This was particularly important for getTokenInternal(), which is the main scanning function and has around 700 lines;  function-level stats wouldn’t have been enough.

You lose more when slow than you gain when fast

SpiderMonkey is Firefox’s JavaScript engine.  In Firefox 3.0 and earlier versions, it was just an interpreter.  In Firefox 3.5, a tracing JIT called TraceMonkey was added.  TraceMonkey was able to massively speed up certain parts of programs, such as tight loops;  parts of programs it couldn’t speed up continued to be interpreted.  TraceMonkey provided large speed improvements, but JavaScript performance overall still didn’t compare that well against that of Safari and Chrome, both of which used method JITs that had worse best-case performance than TraceMonkey, but much better worst-case performance.

If you look at the numbers, this isn’t so surprising.  If you’re 10x faster than the competition on half the cases, and 10x slower on half the cases, you end up being 5.05x slower overall.  Firefox 4.0 added a method JIT, JaegerMonkey, which avoided those really slow cases, and Firefox’s JavaScript performance is now very competitive with other browsers.

The take-away message:  you lose more when slow than you gain when fast. Your performance is determined primarily by your slowest operations.  This is true for two reasons.  First, in software you can easily get such large differences in performance: 10x, 100x, 1000x and more aren’t uncommon.  Second, in complex programs like a web browser, overall performance (i.e. what a user feels when browsing day-to-day) is determined by a huge range of different operations, some of which will be relatively fast and some of which will be relatively slow.

Once you realize this, you start to look for the really slow cases. You know, the ones where the browser slows to a crawl and user starts cursing and clicking wildly and holy crap if this happens again I’m switching to another browser.  Those are the performance effects that most users care about, not whether their browser is 2x slower on some benchmark.  When they say “it’s really fast”, the probably actually mean “it’s never really slow”.

That’s why memory leaks are so bad — because they lead to paging, which utterly destroys performance, probably more than anything else.

It also makes me think that the single most important metric when considering browser performance is page fault counts.  Hmm, I think it’s time to look again at Julian Seward’s VM profiler and the Linux perf tools.

 

Multi-faceted JavaScript speed improvements

Firefox 4.0 beta 7’s release announcement was accompanied by the following graphs that show great improvements in JavaScript speed:

Fx4b7 JavaScript speed-ups

Impressive!  The graphs claim speed-ups of 3x, 3x and 5x;  by my calculations the more precise numbers are 3.49x, 2.94x and 5.24x.

The Sunspider and V8bench results are no surprise to anyone who knows about JägerMonkey and has been following AWFY, but the excellent Kraken results really surprised me.  Why?

  • Sunspider and V8bench have been around for ages.  They are the benchmarks most commonly used (for better or worse) to gauge JavaScript performance and so they have been the major drivers of performance improvements.  To put it more bluntly, like all the other browser vendors, we tune for these benchmarks a lot. In contrast, Kraken was only released on September 14th, and so we’ve done very little tuning for it yet.
  • Unlike Sunspider and V8bench, Kraken contains a lot of computationally intensive code such as image and audio processing. These benchmarks are dominated by tight loops containing numerous array accesses.  As a result, they trace really well, and so even 4b7 spends most of its Kraken time (I’d estimate 90%+) in code generated by TraceMonkey, the trace JIT.

We can draw two happy conclusions from Kraken’s improvement.

  • Our speed-ups apply widely, not just to Sunspider and V8bench.
  • Our future performance eggs are not all in one basket: the JavaScript team has made and will continue to make great improvements to the non-JägerMonkey parts of the JavaScript engine.

Firefox 4.0 is going to be great release!

cg_diff: a differential profiling tool

I frequently use the SunSpider and V8 benchmark suites to compare the speed of different versions of TraceMonkey.  The best metric for speed comparisons is always execution time.  However, measuring execution time on modern machines is unreliable — you get different lots of variation between runs.  This is a particular problem in this cae because the run-times of these benchmarks is very small — SunSpider takes less than 700 ms on my laptop, and V8 takes about 6.5 seconds.  Run-to-run variations can be larger than the difference I’m trying to measure.  This is annoying:  the best speed metric cannot be measured exactly.

So I frequently use Cachegrind to measure the number of executed instructions.  This is a worse metric than execution time — the number of instructions doesn’t directly relate to the execution time, although it’s usually a good indicator — but it has the advantage that it can be measured exactly.  Most of the SunSpider and V8 tests are deterministic, and if I measure them twice in a row I’ll get the same result.  Cachegrind also gives instruction counts on a per-function and per-line basis, which is very useful.

So I often run Cachegrind on two different versions of TraceMonkey:  an unchanged copy of the current repository tip, and a copy of the current repository tip with a patch applied.  I can then compare the results and get a very precise idea of how the patch affects performance.

However, comparing the output of two Cachegrind runs manually is a pain.  For example, here is part of Cachegrind’s output (lightly edited for clarity) for crypto-md5.js with an unchanged repository tip (as of a day or two ago):

--------------------------------------------------------------------------------
 Ir
--------------------------------------------------------------------------------
48,923,280  PROGRAM TOTALS
--------------------------------------------------------------------------------
 Ir  file:function
--------------------------------------------------------------------------------
5,638,362  ???:???
4,746,990  /build/buildd/eglibc-2.10.1/string/../sysdeps/i386/i686/strcmp.S:strcmp
2,032,069  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
1,899,298  jstracer.cpp:bool js::VisitFrameSlots<js::CountSlotsVisitor>(...)
1,759,932  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
1,232,425  jsops.cpp:js_Interpret
 885,168  jstracer.cpp:bool js::VisitFrameSlots<js::DetermineTypesVisitor>(...)
 871,197  jstracer.cpp:js::TraceRecorder::set(int*, nanojit::LIns*, bool)
 812,419  /build/buildd/eglibc-2.10.1/iconv/gconv_conf.c:insert_module
 758,034  jstracer.cpp:js::TraceRecorder::monitorRecording(JSOp)

At the top we have the total instruction count, and then we have the instruction counts for the top 10 functions.  The ???:??? entry represents code generated by TraceMonkey’s JIT compiler, for which there is no debug information.  “Ir” is short for “I-cache reads”, which is equivalent to “instructions executed”.

Cachegrind tracks a lot more than just instruction counts, but I’m only showing them here to keep things simple. It also gives per-line counts, but I’ve omitted them as well.

And here is the corresponding output when a patch from bug 575529 is applied:

--------------------------------------------------------------------------------
        Ir
--------------------------------------------------------------------------------
42,332,998  PROGRAM TOTALS
--------------------------------------------------------------------------------
       Ir  file:function
--------------------------------------------------------------------------------
4,746,990  /build/buildd/eglibc-2.10.1/string/../sysdeps/i386/i686/strcmp.S:strcmp
4,100,366  ???:???
1,687,434  jstracer.cpp:bool js::VisitFrameSlots(js::CountSlotsVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
1,343,085  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
1,229,853  jsops.cpp:js_Interpret
1,137,981  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
  868,855  jstracer.cpp:js::TraceRecorder::set(int*, nanojit::LIns*, bool)
  812,419  /build/buildd/eglibc-2.10.1/iconv/gconv_conf.c:insert_module
  755,753  jstracer.cpp:js::TraceRecorder::monitorRecording(JSOp)
  575,200  jsscan.cpp:js::TokenStream::getTokenInternal()

It’s easy to see that the total instruction count has dropped from 48.9M to 42.3M, but seeing the changes at a per-function level is more difficult. For a long time I would make this comparison manually by opening the two files side-by-side and reading carefully.  Sometimes I’d also do some cutting-and-pasting to reorder entries. The whole process was tedious, but the information revealed is so useful that I did it anyway.

Then three months ago David Baron asked on Mozilla’s dev-platform mailing list if anybody knew of any good differential profiling tools. This prompted me to realise that I wanted exactly such a tool for Cachegrind. Furthermore, as Cachegrind’s author, I was in a good place to understand exactly what was necessary :)

The end result is a new script, cg_diff, that can be used to compute the difference between two Cachegrind output files. Here’s part of the difference between the above two versions:

--------------------------------------------------------------------------------
        Ir
--------------------------------------------------------------------------------
-6,590,282  PROGRAM TOTALS
--------------------------------------------------------------------------------
        Ir  file:function
--------------------------------------------------------------------------------
-1,537,996  ???:???
  -894,088  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
  -416,847  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
  -405,271  jstracer.cpp:bool js::VisitFrameSlots(js::DetermineTypesVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
  -246,047  nanojit/Containers.h:nanojit::StackFilter::read()
  -238,121  nanojit/Assembler.cpp:nanojit::Assembler::registerAlloc(nanojit::LIns*, int, int)
   230,419  nanojit/LIR.cpp:nanojit::interval::of(nanojit::LIns*)
  -226,070  nanojit/Assembler.cpp:nanojit::Assembler::asm_leave_trace(nanojit::LIns*)
  -211,864  jstracer.cpp:bool js::VisitFrameSlots(js::CountSlotsVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
  -200,742  nanojit/Assembler.cpp:nanojit::Assembler::findRegFor(nanojit::LIns*, int)

This makes it really easy to see what’s changed. Negative values mean that the instruction count dropped, positive numbers mean that the instruction count increased.

I’ve been using this script for a while now, and it’s really helped me analyse the performance effects of my patches. Indeed, I have some scripts set up so that, with a single command, I can run all of SunSpider through two different versions of TraceMonkey and produce both normal profiles and difference profiles. I can also get high-level instruction comparisons such as the one in this Bugzilla comment.

And now everybody else can use cg_diff too, because I just landed it on the Valgrind trunk. If you want to try it, follow these instructions to setup a copy of the trunk.  And note that if you want to compare two versions of a program that sit in different directories (as opposed to profiling a program, modifying it, then reprofiling it) you’ll need to use cg_diff’s –mod-filename option to get useful results.  Feel free to ask me questions (via email, IRC or in the comments below) if you have troubles.

Happy differencing!