Categories
JägerMonkey MemShrink Tracemonkey

MemShrink progress report, week 23

The only significant MemShrink-related change that landed this week was that David Anderson removed TraceMonkey, the tracing JIT compiler.  In fact, TraceMonkey was disabled a while ago, so the effects on code size and memory consumption of its removal have been felt since then.  But it feels more real now that the source code is gone (all 67,000 lines of it!), so I figure it’s worth mentioning.  (BTW, many thanks to Ryan VanderMeulen who has been going through Bugzilla, closing many old TraceMonkey-related bugs that are no longer relevant.)

People have asked why TraceMonkey isn’t needed any more.  In my opinion, tracing compilation can be a good strategy for certain kinds of code, such as very tight, non-branchy loops.  But it tends to do badly on other kinds of code.  Before JaegerMonkey, JS code in Firefox ran in one of two modes: interpreted (super slow), or trace-compiled (usually fast).  This kind of bimodal performance is bad, because you lose more when slow than you gain when fast.  Also, because tracing was the only way to make code fast, huge amounts of effort were put into tracing code that shouldn’t really be traced, which made TraceMonkey really complicated.

Once JaegerMonkey was added, the performance was still bimodal, but in a better way:  method-compiled (fairly fast) or trace-compiled (usually fast).  But the heuristics to switch between the two modes were quite hairy.  Then type inference was added to JaegerMonkey, which made it faster on average than JaegerMonkey+TraceMonkey.  Combine that with the fact that TraceMonkey was actively getting in the way of various additional JaegerMonkey and type inference improvements, and it was clear it was time for TraceMonkey to go.

It might sound like there’s been a lot of wasted effort with all these different JITs.  There’s some truth to that.  But JavaScript is a difficult language to compile well, and people have only been writing JITs for it for a few years, which isn’t long when it comes to compilers.  Each new JIT has taught the JS team about ideas that do and don’t work, and those lessons have been incorporated into the next, better JIT.  That’s why IonMonkey is now being developed — because JaegerMonkey with type inference still has a number of shortcomings that can’t be remedied incrementally.

In fact, it’s possible that IonMonkey might end up one day with a trace compiler for really hot, tight loops.  If it does, this trace compiler would be much simpler than TraceMonkey because it would only target code that trace-compiles easily;  trace compilation would be the icing on the cake, not the whole cake.

Enough about JITs.  Time for this week’s MemShrink bug counts.

  • P1: 31 (-0/+2)
  • P2: 132 (-3/+8)
  • P3: 60 (-0/+2)
  • Unprioritized: 4 (-0/+4)

Not a great deal of movement there.  The quietness is at least partly explained by the fact that Thanksgiving is happening in the US this week.  Next week will probably be quieter than usual for the same reason.

Categories
Firefox JägerMonkey Memory consumption MemShrink Tracemonkey Uncategorized

MemShrink progress, week 20

Surprise of the week

[Update: This analysis about livemarks may be wrong.  Talos results from early September show that the MaxHeaps number increased, and the reduction when the “Latest Headlines” livemark was removed has undone that increase.  So livemarks may not be responsible at all, it could just be a coincidence.  More investigation is needed!]

Jeff Muizelaar removed the “Latest Headlines” live bookmark from new profiles.  This was in the Bookmarks Toolbar, and so hasn’t been visible since Firefox 4.0, and most people don’t use it.  And it uses a non-zero amount of memory and CPU.  Just how non-zero was unclear until Marco Bonardo noticed some big performance improvements.  First, in the Talos “MaxHeap” results on WinNT5.2:

Talos MaxHeap graph

And secondly in the Talos “Allocs” results on WinNT5.2 and Mac10.5.2:

Talos Allocs graph

In the WinNT5.2 cases, it looks like we had a bi-modal distribution previously, and this patch changed things so that the higher of the two cases never occurred.  In the Mac10.5.2 case we just had a simple reduction in the number of allocations.  On Linux the results were less conclusive, but there may have been a similar if smaller effect.

This surprised me greatly.  I’ve done a lot of memory profiling of Firefox and never seen anything that pointed to the feed reader as using a lot of memory.  This may be because the feed reader’s usage is falling into a larger, innocuous bucket, such as JS or generic strings.  Or maybe I just missed the signs altogether.

Some conclusions and questions:

  • If you have live bookmarks in your bookmarks toolbar, remove them! [Update: I meant to say “unused live bookmarks”.]
  • We need to work out what is going on with the feed reader, and optimize its memory usage.
  • Can we disable unused live bookmarks for existing users?

Apparently nobody really owns the feed reader, because previous contributors to it have all moved on.  So I’m planning to investigate, but I don’t know the first thing about it.  Any help would be welcome!

 Other stuff

There was a huge memory leak in the Video DownloadHelper add-on v4.9.5, and possibly earlier versions.  This has been fixed in v4.9.6a3 and the fix will make it into the final version v4.9.6 when it is released.  That’s one more add-on leak down, I wonder how many more there are to go.

TraceMonkey, the trace JIT, is no longer built by default.  This means it’s no longer used, and this saves both code and data space.  The old combination of TraceMonkey and JaegerMonkey is slower than the new combination of JaegerMonkey with type inference, and TraceMonkey is also preventing various clean-ups and simplifications, so it’ll be removed entirely soon.

I refined the JS memory reporters in about:memory to give more information about objects, shapes and property tables.

I avoided creating small property tables, removed KidHashes when possible, and reduced the size of KidHashes with few entries.

I wrote about various upcoming memory optimizations in the JS engine.

Justin Lebar enabled jemalloc on MacOS 10.5 builds.  This was expected to be a space improvement, but it also reduced the “Tp5 MozAfterPaint” page loading benchmark by 8%.

Robert O’Callahan avoided excessive memory usage in certain DOM animations on Windows.

Drew Willcoxon avoided excessive memory usage in context menu items created by the add-on SDK.

Bug Counts

  • P1: 35 (-1, +1)
  • P2: 116 (-2, +5)
  • P3: 55 (-2, +3)
  • Unprioritized: 5 (-4, +5)

At this week’s MemShrink meeting we only had 9 bugs to triage, which is the lowest we’ve had in a long time.  It feels like the MemShrink bug list is growing slower than in the past.

Categories
about:memory Firefox Garbage Collection JägerMonkey Memory consumption MemShrink Tracemonkey

SpiderMonkey is on a diet

One thing I’ve learnt while working for Mozilla is that a web browser can be characterized as a JavaScript execution environment that happens to have some multimedia capabilities.  In particular, if you look at Firefox’s about:memory page, the JS engine is very often the component responsible for consuming the most memory.

Consider the following snapshot from about:memory of the memory used by a single JavaScript compartment.

about:memory snapshot

(For those of you who have looked at about:memory before, some of those entries may look unfamiliar, because I landed a patch to refine the JS memory reporters late last week.)

There is work underway to reduce many of the entries in that snapshot.  SpiderMonkey is on a diet.

Objects

Objects are the primary data structure used in JS programs;  after all, it is an object-oriented language.  Inside SpiderMonkey, each object is represented by a JSObject, which holds basic information, and possibly a slots array, which holds the object’s properties. The memory consumption for all JSObjects is measured by the “gc-heap/objects/non-function” and “gc-heap/objects/function” entries in about:memory, and the slots arrays are measured by the “object-slots” entries.

The size of a non-function JSObject is currently 40 bytes on 32-bit platforms and 72 bytes on 64-bit platforms.  Brian Hackett is working to reduce that to 16 bytes and 32 bytes respectively. Function JSObjects are a little larger, being (internally) a sub-class of JSObject called JSFunction.  JSFunctions will therefore benefit from the shrinking of JSObject, and Brian is slimming down the function-specific parts as well.  In fact, these changes are complete in the JaegerMonkey repository, and will likely be merged into mozilla-central early in the Firefox 11 development period.

As for the slots arrays, they are currently arrays of “fatvals” A fatval is a 64-bit internal representation that can hold any JS value — number, object, string, whatever.  (See here for details, scroll down to “Mozilla’s New JavaScript Value Representation”;  the original blog entry is apparently no longer available).  64-bits per entry is overkill if you know, for example, that you have an array full entirely of integers that could fit into 32 bits.  Luke Wagner and Brian Hackett have been discussing a specialized representation to take advantage of such cases.  Variations on this idea have been tried twice before and failed, but perhaps SpiderMonkey’s new type inference support will provide the right infrastructure for it to happen.

Shapes

There are a number of data structures within SpiderMonkey dedicated to making object property accesses fast.  The most important of these are Shapes.  Each Shape corresponds to a particular property that is present in one or more JS objects.  Furthermore, Shapes are linked into linear sequences called “shape lineages”, which describe object layouts.  Some shape lineages are shared and live in “property trees”.  Other shape lineages are unshared and belong to a single JS object;  these are “in dictionary mode”.

The “shapes/tree” and “shapes/dict” entries in about:memory measure the memory consumption for all Shapes.  Shapes of both kinds are the same size;  currently they are 40 bytes on 32-bit platforms and 64 bytes on 64-bit platforms.  But Brian Hackett has also been taking a hatchet to Shape, reducing them to 24 bytes and 40 bytes respectively.  This has required the creation of a new auxiliary BaseShape type, but there should be many fewer BaseShapes than there are Shapes.  This change will also increase the number of Shapes, but should result in a space saving overall.

SpiderMonkey often has to search shape lineages, and for lineages that are hot it creates an auxiliary hash table, called a “property table”, that makes lookups faster.  The “shapes-extra/tree-tables” and “shapes-extra/dict-tables” entries in about:memory measure these tables.  Last Friday I landed a patch that avoids building these tables if they only have a few items in them;  in that case a linear search is just as good.  This reduced the amount of memory consumed by property tables by about 20%.

I mentioned that many Shapes are in property trees.  These are N-ary trees, but most Shapes in them have zero or one child;  only a small fraction have more than that, but the maximum N can be hundreds or even thousands.  So there’s a long-standing space optimization where each shape contains (via a union) a single Shape pointer which is used if it has zero or one child.  But if the number of children increases to 2 or more, this is changed into a pointer to a hash table, which contains pointers to the N children.  Until recently, if a Shape had a child deleted and that reduced the number of children from 2 to 1, it wouldn’t be converted from the hash form back to the single-pointer.  I changed this last Friday.  I also reduced the minimum size of these hash tables from 16 to 4, which saves a lot of space because most of them only have 2 or 3 entries.  These two changes together reduced the size of the “shapes-extra/tree-shape-kids” entry in about:memory by roughly 30–50%.

Scripts

Internally, a JSScript represents (more or less) the code of a JS function, including things like the internal bytecode that SpiderMonkey generates for it.  The memory used by JSScripts is measured by the “gc-heap/scripts” and “script-data” entries in about:memory.

Luke Wagner did some measurements recently that showed that most (70–80%) JSScripts created in the browser are never run.  In hindsight, this isn’t so surprising — many websites load libraries like jQuery but only use a fraction of the functions in those libraries.  It wouldn’t be easy, but if SpiderMonkey could be changed to generate bytecode for scripts lazily, it could reduce “script-data” memory usage by 60–70%, as well as shaving non-trivial amounts of time when rendering pages.

Trace JIT

TraceMonkey is SpiderMonkey’s original JIT compiler, which was introduced in Firefox 3.5.  Its memory consumption is measured by the “tjit-*” entries in about:memory.

With the improvements that type inference made to JaegerMonkey, TraceMonkey simply isn’t needed any more.  Furthermore, it’s a big hairball that few if any JS team members will be sad to say goodbye to.  (js/src/jstracer.cpp alone is over 17,000 lines and over half a megabyte of code!)

TraceMonkey was turned off for web content JS code when type inference landed.  And then it was turned off for chrome code.  And now it is not even built by default.  (The about:memory snapshot above was from a build just before it was turned off.)  And it will be removed entirely early in the Firefox 11 development period.

As well as saving memory for trace JIT code and data (including the wasteful ballast hack required to avoid OOM crashes in Nanojit, ugh), removing all that code will significantly shrink the size of Firefox’s code.  David Anderson told me the binary of the standalone JS shell is about 0.5MB smaller with the trace JIT removed.

Method JIT

JaegerMonkey is SpiderMonkey’s second JIT compiler, which was introduced in Firefox 4.0.  Its memory consumption is measured by the “mjit-code/*” and “mjit-data” entries in about:memory.

JaegerMonkey generates a lot of code.  This situation will hopefully improve with the introduction of IonMonkey, which is SpiderMonkey’s third JIT compiler.  IonMonkey is still in early development and won’t be integrated for some time, but it should generate code that is not only much faster, but much smaller.

GC HEAP

There is a great deal of work being done on the JS garbage collector, by Bill McCloskey, Chris Leary, Terrence Cole, and others.  I’ll just point out two long-term goals that should reduce memory consumption significantly.

First, the JS heap currently has a great deal of wasted space due to fragmentation, i.e. intermingling of used and unused memory.  Once moving GC — i.e. the ability to move things on the heap — is implemented, it will pave the way for a compacting GC, which is one that can move live things that are intermingled with unused memory into contiguous chunks of memory.  This is a challenging goal, especially given Firefox’s high level of interaction between JS and C++ code (because moving C++ objects is not feasible), but one that could result in very large savings, greatly reducing the “gc-heap/arena/unused” and “gc-heap-chunk-*-unused” measurements in about:memory.

Second, a moving GC is a prerequisite for a generational GC, which allocates new things in a small chunk of memory called a “nursery”.  The nursery is garbage-collected frequently (this is cheap because it’s small), and objects in the nursery that survive a collection are promoted to a “tenured generation”.  Generational GC is a win because in practice the majority of things allocated die quickly and are not promoted to the tenured generation.  This means the heap will grow more slowly.

Is that all?

It’s all I can think of right now.  If I’ve missed anything, please add details in the comments.

There’s an incredible amount of work being done on SpiderMonkey at the moment, and a lot of it will help reduce Firefox’s memory consumption.  I can’t wait to see what SpiderMonkey looks like in 6 months!

Categories
about:memory Firefox Garbage Collection Memory allocation Memory consumption MemShrink Tracemonkey Valgrind

MemShrink progress, week 9

Firefox 8 graduated to the Aurora channel this week, and the development period for what will become Firefox 9 began.  Lots of MemShrink activity happened this week, and I think all the changes listed below will make it into Firefox 8.

Avoiding Wasted Memory

I have blogged previously about memory wasted by “clownshoes” bugs.   Ed Morley found a webpage that resulted in 700MB of memory being wasted by the PLArena clownshoes bug.  Basically, on platforms where jemalloc is used (Windows, Linux), half the memory allocated by nsPresArena (which is built on top of PLArena) was wasted.  (On Mac the waste was 11%, because the Mac allocator rounds up less aggressively than jemalloc).

Fixing this problem properly for all PLArenas takes time because it requires changes to NSPR, so I made a spot-fix for the nsPresArena case.  This is a particularly big win on very large pages, but it saves around 3MB even on Gmail. This spot-fix has been granted beta approval and so will, barring disaster, make it into Firefox 7.

A Firefox Nightly user did some measurements with different browsers on the problematic page:

  • Firefox 8.0a1 before patch: 2.0 GB
  • Firefox 8.0a1 after patch: 1.3 GB
  • Latest Chrome canary build and dev (15.0.849.0): 1.1GB
  • Webkit2Process of Safari 5.1: 1.05 GB
  • Internet Explorer 9.0.2: 838 MB
  • Latest Opera Next 12.00: 727 MB

So this fix gets Firefox within spitting distance of other browsers, which is good!

In other developments related to avoiding wasted memory:

  • Luke Wagner discovered that, on typical websites, most JSScripts are byte-compiled but never run.  A JSScript roughly corresponds to a JavaScript function.  In hindsight, it’s not such a surprising result — Firefox byte-compiles all loaded JavaScript code, and you can imagine lots of websites use libraries like jQuery but only use a small fraction of the functions in the library.  Making byte-compilation lazy could potentially save MBs of memory per compartment.  But that will require some non-trivial reworking of the JS engine, and so is unlikely to happen in the short-term.
  • Kyle Huey avoided a small amount (~100KB per browser process) of waste due to rounding up in XPT arenas.

Improving about:memory

I made some progress on a Valgrind tool to help identify the memory that is currently reported only by the “heap-unclassified” bucket in about:memory.  It’s called “DMD”, short for “Dark Matter Detector”.  It’s in early stages and I still need to teach it about most of Firefox’s memory reporters, but it’s already spitting out useful data, which led to me and Ehsan Akhgari landing memory reporters for the JS atom table and the Hunspell spell checker.  We also now have some insight (here and here) about memory usage for very large pages.

Mounir Lamouri turned on the memory reporter for the DOM that he’s been working on for some time.  This shows up under “dom” in about:memory.  There are still some cases that require handling;  you can follow the progress of these here.

Andrew McCreight replaced about:memory’s buttons so you can force a cycle collection without also forcing a garbage collection, which may be useful in hunting down certain problems.

Finally, Sander van Veen added the existing “js-compartments-user” and “js-compartments-system” to the statistics collected by telemetry (his first landed patch!), and I did likewise for the “storage/sqlite” reporter.  I also added a new “tjit-data/trace-monitor” memory reporter that accounts for some of the memory used by TraceMonkey.

Miscellaneous

Igor Bukanov tweaked the handling of empty chunks by the JavaScript garbage collector.  That sounds boring until you see the results on Gregor Wagner’s 150-tab stress test: resident memory usage dropped 9.5% with all 150 tabs open, and dropped by 27% after all those tabs were closed.

Brian Hackett fixed a memory leak in type inference, which gets it one step closer to being ready to land.

Christian Höltje fixed a leak in his “It’s All Text” add-on that was causing zombie compartments.  This fix will be in version 1.6.0, which is currently awaiting to receive AMO approval, but can be obtained here in the meantime.  This fix and last week’s fix of a memory leak in LastPass are very encouraging — per-compartment reporters in about:memory have, for the first time, given add-on developers a reasonable tool for identifying memory leaks.  I hope we can continue to improve the situation here.  Several people have asked me for documentation on how to avoid memory leaks in add-ons.  I’m not the person to write that guide (I’m not a Gecko expert and I know almost nothing about add-ons) but hopefully someone else can step up to the plate.

Bug counts

Here’s the change in MemShrink bug counts.

  • P1: 30 (-0, +1)
  • P2: 64 (-4, +6)
  • P3: 36 (-5, +0)
  • Unprioritized: 1 (-2, +1)

Good progress on P3 bugs, but they’re the least important ones.  Other than that, new bugs are still being reported faster than they’re being fixed.  If you want to help but don’t know where to start, feel free to email me or ping me on IRC and I’ll do my best to help get you involved.

 

Categories
about:memory JägerMonkey Memory consumption Tracemonkey

You make what you measure

Inspired by Shaver’s patch, I implemented some per-compartment stats in about:memory.  I then visited TechCrunch because I know it stresses the JS engine.  Wow, there are over 70 compartments!  There are 20 stories on the front page.  Every story has a Facebook “like” button, a Facebook “send” button, and a Google “+1” button, and every button gets its own compartment.

That sounds like a bug, but it’s probably not.  Nonetheless, every one of those buttons had an entry in about:memory that looked like this:

Old compartment measurements

(The ‘object-slots’ and ‘scripts’ and ‘string-chars’ measurements are also new, courtesy of bug 571249.)

Ugh, 255,099 bytes for a compartment that has only 971 bytes (i.e. not much) of scripts?  Even worse, this is actually an underestimate because there is another 68KB of tjit-data memory that isn’t being measured for each compartment.  That gives a total of about 323KB per compartment.  And it turns out that no JIT compilation is happening for these compartments, so all that tjit-data and mjit-code space is allocated but unused.

Fortunately, it’s not hard to avoid this wasted space, by lazily initializing each compartment’s TraceMonitor and JaegerCompartment.  With that done, the entry in about:memory looks like this:

New compartment measurements

That’s an easy memory saving of over 20MB for a single webpage.  The per-compartment memory reporters haven’t landed yet, and may not even land in their current form, but they’ve already demonstrated their worth.  You make what you measure.

Categories
Firefox Memory consumption Performance Tracemonkey

You lose more when slow than you gain when fast

SpiderMonkey is Firefox’s JavaScript engine.  In Firefox 3.0 and earlier versions, it was just an interpreter.  In Firefox 3.5, a tracing JIT called TraceMonkey was added.  TraceMonkey was able to massively speed up certain parts of programs, such as tight loops;  parts of programs it couldn’t speed up continued to be interpreted.  TraceMonkey provided large speed improvements, but JavaScript performance overall still didn’t compare that well against that of Safari and Chrome, both of which used method JITs that had worse best-case performance than TraceMonkey, but much better worst-case performance.

If you look at the numbers, this isn’t so surprising.  If you’re 10x faster than the competition on half the cases, and 10x slower on half the cases, you end up being 5.05x slower overall.  Firefox 4.0 added a method JIT, JaegerMonkey, which avoided those really slow cases, and Firefox’s JavaScript performance is now very competitive with other browsers.

The take-away message:  you lose more when slow than you gain when fast. Your performance is determined primarily by your slowest operations.  This is true for two reasons.  First, in software you can easily get such large differences in performance: 10x, 100x, 1000x and more aren’t uncommon.  Second, in complex programs like a web browser, overall performance (i.e. what a user feels when browsing day-to-day) is determined by a huge range of different operations, some of which will be relatively fast and some of which will be relatively slow.

Once you realize this, you start to look for the really slow cases. You know, the ones where the browser slows to a crawl and user starts cursing and clicking wildly and holy crap if this happens again I’m switching to another browser.  Those are the performance effects that most users care about, not whether their browser is 2x slower on some benchmark.  When they say “it’s really fast”, the probably actually mean “it’s never really slow”.

That’s why memory leaks are so bad — because they lead to paging, which utterly destroys performance, probably more than anything else.

It also makes me think that the single most important metric when considering browser performance is page fault counts.  Hmm, I think it’s time to look again at Julian Seward’s VM profiler and the Linux perf tools.

 

Categories
Firefox JägerMonkey Performance Tracemonkey

Multi-faceted JavaScript speed improvements

Firefox 4.0 beta 7’s release announcement was accompanied by the following graphs that show great improvements in JavaScript speed:

Fx4b7 JavaScript speed-ups

Impressive!  The graphs claim speed-ups of 3x, 3x and 5x;  by my calculations the more precise numbers are 3.49x, 2.94x and 5.24x.

The Sunspider and V8bench results are no surprise to anyone who knows about JägerMonkey and has been following AWFY, but the excellent Kraken results really surprised me.  Why?

  • Sunspider and V8bench have been around for ages.  They are the benchmarks most commonly used (for better or worse) to gauge JavaScript performance and so they have been the major drivers of performance improvements.  To put it more bluntly, like all the other browser vendors, we tune for these benchmarks a lot. In contrast, Kraken was only released on September 14th, and so we’ve done very little tuning for it yet.
  • Unlike Sunspider and V8bench, Kraken contains a lot of computationally intensive code such as image and audio processing. These benchmarks are dominated by tight loops containing numerous array accesses.  As a result, they trace really well, and so even 4b7 spends most of its Kraken time (I’d estimate 90%+) in code generated by TraceMonkey, the trace JIT.

We can draw two happy conclusions from Kraken’s improvement.

  • Our speed-ups apply widely, not just to Sunspider and V8bench.
  • Our future performance eggs are not all in one basket: the JavaScript team has made and will continue to make great improvements to the non-JägerMonkey parts of the JavaScript engine.

Firefox 4.0 is going to be great release!

Categories
Performance Tracemonkey Valgrind

cg_diff: a differential profiling tool

I frequently use the SunSpider and V8 benchmark suites to compare the speed of different versions of TraceMonkey.  The best metric for speed comparisons is always execution time.  However, measuring execution time on modern machines is unreliable — you get different lots of variation between runs.  This is a particular problem in this cae because the run-times of these benchmarks is very small — SunSpider takes less than 700 ms on my laptop, and V8 takes about 6.5 seconds.  Run-to-run variations can be larger than the difference I’m trying to measure.  This is annoying:  the best speed metric cannot be measured exactly.

So I frequently use Cachegrind to measure the number of executed instructions.  This is a worse metric than execution time — the number of instructions doesn’t directly relate to the execution time, although it’s usually a good indicator — but it has the advantage that it can be measured exactly.  Most of the SunSpider and V8 tests are deterministic, and if I measure them twice in a row I’ll get the same result.  Cachegrind also gives instruction counts on a per-function and per-line basis, which is very useful.

So I often run Cachegrind on two different versions of TraceMonkey:  an unchanged copy of the current repository tip, and a copy of the current repository tip with a patch applied.  I can then compare the results and get a very precise idea of how the patch affects performance.

However, comparing the output of two Cachegrind runs manually is a pain.  For example, here is part of Cachegrind’s output (lightly edited for clarity) for crypto-md5.js with an unchanged repository tip (as of a day or two ago):

--------------------------------------------------------------------------------
 Ir
--------------------------------------------------------------------------------
48,923,280  PROGRAM TOTALS
--------------------------------------------------------------------------------
 Ir  file:function
--------------------------------------------------------------------------------
5,638,362  ???:???
4,746,990  /build/buildd/eglibc-2.10.1/string/../sysdeps/i386/i686/strcmp.S:strcmp
2,032,069  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
1,899,298  jstracer.cpp:bool js::VisitFrameSlots<js::CountSlotsVisitor>(...)
1,759,932  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
1,232,425  jsops.cpp:js_Interpret
 885,168  jstracer.cpp:bool js::VisitFrameSlots<js::DetermineTypesVisitor>(...)
 871,197  jstracer.cpp:js::TraceRecorder::set(int*, nanojit::LIns*, bool)
 812,419  /build/buildd/eglibc-2.10.1/iconv/gconv_conf.c:insert_module
 758,034  jstracer.cpp:js::TraceRecorder::monitorRecording(JSOp)

At the top we have the total instruction count, and then we have the instruction counts for the top 10 functions.  The ???:??? entry represents code generated by TraceMonkey’s JIT compiler, for which there is no debug information.  “Ir” is short for “I-cache reads”, which is equivalent to “instructions executed”.

Cachegrind tracks a lot more than just instruction counts, but I’m only showing them here to keep things simple. It also gives per-line counts, but I’ve omitted them as well.

And here is the corresponding output when a patch from bug 575529 is applied:

--------------------------------------------------------------------------------
        Ir
--------------------------------------------------------------------------------
42,332,998  PROGRAM TOTALS
--------------------------------------------------------------------------------
       Ir  file:function
--------------------------------------------------------------------------------
4,746,990  /build/buildd/eglibc-2.10.1/string/../sysdeps/i386/i686/strcmp.S:strcmp
4,100,366  ???:???
1,687,434  jstracer.cpp:bool js::VisitFrameSlots(js::CountSlotsVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
1,343,085  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
1,229,853  jsops.cpp:js_Interpret
1,137,981  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
  868,855  jstracer.cpp:js::TraceRecorder::set(int*, nanojit::LIns*, bool)
  812,419  /build/buildd/eglibc-2.10.1/iconv/gconv_conf.c:insert_module
  755,753  jstracer.cpp:js::TraceRecorder::monitorRecording(JSOp)
  575,200  jsscan.cpp:js::TokenStream::getTokenInternal()

It’s easy to see that the total instruction count has dropped from 48.9M to 42.3M, but seeing the changes at a per-function level is more difficult. For a long time I would make this comparison manually by opening the two files side-by-side and reading carefully.  Sometimes I’d also do some cutting-and-pasting to reorder entries. The whole process was tedious, but the information revealed is so useful that I did it anyway.

Then three months ago David Baron asked on Mozilla’s dev-platform mailing list if anybody knew of any good differential profiling tools. This prompted me to realise that I wanted exactly such a tool for Cachegrind. Furthermore, as Cachegrind’s author, I was in a good place to understand exactly what was necessary 🙂

The end result is a new script, cg_diff, that can be used to compute the difference between two Cachegrind output files. Here’s part of the difference between the above two versions:

--------------------------------------------------------------------------------
        Ir
--------------------------------------------------------------------------------
-6,590,282  PROGRAM TOTALS
--------------------------------------------------------------------------------
        Ir  file:function
--------------------------------------------------------------------------------
-1,537,996  ???:???
  -894,088  jstracer.cpp:js::TraceRecorder::determineSlotType(int*)
  -416,847  jstracer.cpp:js::TraceRecorder::checkForGlobalObjectReallocation()
  -405,271  jstracer.cpp:bool js::VisitFrameSlots(js::DetermineTypesVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
  -246,047  nanojit/Containers.h:nanojit::StackFilter::read()
  -238,121  nanojit/Assembler.cpp:nanojit::Assembler::registerAlloc(nanojit::LIns*, int, int)
   230,419  nanojit/LIR.cpp:nanojit::interval::of(nanojit::LIns*)
  -226,070  nanojit/Assembler.cpp:nanojit::Assembler::asm_leave_trace(nanojit::LIns*)
  -211,864  jstracer.cpp:bool js::VisitFrameSlots(js::CountSlotsVisitor&, JSContext*, unsigned int, js::FrameRegsIter&, JSStackFrame*)
  -200,742  nanojit/Assembler.cpp:nanojit::Assembler::findRegFor(nanojit::LIns*, int)

This makes it really easy to see what’s changed. Negative values mean that the instruction count dropped, positive numbers mean that the instruction count increased.

I’ve been using this script for a while now, and it’s really helped me analyse the performance effects of my patches. Indeed, I have some scripts set up so that, with a single command, I can run all of SunSpider through two different versions of TraceMonkey and produce both normal profiles and difference profiles. I can also get high-level instruction comparisons such as the one in this Bugzilla comment.

And now everybody else can use cg_diff too, because I just landed it on the Valgrind trunk. If you want to try it, follow these instructions to setup a copy of the trunk.  And note that if you want to compare two versions of a program that sit in different directories (as opposed to profiling a program, modifying it, then reprofiling it) you’ll need to use cg_diff’s –mod-filename option to get useful results.  Feel free to ask me questions (via email, IRC or in the comments below) if you have troubles.

Happy differencing!

Categories
Tracemonkey

Don’t be stupid

Good compilers are complicated, and generating good code is hard.  Clever
optimisations play their part, but it can be equally important to not do
things that are stupid.

Back in March I noticed this awful code begin generated by TraceMonkey for
access-fannkuch.js, one of the SunSpider benchmarks:

 ld16 = ld sp[-152]
 sti sp[-152] = ld16
 ld17 = ld sp[-128]
 sti sp[-128] = ld17
 ld18 = ld sp[-88]
 sti sp[-88] = ld18
 ld19 = ld sp[-80]
 sti sp[-80] = ld19
 sti sp[-72] = addxov1
 ld20 = ld sp[-64]
 sti sp[-64] = ld20
 ld21 = ld sp[-48]
 sti sp[-48] = ld21
 sti sp[-32] = ld15
 ld22 = ld sp[-8]
 sti sp[-8] = ld22
 j -> label1

A number of the trace fragments compiled for that benchmark ended similarly.
I filed a bug and investigated a little at the time but didn’t understand
the responsible code well enough to fix it.

This week I took another look, and was glad that I did.  Thanks to a better
understanding of the responsible code, I was able to fix it with only a 29
line patch
(10 lines of which were comments).  Even better, it turns out
this fix not only affected the small number of egregiously stupid cases such
as the one above;  it also removed a lot of redundant stores that were less
obvious.  The net result was about a 2% speedup for the SunSpider benchmark
suite and 3% for the V8 benchmark suite.  That’s a satisfying patch!

Categories
Correctness Programming Tracemonkey

A win for code hygiene

I’ve written recently about clean-ups I’ve been making in Nanojit — simplifying code, refactoring, adding more internal sanity checking.  I’m a firm believer that these kinds of changes are worth doing, that “if it ain’t broke, don’t fix it” is not always true.  But there’s always a voice in the back of my head wondering “am I just polishing this code for little gain?  Should I be working on something else instead?”  Yesterday I received confirmation that this work has been worthwhile.  But before I can explain why, I need to give some background.

Nanojit serves as the back-end of TraceMonkey.  It converts LIR (a low-level intermediate representation) generated by TraceMonkey’s front-end into native code.  So Nanojit is a critical part of TraceMonkey.  And TraceMonkey is a critical part of Firefox.

However, code generators (and register allocators) are tricky beasts — easy to get wrong and hard to debug.  (A long time ago I heard someone, I can’t remember who, say “code generators and garbage collectors should crash as early and noisily as possible”.)  So with a piece of code this important and inherently difficult, it’s a good idea to arm yourself with some weapons that give you confidence that it is correct.

Weapon 1: clean IR semantics

One way to do this is to give your IR clean semantics.  LIR’s semantics are mostly clean, but there are a few nasty cases.  One of the worst is the ‘ov’ instruction.  It performs an overflow check on an arithmetic (add, mul, sub, neg) operation.

JavaScript numbers are doubles by default, but TraceMonkey sometimes demotes them to integers for speed.  TraceMonkey has to check for integer overflow in these cases;  if an integer does overflow TraceMonkey has to step back and use doubles for the computation.

Here’s some example LIR that uses ‘ov’:

 ld6 = ld sp[-8]
 add1 = add ld6, 1
 ov1 = ov add1
 xt4: xt ov1 -> pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)

The first instructions loads a value from the stack.  The second instruction adds 1 to that value.  The third instruction performs an overflow check and puts the result in ‘ov1’.  The fourth instruction is a guard;  if ‘ov1’ is true it exits the code block.

So why is ‘ov’ semantically nasty?  First consider ‘add’, which is a semantically cleaner instruction.  Its result (in this case ‘add1’) is a function of its inputs (‘ld6’ and 1).  In comparison, ‘ov’ does not have this property.  The ‘ov’ doesn’t really take ‘add1’ as an input —  you can’t tell by looking at ‘add1’ whether the addition overflowed.  In fact you can’t really understand what is happening here at the LIR level;  you have to know what happens in the native code that is generated from this LIR.  It turns out that the native code for the ‘add’ also sets some condition codes, and the native code generated for the ‘ov’/’xt’ pair inspects those condition codes.  For this to work, the ‘add’ has to immediately precede the ‘ov’;  if it does not, whatever intermediate native code that is generated could overwrite the condition codes, in which case the behaviour of the guard becomes completely unpredictable.  This is obviously bad.

The real problem is that the addition has two outputs:  the result, and the overflow status indicator (which maps to condition codes in the hardware).  But LIR has no explicit way to represent that second output.  So the second output is implicit, and an extra constraint must be imposed on the LIR instead, i.e. ‘ov’ must immediately follow an arithmetic operation.  Because this constraint is so arbitrary it’s easy to forget and easy to break.

In May 2009 Julian Seward opened a bug to improve LIR’s semantics, giving five suggestions.  It generated a lot of discussion and Julian drafted a patch to replace ‘ov’ with some cleaner opcodes, but there was enough disagreement that it never went anywhere.  But I didn’t forget about it, and ‘ov’ has annoyed me enough times recently that I decided to resurrect Julian’s idea, this time in a new bug to escape the baggage of the old one.  The idea is simple: if ‘ov’ always has to follow an arithmetic operation, then why not create new opcodes that fuse the two parts together?  These new opcodes are ‘addxov’, ‘subxov’ and ‘mulxov’.  With my patch the code from above now looks like this:

 ld6 = ld sp[-8]
 add1 = addxov ld6, 1 -> pc=0x883ed8 imacpc=0x0 sp+0 rp+0 (GuardID=218)

The ‘addxov’ adds ‘ld6’ and 1, puts the result in ‘add1’, and exits if there was an overflow.  The generated native code is unchanged, but the implicit output of the addition is now hidden within a single LIR instruction, and it’s impossible for overflow checks in the native code to become separated from the potentially-overflowing arithmetic operation.

Weapon 2: strong IR sanity checking

Compilers are usually built in a pipeline fashion, where you have multiple passes over the code representation, and each pass does a task such as optimisation, register allocation or code generation.  It’s also a really good idea to have a pass that does a thorough sanity check of the code representation.

In November 2008 Jim Blandy filed a bug suggesting such a pass for Nanojit, one that performs type-checking of the LIR.  The bug sat untouched for over six months until some extra discussion occurred and then (once again) Julian wrote a patch.  It found at least one case where TraceMonkey was generating bad LIR code, but again the bug stalled, this time because we spent three months merging Mozilla’s and Adobe’s copies of Nanojit.  I resurrected the bug again recently and added some “structure checking” for things like the ‘ov’ constraint, and it landed in the TraceMonkey repository on January 21st.  Happily, my version of the checker was simpler than Julian’s;  this was possible because LIR had had some other semantic clean-ups (e.g. properly distinguishing 64-bit integers from 64-bit floats).  My patch replaced a very basic type-checker (called “SanityFilter”) that had been in TraceMonkey, and immediately found two bugs, one of which involved ‘ov’.

Synergy

It’s not often that a bug report involving your code makes you smile.  Yesterday Gary Kwong hit an assertion failure in Nanojit when fuzz-testing:

    LIR structure error (end of writer pipeline):
    in instruction with opcode: ov
    argument 1 has opcode: add
    it should be: located immediately prior, but isn't
  One way to debug this:  change the failing NanoAssertMsgf(0, ...) call to a
  printf(...) call and rerun with verbose output.  If you're lucky, this error
  message will appear before the block containing the erroneous instruction.

In other words, there’s an ‘ov’ following an ‘add’, but there are one or more other LIR instructions between them.  This means that the code generated will be bogus, as I explained above.  This bug may have been in TraceMonkey for a long time, but it wasn’t detected until strong internal sanity checking was added.  The good news is that the ‘ov’-removal patch (which hasn’t landed as it’s still awaiting review) will fix this patch.

Lessons learned

This story tied together a number of related ideas for me.  In particular:

  • Clean IR semantics are worth the effort.  Hacks may save time initially but they will come back to bite you.
  • Strong IR sanity checking is worth the effort.  (And cleaner semantic allows for stronger checking.)
  • Listen to Julian, especially when he talks about correctness.  (And read his code!  VEX/pub/libvex_ir.h in the Valgrind source code describes Valgrind’s IR, which is a wonderfully clean IR, particularly in the way it separates statements (which have side-effects) from expressions (which don’t).)
  • Don’t put too many ideas into one bug report.  Too much discussion can kill a bug, or at least put it in a coma.
  • Follow-through is important.  A patch can’t do much good unless it lands.
  • Fuzz testing is awesome (but we already knew that).