Main menu:

Site search

Categories

Archive

TraceVis: performance visualization for TraceMonkey

I’ve been working on a visualization of TraceMonkey performance, with the goal of revealing what the JS VM is doing, and why it runs certain programs fast or slow, so we can figure out how to make the slow ones fast too. In this post, I want to show off the results and explain how to read them, hopefully explaining a bit about how TraceMonkey works in general while I’m at it.

Background on TraceMonkey. First, I need to explain how TraceMonkey works at a high level. Readers who already know the basic ideas behind TraceMonkey can skip to the next section.

Before TraceMonkey (TM), we had an interpreter. The fundamental idea of TM is to select hot (frequently run) regions of JS code, compile them to fast native code (x86, ARM), and then use the native code for those regions.

In all compilers, generating fast native code requires a lot of information about the program’s run-time behavior. Specifically, the compiler needs to know what values are constant in what regions, what branches are or are not always taken, and, crucially for dynamic languages like JS, what types variables have in what regions. For example, in compiling JS a+b, if we don’t know the types of a and b, the compiler needs to generate native code to handle every type combination, along with type tests and branches on the run-time types. And then the compiler has to carry along extra bits to record the type, and previous operations that set a and b need to generate those extra bits. And so on. That’s a lot of native code, and it’s only slightly more efficient than the interpreter. Conversely, if the compiler knows the types are both, say, double, then the compiler can simply generate a native add instruction, which is as fast as possible.

It’s hard for the compiler to figure out types and other such information for dynamic languages like JS. One reason it’s hard is the lack of type declarations, but there are others, such as the ability of eval to create new variables with values of any type at any time.

TraceMonkey solves this problem by collecting information for the compiler dynamically. That is, when TM wants to compile a certain region of code, it actually runs that region in the interpreter. As the interpreter runs, TM records the path taken through the code and all the types and constant values seen. The result is a linear trace through the code with type, value, and branch annotations. The compiler (nanojit) can then relatively easily compile the trace to fast native code.

Now, for many programs this works wonderfully and the program runs 2-20x faster in TraceMonkey vs. the pure interpreter, but some programs don’t speed up or run more slowly. We need to understand why in order to improve TM.

TraceMonkey VM Activities. To better understand TM performance, I broke down what TM does into 6 major activities.

  • When TM starts running a program, it always starts by interpreting the program, exactly as in the non-tracing JS engine.
  • When execution reaches a point where TM might want to start a compiled trace, TM spends a bit of time monitoring the execution: checking to see if it already has a compiled region, counting the number of times passed, and deciding whether to start a trace. Monitoring is a kind of overhead: while monitoring, TM isn’t running the user’s program, but monitoring is a necessary cost of finding and optimizing traces.
  • If TM does decide to create a new compiled trace, it runs in the interpreter while recording the trace, including operations and types of values. During this time, it is running user code a little slower than the basic interpreter.
  • When the trace is finished, TM compiles the trace to native code. This is another form of overhead.
  • As I mentioned above, as part of monitoring, TM checks to see if it already has a compiled native trace starting at the current point. If so, TM selects the right trace and prepares to run it, which I call executing the trace. This is a third form of overhead.
  • Finally, TM can be running native code compiled previously. Compiled native traces run 2-20x faster than the interpreter, with a typical speedup factor of about 2.5.

Visualizing TraceMonkey Activities. Now, let’s see how that looks in a picture. I’m going to use 2loops.js, a sample program that computes the mean and variance of the numbers 0-999,999 using two separate loops:

var n = 100000;
var sum = 0;
for (var i = 0; i < n; ++i) {
  sum += i;
}
var sum_squares = 0;
for (var i = 0; i < n; ++i) {
  sum_squares += i * i;
}
var mean = sum / n;
var variance = sum_squares / n - mean * mean;
print('mean:     ' + mean);
print('variance: ' + variance);

Here is the TraceVis output. Click for a version large enough to actually read. The numbered boxes going clockwise around the chart show how to read each element of the chart and what the chart tells us about what TM is doing.


TraceVis output

TraceVis on SunSpider Click here for TraceVis output for all the SunSpider benchmarks. I give the speedup vs. pure interpretation on the front page so you can get a feel for what pictures go with slow and fast execution. Here are a few examples interpreted in detail:

crypto-sha1 traces very well, with a 6x speedup vs. the interpreter. The picture looks a lot like the picture for 2loops.js, but with more traces:


TraceVis crypto-sha1

The middle purple and blue stripes are interesting: TM had to create 6 native traces before it was able to really switch to native code. Figuring out why this happens requires additional tools. In this case, because there aren't too many traces, debug spew (environment variable TRACEMONKEY=verbose) is readable. To find where all the traces start recording, I took a debug shell build and ran TRACEMONKEY=verbose dist/bin/js -j ../t/crypto-sha1.js | grep starting. I got:

recording starting from ../t/crypto-sha1.js:221@119
recording starting from ../t/crypto-sha1.js:152@29
recording starting from ../t/crypto-sha1.js:152@39
recording starting from ../t/crypto-sha1.js:63@154
recording starting from ../t/crypto-sha1.js:63@159
recording starting from ../t/crypto-sha1.js:90@5
recording starting from ../t/crypto-sha1.js:91@31
recording starting from ../t/crypto-sha1.js:92@52
recording starting from ../t/crypto-sha1.js:55@111
recording starting from ../t/crypto-sha1.js:177@34

There are 10 places recording started, corresponding to the 10 purple bands in TraceVis. Numbering from 1, 4-9 are the traces for the middle band. The first two traces are for the loop starting at line 61, which is an inner loop. (Inner loops get hot before outer loops, so they tend to be traced first.) The traces cover two different paths through the if. Then, the tracer ends up discovering 3 different hot paths through the function sha1_ft around line 90, so they are traced as well. Finally, an outer loop at line 53 gets hot, so it gets traced as well. At this point, there are enough traces to cover all the cases, so we get to stay on native code until the containing function returns.

Thus, the purple/blue banding pattern followed by a long stretch of green indicates the buildup of several traces through a loop to account for different paths or types or inner and outer loops. Once enough traces have been compiled to cover all the hot code of a loop (or set of nested loops), the chart goes green until the loop is done.

If you zoom in, you can actually see short bands of yellow and green in between the wide bands of purple and blue. This is because after compiling one of these traces, TM starts tries to execute the native code. Often TM gets several iterations in right away before it exits and needs to record more traces. Computers being fast and all, "several iterations" is usually a few microseconds or less.

3d-cube traces moderately well, achieving a 2.5x speedup over the interpreter. Let's see why it doesn't trace as well as sha1:


TraceVis 3d-cube

Here we have the familiar purple/blue buildup of compiled traces, but once we finish compiling, we don't go solid green. Instead, we get a pattern that looks like blades of grass with red tips on a white background. Zooming in and looking at the vertical dimension, we can see that we get a bit of red (monitoring), then a bit of yellow (preparing to execute), then a strip of green (native code), then a bit of yellow (cleaning up after executing), then a bit of red (finish monitoring), then a strip of white (interpreter), and the pattern repeats.

This means we are repeatedly starting to execute native code, but then we are forced to leave the native code for the interpreter within a few microseconds. Evidently, the native code runs pretty fast, because we are getting a good total speedup even though we spend about 1/3 of our time in the base interpreter.

The next question is why we are leaving native code and returning to the interpreter, which this visualization doesn't answer. There are about a dozen different reasons why we can exit from native code. I have another prototype instrumentation patch that counts the different exit reasons and and how much time we spend in the interpreter for each reason.

For 3d-cube, the vast majority (3808 of 3907) of these exits are for a reason I apparently called loop2 when I wrote the patch. As you may expect, I didn't know what loop2 actually meant, but after conferring with Andreas, I learned that loop2 means we exiting an outer loop. When we exit an inner loop, we know we are still inside a loop, so we are probably in some hot code and should continue trying to record traces. But if we exit an outer loop, we are presumably back to one-shot code, which does not benefit from tracing.

But we exit this way 3808 times, which is impossible unless we are inside a loop. So something must be wrong. I used debug spew again, this time looking for "leaving trace" messages. Most of them are at lines 57 and 100 of 3d-cube.js. Line 57 is a loop inside a function DrawLine that appears to iterate over the pixels in a line. DrawLine is called only in a function DrawQube. DrawQube is called by Loop, which implements a loop by recursion, which TraceMonkey currently doesn't support.

So, we can probably speed up 3d-cube even more by either (a) tracing recursion, (b) tracing tail recursion as loops (Loop is tail recursive), or (c) extending traces from outer loops if the outer loop exit becomes hot. And we didn't actually know this until now.

string-tagcloud traces badly: it runs 5% slower with the JIT turned on.


TraceVis string-tagcloud

The pattern has 3 phases.

In phase 1, which accounts for about half of total time, we are running in the pure interpreter. The first real code run by this test is: var tagInfo = tagInfoJSON.parseJSON(...) over 175k of JSON. parseJSON is recursive. TM doesn't even see any loop headers in that part, so TM never even goes to monitoring mode.

In phase 2, we have a white background with purple streaks and lots of red dots. The purple streaks are recording traces, but note that there is no blue to represent compilation. So the traces must be aborting for some reason; i.e., they encounter some feature TM doesn't know how to trace. The red dots indicate that TM is seeing loop headers and monitoring them, but not doing anything with them, probably because no traces can be successfully compiled for them.

To understand phase 2 in detail, we need to know why recording is failing. Grepping the spew for 'Abort' will get the answer. In this case, all but 1 of the aborts are: Abort recording (line 184, pc 64): callname. This means tracing is stopping on a JSOP_CALLNAME bytecode in the interpreter. Line 184 contains a recursive call to a function walk defined inside another function. Poking around a bit, I noticed that a detailed message would be printed with lower-case 'abort', so I tried that and got abort: 4815: fp->scopeChain is not global or active call object. I found that error message in the code, and after a little debugging I found the reason the abort is activated.

The problem is that during lookup of the variable walk in the function call expression, walk is found in an enclosing scope that is a function call scope that is not part of the trace. Whenever that happens, the tracer aborts. I think this is just an implementation limitation: in order to call the function, the tracer needs to have the function in its "tracker", so that it can refer to the LIR opcode that loads the function. (That's just how it works right now.) But if the function is defined "above" the trace, then the tracer's tracker will not have seen it.

I think this wouldn't matter if we supported recursion, because then the outer scope would be part of the trace anyway.

Phase 3 consists of a little bit of recording and compiling, followed by a big patch of native code execution. That's the loop inside the function makeTagCloud, which doesn't do anything scary and traces well.

Conclusion. I've explained how to read TraceVis charts, and then showed how to read the charts along with other debug info to diagnose performance problems (or performance wins) in detail.

I want to thank my girlfriend Natalie for inspiring TraceVis by sending me an article on visualizations in computer forensics.

The code is available in my personal hg repository. You can check out a local copy by running

hg clone http://hg.mozilla.org/users/dmandelin_mozilla.com/tracevis/ 

The repo includes the patch that instruments TM with activity counters and a bunch of Python scripts for processing the outputs. The image generation scripts require PIL (Python Imaging Library).

Looking at SunSpider charts is fun, but I really want to apply this tool to performance problems in real applications like bugs 463487, 463478, 465773, and 468840. I'd also like to look at the V8 benchmarks, where TM currently doesn't do as well as it does on SunSpider.

Comments

Comment from Yusuf
Time: February 26, 2009, 11:41 pm

David, Great post. What’s the bug number which tracks tracing recursion in TraceMonkey. Is this supposed to go in 3.1 or is a post 3.1 issue

Comment from Albert
Time: February 27, 2009, 2:30 am

Great work! I like the simple explanation of how TraceMonkey works, too.

Pingback from Ajaxian » TraceVis: Getting some eyes into TraceMonkey
Time: February 27, 2009, 3:55 am

[…] Mandelin has announcedTraceVis “a visualization of TraceMonkey performance, with the goal of revealing what the JS VM is […]

Comment from TNO
Time: February 27, 2009, 5:56 am

This is some excellent insight, can’t wait to see some of the V8 benchmark graphs

Comment from theorz
Time: February 27, 2009, 9:50 am

Is there anywhere thats lists all the things that can cause TM to leave native code? It would be nice to help write javascript that gets the most out of tracemonkey.

From this post I can gather that avoiding recursion is a good idea for now. But things like http://www.jsballs.com/benchmark.html are slower in firefox 3.1 than 3.0 and do not use much recursion.

Comment from Ted Mielczarek
Time: February 27, 2009, 10:10 am

This is very cool! I understood the basics of how TM worked, but this makes it a lot clearer. Thanks for the pretty graphs and the detailed explanations.

Comment from RyanVM
Time: February 27, 2009, 3:28 pm

Yusuf – I’m pretty sure you want bug 459301

Comment from bcrowder
Time: February 27, 2009, 3:37 pm

This is very similar to the sort of diagram I had wanted to generate to look at the browser’s memory fragmentation, except with colors assigned to the various allocating locations, so that we could see precisely where allocations from various modules ended up in memory. It would be a nice way to visualize locality and fragmentation at the same time.

Also, I think this diagramming approach would be a fantastic way to visualize activity in the browser with respect to when we’re running C/C++ and when we’re running JS, or even with finer granularity on a module-by-module basis.

Comment from kyber
Time: February 27, 2009, 4:04 pm

Thanks for this awesome writeup David – I’ve been following TraceMonkey from a distance, but never dared to look into what it is actually doing, the writeup was clear, easy to understand and I love the pretty pictures.

Oh, and thanks for continued study of jsMSX. :)

Comment from kyber
Time: February 27, 2009, 4:21 pm

Regarding theorz’s comment – I was curious as to that benchmark, and tried a few runs of FF3.2a and FF3.
61.061 FF3.2a jit
60.225 FF3.2a jit
60.226 FF3.2a no jit
60.147 FF3.2a no jit

60.245 FF3

From these runs on my machine it seems to me that JIT was largely irrelevant – my overall guess, without running any fancy visualisations, is maybe that the overhead is in things that aren’t impacted by the JIT yet like graphics rendering and DOM manipulation, and that perhaps the page could benefit from tricks to optimise those.
Maybe even use a canvas, as fun as pushing around divs is.

Comment from kyber
Time: February 27, 2009, 4:34 pm

I always forget to turn off firebug. Reran with firebug disabled in FF3.2a and FF3.
59.896 FF3.2a jit
59.794 FF3.2a jit
59.442 FF3.2a no jit
59.541 FF3.2a no jit

60.209 FF3

Not a huge difference. I wonder what else besides it just being in things TM doesn’t trace. Maybe using same vars for both strings and ints? :)

Comment from Boris
Time: February 27, 2009, 8:06 pm

The jsballs benchmark is mostly painting+DOM+css-bound. The time spent in the JS interpreter on it is just a few percent of the total. It can be reduced once we start tracing across DOM method calls (right now those abort trace recording), but that won’t help much in terms of benchmark time.

This is also why the fact that we run the JS slower on that test with jit on (because we keep starting and aborting recordings) doesn’t affect the test results much: it’s just not spending that much time in the JS engine to start with.

Pingback from TraceVis: Getting some eyes into TraceMonkey | Guilda Blog
Time: February 28, 2009, 4:29 am

[…] Mandelin has announcedTraceVis “a visualization of TraceMonkey performance, with the goal of revealing what the JS VM is […]

Comment from Travis
Time: February 28, 2009, 7:41 pm

This reminds me so much of profiling code on the commodore 64. The processor and video chip ran in lockstep. The easy way to profile was to change the screen background color as different parts of the code were running. For programs synchronized with the screen refresh (as most games were, and what else was there to program besides games :^), you would get a nearly stable image of exactly the sort tracevis gives, just rotated 90 degrees.

Comment from Jeroen
Time: March 1, 2009, 11:53 pm

Hi David,

Thanks for your excellent story!
I have a question.
You say in your trace picture (step 9) that “native code detects that it doesn’t fit in an integer”. So this means the generated code is also doing extra checks for types and or overflows ?
So it’s not quite as fast as for example code generated by c compiler? Is this checking because a js int can hold any large value?
Thanks again for your nice story, looking forward for the next one ;-)

Thx,
Jeroen

Comment from Dan
Time: March 3, 2009, 10:39 am

A very interesting read and the first post by anyone that’s actually made me understand how TraceMonkey makes things faster!

Pingback from Self-improvement « Bill McCloskey’s Blog
Time: March 20, 2009, 11:03 am

[…] blog post by my friend Dave perfectly illustrates another topic: the importance of logging, tracing, and […]

Pingback from Was bringt der Firefox 3.5? • Peter Kröner, Webdesigner & Frontendentwickler
Time: April 28, 2009, 6:24 am

[…] ausgeführt werden können. Genaueres zur Funktionsweisen von TraceMonkey findet sich in diesem und in diesem Blogpost – einfach ausgedrückt sucht TraceMonkey nach sich wiederholenden Mustern im Script und […]

Pingback from David Mandelin’s blog » PLDI 2009
Time: June 12, 2009, 1:35 pm

[…] Nick Nethercote has been fixing up my TraceVis instrumentation so it can be checked into the main Mozilla tree. Soon it should be possible to […]

Pingback from Unleash the power of JavaScript with Mozilla TraceMonkey
Time: July 29, 2009, 5:15 pm

[…] TraceMonkey brings to JavaScript execution in Firefox. David Mandelin has a blog post with a good explanation and visualization of the performance improvement Firefox gets with TraceMonkey. While the excitement around XUL – the Mozilla RIA language […]

Pingback from 谋智社区 » Blog Archives » 颠覆网络35天 ─ TraceMonkey
Time: August 19, 2009, 9:30 pm

[…] 这里有篇讨论TraceVis的文章可能是目前对TraceVis最全面的解释,包括图表中各种标识的含义以及如何分析性能问题等。这篇文章同样包括一个详细的分析可以帮助更好的理解TraceMonkey的工作远离。 […]