WebAssembly

I’m happy to report that we at Mozilla have started working with Chromium, Edge and WebKit engineers on creating a new standard, WebAssembly, that defines a portable, size- and load-time-efficient format and execution model specifically designed to serve as a compilation target for the Web. As reflected in the high-level goals, a central requirement for WebAssembly is that it integrate well with the rest of the Web platform and that the initial version run efficiently on current browsers using a client-side polyfill. As demonstrated, the polyfill can leverage asm.js to get great performance. For existing Emscripten/asm.js users, targeting WebAssembly will be as easy as flipping a flag. Thus, it is natural to view WebAssembly as the next evolutionary step of asm.js (a step many have requested and anticipated).

We’re pretty early into the overall process—there is no draft spec or even final formal standards body chosen, just a W3C Community Group, some initial prototyping and early cross-browser consensus on the high-level design documents. Going forward, there will be a lot more iteration and experimentation under the WebAssembly GitHub organization. For questions, check out the still-emerging FAQ. Brendan Eich also has a vibrant blog post with more context, history and JS perspective.

asm.js optimizations previewing in Windows 10 / Edge

There’s a great new post by the Chakra team on the implementation details of their previously-announced asm.js optimizations. These optimizations are reaching a preview state in Windows 10 / Edge which is an important milestone for asm.js and the compile-to-web story. Congratulations to the team!

UPDATE: Now shipping by default in Edge!

Microsoft announces asm.js optimizations

The Microsoft Chakra team has announced on the IE blog that asm.js optimizations are In Development. We at Mozilla are very excited for IE to join Firefox in providing predictable, top-tier performance on asm.js code and from my discussions with the Chakra team, I expect this will be the case.

What does “asm.js optimizations” mean?

Given this announcement, it’s natural to ask what exactly “asm.js optimizations” really means these days and how “asm.js optimizations” are different than the normal JS optimizations, which all browsers are continually adding, that happen to benefit asm.js code. In particular, the latter sort of optimizations are often motivated by asm.js workloads as we can see from the addition of asm.js workloads to both Google’s and Apple’s respective benchmark suites.

Initially, there was a simple approximate answer to this question: a distinguishing characteristic of asm.js is the no-op "use asm" directive in the asm.js module, so if a JS engine tested for "use asm", it was performing asm.js optimizations. However, Chrome has recently starting observing "use asm" as a form of heuristic signaling to the otherwise-normal JS compiler pipeline and both teams still consider there to be something categorically different in the Firefox approach to asm.js optimization. So, we need a more nuanced answer.

Alternatively, since asm.js code allows fully ahead-of-time (AOT) compilation, we might consider this the defining characteristic. Indeed, AOT is mentioned in the abstract of the asm.js spec, my previous blog post and v8 issue tracker comments by project members. However, as we analyze real-world asm.js workloads and plan how to further improve load times, it is increasingly clear that hybrid compilation strategies have a lot to offer. Thus, defining “asm.js optimizations” to mean “full AOT compilation” would be overly specific.

Instead, I think the right definition is that a JS engine has “asm.js optimizations” when it implements the validation predicate defined by the asm.js spec and optimizes based on successful validation. Examples of such optimizations include those described in my previous post on asm.js load-time as well as the throughput optimizations like bounds-check elimination on 64-bit platforms and the use of native calling conventions (particularly for indirect calls) on all platforms. These optimizations all benefit from the global, static type structure guaranteed by asm.js validation which is why performing validation is central.

Looking forward

This is a strong vote of confidence by Microsoft in asm.js and the overall compile-to-web story. With all the excitement and momentum we’ve seen behind Emscripten and asm.js before this announcement, I can’t wait to see what happens next. I look forward to collaborating with Microsoft and other browser vendors on taking asm.js to new levels of predictable, near-native performance.

asm.js on status.modern.ie

I was excited to see that asm.js has been added to status.modern.ie as “Under Consideration”. Since asm.js isn’t a JS language extension like, say, Generators, what this means is that Microsoft is currently considering adding optimizations to Chakra for the asm.js subset of JS. (As explained in my previous post, explicitly recognizing asm.js allows an engine to do a lot of exciting things.)

Going forward, we are hopeful that, after consideration, Microsoft will switch to “In Development” and we are quite happy to collaborate with them and any other JS engine vendors on the future evolution of asm.js.

On a more general note, it’s exciting to see that there has been across-the-board improvements on asm.js workloads in the last 6 months. You can see this on arewefastyet.com or by loading up the Dead Trigger 2 demo on Firefox, Chrome or (beta) Safari. Furthermore, with the recent release of iOS8, WebGL is now shipping in all modern browsers. The future of gaming and high-performance applications on the web is looking good!

asm.js AOT compilation and startup performance

With the recent announcement of a commercial game shipping using Emscripten and asm.js, I thought it’d be a good time to explain how asm.js is executed in Firefox and some of the load-time optimizations we’ve made since the initial landing of OdinMonkey in March. (OdinMonkey is an optimization module inside Mozilla’s JavaScript engine.) There have also been significant throughput optimizations as well, but I’ll stick to load time in this post.

Measuring the Epic Citadel demo (based on the same Unreal Engine 3 inside Monster Madness), I see a 2x improvement:

Times were measured with a simple stopwatch up to the first animation frame on a 16×2.4Ghz core Linux machine. (An IndexedDB bug in either the demo or Chrome causes level data not to be cached so time in “Downloading data” is explicitly subtracted from Chrome’s time.)

Cold load time improvements on the Citadel demo are harder to see since network latency plays a much larger part and adds considerable variance. Measuring the Nebula3 demos instead, which have a smaller initial download size and are compiled with both Emscripten and PNaCl, we can also see significantly better load times:

Times were again measured with a simple stopwatch up to first animation frame.

In this blog post I’ll explain the compilation strategy we use for asm.js, why we decided to try this strategy, how it’s been working, and 3 optimizations that have had a significant impact on load time.

The post is a bit long, so here’s the TL;DR:

  • Ahead-of-time (AOT) compilation is used for asm.js to provide predictable performance.
  • With a few cores, parallel compilation hides most of the cost of using the top-tier compiler for all code.
  • Async compilation allows the webapp to stay responsive during AOT compilation.
  • Caching compiled machine code greatly improves warm start time.

JIT Compilation

Before getting into how we compile asm.js, let’s look at a diagram of the path taken by normal JavaScript in SpiderMonkey (Mozilla’s JavaScript engine). In this diagram, boxes are data structures and arrows represent algorithms which consume and/or generate these data structures:
jit-diagram
In short, units of code (like functions, eval scripts, and global scripts) start as a bunch of characters in memory and gradually get compiled into forms that are able to execute more efficiently. While each unit of code starts the same way, different units of code will move along the arrows of this diagram at different times as they are run and judged hot enough. This compilation strategy is generally called Just-In-Time (JIT) compilation.

Going into a little more detail on the labels in the digram:

  • AST: Abstract Syntax Tree
  • Baseline: a JIT compiler that balances compilation speed and the performance of generated code (see Kannan’s post for details)
  • Ion: short for IonMonkey, a JIT compiler that produces highly-optimized code at the expense of compilation speed (see David’s post for details)
  • MIR: an SSA-based representation of code used throughout Ion
  • Profile: collect metadata describing the runtime behavior of the code
  • Ion-build: generate MIR from bytecode and profiling metadata
  • Ion-compile: optimize and generate machine code from MIR
  • Bail: stop executing Ion-compiled code in order to Ion-compile a new version or spend more time collecting profiling metadata in Baseline-compiled code

Given this whole process, it’s reasonable to ask: why do we need all these tiers of execution? To wit, V8 has two tiers, and Apple’s JSC has three and is experimenting with a fourth. Thus, this strategy is common (although people are always looking for something simpler). There are two main reasons we’ve found in SpiderMonkey for this tiered structure.

One reason is that SpiderMonkey has to run many different types of code and most code doesn’t run long enough to amortize the cost of compilation. In fact, most code doesn’t even run once which is why SpiderMonkey and other JS engines wait for a function to be run before even fully parsing it. Of code that is run, most doesn’t get warm enough to Baseline-compile and, similarly, most warm code doesn’t get hot enough to Ion-compile. Thus, each tier of execution services a distinct type of workload.

The other reason is that the Ion-build step actually depends on code having warmed up in Baseline so that the profiling metadata is likely representative of future execution. Ion compilation uses this metadata to specialize the types of values, objects, operations, etc which it could not do based on static analysis of the code alone.

What’s great about this design is that it has allowed continual progress by modern JavaScript engines on all kinds of JavaScript code. This progress continues today in all the major JS engines without signs of letting up.

JIT Problems

As it became clear that Emscripten was a big deal (remember the H.264 decoder?), we started to try it out on bigger codes and talk with potential users. As we did this, one thing that became clear: if the web was going to be a serious porting target for large, computationally-intensive apps, we needed performance to be predictable. Now, even with native code, performance is never truly predictable due to things like dynamic scheduling and cache hierarchies. However, with Emscripten output, we were seeing some pretty violent fluctuations in startup and throughput on differnet codes and on different browsers.

Analyzing these fluctuations, we saw several causes:

  1. JIT compilation time;
  2. as code warms up, it runs in lower-tier execution modes where it executes more slowly;
  3. compiler heuristics (such as: which scripts should be compiled with the top-tier JIT, when and where to inline, whether to compile a loop side-entry and what machine types to use to represent numbers) can make suboptimal choices that permanently reduce throughput; and
  4. some optimizations were just missing from the underlying backend compiler because they were rather difficult to implement in the general case (e.g., a better calling convention for direct and indirect function calls).

Each of these problems can potentially be mitigated by adding new JIT compilation techniques and heuristics. Indeed, we’ve seen a lot of improvement along these lines in the V8 and SpiderMonkey JIT compilers in the last year and I expect to see more in the future. For example, in both JIT compilers, a few heuristic tweaks provided large throughput improvements on the asmjs-apps benchmarks on arewefastyet.com and background JIT compilation has helped to significantly reduce JIT compilation pauses.

However, the question is: to what extent can these problems be mitigated? Unfortunately, that’s hard to know a priori: you only really know when you’re done. Furthermore, as with any heuristic tuning problem, it’s easy to measure on workloads A, B and C only to find afterwards that the fixes don’t generalize to workloads D-Z.

In broader terms: with the proliferation of walled gardens and the consequent frustration of developers, the Web has a great opportunity to provide an open, portable alternative. But to really be an alternative for many types of applications, the web needs predictable, near-native performance. The time is ripe, so we don’t want to miss the opportunity by blocking on a Sufficiently Smart Compiler.

AOT Compilation

To attempt to solve the above problems, we started the OdinMonkey experiment. The basic idea behind the experiment was: Emscripten-generated code has enough type information preserved from the original statically-typed source language that we can avoid all the dynamic-language compilation infrastructure and use a simple Ahead-of-Time (AOT) compiler.

For example, given the following C code:

int f(int i) {
  return i + 1;
}

Emscripten would output the following JS code:

function f(i) {
  i = i|0;
  return (i + 1)|0;
}

The statement “i = i|0” effectively performs the JS spec ToInt32 on the input, ensuring that + always operates on an integer. If we can prove that all callers pass ints, then this coercion is a no-op. The expression “(i + 1)|0” exactly simulates 2s complement addition meaning that this JavaScript expression compiles to a single machine instruction — no type tests, no overflow checks.

If you squint your eyes at the above code, you can view “i = i|0” as a parameter type declaration, “return (...)|0” as a return type declaration and binary + as taking two int types and returning a special type which requires coercion via ToInt32 or ToUint32 before use. This basic idea of viewing runtime coercions as types can be extended to all statements and expressions in Emscripten-generated code and the resulting type system is asm.js.

Given the asm.js type system, OdinMonkey is easily able to generate MIR from the AST. As an example, check out the CheckNot function in OdinMonkey (which checks the ! operator): as input it receives a ParseNode (an AST node) and, as output, it returns an MNot MIR node and the expression’s result type (which according to the spec is int). If any of the types fail to match, a type error message (like you’d expect from a C compiler) is output to the Web Console and OdinMonkey transparently falls back to normal JS compilation.

In terms of the previous JIT compilation diagram, OdinMonkey adds a single new arrow between AST and MIR:
aot-diagram
Furthermore, after asm.js type checking succeeds (as well as the link-time check), it is not possible for the generated code to take the Bail edge: there are no dynamically-checked assumptions that can fail.

In addition to simplifying the compilation process, the asm.js type system also provides three broader benefits:

  • asm.js acts as a testable interface between Emscripten and the browser. (This found bugs in Emscripten.)
  • asm.js specifies a target for non-Emscripten code generators so that they don’t have to re-discover the same sweet spot as Emscripten. (Speaking of, check out the experimental asm.js PyPy backend and LLJS-asm.js fork.)
  • asm.js establishes a clear optimization target for all browsers so that this style of code can become portably fast.

AOT Potential Problems

Despite all these advantages, AOT has a significant potential downside: it compiles everything using the most expensive compiler without knowing if the code being compiled is hot or cold. This would obviously be a problem if an app contained a lot of cold or dead asm.js code. Similarly, AOT would be a net loss for an app with a lot of code that runs in short bursts so that low tiers of execution and compilation stalls aren’t noticeable. Thus, the load-time performance of AOT relative to JIT depends on the kind of code being executed.

Another potential pitfall for AOT is pathologically-large functions since these can take a really long time to compile in the top-tier compiler. With JIT compilation, the usual heuristics ensure that the top-tier compiler is never used. With some work, OdinMonkey could be extended with heuristics do the same. In the meantime, Alon added an “outlining” option to Emscripten that automatically breaks up large functions and has been quite effective. By making functions smaller, outlining also improves performance of asm.js on non-OdinMonkey since it encourages the JIT to use the top-tier compiler.

One theoretical response to these load-time concerns is that the "use asm" directive required at the beginning of any asm.js module has no semantics and can simply be removed if AOT compilation is not beneficial. As such, "use asm" gives the developer more control over the compilation scheme used for their application. In theory (it’s difficult in practice at the moment due to lack of automated tooling), developers can exercise even finer-grain control by choosing which functions are inside the asm.js module (and thus receive AOT compilation) and which are outside (and thus receive JIT compilation). One can even imagine an Emscripten PGO pass that does this automatically for cold code.

In the end, though, it’s hard to predict what will happen in practice so we had to just try. (OdinMonkey was started as a experiment, after all.)

The results so far have been good. In addition to those reported at the beginning of the post, cold load times are also measured by the asmjs-apps-*-loadtime synthetic workloads on awfy:

In this graph, Firefox (JIT) refers to Firefox’s performance with OdinMonkey disabled (by passing --no-asmjs to the JS shell or setting javascript.options.asmjs to false in about:config in the browser).

Another data point is the BananaBread benchmark which conveniently measures its own load time:

This graph reports the sum of the “preload” and “startup” times when the benchmark is run in headless mode with a cold cache.

Now let’s look at the major optimizations that AOT compilation allows.

Parallel Compilation

With the intermediate JIT compilation steps avoided, the majority of AOT compilation time is in the Ion-compile step. For example, measuring the Citadel demo we can see the following breakdown of time:

  1. Parse AST: 1.5s
  2. Odin-build AST into MIR: 1.5s
  3. Ion-compile MIR: 8s

Fortunately, the Ion-compile step is also the most parallelizable: each function in the asm.js module results in an independent Ion compilation and there are tens of thousands of functions in large apps. Even better, SpiderMonkey had already supported background Ion-compilation for a year before OdinMonkey, so we were able to add parallel compilation to OdinMonkey without much trouble.

After basic parallel compilation worked, we made an additional refinement to extract further parallelism. Originally, the entire asm.js module would be parsed into one big AST before being handed over to OdinMonkey. OdinMonkey would then simply recurse over the AST, firing off parallel Ion compilations as it went. This was suboptimal for two reasons:

  • While parsing the AST, only one core was working.
  • Since the AST is an order of magnitude bigger than the source and asm.js source can be 35MB (don’t worry, that compresses down to 5MB over the wire with HTTP gzip content encoding), we were seeing out-of-memory errors on mobile devices and even 32-bit desktop processes with many tabs open.

The solution to both of these problems was to allow Odin-building and Ion-compiling to overlap parsing as illustrated in the following psuedo code:

while (not at end of asm.js module) {
  ast = ParseFunction();
  mir = CheckAndEmit(ast);
  StartBackgroundIonCompilation(mir);
  ReleaseMemory(ast)
}

Since the time to Ion-compile a function is on average longer than the time to parse, the process looks something like this:

parallel

To measure the effect, first disable caching (set javascript.options.parallel_parsing to false) and then compare compile times with and without javascript.options.ion.parallel_compilation enabled. To get a more precise measure of compile time, look at the “total compilation time ___ms” part of the “Successfully compiled asm.js code” Web Console message.

On my machine, parallel compilation reduces compile time from 11s to 5s on the Citadel demo, but this improvement is obviously contigent on the number of cores. Measuring with 2 cores, the compile time is 9s, with 3 cores, 6s, and with 4 cores, 5s. Adding further cores doesn’t appear to help. The remaining gap between this and the theoretical minimum of 3s suggested above is largely due to a fixable implementation detail.

Asynchronous Compilation

As described above, AOT compilation occurs when "use asm" is first encountered while parsing. This can be while parsing an inline <script> tag or the string passed to eval or the Function constructor. All of these happen on the browser’s main thread and thus a large asm.js compilation will hold up event handling and the page will appear frozen (as well as the whole Firefox browser on desktop since it’s not multiprocess (yet!)).

Since HTML allows events to be delivered to pages that still have pending <script> elements to evaluate, any script may technically be parsed off the main thread. Unfortunately, the script must still execute synchronously with respect to parsing the rest of the document and constructing the DOM so script parsing traditionally happens on the main thread right before execution.

Fortunately, HTML5 added a new async property to script elements that defaults to true for script-created external script elements and can be set explicitly for external scripts (<script async src="...">). When async is true, the browser is allowed to evaluate the script whenever it wants. This makes async scripts a perfect candidate for parsing off the main thread and Brian Hackett recently made it happen.

OdinMonkey, by nature of running at parse-time, got to ride along for free(-ish). Even better, most Emscripten’d apps are already async since Emscripten’s default harness uses an async script to load the main asm.js module. See this short MDN article for more details, gotchas and workarounds concerning async scripts.

Caching

When someone starts contributing to SpiderMonkey, there are a few ideas they will almost inevitably have. One is: “Why don’t we cache JIT code?”. The response to this question is usually some combination of:

  • JIT compilation is pretty fast and usually a small percentage of total run time, so it probably wouldn’t help most sites.
  • The implementation would be really complicated because JIT code is highly dependent on the current browser state in memory and JIT code forms a complex graph data structure.
  • It’s hard to know when and what to cache; compilation is happening all the time in tiny units.

None of these problems are insurmountable, but together they make JIT-code caching a fairly daunting task. (To wit, the other inevitable question is “Why don’t we use LLVM as a compiler backend?”, so it’s great to see Apple actually trying this. Update)

In contrast, for asm.js code the cost/benefit analysis is much simpler:

  • compilation time is significant
  • the asm.js module has limited and explicit dependencies (viz., the arguments to the asm.js module function)
  • the representation of the generated asm.js module is relatively simple and easily serialized and deserialized

making caching a clear win. So that’s what we did.

There is one unfortunate limitation in the current implementation, though: caching only kicks in for async scripts and WebWorker code (due to some hopefully temporary main-thread limitations arising from browser storage integration). Thus, large applications have two big reasons to use async scripts. Other than that, the cache should behave predictably according to the following rules:

  • The cache is only used for medium-to-large modules (the current cutoff is modules longer than 10,000 characters).
  • The cache entry is keyed on: the origin of the script, the source characters of the asm.js module, the type of CPU and its features, the Firefox build-id (which changes on every major or minor release).
  • The asm.js cache participates in browser-wide quota management such that, when total temporary storage grows past a certain threshold, storage is evicted on an LRU basis.
  • There is a fixed cap (currently 16) on the number of cached asm.js modules per origin; eviction is LRU.

To get confirmation of caching, open the Web Console: the asm.js success messages now include a “loaded from cache” / “stored in cache” / “not stored in cache” clause.

To see caching in action, try out the demos mentioned in the introduction with and without javascript.options.parallel_parsing enabled in about:config (true enables caching). Using this to measure cached vs. uncached warm load time of the Epic Citadel demo shows a 2x improvement:

As before, times were measured with a simple stopwatch up to the first animation frame on a 16×2.4Ghz core Linux machine.

Note: to clear out the asm.js cache for a given origin, click the Site Identity Button → More Information → Permissions → (scroll down) → Clear Storage. (I hear better quota management UI is coming.)

Conclusions

OdinMonkey started as an experiment in achieving predictable, high performance through AOT compilation. With the rapid pace of innovation in JS JIT compilation techniques, it’s definitely too early to draw any final conclusions. However, almost a year after the initial release, OdinMonkey is still showing marked leads in predictability, throughput and load time.

In recent news, it’s interesting to see that Google just shipped ART, an AOT compiler for Java on Android (the current Java VM on Android, Dalvik, is a JIT compiler). OdinMonkey and ART aren’t really comparable for several reasons, but some of the arguments made in the article about startup time definitely sound familiar ☺.

On a final note, I’d like to emphasize that the majority of Mozilla’s JavaScript performance engineers are still focused on improving general JavaScript performance 1,2,3,4,5,6,7,8,9. Moreover, I think the future we’re moving toward has web apps composed of both compiled modules and high-level handwritten code. PlayCanvas provides us an early example of this, embedding ammo.js (an Emscripten port of the Bullet physics engine) into an existing hand-written game engine. I hope to see this trend continue with more reusable compiled components in more application domains and with tighter integration between compiled and handwritten JS (e.g. LLJS, embind).

asm.js in Firefox Nightly

I’m happy to announce that OdinMonkey, an asm.js optimization module for Firefox’s JavaScript engine, is now in Nightly builds and will ship with Firefox 22 in June.

What is asm.js? Why are we doing it, and how are we getting to within 2x of native performance? This post won’t be able to go into too much detail since we’re hard at work preparing for Mozilla’s upcoming GDC session, which you should definitely come see (Wednesday 11am, Room 3024, West Hall). After GDC, expect full coverage of these topics by Alon, Dave, myself and surely others. For now, allow me to point you at the asm.js FAQ, Alon’s mloc.js slides, a nice Badass JavaScript post and a more in-depth post by Axel Rauschmayer.

Want to see it in action? Download a new Firefox Nightly build and try out BananaBench. (Note: BananaBench runs with a fixed time step to make JS execution deterministic, so game speed will run fast/slow, depending on your hardware/browser.) Or, check out a demo of the Emscripten-compiled Bullet engine simulating a ton of falling boxes.

At the moment, we have x86/x64 support on desktop Windows/Mac/Linux and support for mobile Firefox on ARM is almost done. Since we intend to continue to iterate on the asm.js spec in cooperation with other JS engines, we’ve put OdinMonkey behind the flag javascript.options.asmjs in about:config. This flag is currently enabled by default on Nightly and Aurora, and if nothing changes over the next 12 weeks, will be automatically disabled in Beta and Release. By then, we hope to be happy with a stable “asm.js v.1″, we’ll enable it everywhere and ship with it enabled in our final builds. [Update: OdinMonkey has been enabled by default for all releases starting with Firefox 22.]

If you want to start experimenting with asm.js right now, you can:

  • Get Emscripten and start compiling C/C++ code. (Don’t forget the -O2 -s ASM_JS=1.)
  • Check out the draft spec and start writing asm.js by hand.

In the future, we’d like to see a rich third option of generating asm.js using a more ergonomic front-end language (e.g., a derivative of LLJS). [Update: LLJS work is already underway!]

How do you know if you are generating valid asm.js and taking full advantage of OdinMonkey? In the old days, this was a frustrating question for developers. Maybe you were doing something wrong, maybe the code, as written, was just slow. One cool thing about asm.js is that the "use asm" directive makes the programmer’s intention quite clear: they want to compile asm.js. Thus, if there is an asm.js validation error, OdinMonkey will print a warning on the JavaScript console. (OdinMonkey emits a warning, instead of throwing an error, since asm.js is just JavaScript and thus cannot change JavaScript semantics.) In fact, since silence is ambiguous, OdinMonkey also prints a message on successful compilation of asm.js. (There is currently a bug preventing asm.js optimization and warnings in Scratchpad and the Web Console, so for now experiment in regular content.)

For those who are itching to do some performance experiments: go for it, we’ve been pretty happy with the results so far when asm.js is applied to new codes, but we’ve also seen plenty of cases where the C++ compiler is doing important backend optimizations that we haven’t taught our IonMonkey backend yet. We expect continuous incremental improvement as we measure and implement new optimizations to close this gap. Second: one performance fault that we already know trips up people trying to benchmark asm.js is that calling from non-asm.js into asm.js and vice versa is much slower than normal calls due to general-purpose enter/exit routines. We plan to fix this in the next few months but, in the meantime, for benchmarking purposes, try to keep the whole computation happening inside a single asm.js module, not calling in and out.

In closing, I leave you with the musical inspiration for OdinMonkey:

Happy hacking!

Optimizing JavaScript variable access

I recently finished a project to improve how SpiderMonkey implements variable access so I thought this would be a good time to explain how it all works now. Taking a note from mraleph’s post (and SICP), I’ll illustrate the implementation using JavaScript as the implementation language. That is, I’ll translate JavaScript using full-featured variable access into JavaScript that doesn’t, rather like how the original C++ compiler translated C++ into C.

Before starting, let me set up the problem space. By variable I’m referring not just to the variables introduced by var, but also those introduced by let, const, catch, function statements, and function argument lists. By variable access, I mean a read or a write. Variable access can take many forms:

  • Local access (i.e., access to a variable in the same function):
    function add(x,y) { return x+y }
  • Non-local access (i.e., access to a variable in an enclosing function):
    function add(x,y) { return (function() { return x+y })() }
  • Access from dynamically-generated code:
    function add(x,y) { return eval("x+y") }
  • Access after dynamic scope modification via non-strict direct eval:
    function add(a,b) { eval("var x="+a+", y="+b); return x+y }
  • Dynamic function argument access via the arguments object:
    function add(x,y) { return arguments[0]+arguments[1] }
  • Unexpected debugger snooping (via Firebug, the new builtin Firefox debugger, or directly from privileged JS using the new Debugger API):
    dbg.onDebugerStatement = function(f) { return f.eval("x+y") }

To keep the post small(-ish), I’ll pretend there is only (non-strict, direct) eval and ignore strict and indirect eval as well as with (which we generally deoptimize as if it was an eval). I’ll also ignore let, global access optimizations, the bizarre things SpiderMonkey does for block-level function statements, and the debugger.

The worst case

To rise above, we must first see how low we need to go in the worst case. Consider the following function:

function strange() {
  eval("var x = 42");
  return function xPlus1() { var z = x + 1; return z }
}

Here, eval is dynamically adding x to the scope of strange where it will be read by xPlus1. Since eval can be called with a dynamically-constructed string we must, in general, treat function scopes as dynamic maps from names to values. (Fun fact: names added by eval can be removed using the delete keyword, so the map can both grow and shrink at runtime!)

To make this more concrete, we’ll implement scopes in JS using ES6 Map objects. We’ll give every function its own Map that will be stored in a local variable named scope and hold all the function’s variables. (Yes, we’re using a variable to implement variables; but since we’ll only use a small finite number of them, we can think of them as registers.)

function strange() {
  // the scope of 'strange' is initially empty
  var scope = new Map;

  // eval("var x = 42") effectively executes:
  scope.set('x', 42);

  return function xPlus1() {
    // vars are hoisted so scope initially contains 'z'
    var scope = new Map([['z', undefined]]);

    // var z = x + 1
    scope.set('z', scope.get('x') + 1);  // oops!

    // return z
    return scope.get('z');
  }
}

As the comment indicates, there is a bug in xPlus1: x isn’t in the scope of xPlus1, it’s in the scope of strange! To fix this we need to do two things:

  1. Add an enclosing field to all scope objects indicating the enclosing function’s scope (or the global object if the function is top-level).
  2. Replace uses of scope.get with a lookup algorithm that walks the chain of scopes.
function strange() {
  // the scope of 'strange' is initially empty
  var scope = new Map;
  scope.enclosing = window;

  // eval("var x = 42") effectively executes
  scope.set('x', 42);

  var tmp = function xPlus1() {
    // vars are hoisted so scope initially contains 'z'
    var scope = new Map([['z', undefined]]);
    scope.enclosing = xPlus1.enclosing;

    // var z = x + 1
    scope.set('z', lookup(scope, 'x') + 1);

    // return z
    return lookup(scope, 'z');
  }
  tmp.enclosing = scope;
  return tmp;
}

function lookup(scope, name) { while (scope instanceof Map && !scope.has(name)) scope = scope.enclosing; return scope.get(name); }

Note that, without being able to use non-local variable access (since that is what we are implementing), we must attach the scope of strange to the xPlus1 function object. This isn’t just some hack; it is a fundamental part of the implementation of languages with lexically-scoped first-class functions. More generally, we can establish the following relationship (pardon my ASCII-art):

Function-scope
  | *        ^ 0 or 1
  |          |
  | call of  | enclosing
  |          |
  V 1        | 1
Function-object
  | *
  |
  | evaluation of
  |
  V 1
Function-literal

Each function literal can be evaluated any number of times, with each evaluation producing a function object that is associated with its enclosing scope. Each of those function objects can be called any number of times, each of those calls producing a scope. When using the language, it is easy to see just a single concept function, but hopefully this illustrates that there are really three “function” concepts at play here: scope, object, and literal.

With these changes, we have successfully dealt with the ravages of eval, but at what cost? Each variable access involves a call to an algorithm that iteratively performs hash-table lookups! Fortunately, this problem isn’t that different from object-property lookup and the same type of optimizations apply: hidden classes and caches. I won’t go into these techniques, as there are already two great explanations available. (Caching has been used to speed up name access since Firefox 3.) Even with these optimizations, however, name lookup isn’t as fast as we’d like it to be and we are still creating a Map object on every call.

In summary, we’ve handled the worst case, but we’d like to do better in code that doesn’t exercise the worst case.

Fast local name access

Now let’s optimize local variable access when all accesses are local. With this constraint, JavaScript starts to look like C and we can use some of the same techniques as a C compiler: store all variables in a stack and access variables by their offset in the stack.

As a first (highly garbalicious) iteration, we create an array for each set of arguments and vars, thereby turning

foo(13, 42);

function foo(x,y) {
  var a = x + y;
  return bar(a);
}

into:

foo([13, 42]);

function foo(args) {
  var vars = [undefined];
  vars[0] = args[0] + args[1];
  return bar([vars[0]]);
}

The second step is to avoid creating all those temporary arrays by using one big array, shared by all active functions. There are many ways to do this (corresponding to different calling conventions); we’ll just do something simple here:

// executed some time before the first function call:
var stack = [];

stack.push(13);
stack.push(42);
foo(/* number of arguments pushed = */ 2);

function foo(numArgs) {
  // push missing arguments, pop extra arguments
  for (var i = numArgs; i < 2; i--)
    stack.pop();

  // analogous to the frame pointer register
  var firstLocal = stack.length;

  // push local 'a'
  stack.push(undefined);

  // var a = x + y:
  stack[firstLocal] = stack[firstLocal - 2] + stack[firstLocal - 1];

  // prepare stack for call to 'bar(a)':
  stack.push(stack[firstLocal]);
  return bar(/* number of arguments pushed = */ 1);

  // in this calling convention, the callee pops the arguments
  stack.pop(); // pop 'a'
  stack.pop(); // pop 'y'
  stack.pop(); // pop 'x'
}

With this strategy, a JIT compiler can do some pretty great optimization. To start with, each read from or write to stack in the above JS can be compiled down to a single CPU load or store. This is achieved by caching the address of stack[firstLocal] in a register and rolling the remaining “+ INDEX” into the load instruction as an offset. Even better, modern JavaScript JIT compilers do register allocation which can avoid the loads/stores altogether. (Register allocation has been in Firefox since version 3.5.)

In summary, we can do pretty efficient things for local variable access, but only with some stringent restrictions.

Fast non-local access

While we shouldn’t expect great performance when functions call eval or arguments, the requirement made in the previous section that we only access local variables is pretty harsh and conflicts with both the functional and module patterns of JavaScript programming. In this section, we’ll optimize non-local access.

We start with the observation that, in the absence of eval and other weirdos, there is no need for a fully dynamic scope lookup: we can know exactly where on the scope chain to find the variable being accessed. The first step is to view each top-level function as a tree of nested functions, giving each node (function) in the tree an array of the variables defined in its scope. For example, given this function:

function add3(arg1, arg2, arg3) {
  function addInner(innerArg1) {
    function innermost() { return innerArg1 + arg2 + getArg3() };
    return innermost();
  }
  function getArg3() {
    return arg3;
  }
  return addInner(arg1);
}

we can distill the following tree:

function add3: [arg1, arg2, arg3, addInner, getArg3]
 |\_ function addInner: [innerArg1, innermost]
 |    \_ function innermost: []
  \_ function getArg3: []

The next step is to include uses as leaves of the tree that are linked to the innermost enclosing definition with the same name. Rather than drawing terrible ASCII-art arrows, let’s represent a use-to-definition arrow with a two-number coordinate:

  • hops = the number of nodes in the tree to skip to get to the function node whose array contains the definition.
  • index = the index of the definition in the function node’s array.

Linking uses to definitions in the above tree produces:

function add3: [arg1, arg2, arg3, addInner, getArg3]
 |\_ function addInner: [innerArg1, innermost]
 |    |\_ function innermost: []
 |    |    |\_ "innerArg1"   {hops=1, index=0}
 |    |    |\_ "arg2"        {hops=2, index=1}
 |    |     \_ "getArg3"     {hops=2, index=4}
 |     \_ "innermost":       {hops=0, index=1}
 |\_ function getArg3: []
 |     \_ "arg3"             {hops=1, index=2}
 |\_ "addInner"              {hops=0, index=3}
 |\_ "getArg3"               {hops=0, index=4}
  \_ "arg1"                  {hops=0, index=0}

As a last step, we’ll erase all variables that only have local uses. We can also remove entire scopes if they are empty; we just need to be mindful not to include these removed scopes in any hops count. Applying this last transformation produces the following, final tree:

function add3: [arg2, arg3, getArg3]
 |\_ function addInner: [innerArg1]
 |     \_ function innermost: 
 |         |\_ "innerArg1"   {hops=0, index=0}
 |         |\_ "arg2"        {hops=1, index=0}
 |          \_ "getArg3"     {hops=1, index=2}
 |\_ function getArg3: 
 |     \_ "arg3"             {hops=0, index=1}
  \_ "getArg3"               {hops=0, index=2}

With this analysis, we have all the information we need to efficiently compile the program. For the local-only variables that we removed in the last step, we can use the stack directly (as in the second section). For variables with non-local access, we can represent the scope chain as a linked list of scopes (as in the first section), except this time we represent scopes as arrays instead of maps. To compile an access, we use {hops,index} coordinate: hops tells us how many .enclosing links to follow, index tells us the index in the array.

Applying this scheme to the original example (and eliding the missing/extra arguments boilerplate) produces the following translated JS (with the scope access code highlighted in red):

function add3() {
  var firstLocal = arguments.length;

  // the optimized scope of add3 is: [arg2, arg3, getArg3]
  var scope = [stack[firstLocal-2], stack[firstLocal-1], undefined];
  scope.enclosing = window;

  // initialize 'addInner':
  stack.push(function addInner() {
    var firstLocal = arguments.length;

    // the optimized scope of addInner is: [innerArg1]
    var scope = [stack[firstLocal - 1]];
    scope.enclosing = addInner.enclosing;

    // push local 'innermost'
    stack.push(function innermost() {
      // the scope of innermost is completely optimized away
      var scope = innermost.enclosing;

      // return innerArg1 {hops=0, index=0} +
      //        arg2      {hops=1, index=0} +
      //        getArg3() {hops=1, index=2}
      return scope[0] +
             scope.enclosing[0] +
             (scope.enclosing[2])();
    });
    stack[firstLocal].enclosing = scope;

    // return innermost()
    var returnValue = (stack[firstLocal])();
    stack.pop();  // pop 'innermost'
    stack.pop();  // pop 'innerArg1'
    return returnValue;
  });
  stack[firstLocal].enclosing = scope;

  // initialize 'getArg3' {hops=0, index=2}:
  scope[2] = function getArg3() {
    // the scope of getArg3 is completely optimized away
    var scope = getArg3.enclosing;

    // return arg3 {hops=0, index=1}
    return scope[1];
  }
  scope[2].enclosing = scope;

  // return addInner(arg1)
  stack.push(stack[firstLocal - 3]);
  var returnValue = (stack[firstLocal])();
  stack.pop();  // pop 'addInner'
  stack.pop();  // pop 'arg3'
  stack.pop();  // pop 'arg2'
  stack.pop();  // pop 'arg1'
  return returnValue;
}

This strategy is good for JIT compilation in several ways:

  • If a variable is only accessed locally, it can still live on the stack and receive full JIT optimization.
  • Each .enclosing expression compiles to a single load instruction. Furthermore, when there are multiple accesses to variables in the same scope, the compiler can factor out the common scope walking.
  • Since a non-local name access in this scheme is much simpler than the name cache mentioned earlier, IonMonkey is more able to apply the optimizations it uses for local names such as LICM, GVN, and DCE.

In summary, we’ve now optimized non-local access while keeping local access fast. There are several other optimizations related to scopes that soften the blow when eval or arguments is used, but I think this is a good stopping point.

Next steps

The recent scope project basically catches us up to the level of other JS VMs. I should also note that functional languages have been doing similar optimizations forever. Looking forward, there are some straightforward optimizations I think we could do to avoid creating scope objects as well as more advanced optimizations we can lift from the functional crowd.

In SpiderMonkey

If you are interested in seeing the code for all this in SpiderMonkey, you can use the following links to get started:

  • The {hops,index} coordinate is called ScopeCoordinate.
  • The various scope objects are described in this ASCII-art tree. (Note, for mostly historical reasons, we use the same underlying representation for objects and scopes. Due to the Shape mechanism (which is pre-generated for scopes at compile-time), scopes are still, effectively, arrays.)
  • Optimized non-local access is performed with ALIASEDVAR opcodes. See the implementation of these ops in the interpreter and IonMonkey jit.
  • The frontend name analysis is a bit old and messy (and will hopefully be rewritten sometime in the near future). However, the important part of the analysis is at the very end, when we emit the ALIASEDVAR ops in EmitAliasedVarOp.

Nicolas Pierron has provided a French translation of this post. Thanks!

JSRuntime is now officially single-threaded

Given this title, a reasonable reaction would be:

Wait, wait, single threaded?!  But isn’t that, like, the wrong direction for the multicore present and manycore future?

so let me start by clearing this up:

A single SpiderMonkey runtime (that is, instance of JSRuntime) — and all the objects, strings and contexts associated with it — may only be accessed by a single thread at any given time. However, a SpiderMonkey embedding may create multiple runtimes in the same process (each of which may be accessed by a different thread).

That means it is up to the embedding to provide communication (if any) between the runtimes via JSNative or other SpiderMonkey hooks. One working example is the new implementation of web workers in Firefox which uses a runtime per worker. Niko Matsakis is experimenting with a different architecture in his new parallel JS project.

So that’s the quick summary. Now, for the interested, I’ll back up and explain the situation, how we got here, and where we are going in more detail.

Ghosts of SpiderMonkey past

In the beginning, as Brendan explains, Java-style big-shared-mutable-heap concurrency was all the rage and so, as Java’s kid brother, SpiderMonkey also had big-shared-mutable-heap concurrency. Now, locks weren’t ever (afaik) exposed to JS as part of SpiderMonkey, but an embedding could add them easily with a JSNative. However, SpiderMonkey did support concurrent atomic operations on objects with a clever (patented, even) locking scheme that avoided synchronization overhead for most operations.

This initial threading design stayed in place until about a year before Firefox 4.0 when the compartments project picked up steam. The key new concept introduced by this project was, well, the compartment. A runtime contains a set of compartments and each compartment contains a set of objects. Every object is in exactly one compartment and any reference between objects in different compartments must go through a wrapper. With compartments and wrappers, you can implement a sort of membrane that is useful for all kinds of things: GC, security boundaries, JIT compilation invariants, memory accounting, and JS proxies. Overall, I would say that compartments are one honking great idea.

The important thing about compartments for this story, though, is that the implementation effort really wanted single-threaded access to everything in a compartment. To be honest, I don’t know the particular technical issue raised at the time, but it isn’t hard to see how single-threaded-ness was a necessary simplification for such a challenging plan (viz., shipping compartments with Firefox 4). Anyway, the decision was made and compartments became single-threaded.

After Firefox 4 another great choice was made to rewrite Firefox’s implementation of web workers to not use XPConnect and to instead create a new runtime per worker. The choice was made because, even though a runtime allowed multi-threaded execution, there were still some global bottlenecks such as GC and allocation that were killing workers’ parallelism.

I’m Talking About Drawing a Line in the Sand

With web workers in separate runtimes, there were no significant multi-threaded runtime uses remaining. Furthermore, to achieve single-threaded compartments, the platform features that allowed JS to easily ship a closure off to another thread had been removed since closures fundamentally carry with them a reference to their original enclosing scope. Even non-Mozilla SpiderMonkey embeddings had reportedly experienced problems that pushed them toward a similar shared-nothing design. Thus, there was little reason to maintain the non-trivial complexity caused by multi-threading support.

There are a lot of things that “would be nice” but what pushed us over the edge is that a single-threaded runtime allows us to hoist a lot data currently stored per-compartment into the runtime. This provides immediate memory savings and also enables another big change we want to make that would create a lot more compartments (and thus needs compartments to be lighter-weight).

Thus, the decision was made to try to make SpiderMonkey single-threaded as an API requirement. A bug was filed in April 2011 and an announcement was made on dev.tech.js-engine a month after.

Across this line you do not…

April 2011 to… January 2012… what took so long?

Well, to begin with, there were quite a few minor uses of JSRuntime off the main thread that had to be chased down. Also, each of these cases required understanding new parts of the codebase and, in several cases, waiting a few months for other kind-hearted, but very busy, Mozillians to fix things for me. The biggest problem was xpcom proxies (not to be confused with JS proxies, which are awesome). Fortunately, Benjamin Smedberg already had a beef with xpcom/proxy and (just recently) nuked the whole directory from orbit.

After getting try server to pass without hitting any of the 10,000 places where single-thread-ness gets verified in debug builds, we couldn’t exactly just rip out the multi-threading support. The worry we had was that some popular add-ons would break the whole browser and we’d be faced with an uncomfortable backout situation. Thus, we landed a simple patch that asserts single-threaded-ness in a few pinch points in release builds and waited for the assert to make its way to a bigger audience. (I think this is a great example of how the rapid-release process enables developers.)

As of right now, the assert is in Firefox 10 Beta and slated to be released on January 31st. There are three four known offending extensions:

  • The IcedTea Java plugin on Linux seems to hit the assert for some applets. [Update: this was reported fixed in version 1.2pre]
  • BExternal.dll and gemgecko.dll are touching the main-thread only pref service off the main thread (already a bug) which ends up calling a JS observer. [Update: Both Gemius and Babylon seem to have shipped fixes]
  • [UPDATE] The DivX plugin.

Based on these results, we are concluding that the invariant “stuck” and thus we can actually make changes that assume single-threaded-ness. Indeed, the first bomb has been dropped (taking along 2200 lines of code along with it, and this is just the beginning).

The single-threaded invariant in detail

Each runtime now has an “owner thread” (JSRuntime::ownerThread). The “owner thread” is the id of the only OS thread allowed to touch the runtime. This owner thread is set to the current thread (PR_GetCurrentThread()) from JS_NewRuntime and may only be changed — when no JS is executing in the runtime — via JS_ClearRuntimeThread/JS_SetRuntimeThread. Virtually every JSAPI function that takes a JSContext parameter will assert that PR_GetCurrentThread == cx->runtime->ownerThread().

It should be mentioned that there are still a few remaining sources of concurrency:

  • The background-sweeping thread cleans up garbage objects which don’t have finalizers. Its interaction with the VM is pretty well isolated to the GC.
  • Pretty much the only JSAPI function that can be called off the main thread is JS_TriggerOperationCallback. This is how the the watchdog thread stops runaway JS. Fortunately, the interaction with the VM is through a single field: JSRuntime::interrupt.

One last thing to point out is that SpiderMonkey’s architecture will likely continue evolving to meet new concurrency needs. Indeed, concurrent runtimes may one day return in a more restricted and structured form. But maybe not; we’ll see.