JavaScript Startup Bytecode Cache

(firefox 58)

We want to make Firefox load pages as fast as possible, to make sure that you can get all the goods from the loaded pages available as soon as possible on your screen.

The JavaScript Startup Bytecode Cache (JSBC) is an optimization made to improve the time to fully load a page. As with many optimizations, this is a trade-off between speed and memory. In exchange for faster page-load, we store extra data in a cache.

The Context

When Firefox loads a web page, it is likely that this web page will need to run some JavaScript code. This code is transferred to the browser from the network, from the network cache, or from a service worker.

JavaScript is a general purpose programming language that is used to improve web pages. It makes them more interactive, it can request dynamically loaded content, and it can improve the way web pages are programmed with libraries.

JavaScript libraries are collections of code that are quite wide in terms of scope and usage. Most of the code of these libraries is not used (~70%) while the web page is starting up. The web page’s startup lasts beyond the first paint, it goes beyond the point when all resources are loaded, and can even last a few seconds longer after the page feels ready to be used.

When all the bytes of one JavaScript source are received we run a syntax parser. The goal of the syntax parser is to raise JavaScript syntax errors as soon as possible. If the source is large enough, it is syntax parsed on a different thread to avoid blocking the rest of the web page.

As soon as we know that there are no syntax errors, we can start the execution by doing a full parse of the executed functions to generate bytecode. The bytecode is a format that simplifies the execution of the JavaScript code by an interpreter, and then by the Just-In-Time compilers (JITs). The bytecode is much larger than the source code, so Firefox only generates the bytecode of executed functions.

Sequence diagram describing a script request, with the overhead of the Syntax parser and the Full parser in the content process. The content process request a script through IPC, the cache reply with the source, the source is parsed, then full parsed, and executed.

The Design

The JSBC aims at improving the startup of web pages by saving the bytecode of used functions in the network cache.

Saving the bytecode in the cache removes the need for the syntax-parser and replaces the full parser with a decoder. A decoder has the advantages of being smaller and faster than a parser. Therefore, when the cache is present and valid, we can run less code and use less memory to get the result of the full parser.

Having a bytecode cache however causes two problems. The first problem concerns the cache. As JavaScript can be updated on the server, we have to ensure that the bytecode is up to date with the current source code. The second problem concerns the serialization and deserialization of JavaScript. As we have to render a page at the same time, we have to ensure that we never block the main loop used to render web pages.

Alternate Data Type

While designing the JSBC, it became clear that we should not re-implement a cache.

At first sight a cache sounds like something that maps a URL to a set of bytes. In reality, due to invalidation rules, disk space, the mirroring of the disk in RAM, and user actions, handling a cache can become a full time job.

Another way to implement a cache, as we currently do for Asm.js and WebAssembly, is to map the source content to the decoded / compiled version of the source. This is impractical for the JSBC for two reasons: invalidation and user actions would have to be propagated from the first cache to this other one; and we need the full source before we can get the bytecode, so this would race between parsing and a disk load, which due to Firefox’s sandbox rules will need to deal with interprocess communication (IPC).

The approach chosen for the JSBC wasn’t to implement any new caching mechanism, but to reuse the one already available in the network cache. The network cache is usually used to handle URLs except those handle by a service worker or those using some Firefox internal privileged protocols.

The bytecode is stored in the network cache alongside the source code as “alternate content”; the user of the cache can request either one.

To request a resource, a page that is sandboxed in a content process creates a channel. This channel is then mirrored through IPC in the parent process, which resolves the protocol and dispatches it to the network. If the resource is already available in the cache, the cached version is used after verifying the validity of the resource using the ETag provided by the server. The cached content is transferred through IPC back to the content process.

To request bytecode, the content process annotates the channel with a preferred data type. When this annotation is present, the parent process, which has access to the cache, will look for an alternate content with the same data type. If there is no alternate content or if the data type differs, then the original content (the JavaScript source) is transferred. Otherwise, the alternate content (the bytecode) is transferred back to the content process with an extra flag repeating the data type.

Sequence diagram describing a script request, with the overhead of the Syntax parser and the Full parser in the content process. The content process request a script through IPC, the cache reply with the source, the source is parsed, then full parsed, and executed.

To save the bytecode, the content process has to keep the channel alive after having requested an alternate data type. When the bytecode is ready to be encoded, it opens a stream to the parent process. The parent process will save the given stream as the alternate data for the resource that was initially requested.

This API was implemented in Firefox’s network stack by Valentin Gosu and Michal Novotny, which was necessary to make this work possible. The first advantage of this interface is that it can also be implemented by Service Workers, which is currently being added in Firefox 59 by Eden Chuang and the service worker team. The second advantage of this interface is that it is not specific to JavaScript at all, and we could also save other forms of cached content, such as decoded images or precompiled WebAssembly scripts.

Serialization & Deserialization

SpiderMonkey already had a serialization and deserialization mechanism named XDR. This part of the code was used in the past to encode and decode Firefox’s internal JavaScript files, to improve Firefox startup. Unfortunately, XDR serialization and deserialization cannot handle lazily-parsed JavaScript functions and block the main thread of execution.

Saving Lazy Functions

Since 2012, Firefox parses functions lazily. XDR was meant to encode fully parsed files. With the JSBC, we need to avoid parsing unused functions. Most of the functions that are shipped to users are either never used, or not used during page startup. Thus we added support for encoding functions the way they are represented when they are unused, which is with start and end offsets in the source. Thus unused functions will only consume a minimal amount of space in addition to that taken by the source code.

Once a web page has started up, or in case of a different execution path, the source might be needed to delazify functions that were not cached. As such, the source must be available without blocking. The solution is to embed the source within the bytecode cache content. Instead of storing the raw source, the same way it is served by the network cache, it is stored in UCS2 encoding as compressed chunks¹, the same way we represent it in memory.

Non-Blocking Transcoding

XDR is a main-thread only blocking process that can serialize and deserialize bytecode. Blocking the main thread is problematic on the web, as it hangs the browser, making it unresponsive until the operation finishes. Without rewriting XDR, we managed to make this work such that it does not block the event loop. Unfortunately, deserialization and serialization could not both be handled the same way.

Deserialization was the easier of the two. As we already support parsing JavaScript sources off the main thread, decoding is just a special case of parsing, which produces the same output with a different input. So if the decoded content is large enough, we will transfer it to another thread in order to be decoded without blocking the main thread.

Serialization was more difficult. As it uses resources handled by the garbage collector, it must remain on the main thread. Thus, we cannot use another thread as with deserialization. In addition, the garbage collector might reclaim the bytecode of unused functions, and some objects attached to the bytecode might be mutated after execution, such as the object held by the JSOP_OBJECT opcode. To work around these issues, we are incrementally encoding the bytecode as soon as it is created, before its first execution.

To incrementally encode with XDR without rewriting it, we encode each function separately, along with location markers where the encoded function should be stored. Before the first execution we encode the JavaScript file with all functions encoded as lazy functions. When the function is requested, we generate the bytecode and replace the encoded lazy functions with the version that contains the bytecode. Before saving the serialization in the cache, we replace all location markers with the actual content, thus linearizing the content as if we had produced it in a single pass.

The Outcome

The Threshold

The JSBC is a trade-off between encoding/decoding time and memory usage, where the right balance is determined by the number of times the page is visited. As this is a trade-off, we have the ability to choose where to set the cursor, based on heuristics.

To find the threshold, we measured the time needed to encode, the time gained by decoding, and the distribution of cache hits. The best threshold is the value that minimizes the cost function over all page loads. Thus we are comparing the cost of loading a page without any optimization (x1), the cost of loading a page and encoding the bytecode (x1.02 — x1.1), and the cost of decoding the bytecode (x0.52 — x0.7). In the best/worst case, the cache hit threshold would be 1 or 2, if we only considered the time aspect of the equation.

As a human, it seems that we should not penalize the first visit of a website and save content in the disk. You might read a one time article on a page that you will never visit again, and saving the cached bytecode to disk for future visits sounds like a net loss. For this reason, the current threshold is set to encode the bytecode only on the 4th visit, thus making it available on the 5th and subsequent visits.

The Results

The JSBC is surprisingly effective, and instead of going deeper into explanations, let’s see how it behaves on real websites that frequently load JavaScript, such as Wikipedia, Google Search results, Twitter, Amazon and the Facebook Timeline.

This graph represents the average time between the start of navigation and when the onload event for each website is fired, with and without the JavaScript Startup Bytecode Cache (JSBC). The error bars are the first quartile, median and third quartile values, over the set of approximately 30 page loads for each configuration. These results were collected on a new profile with the apptelemetry addon and with the tracking protection enabled.

While this graph shows the improvement for all pages (wikipedia: 7.8%, google: 4.9%, twitter: 5.4%, amazon: 4.9% and facebook: 12%), this does not account for the fact that these pages continue to load even after the onload event. The JSBC is configured to capture the execution of scripts until the page becomes idle.

Telemetry results gathered during an experiment on Firefox Nightly’s users reported that when the JSBC is enabled, page loads are on average 43ms faster, while being effective on only half of the page loads.

The JSBC neither improves benchmarks nor regresses them. Benchmarks’ behaviour does not represent what users actually do when visiting a website — they would not reload the pages 15 times to check the number of CPU cycles. The JSBC is tuned to capture everything until the pages becomes idle. Benchmarks are tuned to avoid having an impatient developer watching a blank screen for ages, and thus they do not wait for the bytecode to be saved before starting over.

 
Thanks to Benjamin Bouvier and Valentin Gosu for proofreading this blog post and suggesting improvements, and a special thank you to Steve Fink and Jérémie Patonnier for improving this blog post.

¹ compressed chunk: Due to a recent regression this is no longer the case, and it might be interesting for a new contributor to fix.

HolyJit: A New Hope

tl;dr: We believe there is a safer and easier way of writing a Jit.

Current State

Today, all browsers’ Jits share a similar design. This design makes extending the language or improving its performance time-consuming and complex, especially while avoiding security issues.

For instance, at the time of this writing, our Jit relies upon ~15000 lines of carefully crafted, hand-written assembly code (~36000 in Chromium’s v8). The Jit directory represents 5% of all the C++ code of Firefox, and contains 3 of the top 20 largest files of Firefox, all written by hand.

Interestingly, these files all contain code that is derived by hand from the Interpreter and a limited set of built-in functions of the JavaScript engine. But why do it by hand, when we could automatize the process, saving time and risk? HolyJit is exploring this possibility.

Introducing HolyJit (prototype)

This week, during the JS Team meetup, we have demonstrated the first prototype of a Rust meta-Jit compiler, named HolyJit. If our experiment proves successful, we believe that employing a strategy based on HolyJit will let us avoid many potential bugs and let us concentrate on strategic issues. This means more time to implement JavaScript features quickly and further improve the speed of our Jit.

HolyJit library instrumenting the Rust compiler to add a meta-Jit for Rust code.

For instance, in a recent change, we extended the support of optimizations to Array.prototype.push. What should have been a trivial modification required diving into safety-critical code and adding 135 lines of code, and reading even more code to check that we were not accidentally breaking invariants.

With HolyJit, what should have been a trivial change would effectively have been a trivial change. The following change to a hypothetical JS Jit built with HolyJit does exactly the same thing as the previous patch, i.e. allowing the Jit to inline the Array.prototype.push function when it is being called with more than one argument.

 fn array_push(args: &CallArgs) -> Result<JSValue, Error> {
-    jit_inline_if!(args.len() == 1);
+    jit_inline_if!(args.len() >= 1);
     …
 }

By making changes self-contained and simple, we hope that HolyJit will improve the safety of our Jit engine, and let us focus on optimizations.

HolyJit Repository: https://github.com/nbp/holyjit

Thanks to David Teller, Jason Orendorff, Sean Stangl, Jon Coppeard for proof reading this blog post.

ECMAScript 2016+ in Firefox

After the release of ECMAScript 2015, a.k.a. ES6, the ECMAScript Language Specification is evolving rapidly: it’s getting many new features that will help developing web applications, with a new release planned every year.

Last week, Firefox Nightly 54 reached 100% on the Kangax ECMAScript 2016+ compatibility table that currently covers ECMAScript 2016 and the ECMAScript 2017 draft.

Here are some highlights for those spec updates.

Kangax ECMAScript 2016+ compatibility table

ECMAScript 2016

ECMAScript 2016 is the latest stable edition of the ECMAScript specification. ECMAScript 2016 introduces two new features, the Exponentiation Operator and Array.prototype.includes, and also contains various minor changes.

New Features

Exponentiation Operator

Status: Available from Firefox 52 (now Beta, will ship in March 2017).

The exponentiation operator (**) allows infix notation of exponentiation.
It’s a shorter and simpler replacement for Math.pow. The operator is right-associative.

// without Exponentiation Operator
console.log(Math.pow(2, 3));             // 8
console.log(Math.pow(2, Math.pow(3, 2)); // 512

// with Exponentiation Operator
console.log(2 ** 3);                     // 8
console.log(2 ** 3 ** 2);                // 512

Array.prototype.includes

Status: Available from Firefox 43.

Array.prototype.includes is an intuitive way to check the existence of an element in an array, replacing the array.indexOf(element) !== -1 idiom.

let supportedTypes = [
  "text/plain",
  "text/html",
  "text/javascript",
];

// without Array.prototype.includes.
console.log(supportedTypes.indexOf("text/html") !== -1); // true
console.log(supportedTypes.indexOf("image/png") !== -1); // false

// with Array.prototype.includes.
console.log(supportedTypes.includes("text/html")); // true
console.log(supportedTypes.includes("image/png")); // false

Miscellaneous Changes

Generators can’t be constructed

Status: Available from Firefox 43.

Calling generator with new now throws.

function* g() {
}

new g(); // throws

Iterator for yield* can catch throw()

Status: Available from Firefox 27.

When Generator.prototype.throw is called on a generator while it’s executing yield*, the operand of the yield* can catch the exception and return to normal completion.

function* inner() {
  try {
    yield 10;
    yield 11;
  } catch (e) {
    yield 20;
  }
}

function* outer() {
  yield* inner();
}

let g = outer();
console.log(g.next().value);  // 10
console.log(g.throw().value); // 20, instead of throwing

Function with non-simple parameters can’t have “use strict”

Status: Available from Firefox 52 (now Beta, will ship in March 2017).

When a function has non-simple parameters (destructuring parameters, default parameters, or rest parameters), the function can’t have the "use strict" directive in its body.

function assertEq(a, b, message="") {
  "use strict"; // error

  // ...
}

However, functions with non-simple parameters can appear in code that’s already strict mode.

"use strict";
function assertEq(a, b, message="") {
  // ...
}

Nested RestElement in Destructuring

Status: Available from Firefox 47.

Now a rest pattern in destructuring can be an arbitrary pattern, and also be nested.

let [a, ...[b, ...c]] = [1, 2, 3, 4, 5];

console.log(a); // 1
console.log(b); // 2
console.log(c); // [3, 4, 5]

let [x, y, ...{length}] = "abcdef";

console.log(x);      // "a"
console.log(y);      // "b"
console.log(length); // 4

Remove [[Enumerate]]

Status: Removed in Firefox 47.

The enumerate trap of the Proxy handler has been removed.

ECMAScript 2017 draft

ECMAScript 2017 will be the next edition of ECMAScript specification, currently in draft. The ECMAScript 2017 draft introduces several new features, Object.values / Object.entries, Object.getOwnPropertyDescriptors, String padding, trailing commas in function parameter lists and calls, Async Functions, Shared memory and atomics, and also some minor changes.

New Features

Async Functions

Status: Available from Firefox 52 (now Beta, will ship in March 2017).

Async functions help with long promise chains broken up to separate scopes, letting you write the chains just like a synchronous function.

When an async function is called, it returns a Promise that gets resolved when the async function returns. When the async function throws, the promise gets rejected with the thrown value.

An async function can contain an await expression. That receives a promise and returns a resolved value. If the promise gets rejected, it throws the reject reason.

async function makeDinner(candidates) {
  try {
    let itemsInFridge = await fridge.peek();

    let itemsInStore = await store.peek();

    let recipe = await chooseDinner(
      candidates,
      itemsInFridge,
      itemsInStore,
    );

    let [availableItems, missingItems]
      = await fridge.get(recipe.ingredients);

    let boughtItems = await store.buy(missingItems);

    pot.put(availableItems, boughtItems);

    await pot.startBoiling(recipe.temperature);

    do {
      await timer(5 * 60);
    } while(taste(pot).isNotGood());

    await pot.stopBoiling();

    return pot;
  } catch (e) {
    return document.cookie;
  }
}

async function eatDinner() {
  eat(await makeDinner(["Curry", "Fried Rice", "Pizza"]));
}

Shared memory and atomics

Status: Available behind a flag from Firefox 52 (now Beta, will ship in March 2017).

SharedArrayBuffer is an array buffer pointing to data that can be shared between Web Workers. Views on a shared memory can be created with the TypedArray and DataView constructors.

When transferring SharedArrayBuffer between the main thread and a worker, the underlying data is not transferred but instead the information about the data in memory is sent. As a result it reduces the cost of using a worker to process data retrieved on main thread, and also makes it possible to process data in parallel on multiple workers without creating separate data for each.

// main.js
let worker = new Worker("worker.js");

let sab = new SharedArrayBuffer(IMAGE_SIZE);
worker.onmessage = functions(event) {
  let image = createImageFromBitmap(event.data.buffer);
  document.body.appendChild(image);
};
captureImage(sab);
worker.postMessage({buffer: sab})

// worker.js
onmessage = function(event) {
  let sab = event.data.buffer;
  processImage(sab);
  postMessage({buffer: sab});
};

Moreover, a new API called Atomics provides low-level atomic access and synchronization primitives for use with shared memory. Lars T Hansen has already written about this in the Mozilla Hacks post “A Taste of JavaScript’s New Parallel Primitives“.

Object.values / Object.entries

Status: Available from Firefox 47.

Object.values returns an array of a given object’s own enumerable property values, like Object.keys does for property keys, and Object.entries returns an array of [key, value] pairs.

Object.entries is useful to iterate over objects.

createElement("img", {
  width: 320,
  height: 240,
  src: "http://localhost/a.png"
});

// without Object.entries
function createElement(name, attributes) {
  let element = document.createElement(name);

  for (let name in attributes) {
    let value = attributes[name];
    element.setAttribute(name, value);
  }

  return element;
}

// with Object.entries
function createElement(name, attributes) {
  let element = document.createElement(name);

  for (let [name, value] of Object.entries(attributes)) {
    element.setAttribute(name, value);
  }

  return element;
}

When property keys are not used in the loop, Object.values can be used.

Object.getOwnPropertyDescriptors

Status: Available from Firefox 50.

Object.getOwnPropertyDescriptors returns all own property descriptors of a given object.

The return value of Object.getOwnPropertyDescriptors can be passed to Object.create, to create a shallow copy of the object.

function shallowCopy(obj) {
  return Object.create(Object.getPrototypeOf(obj),
                       Object.getOwnPropertyDescriptors(obj));
}

String padding

Status: Available from Firefox 48.

String.prototype.padStart and String.prototype.padEnd add padding to a string, if necessary, to extend it to the given maximum length. The padding characters can be specified by argument.

They can be used to output data in a tabular format, by adding leading spaces (align to end) or trailing spaces (align to start), or to add leading "0" to numbers.

let stock = {
  apple: 105,
  pear: 52,
  orange: 78,
};
for (let [name, count] of Object.entries(stock)) {
  console.log(name.padEnd(10) + ": " + String(count).padStart(5, 0));
  // "apple     : 00105"
  // "pear      : 00052"
  // "orange    : 00078"
}

Trailing commas in function parameter lists and calls

Status: Available from Firefox 52 (now Beta, will ship in March 2017).

Just like array elements and object properties, function parameter list and function call arguments can now have trailing commas, except for the rest parameter.

function addItem(
  name,
  price,
  count = 1,
) {
}

addItem(
  "apple",
  30,
  2,
);

This simplifies generating JavaScript code programatically, i.e. transpiling from other language. Code generator doesn’t have to worry about whether to emit comma or not, while emitting function parameters or function call arguments.

Also this makes it easier to rearrange parameters by copy/paste.

Miscellaneous Changes

Remove proxy OwnPropertyKeys error with duplicate keys

Status: Available from Firefox 51.

The ownKeys trap of a user-defined Proxy handler is now permitted to return duplicate keys for non-extensible object.

let nonExtensibleObject = Object.preventExtensions({ a: 10 });

let x = new Proxy(nonExtensibleObject, {
  ownKeys() {
    return ["a", "a", "a"];
  }
});

Object.getOwnPropertyNames(x); // ["a", "a", "a"]

Case folding for \w, \W, \b, and \B in unicode RegExp

Status: Available from Firefox 54 (now Nightly, will ship in June 2017).

\w, \W, \b, and \B in RegExp with unicode+ignoreCase flags now treat U+017F (LATIN SMALL LETTER LONG S) and U+212A (KELVIN SIGN) as word characters.

console.log(/\w/iu.test("\u017F")); // true
console.log(/\w/iu.test("\u212A")); // true

Remove arguments.caller

Status: Removed in Firefox 53 (now Developer Edition, will ship in April 2017).

The caller property on arguments objects, that threw a TypeError when gotten or set, has been removed.

function f() {
  "use strict";
  arguments.caller; // doesn't throw.
}
f();

What’s Next?

We’re also working on implementing ECMAScript proposals.

New Feature

Function.prototype.toString revision (proposal Stage 3)

Status: Work in progress.

This proposal standardizes the string representation of functions to make it interoperable between browsers.

Lifting Template Literal Restriction (proposal Stage 3)

Status: Available from Firefox 53 (now Developer Edition, will ship in April 2017).

This proposal removes a restriction on escape sequences in Tagged Template Literals.

If an invalid escape sequence is found in a tagged template literal, the template value becomes undefined but the template raw value becomes the raw string.

function f(callSite) {
  console.log(callSite);     // [undefined]
  console.log(callSite.raw); // ["\\f. (\\x. f (x x)) (\\x. f (x x))"]
}

f`\f. (\x. f (x x)) (\x. f (x x))`;

Async Iteration (proposal Stage 3)

Status: Work in progress.

The Async Iteration proposal comes with two new features: async generator and for-await-of syntax.

The async generator is a mixture of a generator function and an async function. It can contain yield, yield*, and await. It returns a generator object that behaves asynchronously by returning promises from next/throw/return methods.

The for-await-of syntax can be used inside an async function and an async generator. It behaves like for-of, but interacts with the async generator and awaits internally on the returned promise.

async function* loadImagesSequentially(urls) {
  for (let url of urls) {
    yield await loadImage(url);
  }
}

async function processImages(urls) {
  let processedImages = [];

  for await (let image of loadImagesSequentially(urls)) {
    let processedImage = await processImageInWorker(image);
    processedImages.push(processedImage);
  }

  return processedImage;
}

Conclusion

100% on the ES2016+ compatibility table is an important milestone to achieve, but the ECMAScript language will continue evolving. We’ll keep working on implementing new features and fixing existing ones to make them standards-compliant and improve their performance. If you find any bug, performance issue, or compatibility fault, please let us know by filing a bug in Bugzilla. Firefox’s JavaScript engine engineers can also be found in #jsapi on irc.mozilla.org. 🙂

Interesting reads in January

Most of us have our own blog. As a result this channel has become a little bit quiet, which is sad for the people using this blog to get news about the SpiderMonkey engine. As a result I will try to keep up with the different blogs and aggregate the links here.

CacheIR: A new approach to Inline Caching in Firefox
by Jan de Mooij (jandem)

The past months we have been working on CacheIR, an overhaul of the IC code in SpiderMonkey’s JITs. CacheIR has allowed us to remove thousands of lines of boilerplate and code duplication. It has also made it much easier to add new optimizations. This post describes how CacheIR works, why it’s much better than what we had before, and some plans we have for the future… read further >>

Spidermonkey JIT improvements in FF52
by Hannes Verschore (h4writer)

Last week we signed off our hard work on FF52 and we will start working on FF53. The expected release date of this version is the 6th of March. In the meantime we still have time to stabilize the code and fix issues that we encounter … read further >>

TypedArray or DataView: Understanding byte order
by Martin Splitt

Depending on how you access an ArrayBuffer you get different byte order on the same machine. So long story short: it makes a difference if you use a TypedArray or the setters from a DataView. ArrayBuffer is there to give efficient and fast access to binary data … read further >>

IonMonkey: Evil on your behalf

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.”
This quote is not as well known as the end of it, often shortened to “premature optimization is the root of all evil”. It highlights why you should prefer code which is more maintainable, as long as the performance of it does not impact the user.
 

Maintainable code is a matter of taste. For example, we often teach students to dislike “goto” in C, but the best error handling code I have seen (in C) comes from the Linux kernel, and is using “goto”’s. Maintainable code in JavaScript is also a matter of taste. Some might prefer using ES6 features, while others might prefer functional programing approach with lambdas or even using some framework.

In most cases, we add more abstractions, we add more memory, and more code, thus giving even more reasons for the code to be slower. Today, I will introduce two optimizations that made it into IonMonkey, which are known as Scalar Replacement and Branch Pruning.
 

Scalar Replacement

 
The goal of Scalar Replacement is to reduce the amount of memory needed to represent objects in memory, by replacing object properties with local variables.  Then, when all objects properties are replaced, we can remove the object and avoid its memory overhead (i-e allocation, accesses, and GCs).
 
For example, in the following code, we have a few functions for manipulating complex numbers as a tuple of a real and an imaginary part.  When compiling the norm_multiply function, the Inlining phase will bake the code of complex and norm functions inside the norm_multiply function.

function complex(r, i) {
    return { r: r, i: i };
}
function norm(c) {
    return Math.sqrt(c.r * c.r + c.i * c.i);
}
function norm_multiply(a, b) {
    var mul = complex(a.r * b.r - a.i * b.i, a.r * b.i + a.i * b.r);
    return norm(mul);
}
Thus, Scalar Replacement handles the object coming from the complex function, and replaces it by two local variables.  The object is no longer needed and IonMonkey effectively runs the following code:

function norm_multiply(a, b) {
    var mul_r = a.r * b.r - a.i * b.i;
    var mul_i = a.r * b.i + a.i * b.r;
    return Math.sqrt(mul_r * mul_r + mul_i * mul_i);
}
This optimization works by looking at one object, then determining if the object never escapes (detailed below), that all properties are known, and that it is not mixed with other objects.
 
Once all these predicates are validated, this optimization emulates the content of the memory of the object while traversing the control flow graph of the program, thereby replacing all property accesses with reads of local variables.
 
Once all property accesses are removed, the allocated object is used only by a few recover instructions that are capturing the object state at across the control flow graph. Each object state is an instruction that serves no purpose during the execution, but uses the Recover Instructions mechanism to set the content of the properties on the deoptimization path (bailout) to Baseline compiled code. At the end, the Sink / Dead Code Elimination phases convert the object allocation into a recovered instruction, as it is only used by the object state instructions.
 

Escape Analysis

 
One of the biggest limitation of Scalar Replacement is that it is limited to objects that do not escape.
 
We have multiple ways for an object to escape:
  • a function call.
    
    escape({ bar: 1, baz: 2 });
  •  a returned value.
    
    function next() {
        return { done: false, value: 0};
    }
  • an escaped object property.
    
    escaped_obj = { property: obj };
    escape(escaped_obj);
    
The problem of an escaped object is that we have no idea how the object would be manipulated, thus we cannot safely replace the properties of the object with local variables.
 
The good news is that we already have an optimization phase, called Inlining, that already takes care of inserting the code of smaller functions into the body of their calling functions, as long as they are frequently used.
 
Inlining has some limitations, such as the size and hotness of the function. For example, it might be possible for scalar replacement to never trigger if the object escapes in an unused branch, such as in exception handling or logging code.
 
function doSomething(obj) {
    if (theHighlyUnlikelyHappened())
        throw new Error("aaaaaaahhh!!", obj);
}
Fortunately, this kind of problem is addressed by the Branch Pruning optimization described in the next section.
 

Mixing Objects from Multiple Allocation Sites

 
One other limitation of Scalar Replacement is that we have to be able to identify a single allocation that dominates (i-e in the same block or in any enclosing block) the rest of its uses.  For example, the following code causes problem because at the end of the then-block, we do not know which branch the allocated object comes from.
 
function dummy() {
    var obj = { done: false }; // (a)
    if (len == max)
        obj = { done: true }; // (b)
    if (obj.done) // obj comes from either (a) or (b)
        console.log("We are done! \o/");
}
This issue also appears in the case of a returned object.  When a function is inlined, all the return statements are funneled into a return block.  Thus, Scalar Replacement is not capable of mixing multiple objects allocations.  This problem occurs in the next function of iterators:
 
function next() {
    if (this.idx < this.len)
        return { value: this.getValue(idx), done: false };
    return { done: true };
}
As long as all properties are known, these can be transformed quite easily, by creating a single object allocation ahead of the condition and mutating the object, while returning the same object from all paths.
 
function next() {
    var obj = { value: undefined, done: false };
    if (this.idx < this.len) {
        obj.value = this.getValue(idx);
        return obj;
    }
    obj.done = true;
    return obj;
}
This problem actually occurred in the self-hosted code of the Array iterator‘s next() function, used by for-of.  At first we rewrote it as shown in the example above, but to properly handle the security model of SpiderMonkey, this trick was not enough. The security model requires an extra branch which adds a new return statement with a different object, which cannot be merged above. Fortunately, this issue goes away with Branch Pruning, as we will see below.
 

Known Properties

 
Another limitation of Scalar Replacement is the requirement to identify property accesses at compile time. This implies that one cannot expect to have a working Scalar Replacement for a loop iterating over the properties of an object, or the indexes of an array.
 
function norm1(vec) {
    var sum = 0;
    for (var i = 0; i < vec.length; i++)
        sum += vec[i];
    return sum;
}
This is one case where a Loop Unrolling optimization might be useful in the future.
 

Lambdas

 
Lambdas are one of the cases where Scalar Replacement makes a lot of sense, especially in SpiderMonkey where the scope chain uses objects as the underlying representation.
 
Each time you execute code that gives a lambda literal as an argument, a new function is created.  This new function holds a pointer to its function environment, which itself holds a pointer to the scope chain of the enclosing function.
 
In cases where none of the lambdas within a function escape, we can use scalar replacement to optimize scope chain accesses within the inlined code of the lambdas.
 
For example, in the following code, the Inlining will add the code for the forEach function, and the code of the lambda given as argument to the forEach function, into the caller. Then Scalar Replacement will detect that the lambda does not escape, since forEach and the lambda are inlined, and it will replace the scope chain holding the captured sum variable with a local variable.
 
function norm1(vec) {
    var sum = 0;
    vec.forEach((x) => { sum += x; });
    return sum;
}
At the same time, the scope chain allocation as well as the new function allocation holding it will be moved to the bailout path.  Thus, we will no longer do any allocation to execute this function.  In this case, Scalar Replacement makes this forEach call as fast as the C-like for loop equivalent.
 
Scalar Replacement has a lot of pitfalls, but when all the conditions are met, it can remove tons of allocations while allowing JavaScript developers to use higher levels of abstraction.
 
As soon as we make generic functions, it becomes quite hard to avoid these pitfalls. Just the fact that the code is present, even if it is never executed, causes these pitfalls to appear.
 

Branch Pruning

 
The goal of Branch Pruning is to remove unused branches. This is similar to the badly named Profiler Guided Optimization (PGO) phase that we have in static compilers, except that instead of only moving infrequently used branches out of the main code path, we remove them entirely from IonMonkey’s generated code.
 
To do so, we instrument SpiderMonkey to count the number of time each block of code gets executed. Then, when compiling the code, we check based on hand-crafted heuristics whether a block of code should be removed or not. The heuristics select branches that have never been executed, are too complex (i-e store values in memory, make calls, have a large number of instructions) and have predecessors with a large numbers of executions. If a block should be removed, then we replace the branch with a bailout that will fall back to the Baseline compiled code to resume the execution.
 
This optimization alone does not bring much performance improvement. At best, it can speed up the compiler by removing a chunk of the workload, and help the instruction cache of the CPU by removing instructions from the pipeline.
 
However, Branch Pruning helps the other phases of IonMonkey. By removing unused branches, we improve other optimizations such as Scalar Replacement, Loop Invariant Code Motion, etc.
 
Thus code that has to handle unlikely error cases, such as in the following, can be modified by Branch Pruning to remove the block that has never executed yet.
 
function doSomething(obj) {
    if (theHighlyUnlikelyHappened())
        throw new Error("aaaaaaahhh!!", obj);
}
In this example, IonMonkey will convert the then-block into a path which no longer merges back in the control flow graph, and instead does a bailout to Baseline. In JavaScript terms, this would be similar to doing a yield while we are in IonMonkey’s compiled code and resuming the execution in Baseline’s compiled code. Baseline’s compiled code still has the instruction for running the exception handling code.
 
// IonMonkey's compiled code
function doSomething(obj) {
    if (theHighlyUnlikelyHappened())
        bailout; // yield and resume in Baseline compiled code.
}

Frameworks

 
In today’s usage of JavaScript, a single page uses multiple frameworks. These frameworks are likely made to handle more than one use case. As a user of these frameworks, you are probably only using a few, either by convention or by habit.
 
The promise of Branch Pruning is that all the branches of code that you are not using at all will be removed. Thus unused branches would not prevent optimizations.
 
For example, as mentioned in the Scalar Replacement optimization, the Array iterator’s next() function has an extra branch to support the security model of SpiderMonkey.  This adds an extra branch that is unlikely to be used by most websites, thus Branch Pruning is able to remove this branch and replace it with a bailout.
 
As this branch is replaced by a bailout, this waives the limitation preventing Scalar Replacement.  Thus, the fact that Branch Pruning removes code enables Scalar Replacement to reduce the memory allocations, and also optimize the execution of for-of loops.
 
The following histogram represents the relative speed of the for-of micro-benchmark compared against the same for loop written in C-style.  In addition, we compare with the improvements provided by Scalar Replacement (SR), Branch Pruning (BP), and with both enabled (SR+BP).  This highlights that these two optimizations are better than the sum of their individual contributions.  These optimizations are dividing the time taken by for-of loops by a factor of 2.5x.
 
 

Future Work

 
Scalar Replacement and Branch Pruning, when combined, are able to remove a lot of code and allocations, and are able to improve the speed of frameworks such as ES6 for-of loops (only a factor of 2.8x behind a C-like for loop). To make for-of loops as fast as C-like for loops, we would need to remove the bounds checks on the array element indexes, as well as add support for already allocated objects to Scalar Replacement.
 
Branch Pruning heuristics are too conservative today, meaning that we will not attempt to remove branches unless we are confident that the branch is not going to be used. This problem comes from the fact that bailouts are costly. Hopefully, this will be addressed in the future in a project with the code name ThreeHeadedMonkey.
 
Thanks to the blog post reviewers, with special thanks to Steve Fink, and to all the SpiderMonkey team for helping with the Scalar Replacement and Branch Pruning implementations.

The state of SIMD.js performance in Firefox

SIMD.js is a new API for performing SIMD (Single Instruction Multiple Data) computations in JavaScript which is being developed by Google, Intel, Mozilla, and others. For example, it defines a new Float32x4 type which represents 4 float32 values packed up together. The API contains functions which operate in parallel on each value, including all basic arithmetic operations, and operations to rearrange, load, and store such values. For a more complete introduction to SIMD.js, see Dan Gohman’s recent blog post about SIMD.js.

SIMD operations map closely to processor instructions, so they can be made very efficient on modern processors. In theory this means that a SIMD program using float32x4 instead of float32 may be up to four times faster than a non-SIMD program. In practice there is some overhead when using SIMD instructions — this is due to inserting and extracting values into / from a vector, shuffles, and programs sometimes need to be reorganized in order to have multiple values to operate on at the same time. Also note that only the parts of a program which will use float32x4 instead of float32 will benefit from the speedup, and other computations not using float32x4 obviously won’t get the speedup.

SIMD in OdinMonkey (asm.js)

Firefox Nightly began shipping SIMD.js support by adding it for asm.js code. Initially this covered the two SIMD types int32x4 and float32x4 as these are commonly used in SIMD compute kernels and are well supported across different architectures.

Implementing the SIMD operations for these two types in asm.js on a single platform (x86) allowed us to see that the performance of applications cross-compiled from C++ by Emscripten and using these intrinsics was close to native performance. Moreover, it gave us a platform to quickly iterate on the API, and allowed us to find numerous corner cases and gaps in the API, such as particular value inputs to some operations and architecture specific issues.

This work also paved the way for optimization in the general case (that is, not specific to asm.js) as the asm.js compiler and IonMonkey share a lot of the backend code which takes care of generating the assembly sequences which map to the SIMD instrinsics. It also made unit testing way simpler, as each operation can be tested separately in a consistent fashion.

Demos of SIMD code complying to the asm.js type system rules can be found on Intel’s demo site. You can even test it at home if you have a Nightly build of Firefox!

We’ve extracted the SIMD kernel code from the Mandelbrot demo, taken above, to be able to benchmark it. Here is a graphic that shows performance difference between the official SIMD.js polyfill (which mostly uses Typed Arrays) and the performance in OdinMonkey (asm.js). Note that the Y axis is in milliseconds, so the lower, the better.

SIMD Mandelbrot Kernel

SIMD in IonMonkey

As most of the work on implementing SIMD in asm.js was complete the next step was obviously to make it efficient in “regular” JavaScript, that is, non-asm.js.

The main difference between OdinMonkey (asm.js) and IonMonkey (“regular” JavaScript) is that IonMonkey has to cohabit with the Baseline compiler and the Interpreter. Thus it has to avoid object allocations while executing SIMD instructions and while boxing the SIMD values when calling back or returning to Baseline or the Interpreter, which expect objects.

The Base

Before running the code produced by IonMonkey, we first have to execute code produced by the baseline compiler. Baseline is mostly composed of inline caches, and at the moment it does not attempt any optimizations on SIMD objects. On the other hand, the inline caches do record which functions are being executed. The functions recorded by Baseline are then inlined by IonMonkey to generate SIMD instructions.

One of the challenges we had to face when implementing SIMD in Ion was that our register allocator, which is the algorithm choosing where values should live (e.g. in a register or on the stack) in JIT code, didn’t know about registers with widths larger than 64 bits — the size of a double floating-point value. Also, SIMD instructions have expectations on stack alignment, so we needed to adjust most trampolines code sections of the engine to have them respect the stack requirements.

With these changes completed, Ion could successfully execute SIMD code. However, compared to Odin, the performance was still quite slow. Odin uses the same backend code as Ion, so it serves as a rough theoretical lower-bound for Ion. Here is a graph comparing performance between the polyfill, Ionmonkey at this point, and OdinMonkey.

SIMD Mandelbrot - polyfill vs ion vs asmjs

Recover instructions & Global Value Numbering

The reason why SIMD instructions were not extremely fast is ultimately because JavaScript has dynamic types. We use Baseline whenever we don’t know anything about what the types will be, or when we think we know something but then find out that we were wrong. When the latter case happens, we have to bail out of the code that Ion produced under the now-wrong assumption, and start back over again in Baseline. This means we need to have support for capturing values from Ion execution and packaging them up to work in Baseline code. Any fallible instruction contains an implicit path back to resume the execution in Baseline. Baseline primarily operates on boxed objects, so that it can be independent of the actual dynamic types. Consequently, when Ion bails out into Baseline, it must box up any live SIMD values into objects.

Fortunately for us, this was addressed by the addition of Recover Instructions last June. Recover Instructions are used to move any instruction unused by IonMonkey to the implicit paths to Baseline. In this case, it means that we can box the SIMD values only on the fallible (extremely cold) paths, preventing lot of allocations in other code paths.

Combined with Global Value Numbering — an algorithm which can find redundant code and simplify operations sequences — we can remove any sequences where one SIMD instruction result is boxed and the next instruction unboxes it. Once again, this optimization prevents SIMD object allocations by removing uses of the boxed values.

Here is the same Mandelbrot benchmark (without the slow polyfill version, so that we can better compare Ion and Odin). The new Ion bar represents the state of performance in Ion after the two previously mentioned optimizations.

SIMD Mandelbrot kernel - Ion optimized vs AsmJS

Eager Unboxing

In fact the above solution applies to any known objects/array, with the help of Escape Analysis and Scalar Replacement but it is not perfect, and it might still be slower compared to the asm.js version of the same benchmark.

The problem with any object is that the allocation-site of any object is observable. Thus we can detect whether an object is the same that the one we’ve used before, or if it was duplicated. The strict equality / difference operators (=== and !==) can distinguish between two objects pointers. Thus if you call a black-box function named “id” which returns a similar object, you can determine if the object is the same as the input of the function or a different one.


// Returns the initial object if it has some properties
// or a new one otherwise.
function id (x) {
  if (Object.getOwnPropertyNames(x).length)
    return x;
  return {};
}
var o1 = {};
console.log(o1 === id(o1)); // false, different objects
var o2 = {js_is_awesome: true};
console.log(o2 === id(o2)); // true, same object

Fortunately, the specification intends that SIMD values are implemented as first class-citizen values, not as objects. This means that an object is defined by its content and not by its container. Thus, an optimizing compiler such as IonMonkey can eagerly unbox the content of the value and box the content before it exits. As a matter of fact, the last object allocations caused by boxing are removed by eagerly unboxing any potential SIMD values, and boxing them as late as possible.

Here is the current state of performance. Ion (final) contains all optimizations, including the one described in this paragraph. Ion’s execution time of this benchmark is in the ballpark of twice the time spent in asm.js execution, which is really nice.

SIMD Mandelbrot kernel - Ion (final) vs AsmJS

Future work

All the SIMD optimization work, be it in asm.js or in “regular” JS, has been done only for the x86 and x64 platforms, as a proof-of-concept. It is already available in Nightly builds of Firefox. That work needs to be extended to other platforms, in particular ARM, so that the phones running under Firefox OS benefit from fast SIMD.

If you look closely at the specification, you’ll see that there are more types than the two mentioned earlier: float64x2 (two double floating-point values packed), int16x8 (eight integers of 16 bits) and int8x16 (sixteen integers of 8 bits). Future work will include mapping these new types and their operations to assembly instructions as well, in Odin and Ion.

Currently, our SIMD values are represented by objects in the interpreter, even though the specification expects them to be instances of value types. In particular, they should have value identity, which means that two SIMD values are the same if, and only if, their components are the same, pairwise. This isn’t true as of today in our implementation, as SIMD values are represented by objects and thus have object identity in the interpreter. This depends on the value types specification moving forward in TC39 and this specification being implemented in Spidermonkey.

As compiler writers, we’re naturally thinking about auto-vectorization too, where the compiler converts regular scalar code into SIMD code, either by restructuring loops, or by finding concurrency among groups of statements. Implementing SIMD.js support today will actually make it easier for us to start experimenting with auto-vectorization in the future, since all the backend support for SIMD will already be in place.

And lastly, the SIMD.js spec itself is still in development, and we are continuing to work with others on the SIMD.js project to propose SIMD.js to TC-39 for standardization.

This blog post has been co-written by Nicolas B. Pierron (:nbp) and Benjamin Bouvier (:bbouvier). Thanks to the blog post reviewers, to all the SpiderMonkey team who has helped reviewing the SIMD patches and to the contributors who helped implementing features!

The Path to Parallel JavaScript

Between the coming release of ES6 and unrelenting competition for JIT performance, these are exciting times for JavaScript. But an area where JS still lags is parallelism—exploiting hardware acceleration by running multiple computations simultaneously. I’d like to present some experiments we’ve been doing in SpiderMonkey with a low-level, evolutionary approach to extending JavaScript with more flexible and powerful primitives for parallelism.

I should be clear that I’m not talking about concurrency, which is about writing programs that respond to simultaneous events. JavaScript’s asynchronous concurrency model is popular and successful, and with promises, ES6 generators, and the upcoming async/await syntax, it’s getting better all the time.

State of the Parallel Union

What I am talking about is unlocking the power lurking inside our devices: GPUs, SIMD instructions, and multiple processor cores. With the emerging WebGL 2.0 and SIMD standards, the Web is making significant progress on the first two. And Web Workers go some part of the way towards enabling multicore parallelism.

But workers are, by design, strongly isolated: they can only communicate via postMessage. And for good reason! JavaScript’s “run-to-completion” programming model is a central part of the programming experience: when your code runs in an event handler, the functions and methods that you call are the only code you have to worry about changing your app state. Nevertheless, this comes at a cost: when multiple threads want to coordinate, they repeatedly have to copy any data they need to communicate between each other. The ability to transfer binary buffers helps cut down on some of these copying costs, but for many apps this still just can’t compete with the ability for multiple threads to write simultaneously into different parts of shared state. Even setting aside the costs of data transfer, message-passing itself has nontrivial latency. It’s hard to compete with dedicated hardware instructions that allow threads to communicate directly through shared state.

So where should we go from here? A radical option would be to bite the bullet and do what Nashorn has done: turn JavaScript into a fully multi-threaded data model and call it a day. In Nashorn, nothing stops you from running multiple Java threads on a shared JavaScript environment. Unless your host Java program is careful to synchronize your scripts, your JavaScript apps lose all the guarantees of run-to-completion. Frankly, I can’t imagine considering such a step right now. Even setting aside the massive standardization and implementation work required, it’s a huge ecosystem risk: every app, every library, every data structure ever written to date threatens to be subverted by imperfect (or malicious) uses of threads.

On the other end of the spectrum, Mozilla Research and Intel Labs have done some experiments over the years with deterministic parallelism APIs (sometimes referred to as River Trail or PJS). The goal of these experiments was to find high-level abstractions that could enable parallel speedups without any of the pitfalls of threads. This is a difficult approach, because it’s hard to find high-level models that are general enough to suit a wide variety of parallel programs. And at least for the moment, PJS faces a difficult adoption challenge: JS engine implementors are reluctant to commit to a large implementation effort without more developer feedback, but developers can’t really put PJS through the paces without a good polyfill to try it out in real production apps.

An Extensible Web Approach to Parallel JS

In 2012, I co-signed the Extensible Web Manifesto, which urged browser vendors and standards bodies to prioritize basic, low-level, orthogonal primitives over high-level APIs. A key insight of the Extensible Web is that growing the platform incrementally actually enables faster progress because it allows Web developers to iterate quickly—faster than browser vendors and standards bodies can—on building better abstractions and APIs on top of the standardized primitives.

Turning back to parallelism, just such a low-level API has been in the air for a while. A couple years ago, Filip Pizlo and Ryosuke Niwa of Apple’s WebKit team discussed the possibility of a variation on ArrayBuffer that could be shared between workers. Around the same time Thibault Imbert floated the same idea on his blog (perhaps inspired by similar functionality in Flash). At last year’s JSConf, Nick Bray of Google’s PNaCl team demo’ed a working prototype of shared buffers in Chrome.

Now, there’s no question such an API is low-level. Unlike PJS, a SharedArrayBuffer type with built-ins for locking would introduce new forms of blocking to workers, as well as the possibility that some objects could be subject to data races. But unlike Nashorn, this is only true for objects that opt in to using shared memory as a backing store—if you create an object without using a shared buffer, you know for sure that it can never race. And workers do not automatically share memory; they have to coordinate up front to share an array buffer. As long as your top level worker code never accepts and uses a shared buffer, you are assured of the same amount of isolation between workers as ever.

Another sensible restriction, at least at this point, is to limit access to shared buffers to workers. Eventually, sharing buffers with the main thread, ideally in controlled ways, would be a logical extension. Exposing shared buffers to the main thread would increase power and allow us to connect parallel computations directly to Web APIs like <canvas>. At the same time, the main thread has implementation challenges and could carry risks for the JS programming experience. It’s an important area to explore but it needs careful investigation.

So this approach is more conservative than full threading, and yet it should be more than enough to satisfy a large number of use cases—from number-crunching to graphics processing to video decoding—and with a much smaller implementation cost on engines than more ambitious solutions like PJS or threads. This would significantly move the needle on what JavaScript applications can do with workers, as well as open new opportunities for compiling threaded languages to the Web.

And crucially, developers would be able to start building higher-level abstractions. As one example, I’ve sketched out API ideas for region-slicing, data-race-free sharing of portions of a single binary buffer, and this could easily be polyfilled with SharedArrayBuffer. Similarly, multi-dimensional parallel array traversals, similar to PJS, could be polyfilled in plain JavaScript, instead of being blocked on standardization. Each of these APIs has pros and cons, including different use cases and performance trade-offs. And the Extensible Web approach lets us experiment with and settle on these and other high-level abstractions faster than trying to standardize them directly.

Moreover, by providing high-performance primitives, different domain-specific abstractions can determine for themselves how to enforce their guarantees. Consider region-slicing, for example: the design represents regions as objects and shares them with workers via message-passing. For some cases, the hits of creating wrapper objects and passing messages would be negligible; others—say, a column-major multidimensional array—might require allocating and communicating so many region slices as to dominate any parallelism gains. Providing the low-level primitives empowers library authors to determine for themselves how to achieve their desired guarantees and what use cases to enable.

Next Steps

We’ve begun experimenting with a SharedArrayBuffer API in SpiderMonkey. Lars Hansen is drafting a spec of the API we’re experimenting with, and we’ve provided a prototype implementation in Firefox Nightly builds. Our hope is that this will allow people to play with the API and give us feedback.

While there seems to be a good amount of interest in this direction, it will require more discussion with Web developers and browser implementers alike. With this post we’re hoping to encourage a wider conversation. We’ll be reaching out to solicit more discussion in standards forums, and we’d love to hear from anyone who’s interested in this space.

Slimmer and faster JavaScript strings in Firefox

Since Brendan Eich wrote the first SpiderMonkey version in 1995, there have been many, many changes to its internal string representation. New string types like inline strings (more on this below) and ropes were added to save memory and improve performance, but the character representation never changed: string characters were always stored as a sequence of UTF-16 code units. The past two months I’ve been working on changing this. The JS engine can now store Latin1 strings (strings that only use the first 256 Unicode code points) more efficiently: it will use 1 byte per character instead of 2 bytes. This is purely an internal optimization, JS code will behave exactly the same. In this post I will use the term Latin1 for the new 8-bit representation and TwoByte for the 16-bit format.

To measure how much memory this saves, I opened Gmail, waited 15 seconds then opened about:memory and looked at the zones/strings subtree under “Other measurements”:

JS String Memory

For every JS string we allocate a small, fixed-size structure (JSString) on the gc-heap. Short strings can store their characters inline (see the Inline strings section below), longer strings contain a pointer to characters stored on the malloc-heap.

Note that the malloc-heap size was more than halved and the total string memory was reduced by about 4 MB (40%). The difference between 32-bit and 64-bit is pretty small, JSString is 16 bytes on 32-bit and 24 bytes on 64-bit, but on 64-bit it can store more characters inline.

The chart below shows how much of our string memory is used for Latin1 strings vs TwoByte strings on 64-bit:

JS String Encoding

Almost all strings are stored as Latin1. As we will see below, this is also the case for non-Western languages. The graph suggests some longer strings (that use malloc storage) are still stored as TwoByte. Some of these strings are really TwoByte and there’s not much we can do there, but a lot of them could be stored as Latin1. There are follow-up bugs on file to use Latin1 strings in more cases.

Why not use UTF8?

At this point you may ask: wait, it’s 2014, why use Latin1 and not UTF8? It’s a good question and I did consider UTF8, but it has a number of disadvantages for us, most importantly:

  • Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks. As described below, we currently inflate Latin1 strings to TwoByte Gecko strings and that was also a potential performance risk, but inflating Latin1 is much faster than inflating UTF8.
  • Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.
  • Converting SpiderMonkey’s own string algorithms to work on UTF8 would require a lot more work. This includes changing the irregexp regular expression engine we imported from V8 a few months ago (it already had code to handle Latin1 strings).

So although UTF8 has advantages, with Latin1 I was able to get significant wins in a much shorter time, without the potential performance risks. Also, with all the refactoring I did we’re now in a much better place to add UTF8 in the future, if Gecko ever decides to switch.

Non-Western languages

So Latin1 strings save memory, but what about non-Western languages with non-Latin1 characters? It turns out that such websites still use a lot of Latin1 strings for property names, DOM strings and other identifiers. Also, Firefox and Firefox OS have a lot of internal JS code that’s exactly the same for each locale.

To verify this, I opened the top 10 most popular Chinese websites, then looked at about:memory. 28% of all strings were TwoByte, this is definitely more than I saw on English language websites, but the majority of strings were still Latin1.

String changes

Each JSString used to have a single word that stored both the length (28 bits) and the flags (4 bits). We really needed these 4 flag bits to encode all our string types, but we also needed a new Latin1 flag, to indicate the characters are stored as Latin1 instead of TwoByte. I fixed this by eliminating the character pointer for inline strings, so that we could store the string length and flags in two separate 32-bit fields. Having 32 flags available meant I could clean up the type encoding and make some type tests a lot faster. This change also allowed us to shrink JSString from 4 words to 3 words on 64-bit (JSString is still 4 words on 32-bit).

After this, I had to convert all places where SpiderMonkey and Gecko work with string characters. In SpiderMonkey itself, I used C++ templates to make most functions work on both character types without code duplication. The deprecated HTML string extensions were rewritten in self-hosted JS, so that they automatically worked with Latin1 strings.

Some operations like eval currently inflate Latin1 to a temporary TwoByte buffer, because the parser still works on TwoByte strings and making it work on Latin1 would be a pretty big change. Fortunately, as far as I know this did not regress performance on any benchmarks or apps/games. The JSON parser, regex parser and almost all string functions can work on Latin1 strings without inflating.

When I started working on this, Terrence mentioned that if we are going to refactor our string code, it’d be good to make all string code work with unstable string characters as well: once we start allocating strings in the GGC nursery (or have compacting GC for strings), we can no longer hold a pointer to string characters across a GC, because the GC can move the characters in memory, leaving us with a dangling pointer. I added new APIs and functions to safely access string characters and pave the way for more string improvements in the future.

After converting all of SpiderMonkey’s string code, I had to make Gecko work with Latin1 JS strings and unstable string characters. Gecko has its own TwoByte string types and in many cases it used to avoid copying the JS characters by using a nsDependentString. This is not compatible with both Latin1 strings and nursery allocated strings, so I ended up copying (and inflating) JS strings when passing them to Gecko to solve both problems. For short strings we can use inline storage to avoid malloc or nsStringBuffer and overall this turned out to be fast enough: although it was about 10% slower on (pretty worst-case) getAttribute micro-benchmarks, none of our DOM benchmarks noticeably regressed as far as I know. For longer strings, the copy is potentially more expensive, but because I used a (refcounted) nsStringBuffer for those, it often avoids copying later on.

Inline strings

SpiderMonkey has two string types that can store their characters inline, instead of on the malloc heap: inline strings (size of a normal JSString) and fat inline strings (a few words bigger than a JSString). The table below shows the number of characters they can store inline (excluding the null terminator):

32-bit Latin1 32-bit TwoByte 64-bit Latin1 64-bit TwoByte
Inline 7 3 15 7
Fat inline 23 11 23 11

So a Latin1 string of length 15 can be stored in an inline string on 64-bit and a fat inline string on 32-bit. Latin1 strings can store more characters inline so, as expected, there are a lot more inline strings now (measured after opening Gmail):

JS String Types

87% of strings can store their characters inline, this used to be 65%. Inline strings are nice for cache locality, save memory and improve performance (malloc and free are relatively slow). Especially on 64-bit, non-fat inline strings are now more common than fat inline strings, because most strings are very short.

These numbers include atoms, a string subtype the engine uses for property names, identifiers and some other strings. Minimized JS code is likely responsible for many short strings/atoms.

Note that SpiderMonkey has other string types (ropes, dependent strings, external strings, extensible strings), but more than 95% of all strings are inline, fat inline or malloc’ed so I decided to focus on those to avoid making the chart more complicated.

Performance

The main goal was saving memory, but Latin1 strings also improved performance on several benchmarks. There was about a 36% win on Sunspider regexp-dna on x86 and x64 on AWFY (the regular expression engine can generate more efficient code for Latin1 strings) and a 48% win on Android.

There were also smaller wins on several other benchmarks, for instance the Kraken JSON tests improved by 11% on x86 and x64. On ARM this was closer to 20%.

Conclusion

Latin1 strings are in Firefox Nightly, will be in Aurora this week and should ship in Firefox 33. This work also paves the way for generational and compacting GC for strings, I’ll hopefully have time to work on that in the coming weeks. Last but not least, I want to thank Luke Wagner, Terrence Cole, Nicholas Nethercote, Boris Zbarsky, Brian Hackett, Jeff Walden and Bobby Holley for many excellent and fast reviews.

IonMonkey: Optimizing Away

Firefox 32

Firefox uses two just-in-time compilers: Baseline and IonMonkey. When a web page contains some JavaScript code, this code is first executed in the interpreter. Then after running the same code multiple times, we compile the code with Baseline. After hundreds of iterations on the same code, we trigger a compilation with IonMonkey.

Baseline is a method compiler which focuses on optimizing each instruction individually. Each instruction checks for optimized cases based on what was previously observed, and it modifies itself every time a new case is detected.

IonMonkey is a method compiler too, but its goal is to optimize instructions across the entire method, and it can remove instructions which are not needed. IonMonkey is also a greedy optimizer, as it takes the currently observed cases as the only possible options and assumes that nothing will change. For example, if a variable has only ever held numbers, IonMonkey assumes that variable will always be a number. If one of the assumptions fails, we cannot continue execution in IonMonkey. Otherwise we might compute incorrect results, or worse, execute random pieces of code. In fact we must discard the IonMonkey code and resume execution in Baseline.

Bailouts

Resuming execution in Baseline implies that we need to stop IonMonkey execution, list all local and temporary variables, reconstruct a stack frame with the layout expected by Baseline, and jump to Baseline code. This process is called a bailout.

To list all variables, IonMonkey emulates the Baseline stack frame when we generate the control flow graph. On the graph, we annotate every block entry and instructions which are doing side effects, such as calls and stores, with resume points. These resume points are used in a similar way as checkpoints in video games — they’re the location where you will start over if something goes wrong. Each resume point captures the layout of the Baseline stack frame, and is used by the generated code to resume execution of the program if IonMonkey’s code is no longer valid.

Representation of a Snapshot

Each resume point contains a list of MIR (Middle-level Intermediate Representation) Instructions which are listed in the same order as the Baseline’s frame is stored on the stack. These MIR Instructions are listed so that we can track the register/stack allocations of the return value for each captured instruction.

Resume points are used by instructions which might fail. For example, when a call returns an object where it used to always return a Number, or when a math operation/function returns a floating point number where it used to only return integers.

When an instruction that might fail is lowered (converted to a lower intermediate representation), we attach a snapshot to the instruction. The snapshot captures the allocation of every MIR Instruction which is listed in the resume point, at the location of the lowered instruction. We do this for each instruction because a value contained in a register might be evicted to the stack if we no longer need it. Thus, for bailing out we need to track all register allocation changes.

An allocation is either a register, a floating point register, a stack offset in IonMonkey’s frame or a constant which is stored next to the code. All these snapshots and allocations are serialized in order to be used during bailouts.

When an instruction fails, the bailout decodes the snapshot and each allocation. The snapshot is used to reconstruct the Baseline frame that we were supposed to have at the last resume point. Allocations are used to read the value out of IonMonkey’s frame and code, to fill the reconstructed frame. And finally, we replace the IonMonkey frame with the Baseline frame and resume Baseline execution.

What Baseline wants, IonMonkey executes

IonMonkey has multiple optimizations, such as Unreachable Code Elimination (UCE) and Dead Code Elimination (DCE). UCE removes then/else blocks which are never taken as the condition can be inferred. DCE removes instructions which are never used. These optimizations help IonMonkey to generate faster code, as there are fewer instructions to execute.

Baseline makes no assumptions about which part of the code needs to run. Its stack frames are always complete and correct. Thus, IonMonkey cannot remove a MIR Instruction if there is a resume point which captures it. This also implies that a MIR Instruction has to be executed before all of the resume points that use it. As a consequence, any unused MIR Instruction, which is captured by a resume point, has to be executed by IonMonkey.

IonMonkey’s optimizations are limited by the expectations of Baseline. Any optimization done by IonMonkey should not be observable within Baseline. The following pieces of code illustrates the limitations added by Baseline:

function getCond() {
  console.log("getCond returns false");
  return false;
}

function maybeMul2(arg) {
  var x = 2 * arg;
  // Baseline expects "x" to be on the stack.

  var cond = getCond();
  // IonMonkey cannot resume before a call.

  if (cond) {
    return x;
  }
  return arg;
}

Once the call of “getCond()” has executed in IonMonkey, we cannot resume in Baseline before that call. If we did, Baseline would execute the call to “console.log” a second time. This implies that the resume point has to capture the MIR Instruction which computes “x”. Thus we have to compute the value of “x” before the call, even in IonMonkey.

IonMonkey can inline the function “getCond()”, determine that “cond” is always false, and eliminate the body of the if-statement as dead code. The optimized IonMonkey code therefore looks like this:

function maybeMul2(arg) {
  var x = 2 * arg; // expected by Baseline.

  // inlined code of getCond
  console.log("getCond returns false");

  return arg;
}

Still, we must compute “x” even though the optimized code clearly doesn’t use it — just in case we are forced to bail out after the call to console.log.

We want IonMonkey to remove that instruction computing “x”.

Recover Instructions

To work around Baseline expectations, we need to remove the instruction from IonMonkey code and move it to the bailout path. The problem is that we need to carry this information to the bailout function.

Currently, allocations are used to carry information about where the values are located. What we need is similar, we need a way to express an allocation which refers to the result of an instruction. The difference being that the instruction is not executed by IonMonkey but that it is executed before reconstructing the Baseline stack frame.

Resume points are listing all the MIR Instructions which are needed to reconstruct a stack frame. Even if we do not execute these instructions, we still want to have them listed in the resume points.

The solution implemented in IonMonkey adds a flag to every MIR instruction. This flag, named RecoveredOnBailout, is used to indicate that for the current instruction, which is still in the control flow graph, we won’t emit any JIT code. Instead, when we attach a snapshot to an instruction, we order and serialize all the instructions which are flagged as RecoveredOnBailout.

Now that instructions are ordered, we can index their results in a similar way as we do for stack offsets. This way, when we bailout, we reserve a vector which is large enough to contain the result of all instructions which are used during the bailout. The allocations are modified so that, in addition to the register, stack offset, and constants, we can refer to an index in this vector of instruction results.

Representation of a Snapshot with Recover Instructions

When an instruction fails, in addition to decoding the snapshot, the bailout process iterates over the list of instructions. Each instruction reads the values of the allocations that it needs, and stores its result in the reserved vector. Then, another allocation will refer to this result to fill the reconstructed frame.

Optimizing Away

The goal of recover instructions is to open a new class of optimizations where resume points are almost nonexistent.

Due to the resume points, we had to restrict our optimizations to optimizations which hoist instructions in the control flow graph, such as Global Value Numbering (GVN) and Loop Invariant Code Motion (LICM), but we could barely remove instructions or even sink them in the control flow graph.

With recover instructions, we should be able to remove instructions or move them closer to their uses. In the previous example, if we are unable to prove that the branch is dead, we can still move the computation of “x” to the only branch which uses it.

function maybeMul2(arg) {
  // no more overhead.

  var cond = getCond();
  // Baseline can recover "x = 2 * arg".

  if (cond) {
    var x = 2 * arg;
    return x;
  }
  return arg;
}

In the meantime, recover instructions are a good abstraction for improving our Dead Code Elimination (DCE), and implementing optimizations such as Escape Analysis. All of these optimizations allow IonMonkey to generate faster code by transforming the code to be even more different from what Baseline expects. Whenever a bailout occurs, recover instructions stand ready to put the world back in a Baseline-friendly state.

Contributors inside

In addition to improving our scope of optimizations, recover instructions are simple to implement. This project has provided a good trampoline for multiple contributors to join the project. Please thank and welcome the 12 new contributors who contributed their first patches to the JavaScript engine:

The monkeys in 2013

A monkey. That’s the default name a part in the JavaScript Engine of Mozilla Firefox gets. Even the full engine has its own monkey name, called Spidermonkey. 2013 has been a transformative year for the monkeys. New species have been born and others have gone extinct. I’ll give a small but incomplete overview into the developments that happened.

Before 2013 JaegerMonkey had established itself as the leader of the pack (i.e. the superior engine in Spidermonkey) and was the default JIT compiler in the engine. It was successfully introduced in Firefox 4.0 on March 22nd, 2011. Its original purpose was to augment the first JIT Monkey, TraceMonkey. Two years later it had kicked TraceMonkey out of the door and was absolute ruler in monkey land. Along the ride it had totally changed. A lot of optimizations had been added, the most important being Type Inference. Though there were also drawbacks. JaegerMonkey wasn’t really designed to host all those optimizations and it was becoming harder and harder to add new flashy things and easier and easier to add faults. JaegerMonkey had always been a worthy monkey but was starting to cripple under age.

Improvement of Firefox on the octane benchmark

Improvement of Firefox on the octane benchmark

The year 2013 was only eight days old and together with the release of Firefox 18, a new bad boy was in town, IonMonkey. It had received education from the elder monkeys, as well as from its competitors and inherited the good ideas, while trying to avoid the old mistakes. IonMonkey became a textbook compiler with regular optimization passes only adjusted to work in a JIT environment. I would recommend reading the blogpost of the release for more information about it. Simultaneously, JaegerMonkey was downgraded to a startup JIT to warm up scripts before IonMonkey took over responsibility.

But that wasn’t the only big change. After the release of IonMonkey in Firefox 18 the year 2013 saw the release of Firefox 19, 20, all the way to number 26. Also Firefox 27, 28 and (partly) 29 were developed in 2013. All those releases brought their own set of performance improvements:

Firefox 19 was the second release with IonMonkey enabled. Most work went into improving the new infrastructure of IonMonkey. Another notable improvement was updating Yarr (the engine that executes regular expressions imported from JSC) to the newest release.

Firefox 20 saw range analysis, one of the optimization passes of IonMonkey, refactored. It was improved and augmented with symbolic range analysis. Also this was the first release containing JavaScript self-hosting infrastructure that allows standard, builtin functions to be implemented in JavaScript instead of C++. These functions get the same treatment as normal functions, including JIT compilation. This helps a lot with removing the overhead from calling between C++ and JavaScript and even allows builtin JS functions to be inlined in the caller.

Firefox 21 is the first release where off-thread compilation for IonMonkey was enabled. This moves most of the compilation to a background thread, so that the main thread can happily continue executing JavaScript code.

Firefox 22 saw a big refactoring of how we inline and made it possible to inline a subset of callees at a polymorphic callsite, instead of everything or nothing. A new monkey was also welcomed, called OdinMonkey. OdinMonkey acts as an Ahead of Time compiler optimization pass that reuses most of IonMonkey, kicking in for specific scripts that have been declared to conform to the asm.js subset of JavaScript. OdinMonkey showed immediate progress on the Epic Citadel demo. More recently, Google added an asm.js workload to Octane 2.0 where OdinMonkey provides a nice boost.

Firefox 23 brought another first. The first compiler without a monkey name was released: the Baseline Compiler. It was designed from scratch to take over the role of JaegerMonkey. It is the proper startup JIT JaegerMonkey was forced to be when IonMonkey was released. No recompilations or invalidations in the Baseline Compiler: only saving type information and make it easy for IonMonkey to kick in. With this release IonMonkey was also allowed to kick in 10 times earlier. At this point, Type Inference was now only needed for IonMonkey. Consequently, major parts of Type Inference were moved and integrated directly into IonMonkey improving memory usage.

Firefox 24 added lazy bytecode generation. One of the first steps in JS execution is parsing the functions in a script and creating bytecode for them. (The whole engine consumes bytecodes instead of a raw JavaScript string.) With the use of big libraries, a lot of functions aren’t used and therefore creating bytecode for all these functions adds unnecessary overhead. Lazy bytecode generation allow us to wait until the first execution before parsing a function and avoids parsing functions that are never executed.

Firefox 25 to Firefox 28: No real big performance improvements that stand out. A lot of smaller changes under the hood have landed. Goal: polishing existing features or implementing small improvements. A lot of preparation work went into Exact Rooting. This is needed for more advanced garbage collection algorithms, like Generational GC. Also a lot of DOM improvements were added.

Firefox 29. Just before 2014 Off-thread MIR Construction landed. Now the whole compilation process in IonMonkey can be run off the main thread. No delays during execution due to compiling if you have two or more processors anymore.

Improvement of Firefox on the dromaeo benchmark

Improvement of Firefox on the dromaeo benchmark

All these things resulted in improved JavaScript speed. Our score on Octane v1.0 has increased considerably compared to the start of the year. We now are again competitive on the benchmark. Towards the end of the year, Octane v2.0 was released and we took a small hit, but we were very efficient in finding the opportunities to improve our engine and our score on Octane v2.0 has almost surpassed our Octane v1.0 score. Another example on how the speed of Spidermonkey has increased a lot is the score on the Web Browser Grand Prix on Tom’s Hardware. In those reports, Chrome, Firefox, Internet Explorer and Opera are tested on multiple benchmarks, including Browsermark, Peacekeeper, Dromaeo and a dozen others. During 2013, Firefox was in a steady second place behind Chrome. Unexpectedly, the hard work brought us to the first place on the Web Browser Grand Prix of June 30th.  Firefox 22 was crowned above Chrome and Opera Next. More importantly than all these benchmarks are the reports we get about overall improved JavaScript performance, which is very encouraging.

A new year starts and improving performance is never finished. In 2014 we will try to improve the JavaScript engine even more. The usual fixes and adding of fast paths continues, but also the higher-level work continues. One of the biggest changes we will welcome this year is the landing of Generational GC. This should bring big benefits in reducing the long GC pauses and improving heap usage. This has been an enormous task, but we are close to landing. Other expected boosts include improving DOM access even more, possibly a lightweight way to do chunked compilation in the form of loop compilation, different optimization levels for scripts with different hotness, adding a new optimization pass called escape analysis … and possibly much more.

A happy new year from the JavaScript team!