Mozilla Research Grants: Call for Applications

Mozilla seeks applications for research funding to support our mission: to ensure the Internet is a global public resource, open and accessible to all. In particular, for the 2017H1 funding cycle, we are looking for research projects that prototype, explore and characterize the future: we’re looking to support research would benefit a free and open internet. These grants may include topics both inside and outside of Mozilla’s core focus on Firefox, as well as  topics that fit more broadly with our vision for improving the web.

Our key objective is identifying research projects in line with our mission. Areas we have funded in the past include networking, security, compilers, software verification, software power management, and developer tools. Other areas we are interested in supporting include virtual, mixed and augmented reality, machine learning, home networking and router use, studying internet health, developing open data resources, decentralized technologies, improving web anonymity, and exploring approaches to diversity in open source. These lists are by no means exhaustive: we are open to other proposals as long as they support our mission.

To learn more and to submit your application, see our application form here:

The Emterpreter: Run code before it can be parsed

I’m excited to announce a new Mozilla Research experiment: the Emterpreter, a pure-JavaScript interpreter that can start running large Emscripten-compiled apps faster than JavaScript engines can, giving developers control over the latency/throughput trade-off.

An app’s startup time is a precious resource. For small apps, minification and image compression are good enough to provide a smooth user onboarding experience. But when a codebase gets large enough, the JavaScript engine startup costs—in particular, parsing—can add up to noticeable startup delays.

What can we do to improve JavaScript parse time? The obvious steps are removing unneeded code and minifying, but those only get you so far. We wanted to try a more extreme experiment: what if we compressed asm.js into a bytecode format and shipped it along with a small interpreter? Read on for some interesting results!

First let’s see how we can measure the problem:


Here we ran the Bullet physics engine in Firefox with and without ahead-of-time (AOT) compilation. The numbers show a classic latency-vs-throughput tradeoff: AOT compilation of asm.js maximizes our sustained speed, but at some cost in startup time.

The Emterpreter is an experiment to address the latency side of this equation. We’ve added an experimental command-line flag to the Emscripten compiler to convert the generated asm.js code to bytecode format and emit a JavaScript interpreter for that bytecode. Let’s see the effect this has on startup times:


We ran the Bullet and Cube 2 benchmarks with asm.js and Emterpreter modes, running each with and without AOT. The Emterpreter starts up significantly faster – unsurprisingly, since loading unprocessed binary data (which is what the Emterpreter bytecode is) is faster than the JavaScript engine processing that code.

Naturally, while the startup times improved, the costs of running the code through an interpreter are substantial. Running the benchmarks from the Emscripten benchmark suite we can see anywhere from 6x to 22x slowdowns compared to normal asm.js execution:


That’d certainly be disappointing if it were all we could do. But we designed the Emterpreter to allow mixed execution. Some functions are “emterpreted,” and others run normally as asm.js. This lets us run most code in bytecode format, but leave the performance-sensitive parts running at full asm.js speed – outside of the Emterpreter.

As an example, we can run the Box2D physics engine mostly in the Emterpreter but with a “blacklist” of 6 performance-sensitive functions that remain in asm.js:


On the left we can see the blacklist slows down startup only slightly, and on the right we can see that execution time takes a much smaller hit than running purely in the Emterpreter. This shows that there is promise to this approach to balancing startup time and speed: developers can selectively improve startup time without losing all the performance benefits of asm.js.

Finally, there’s one more optimization trick we can use. So far it has seemed we couldn’t significantly improve startup times without taking a hit in our peak performance. How can we avoid this compromise? The answer: start up quickly in emterpreted mode, but load the asm.js version in the background. Browsers can parse <script async> tags in a background thread (as Firefox does). Since AOT compilation of asm.js is done at parse time, developers have the ability to force compilation into a background thread. Apps can even get notified with a callback when the code is ready. While this can delay the point at which we reach peak performance, it also gives apps control over their user experience, such as supplying a splash screen so the app remains responsive while code is still being compiled. Once the callback is called, the Emterpreted code can be “hot swapped” with the optimized asm.js code (this is practical to do because asm.js code is in a very modular form), and the app will run at full speed.

With all these tools in place, we can revisit our first graph. We ran the same benchmark on two new cases: where we hot-swap the emterpreter with asm.js (yellow line), and where we both hot-swap and use a blacklist of performance-sensitive functions for the emterpreter (green line).


Hot-swapping allows us to reach full peak performance but get much better startup times. And the blacklist allows us to selectively improve the speed in the interim before the optimized code finishes compiling. Note also how the green line is strictly the best until around 600ms: all the others either have not started to execute yet, or are executing much more slowly. That shows the Emterpreter is capable of startup performance even better than the browser can achieve, whether the browser does AOT compilation or not. This is possible because asm.js code is very simple and low-level, and as a result easy to interpret in an efficient manner. And it takes the browser much less time to parse and optimize an interpreter over parsing and optimizing all of the code of an application.

These are preliminary results, and we intend to keep experimenting with the Emterpreter to see what we can do with it on real codebases. We encourage you to give it a try and tell us what you learn!

Daala Progress in 2014

The Daala team has been hard at work making improvements to our royalty-free video codec. This year we’ve spent a large amount of effort improving still image coding and building tools to evaluate our performance against other codecs. Video performance has also been improving greatly, but we’ll cover that in a later update.

We focused on still images because the work is a subset of the work needed for a video codec, and it is an area where research has stagnated since H.264. Additionally, Daala’s unique design makes still image coding one of the hard problems to solve. It’s an obvious early goal to outperform other codecs in this area.

Monty has posted a progress update along with interactive demos that show our progress over the year as well as comparisons between Daala and other well known codecs like JPEG and HEVC.

Preview of the comparison tool included in the update

Preview of the comparison tool included in the update

JavaScript: Servo’s only garbage collector

by Josh Matthews and Keegan McAllister

A web browser’s purpose in life is to mediate interaction between a user and an application (which we somewhat anachronistically call a "document"). Users expect a browser to be fast and responsive, so the core layout and rendering algorithms are typically implemented in low-level native code. At the same time, JavaScript code in the document can perform complex modifications through the Document Object Model. This means the browser’s representation of a document in memory is a cross-language data structure, bridging the gap between low-level native code and the high-level, garbage-collected world of JavaScript.

We’re taking this as another opportunity in the Servo project to advance the state of the art. We have a new approach for DOM memory management, and we get to use some of the Rust language’s exciting features, like auto-generated trait implementations, lifetime checking, and custom static analysis plugins.

Memory management for the DOM

It’s essential that we never destroy a DOM object while it’s still reachable from either JavaScript or native code — such use-after-free bugs often produce exploitable security holes. To solve this problem, most existing browsers use reference counting to track the pointers between underlying low-level DOM objects. When JavaScript retrieves a DOM object (through getElementById for example), the browser builds a "reflector" object in the JavaScript VM that holds a reference to the underlying low-level object. If the JavaScript garbage collector determines that a reflector is no longer reachable, it destroys the reflector and decrements the reference count on the underlying object.

This solves the use-after-free issue. But to keep users happy, we also need to keep the browser’s memory footprint small. This means destroying objects as soon as they are no longer needed. Unfortunately, the cross-language "reflector" scheme introduces a major complication.

Consider a C++ Element object which holds a reference-counted pointer to an Event:

struct Element {
    RefPtr<Event> mEvent;

Now suppose we add an event handler to the element from JavaScript:

elem.addEventListener('load', function (event) {
    event.originalTarget = elem;

When the event fires, the handler adds a property on the Event which points back to the Element. We now have a cross-language reference cycle, with an Element pointing to an Event within C++, and an Event reflector pointing to the Element reflector in JavaScript. The C++ refcounting will never destroy a cycle, and the JavaScript garbage collector can’t trace through the C++ pointers, so these objects will never be freed.

Existing browsers resolve this problem in several ways. Some do nothing, and leak memory. Some try to manually break possible cycles, by nulling out mEvent for example. And some implement a cycle collection algorithm on top of reference counting.

None of these solutions are particularly satisfying, so we’re trying something new in Servo by choosing not to reference count DOM objects at all. Instead, we give the JavaScript garbage collector full responsibility for managing those native-code DOM objects. This requires a fairly complex interaction between Servo’s Rust code and the SpiderMonkey garbage collector, which is written in C++. Fortunately, Rust provides some cool features that let us build this in a way that’s fast, secure, and maintainable.

Auto-generating field traversals

How will the garbage collector find all the references between DOM objects? In Gecko‘s cycle collector this is done with a lot of hand-written annotations, e.g.:

NS_IMPL_CYCLE_COLLECTION(nsFrameLoader, mDocShell, mMessageManager)

This macro describes which members of a C++ class should be added to a graph of potential cycles. Forgetting an entry can produce a memory leak. In Servo the consequences would be even worse: if the garbage collector can’t see all references, it might free a node that is still in use. It’s essential for both security and programmer convenience that we get rid of this manual listing of fields.

Rust has a notion of traits, which are similar to type classes in Haskell or interfaces in many OO languages. A simple example is the Collection trait:

pub trait Collection {
    fn len(&self) -> uint;

Any type implementing the Collection trait will provide a method named len that takes a value of the type (by reference, hence &self) and returns an unsigned integer. In other words, the Collection trait describes any type which is a collection of elements, and the trait provides a way to get the collection’s length.

Now let’s look at the Encodable trait, used for serialization. Here’s a simplified version:

pub trait Encodable {
    fn encode<T: Encoder>(&self, encoder: &mut T);

Any type which can be serialized will provide an encode method. The encode method itself is generic; it takes as an argument any type T implementing the trait Encoder. The encode method visits the data type’s fields by calling Encoder methods such as emit_u32, emit_tuple, etc. The details of the particular serialization format (e.g. JSON) are handled by the Encoder implementation.

The Encodable trait is special, because the compiler can implement it for us! Although this mechanism was intended for painless serialization, it’s exactly what we need to implement garbage collector trace hooks without manually listing data fields.

Let’s look at Servo’s implementation of the DOM’s Document interface:

pub struct Document {
    pub node: Node,
    pub window: JS<Window>,
    pub is_html_document: bool,

The deriving attribute asks the compiler to write an implementation of encode that recursively calls encode on node, window, etc. The compiler will complain if we add a field to Document that doesn’t implement Encodable, so we have compile-time assurance that we’re tracing all the fields of our objects.

Note the difference between the node and window fields above. In the object hierarchy of the DOM spec, every Document is also a Node. Rust doesn’t have inheritance for data types, so we implement this by storing a Node struct within a Document struct. As in C++, the fields of Node are included in-line with the fields of Document, without any pointer indirection, and the auto-generated encode method will visit them as well.

A Document also has an associated Window, but this is not a containing or "is-a" relationship. The Document just has a pointer to a Window, one of many pointers to that object, which can live in native DOM data structures or in JavaScript reflectors. These are precisely the pointers we need to tell the garbage collector about. We do this with a custom pointer type JS<T> (for example, the JS<Window> above). The implementation of encode for JS<T> is not auto-generated; this is where we actually call the SpiderMonkey trace hooks.

Lifetime checking for safe rooting

The Rust code in Servo needs to pass DOM object pointers as function arguments, store DOM object pointers in local variables, and so forth. We need to register these additional temporary references as roots in the garbage collector’s reachability analysis. If we touch an object from Rust when it’s not rooted, that could introduce a use-after-free vulnerability.

To make this happen, we need to expand our repertoire of GC-managed pointer types. We already talked about JS<T>, which represents a reference between two GC-managed DOM objects. These are not rooted; the garbage collector only knows about them when encode reaches one as part of the tracing process.

When we want to use a DOM object from Rust code, we call the root method on JS<T>. For example:

fn load_anchor_href(&self, href: DOMString) {
    let window = self.window.root();

The root method returns a Root<T>, which is stored in a stack-allocated local variable. When the Root<T> is destroyed at the end of the function, its destructor will un-root the DOM object. This is an example of the RAII idiom, which Rust inherits from C++.

Of course, a DOM object might make its way through many function calls and local variables before we’re done with it. We want to avoid the cost of telling SpiderMonkey about each and every step. Instead, we have another type JSRef<T>, which represents a pointer to a GC-managed object which is already rooted elsewhere. Unlike Root<T>, JSRef<T> can be copied at negligible cost.

We shouldn’t un-root an object if it’s still reachable through JSRef<T>, so it’s important that a JSRef<T> can’t outlive its originating Root<T>. Situations like this are common in C++ as well. No matter how smart your smart pointer is, you can take a bare reference to the contents and then erroneously use that reference past the lifetime of the smart pointer.

Rust solves this problem with a compile-time lifetime checker. The type of a reference includes the region of code over which it is valid. In most cases, lifetimes are inferred and don’t need to be written out in the source code. Inferred or not, the presence of lifetime information allows the compiler to reject use-after-free and other dangerous bugs.

Not only do lifetimes protect Rust’s built-in reference type, we can use them in our own data structures as well. JSRef is actually defined as

pub struct JSRef<'a, T> {

T is the familiar type variable, representing the type of DOM structure we’re pointing to, e.g. Window. The somewhat odd syntax 'a is a lifetime variable, representing the region of code in which that object is rooted. Crucially, this lets us write a method on Root with the following signature:

pub fn root_ref<'a>(&'a self) -> JSRef<'a, T> {

What this syntax means is:

  • <'a>: "for any lifetime 'a",
  • (&'a self): "take a reference to a Root which is valid over lifetime 'a",
  • -> JSRef<'a, T>: "return a JSRef whose lifetime parameter is set to 'a".

The final piece of the puzzle is that we put a marker in the JSRef type saying that it’s only valid for the lifetime corresponding to that parameter 'a. This is how we extend the lifetime system to enforce our application-specific property about garbage collector rooting. If we try to compile something like this:

fn bogus_get_window<'a>(&self) -> JSRef<'a, Window> {
    let window = self.window.root();
    window.root_ref()  // return the JSRef

we get an error: 199:15 error: `window` does not live long enough     window.root_ref()
                    ^~~~~~ 200:6 note: reference must be valid for
    the lifetime 'a as defined on the block at 197:56... fn bogus_get_window<'a>(&self) -> JSRef<'a, Window> {     let window = self.window.root();     window.root_ref() } 200:6 note: ...but borrowed value is only
    valid for the block at 197:56 fn bogus_get_window<'a>(&self) -> JSRef<'a, Window> {     let window = self.window.root();     window.root_ref() }

We also implement the Deref trait for both Root<T> and JSRef<T>. This allows us to access fields of the underlying type T through a Root<T> or JSRef<T>. Because JS<T> does not implement Deref, we have to root an object before using it.

The DOM methods of Window (for example) are defined in a trait which is implemented for JSRef<Window>. This ensures that the self pointer is rooted for the duration of the method call, which would not be guaranteed if we implemented the methods on Window directly.

You can check out the Servo project wiki for more of the details that didn’t make it into this article.

Custom static analysis

To recap, the safety of our system depends on two major parts:

  • The auto-generated encode methods ensure that SpiderMonkey’s garbage collector can see all of the references between DOM objects.
  • The implementation of Root<T> and JSRef<T> guarantees that we can’t use a DOM object from Rust without telling SpiderMonkey about our temporary reference.

But there’s a hole in this scheme. We could copy an unrooted pointer — a JS<T> — to a local variable on the stack, and then at some later point, root it and use the DOM object. In the meantime, SpiderMonkey’s garbage collector won’t know about that JS<T> on the stack, so it might free the DOM object. To really be safe, we need to make sure that JS<T> only appears in traceable DOM structs, and never in local variables, function arguments, and so forth.

This rule doesn’t correspond to anything that already exists in Rust’s type system. Fortunately, the Rust compiler can load "lint plugins" providing custom static analysis. These basically take the form of new compiler warnings, although in this case we set the default severity to "error".

We have already implemented a plugin which simply forbids JS<T> from appearing at all. Because lint plugins are part of the usual warnings infrastructure, we can use the allow attribute in places where it’s okay to use JS<T>, like DOM struct definitions and the implementation of JS<T> itself.

Our plugin looks at every place where the code mentions a type. Remarkably, this adds only a fraction of a second to the compile time for Servo’s largest subcomponent, as Rust compile times are dominated by LLVM‘s back-end optimizations and code generation. The current version of the plugin is very simple and will miss some mistakes, like storing a struct containing JS<T> on the stack. However, lint plugins run at a late stage of compilation and have access to full compiler internals, including the results of type inference. So we can make the plugin incrementally more sophisticated in the future.

In the end, the plugin won’t necessarily catch every mistake. It’s hard to achieve full soundness with ad-hoc extensions to a type system. As the name "lint plugin" suggests, the idea is to catch common mistakes at a low cost to programmer productivity. By combining this with the lifetime checking built in to Rust’s type system, we hope to achieve a degree of security and reliability far beyond what’s feasible in C++. Additionally, since the checking is all done at compile time, there’s no penalty in the generated machine code.

It’s an open question how our garbage-collected DOM will perform compared to a traditional reference-counted DOM. The Blink team has performed similar experiments, but they don’t have Servo’s luxury of starting from a clean slate and using a cutting-edge language. We expect the biggest gains will come when we move to allocating DOM objects within the JavaScript reflectors themselves. Since the reflectors need to be traced no matter what, this will reduce the cost of managing native DOM structures to almost nothing.

If you find this stuff interesting, we’d love to have your help on Rust and Servo! Both are open-source projects with a large number of community contributors. Here are some resources for getting started:

WebGL in Web Workers, Today – and Faster than Expected!

Web Workers are additional threads that a website can create. Using workers, a website can utilize multiple CPU cores to speed itself up, or move heavy single-core processing to a background thread to keep the main (UI) thread as responsive as possible.

A problem, however, is that many APIs exist only on the main thread, for example WebGL. WebGL is a natural candidate for running in a worker, as it is often used by things like 3D games, simulations, etc., which do heavy amounts of JavaScript that can stall the main thread. Therefore there have been discussions about supporting WebGL in workers (for example, work is ongoing in Firefox), and hopefully this will be widely supported eventually. But that will take some time – perhaps we can polyfill this meanwhile? That is what this blogpost is about: WebGLWorker is a new open source project that makes the WebGL API available in workers. It does so by transparently proxying necessary commands to the main thread, where WebGL is then rendered.

Quick overview of the rest of this post:

  • We’ll describe WebGLWorker’s design, and how it allows running WebGL-using code in workers, without any modifications to that code.
  • While the proxying approach has some inherent limitations which prevent us from implementing 100% of the WebGL API, the part that we can implement turns out to be sufficient for several real-world projects.
  • We’ll see performance numbers on the proxying approach used in WebGLWorker, showing that it is quite efficient, which is perhaps surprising since it sounds like it might be slow.


Before we get into technical details, here are two demos that show the project in action:

  • PlayCanvas: PlayCanvas is an open source 3D game engine written in JavaScript. Here is the normal version of one of their examples, and here is the worker version.
  • BananaBread: BananaBread is a port of the open source Cube 2/Sauerbraten first person shooter using Emscripten. Here is one of the levels running normally, and here it is in a worker. (Note that startup on the worker version may be sluggish; see the notes on “nested workers”, below, for why.)

In both cases the rendered output should look identical whether running on the main thread or in a worker. Note that neither of these two demos were modified to run in a worker: The exact same code, in both cases, runs either on the main thread or in a worker (see the WebGLWorker repo for details and examples).

How it works

WebGLWorker has two parts: The worker and the client (= main thread). On the worker, we construct what looks like a normal WebGL context, so that a project using WebGL can use the WebGL API normally. When you call it, the WebGLWorker worker code queues rendering commands into buffers, then sends them over at the end of a frame using postMessage. The client code on the main thread receives such command buffers, and then executes them, command after command, in order. Here is what this architecture looks like:


Note that the WebGL-using codebase interacts with WebGLWorker’s worker code synchronously – it calls into it, and receives responses back immediately, for example, createShader returns an object representing a shader. To do that, the worker code needs to parse shader source files and so forth. Basically, we end up implementing some of the “frontend” of WebGL ourselves, in JS. And, of course, on the client side the WebGLWorker client code interacts synchronously with the actual browser WebGL context, executing commands and saving the responses where relevant. However, the important thing to note is that the WebGLWorker worker code sends messages to the client code, using postMessage, but it cannot receive responses – the response would be asynchronous, but WebGL application code is written synchronously. So that arrow goes only in one direction.

For that reason, a major limitation of this approach is synchronous operations that we cannot implement ourselves, like readPixels and getError – in both of those,  we don’t know the answer ourselves, we need to get a response from the actual WebGL context on the client. But we can’t access it synchronously from a worker. As a consequence we do not support readPixels, and for getError, we return “no error” optimistically in the worker where getError is called, proxy the call, and if the client gets an error message back, we abort since it’s too late to tell the worker at that point. We are therefore limited in what we can accomplish with this approach. However, as the demos above show, 2 real-world projects work out of the box without issues.


I think most people’s intuition would be that this approach has to slow things down. After all, we are doing more work – queue commands in a buffer, serialize it, transfer it, deserialize it, and then execute the commands. All of that instead of just running each command as it is invoked! Now, on a single-core machine that would be correct, but on a multi-core machine there are reasons to suspect otherwise, because the worker thread often does other heavy JavaScript operations, so freeing it up quickly to do whatever else it needs can be beneficial. There are a few reasons why that might be possible:

  • Some commands are proxied literally by just appending a few numbers to an array. That can be faster than calling into a DOM API which crosses into native C++ code.
  • Some commands are handled locally in JavaScript, without proxying at all. Again, this can be faster than crossing the native code boundary.
  • Calling from the browser’s native code into the graphics driver can cause delays, which proxying avoids.

So on the one hand it seems obvious that this must be slower, but there are also some reasons to think it might not be. Let’s see some measurements!


The chart shows frames per second (higher numbers are better) on the BananaBread demo, on Firefox and Chrome (latest developer versions, Firefox 33 and Chrome 37, on Linux; results on other OSes are overall similar), on a normal version running on the main thread, and a worker version using WebGLWorker. Frame rates are shown for different numbers of bots, with 2 being a “typical” workload for this game engine, 0 being a light workload and 10 being unrealistically large (note that “0 bots” does not mean “no work” – even with no bots, the game engine renders the world, the player’s model and weapon, HUD information, etc.). As expected, as we go from 0 bots to 2 and then 10, frame rates decrease a little, because the game does more work, both on the CPU (AI, physics, etc.) and on the GPU (render more characters, weapon effects, etc.). Hower, on the normal versions (not running in a worker), the decrease is fairly small, just a few frames per second under the optimal 60. On Firefox, we see similar results when running in a worker as well, even with 10 bots the worker version is about as fast as the normal version. On Chrome, we do see a slowdown on the worker version, of just a few frames per second for reasonable workloads, but a larger one for 10 bots.

Overall, then, WebGLWorker and the proxying approach do fairly well: While our intuition might be that this must be slow, in practice on reasonable workloads the results are reasonably fast, comparable to running on the main thread. And it can perform well even on large workloads, as can be seen by 10 bots on the worker version on Firefox (Chrome’s slowdown there appears to be due to the proxying overhead being more expensive for it – it shows up high on profiles).

Note, by the way, that WebGLWorker could probably be optimized a lot more – it doesn’t take advantage of typed array transfer yet (which could avoid much of the copying, but would make the protocol a little more complex), nor does it try to consolidate typed arrays in any way (hundreds of separate ones can be sent per frame), and it uses a normal JS array for the commands themselves.

A final note on performance: All the measurements from before are for throughput, not latency. Running in a worker inherently adds some amount of latency, as we send user input to the worker and receive rendering back, so a frame or so of lag might occur. It’s encouraging that BananaBread, a first person shooter, feels responsive even with the extra latency – that type of game is typically very sensitive to lag.

“Real” WebGL in workers – that is, directly implemented at the browser level – is obviously still very important, even with the proxying polyfill. Aside from reducing latency, as just mentioned, real WebGL in workers also does not rely on the main thread to be free to execute GL commands, which the proxying approach does.

Other APIs

Proxying just WebGL isn’t enough for a typical WebGL-using application. There are some simple things like proxying keyboard and mouse events, but we also run into more serious issues, like the lack of HTML Image elements in workers. Both PlayCanvas and BananaBread use Image elements to load image assets and convert them to WebGL textures. WebGLWorker therefore includes code to proxy Image elements as well, using a similar approach: When you create an Image and set its src URL, we proxy that info and create an Image on the main thread. When we get a response, we fire the onload event, and so forth, after creating JS objects that look like what we have on the main thread.

Another missing API turns out to be Workers themselves! While the spec supports workers created in workers (“nested workers” or “subworkers”), and while an html5rocks article from 2010 mentions subworkers as being supported, they seem to only work in Firefox and Internet Explorer so far (Chrome bug, Safari bug). BananaBread uses workers for 2 things during startup: to decompress a gzipped asset file, and to decompress crunched textures, which greatly improves startup speed. To get the BananaBread demo to work in browsers without nested workers, it includes a partial polyfill for Workers in Workers (which is far from complete – just enough to run the 2 workers actually needed). (This incidentally is the reason why Chrome startup on the BananaBread worker demo is slower than on the non-worker demo – the polyfill uses just one core, instead of 3, and it has no choice but to use eval, which is generally not optimized very well.)

Finally, there are a variety of missing APIs like requestAnimationFrame, that are straightforward to fill in directly without proxying (see proxyClient.js/proxyWorker.js in the WebGLWorker repo). These are typically not hard to implement, but the sheer amount of potential APIs a website might use (and are not in workers) means that hitting one of them is the most likely thing to be a problem when porting an app to run in a worker.


Based on what we’ve seen, it looks surprisingly practical to polyfill APIs via proxying in order to make them available in workers. And WebGL is one of the larger and higher-traffic APIs, so it is reasonable to expect that applying this approach to something like WebSockets or IndexedDB, for example, would be much more straightforward. (In fact, perhaps the web platform community could proxy an API first and use that to prioritize speccing and implementing it in workers?) Overall, it looks like proxying could enable more applications to run code in web workers, thus doing less on the main thread and maximizing responsiveness.

Mozilla Advances JPEG Encoding with mozjpeg 2.0

We’re pleased to announce the release of mozjpeg 2.0. Early this year, we explained that we started this project to provide a production-quality JPEG encoder that improves compression while maintaining compatibility with the vast majority of deployed decoders. The end goal is to reduce page load times and ultimately create an enhanced user experience for sites hosting images.

With today’s release, mozjpeg 2.0 can reduce file sizes for both baseline and progressive JPEGs by 5% on average compared to those produced by libjpeg-turbo, the standard JPEG library upon which mozjpeg is based [1]. Many images will see further reductions.

Facebook announced today that they are testing mozjpeg 2.0 to improve the compression of images on It has also donated $60,000 to contribute to the ongoing development of the technology, including the next iteration, mozjpeg 3.0.

“Facebook supports the work Mozilla has done in building a JPEG encoder that can create smaller JPEGs without compromising the visual quality of photos,” said Stacy Kerkela, software engineering manager at Facebook. “We look forward to seeing the potential benefits mozjpeg 2.0 might bring in optimizing images and creating an improved experience for people to share and connect on Facebook.”

The major feature in this release is trellis quantization, which improves compression for both baseline and progressive JPEGs without sacrificing anything in terms of compatibility. Previous versions of mozjpeg only improved compression for progressive JPEGs.

Other improvements include:

  • The cjpeg utility now supports JPEG input in order to simplify re-compression workflows.
  • We’ve added options to specifically tune for PSNR, PSNR-HVS-M, SSIM, and MS-SSIM metrics.
  • We now generate a single DC scan by default in order to be compatible with decoders that can’t handle arbitrary DC scans.

New Lossy Compressed Image Research

Last October, we published research that found HEVC-MSP performed significantly better than JPEG, while WebP and JPEG XR performed better than JPEG according to some quality scoring algorithms, but similarly or worse according to others. We have since updated the study to offer a more complete picture of performance for mozjpeg and potential JPEG alternatives.

The study compared compression performance for four formats: JPEG, WebP, JPEG XR, and HEVC-MSP. The following is a list of significant changes since the last study:

  • We use newer versions of the WebP, JPEG, JPEG XR, and HEVC-MSP encoders.
  • We include data for mozjpeg.
  • We changed our graphing to bits per pixel vs. dB (quality) on a log/log scale. This is a more typical presentation format, and it doesn’t require interpolation.
  • We removed an RGB conversion step from quality comparison. We now compare the Y’CbCr input and output directly. This should increase accuracy of the metrics.
  • We include results for more quality values.
  • We added sections discussing encoders tuning for metrics and measurement with luma-only metrics.

We’ve also made changes to our test suite to make it easier to reproduce our results. All metric code is now written in C, which means it runs faster and MATLAB/octave is no longer required. We’ve also added a script to automatically generate graphs from the test data files.

We consider this study to be inconclusive when it comes to the question of whether WebP and/or JPEG XR outperform JPEG by any significant margin. We are not rejecting the possibility of including support for any format in this study on the basis of the study’s results. We will continue to evaluate the formats by other means and will take any feedback we receive from these results into account.

In addition to compression ratios, we are considering run-time performance (e.g. decoding time), feature set (e.g. alpha, EXIF), time to market, and licensing. However, we’re primarily interested in the impact that smaller file sizes would have on page load times, which means we need to be confident about significant improvement by that metric, first and foremost.

Feedback Welcome

We’d like to hear any constructive feedback you might have. In particular, please let us know if you have questions or comments about our code, our methodology, or further testing we might conduct.

Also, the four image quality scoring algorithms used in this study (Y-SSIM, RGB-SSIM, MS-SSIM, and PSNR-HVS-M) should probably not be given equal weight as each has a number of pros and cons. For example: some have received more thorough peer review than others, while only one takes color into account. If you have input on which to give more weight please let us know.

We’ve set up a thread on Google Groups in order to discuss.

1. We’re fans of libjpeg-turbo – it powers JPEG decoding in Firefox because its focus is on being fast, and that isn’t going to change any time soon. The mozjpeg project focuses solely on encoding, and we trade some CPU cycles for smaller file sizes. We recommend using libjpeg-turbo for a standard JPEG library and any decoding tasks. Use mozjpeg when creating JPEGs for the Web.

Static checking of units in Servo

Web browsers do a lot of calculations on geometric coordinates, in many different coordinate systems and units of measurement. For example, a browser may need to translate a position expressed in hardware pixels relative to the screen origin into CSS px units relative to the document origin. Tricky bugs can occur if the code doesn’t convert correctly between different units or coordinate systems (though fortunately these bugs are unlikely to compete with the the world’s most infamous unit conversion bug).

I recently added some new features to help prevent such bugs in rust-geom, a 2D geometry library written in Rust as part of the Servo project. These new features include Length, a type that can holds a single numeric value tagged with a unit of distance, and ScaleFactor, a type for converting between different units. Here’s a simple example:

use geom::{Length, ScaleFactor};

// Define some empty types to use as units.
enum Mm {};
enum Inch {};

let one_foot: Length<Inch, f32> = Length(12.0);
let two_feet = one_foot + one_foot;

let mm_per_inch: ScaleFactor<Inch, Mm> = ScaleFactor(25.4);
let one_foot_in_mm: Length<Mm, f32> = one_foot * mm_per_inch;

Units are checked statically. If you try to use a Length value with one unit in an expression where a different unit is expected, your code will fail to compile unless you add an explicit conversion:

let d1: Length<Inch, f32> = Length(2.0);
let d2: Length<Mm, f32> = Length(0.1);
let d3 = d1 + d2; // Type error: Expected Inch but found Mm

Furthermore, the units are used only at compile time. At run time, a Length<Inch, f32> value is stored in memory as just a single f32 floating point value, with no additional data. The units are “phantom types” that are never instantiated and have no run-time behavior.

Length values can also be used in combination with the other rust-geom types like Rect, Size, and Point2D to keep track of units in any of the library’s supported geometric operations. There are also convenient type aliases and constructor functions to make these combined types a little easier to work with. For example:

// Using Point2D with typed units:
let p: Point2D<Length<Mm, int32>> = Point(Length(30), Length(40));

// Shorthand for the above:
let p: TypedPoint2D<Mm, int32> = TypedPoint2D(30, 40);

We use this in Servo to ensure that values are scaled correctly between different coordinate systems and units of measure such as device pixels, screen coordinates, and CSS px. You can find some of Servo’s units documented in These types are used in the compositor and windowing modules, and we’re gradually converting more Servo code to use them.

This has already helped us find some bugs. For example, the function below was originally missing a conversion from device pixels to page pixels, which would cause incorrect mouse move events in any window whose resolution is not 1 hardware pixel per CSS px:

fn on_mouse_window_move_event_class(&self, cursor: Point2D<f32>) {
    for layer in self.compositor_layer.iter() {

Once we added units to the types in this module, this code would not build until it was fixed:

fn on_mouse_window_move_event_class(&self, cursor: TypedPoint2D<DevicePixel, f32>) {
    let scale = self.device_pixels_per_px();
    for layer in self.compositor_layer.iter() {
        layer.send_mouse_move_event(cursor / scale);

My Rust code is based directly on Kartikaya Gupta’s C++ code implementing statically-checked units in Gecko. There are some minor differences (for example, Gecko does not include a type for one-dimensional “length” values), but the basic design is easily recognizable despite being translated from C++ to Rust.

Several other projects have also tackled this problem. Notably, the F# language has built-in support for units of measure as a language feature, including static analysis of operations on arbitrary combinations of units (not currently implemented in rust-geom). Their research cites other related work. Languages like Rust and C++ do not include special language-level features for units, but through use of generics and phantom types they allow library code to implement similar zero-overhead static checks. Our work in Gecko and Servo demonstrates how useful this approach can be in practice.

Another Big Milestone for Servo—Acid2

Servo, the next-generation browser engine being developed by Mozilla Research, has reached an important milestone by passing the Acid2 test. While Servo is not yet fully web compatible, passing Acid2 demonstrates how far it has already come.

Servo’s Acid2 Test Result

Acid2 tests common HTML and CSS features such as tables, fixed and absolute positioning, generated content, paint order, data URIs, and backgrounds. Just as an acid test is used to judge whether some metal is gold, the web compatibility acid tests were created to expose flaws in browser rendering caused by non-conformance to web standards. Servo passed the Acid1 test in August of 2013 and has rapidly progressed to pass Acid2 as of March 2014.

Servo’s goals are to create a new browser engine for modern computer architectures and security threat models. It is written in a new programming language, Rust, also developed by Mozilla Research, which is designed to be safe and fast. Rust programs should be free from buffer overflows, reusing already freed memory, and similar problems common in C and C++ code. On top of this added safety, Servo is designed to exploit the parallelism of modern computers making use of all available processor cores, GPUs, and vector units.

The early results are encouraging. Many kinds of browser security bugs, such as vulnerabilities similar to Heartbleed, are prevented automatically by the Rust compiler. Performance comparisons on many portions of the Web Platform that we have implemented in single threaded mode are substantially faster than traditional browsers, and multi-threaded performance is even faster yet.

Servo has a growing community of developers and is a great project for anyone looking to play with browsers and programming languages. Please visit us at the Servo project page to learn more.


edited for clarity around Heartbleed

Introducing the ‘mozjpeg’ Project

Today I’d like to announce a new Mozilla project called ‘mozjpeg’. The goal is to provide a production-quality JPEG encoder that improves compression while maintaining compatibility with the vast majority of deployed decoders.

Why are we doing this?

JPEG has been in use since around 1992. It’s the most popular lossy compressed image format on the Web, and has been for a long time. Nearly every photograph on the Web is served up as a JPEG. It’s the only lossy compressed image format which has achieved nearly universal compatibility, not just with Web browsers but all software that can display images.

The number of photos displayed by the average Web site has grown over the years, as has the size of those photos. HTML, JS, and CSS files are relatively small in comparison, which means photos can easily make up the bulk of the network traffic for a page load. Reducing the size of these files is an obvious goal for optimization.

Production JPEG encoders have largely been stagnant in terms of compression efficiency, so replacing JPEG with something better has been a frequent topic of discussion. The major downside to moving away from JPEG is that it would require going through a multi-year period of relatively poor compatibility with the world’s deployed software. We (at Mozilla) don’t doubt that algorithmic improvements will make this worthwhile at some point, possibly soon. Even after a transition begins in earnest though, JPEG will continue to be used widely.

Given this situation, we wondered if JPEG encoders have really reached their full compression potential after 20+ years. We talked to a number of engineers, and concluded that the answer is “no,” even within the constraints of strong compatibility requirements. With feedback on promising avenues for exploration in hand, we started the ‘mozjpeg’ project.

What we’re releasing today, as version 1.0, is a fork of libjpeg-turbo with ‘jpgcrush’ functionality added. We noticed that people have been reducing JPEG file sizes using a perl script written by Loren Merritt called ‘jpgcrush’, references to which can be found on various forums around the Web. It losslessly reduces file sizes, typically by 2-6% for PNGs encoded to JPEG by IJG libjpeg, and 10% on average for a sample of 1500 JPEG files from Wikimedia. It does this by figuring out which progressive coding configuration uses the fewest bits. So far as we know, no production encoder has this functionality built in, so we added it as the first feature in ‘mozjpeg’.

Our next goal is to improve encoding by making use of trellis quantization. If you want to help out or just learn more about our plans, the following resources are available:

* github
* mailing list

Studying Lossy Image Compression Efficiency

JPEG has been the only widely supported lossy compressed image format on the Web for many years. It was introduced in 1992, and since then a number of proposals have aimed to improve on it. A primary goal for many proposals is to reduce file sizes at equivalent qualities.

We’d like to share a study that compares three frequently-discussed alternatives, HEVC-MSP, WebP, and JPEG XR, to JPEG, in terms of compression efficiency.

The data shows HEVC-MSP performing significantly better than JPEG and the other formats we tested. WebP and JPEG XR perform better than JPEG according to some quality scoring algorithms, but similarly or worse according to others.

We consider this study to be inconclusive when it comes to the question of whether WebP and/or JPEG XR outperform JPEG by any significant margin. We are not rejecting the possibility of including support for any format in this study on the basis of the study’s results. We will continue to evaluate the formats by other means and will take any feedback we receive from these results into account.

In addition to compression ratios, we are considering run-time performance (e.g. decoding time), feature set (e.g. alpha, EXIF), time to market, and licensing. However, we’re primarily interested in the impact that smaller file sizes would have on page load times, which means we need to be confident about significant improvement by that metric, first and foremost.

We’d like to hear any constructive feedback you might have. In particular, please lets us know if you have questions or comments about our code, our methodology, or further testing we might conduct.

Also, the four image quality scoring algorithms used in this study (Y-SSIM, RGB-SSIM, IW-SSIM, and PSNR-HVS-M) should probably not be given equal weight as each has a number of pros and cons. For example: some have received more thorough peer review than others, while only one takes color into account. If you have input on which to give more weight please let us know.

We’ve set up a thread on Google Groups in order to discuss.