WebGL in Web Workers, Today – and Faster than Expected!

Alon Zakai

8

Web Workers are additional threads that a website can create. Using workers, a website can utilize multiple CPU cores to speed itself up, or move heavy single-core processing to a background thread to keep the main (UI) thread as responsive as possible.

A problem, however, is that many APIs exist only on the main thread, for example WebGL. WebGL is a natural candidate for running in a worker, as it is often used by things like 3D games, simulations, etc., which do heavy amounts of JavaScript that can stall the main thread. Therefore there have been discussions about supporting WebGL in workers (for example, work is ongoing in Firefox), and hopefully this will be widely supported eventually. But that will take some time – perhaps we can polyfill this meanwhile? That is what this blogpost is about: WebGLWorker is a new open source project that makes the WebGL API available in workers. It does so by transparently proxying necessary commands to the main thread, where WebGL is then rendered.

Quick overview of the rest of this post:

  • We’ll describe WebGLWorker’s design, and how it allows running WebGL-using code in workers, without any modifications to that code.
  • While the proxying approach has some inherent limitations which prevent us from implementing 100% of the WebGL API, the part that we can implement turns out to be sufficient for several real-world projects.
  • We’ll see performance numbers on the proxying approach used in WebGLWorker, showing that it is quite efficient, which is perhaps surprising since it sounds like it might be slow.

Examples

Before we get into technical details, here are two demos that show the project in action:

  • PlayCanvas: PlayCanvas is an open source 3D game engine written in JavaScript. Here is the normal version of one of their examples, and here is the worker version.
  • BananaBread: BananaBread is a port of the open source Cube 2/Sauerbraten first person shooter using Emscripten. Here is one of the levels running normally, and here it is in a worker. (Note that startup on the worker version may be sluggish; see the notes on “nested workers”, below, for why.)

In both cases the rendered output should look identical whether running on the main thread or in a worker. Note that neither of these two demos were modified to run in a worker: The exact same code, in both cases, runs either on the main thread or in a worker (see the WebGLWorker repo for details and examples).

How it works

WebGLWorker has two parts: The worker and the client (= main thread). On the worker, we construct what looks like a normal WebGL context, so that a project using WebGL can use the WebGL API normally. When you call it, the WebGLWorker worker code queues rendering commands into buffers, then sends them over at the end of a frame using postMessage. The client code on the main thread receives such command buffers, and then executes them, command after command, in order. Here is what this architecture looks like:

WebGLWorker

Note that the WebGL-using codebase interacts with WebGLWorker’s worker code synchronously – it calls into it, and receives responses back immediately, for example, createShader returns an object representing a shader. To do that, the worker code needs to parse shader source files and so forth. Basically, we end up implementing some of the “frontend” of WebGL ourselves, in JS. And, of course, on the client side the WebGLWorker client code interacts synchronously with the actual browser WebGL context, executing commands and saving the responses where relevant. However, the important thing to note is that the WebGLWorker worker code sends messages to the client code, using postMessage, but it cannot receive responses – the response would be asynchronous, but WebGL application code is written synchronously. So that arrow goes only in one direction.

For that reason, a major limitation of this approach is synchronous operations that we cannot implement ourselves, like readPixels and getError – in both of those,  we don’t know the answer ourselves, we need to get a response from the actual WebGL context on the client. But we can’t access it synchronously from a worker. As a consequence we do not support readPixels, and for getError, we return “no error” optimistically in the worker where getError is called, proxy the call, and if the client gets an error message back, we abort since it’s too late to tell the worker at that point. We are therefore limited in what we can accomplish with this approach. However, as the demos above show, 2 real-world projects work out of the box without issues.

Performance

I think most people’s intuition would be that this approach has to slow things down. After all, we are doing more work – queue commands in a buffer, serialize it, transfer it, deserialize it, and then execute the commands. All of that instead of just running each command as it is invoked! Now, on a single-core machine that would be correct, but on a multi-core machine there are reasons to suspect otherwise, because the worker thread often does other heavy JavaScript operations, so freeing it up quickly to do whatever else it needs can be beneficial. There are a few reasons why that might be possible:

  • Some commands are proxied literally by just appending a few numbers to an array. That can be faster than calling into a DOM API which crosses into native C++ code.
  • Some commands are handled locally in JavaScript, without proxying at all. Again, this can be faster than crossing the native code boundary.
  • Calling from the browser’s native code into the graphics driver can cause delays, which proxying avoids.

So on the one hand it seems obvious that this must be slower, but there are also some reasons to think it might not be. Let’s see some measurements!

glproxy2

The chart shows frames per second (higher numbers are better) on the BananaBread demo, on Firefox and Chrome (latest developer versions, Firefox 33 and Chrome 37, on Linux; results on other OSes are overall similar), on a normal version running on the main thread, and a worker version using WebGLWorker. Frame rates are shown for different numbers of bots, with 2 being a “typical” workload for this game engine, 0 being a light workload and 10 being unrealistically large (note that “0 bots” does not mean “no work” – even with no bots, the game engine renders the world, the player’s model and weapon, HUD information, etc.). As expected, as we go from 0 bots to 2 and then 10, frame rates decrease a little, because the game does more work, both on the CPU (AI, physics, etc.) and on the GPU (render more characters, weapon effects, etc.). Hower, on the normal versions (not running in a worker), the decrease is fairly small, just a few frames per second under the optimal 60. On Firefox, we see similar results when running in a worker as well, even with 10 bots the worker version is about as fast as the normal version. On Chrome, we do see a slowdown on the worker version, of just a few frames per second for reasonable workloads, but a larger one for 10 bots.

Overall, then, WebGLWorker and the proxying approach do fairly well: While our intuition might be that this must be slow, in practice on reasonable workloads the results are reasonably fast, comparable to running on the main thread. And it can perform well even on large workloads, as can be seen by 10 bots on the worker version on Firefox (Chrome’s slowdown there appears to be due to the proxying overhead being more expensive for it – it shows up high on profiles).

Note, by the way, that WebGLWorker could probably be optimized a lot more – it doesn’t take advantage of typed array transfer yet (which could avoid much of the copying, but would make the protocol a little more complex), nor does it try to consolidate typed arrays in any way (hundreds of separate ones can be sent per frame), and it uses a normal JS array for the commands themselves.

A final note on performance: All the measurements from before are for throughput, not latency. Running in a worker inherently adds some amount of latency, as we send user input to the worker and receive rendering back, so a frame or so of lag might occur. It’s encouraging that BananaBread, a first person shooter, feels responsive even with the extra latency – that type of game is typically very sensitive to lag.

“Real” WebGL in workers – that is, directly implemented at the browser level – is obviously still very important, even with the proxying polyfill. Aside from reducing latency, as just mentioned, real WebGL in workers also does not rely on the main thread to be free to execute GL commands, which the proxying approach does.

Other APIs

Proxying just WebGL isn’t enough for a typical WebGL-using application. There are some simple things like proxying keyboard and mouse events, but we also run into more serious issues, like the lack of HTML Image elements in workers. Both PlayCanvas and BananaBread use Image elements to load image assets and convert them to WebGL textures. WebGLWorker therefore includes code to proxy Image elements as well, using a similar approach: When you create an Image and set its src URL, we proxy that info and create an Image on the main thread. When we get a response, we fire the onload event, and so forth, after creating JS objects that look like what we have on the main thread.

Another missing API turns out to be Workers themselves! While the spec supports workers created in workers (“nested workers” or “subworkers”), and while an html5rocks article from 2010 mentions subworkers as being supported, they seem to only work in Firefox and Internet Explorer so far (Chrome bug, Safari bug). BananaBread uses workers for 2 things during startup: to decompress a gzipped asset file, and to decompress crunched textures, which greatly improves startup speed. To get the BananaBread demo to work in browsers without nested workers, it includes a partial polyfill for Workers in Workers (which is far from complete – just enough to run the 2 workers actually needed). (This incidentally is the reason why Chrome startup on the BananaBread worker demo is slower than on the non-worker demo – the polyfill uses just one core, instead of 3, and it has no choice but to use eval, which is generally not optimized very well.)

Finally, there are a variety of missing APIs like requestAnimationFrame, that are straightforward to fill in directly without proxying (see proxyClient.js/proxyWorker.js in the WebGLWorker repo). These are typically not hard to implement, but the sheer amount of potential APIs a website might use (and are not in workers) means that hitting one of them is the most likely thing to be a problem when porting an app to run in a worker.

Summary

Based on what we’ve seen, it looks surprisingly practical to polyfill APIs via proxying in order to make them available in workers. And WebGL is one of the larger and higher-traffic APIs, so it is reasonable to expect that applying this approach to something like WebSockets or IndexedDB, for example, would be much more straightforward. (In fact, perhaps the web platform community could proxy an API first and use that to prioritize speccing and implementing it in workers?) Overall, it looks like proxying could enable more applications to run code in web workers, thus doing less on the main thread and maximizing responsiveness.

Mozilla Advances JPEG Encoding with mozjpeg 2.0

Josh Aas

0

We’re pleased to announce the release of mozjpeg 2.0. Early this year, we explained that we started this project to provide a production-quality JPEG encoder that improves compression while maintaining compatibility with the vast majority of deployed decoders. The end goal is to reduce page load times and ultimately create an enhanced user experience for sites hosting images.

With today’s release, mozjpeg 2.0 can reduce file sizes for both baseline and progressive JPEGs by 5% on average compared to those produced by libjpeg-turbo, the standard JPEG library upon which mozjpeg is based [1]. Many images will see further reductions.

Facebook announced today that they are testing mozjpeg 2.0 to improve the compression of images on facebook.com. It has also donated $60,000 to contribute to the ongoing development of the technology, including the next iteration, mozjpeg 3.0.

“Facebook supports the work Mozilla has done in building a JPEG encoder that can create smaller JPEGs without compromising the visual quality of photos,” said Stacy Kerkela, software engineering manager at Facebook. “We look forward to seeing the potential benefits mozjpeg 2.0 might bring in optimizing images and creating an improved experience for people to share and connect on Facebook.”

The major feature in this release is trellis quantization, which improves compression for both baseline and progressive JPEGs without sacrificing anything in terms of compatibility. Previous versions of mozjpeg only improved compression for progressive JPEGs.

Other improvements include:

  • The cjpeg utility now supports JPEG input in order to simplify re-compression workflows.
  • We’ve added options to specifically tune for PSNR, PSNR-HVS-M, SSIM, and MS-SSIM metrics.
  • We now generate a single DC scan by default in order to be compatible with decoders that can’t handle arbitrary DC scans.

New Lossy Compressed Image Research

Last October, we published research that found HEVC-MSP performed significantly better than JPEG, while WebP and JPEG XR performed better than JPEG according to some quality scoring algorithms, but similarly or worse according to others. We have since updated the study to offer a more complete picture of performance for mozjpeg and potential JPEG alternatives.

The study compared compression performance for four formats: JPEG, WebP, JPEG XR, and HEVC-MSP. The following is a list of significant changes since the last study:

  • We use newer versions of the WebP, JPEG, JPEG XR, and HEVC-MSP encoders.
  • We include data for mozjpeg.
  • We changed our graphing to bits per pixel vs. dB (quality) on a log/log scale. This is a more typical presentation format, and it doesn’t require interpolation.
  • We removed an RGB conversion step from quality comparison. We now compare the Y’CbCr input and output directly. This should increase accuracy of the metrics.
  • We include results for more quality values.
  • We added sections discussing encoders tuning for metrics and measurement with luma-only metrics.

We’ve also made changes to our test suite to make it easier to reproduce our results. All metric code is now written in C, which means it runs faster and MATLAB/octave is no longer required. We’ve also added a script to automatically generate graphs from the test data files.

We consider this study to be inconclusive when it comes to the question of whether WebP and/or JPEG XR outperform JPEG by any significant margin. We are not rejecting the possibility of including support for any format in this study on the basis of the study’s results. We will continue to evaluate the formats by other means and will take any feedback we receive from these results into account.

In addition to compression ratios, we are considering run-time performance (e.g. decoding time), feature set (e.g. alpha, EXIF), time to market, and licensing. However, we’re primarily interested in the impact that smaller file sizes would have on page load times, which means we need to be confident about significant improvement by that metric, first and foremost.

Feedback Welcome

We’d like to hear any constructive feedback you might have. In particular, please let us know if you have questions or comments about our code, our methodology, or further testing we might conduct.

Also, the four image quality scoring algorithms used in this study (Y-SSIM, RGB-SSIM, MS-SSIM, and PSNR-HVS-M) should probably not be given equal weight as each has a number of pros and cons. For example: some have received more thorough peer review than others, while only one takes color into account. If you have input on which to give more weight please let us know.

We’ve set up a thread on Google Groups in order to discuss.

1. We’re fans of libjpeg-turbo – it powers JPEG decoding in Firefox because its focus is on being fast, and that isn’t going to change any time soon. The mozjpeg project focuses solely on encoding, and we trade some CPU cycles for smaller file sizes. We recommend using libjpeg-turbo for a standard JPEG library and any decoding tasks. Use mozjpeg when creating JPEGs for the Web.

Static checking of units in Servo

Matt Brubeck

0

Web browsers do a lot of calculations on geometric coordinates, in many different coordinate systems and units of measurement. For example, a browser may need to translate a position expressed in hardware pixels relative to the screen origin into CSS px units relative to the document origin. Tricky bugs can occur if the code doesn’t convert correctly between different units or coordinate systems (though fortunately these bugs are unlikely to compete with the the world’s most infamous unit conversion bug).

I recently added some new features to help prevent such bugs in rust-geom, a 2D geometry library written in Rust as part of the Servo project. These new features include Length, a type that can holds a single numeric value tagged with a unit of distance, and ScaleFactor, a type for converting between different units. Here’s a simple example:

use geom::{Length, ScaleFactor};

// Define some empty types to use as units.
enum Mm {};
enum Inch {};

let one_foot: Length<Inch, f32> = Length(12.0);
let two_feet = one_foot + one_foot;

let mm_per_inch: ScaleFactor<Inch, Mm> = ScaleFactor(25.4);
let one_foot_in_mm: Length<Mm, f32> = one_foot * mm_per_inch;

Units are checked statically. If you try to use a Length value with one unit in an expression where a different unit is expected, your code will fail to compile unless you add an explicit conversion:

let d1: Length<Inch, f32> = Length(2.0);
let d2: Length<Mm, f32> = Length(0.1);
let d3 = d1 + d2; // Type error: Expected Inch but found Mm

Furthermore, the units are used only at compile time. At run time, a Length<Inch, f32> value is stored in memory as just a single f32 floating point value, with no additional data. The units are “phantom types” that are never instantiated and have no run-time behavior.

Length values can also be used in combination with the other rust-geom types like Rect, Size, and Point2D to keep track of units in any of the library’s supported geometric operations. There are also convenient type aliases and constructor functions to make these combined types a little easier to work with. For example:

// Using Point2D with typed units:
let p: Point2D<Length<Mm, int32>> = Point(Length(30), Length(40));

// Shorthand for the above:
let p: TypedPoint2D<Mm, int32> = TypedPoint2D(30, 40);

We use this in Servo to ensure that values are scaled correctly between different coordinate systems and units of measure such as device pixels, screen coordinates, and CSS px. You can find some of Servo’s units documented in geometry.rs. These types are used in the compositor and windowing modules, and we’re gradually converting more Servo code to use them.

This has already helped us find some bugs. For example, the function below was originally missing a conversion from device pixels to page pixels, which would cause incorrect mouse move events in any window whose resolution is not 1 hardware pixel per CSS px:

fn on_mouse_window_move_event_class(&self, cursor: Point2D<f32>) {
    for layer in self.compositor_layer.iter() {
        layer.send_mouse_move_event(cursor);
    }
}

Once we added units to the types in this module, this code would not build until it was fixed:

fn on_mouse_window_move_event_class(&self, cursor: TypedPoint2D<DevicePixel, f32>) {
    let scale = self.device_pixels_per_px();
    for layer in self.compositor_layer.iter() {
        layer.send_mouse_move_event(cursor / scale);
    }
}

My Rust code is based directly on Kartikaya Gupta’s C++ code implementing statically-checked units in Gecko. There are some minor differences (for example, Gecko does not include a type for one-dimensional “length” values), but the basic design is easily recognizable despite being translated from C++ to Rust.

Several other projects have also tackled this problem. Notably, the F# language has built-in support for units of measure as a language feature, including static analysis of operations on arbitrary combinations of units (not currently implemented in rust-geom). Their research cites other related work. Languages like Rust and C++ do not include special language-level features for units, but through use of generics and phantom types they allow library code to implement similar zero-overhead static checks. Our work in Gecko and Servo demonstrates how useful this approach can be in practice.

Another Big Milestone for Servo—Acid2

Jack Moffitt

28

Servo, the next-generation browser engine being developed by Mozilla Research, has reached an important milestone by passing the Acid2 test. While Servo is not yet fully web compatible, passing Acid2 demonstrates how far it has already come.

Servo’s Acid2 Test Result

Acid2 tests common HTML and CSS features such as tables, fixed and absolute positioning, generated content, paint order, data URIs, and backgrounds. Just as an acid test is used to judge whether some metal is gold, the web compatibility acid tests were created to expose flaws in browser rendering caused by non-conformance to web standards. Servo passed the Acid1 test in August of 2013 and has rapidly progressed to pass Acid2 as of March 2014.

Servo’s goals are to create a new browser engine for modern computer architectures and security threat models. It is written in a new programming language, Rust, also developed by Mozilla Research, which is designed to be safe and fast. Rust programs should be free from buffer overflows, reusing already freed memory, and similar problems common in C and C++ code. On top of this added safety, Servo is designed to exploit the parallelism of modern computers making use of all available processor cores, GPUs, and vector units.

The early results are encouraging. Many kinds of browser security bugs, such as vulnerabilities similar to Heartbleed, are prevented automatically by the Rust compiler. Performance comparisons on many portions of the Web Platform that we have implemented in single threaded mode are substantially faster than traditional browsers, and multi-threaded performance is even faster yet.

Servo has a growing community of developers and is a great project for anyone looking to play with browsers and programming languages. Please visit us at the Servo project page to learn more.

References

edited for clarity around Heartbleed

Introducing the ‘mozjpeg’ Project

Josh Aas

0

Today I’d like to announce a new Mozilla project called ‘mozjpeg’. The goal is to provide a production-quality JPEG encoder that improves compression while maintaining compatibility with the vast majority of deployed decoders.

Why are we doing this?

JPEG has been in use since around 1992. It’s the most popular lossy compressed image format on the Web, and has been for a long time. Nearly every photograph on the Web is served up as a JPEG. It’s the only lossy compressed image format which has achieved nearly universal compatibility, not just with Web browsers but all software that can display images.

The number of photos displayed by the average Web site has grown over the years, as has the size of those photos. HTML, JS, and CSS files are relatively small in comparison, which means photos can easily make up the bulk of the network traffic for a page load. Reducing the size of these files is an obvious goal for optimization.

Production JPEG encoders have largely been stagnant in terms of compression efficiency, so replacing JPEG with something better has been a frequent topic of discussion. The major downside to moving away from JPEG is that it would require going through a multi-year period of relatively poor compatibility with the world’s deployed software. We (at Mozilla) don’t doubt that algorithmic improvements will make this worthwhile at some point, possibly soon. Even after a transition begins in earnest though, JPEG will continue to be used widely.

Given this situation, we wondered if JPEG encoders have really reached their full compression potential after 20+ years. We talked to a number of engineers, and concluded that the answer is “no,” even within the constraints of strong compatibility requirements. With feedback on promising avenues for exploration in hand, we started the ‘mozjpeg’ project.

What we’re releasing today, as version 1.0, is a fork of libjpeg-turbo with ‘jpgcrush’ functionality added. We noticed that people have been reducing JPEG file sizes using a perl script written by Loren Merritt called ‘jpgcrush’, references to which can be found on various forums around the Web. It losslessly reduces file sizes, typically by 2-6% for PNGs encoded to JPEG by IJG libjpeg, and 10% on average for a sample of 1500 JPEG files from Wikimedia. It does this by figuring out which progressive coding configuration uses the fewest bits. So far as we know, no production encoder has this functionality built in, so we added it as the first feature in ‘mozjpeg’.

Our next goal is to improve encoding by making use of trellis quantization. If you want to help out or just learn more about our plans, the following resources are available:

* github
* mailing list

Studying Lossy Image Compression Efficiency

Josh Aas

0

JPEG has been the only widely supported lossy compressed image format on the Web for many years. It was introduced in 1992, and since then a number of proposals have aimed to improve on it. A primary goal for many proposals is to reduce file sizes at equivalent qualities.

We’d like to share a study that compares three frequently-discussed alternatives, HEVC-MSP, WebP, and JPEG XR, to JPEG, in terms of compression efficiency.

The data shows HEVC-MSP performing significantly better than JPEG and the other formats we tested. WebP and JPEG XR perform better than JPEG according to some quality scoring algorithms, but similarly or worse according to others.

We consider this study to be inconclusive when it comes to the question of whether WebP and/or JPEG XR outperform JPEG by any significant margin. We are not rejecting the possibility of including support for any format in this study on the basis of the study’s results. We will continue to evaluate the formats by other means and will take any feedback we receive from these results into account.

In addition to compression ratios, we are considering run-time performance (e.g. decoding time), feature set (e.g. alpha, EXIF), time to market, and licensing. However, we’re primarily interested in the impact that smaller file sizes would have on page load times, which means we need to be confident about significant improvement by that metric, first and foremost.

We’d like to hear any constructive feedback you might have. In particular, please lets us know if you have questions or comments about our code, our methodology, or further testing we might conduct.

Also, the four image quality scoring algorithms used in this study (Y-SSIM, RGB-SSIM, IW-SSIM, and PSNR-HVS-M) should probably not be given equal weight as each has a number of pros and cons. For example: some have received more thorough peer review than others, while only one takes color into account. If you have input on which to give more weight please let us know.

We’ve set up a thread on Google Groups in order to discuss.

Mozilla Research Party—April 2, 6pm

Dave Herman

0

Mozilla Research is starting an informal gathering of academics and practitioners excited about expanding the foundations of the Open Web. The first event is held on Tuesday, April 2nd at the beautiful Mozilla office at 2 Harrison St, San Francisco, CA, overlooking the magnificent Bay Bridge, starting at 6:00pm.

We will kick off the event with a keynote speech by Andreas Gal, VP of Mobile Engineering and Research, followed by lightning talks on several of the projects we’ve been working on: Parallel JavaScript, Rust, Servo, asm.js, Emscripten and Shumway.

The real goal is to socialize and exchange ideas. The talk section of this event will be streamed live on Air Mozilla.

Please RSVP through Eventbrite to help us plan refreshments. We hope to see you there!

Mozilla Research projects

The Shumway Open SWF Runtime Project

Jet Villegas

64

Shumway is an experimental web-native runtime implementation of the SWF file format. It is developed as a free and open source project sponsored by Mozilla Research. The project has two main goals:

1. Advance the open web platform to securely process rich media formats that were previously only available in closed and proprietary implementations.
2. Offer a runtime processor for SWF and other rich media formats on platforms for which runtime implementations are not available.

You can view live demo examples using Shumway. More adventurous users can download a Firefox beta build and install the test extension to preview SWF content on the web using Shumway. Please be aware that Shumway is very experimental, is missing features, contains many defects, and is evolving rapidly.

Mozilla’s mission is to advance the Open Web. We believe that we can offer a positive experience if we provide support for the SWF format that is still used on many web sites, especially on mobile devices where the Adobe Flash Player is not available.

The Open Web can be further advanced by making rich media capabilities, previously only available in Flash, also available in the native web browser stack. Shumway is an exciting opportunity to do that for SWF, and we welcome support from external contributors as we advance the technology. We are reaching out to technical users who are interested in contributing to the Shumway implementation in these five areas:

1. Core. This module includes the main file format parser, the rasterizer, and event system.
2. AVM1. JavaScript interpreter for ActionScript version 1 and version 2 bytecode.
3. AVM2.  JavaScript interpreter and JIT compiler for ActionScript version 3 bytecode.
4. Browser Integration handles the glue between the web browser and the Shumway runtime.
5. Testing/Demos. Add good demo and test files/links to Shumway.

More information can be found on the github links:
* https://github.com/mozilla/shumway/wiki
* https://github.com/mozilla/shumway/wiki/Running-the-Examples
* https://github.com/mozilla/shumway/wiki/Building-Firefox-Extension

The Shumway team is active on the #shumway IRC channel for real-time discussion. A technical mailing list is available here. The source code is available on github.