Enhancing webcams with canvas.captureStream()
A really enhanced webcam stream!
Recently, HTMLCanvasElement.captureStream() was implemented in browsers. This allows you to expose the contents of a HTML5 canvas as a MediaStream to be consumed by applications. This is the same base MediaStream type that getUserMedia returns, which is what websites use to get access to your webcam.
The first question that comes to mind is, of course: “Is it possible to intercept calls to getUserMedia, get a hold of the webcam MediaStream, enhance it by rendering it into a canvas and doing some post-processing, then transparently returning the canvas’ MediaStream?”
As it turns out, the answer is yes.
We built a cross-platform WebExtension called Zombocam that does exactly this. Zombocam injects itself on every webpage and monkey-patches getUserMedia. If a webpage then calls getUserMedia, we transparently enhance the camera and spawn a floating UI in the DOM that lets you control your different filters and settings. This means that any website that uses your webcam will now get your enhanced webcam instead!
This blog post is a technical walk-through of the different challenges we ran into while developing Zombocam.
Monkey-patching 101
Monkey-patching getUserMedia essentially means replacing the browser’s implementation with our own. We supply our own getUserMedia function that wraps the browser’s implementation and adds an intermediary canvas processing step (and fires up a UI). Of course, since getUserMedia is a web JS API, there are one million different versions that need to be supported. There’s Navigator.getUserMedia and MediaDevices.getUserMedia, and then vendor prefixes on top of that (e.g. Navigator.webkitGetUserMedia and Navigator.mozGetUserMedia), and then there are different signatures (e.g. callbacks vs promises), and then on top of that again they historically support different syntaxes for specifying constraints. Oh, and they have different errors too. To be fair, MediaDevices.getUserMedia, the one true getUserMedia, solves all of these problems, but the web needs to wait for everyone to stop using the old versions first.
One million twisty little getUserMedia functions, all different.
All of this boils down to having to type a lot of code to iron over the inconsistencies between different implementations, but in the happy case we end up with something like:
Monkey-patching on one of the many heads of the Hydra that is getUserMedia.
The rendering pipeline
Most of the effects and filters in Zombocam are implemented as WebGL fullscreen quad shader passes. This is a WebGL rendering technique that essentially lets us generate images on the fly on a per-pixel basis by using a fragment shader. This is elaborated upon in thorough detail in this excellent article by Alexander Oldemeier. Using this technique means that the image processing can be done on the GPU, which is essential to achieve smooth real-time performance. For each video frame, the frame is uploaded to the GPU and made available to an effect’s fragment shader, which is responsible for implementing the specific transformation for that effect.
The glsl source for an invert effect fragment shader. This fragment shader inverts the pixel values for each pixel.
Effects in Zombocam are split into three main categories: color filters, distortion effects and overlays. Filters in the first categories are implemented as non-linear per-channel functions with hard-coded mappings of input to output values in each frame. The idea is that a color grading expert creates a nice-looking preset using his or her favorite color grading tool. Then that color grading is applied to three 0–255 gradients, one for each color channel. The color graded outputs then serve as lookup tables for the pixel values in order to create a color graded output. This is a simplified version of the technique elaborated upon in this excellent article by Slick Entertainment.
Distortion effects are implemented as non-linear pixel coordinate transformation functions on the input image. That is, the pixel at coordinate (x, y) in the transformed image is copied from the pixel at coordinate f(x, y) in the original image. As long as you define f correctly, you can implement swirls, pinches, magnifications, hazes and all sorts of other distortions.
Finally, overlay effects simply overlay new pixels on parts or all of the frame. These new pixels can be sourced from anywhere, including other video sources. This effectively lets us overlay Giphy videos directly in the camera stream! Productivity will never be the same.
Since effects can be chained in Zombocam, the output from one effect’s rendering pass is fed directly as input to the next effect’s rendering pass. This opens for a wide array of different possible effect combinations.
Zombocam can turn you into a cyclops if you’re not careful when chaining effects!
Works everywhere! (*)
In theory, this approach works everywhere out of the box, so you can use when you’re snapping a profile picture on Facebook , hanging out in video meetings on Appear.in or Google Hangouts. In practice, however, the story is a little more nuanced. Reliably monkey-patching getUserMedia in time in a cross-browser fashion via injection from a WebExtension without going overboard with permissions turns out to be hard in some cases. This means that if an application is really adamant at calling getUserMedia reeeally early in the page’s lifetime, getUserMedia might not be monkey-patched yet. In that case, Zombocam will simply never trigger, and it will be as if it weren’t ever even installed.
When attempting to transparently monkey-patch APIs one has to take extreme care to make sure that the monkey-patching actually is transparent. That means properly forwarding all sorts of properties on the Streams and Tracks returned from getUserMedia that applications might expect and depend on.
One specific example of this that we ran into was with Appear.in’s new premium offering, where you can screen-share and show your webcam stream in your meeting room at the same time. The application relied on the name of one of the Tracks to be “Screen”, which we didn’t properly forward to our Tracks that we got from our canvas. Because of this, Appear.in didn’t know which of the tracks was the screen-sharing track, and things stopped working. Properly forwarding the name property solved the issue, and we learned an important lesson in the virtues of actually being transparent when trying to transparently intercept APIs.
What’s next: audio filters
With the new release of Zombocam coming up this week we’ve taken it one step further and enhanced getUserMedia audio tracks as well using the Web Audio API. More on that in a later blog post!