The evolution of WebRTC 1.0.

June 5, 2017

Contributed by Jan-Ivar Bruaroey,

March 2018 update: Firefox 59 now implements transceivers (stage 3) as described in this article. Remote tracks are now muted and temporarily removed from their stream(s), rather than ended, in response to direction changes (e.g. from removeTrack). Listen to stream.onremovetrack instead of track.onended.

The WebRTC spec is nearing completion, going to Candidate Recommendation real soon. With Google planning to finish 1.0 in Chrome by the end of the year, it seems a good time to recap what this means, as far as changes to what folks are doing today.

Under the hood, the biggest remaining obstacle to advanced wire interop, is that, unlike Firefox, Chrome hasn’t implement the spec’s “Unified Plan” for multiple audio and video tracks yet. Be sad or happy, but this blog isn’t about bridging that gap through SDP mangling. At this point, it’s probably better to wait for Google to address this.

But web developers need to prepare, because the JavaScript API will be different. This may be a surprise to those who haven’t followed the spec. A good start is to look at what Firefox is already doing. But there’s more. I don’t mean superficial things like promises, which all browsers support now. Instead, expect a change in the media model.

The API gap: How the RTCPeerConnection API pivoted twice

The remaining API gap between browsers is how media is managed over a peer connection. If you’re reading a WebRTC book on this, throw it out.

The RTCPeerConnection API has endured three design iterations on this topic over the years. As a result, each browser today implements a snapshot from a different point in the timeline of an evolving spec.

The three main stages in the design were:

addStream and removeStream (Chrome today)
addTrack, removeTrack, and sender.replaceTrack (Firefox before 59)
addTransceiver and early media (Firefox today)

Basically, early stages were leaky abstractions on-top of later stages, on-top of the SDP protocol. Stages 2 and 3 remain in the spec. It’s best to explain the early stages first.

Stage 1: addStream and removeStream

This is the oldest model. It is no longer in the spec. It starts to break down when we want to manipulate individual tracks in the stream. E.g. let’s say we want to add/remove a video track to an already established audio-only call:

Activate the “Result” tab and you can add and remove video using the checkbox. This works in Chrome and Firefox (where we must polyfill the deprecated removeStream), though wait 5 seconds on remove in Firefox due to a bug.

Looking at the code, we start by adding a stream with only an audio track in it. On the remote end, we add listeners to the stream to detect arrival of new tracks, and we add listeners to the tracks to learn when they end.

When the user checks the checkbox, we add the video track to the stream using stream.addTrack(videoTrack). We then have to remove and re-add the stream to kick things into gear, or nothing happens.

You might have expected the last step to be unnecessary, and for the peer connection to start sending videoTrack automatically as soon as it was added to the stream we added earlier. The problem is, you might not expect this as well. People might have other reasons for adding tracks locally, and be surprised by it being sent to the other side as a side-effect. Side-effect APIs are bad.

Stage 2: addTrack, removeTrack, and sender.replaceTrack

This is the second model. It is still in the spec. A pivot to tracks solved the previous problems. Our example now becomes simpler (make sure you stopped the previous example before running):

This works in Firefox, and when run, it should behave the same. It half-works in Chrome today thanks to adapter.js which recently polyfilled addTrack in Chrome! You just won’t be able to remove the video in Chrome, since there’s no removeTrack there yet.

Looking at the code, what’s different is the local stream configuration is now irrelevant. The remote projection of streams and tracks is constructed entirely from the inputs to addTrack, taking id‘s from optionally passed-in streams and creating streams with the same id's remotely. No more side-effects.

On the remote end, we learn of new tracks through ontrack and we add listeners to the tracks to learn when they end.

With dedicated sender and receiver objects, we now have control surfaces for each media transport, separate from the media itself. E.g. we can downscale video to different levels for simulcast using sender.setParameters(), or get relevant stats about this particular transmission using sender.getStats() and receiver.getStats().

This solves another problem as well: In stage 1 there was no way to replace a track we’re currently sending with a different one, without informing the other side first through expensive renegotiation. Now we can, using sender.replaceTrack().

This model adds quite a bit of control. But this abstraction still doesn’t cover the following realities well:

ICE allocates bi-directional media ports, yet there’s no way to know which sender and receiver go together.
There’s no clear re-use model for these ports that wouldn’t risk sending the wrong media to a receiver at times.
The {offerToReceiveVideo: 3} offer-options stuff is a kludgy way to allocate additional m-lines (ports).
Using stream ids to correlate local and remote streams can collide with ids from other participants.

So while this model remains, the API had to change again.

Stage 3: addTransceiver and early media

This is the third model. No browsers implement it (at the time of writing this blog. Firefox 59 now does). It solves the remaining 4 problems, plus one more: With call-setup time being critical more often than not, some users want to send media early, before negotiation has even completed.

First, the main difference with this model is it naturally groups a sender and a receiver together into a transceiver. The second, less obvious, difference is that this triumvirate is always created together.

That’s right: the instant you call addTrack, you not only have a sender, you have a transceiver and a receiver as well, and that receiver has a track, which if you add it to a stream and set that stream on a video element, will play instantly! – It will play silent black until the other side actually sends something, but this may happen before the two-way handshake with the peer has completed.

This differs quite substantially from all implementations today where remote tracks are created by setRemoteDescription. That said, this model tries to preserve the stage 2 abstraction as best it can, including firing the track event at the regular time, so our stage 2 example will still work.

Update: The WebRTC working group has decided remote tracks are now muted rather than ended in response to direction changes (e.g. from removeTrack), since they correspond to sender objects on the other side, which may resume. This is a breaking spec change. However, as a mitigation, remote tracks which are muted in this fashion are temporarily removed from their stream(s). This means listening to stream.onremovetrack instead of track.onended will work. Or listen to the new track.onmuted for this and SSRC timeouts.

Again, the stage 2 example will continue to work, but requires updating to listen to the removetrack event . That’s because, again addTrack implicitly creates a transceiver, just like addTranceiver does (unless there’s an unused transceiver to re-use). This transceiver survices removeTrack.

Our example therefore may be rewritten like this (please stop earlier examples before running in Firefox):

The benefit is it now re-uses the same m-line over and over, whereas the stage 2 addTrack/removeTrack version created a new m-line each time you checked the checkbox (because of the re-use problem). This scales better, since m-lines are never removed, unless you call transceiver.stop().

It also has some symmetry advantages: For instance, we could rely solely on replaceTrack here, create the video transceiver as "sendonly" initially, and skip renegotiation and setDirection entirely, just by removing some lines.

It also solves the 4th problem, which is the subject of the next chapter.

Correlate by transceiver.mid instead of stream/track.id.

If you’re sending track ids out-of-band today hoping to correlate local and remote tracks, WebRTC 1.0 will break you.

The most obvious reason perhaps is replaceTrack. But even if you avoid it, things will likely break, on both ends. That’s because the receiver and its track will be created ahead of setRemoteDescription next year, at least on the offerer side, to support early media. Once a track has been created, its id is immutable, which means it cannot be changed later to match the id sent from the other side.

Sending stream ids out-of-band to correlate local and remote streams, should continue to work as well as it does today (because you don’t learn of remote streams until the track event fires). Here’s an example of that that sends two videos, and swaps them after 3 seconds using replaceTrack (requires Firefox). We want the camera video to start on the left:

Without correlating streams, there would be no telling which video would end up on the left vs. right. Thus the fiddle correlates streams like this today:

  let videos = {[camStream.id]: videoA, [blankStream.id]: videoB};

  pc2.ontrack = ({streams: [stream]}) => {
    let video = videos[stream.id];
    if (!video.srcObject) video.srcObject = stream;
  }

However, this approach has never been sound, because stream ids are only unique to one connection. This is fine for this demo, but in general, a client receiving streams from multiple simultaneous connections may experience intermittent id collisions. Those might be rare, but who wants intermittent mix-up of video?

With the new spec there’s a better way, using transceiver.mid:

  let videos = {[camTransceiver.mid]: videoA, [blankTransceiver.mid]: videoB};

  pc2.ontrack = ({transceiver, streams: [stream]}) => {
    let video = videos[transceiver.mid];
    if (!video.srcObject) video.srcObject = stream;
  }

The mid isn’t globally unique either, but unlike streams id it doesn’t need to be. The app should already know the different peer connections it has, and using the mid be able to locate any remote stream or track, and thus correlate it to the other side.

Conclusion.

This blog covers only a piece of the API. There are lots of other changes not covered here, in the API and on the wire. A lot of effort has gone into this spec, and hopefully knowing the evolution of at least one part of it might help with appreciating all it does. Looking forward to seeing it all implemented soon!