Firefox now implements the RTCRtpTransceiver API, a.k.a. stage 3 in my blog post “The evolution of WebRTC 1.0.” from last year. Transceivers more accurately reflect the SDP-rooted network behaviors of an RTCPeerConnection. E.g.
addTrack) now creates a receiver at the same time, which correlates better with the bi-directional m-line that begins life in the SDP offer, rather than being something that arrives later once negotiation completes. This lets media arrive early on renegotiation.
But most WebRTC sites today are still written the old way, and there are subtle differences to trip over. Web developers are stuck in a bit of limbo, as other browsers haven’t caught up yet. In light of that, we should probably dive right into two immediate breaking spec changes you might have run into in Firefox 59:
- Remote tracks are now muted and temporarily removed from their stream(s), rather than ended, in response to direction changes (e.g. from the poorly named
track.ids at the two end-points rarely correlate anymore.
Importantly, this affects you whether you use transceivers or not.
addStream) is just a tweaked version of
addTransceiver these days. This blog will look at how to handle this—tying up lose ends from last year’s blog post—and demonstrate how Firefox works.
The short answer to “Why no ended?”, is that incoming tracks now correspond to sender objects on the other side, which may resume in response to
The short answer to “Why no matching track IDs?”, is that incoming (receiver) tracks now usually exist ahead of connecting, their immutable
ids unable to match up with the other side (and were never guaranteed to be unique anyway).
As you might notice, this boils down to changes to the lifetime of remote tracks. Gone is misleading symmetry between local and remote tracks, or the streams they belong to for that matter. That symmetry looked pretty, but got in the way of fully controlling tracks remotely.
A more useful analogy is that of
transceiver.sender as a remote-control of the
transceiver.receiver.track held by the other peer. Their lifetimes match that of the transceiver itself. Here’s how this remote-control works:
|When we do this…||…the other side sees this|
|…then once media flows||
|…then once media flows||
||Media seamlessly goes black/silent|
||Media seamlessly resumes|
||Media seamlessly changes|
||Media seamlessly halts|
|Network (RTP SSRC) timeout||
* = after renegotiation through
(Note that many of these transitions are state-based and only fire events if the state ends up changing.)
You can try out this remote-control below in Firefox (59):
In the “Result” tab, grant permission, then click the buttons in sequence from top to bottom, to see the video update.
Then try the sequence again from the top. The sequence is repeatable, because we use the newest
transceiver returned from
addTransceiver each go-around (they accumulate).
Note that we get
addTransceiver only the second time around, since we had no opportunity to add listeners to the remote stream the first time.
Interestingly, the video element, here representing the remote side, will play along with this, showing the latest video, provided we always either:
- set its
Stopping works because video elements ignore ended tracks. Changing direction works because the temporarily-muted tracks get removed from the stream the video element is playing. Try clicking the buttons in different order to prove this out.
Now this may look like a lot of ways to accomplish the same thing. The differences may not be appreciable in a small demo, but each control has trade-offs:
stop()terminates both directions (in this example we were only sending one way). Also,
stoppedtransceivers stick around in
pc.getTransceivers(), at least locally, and litter the remote stream with ended tracks (however, the Offer/Answer m-line may get repurposed in subsequent negotiations apparently).
direction-changes reuse the same transceiver and track without waste, but still require re-negotiation.
replaceTrack(null)is instant, requiring no negotiation at all, but stops sending without informing the other party. This may be indistinguishable from a network issue if the other side is looking at stats.
track.enabled = falsenever completely halts network traffic, instead sending one black frame per second, or silence for audio. This is the only control that lets browsers know the camera/microphone is no longer in use.
For the above reasons, the spec encourages implementing “hold” functionality using both
replaceTrack(null) in combination.
Don’t forget the camera!
In addition to the spec’s recommended “hold” solution, consider setting
track.enabled = false at the same time. If you do, Firefox 60 will turn off the user’s camera and hardware indicator light, for less paranoid face-muting. This is a spec-supported feature Chrome does not have yet, and is the subject of my next blog.
Correlate tracks by transceiver.mid or order
Last year’s blog explained how using track ids out-of-band to correlate remote tracks would no longer work. It admonished using
transceiver.mid instead for this, but, sans implementation, left out a working example.
Here’s an example that correlates tracks regardless of arrival order to always put the camera track on the left:
In the “Result” tab, check the boxes in any order; the camera always appears on the left, the other one on the right.
The trick here in
ontrack is using
camTransceiver.mid to pick between the left or right video element. This is the
mid from the other side. In the real world, we’d send this ID over a data-channel or something, but you get the idea. Since we connect the transceivers ahead of time, we could do that.
But what if we needed to correlate on initial connection? How would the ID get there in time in the real world? The IDs are in the SDP, but which one is which?
Well-defined transceiver order.
Something I overlooked last year is that
setLocalDescription. We avoided that problem above by establishing the connection ahead of sending anything, but this makes
mid useless for correlating in the initial negotiation.
The good news here since last year is that
pc.getTransceivers() order is now well-defined! Transceivers always appear in creation-order, whether introduced by you with
addTransceiver, or in m-line order from
setRemoteDescription. That m-line order matches the other side’s transceiver order, at least initially.
With some care, this means we can correlate tracks using transceiver order from the initial offer itself. Here’s an example—without check-boxes this time—that does that. We’re also introducing a microphone into the picture:
In the “Result” tab, you’ll see the camera on the left again. You can mute the audio in that video element as well.
This time in
pc2.ontrack, we don’t cheat by looking at the other side’s
transceivers. We only look at our own
pc2.getTransceivers() which is guaranteed to be in the same order here.
Specifically, we look if this is the third transceiver (
pc2.getTransceivers()), and if so, put it on the right, otherwise left. We also use the
streams argument to intuitively group the camera and microphone tracks together. Since the third track didn’t have a stream in this case, we could have keyed off of that difference instead. There may be several ways to correlate at times: by transceiver order, by stream, or by out-of-band
If you’re wondering how we can access the third transceiver already in the first
ontrack, the API guarantees that
setRemoteDescription is effectively done by the time the
track events all fire. All transceivers are there; all tracks are in their streams.
A couple of things to watch out for if you’re going to rely on transceiver order:
getTransceivers(), the MediaCapture spec’s
stream.getTracks()does not guarantee order across browsers! Therefore, avoid for-looping over it when adding tracks to a peer connection if you want deterministic order.
- Once you
stop()a transceiver, it remains in
getTransceivers()locally, but m-line reuse may cause the other side to get out of lockstep with indexes once more transceivers are added from this point on.
- Be careful about accidentally adding transceivers on the answering side during negotiation. Unlike
addTransceiver()always creates a new transceiver, never reusing existing ones with available m-lines.
The second point pretty much limits the usefulness of this correlation-technique to initial offers. The third point is the final topic of this post: using transceivers on the answering-side.
Why use transceivers at all?
In case you feel “It’s too complicated! Bring back
addStream()!”, it may be worth addressing its shortcomings.
Negotiation in WebRTC is inherently asymmetric. The now-deprecated 2014
addStream() API was a largely symmetric abstraction. It worked well for one video track and one audio track. Mapping to SDP was trivial: One bi-directional m-line for video, another for audio, and we were good.
But add a fifth track, and we’re at an impasse: We either surrender control over how things get paired to go over the wire, or we need an API that reflects how things go over the wire. Luckily, we don’t have to choose: Make browsers build the missing API, and shim
addStream() on top of that if you want, or use
addTrack() with abandon.
In other words, feel free to ignore transceivers if you don’t care how your media gets from A to B. On the other hand, if you dislike leaky abstractions, or you’re curious how to send 3 tracks in both directions using only 3 m-lines total, then read on.
How to answer with transceivers.
So far, we’ve only been sending in one direction. Let’s send the 3 tracks from earlier in both directions this time. The classic way to do this on the answering side is with
addTrack(). Perhaps surprisingly, this is still the best option,
and currently the only option unless you’re OK with tracks being stream-less. More on this later. Update: This has been fixed in the spec, but not Firefox yet.
addTrack() only uses 3 transceivers, because
addTrack() automatically attaches to any existing unused (
"recvonly") transceivers—in transceiver order—before creating new ones. This is a bit magic.
On the other hand, calling
addTransceiver() 3 times is straightforward, but would give us 6 m-lines total.
To make due with only 3 m-lines, the answerer must effectively modify the 3 transceivers created by
setRemoteDescription from the offer, instead of adding 3 of its own. Think of it as the offerer setting up the transceivers, and the answerer plugging into them.
Make sure you’ve stopped the previous example before running this one.
In the “Result” tab, you’ll see 6 remote tracks, 3 each way, over 3 m-lines total. Mute audio in both video elements.
addTransceiver() on one end, and
addTrack() on the other to re-use m-lines, relying on their order.
How to answer ONLY with transceivers.
addTrack() looks for unused transceivers to usurp, based on order and kind. This reliance on order may not always be practical. E.g. when using out-of-band
mid, it’s more natural to want to modify the transceiver directly. Here the specification comes up a bit short unfortunately.
Let’s see how we’d answer the 3 transceivers without relying on their order, and then discuss what works and what doesn’t (again, make sure you’ve stopped the previous example first):
Rather than resort to
addTrack(), we explicitly modify each transceiver on the answering end:
- We change its
- we add our track using
Unfortunately, this API offers no way to associate streams with these tracks, so our tracks end up being stream-less. Our
ontrack code becomes more complicated as a result, since the camera and microphone tracks no longer come grouped into a stream. But at least it works.
Extending the API to provide a way to add stream associations in this situation, seems reasonable. I’ve filed an issue on the spec about this. Update: This has been fixed with the sender.setStreams() API.
We’ve found people generally don’t care how media is organized over the wire, until they do. The tipping point is usually some combination of needing to do something more complicated, trying to correlate media to some underlying network metric, or explain some anomaly not gleaned from simpler API abstractions. Hopefully this API gives some insight into how WebRTC actually works, giving you options should you need it.