Exploring RTCRtpTransceiver.

April 4, 2018

Contributed by Jan-Ivar Bruaroey,

Firefox now implements the RTCRtpTransceiver API, a.k.a. stage 3 in my blog post “The evolution of WebRTC 1.0.” from last year. Transceivers more accurately reflect the SDP-rooted network behaviors of an RTCPeerConnection. E.g. addTransceiver() (or addTrack) now creates a receiver at the same time, which correlates better with the bi-directional m-line that begins life in the SDP offer, rather than being something that arrives later once negotiation completes. This lets media arrive early on renegotiation.

But most WebRTC sites today are still written the old way, and there are subtle differences to trip over. Web developers are stuck in a bit of limbo, as other browsers haven’t caught up yet. In light of that, we should probably dive right into two immediate breaking spec changes you might have run into in Firefox 59:

Remote tracks are now muted and temporarily removed from their stream(s), rather than ended, in response to direction changes (e.g. from the poorly named removeTrack).
track.ids at the two end-points rarely correlate anymore.

Importantly, this affects you whether you use transceivers or not. addTrack (even addStream) is just a tweaked version of addTransceiver these days. This blog will look at how to handle this—tying up lose ends from last year’s blog post—and demonstrate how Firefox works.

The short answer to “Why no ended?”, is that incoming tracks now correspond to sender objects on the other side, which may resume in response to transceiver.direction changes.

The short answer to “Why no matching track IDs?”, is that incoming (receiver) tracks now usually exist ahead of connecting, their immutable ids unable to match up with the other side (and were never guaranteed to be unique anyway).

As you might notice, this boils down to changes to the lifetime of remote tracks. Gone is misleading symmetry between local and remote tracks, or the streams they belong to for that matter. That symmetry looked pretty, but got in the way of fully controlling tracks remotely.

Remote-controlled tracks.

A more useful analogy is that of transceiver.sender as a remote-control of the transceiver.receiver.track held by the other peer. Their lifetimes match that of the transceiver itself. Here’s how this remote-control works:

When we do this…	…the other side sees this
`pc.addTransceiver(trackA, {streams})`*	(`stream.onaddtrack` only if stream already exists there)
…	`pc.ontrack` with new `transceiver` and `streams`
…then once media flows	`transceiver.receiver.track.onunmute`
`tranceiver.direction = "recvonly"`*	`transceiver.receiver.track.onmute`
…	`stream.onremovetrack`
`tranceiver.direction = "sendrecv"`*	`stream.onaddtrack`
…	`pc.ontrack` with existing `transceiver` and `streams`
…then once media flows	`transceiver.receiver.track.onunmute`
`transceiver.sender.track.enabled = false`	Media seamlessly goes black/silent
`transceiver.sender.track.enabled = true`	Media seamlessly resumes
`transceiver.sender.replaceTrack(trackB)`	Media seamlessly changes
`transceiver.sender.replaceTrack(null)`	Media seamlessly halts
Network (RTP SSRC) timeout	`transceiver.receiver.track.onmute`
`tranceiver.stop()`*	`transceiver.receiver.track.onended`

_{* = after renegotiation through pc.onnegotiationneeded completes.

(Note that many of these transitions are state-based and only fire events if the state ends up changing.)}

You can try out this remote-control below in Firefox (59):
In the “Result” tab, grant permission, then click the buttons in sequence from top to bottom, to see the video update.

Then try the sequence again from the top. The sequence is repeatable, because we use the newest transceiver returned from addTransceiver each go-around (they accumulate).

Note that we get stream.onaddtrack from addTransceiver only the second time around, since we had no opportunity to add listeners to the remote stream the first time.

Interestingly, the video element, here representing the remote side, will play along with this, showing the latest video, provided we always either:

stop() the previous transceiver, or
set its direction to "recvonly".

Stopping works because video elements ignore ended tracks. Changing direction works because the temporarily-muted tracks get removed from the stream the video element is playing. Try clicking the buttons in different order to prove this out.

Now this may look like a lot of ways to accomplish the same thing. The differences may not be appreciable in a small demo, but each control has trade-offs:

stop() terminates both directions (in this example we were only sending one way). Also, stopped transceivers stick around in pc.getTransceivers(), at least locally, and litter the remote stream with ended tracks (however, the Offer/Answer m-line may get repurposed in subsequent negotiations apparently).
direction-changes reuse the same transceiver and track without waste, but still require re-negotiation.
replaceTrack(null) is instant, requiring no negotiation at all, but stops sending without informing the other party. This may be indistinguishable from a network issue if the other side is looking at stats.
track.enabled = false never completely halts network traffic, instead sending one black frame per second, or silence for audio. This is the only control that lets browsers know the camera/microphone is no longer in use.

For the above reasons, the spec encourages implementing “hold” functionality using both direction and replaceTrack(null) in combination.

Don’t forget the camera!

In addition to the spec’s recommended “hold” solution, consider setting track.enabled = false at the same time. If you do, Firefox 60 will turn off the user’s camera and hardware indicator light, for less paranoid face-muting. This is a spec-supported feature Chrome does not have yet, and is the subject of my next blog.

Correlate tracks by transceiver.mid or order

Last year’s blog explained how using track ids out-of-band to correlate remote tracks would no longer work. It recommended using transceiver.mid instead for this, but, sans implementation, left out a working example.

Here’s an example that correlates tracks regardless of arrival order to always put the camera track on the left:
In the “Result” tab, check the boxes in any order; the camera always appears on the left, the other one on the right.

The trick here in ontrack is using camTransceiver.mid to pick between the left or right video element. This is the mid from the other side. In the real world, we’d send this ID over a data-channel or something, but you get the idea. Since we connect the transceivers ahead of time, we could do that.

But what if we needed to correlate on initial connection? How would the ID get there in time in the real world? The IDs are in the SDP, but which one is which?

Well-defined transceiver order.

Something I overlooked last year is that transceiver.mid is null before setLocalDescription. We avoided that problem above by establishing the connection ahead of sending anything, but this makes mid useless for correlating in the initial negotiation.

The good news here since last year is that pc.getTransceivers() order is now well-defined! Transceivers always appear in creation-order, whether introduced by you with addTransceiver, or in m-line order from setRemoteDescription. That m-line order matches the other side’s transceiver order, at least initially.

With some care, this means we can correlate tracks using transceiver order from the initial offer itself. Here’s an example—without check-boxes this time—that does that. We’re also introducing a microphone into the picture:
In the “Result” tab, you’ll see the camera on the left again. You can mute the audio in that video element as well.

This time in pc2.ontrack, we don’t cheat by looking at the other side’s transceivers. We only look at our own pc2.getTransceivers() which is guaranteed to be in the same order here.

Specifically, we look if this is the third transceiver (pc2.getTransceivers()[2]), and if so, put it on the right, otherwise left. We also use the streams argument to intuitively group the camera and microphone tracks together. Since the third track didn’t have a stream in this case, we could have keyed off of that difference instead. There may be several ways to correlate at times: by transceiver order, by stream, or by out-of-band mid.

If you’re wondering how we can access the third transceiver already in the first ontrack, the API guarantees that setRemoteDescription is effectively done by the time the track events all fire. All transceivers are there; all tracks are in their streams.

A couple of things to watch out for if you’re going to rely on transceiver order:

Unlike getTransceivers(), the MediaCapture spec’s stream.getTracks() does not guarantee order across browsers! Therefore, avoid for-looping over it when adding tracks to a peer connection if you want deterministic order.
Once you stop() a transceiver, it remains in getTransceivers() locally, but m-line reuse may cause the other side to get out of lockstep with indexes once more transceivers are added from this point on.
Be careful about accidentally adding transceivers on the answering side during negotiation. Unlike addTrack(), addTransceiver() always creates a new transceiver, never reusing existing ones with available m-lines.

The second point pretty much limits the usefulness of this correlation-technique to initial offers. The third point is the final topic of this post: using transceivers on the answering-side.

Why use transceivers at all?

In case you feel “It’s too complicated! Bring back addStream()!”, it may be worth addressing its shortcomings.

Negotiation in WebRTC is inherently asymmetric. The now-deprecated 2014 addStream() API was a largely symmetric abstraction. It worked well for one video track and one audio track. Mapping to SDP was trivial: One bi-directional m-line for video, another for audio, and we were good.

But add a fifth track, and we’re at an impasse: We either surrender control over how things get paired to go over the wire, or we need an API that reflects how things go over the wire. Luckily, we don’t have to choose: Make browsers build the missing API, and shim addStream() on top of that if you want, or use addTrack() with abandon.

In other words, feel free to ignore transceivers if you don’t care how your media gets from A to B. On the other hand, if you dislike leaky abstractions, or you’re curious how to send 3 tracks in both directions using only 3 m-lines total, then read on.

How to answer with transceivers.

So far, we’ve only been sending in one direction. Let’s send the 3 tracks from earlier in both directions this time. The classic way to do this on the answering side is with addTrack(). Perhaps surprisingly, this is still the best option, ~~and currently the only option unless you’re OK with tracks being stream-less~~. More on this later. Update: This has been fixed in the spec, but not Firefox yet.

Using addTrack() only uses 3 transceivers, because addTrack() automatically attaches to any existing unused ("recvonly") transceivers—in transceiver order—before creating new ones. This is a bit magic.

On the other hand, calling addTransceiver() 3 times is straightforward, but would give us 6 m-lines total.

To make due with only 3 m-lines, the answerer must effectively modify the 3 transceivers created by setRemoteDescription from the offer, instead of adding 3 of its own. Think of it as the offerer setting up the transceivers, and the answerer plugging into them.

Make sure you’ve stopped the previous example before running this one.
In the “Result” tab, you’ll see 6 remote tracks, 3 each way, over 3 m-lines total. Mute audio in both video elements.

We use addTransceiver() on one end, and addTrack() on the other to re-use m-lines, relying on their order.

How to answer ONLY with transceivers.

addTrack() looks for unused transceivers to usurp, based on order and kind. This reliance on order may not always be practical. E.g. when using out-of-band mid, it’s more natural to want to modify the transceiver directly. Here the specification comes up a bit short unfortunately.

Let’s see how we’d answer the 3 transceivers without relying on their order, and then discuss what works and what doesn’t (again, make sure you’ve stopped the previous example first):

Rather than resort to addTrack(), we explicitly modify each transceiver on the answering end:

We change its transceiver.direction from "recvonly" to "sendrecv", and
we add our track using tranceiver.sender.replaceTrack().

Unfortunately, this API offers no way to associate streams with these tracks, so our tracks end up being stream-less. Our ontrack code becomes more complicated as a result, since the camera and microphone tracks no longer come grouped into a stream. But at least it works.

Extending the API to provide a way to add stream associations in this situation, seems reasonable. I’ve filed an issue on the spec about this. Update: This has been fixed with the sender.setStreams() API.

In closing

We’ve found people generally don’t care how media is organized over the wire, until they do. The tipping point is usually some combination of needing to do something more complicated, trying to correlate media to some underlying network metric, or explain some anomaly not gleaned from simpler API abstractions. Hopefully this API gives some insight into how WebRTC actually works, giving you options should you need it.