Simulcast-Aware Forwarding

A Selective Forwarding Unit earns its name by never re-encoding media: it receives one set of RTP streams from each publisher and forwards a subset to each subscriber untouched. Simulcast turns that forwarding decision into the core of the product. When a publisher sends three independently encoded resolutions of the same camera — a high, a medium, and a low spatial layer, each its own RTP stream with its own SSRC — the server must choose, per subscriber, which one of those streams to relay, and switch between them as that subscriber’s downlink changes. This guide is part of the Media Server Architecture: SFU & MCU guide, and it covers the exact mechanics of that forwarding path: reading the RID/MID that identifies each layer, mapping a subscriber’s estimated bitrate to a layer, switching cleanly on keyframe boundaries, and rewriting the RTP headers so the receiver never notices the source changed.

The goal is concrete: build a forwarder that can promote a subscriber from the low layer to the high layer when their bandwidth recovers, and demote them again when it drops, with no decoder corruption, no frozen frame, and a switch latency bounded by one keyframe interval. Everything below assumes you have already terminated DTLS-SRTP, demultiplexed RTP, and parsed RTCP feedback — the forwarding logic sits one layer above that.

The publisher emits three independent simulcast streams; the SFU selector forwards exactly one to each subscriber and rewrites RTP headers so each switch is invisible to the decoder.

Step 1 — Read RID and MID to identify the incoming layers

Simulcast layers arrive as separate RTP streams that all belong to the same logical track. The browser tags each one with an RTP Stream Identifier (RID) carried in the urn:ietf:params:rtp-hdrext:sdes:rtp-stream-id header extension, and associates them with a media section through the MID extension (urn:ietf:params:rtp-hdrext:sdes:mid). The publisher’s SDP declares which RIDs exist and in what order via a=simulcast:send and a=rid: lines; your job is to parse those and then match each inbound packet to a layer by reading its RID extension.

Chrome sends the RID only on the first few packets of each stream and on every keyframe, not on every packet — so cache the SSRC-to-RID binding the moment you first observe it, and never assume a later packet will re-advertise the RID. Once you know ssrc → rid, and rid → layer index from the SDP, you have the mapping the selector needs.

// Parse the publisher's simulcast layers from the offer SDP, then bind SSRCs at runtime.
// rid order in `a=simulcast:send h;m;l` is high→low by convention but NOT guaranteed —
// resolve the real spatial size from each a=rid line's max-width/max-height if present.
function parseSimulcastLayers(sdpMediaSection) {
  const rids = [];
  for (const line of sdpMediaSection.split(/\r?\n/)) {
    const m = line.match(/^a=rid:(\S+) send/); // a=rid:h send max-width=1280;max-height=720
    if (m) rids.push(m[1]);
  }
  return rids; // e.g. ['h', 'm', 'l'] — index 0 is the top layer
}

const ssrcToRid = new Map();   // learned from the RID header extension on early/keyframe packets
const ridToIndex = new Map();  // 'h'->0, 'm'->1, 'l'->2 from parseSimulcastLayers order

function onRtpPacket(pkt) {
  const rid = pkt.getHeaderExtension('rtp-stream-id'); // present early + on keyframes only
  if (rid && !ssrcToRid.has(pkt.ssrc)) {
    ssrcToRid.set(pkt.ssrc, rid); // bind once; later packets omit the RID
  }
  const layerIndex = ridToIndex.get(ssrcToRid.get(pkt.ssrc));
  routePacket(pkt, layerIndex); // hand to the per-subscriber selector
}

Step 2 — Identify spatial and temporal layers within each stream

Simulcast gives you spatial layers — distinct resolutions, each its own RTP stream. Inside each of those streams the encoder usually also produces temporal layers: a base layer at, say, 7.5 fps and additional frames that lift it to 15 and 30 fps, all in one RTP stream and distinguishable only by reading the codec payload. Temporal scalability is what lets you drop a subscriber’s frame rate without a full layer switch, shedding 30–50% of a stream’s bitrate by forwarding only the lower temporal IDs.

How you read the temporal ID depends entirely on the codec. VP8 exposes a TID field plus a picture-id in its payload descriptor; AV1 carries a Dependency Descriptor header extension that encodes the full spatial/temporal dependency graph, which is the same structure the Configuring AV1 SVC Layers in WebRTC workflow relies on. Mishandling this distinction is the single most common source of decoder corruption, so the forwarder must know, per codec, exactly which bytes carry the temporal ID and which frames are safe to drop.

// Extract the temporal ID per codec. Dropping a higher-TID frame is always safe;
// dropping a base-layer (TID 0) frame breaks every frame that depends on it.
function temporalId(pkt, codec) {
  if (codec === 'VP8') {
    // VP8 payload descriptor: T bit signals presence of TID in the extension byte
    const d = pkt.payloadDescriptor;
    return d.hasTID ? d.tid : 0;        // 0 = base layer, must always be forwarded
  }
  if (codec === 'AV1') {
    // AV1 reads the temporal_id straight from the Dependency Descriptor extension
    return pkt.dependencyDescriptor.temporalId;
  }
  return 0; // H.264 simulcast here is treated as spatial-only (no temporal shaping)
}

Step 3 — Map subscriber bitrate to a layer and switch on a keyframe

Each subscriber has an estimated downlink bitrate from REMB or transport-wide congestion control feedback — the same estimate produced by the bandwidth estimation pipeline. The selector maps that estimate onto a target layer using each layer’s measured send bitrate plus a safety margin, then commits to the switch only when a usable decoder-refresh point arrives. The full threshold table, debounce timing, and keyframe-request logic are worked out in Forwarding Simulcast Layers by Subscriber Bandwidth; the broader policy that balances every subscriber against the publisher’s available layers lives in Bandwidth-Aware Layer Selection in an SFU.

The non-negotiable rule: you may only begin forwarding a new spatial layer starting at a keyframe of that layer. Inter-coded frames reference earlier frames of the same stream; splice a P-frame from the high layer onto a decoder that was watching the low layer and you get a green-block smear or a hard freeze until the next keyframe. When you decide to upswitch, send an RTCP Picture Loss Indication (PLI) or Full Intra Request (FIR) to the publisher for the target layer’s SSRC, keep forwarding the old layer until the requested keyframe arrives, and only then cut over. Downswitching to a lower layer that is already flowing can often happen on its next existing keyframe without a request, since lower layers are cheaper for the publisher to refresh frequently.

// Per-subscriber forwarder. Switches are pending until a keyframe of the target layer lands.
class SubscriberForwarder {
  constructor(publisher, sendPli) {
    this.publisher = publisher;
    this.sendPli = sendPli;        // (ssrc) => emit RTCP PLI/FIR upstream
    this.currentLayer = 2;         // start conservative on the low layer
    this.pendingLayer = null;
  }

  requestLayer(target) {
    if (target === this.currentLayer || target === this.pendingLayer) return;
    this.pendingLayer = target;
    // ask the publisher for a fresh keyframe on the layer we want to switch into
    this.sendPli(this.publisher.ssrcForLayer(target));
  }

  forward(pkt, layerIndex, isKeyframe) {
    if (this.pendingLayer !== null && layerIndex === this.pendingLayer && isKeyframe) {
      this.currentLayer = this.pendingLayer; // commit exactly on the keyframe boundary
      this.pendingLayer = null;
    }
    if (layerIndex !== this.currentLayer) return; // drop every other layer's packets
    this.rewriteAndSend(pkt);
  }
}

Step 4 — Rewrite RTP SSRC, sequence number, and picture-id, then verify

From the subscriber’s decoder’s point of view there is a single continuous RTP stream. But behind the selector you are splicing packets from streams that each have their own SSRC, their own monotonically increasing sequence numbers, and their own timestamp and picture-id baselines. Forward them raw and the receiver sees the SSRC change (it tears down and rebuilds the stream), sees a sequence-number discontinuity (it reports massive packet loss), and sees the picture-id jump (VP8 reference picture selection breaks). The forwarder must therefore present one outgoing SSRC and rewrite every header field to a continuous, gap-free sequence.

Maintain per-subscriber offsets: a fixed output SSRC, a running sequence-number translation that closes the gap left by every dropped packet, and a picture-id translation for VP8. At each switch you snapshot the last values you emitted and rebase the new layer onto them. Verify the result by pulling the subscriber’s inbound-rtp stats — framesDecoded should keep climbing across a switch, freezeCount should not increment, and pliCount upstream should show exactly one request per upswitch, not a storm.

// Rewrite headers so the splice is invisible. One output SSRC per subscriber.
rewriteAndSend(pkt) {
  const t = this.translation;                 // { outSsrc, seqOffset, lastSeq, picIdOffset }
  pkt.ssrc = t.outSsrc;                        // collapse N source SSRCs into one
  pkt.sequenceNumber = (pkt.sequenceNumber + t.seqOffset) & 0xffff; // gap-free, 16-bit wrap
  t.lastSeq = pkt.sequenceNumber;

  if (pkt.codec === 'VP8') {
    // rebase VP8 picture-id so reference selection stays monotonic across a layer switch
    pkt.payloadDescriptor.pictureId = (pkt.payloadDescriptor.pictureId + t.picIdOffset) & 0x7fff;
  }
  this.send(pkt);
}

// On commit, recompute offsets so the NEW layer continues from the last emitted values.
rebaseOnSwitch(firstPktOfNewLayer) {
  const t = this.translation;
  t.seqOffset = (t.lastSeq + 1 - firstPktOfNewLayer.sequenceNumber) & 0xffff;
  t.picIdOffset = (t.lastPicId + 1 - firstPktOfNewLayer.payloadDescriptor.pictureId) & 0x7fff;
}

Edge Cases & Browser Quirks

VP8 picture-id width. VP8’s picture-id is either 7-bit or 15-bit depending on the M bit in the payload descriptor; Chrome emits the 15-bit form, but a naive parser that assumes 7 bits will mis-rebase on every switch and corrupt reference selection. Always read the M bit before masking.
AV1 dependency descriptor vs VP8 picture-id. AV1 does not use a picture-id at all — its frame relationships live in the Dependency Descriptor header extension, a template-based structure. A forwarder that reuses VP8 rewriting logic for AV1 will drop frames the decoder needed; AV1 requires rewriting the frame_number inside the descriptor instead, and respecting its declared decode targets.
Chrome RID advertisement timing. Chrome (since roughly M90) sends the RID extension only on the initial packets and on keyframes. If your SSRC binding logic waits for a steady-state packet, it will never bind the low layer on a quiet camera. Bind on the very first packet that carries the extension.
Safari simulcast support. Safari only gained reliable send-side simulcast for VP8/H.264 in recent versions and historically advertised RIDs it would not actually encode under thermal pressure, silently collapsing to one layer. Detect a missing layer at runtime (no packets on its SSRC for >1 s) and fall the selector back to the layers actually arriving rather than forwarding a dead SSRC.
Firefox temporal layers. Firefox’s VP8 simulcast historically produced fewer temporal layers than Chrome for the same encoding request, so a temporal-ID drop policy tuned on Chrome can over-shed frames on Firefox. Read the actual TIDs present rather than assuming three temporal layers exist.

Common Implementation Mistakes

Switching mid-GOP. Cutting to a new spatial layer on a P-frame instead of waiting for its keyframe is the classic green-smear bug. Always hold the pending switch until the target layer’s keyframe arrives.
Forwarding the RID-bearing packets but not caching the binding. If you only read RID per-packet you lose the layer identity the instant Chrome stops sending it. Cache ssrc → rid on first sight.
Leaving SSRC unrewritten. Forwarding the source SSRC straight through makes the subscriber tear down and rebuild its receiver on every switch, adding a visible freeze. Collapse to one output SSRC.
PLI storms on upswitch. Re-requesting a keyframe every packet while waiting for one inflates upstream bitrate and can knock the publisher’s own encoder into a degraded state. Send one PLI, then wait with a timeout before retrying.
Dropping temporal base-layer frames. Shedding TID 0 frames to save bandwidth breaks every dependent frame. Only ever drop the highest temporal IDs first.
Ignoring the publisher’s actual send bitrate. Mapping a subscriber to the high layer because the SDP declared 1700 kbps, when the publisher is thermally throttled to 600 kbps, forwards a starved stream. Drive the mapping off measured per-layer bitrate.

FAQ

Does simulcast-aware forwarding re-encode video? No. The defining property of this design is that media is forwarded byte-for-byte at the codec level; only RTP header fields and codec-specific descriptors are rewritten. There is no transcode, no pixel work, and therefore no per-stream CPU cost beyond packet rewriting — which is exactly why an SFU scales where an MCU does not. The topology trade-off is detailed in SFU vs MCU Topologies.

How fast can a subscriber switch from the low layer to the high layer? Switch latency is bounded by how quickly the publisher produces a keyframe for the target layer after your PLI. With a sane keyframe-on-request path that is typically one round trip plus encoder latency — tens to low hundreds of milliseconds. You keep forwarding the old layer the entire time, so the subscriber sees continuous video, just at the old quality until the cutover.

Why not just forward all layers and let the client pick? That defeats the purpose: forwarding every layer to every subscriber sends the full aggregate bitrate down each link, which is precisely the congestion simulcast exists to avoid. The server-side selection is what keeps each subscriber’s downlink matched to one layer.

How does this differ from SVC forwarding? With SVC the layers are encoded with dependencies inside a single stream rather than as independent simulcast streams, so the forwarder drops the unwanted enhancement layers of one stream instead of choosing between separate streams. The decision between the two encodings is covered in Choosing Simulcast vs SVC for Large Conferences.

Related: build the per-subscriber threshold logic in Forwarding Simulcast Layers by Subscriber Bandwidth, place it inside the broader Selective Forwarding Unit Design and its Bandwidth-Aware Layer Selection in an SFU policy, and revisit the client-side encoding setup in Simulcast & SVC Implementation.

Related Guides