Simulcast & SVC Implementation in WebRTC

Multi-stream encoding is how a single publisher satisfies a room full of subscribers on wildly different downlinks — one sends 1.5 Mbps of 720p, another a 150 kbps thumbnail, from the same camera and the same RTCPeerConnection. This guide is part of the Media Handling, Codecs & Bandwidth Estimation guide, and its job is to make simulcast and SVC work end-to-end: defining RID-based encodings, mapping scaleResolutionDownBy across three spatial tiers, switching to a single SVC stream with scalabilityMode, and forwarding the right layer per subscriber from the server. Get the encoder configuration wrong and you ship three identical-resolution streams that melt the CPU; get the keyframe handling wrong and subscribers see green-block corruption every time the server upgrades them.

Simulcast and SVC solve the same problem — one encode, many downstream qualities — with opposite trade-offs. Simulcast runs N parallel encoders and emits N independent RTP streams; the Selective Forwarding Unit just forwards whichever stream matches a subscriber, never touching codec internals. SVC runs one encoder that emits one stream layered so the server can drop frames to downscale. The sections below build both, call out where Chrome, Firefox, and Safari diverge, and link to the server-side forwarding logic that consumes them.

One camera emits three RID-tagged RTP streams; the SFU forwards a different spatial layer to each subscriber based on its measured downlink, never re-encoding.

Step 1 — Declare RID-based simulcast encodings

Simulcast is configured entirely on the sender, before the first offer. You attach the track, read the parameters, replace the encodings array with one entry per quality tier, and write it back. Each entry carries a rid (the RTP stream identifier the SFU keys on), an active flag, a maxBitrate ceiling, and a scaleResolutionDownBy factor. The browser then emits a=simulcast and a=rid lines into the SDP automatically — you never hand-edit them.

The timing is strict: setParameters() must complete before createOffer(). You cannot add, remove, or rename a rid after negotiation; only the per-encoding active, maxBitrate, and scaleResolutionDownBy are mutable at runtime. Pin a codec that supports independent layers first — in Chrome that means VP8, VP9, or AV1, because Chromium maps H.264 onto SVC and caps it at two simulcast layers. The Chrome-specific recipe, including the exact chrome://webrtc-internals SSRC check, is in Simulcast with Three Quality Layers in Chrome.

The rid value is not cosmetic — it is the join key between this sender and the SFU. Whatever string you choose (high/mid/low, or f/h/q as some stacks use) appears verbatim in the a=rid SDP lines and in every outbound-rtp stats report, and the server’s forwarding table is built from exactly those identifiers. Keep them stable across your client and server code; a mismatch means the SFU has streams it cannot route. Note also the encoding order: list the highest-resolution layer first so the encoder treats it as the base and derives the scaled-down layers from it. Reversing the order makes some encoder builds scale up, producing blurry top layers.

const transceiver = pc.addTransceiver(videoTrack, {
  direction: 'sendonly',
  // Order matters: 'high' first so the encoder treats it as the base resolution
  sendEncodings: [
    { rid: 'high', active: true, maxBitrate: 1_500_000, scaleResolutionDownBy: 1.0, maxFramerate: 30 },
    { rid: 'mid',  active: true, maxBitrate:   500_000, scaleResolutionDownBy: 2.0, maxFramerate: 30 },
    { rid: 'low',  active: true, maxBitrate:   150_000, scaleResolutionDownBy: 4.0, maxFramerate: 15 }
  ]
});

// Prefer VP8/VP9/AV1 before the offer — H.264 silently collapses to 2-layer SVC in Chrome
const caps = RTCRtpSender.getCapabilities('video');
const vp8 = caps.codecs.filter(c => /vp8/i.test(c.mimeType));
transceiver.setCodecPreferences([...vp8, ...caps.codecs]);

const offer = await pc.createOffer(); // a=simulcast + a=rid:high/mid/low now emitted automatically
await pc.setLocalDescription(offer);

Step 2 — Map scaleResolutionDownBy across 1/2/4

scaleResolutionDownBy is the single most important field for keeping CPU sane. It divides the capture resolution before encoding, so a 1280×720 capture with factors 1.0 / 2.0 / 4.0 produces 720p, 360p, and 180p streams. Omit it and you get three full-resolution encodes — roughly 3× the encoder load with no quality benefit, the fastest way to exhaust a laptop CPU mid-call. Keep the factors as clean powers of two; fractional ratios like 1.5 force the scaler onto non-aligned dimensions that some hardware encoders reject.

Pair each resolution with a maxBitrate that leaves clear headroom between tiers — a useful rule is that each layer’s ceiling should be at least 2× the layer below it, or the bandwidth estimator treats two layers as one and drops the higher of the pair. Don’t hardcode these ceilings as the actual send rate; they are caps, and WebRTC’s Google Congestion Control allocates the real bitrate underneath them. The interaction between simulcast ceilings and the estimator is covered in Bandwidth Estimation & Congestion Control.

RID	scaleResolutionDownBy	Resolution (from 720p)	maxBitrate	maxFramerate
high	1.0	1280×720	1.5 Mbps	30
mid	2.0	640×360	500 kbps	30
low	4.0	320×180	150 kbps	15

At runtime you adapt by flipping active per layer rather than renegotiating. Dropping the high layer under sustained loss frees its entire bitrate budget for the survivors without an SDP round trip:

// Disable the top layer without renegotiation — no createOffer needed
function setLayerActive(sender, rid, active) {
  const params = sender.getParameters();
  const enc = params.encodings.find(e => e.rid === rid);
  if (enc) enc.active = active;     // mutable at runtime; rid itself is frozen
  return sender.setParameters(params);
}

Step 3 — Switch to SVC with scalabilityMode L3T3

SVC replaces N parallel encoders with one encoder that structures its single output into decodable sub-layers. Instead of a rid array you configure one encoding with a scalabilityMode string. L3T3 means 3 spatial layers and 3 temporal layers — nine forwardable operating points from one RTP stream — and L3T3_KEY adds keyframe-synchronised spatial layers so the server can upgrade a subscriber’s resolution at a shared keyframe boundary. VP9 and AV1 expose full spatial SVC; the AV1-specific layer planning and CPU budget live in Configuring AV1 SVC Layers in WebRTC.

SVC’s win is a single encode pass and one SSRC, so CPU and bandwidth overhead are lower than simulcast’s parallel encoders. The cost moves to the server: the SFU must parse the dependency descriptor to know which packets belong to which layer. The decision of which mechanism to deploy at scale — and where each one wins past 50 participants — is worked through in Choosing Simulcast vs SVC for Large Conferences.

The temporal and spatial axes serve different adaptation goals. Temporal layers (the T digit) let the SFU halve frame rate per subscriber — drop the top temporal layer and a 30 fps stream becomes 15 fps at a fraction of the bitrate, with no resolution change. Spatial layers (the S/L digit) let it halve resolution. A conference that mostly needs to absorb brief congestion spikes benefits more from temporal layers, because frame-rate drops are visually gentler than resolution drops and recover instantly. Rooms with a wide spread of screen sizes — phones next to large displays — need the spatial range. L3T3 gives both, which is why it is the common default once a codec supports full spatial SVC.

const sender = pc.addTrack(videoTrack, stream);

const params = sender.getParameters();
params.encodings = [{
  active: true,
  maxBitrate: 2_000_000,
  scalabilityMode: 'L3T3_KEY'   // 3 spatial + 3 temporal, keyframe-synced spatial upgrades
}];
await sender.setParameters(params);
// No rid array: a single SSRC carries all layers, distinguished by the dependency descriptor

Step 4 — SFU layer selection and keyframes

Whichever mechanism you ship, the server makes the actual quality decision per subscriber. For simulcast the SFU matches each subscriber’s estimated downlink against the available rid streams and forwards the highest one that fits, dropping the others’ RTP packets without decoding. For SVC it reads the dependency descriptor and forwards only the spatial/temporal layers the subscriber can afford. This receiver-driven selection — not sender-driven — is the correct model for any SFU topology; sender-driven switching belongs only to P2P mesh. The full algorithm, including hysteresis to stop layer flapping, is in Bandwidth-Aware Layer Selection in an SFU and the forwarding mechanics in Simulcast-Aware Forwarding.

The hard part is keyframes. A spatial or rid upgrade is only decodable from a keyframe — forward the first packets of a higher layer mid-GOP and the subscriber renders corruption until the next keyframe arrives on its own. So when the SFU promotes a subscriber, it sends a PLI (Picture Loss Indication) upstream to the publisher and holds the upgrade until the resulting keyframe boundary. Verify the whole pipeline by polling getStats() and confirming each rid shows independent, growing bytesSent:

// Verification: confirm every layer is independently active before trusting the SFU
const stats = await sender.getStats();
for (const r of stats.values()) {
  if (r.type === 'outbound-rtp' && r.rid) {
    console.log(`rid=${r.rid} bytesSent=${r.bytesSent} keyframes=${r.keyFramesEncoded} fps=${r.framesPerSecond}`);
    // A layer with frozen bytesSent while others grow = collapsed or CPU-starved encoder
  }
}

Edge Cases & Browser Quirks

Chrome simulcast vs Safari. Chrome (since ~M90) supports three-layer VP8/VP9 simulcast and AV1 SVC reliably. Safari negotiates a=rid but is stricter about ordering and historically caps practical simulcast at two layers; if a third layer never produces an SSRC, check that Safari accepted all a=rid lines in the answer rather than pruning one.
VP8 has no spatial SVC. VP8 SVC is temporal-only (L1T2, L1T3). Requesting L3T3 on VP8 silently degrades to a single spatial layer. Use VP9 or AV1 when you need spatial scalability.
H.264 in Chrome maps to SVC. Chromium routes H.264 through its SVC path and will not emit three independent simulcast SSRCs. For three-layer simulcast on H.264 you need an external encoder or a VP-family codec — see VP8 vs H.264 vs AV1 Codec Selection.
AV1 SVC support is uneven. Chrome ships AV1 L1T3/L3T3; Safari’s AV1 SVC remains partial. Always probe RTCRtpSender.getCapabilities('video') and fall back to VP9 SVC or VP8 simulcast when a scalabilityMode is unsupported, because a rejected mode silently collapses to single-layer encoding.
Firefox simulcast. Firefox supports VP8 simulcast but leans on scaleResolutionDownBy and has historically lagged on VP9/AV1 SVC; treat its SVC support as opportunistic and test the negotiated SDP, not the requested config.

Common Implementation Mistakes

Omitting scaleResolutionDownBy. Three identical-resolution encodes triple encoder load for zero benefit and exhaust the CPU within seconds.
Overlapping bitrate tiers. Ceilings packed too closely (400/500/700 kbps) let the estimator merge layers and drop the top one. Keep each layer at least 2× the one below.
Calling setParameters() too late. After the first frame encodes, changing rid throws InvalidModificationError. Configure encodings before createOffer().
Fighting GCC with manual toggling. Flipping active faster than the estimator’s probing cycle causes layer oscillation and packet bursts; only toggle on sustained (>5 s) degradation.
Forwarding an upgrade without a keyframe. Promoting a subscriber mid-GOP renders corruption. Always request a PLI and wait for the keyframe boundary.
Assuming SVC is universal. Requesting an unsupported scalabilityMode silently falls back to one layer. Probe capabilities and have a simulcast fallback.

FAQ

Should I use simulcast or SVC? Simulcast is the safe default for heterogeneous, cross-browser rooms because every engine supports VP8 simulcast and the SFU logic is trivially simple — forward the matching stream. SVC wins on encoder CPU and total uplink bitrate when your clients reliably run VP9 or AV1, at the cost of an SFU that understands the dependency descriptor. The full trade-off at conference scale is in Choosing Simulcast vs SVC for Large Conferences.

Why does my third simulcast layer never appear? Almost always the negotiated codec is H.264 (which collapses to SVC in Chrome) or setParameters() ran after encoding began. Confirm the codec in the SDP and that three distinct SSRCs appear under VideoSender in chrome://webrtc-internals.

How does the SFU pick a layer without decoding the video? For simulcast it keys on the rid/SSRC mapping from the SDP; for SVC it reads the RTP dependency descriptor header extension. In both cases it forwards or drops whole RTP packets and never enters the codec, which is exactly what keeps an SFU cheap relative to an MCU.

What triggers the keyframe before a layer upgrade? The SFU sends an RTCP PLI to the publisher when it decides to promote a subscriber, and holds the higher layer until the keyframe that PLI produces arrives — forwarding earlier shows corruption.

Related Guides