Simulcast & SVC Implementation in WebRTC

Multi-stream encoding is how a single publisher satisfies a room full of subscribers on wildly different downlinks — one sends 1.5 Mbps of 720p, another a 150 kbps thumbnail, from the same camera and the same RTCPeerConnection. This guide is part of the Media Handling, Codecs & Bandwidth Estimation guide, and its job is to make simulcast and SVC work end-to-end: defining RID-based encodings, mapping scaleResolutionDownBy across three spatial tiers, switching to a single SVC stream with scalabilityMode, and forwarding the right layer per subscriber from the server. Get the encoder configuration wrong and you ship three identical-resolution streams that melt the CPU; get the keyframe handling wrong and subscribers see green-block corruption every time the server upgrades them.

Simulcast and SVC solve the same problem — one encode, many downstream qualities — with opposite trade-offs. Simulcast runs N parallel encoders and emits N independent RTP streams; the Selective Forwarding Unit just forwards whichever stream matches a subscriber, never touching codec internals. SVC runs one encoder that emits one stream layered so the server can drop frames to downscale. The sections below build both, call out where Chrome, Firefox, and Safari diverge, and link to the server-side forwarding logic that consumes them.

Simulcast layers feeding an SFU with per-subscriber forwarding A single camera encodes three RID layers — low at 150 kbps, mid at 500 kbps, high at 1.5 Mbps — sent as independent RTP streams to an SFU, which forwards the low layer to a mobile subscriber, the mid layer to a tablet, and the high layer to a desktop based on each subscriber's bandwidth. Camera one encode request rid=high 720p 1.5 Mbps rid=mid 360p 500 kbps rid=low 180p 150 kbps SFU per-subscriber layer select Desktop gets high Tablet gets mid Mobile gets low
One camera emits three RID-tagged RTP streams; the SFU forwards a different spatial layer to each subscriber based on its measured downlink, never re-encoding.

Step 1 — Declare RID-based simulcast encodings

Simulcast is configured entirely on the sender, before the first offer. You attach the track, read the parameters, replace the encodings array with one entry per quality tier, and write it back. Each entry carries a rid (the RTP stream identifier the SFU keys on), an active flag, a maxBitrate ceiling, and a scaleResolutionDownBy factor. The browser then emits a=simulcast and a=rid lines into the SDP automatically — you never hand-edit them.

The timing is strict: setParameters() must complete before createOffer(). You cannot add, remove, or rename a rid after negotiation; only the per-encoding active, maxBitrate, and scaleResolutionDownBy are mutable at runtime. Pin a codec that supports independent layers first — in Chrome that means VP8, VP9, or AV1, because Chromium maps H.264 onto SVC and caps it at two simulcast layers. The Chrome-specific recipe, including the exact chrome://webrtc-internals SSRC check, is in Simulcast with Three Quality Layers in Chrome.

The rid value is not cosmetic — it is the join key between this sender and the SFU. Whatever string you choose (high/mid/low, or f/h/q as some stacks use) appears verbatim in the a=rid SDP lines and in every outbound-rtp stats report, and the server’s forwarding table is built from exactly those identifiers. Keep them stable across your client and server code; a mismatch means the SFU has streams it cannot route. Note also the encoding order: list the highest-resolution layer first so the encoder treats it as the base and derives the scaled-down layers from it. Reversing the order makes some encoder builds scale up, producing blurry top layers.

const transceiver = pc.addTransceiver(videoTrack, {
  direction: 'sendonly',
  // Order matters: 'high' first so the encoder treats it as the base resolution
  sendEncodings: [
    { rid: 'high', active: true, maxBitrate: 1_500_000, scaleResolutionDownBy: 1.0, maxFramerate: 30 },
    { rid: 'mid',  active: true, maxBitrate:   500_000, scaleResolutionDownBy: 2.0, maxFramerate: 30 },
    { rid: 'low',  active: true, maxBitrate:   150_000, scaleResolutionDownBy: 4.0, maxFramerate: 15 }
  ]
});

// Prefer VP8/VP9/AV1 before the offer — H.264 silently collapses to 2-layer SVC in Chrome
const caps = RTCRtpSender.getCapabilities('video');
const vp8 = caps.codecs.filter(c => /vp8/i.test(c.mimeType));
transceiver.setCodecPreferences([...vp8, ...caps.codecs]);

const offer = await pc.createOffer(); // a=simulcast + a=rid:high/mid/low now emitted automatically
await pc.setLocalDescription(offer);

Step 2 — Map scaleResolutionDownBy across 1/2/4

scaleResolutionDownBy is the single most important field for keeping CPU sane. It divides the capture resolution before encoding, so a 1280×720 capture with factors 1.0 / 2.0 / 4.0 produces 720p, 360p, and 180p streams. Omit it and you get three full-resolution encodes — roughly 3× the encoder load with no quality benefit, the fastest way to exhaust a laptop CPU mid-call. Keep the factors as clean powers of two; fractional ratios like 1.5 force the scaler onto non-aligned dimensions that some hardware encoders reject.

Pair each resolution with a maxBitrate that leaves clear headroom between tiers — a useful rule is that each layer’s ceiling should be at least 2× the layer below it, or the bandwidth estimator treats two layers as one and drops the higher of the pair. Don’t hardcode these ceilings as the actual send rate; they are caps, and WebRTC’s Google Congestion Control allocates the real bitrate underneath them. The interaction between simulcast ceilings and the estimator is covered in Bandwidth Estimation & Congestion Control.

RID scaleResolutionDownBy Resolution (from 720p) maxBitrate maxFramerate
high 1.0 1280Ă—720 1.5 Mbps 30
mid 2.0 640Ă—360 500 kbps 30
low 4.0 320Ă—180 150 kbps 15

At runtime you adapt by flipping active per layer rather than renegotiating. Dropping the high layer under sustained loss frees its entire bitrate budget for the survivors without an SDP round trip:

// Disable the top layer without renegotiation — no createOffer needed
function setLayerActive(sender, rid, active) {
  const params = sender.getParameters();
  const enc = params.encodings.find(e => e.rid === rid);
  if (enc) enc.active = active;     // mutable at runtime; rid itself is frozen
  return sender.setParameters(params);
}

Step 3 — Switch to SVC with scalabilityMode L3T3

SVC replaces N parallel encoders with one encoder that structures its single output into decodable sub-layers. Instead of a rid array you configure one encoding with a scalabilityMode string. L3T3 means 3 spatial layers and 3 temporal layers — nine forwardable operating points from one RTP stream — and L3T3_KEY adds keyframe-synchronised spatial layers so the server can upgrade a subscriber’s resolution at a shared keyframe boundary. VP9 and AV1 expose full spatial SVC; the AV1-specific layer planning and CPU budget live in Configuring AV1 SVC Layers in WebRTC.

SVC’s win is a single encode pass and one SSRC, so CPU and bandwidth overhead are lower than simulcast’s parallel encoders. The cost moves to the server: the SFU must parse the dependency descriptor to know which packets belong to which layer. The decision of which mechanism to deploy at scale — and where each one wins past 50 participants — is worked through in Choosing Simulcast vs SVC for Large Conferences.

The temporal and spatial axes serve different adaptation goals. Temporal layers (the T digit) let the SFU halve frame rate per subscriber — drop the top temporal layer and a 30 fps stream becomes 15 fps at a fraction of the bitrate, with no resolution change. Spatial layers (the S/L digit) let it halve resolution. A conference that mostly needs to absorb brief congestion spikes benefits more from temporal layers, because frame-rate drops are visually gentler than resolution drops and recover instantly. Rooms with a wide spread of screen sizes — phones next to large displays — need the spatial range. L3T3 gives both, which is why it is the common default once a codec supports full spatial SVC.

const sender = pc.addTrack(videoTrack, stream);

const params = sender.getParameters();
params.encodings = [{
  active: true,
  maxBitrate: 2_000_000,
  scalabilityMode: 'L3T3_KEY'   // 3 spatial + 3 temporal, keyframe-synced spatial upgrades
}];
await sender.setParameters(params);
// No rid array: a single SSRC carries all layers, distinguished by the dependency descriptor

Step 4 — SFU layer selection and keyframes

Whichever mechanism you ship, the server makes the actual quality decision per subscriber. For simulcast the SFU matches each subscriber’s estimated downlink against the available rid streams and forwards the highest one that fits, dropping the others’ RTP packets without decoding. For SVC it reads the dependency descriptor and forwards only the spatial/temporal layers the subscriber can afford. This receiver-driven selection — not sender-driven — is the correct model for any SFU topology; sender-driven switching belongs only to P2P mesh. The full algorithm, including hysteresis to stop layer flapping, is in Bandwidth-Aware Layer Selection in an SFU and the forwarding mechanics in Simulcast-Aware Forwarding.

The hard part is keyframes. A spatial or rid upgrade is only decodable from a keyframe — forward the first packets of a higher layer mid-GOP and the subscriber renders corruption until the next keyframe arrives on its own. So when the SFU promotes a subscriber, it sends a PLI (Picture Loss Indication) upstream to the publisher and holds the upgrade until the resulting keyframe boundary. Verify the whole pipeline by polling getStats() and confirming each rid shows independent, growing bytesSent:

// Verification: confirm every layer is independently active before trusting the SFU
const stats = await sender.getStats();
for (const r of stats.values()) {
  if (r.type === 'outbound-rtp' && r.rid) {
    console.log(`rid=${r.rid} bytesSent=${r.bytesSent} keyframes=${r.keyFramesEncoded} fps=${r.framesPerSecond}`);
    // A layer with frozen bytesSent while others grow = collapsed or CPU-starved encoder
  }
}

Edge Cases & Browser Quirks

Common Implementation Mistakes

FAQ

Should I use simulcast or SVC? Simulcast is the safe default for heterogeneous, cross-browser rooms because every engine supports VP8 simulcast and the SFU logic is trivially simple — forward the matching stream. SVC wins on encoder CPU and total uplink bitrate when your clients reliably run VP9 or AV1, at the cost of an SFU that understands the dependency descriptor. The full trade-off at conference scale is in Choosing Simulcast vs SVC for Large Conferences.

Why does my third simulcast layer never appear? Almost always the negotiated codec is H.264 (which collapses to SVC in Chrome) or setParameters() ran after encoding began. Confirm the codec in the SDP and that three distinct SSRCs appear under VideoSender in chrome://webrtc-internals.

How does the SFU pick a layer without decoding the video? For simulcast it keys on the rid/SSRC mapping from the SDP; for SVC it reads the RTP dependency descriptor header extension. In both cases it forwards or drops whole RTP packets and never enters the codec, which is exactly what keeps an SFU cheap relative to an MCU.

What triggers the keyframe before a layer upgrade? The SFU sends an RTCP PLI to the publisher when it decides to promote a subscriber, and holds the higher layer until the keyframe that PLI produces arrives — forwarding earlier shows corruption.

Related: return to Media Handling, Codecs & Bandwidth Estimation, drill into Simulcast with Three Quality Layers in Chrome, Choosing Simulcast vs SVC for Large Conferences, and Configuring AV1 SVC Layers in WebRTC, then cross over to Simulcast-Aware Forwarding and Selective Forwarding Unit Design to build the server side.