Configuring AV1 SVC Layers in WebRTC

This guide is part of the Simulcast & SVC Implementation guide, and it solves one task: configuring AV1’s scalable video coding layers in a browser RTCPeerConnection so an Selective Forwarding Unit can forward exactly the spatial and temporal layers each subscriber can afford. AV1 gives the best compression of any WebRTC codec and clean native SVC, but it asks for the most encode CPU of the three — so layer count is a CPU-budget decision, not just a quality one.

Context & Trade-offs

AV1 SVC is selected entirely through one field — scalabilityMode on a single encoding — and the encoder structures its one output stream accordingly. The two modes that matter in practice are L1T3 (one spatial layer, three temporal layers — frame-rate adaptation only) and L3T3 (three spatial, three temporal — nine forwardable operating points). L3T3_KEY adds keyframe-synchronised spatial layers so the SFU can upgrade a subscriber’s resolution at a shared boundary without corruption.

Unlike simulcast, AV1 SVC uses a single encoding object with no rid array — the one SSRC carries every layer, and the layer structure lives in the bitstream rather than in separate RTP streams. That is the source of both its advantage and its constraint. The advantage: one encode pass and one SSRC means the publisher’s CPU and uplink track a single stream rather than the sum of three. The constraint: every consumer of those layers — the SFU above all — must understand the layered structure to do anything useful with it, where a simulcast SFU could stay entirely codec-blind. So the field that makes AV1 SVC trivial to request on the sender is exactly the field that pushes complexity onto the server, which is why the dependency descriptor (below) is not optional plumbing but the load-bearing piece of an AV1 SVC deployment.

The cost is CPU. AV1 software encoding runs roughly 3–5× the cost of VP8 for the same resolution, and each added spatial layer compounds it — L3T3 on a 720p source can saturate a mid-range laptop core. That makes L1T3 the conservative default: full temporal scalability (the SFU can halve frame rate per subscriber) at near-single-layer encode cost, reserving full L3T3 spatial scalability for hardware-accelerated clients or low resolutions. AV1’s superior compression means an L1T3 stream at 1.2 Mbps often matches VP9 quality at 1.5 Mbps, recovering some of the uplink budget discussed in Bandwidth Estimation & Congestion Control.

The SFU forwards AV1 layers by reading the dependency descriptor — the RTP header extension (a=extmap ... dependency-descriptor) that marks each packet’s spatial and temporal layer and its decode dependencies. Without it negotiated, the SFU cannot drop frames safely and AV1 SVC degrades to forwarding the whole stream. The forwarding mechanics are in Simulcast-Aware Forwarding, and the codec-selection context in VP8 vs H.264 vs AV1 Codec Selection.

The dependency descriptor is what makes AV1 SVC genuinely codec-agnostic at the server: it is a generic frame-structure description carried outside the codec payload, so the SFU drops frames by reading the extension alone, never parsing AV1 bitstream internals. That is the same property that keeps an SFU cheap relative to a transcoding MCU — it forwards or drops whole RTP packets and stays out of the codec. For a keyframe-synced mode like L3T3_KEY, the descriptor also tells the server exactly which frames are decode targets for a spatial upgrade, so the SFU can promote a subscriber at the right boundary after requesting a keyframe with a PLI. Get the extension negotiation right and the rest of the AV1 forwarding path is identical to VP9 SVC.

scalabilityMode	Layers	Encode cost (vs VP8)	Best fit
L1T3	1 spatial, 3 temporal	~3×	default; CPU-limited publishers
L3T3	3 spatial, 3 temporal	~4–5×	HW-accelerated or low-res sources
L3T3_KEY	3 spatial (keyframe-synced), 3 temporal	~4–5×	conferences needing clean resolution upgrades

Minimal Runnable Implementation

// Prefer AV1 before negotiation, then request an SVC mode
const transceiver = pc.addTransceiver(videoTrack, { direction: 'sendonly' });

const caps = RTCRtpSender.getCapabilities('video');
const av1 = caps.codecs.filter(c => /av01/i.test(c.mimeType)); // AV1 mime is 'video/AV01'
if (av1.length) transceiver.setCodecPreferences([...av1, ...caps.codecs]);

const sender = transceiver.sender;
const params = sender.getParameters();
params.encodings = [{
  active: true,
  maxBitrate: 1_200_000,
  // Start conservative: temporal-only. Switch to 'L3T3_KEY' only on HW-accelerated clients.
  scalabilityMode: 'L1T3'
}];
await sender.setParameters(params);

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
// Verify the dependency descriptor extension is in the SDP — required for SFU layer dropping
console.assert(/dependency-descriptor/.test(offer.sdp), 'AV1 SVC needs the dependency descriptor');

Reproduction Steps & Debugging Log Patterns

Probe RTCRtpSender.getCapabilities('video') and confirm video/AV01 is present before forcing AV1.
Apply scalabilityMode: 'L1T3' and negotiate; grep the local SDP for a=extmap carrying dependency-descriptor.
Poll getStats() and read outbound-rtp — confirm framesEncoded, totalEncodeTime, and that scalabilityMode echoes back the requested mode.
Watch totalEncodeTime per second; if it approaches the frame interval budget, the encoder is CPU-bound — drop to a lower resolution or fewer spatial layers.
On the SFU, confirm it parses the dependency descriptor and forwards distinct temporal layers to subscribers on different downlinks.

Expected healthy AV1 L1T3 stats:

// outbound-rtp (video, AV1)
// scalabilityMode=L1T3  framesEncoded rising  totalEncodeTime delta ~28ms/s  bitrate~1.18Mbps
// If scalabilityMode comes back undefined or 'L1T1' → mode rejected, collapsed to single layer

A scalabilityMode echoing back as L1T1, or a missing dependency-descriptor extension, means the requested SVC mode was not honored — fall back to VP9 SVC or VP8 simulcast.

When the encoder is the bottleneck, the symptom is specific: qualityLimitationReason on the outbound-rtp report reads cpu, framesPerSecond sags below the requested maxFramerate, and totalEncodeTime climbs toward the frame budget. AV1 reaches this point at lower resolutions than VP8 or VP9 because of its higher per-frame cost, so the right response is to step down the spatial layer count (L3T3 to L1T3) or the resolution before touching bitrate — dropping the bitrate ceiling on a CPU-bound AV1 encoder does nothing, since the limit is compute, not bandwidth. Watch the same counter on representative low-end subscribers too; an AV1 decode that overruns the frame budget there shows up as rising framesDropped on the receiver’s inbound-rtp, which no publisher-side change can fix.

Tuning the CPU Budget

AV1’s encode cost is governed less by SVC structure than by the encoder’s speed/quality trade-off, which in WebRTC you influence indirectly through resolution, frame rate, and layer count rather than a direct cpu-used knob. The practical levers, in order of impact: drop spatial layers first (L3T3 to L1T3 removes the two scaled encodes that dominate cost), then cap resolution (a 360p source encodes AV1 comfortably where 720p saturates), then reduce frame rate via maxFramerate. Bitrate is the lever that does not help a CPU-bound encoder — lowering maxBitrate reduces output size but not the compute to produce each frame.

Hardware acceleration changes the math entirely. On devices with an AV1 hardware encoder, L3T3 becomes viable at full resolution because the spatial layers run on dedicated silicon instead of the CPU. Probe for it indirectly: apply L3T3_KEY, then read qualityLimitationReason and totalEncodeTime under load — a hardware path holds frame rate with low encode time where software collapses. Because that capability varies by device, the robust pattern is to ship L1T3 as the default and promote to L3T3 only after observing healthy stats, rather than assuming the higher mode and recovering from a stalled encoder. That keeps the conservative path safe for the wide range of software-only AV1 clients while still exploiting hardware where it exists.

Common Implementation Mistakes

Requesting L3T3 on a CPU-limited publisher. AV1’s encode cost compounds per spatial layer; three spatial layers at 720p can saturate a core. Default to L1T3 and escalate only with hardware acceleration.
Not negotiating the dependency descriptor. Without that RTP extension the SFU cannot drop layers and forwards the whole stream, defeating SVC.
Assuming universal AV1 support. Chrome ships AV1 SVC; Safari’s is partial and Firefox lags. Probe capabilities and fall back, since a rejected mode silently collapses to one layer.
Ignoring decode cost on weak receivers. AV1 decode is also heavier than VP8/VP9; low-end mobile subscribers may struggle even when the publisher is fine.
Treating maxBitrate as the send rate. It is a ceiling; Google Congestion Control allocates the actual bitrate underneath, and AV1’s efficiency means it often sends well below the cap.

FAQ

Should I start with L1T3 or L3T3 for AV1? Start with L1T3. It gives full temporal scalability at near-single-layer encode cost, which is the safe default for software-encoded publishers. Move to L3T3/L3T3_KEY only when clients have hardware AV1 encoding or the source resolution is low enough to absorb the extra spatial-layer cost.

How do I confirm the SFU can actually drop AV1 layers? Verify the dependency-descriptor a=extmap is present in the negotiated SDP and that the SFU parses it. Without it, layer dropping is impossible and the SFU forwards the full stream regardless of subscriber bandwidth.

Is AV1 SVC worth the CPU over VP9 SVC? When clients support it and CPU permits, yes — AV1 delivers comparable quality at roughly 20–30% lower bitrate than VP9, which matters most on constrained uplinks. On CPU-limited publishers, VP9 SVC is the pragmatic choice, and the broader simulcast-versus-SVC trade is in Choosing Simulcast vs SVC for Large Conferences.

What is the difference between L1T3 and L3T3 in one sentence? L1T3 gives the SFU three frame-rate operating points at a single resolution (cheap to encode), while L3T3 adds two more resolutions for nine total operating points at substantially higher encode cost — so L1T3 is the default and L3T3 is the upgrade for clients that can afford it.

Why does my AV1 SVC stream forward as a single layer even though the mode applied? The dependency-descriptor header extension was not negotiated, so the SFU has no per-frame layer information and cannot drop frames. Confirm the a=extmap line for dependency-descriptor is in the answer, not just the offer, and that the SFU advertises support for it.

Related: return to Simulcast & SVC Implementation, compare against Choosing Simulcast vs SVC for Large Conferences and Simulcast with Three Quality Layers in Chrome, and cross to Simulcast-Aware Forwarding for the server side.