Media Server Architecture: SFU & MCU Design, Scaling & Production Patterns

A full-mesh WebRTC topology is elegant until it isn’t: every peer encodes and uploads a separate stream to every other peer, so a five-person call already asks each client to run four encoders and saturate four uplinks at once. Upload bandwidth and CPU both scale as O(n²) across the room, and consumer uplinks collapse somewhere around four to six participants. Past that threshold the only viable answer is to put a server in the media path and let it route — not the signaling path, the actual RTP. This guide is the engineering map for that server tier: it covers the two canonical topologies (SFU and MCU), the publisher/subscriber session model, the production configuration that drives simulcast-aware routing, the scaling patterns that keep a single room from pinning one node, and the observability you need when a subscriber complains that “video is frozen” but every dashboard says green.

If you operate group video, broadcast, or interactive-audio products, you have already hit the wall where peer-to-peer stops being a strategy and becomes a liability. The sections below move from topology selection through the transport lifecycle, into an annotated publisher configuration, then route into focused deep-dives for forwarding logic, recording, and horizontal scaling — every layer you will eventually have to operate under real load.

Mesh connects every peer directly; an MCU decodes and mixes all inputs into one composite stream per receiver; an SFU forwards each publisher's selected layer to each subscriber without transcoding.

Core System Architecture

A media server sits in the data plane and decides, per subscriber, which packets to send. The two canonical designs answer that question very differently. A Multipoint Control Unit (MCU) decodes every incoming stream, composites them into a single mixed picture (and a single mixed audio track), re-encodes that composite, and sends one stream to each participant. A Selective Forwarding Unit (SFU) never decodes media at all: it terminates the transport, inspects RTP headers, and forwards the packets it chooses straight through. Mesh remains the baseline for the smallest calls because it needs no server, but it does not survive growth.

The cost profile is the whole decision. An MCU pays a full decode-plus-encode cycle for every output, which pins CPU and adds a transcode latency budget, but each client only ever receives one stream — ideal for low-power endpoints and for compositing into a single recording. An SFU spends almost no CPU per stream (header rewrite, not pixels) and stays close to wire latency, but pushes N-1 streams down to each subscriber, so its scaling pressure is egress bandwidth, not cores.

Topology	CPU cost (server)	Egress bandwidth	Added latency	Practical max participants
Mesh (no server)	none	O(n²) on each client uplink	lowest (direct path)	~4–6 before uplinks saturate
MCU (mix)	very high — decode + encode per output	O(n): one stream per participant	+100–300 ms transcode	hundreds (CPU-bound)
SFU (forward)	low — header rewrite, no transcode	O(n²) server egress	+5–30 ms (no decode)	thousands per room (bandwidth-bound)

The dominant pattern in modern deployments is the SFU, because keeping packets opaque preserves end-to-end characteristics, sidesteps a transcode latency penalty, and lets the server forward different quality layers to different subscribers when publishers send simulcast or SVC. That last capability is what makes an SFU viable for large, heterogeneous rooms, and it ties directly back to how clients encode — covered under Simulcast & SVC Implementation — and to the per-path estimates explored in Bandwidth Estimation & Congestion Control.

Read the bandwidth column carefully, because it is where the two server topologies invert. An MCU’s egress is linear in participant count: one composite stream out per receiver, regardless of how many people are in the room, which is exactly why it remains the right answer for a thousand-viewer broadcast where every viewer wants the same mixed picture. An SFU’s egress is quadratic: each of N subscribers may receive up to N-1 forwarded streams, so a busy 50-person room can ask a single node to push tens of gigabits. The trade is deliberate — you spend egress bandwidth (cheap to add by scaling out nodes) to avoid CPU transcode cost (which pins cores and adds latency you cannot scale away). When a room is large but only a few publishers are ever visible at once — a webinar, a classroom — the SFU’s quadratic term collapses back toward linear because the server forwards only the active speakers, which is the lever most production deployments pull before they reach for mixing at all.

Transport & Session Lifecycle

A media server splits each participant’s involvement into two logically separate roles, even though one RTCPeerConnection per direction is common: a publisher connection that uploads the participant’s own tracks to the server, and one or more subscriber connections over which the server delivers everyone else’s tracks. This is the key mental shift away from mesh — the client no longer negotiates with other clients at all. It negotiates only with the server, and the server fans out.

The lifecycle is deterministic. The client opens a publisher peer connection, adds its audio and video senders (with simulcast encodings declared up front via sendEncodings), and exchanges an SDP offer/answer with the server over the signaling channel. ICE runs exactly as it does peer-to-peer — the server is just the remote ICE agent — so the connectivity tier you already operate still applies: candidate filtering from ICE Candidate Gathering & Filtering and relay fallback from TURN Server Configuration & Auth both sit underneath the publisher and subscriber connections. Once DTLS-SRTP is up, the publisher streams its three simulcast spatial layers as distinct RTP streams (distinct SSRCs, tagged by rid), and the server demultiplexes them.

On the subscribe side, the server creates a subscriber transport per participant, and as new publishers join it adds the corresponding remote tracks and renegotiates. For each (subscriber, publisher) pair the SFU picks exactly one of the available simulcast layers to forward, based on that subscriber’s downlink estimate and its requested render size. When the subscriber’s bandwidth drops, the server switches it down a layer — ideally on a keyframe boundary so decoding never breaks — and forwards the lower-resolution stream that was already arriving from the publisher. No re-encode happens; the server only changes which packets it copies onto the subscriber’s socket. The full mechanics of that switch live in Simulcast-Aware Forwarding.

Audio takes a simpler path than video and deserves its own note. Audio never simulcasts — each publisher sends one Opus stream — so the server’s job there is not layer selection but speaker selection: forwarding only the loudest few streams to each subscriber rather than mixing or blindly fanning out every microphone. Most SFUs read RTP audio-level header extensions to rank speakers and forward the top three or four, which keeps a 50-person room from delivering 49 simultaneous audio tracks to every client. This is why audio and video are negotiated as separate transceivers even though they ride the same BUNDLE’d transport: their forwarding policies are unrelated.

Three signaling details separate a working SFU integration from a broken one. First, the publisher must declare every layer it intends to send before the offer, because adding encodings later forces renegotiation. Second, a=rtcp-mux and a=group:BUNDLE should be required so each transport uses a single 5-tuple — the same multiplexing discipline the WebRTC Protocol Stack & Signaling Servers guide enforces for peer-to-peer connections. Third, the server drives subscriber renegotiation, so the client’s signaling state machine must accept server-initiated offers without treating them as glare.

Production Configuration

The publisher side is where simulcast is born. The client declares three spatial layers on the video sender — typically a full-resolution layer, a half-scale layer, and a quarter-scale layer — each with its own maxBitrate and scaleResolutionDownBy. The server then has three quality tiers to choose from per subscriber without ever asking the publisher to change anything. The configuration below sets up a publisher connection to an SFU with three-layer simulcast and the multiplexing policies a server tier expects.

// Publisher RTCPeerConnection to an SFU with three-layer simulcast
const pcConfig = {
  iceServers: [
    { urls: 'stun:stun.l.google.com:19302' },             // reflexive discovery
    {
      urls: 'turns:turn.example.com:5349?transport=tcp',   // TLS relay for strict networks
      username: ephemeralUser,                             // HMAC time-limited credential
      credential: ephemeralPass                            // never a static secret
    }
  ],
  bundlePolicy: 'max-bundle',     // one 5-tuple for all media to the server
  rtcpMuxPolicy: 'require'        // RTP and RTCP share the port
};

const publisher = new RTCPeerConnection(pcConfig);

// Declare three simulcast layers up front — the SFU forwards whichever fits each subscriber
publisher.addTransceiver(localVideoTrack, {
  direction: 'sendonly',
  sendEncodings: [
    { rid: 'q', scaleResolutionDownBy: 4, maxBitrate: 150_000 },   // quarter — ~180p floor
    { rid: 'h', scaleResolutionDownBy: 2, maxBitrate: 500_000 },   // half — ~360p mid
    { rid: 'f', scaleResolutionDownBy: 1, maxBitrate: 1_700_000 }  // full — ~720p top
  ]
});
publisher.addTrack(localAudioTrack);  // audio never simulcasts — one Opus stream

// The SFU answers; it will forward the 'q', 'h', or 'f' layer per subscriber on demand
const offer = await publisher.createOffer();
await publisher.setLocalDescription(offer);
signaling.send({ type: 'publish', sdp: offer.sdp });

Order the encodings low-to-high — some engines tie the array order to the layer-allocation logic — and keep the bitrate ratios roughly 4:1 between adjacent layers so each tier is a meaningful step. The maxBitrate ceilings are not just throttles: they bound how much egress each forwarded layer costs the server, which directly drives the per-node capacity math in the scaling section. How the server reacts when a subscriber’s estimate falls between two layers is the subject of Selective Forwarding Unit Design.

Section Deep-Dives

This guide is the hub; each subsection routes to a focused reference on one part of the media-server tier.

SFU vs MCU Topologies

Choosing between forwarding and mixing is the first architectural decision, and it is driven by client capability, recording needs, and cost. The SFU vs MCU Topologies guide compares the two end to end — CPU versus bandwidth profiles, latency budgets, endpoint constraints, and the hybrid designs that mix only for recording — and works the cost-and-quality trade-off math for realistic room sizes.

Selective Forwarding Unit Design

An SFU is mostly a packet router with opinions. The Selective Forwarding Unit Design guide covers the internals: RTP demultiplexing by SSRC and rid, per-subscriber sender state, keyframe (PLI/FIR) request handling on layer switches, NACK and RTX retransmission, and the bandwidth-aware logic that selects which layer each subscriber should receive.

Simulcast-Aware Forwarding

Forwarding the right layer to the right subscriber is what makes an SFU scale across heterogeneous networks. The Simulcast-Aware Forwarding guide details how the server demuxes the three arriving spatial layers, tracks each subscriber’s downlink estimate, switches layers on keyframe boundaries to avoid decode corruption, and rewrites sequence numbers and timestamps so the switch is seamless to the receiver.

Server-Side Recording & Composition

A forwarded set of independent streams is not a file you can hand to a user. The Server-Side Recording & Composition guide covers recording every track, then compositing the room into a single layout server-side — decode, mix to a grid or active-speaker view, re-encode, and mux to MP4 — including the synchronization and timestamp-alignment problems that make multi-party recording hard.

Load Balancing & Scaling SFUs

One node cannot hold a 500-person room, and egress is the ceiling. The Load Balancing & Scaling SFUs guide covers horizontal scaling: routing whole rooms to nodes by capacity, sharding a single large room across multiple SFUs with server-to-server forwarding, cascading nodes across regions, and the health-check and drain logic that lets you deploy without dropping calls.

Failure Modes & Anti-Patterns

Reaching for an SFU when mesh suffices — or mesh when it doesn’t. Below four peers, mesh has lower latency and zero server cost; above six it collapses uplinks. Pick the topology from the room-size distribution, not from a default.
Forwarding the top simulcast layer to every subscriber. This defeats the entire point: a subscriber on a 600 kbps downlink that receives the 1.7 Mbps layer will stall and drop. Always select the layer per subscriber from its estimate.
Switching layers off a keyframe boundary. Forwarding the new layer mid-GOP hands the decoder frames that reference data it never received, producing green frames or a frozen image. Request a keyframe (PLI) and switch on its arrival.
Not rewriting sequence numbers and timestamps on a layer switch. Each simulcast layer has its own RTP sequence space; forwarding them onto one subscriber stream without continuous rewriting breaks the receiver’s jitter buffer. The SFU must maintain a per-subscriber rewrite offset.
Treating an MCU’s transcode latency as free. Mixing adds a decode-plus-encode budget of 100–300 ms; on interactive calls that is the difference between conversational and walkie-talkie. Use mixing for broadcast and recording, not low-latency interaction.
Pinning a whole room to one node with no shard plan. A single SFU’s egress is finite; a large room needs server-to-server forwarding across nodes before it hits the egress ceiling, not after the call has already degraded.
Forgetting the relay tier under the server. The publisher and subscriber connections still need STUN/TURN; an SFU on a public IP can still be unreachable from a UDP-blocking enterprise network without a TLS relay on 443.

Debugging & Observability

A media server fails in ways a single browser never shows you, because the symptom is on one subscriber while the cause is in the forwarding decision for that pair. Server-side observability is built from the same primitives as client getStats(), just aggregated per peer connection the server holds.

Per-subscriber inbound/outbound stats. Every modern SFU exposes a getStats()-equivalent for each transport it terminates. Poll at 1 s intervals — finer adds overhead without finer signal — and correlate, per subscriber: the forwarded layer currently selected, the subscriber’s availableOutgoingBitrate (the server’s send-side estimate to that peer), and that subscriber’s outbound-rtp packet loss and NACK count. A subscriber stuck on the quarter layer while its estimate sits comfortably above the half-layer bitrate points to a stuck layer-selection state machine, not a network problem.

// Per-subscriber forwarding health from the server's send-side stats
async function subscriberHealth(subscriberPc, selectedRid) {
  const stats = await subscriberPc.getStats();
  let estimate, lost = 0, sent = 0, nack = 0;

  for (const report of stats.values()) {
    if (report.type === 'candidate-pair' && report.nominated) {
      estimate = report.availableOutgoingBitrate;   // server -> this subscriber, bps
    }
    if (report.type === 'outbound-rtp' && report.kind === 'video') {
      sent = report.packetsSent ?? 0;
      nack = report.nackCount ?? 0;                  // subscriber asking for resends
    }
    if (report.type === 'remote-inbound-rtp' && report.kind === 'video') {
      lost = report.packetsLost ?? 0;               // loss reported back by the subscriber
    }
  }
  const lossPct = sent > 0 ? (lost / (sent + lost) * 100).toFixed(2) : '0.00';
  // forwarding 'f' (1.7 Mbps) on a sub-1 Mbps estimate is the classic misforward
  console.log(`rid=${selectedRid} estimate=${estimate} loss=${lossPct}% nack=${nack}`);
}

Forwarded-layer metrics over time. Log every layer switch with its trigger (estimate drop, keyframe arrival, subscriber resize) keyed on (roomId, publisherId, subscriberId). A healthy room shows occasional switches that track real bandwidth changes; a flapping subscriber — switching up and down every few seconds — means the selection hysteresis is too tight and is itself causing keyframe storms, since every upward switch costs a PLI and a fresh keyframe from the publisher.

Publisher-side layer arrival. Confirm the publisher is actually sending all three layers. Under CPU pressure or its own bandwidth estimate, a browser will drop the top simulcast layer silently, so the SFU has nothing to forward even to high-bandwidth subscribers. Read the publisher’s outbound-rtp per rid and alert when an expected layer’s bytesSent stays flat.

For deeper single-session diagnosis, the client side still has chrome://webrtc-internals and Firefox about:webrtc, which show the publisher’s three outbound streams and the subscriber’s inbound stream exactly as the server sees them — the reading technique carries over from the WebRTC Protocol Stack & Signaling Servers guide. Correlate the client dump’s per-rid outbound stats with the server’s per-subscriber forwarding log on a shared session ID, and a “frozen video for one user” report resolves to a specific layer-selection or keyframe event in a single query.

FAQ

At what room size do I actually need a media server instead of mesh? Mesh works well up to roughly four to six participants, where each client still has spare uplink and CPU to encode and send a stream to every peer. Past that, upload bandwidth scales as O(n²) per client and consumer uplinks saturate, so you move to an SFU. The crossover depends on the resolution and the worst client’s uplink, not a fixed number — measure your own room-size distribution.

Why are SFUs more common than MCUs in modern deployments? An SFU never decodes media, so it spends almost no CPU per stream, adds only 5–30 ms of forwarding latency, and preserves the publisher’s encoding end to end. An MCU’s per-output decode-plus-encode cost is far higher and adds 100–300 ms of transcode latency. MCUs still win when endpoints are too weak to receive many streams, or when you need a single composited output for recording or broadcast.

How does an SFU send different quality to different subscribers without transcoding? The publisher sends multiple simulcast layers as separate RTP streams. The SFU demultiplexes them by rid/SSRC and, for each subscriber, forwards exactly one layer chosen from that subscriber’s downlink estimate. Because it only selects among streams that already exist, no re-encode is needed — the work is sequence-number and timestamp rewriting, detailed in Simulcast-Aware Forwarding.

What is the scaling bottleneck for an SFU, and how do I get past it? Egress bandwidth. An SFU forwards up to N-1 streams to each of N subscribers, so a single node’s network interface caps the room. You get past it by routing whole rooms to nodes by available capacity and, for very large rooms, sharding one room across multiple SFUs that forward between each other server-to-server — the patterns in Load Balancing & Scaling SFUs.

Does putting a server in the path remove the need for STUN and TURN? No. The publisher and subscriber connections are still ordinary WebRTC peer connections whose remote agent happens to be your server, so ICE still runs. Clients behind UDP-blocking firewalls still need a TLS relay on 443, and the server still needs reachable candidates — the connectivity tier from ICE Candidate Gathering & Filtering and TURN Server Configuration & Auth sits underneath the media server unchanged.

Related: start with SFU vs MCU Topologies to pick a topology, then move into Simulcast-Aware Forwarding and Load Balancing & Scaling SFUs as your rooms grow.

Related Guides