SFU vs MCU Topologies

Once a call grows past a handful of participants, the peer-to-peer mesh that powers two-party WebRTC collapses under its own uplink: every participant must encode and send a separate stream to every other participant, so an N-party mesh forces each browser to run N-1 encoders and N-1 uploads. A central media server fixes this, and the two classic server topologies are the Selective Forwarding Unit (SFU) and the Multipoint Control Unit (MCU). This guide is part of the Media Server Architecture: SFU & MCU guide, and its goal is to give you a precise, decision-ready model of how each topology ingests, processes, and emits media so you can pick the right one before you write a line of server code.

The distinction is not academic. It determines your per-participant server CPU cost, your downlink bandwidth bill, the decode load you place on mobile clients, and how much layout flexibility you can offer. The four steps below trace the same media packet through each topology — ingest, processing, emit, and verification — and the comparison table and edge-case notes that follow turn those mechanics into operational guidance.

An MCU decodes all inputs and emits one composited stream per participant; an SFU forwards the original encoded streams untouched, so each participant receives N-1 separate decodes.

Step 1 — How each topology ingests media

Ingest looks identical on the wire and diverges immediately afterward. In both topologies every participant establishes one RTCPeerConnection to the server, completes ICE and the DTLS-SRTP handshake, and pushes encoded RTP exactly as it would to a peer. The server is, from the browser’s perspective, just another ICE agent — which is why the same ICE Candidate Gathering & Filtering rules and TURN fallback apply, and why you still budget 8–20% of sessions onto relays.

The divergence is what the server does with the inbound SRTP. An SFU terminates SRTP, decrypts to plaintext RTP, inspects headers, and stops there — it never touches the encoded payload. It reads the RTP header, the transport-wide-cc feedback, and (for layered media) the simulcast/SVC markers, but the VP8 or H.264 bitstream itself is opaque cargo. An MCU terminates SRTP and then fully decodes every inbound stream to raw YUV frames and PCM samples. That decode is the expensive, defining act of an MCU: a single 720p30 VP8 decode costs real CPU, and the MCU pays it once per inbound stream per room.

// SFU ingest: terminate SRTP, read headers, keep payload encoded
function onInboundRtp(packet, participant) {
  const header = parseRtpHeader(packet);          // ssrc, seq, ts, marker
  const layer  = readSimulcastLayer(packet);      // rid: 'q' | 'h' | 'f' — no decode
  participant.tracks.set(header.ssrc, { header, layer, payload: packet.payload });
  // payload stays VP8/H.264-encoded; the SFU never owns a decoder
  routeToSubscribers(participant, header.ssrc, layer);
}

Step 2 — How each topology processes media

This is where cost is decided. The SFU’s “processing” is a routing decision, not a media transform: for each subscriber it picks which of a sender’s encoded streams or layers to forward, rewrites RTP sequence numbers and timestamps so the subscriber sees one continuous stream, and translates RTCP feedback (PLI, NACK, REMB/transport-wide-cc) between sender and receiver. It does zero pixel work. When senders publish multiple resolutions, the SFU’s job becomes choosing the right layer per subscriber — the mechanics of which are covered in Simulcast-Aware Forwarding and the per-subscriber bandwidth logic in Bandwidth-Aware Layer Selection in an SFU.

The MCU runs a full media pipeline per room: decode every inbound video to raw frames, composite them onto a canvas according to a layout (grid, active-speaker, presentation), mix every inbound audio track into a single PCM bus with gain control and echo suppression, then re-encode the composite to one output stream. Because the output is a freshly encoded single stream, the MCU can transcode codecs and resolutions freely — it can accept VP8 from one client and emit H.264 to another — which is also why it can serve a SIP endpoint or a dial-in phone bridge that an SFU cannot. The price is that this pipeline is unavoidable and scales with participant count: more inputs means more decodes and a heavier composite per output frame.

// MCU processing: decode all, composite, mix, re-encode once per output layout
function renderMixedFrame(room) {
  const frames = room.participants.map(p => p.decoder.pull());   // N raw YUV decodes
  const canvas = layoutEngine.composite(frames, room.layout);    // grid / active-speaker
  const audio  = audioMixer.sum(room.participants.map(p => p.pcm)); // single PCM bus
  const encoded = room.encoder.encode(canvas);                   // 1 re-encode per layout
  return { video: encoded, audio: audioEncoder.encode(audio) };  // identical for all viewers
}

Step 3 — How each topology emits media

The SFU emits N-1 distinct encoded streams to each participant (or fewer, if a client subscribes to a subset). Server-side egress is therefore O(N²) in the worst case: a 10-party call with every participant subscribed to every other is up to 90 forwarded streams. The server bandwidth cost is real, but the CPU cost stays near zero because nothing is re-encoded. Each receiving client runs up to N-1 decoders, which is exactly where mobile decode limits bite — most phones cap concurrent hardware video decoders at 1–3.

The MCU emits one stream per participant — or one per distinct layout. Every viewer who wants the same grid receives the same encoded output, so a passive audience of thousands can share a single encode. The receiving client runs exactly one decoder regardless of room size, which is the MCU’s headline advantage for low-power and embedded endpoints. The cost moved server-side: every distinct layout is a separate encode, so per-participant custom views (each user seeing a different active-speaker arrangement) erase the single-encode savings and push MCU CPU toward SFU-like egress without the SFU’s zero-encode benefit.

A hybrid worth naming: many production systems run an SFU as the live topology and bolt an MCU-style compositor on only for server-side recording, so the live call pays SFU economics while the archive gets a single mixed file. That pattern is detailed in Server-Side Recording & Composition.

Step 4 — Verification

Verify the topology behaves as designed by reading getStats() on both the client and the server, polled at 1 s intervals to match the rest of your observability.

// Confirm topology from the subscriber side
const stats = await pc.getStats();
let inboundVideoTracks = 0;
for (const r of stats.values()) {
  if (r.type === 'inbound-rtp' && r.kind === 'video') inboundVideoTracks++;
}
// SFU room of N participants → inboundVideoTracks ≈ N-1 (or your subscribed subset)
// MCU room of any size      → inboundVideoTracks === 1 (the mixed stream)
console.log(`inbound video streams = ${inboundVideoTracks}`);

On the server, confirm an SFU shows near-zero encode time and framesEncoded flat on egress (it is forwarding, not encoding), while an MCU shows totalEncodeTime and decoder utilization climbing with participant count. If an “SFU” reports rising encode time per added participant, something is transcoding when it should be forwarding — a misconfiguration that quietly converts your SFU economics into MCU economics. Cross-check the active path and RTT through the candidate-pair report as described in Bandwidth Estimation & Congestion Control.

Comparison Table

Dimension	SFU	MCU
Server media work	Route + rewrite RTP headers	Decode all + composite + re-encode
Server CPU per participant	Very low (no transcode)	High (decode + encode per input)
Server downlink (egress)	High — up to N-1 streams/participant	Low — 1 stream/participant
Client decoders needed	Up to N-1	Exactly 1
Mobile / embedded friendliness	Limited by decode count	Excellent (single decode)
Layout flexibility	Client-side, fully flexible	Server-fixed per layout
End-to-end latency	Lower (no decode/encode hop)	Higher (+30–150 ms pipeline)
End-to-end encryption	Preserved (payload untouched)	Broken (server decrypts + re-encodes)
Codec transcoding / SIP bridge	No	Yes
Cost driver	Bandwidth	CPU

The trade-offs in this table are quantified per participant-hour, with concrete CPU and dollar figures, in SFU vs MCU Cost & Quality Trade-offs.

Edge Cases & Browser Quirks

Safari concurrent decode ceiling. Safari on older iPhones and iPads aggressively limits simultaneous hardware H.264 decoders — frequently to 1–2. An SFU room that forwards 4+ streams to such a device can silently drop frames or fall back to a software decoder that spikes CPU and battery. Detect the device class and either subscribe to fewer streams or route those clients to an MCU output.
Chrome simulcast on the SFU path. Chrome publishes simulcast layers (rid q/h/f) that an SFU must read to forward the right resolution. If the server ignores rid, Chrome may still send all layers, wasting uplink — confirm the SFU honors the simulcast envelope negotiated in the SDP.
Firefox lacks simulcast for some codecs. Firefox historically did not offer simulcast for certain codec/profile combinations, so an SFU that assumes three layers gets one. Treat layer availability as per-browser and per-codec, negotiated through the SDP Offer/Answer Lifecycle, not guaranteed.
MCU and end-to-end encryption. Any topology that decodes media (every MCU) necessarily terminates encryption at the server. If your threat model requires the server never see plaintext, an MCU is off the table and you must stay on an SFU with insertable-stream E2EE.
Single mixed stream and active-speaker switching. An MCU’s active-speaker layout switches inside one stream, so clients see no track changes; an SFU switches by changing subscriptions, which can trigger renegotiation unless you reuse transceivers — see Replacing Video Tracks Without Renegotiation.

Common Implementation Mistakes

Choosing an MCU for scale, then drowning in CPU. MCUs feel “scalable” because clients receive one stream, but server CPU grows with every input. A room of 50 active publishers is 50 decodes plus composites per output — far more expensive than an SFU forwarding the same media untouched.
Choosing an SFU for low-power clients without capping subscriptions. Forwarding N-1 streams to a phone that can decode three is a guaranteed failure. Cap subscriptions, use active-speaker culling, or terminate those clients on an MCU output.
Assuming the SFU re-encodes. Teams sometimes try to “fix quality” by transcoding inside the SFU, unknowingly converting it into a partial MCU and destroying its cost model. If you need transcoding, choose that deliberately.
Ignoring the encryption consequence. Shipping an MCU into a privacy-sensitive product and only later discovering the server holds plaintext is a costly architecture reversal. Decide E2EE requirements before picking the topology.
One topology for every room size. A 2–4 person call may not need a server at all; a 10–50 person call wants an SFU; a 10,000-viewer broadcast may want an MCU or hybrid edge. Size the topology to the room, and plan for Load Balancing & Scaling SFUs once one node is no longer enough.

FAQ

Is an SFU always cheaper than an MCU?

It depends which resource you are paying for. An SFU is dramatically cheaper on CPU because it never transcodes, but it is more expensive on egress bandwidth because it emits up to N-1 streams per participant. An MCU inverts that: low egress, high CPU. For most interactive conferencing the SFU’s bandwidth bill is cheaper than the MCU’s compute bill, but a large passive audience watching one fixed layout flips the math toward the MCU.

Can I run both in the same product?

Yes, and many production systems do. Run an SFU for the live, interactive call and invoke MCU-style composition only where a single stream is required — server-side recording, a dial-in phone bridge, or a low-power broadcast endpoint. The live path keeps SFU latency and CPU economics while the composite path gets a single mixed output.

Why does an MCU add latency?

Every MCU output passes through decode → composite/mix → re-encode, a pipeline that adds roughly 30–150 ms depending on codec, resolution, and frame buffering. An SFU only rewrites RTP headers and forwards, so it adds little beyond network transit. For tight conversational latency the SFU wins; for a passive audience the extra delay is usually acceptable.

Does an SFU break end-to-end encryption?

No — that is one of its defining advantages. Because the SFU forwards the encoded payload untouched, it can route media it cannot read, which makes insertable-stream E2EE possible. An MCU must decrypt and decode to composite, so it always sees plaintext and cannot offer true end-to-end encryption.

Related: start from the Media Server Architecture: SFU & MCU guide, then quantify the decision with SFU vs MCU Cost & Quality Trade-offs, and continue into Selective Forwarding Unit Design, Simulcast-Aware Forwarding, and Load Balancing & Scaling SFUs.

Related Guides