SFU vs MCU Topologies

Once a call grows past a handful of participants, the peer-to-peer mesh that powers two-party WebRTC collapses under its own uplink: every participant must encode and send a separate stream to every other participant, so an N-party mesh forces each browser to run N-1 encoders and N-1 uploads. A central media server fixes this, and the two classic server topologies are the Selective Forwarding Unit (SFU) and the Multipoint Control Unit (MCU). This guide is part of the Media Server Architecture: SFU & MCU guide, and its goal is to give you a precise, decision-ready model of how each topology ingests, processes, and emits media so you can pick the right one before you write a line of server code.

The distinction is not academic. It determines your per-participant server CPU cost, your downlink bandwidth bill, the decode load you place on mobile clients, and how much layout flexibility you can offer. The four steps below trace the same media packet through each topology โ€” ingest, processing, emit, and verification โ€” and the comparison table and edge-case notes that follow turn those mechanics into operational guidance.

MCU single mixed stream versus SFU N forwarded streams On the left an MCU decodes three inbound streams, composites and re-encodes them into a single mixed stream sent to each participant. On the right an SFU forwards the three original encoded streams unchanged, so each participant receives N minus one separate streams. MCU โ€” mix and re-encode SFU โ€” forward unchanged Peer A Peer B Peer C MCU decode + composite + re-encode 3 in 1 mixed each Peer A Peer B Peer C SFU route only N-1 each
An MCU decodes all inputs and emits one composited stream per participant; an SFU forwards the original encoded streams untouched, so each participant receives N-1 separate decodes.

Step 1 โ€” How each topology ingests media

Ingest looks identical on the wire and diverges immediately afterward. In both topologies every participant establishes one RTCPeerConnection to the server, completes ICE and the DTLS-SRTP handshake, and pushes encoded RTP exactly as it would to a peer. The server is, from the browserโ€™s perspective, just another ICE agent โ€” which is why the same ICE Candidate Gathering & Filtering rules and TURN fallback apply, and why you still budget 8โ€“20% of sessions onto relays.

The divergence is what the server does with the inbound SRTP. An SFU terminates SRTP, decrypts to plaintext RTP, inspects headers, and stops there โ€” it never touches the encoded payload. It reads the RTP header, the transport-wide-cc feedback, and (for layered media) the simulcast/SVC markers, but the VP8 or H.264 bitstream itself is opaque cargo. An MCU terminates SRTP and then fully decodes every inbound stream to raw YUV frames and PCM samples. That decode is the expensive, defining act of an MCU: a single 720p30 VP8 decode costs real CPU, and the MCU pays it once per inbound stream per room.

// SFU ingest: terminate SRTP, read headers, keep payload encoded
function onInboundRtp(packet, participant) {
  const header = parseRtpHeader(packet);          // ssrc, seq, ts, marker
  const layer  = readSimulcastLayer(packet);      // rid: 'q' | 'h' | 'f' โ€” no decode
  participant.tracks.set(header.ssrc, { header, layer, payload: packet.payload });
  // payload stays VP8/H.264-encoded; the SFU never owns a decoder
  routeToSubscribers(participant, header.ssrc, layer);
}

Step 2 โ€” How each topology processes media

This is where cost is decided. The SFUโ€™s โ€œprocessingโ€ is a routing decision, not a media transform: for each subscriber it picks which of a senderโ€™s encoded streams or layers to forward, rewrites RTP sequence numbers and timestamps so the subscriber sees one continuous stream, and translates RTCP feedback (PLI, NACK, REMB/transport-wide-cc) between sender and receiver. It does zero pixel work. When senders publish multiple resolutions, the SFUโ€™s job becomes choosing the right layer per subscriber โ€” the mechanics of which are covered in Simulcast-Aware Forwarding and the per-subscriber bandwidth logic in Bandwidth-Aware Layer Selection in an SFU.

The MCU runs a full media pipeline per room: decode every inbound video to raw frames, composite them onto a canvas according to a layout (grid, active-speaker, presentation), mix every inbound audio track into a single PCM bus with gain control and echo suppression, then re-encode the composite to one output stream. Because the output is a freshly encoded single stream, the MCU can transcode codecs and resolutions freely โ€” it can accept VP8 from one client and emit H.264 to another โ€” which is also why it can serve a SIP endpoint or a dial-in phone bridge that an SFU cannot. The price is that this pipeline is unavoidable and scales with participant count: more inputs means more decodes and a heavier composite per output frame.

// MCU processing: decode all, composite, mix, re-encode once per output layout
function renderMixedFrame(room) {
  const frames = room.participants.map(p => p.decoder.pull());   // N raw YUV decodes
  const canvas = layoutEngine.composite(frames, room.layout);    // grid / active-speaker
  const audio  = audioMixer.sum(room.participants.map(p => p.pcm)); // single PCM bus
  const encoded = room.encoder.encode(canvas);                   // 1 re-encode per layout
  return { video: encoded, audio: audioEncoder.encode(audio) };  // identical for all viewers
}

Step 3 โ€” How each topology emits media

The SFU emits N-1 distinct encoded streams to each participant (or fewer, if a client subscribes to a subset). Server-side egress is therefore O(Nยฒ) in the worst case: a 10-party call with every participant subscribed to every other is up to 90 forwarded streams. The server bandwidth cost is real, but the CPU cost stays near zero because nothing is re-encoded. Each receiving client runs up to N-1 decoders, which is exactly where mobile decode limits bite โ€” most phones cap concurrent hardware video decoders at 1โ€“3.

The MCU emits one stream per participant โ€” or one per distinct layout. Every viewer who wants the same grid receives the same encoded output, so a passive audience of thousands can share a single encode. The receiving client runs exactly one decoder regardless of room size, which is the MCUโ€™s headline advantage for low-power and embedded endpoints. The cost moved server-side: every distinct layout is a separate encode, so per-participant custom views (each user seeing a different active-speaker arrangement) erase the single-encode savings and push MCU CPU toward SFU-like egress without the SFUโ€™s zero-encode benefit.

A hybrid worth naming: many production systems run an SFU as the live topology and bolt an MCU-style compositor on only for server-side recording, so the live call pays SFU economics while the archive gets a single mixed file. That pattern is detailed in Server-Side Recording & Composition.

Step 4 โ€” Verification

Verify the topology behaves as designed by reading getStats() on both the client and the server, polled at 1 s intervals to match the rest of your observability.

// Confirm topology from the subscriber side
const stats = await pc.getStats();
let inboundVideoTracks = 0;
for (const r of stats.values()) {
  if (r.type === 'inbound-rtp' && r.kind === 'video') inboundVideoTracks++;
}
// SFU room of N participants โ†’ inboundVideoTracks โ‰ˆ N-1 (or your subscribed subset)
// MCU room of any size      โ†’ inboundVideoTracks === 1 (the mixed stream)
console.log(`inbound video streams = ${inboundVideoTracks}`);

On the server, confirm an SFU shows near-zero encode time and framesEncoded flat on egress (it is forwarding, not encoding), while an MCU shows totalEncodeTime and decoder utilization climbing with participant count. If an โ€œSFUโ€ reports rising encode time per added participant, something is transcoding when it should be forwarding โ€” a misconfiguration that quietly converts your SFU economics into MCU economics. Cross-check the active path and RTT through the candidate-pair report as described in Bandwidth Estimation & Congestion Control.

Comparison Table

Dimension SFU MCU
Server media work Route + rewrite RTP headers Decode all + composite + re-encode
Server CPU per participant Very low (no transcode) High (decode + encode per input)
Server downlink (egress) High โ€” up to N-1 streams/participant Low โ€” 1 stream/participant
Client decoders needed Up to N-1 Exactly 1
Mobile / embedded friendliness Limited by decode count Excellent (single decode)
Layout flexibility Client-side, fully flexible Server-fixed per layout
End-to-end latency Lower (no decode/encode hop) Higher (+30โ€“150 ms pipeline)
End-to-end encryption Preserved (payload untouched) Broken (server decrypts + re-encodes)
Codec transcoding / SIP bridge No Yes
Cost driver Bandwidth CPU

The trade-offs in this table are quantified per participant-hour, with concrete CPU and dollar figures, in SFU vs MCU Cost & Quality Trade-offs.

Edge Cases & Browser Quirks

Common Implementation Mistakes

FAQ

Is an SFU always cheaper than an MCU?

It depends which resource you are paying for. An SFU is dramatically cheaper on CPU because it never transcodes, but it is more expensive on egress bandwidth because it emits up to N-1 streams per participant. An MCU inverts that: low egress, high CPU. For most interactive conferencing the SFUโ€™s bandwidth bill is cheaper than the MCUโ€™s compute bill, but a large passive audience watching one fixed layout flips the math toward the MCU.

Can I run both in the same product?

Yes, and many production systems do. Run an SFU for the live, interactive call and invoke MCU-style composition only where a single stream is required โ€” server-side recording, a dial-in phone bridge, or a low-power broadcast endpoint. The live path keeps SFU latency and CPU economics while the composite path gets a single mixed output.

Why does an MCU add latency?

Every MCU output passes through decode โ†’ composite/mix โ†’ re-encode, a pipeline that adds roughly 30โ€“150 ms depending on codec, resolution, and frame buffering. An SFU only rewrites RTP headers and forwards, so it adds little beyond network transit. For tight conversational latency the SFU wins; for a passive audience the extra delay is usually acceptable.

Does an SFU break end-to-end encryption?

No โ€” that is one of its defining advantages. Because the SFU forwards the encoded payload untouched, it can route media it cannot read, which makes insertable-stream E2EE possible. An MCU must decrypt and decode to composite, so it always sees plaintext and cannot offer true end-to-end encryption.

Related: start from the Media Server Architecture: SFU & MCU guide, then quantify the decision with SFU vs MCU Cost & Quality Trade-offs, and continue into Selective Forwarding Unit Design, Simulcast-Aware Forwarding, and Load Balancing & Scaling SFUs.