SFU vs MCU Topologies
Once a call grows past a handful of participants, the peer-to-peer mesh that powers two-party WebRTC collapses under its own uplink: every participant must encode and send a separate stream to every other participant, so an N-party mesh forces each browser to run N-1 encoders and N-1 uploads. A central media server fixes this, and the two classic server topologies are the Selective Forwarding Unit (SFU) and the Multipoint Control Unit (MCU). This guide is part of the Media Server Architecture: SFU & MCU guide, and its goal is to give you a precise, decision-ready model of how each topology ingests, processes, and emits media so you can pick the right one before you write a line of server code.
The distinction is not academic. It determines your per-participant server CPU cost, your downlink bandwidth bill, the decode load you place on mobile clients, and how much layout flexibility you can offer. The four steps below trace the same media packet through each topology โ ingest, processing, emit, and verification โ and the comparison table and edge-case notes that follow turn those mechanics into operational guidance.
Step 1 โ How each topology ingests media
Ingest looks identical on the wire and diverges immediately afterward. In both topologies every participant establishes one RTCPeerConnection to the server, completes ICE and the DTLS-SRTP handshake, and pushes encoded RTP exactly as it would to a peer. The server is, from the browserโs perspective, just another ICE agent โ which is why the same ICE Candidate Gathering & Filtering rules and TURN fallback apply, and why you still budget 8โ20% of sessions onto relays.
The divergence is what the server does with the inbound SRTP. An SFU terminates SRTP, decrypts to plaintext RTP, inspects headers, and stops there โ it never touches the encoded payload. It reads the RTP header, the transport-wide-cc feedback, and (for layered media) the simulcast/SVC markers, but the VP8 or H.264 bitstream itself is opaque cargo. An MCU terminates SRTP and then fully decodes every inbound stream to raw YUV frames and PCM samples. That decode is the expensive, defining act of an MCU: a single 720p30 VP8 decode costs real CPU, and the MCU pays it once per inbound stream per room.
// SFU ingest: terminate SRTP, read headers, keep payload encoded
function onInboundRtp(packet, participant) {
const header = parseRtpHeader(packet); // ssrc, seq, ts, marker
const layer = readSimulcastLayer(packet); // rid: 'q' | 'h' | 'f' โ no decode
participant.tracks.set(header.ssrc, { header, layer, payload: packet.payload });
// payload stays VP8/H.264-encoded; the SFU never owns a decoder
routeToSubscribers(participant, header.ssrc, layer);
}
Step 2 โ How each topology processes media
This is where cost is decided. The SFUโs โprocessingโ is a routing decision, not a media transform: for each subscriber it picks which of a senderโs encoded streams or layers to forward, rewrites RTP sequence numbers and timestamps so the subscriber sees one continuous stream, and translates RTCP feedback (PLI, NACK, REMB/transport-wide-cc) between sender and receiver. It does zero pixel work. When senders publish multiple resolutions, the SFUโs job becomes choosing the right layer per subscriber โ the mechanics of which are covered in Simulcast-Aware Forwarding and the per-subscriber bandwidth logic in Bandwidth-Aware Layer Selection in an SFU.
The MCU runs a full media pipeline per room: decode every inbound video to raw frames, composite them onto a canvas according to a layout (grid, active-speaker, presentation), mix every inbound audio track into a single PCM bus with gain control and echo suppression, then re-encode the composite to one output stream. Because the output is a freshly encoded single stream, the MCU can transcode codecs and resolutions freely โ it can accept VP8 from one client and emit H.264 to another โ which is also why it can serve a SIP endpoint or a dial-in phone bridge that an SFU cannot. The price is that this pipeline is unavoidable and scales with participant count: more inputs means more decodes and a heavier composite per output frame.
// MCU processing: decode all, composite, mix, re-encode once per output layout
function renderMixedFrame(room) {
const frames = room.participants.map(p => p.decoder.pull()); // N raw YUV decodes
const canvas = layoutEngine.composite(frames, room.layout); // grid / active-speaker
const audio = audioMixer.sum(room.participants.map(p => p.pcm)); // single PCM bus
const encoded = room.encoder.encode(canvas); // 1 re-encode per layout
return { video: encoded, audio: audioEncoder.encode(audio) }; // identical for all viewers
}
Step 3 โ How each topology emits media
The SFU emits N-1 distinct encoded streams to each participant (or fewer, if a client subscribes to a subset). Server-side egress is therefore O(Nยฒ) in the worst case: a 10-party call with every participant subscribed to every other is up to 90 forwarded streams. The server bandwidth cost is real, but the CPU cost stays near zero because nothing is re-encoded. Each receiving client runs up to N-1 decoders, which is exactly where mobile decode limits bite โ most phones cap concurrent hardware video decoders at 1โ3.
The MCU emits one stream per participant โ or one per distinct layout. Every viewer who wants the same grid receives the same encoded output, so a passive audience of thousands can share a single encode. The receiving client runs exactly one decoder regardless of room size, which is the MCUโs headline advantage for low-power and embedded endpoints. The cost moved server-side: every distinct layout is a separate encode, so per-participant custom views (each user seeing a different active-speaker arrangement) erase the single-encode savings and push MCU CPU toward SFU-like egress without the SFUโs zero-encode benefit.
A hybrid worth naming: many production systems run an SFU as the live topology and bolt an MCU-style compositor on only for server-side recording, so the live call pays SFU economics while the archive gets a single mixed file. That pattern is detailed in Server-Side Recording & Composition.
Step 4 โ Verification
Verify the topology behaves as designed by reading getStats() on both the client and the server, polled at 1 s intervals to match the rest of your observability.
// Confirm topology from the subscriber side
const stats = await pc.getStats();
let inboundVideoTracks = 0;
for (const r of stats.values()) {
if (r.type === 'inbound-rtp' && r.kind === 'video') inboundVideoTracks++;
}
// SFU room of N participants โ inboundVideoTracks โ N-1 (or your subscribed subset)
// MCU room of any size โ inboundVideoTracks === 1 (the mixed stream)
console.log(`inbound video streams = ${inboundVideoTracks}`);
On the server, confirm an SFU shows near-zero encode time and framesEncoded flat on egress (it is forwarding, not encoding), while an MCU shows totalEncodeTime and decoder utilization climbing with participant count. If an โSFUโ reports rising encode time per added participant, something is transcoding when it should be forwarding โ a misconfiguration that quietly converts your SFU economics into MCU economics. Cross-check the active path and RTT through the candidate-pair report as described in Bandwidth Estimation & Congestion Control.
Comparison Table
| Dimension | SFU | MCU |
|---|---|---|
| Server media work | Route + rewrite RTP headers | Decode all + composite + re-encode |
| Server CPU per participant | Very low (no transcode) | High (decode + encode per input) |
| Server downlink (egress) | High โ up to N-1 streams/participant | Low โ 1 stream/participant |
| Client decoders needed | Up to N-1 | Exactly 1 |
| Mobile / embedded friendliness | Limited by decode count | Excellent (single decode) |
| Layout flexibility | Client-side, fully flexible | Server-fixed per layout |
| End-to-end latency | Lower (no decode/encode hop) | Higher (+30โ150 ms pipeline) |
| End-to-end encryption | Preserved (payload untouched) | Broken (server decrypts + re-encodes) |
| Codec transcoding / SIP bridge | No | Yes |
| Cost driver | Bandwidth | CPU |
The trade-offs in this table are quantified per participant-hour, with concrete CPU and dollar figures, in SFU vs MCU Cost & Quality Trade-offs.
Edge Cases & Browser Quirks
- Safari concurrent decode ceiling. Safari on older iPhones and iPads aggressively limits simultaneous hardware H.264 decoders โ frequently to 1โ2. An SFU room that forwards 4+ streams to such a device can silently drop frames or fall back to a software decoder that spikes CPU and battery. Detect the device class and either subscribe to fewer streams or route those clients to an MCU output.
- Chrome simulcast on the SFU path. Chrome publishes simulcast layers (
ridq/h/f) that an SFU must read to forward the right resolution. If the server ignoresrid, Chrome may still send all layers, wasting uplink โ confirm the SFU honors the simulcast envelope negotiated in the SDP. - Firefox lacks simulcast for some codecs. Firefox historically did not offer simulcast for certain codec/profile combinations, so an SFU that assumes three layers gets one. Treat layer availability as per-browser and per-codec, negotiated through the SDP Offer/Answer Lifecycle, not guaranteed.
- MCU and end-to-end encryption. Any topology that decodes media (every MCU) necessarily terminates encryption at the server. If your threat model requires the server never see plaintext, an MCU is off the table and you must stay on an SFU with insertable-stream E2EE.
- Single mixed stream and active-speaker switching. An MCUโs active-speaker layout switches inside one stream, so clients see no track changes; an SFU switches by changing subscriptions, which can trigger renegotiation unless you reuse transceivers โ see Replacing Video Tracks Without Renegotiation.
Common Implementation Mistakes
- Choosing an MCU for scale, then drowning in CPU. MCUs feel โscalableโ because clients receive one stream, but server CPU grows with every input. A room of 50 active publishers is 50 decodes plus composites per output โ far more expensive than an SFU forwarding the same media untouched.
- Choosing an SFU for low-power clients without capping subscriptions. Forwarding
N-1streams to a phone that can decode three is a guaranteed failure. Cap subscriptions, use active-speaker culling, or terminate those clients on an MCU output. - Assuming the SFU re-encodes. Teams sometimes try to โfix qualityโ by transcoding inside the SFU, unknowingly converting it into a partial MCU and destroying its cost model. If you need transcoding, choose that deliberately.
- Ignoring the encryption consequence. Shipping an MCU into a privacy-sensitive product and only later discovering the server holds plaintext is a costly architecture reversal. Decide E2EE requirements before picking the topology.
- One topology for every room size. A 2โ4 person call may not need a server at all; a 10โ50 person call wants an SFU; a 10,000-viewer broadcast may want an MCU or hybrid edge. Size the topology to the room, and plan for Load Balancing & Scaling SFUs once one node is no longer enough.
FAQ
Is an SFU always cheaper than an MCU?
It depends which resource you are paying for. An SFU is dramatically cheaper on CPU because it never transcodes, but it is more expensive on egress bandwidth because it emits up to N-1 streams per participant. An MCU inverts that: low egress, high CPU. For most interactive conferencing the SFUโs bandwidth bill is cheaper than the MCUโs compute bill, but a large passive audience watching one fixed layout flips the math toward the MCU.
Can I run both in the same product?
Yes, and many production systems do. Run an SFU for the live, interactive call and invoke MCU-style composition only where a single stream is required โ server-side recording, a dial-in phone bridge, or a low-power broadcast endpoint. The live path keeps SFU latency and CPU economics while the composite path gets a single mixed output.
Why does an MCU add latency?
Every MCU output passes through decode โ composite/mix โ re-encode, a pipeline that adds roughly 30โ150 ms depending on codec, resolution, and frame buffering. An SFU only rewrites RTP headers and forwards, so it adds little beyond network transit. For tight conversational latency the SFU wins; for a passive audience the extra delay is usually acceptable.
Does an SFU break end-to-end encryption?
No โ that is one of its defining advantages. Because the SFU forwards the encoded payload untouched, it can route media it cannot read, which makes insertable-stream E2EE possible. An MCU must decrypt and decode to composite, so it always sees plaintext and cannot offer true end-to-end encryption.
Related: start from the Media Server Architecture: SFU & MCU guide, then quantify the decision with SFU vs MCU Cost & Quality Trade-offs, and continue into Selective Forwarding Unit Design, Simulcast-Aware Forwarding, and Load Balancing & Scaling SFUs.