Load Balancing & Scaling SFUs

A single Selective Forwarding Unit process saturates long before your user count does — not on CPU first, but on the outbound network interface, because an SFU’s job is to fan one published stream out to every subscriber in a room. This guide is part of the Media Server Architecture: SFU & MCU guide, and it covers the operational problem that follows once a single node works: how to plan per-node capacity, place rooms onto a pool of nodes, span regions with cascaded relays, drive autoscaling from the right signals, and drain nodes without dropping calls.

The implementation goal here is a horizontally scaled SFU tier that keeps every participant in a room reachable at predictable latency while you add and remove nodes underneath live traffic. The control plane — a load balancer or signaling layer — owns room-to-node assignment; the data plane stays exactly the same per-node forwarding logic described in Selective Forwarding Unit Design. Throughout, assume the per-node forwarding cost model from SFU vs MCU Cost & Quality Trade-offs: an SFU does not transcode, so its cost is bandwidth and packet-forwarding, not CPU-bound composition.

The signaling layer assigns each room to an SFU node by remaining capacity; large or multi-region rooms cascade between nodes over a single inter-node relay link rather than splitting participants.

Step 1 — Capacity planning per node by outbound bitrate

Size an SFU node on egress bandwidth first, CPU second. An SFU forwards packets without decoding them, so the dominant constraint is how many bits leave the box. The arithmetic is direct: for a room of N participants where every participant both publishes and subscribes to all others, each published stream is forwarded to N − 1 subscribers, so total egress scales with N × (N − 1) per room. A mesh of small rooms and a few large rooms produce wildly different load on the same hardware.

Work the numbers against a concrete per-stream bitrate. A 720p simulcast stream costs roughly 1.2–1.7 Mbps at its top layer; a node with a 10 Gbps NIC has a hard ceiling near 10,000 Mbps of egress, and you should plan to use no more than 70–80% of it to leave headroom for retransmits and bursts. That puts a single node at roughly 5,000–6,000 forwarded 720p streams before egress saturates, well before a modern multi-core CPU does.

// Estimate whether a node can admit a new room, on outbound bitrate alone.
// Treat egress headroom — not CPU — as the binding constraint for an SFU.
const NIC_MBPS = 10_000;          // 10 Gbps interface
const SAFE_FRACTION = 0.75;       // leave 25% for retransmits, RTX, bursts
const MAX_EGRESS = NIC_MBPS * SAFE_FRACTION;

// Per-room egress: every publisher is forwarded to every other subscriber.
function roomEgressMbps(participants, perStreamMbps = 1.5) {
  return participants * (participants - 1) * perStreamMbps; // N*(N-1) fan-out
}

function canAdmit(node, participants, perStreamMbps = 1.5) {
  const projected = node.currentEgressMbps + roomEgressMbps(participants, perStreamMbps);
  return projected <= MAX_EGRESS; // reject if it would breach the safe ceiling
}

// A 50-person all-to-all room costs 50*49*1.5 ≈ 3,675 Mbps — over a third of one node.

Track currentEgressMbps from each node’s own forwarding stats (sum of outbound-rtp bytesSent deltas over a 1 s window, the same getStats cadence used everywhere else in this stack) rather than from a static per-participant estimate, because simulcast layer selection means actual egress varies with what subscribers can receive — the mechanism detailed in Bandwidth-Aware Layer Selection in an SFU.

Step 2 — Room-to-node assignment

The signaling layer, not the SFU, decides which node a room lives on. The default and correct policy for most workloads is room affinity: every participant in one room connects to the same node, so the SFU can forward locally without any cross-node hop. The load balancer’s only job at join time is to pick a node for the first participant of a room and then pin every subsequent joiner to that same node.

Pick the node by least projected egress, not round-robin and not least-connections. Round-robin ignores that one 50-person room outweighs fifty 2-person rooms; least-connections has the same blindness. A capacity-aware “least loaded by egress headroom” assignment keeps nodes evenly filled in the dimension that actually saturates.

// Room-affinity assignment: pin a room to one node, chosen by egress headroom.
// Backed by a shared store so every signaling instance agrees on placement.
async function assignRoomToNode(roomId, expectedParticipants, store, nodes) {
  const existing = await store.get(`room:${roomId}:node`);
  if (existing) return existing; // affinity — never split a room across nodes

  const projected = roomEgressMbps(expectedParticipants);
  const candidates = nodes
    .filter(n => n.healthy && !n.draining)                  // skip draining nodes
    .filter(n => n.currentEgressMbps + projected <= MAX_EGRESS)
    .sort((a, b) => a.currentEgressMbps - b.currentEgressMbps); // least loaded first

  if (candidates.length === 0) throw new Error('NO_CAPACITY'); // trigger scale-up
  const chosen = candidates[0].id;

  // Claim atomically so concurrent first-joins don't race onto two nodes.
  await store.setNX(`room:${roomId}:node`, chosen, { ttl: 3600 });
  return await store.get(`room:${roomId}:node`);
}

The shared store that holds room → node mappings is the same Redis instance most teams already run for Scaling WebSocket Signaling with Redis Pub/Sub, so signaling fan-out and room placement stay consistent across every signaling node. The consistent-hashing and failover details of this mapping are the subject of Sharding Rooms Across SFU Nodes.

Step 3 — Cascaded SFUs for cross-region and oversized rooms

Affinity breaks in two cases: a room larger than one node can hold, and a room whose participants are spread across regions. Both are solved the same way — cascading, where two SFU nodes subscribe to each other and relay a room’s streams over a single inter-node link instead of forwarding to every remote participant directly.

For cross-region rooms, pin participants to the SFU in their own region and cascade only the aggregate room media between the regional nodes. A participant in Frankfurt subscribes to the EU node; that EU node holds one relayed copy of each US publisher pulled across the Atlantic once, rather than each US publisher’s stream crossing the ocean per EU subscriber. This collapses inter-region egress from publishers × remote_subscribers to publishers × regions, and keeps each participant’s first hop on a low-latency regional path — the same multi-region latency win (40–60% on connect) that regional STUN placement buys at the ICE layer.

// Cascade: a local node pulls each remote publisher exactly once from the peer node.
// One relayed copy per publisher per region — not per remote subscriber.
async function ensureCascade(localNode, remoteNode, roomId, publishers) {
  for (const pub of publishers) {
    const key = `cascade:${roomId}:${pub.id}:${localNode.id}`;
    if (localNode.hasRelay(key)) continue;     // already pulling this publisher

    // localNode acts as a subscriber to remoteNode for this publisher's stream,
    // then re-forwards it to its own local subscribers as if locally published.
    const relay = await localNode.subscribeRemote(remoteNode, pub.id);
    localNode.registerRelay(key, relay);
  }
  // Forward your local publishers to remoteNode symmetrically (it pulls from you).
}

Cascading is not free: it adds one relay hop of latency (typically 80–150 ms inter-region) and doubles the bookkeeping. Reserve it for rooms that genuinely span regions or exceed single-node capacity; keep ordinary rooms node-local. The same simulcast-aware forwarding rules apply on the relayed copy — the cascade should pull only the layers some remote subscriber actually needs, as covered in Simulcast-Aware Forwarding.

Step 4 — Verification: autoscaling signals and draining nodes

Verify the tier the way it will behave under real load: scale up before saturation, scale down without dropping calls, and confirm both with the node-level egress signal you already collect.

Autoscaling signals. Scale on aggregate egress utilization and projected admission failures, not on CPU. CPU on an SFU stays low until egress is long gone, so a CPU-based autoscaler reacts far too late. Trigger scale-up when the pool’s mean egress crosses ~65% of the safe ceiling for 60 s, or when any assignRoomToNode returns NO_CAPACITY. Because new rooms can pin to a fresh node immediately but existing rooms cannot migrate without a renegotiation, scale up earlier than you would a stateless web tier.

Draining nodes. A node never hard-stops while it holds live rooms. Mark it draining so the load balancer stops assigning new rooms to it, let existing rooms finish naturally, and only terminate once participant count reaches zero — or, for long-lived rooms, migrate them by signaling affected clients to reconnect, which re-runs assignment onto a healthy node.

// Drain a node: stop new assignments, wait for rooms to empty, then terminate.
async function drainNode(nodeId, store, opts = { maxWaitMs: 1_800_000 }) {
  await store.set(`node:${nodeId}:draining`, '1');   // load balancer skips it now
  const start = Date.now();

  while (Date.now() - start < opts.maxWaitMs) {
    const rooms = await store.smembers(`node:${nodeId}:rooms`);
    if (rooms.length === 0) break;                   // empty — safe to terminate

    // For rooms that won't drain on their own, ask clients to reconnect so the
    // load balancer reassigns them to a healthy node (brief ICE re-handshake).
    if (Date.now() - start > opts.maxWaitMs / 2) {
      for (const roomId of rooms) await signalReconnect(roomId);
    }
    await sleep(5_000);                              // re-poll, don't spin
  }
  await terminateNode(nodeId); // scale-in only after the node is empty
}

The reconnect-driven migration is a normal ICE restart from the client’s view — createOffer({ iceRestart: true }) against the new node — so cap it at 3 attempts with a 3–5 s fallback, exactly as the Signaling State Machine Patterns guide bounds every reconnection path. Confirm a clean drain by watching the node’s egress fall to zero before scale-in fires.

Edge Cases & Browser Quirks

Reconnect storms on scale-in. Draining a node by mass-reconnect can stampede the signaling layer if every client retries at once. Jitter reconnect delays across a 1–5 s window per client; Chrome and Firefox both honor an immediate iceRestart offer, so without jitter you get a synchronized thundering herd onto the replacement node.
Safari renegotiation on migration. Safari (WebKit) is stricter about m-line ordering on the post-migration offer than Chrome. A room migrated to a new node must reproduce the original transceiver order or Safari rejects the answer — see Debugging SDP m-line Mismatches. Pin transceiver order server-side rather than letting the new node re-derive it.
Cascaded RTCP feedback. PLI and NACK feedback must traverse the cascade hop, adding 80–150 ms to keyframe recovery for remote subscribers. Firefox is more aggressive about requesting full keyframes on packet loss than Chrome, so a cross-region cascade shows more inter-region keyframe traffic when EU subscribers run Firefox.
CGNAT participants behind a relayed first hop. When a participant is already on a TURN relay (symmetric NAT / CGNAT), their first hop to the regional SFU is itself relayed; binding refreshes under 30 s still apply, and a node drain must not outlive the TURN allocation lifetime or the migrated session dies silently.
Sticky load-balancer hashing vs WebSocket upgrade. An L4 load balancer that re-hashes on reconnect can land a returning client on a signaling node that doesn’t hold its room mapping. Always resolve placement from the shared store, never from local in-memory state on the signaling node.

Common Implementation Mistakes

Splitting one room across nodes by default. Spreading a room’s participants over multiple nodes for “balance” forces a cascade hop on every internal subscription and multiplies inter-node bandwidth. Keep rooms node-local; cascade only when a room is too big for one node or genuinely multi-region.
Balancing on connection count or CPU. Both ignore egress, the dimension that actually saturates an SFU. A node with few but huge rooms looks idle by connection count and by CPU right up until its NIC is full. Assign and autoscale on projected egress.
Cascading per-subscriber instead of per-publisher. Pulling a remote publisher once per remote subscriber defeats the entire purpose of cascading. Pull each remote publisher exactly once per node and re-forward locally.
Hard-killing nodes on scale-in. Terminating a node that still holds rooms drops every call on it. Always drain — stop new assignments, wait, then migrate stragglers via client reconnect.
Placing room state only in node memory. If the room → node map lives only on the assigning signaling instance, any other instance and any failover loses it. Persist it in Redis so placement survives node failure and rebalance, as detailed in the sharding guide.
Forwarding all simulcast layers across the cascade. Relaying every layer between regions wastes the inter-region link. Pull only layers a remote subscriber currently needs.

FAQ

How many participants can one SFU node handle? On egress, not CPU. With a 10 Gbps NIC at 75% safe utilization and ~1.5 Mbps per forwarded 720p stream, a node tops out near 5,000–6,000 simultaneous forwarded streams — which is a few large rooms or many small ones. Compute the ceiling for your own per-stream bitrate with N × (N − 1) egress per room rather than a flat per-user number.

When should I cascade rooms across nodes instead of keeping them on one? Only when a room exceeds a single node’s egress ceiling or its participants are split across regions. Cascading adds an 80–150 ms relay hop and roughly doubles bookkeeping, so node-local affinity is the default; reach for a cascade as the exception, and pull each remote publisher exactly once per node.

What signal should drive SFU autoscaling? Aggregate egress utilization across the pool plus admission-failure (NO_CAPACITY) events — never CPU, which stays low until egress is already exhausted. Scale up at ~65% mean egress for 60 s and scale up earlier than a stateless tier, because existing rooms can’t migrate to a new node without a client renegotiation.

How do I remove a node without dropping calls? Drain it: flag it so the load balancer stops assigning new rooms, let existing rooms empty naturally, and for long-lived rooms migrate participants by signaling a reconnect that re-runs assignment onto a healthy node. Only terminate once the node’s egress reaches zero.

Related: start from the Media Server Architecture: SFU & MCU guide, then pair this with Sharding Rooms Across SFU Nodes for the room-affinity map, SFU vs MCU Cost & Quality Trade-offs for the per-node cost model, and Selective Forwarding Unit Design for the forwarding logic each node runs.

Related Guides