Signaling State Machine Patterns for WebRTC

Real-time applications demand deterministic connection management. Event-driven code wired directly onto RTCPeerConnection callbacks degrades quickly into race conditions, SDP glare, and unrecoverable media states the moment a network flap, a tab suspension, or a simultaneous renegotiation arrives. A formal finite state machine (FSM) gives you a single source of truth for the connection lifecycle and a place to make recovery decisions explicit instead of emergent. This guide is part of the WebRTC Protocol Stack & Signaling Servers guide, and it walks through building a production-grade signaling FSM step by step: modelling the states, gating SDP transitions against the browser’s own state, abstracting transport, and verifying behaviour under failure.

The goal is a machine whose transitions you can reason about, log, replay, and alert on β€” one that drives ICE restarts automatically, tears down cleanly, and never leaves a peer connection stranded in an intermediate state. The two scenarios that break naive implementations most often are transport loss during negotiation and two peers offering at the same time; both are handled here as first-class transitions rather than afterthoughts.

WebRTC signaling state machine with ICE-restart recovery paths States: new transitions to connecting on offer or answer, connecting to connected on ICE connected, connected to disconnected on transport flap, disconnected back to connected on recovery or to failed after timeout, and failed to connecting via an ICE restart. new connecting connected disconnected failed offer/answer ICE connected flap recover timeout ICE restart
Signaling FSM: new and connected paths plus ICE-restart recovery from failed.

Step 1 β€” Model Deterministic State Transitions

Start by mapping the native RTCPeerConnection signals onto a small, explicit application state set: new, connecting, connected, disconnected, and failed. The native API exposes two overlapping enums β€” signalingState (stable, have-local-offer, have-remote-offer) and connectionState (new, connecting, connected, disconnected, failed, closed). Your FSM should not try to replace these; it should sit above them and treat their events as inputs. Decouple UI rendering from FSM state so a brief disconnected blip never flashes an error screen during a Wi-Fi-to-cellular handover.

The cleanest implementation is a pure reducer: given the current state and an action, return the next state, rejecting any action that is invalid for the current state. This makes out-of-order signaling messages β€” a late ANSWER arriving after teardown, a duplicate OFFER β€” harmless no-ops instead of exceptions.

// Pure reducer: invalid (state, action) pairs return the current state unchanged
const TRANSITIONS = {
  new:          { OFFER_SENT: 'connecting', OFFER_RECEIVED: 'connecting' },
  connecting:   { ICE_CONNECTED: 'connected', ERROR: 'failed' },
  connected:    { TRANSPORT_FLAP: 'disconnected', CLOSE: 'new' },
  disconnected: { ICE_CONNECTED: 'connected', RESTART_TIMEOUT: 'failed' },
  failed:       { ICE_RESTART: 'connecting', CLOSE: 'new' }
};

function reduce(state, action) {
  const next = TRANSITIONS[state]?.[action.type];
  if (!next) {
    // Reject silently β€” out-of-order signaling messages must not throw
    console.debug(`[FSM] ignored ${action.type} in ${state}`);
    return state;
  }
  return next;
}

Wire native events into the reducer rather than acting on them directly. The connectionstatechange listener becomes a thin translator that dispatches FSM actions, keeping all decision logic in one auditable place. This indirection is what lets you unit-test the entire lifecycle without a browser: feed the reducer a scripted sequence of actions and assert the resulting states, since the reducer is pure and has no dependency on RTCPeerConnection at all.

// Thin translator: native events become FSM actions, nothing more
let state = 'new';
function dispatch(action) {
  const prev = state;
  state = reduce(state, action);
  if (state !== prev) onTransition(prev, state, action); // single audit hook
}

pc.addEventListener('connectionstatechange', () => {
  // Normalise the native enum into the small action vocabulary the FSM understands
  switch (pc.connectionState) {
    case 'connected':    dispatch({ type: 'ICE_CONNECTED' }); break;
    case 'disconnected': dispatch({ type: 'TRANSPORT_FLAP' }); break;
    case 'failed':       dispatch({ type: 'ERROR' }); break;
    case 'closed':       dispatch({ type: 'CLOSE' }); break;
  }
});

Keep the action vocabulary deliberately small. Every action you add is a new edge in the transition table you must reason about, so resist the urge to mirror every native event one-to-one. A handful of high-level actions β€” offer sent, ICE connected, transport flap, restart timeout β€” covers the entire lifecycle, and anything finer-grained (individual candidate events, gathering-state changes) belongs in instrumentation, not in the state graph itself.

Step 2 β€” Gate SDP Sequencing and Rollback

SDP exchange is asynchronous and the browser strictly enforces its own signalingState ladder (stable β†’ have-local-offer β†’ stable). Violating that ladder throws InvalidStateError and can leave the connection wedged. Your FSM must serialise negotiation so only one offer/answer cycle is ever in flight, and it must roll back to the last stable baseline whenever a setLocalDescription or setRemoteDescription call rejects.

Process negotiation through a promise-based mutex so concurrent negotiationneeded events queue instead of racing. On any rejection, call pc.setLocalDescription({ type: 'rollback' }) to return to stable, then re-dispatch from a known-good state. These transitions are the practical application of the SDP Offer/Answer Lifecycle, which defines exactly which states accept which descriptions.

let negotiationChain = Promise.resolve(); // serialises all SDP work

function enqueue(task) {
  // Chain every negotiation step so two offers never overlap
  negotiationChain = negotiationChain.then(task).catch(async (err) => {
    console.error('[FSM] negotiation failed, rolling back:', err.message);
    if (pc.signalingState !== 'stable') {
      // Return to the last stable baseline before retrying
      await pc.setLocalDescription({ type: 'rollback' }).catch(() => {});
    }
  });
  return negotiationChain;
}

The single most common failure here is the offer collision β€” both peers fire negotiationneeded and offer simultaneously, leaving each in have-local-offer with no path to stable. The full perfect-negotiation recovery, including polite/impolite peer roles, lives in Recovering from Glare in Offer Collisions; the FSM’s job is simply to route a detected collision into a rollback rather than an exception.

Step 3 β€” Abstract Transport and Buffer ICE

The signaling transport must guarantee ordering, deliver acknowledgements, and degrade gracefully without coupling the FSM to a specific protocol. Hide WebSocket, HTTP long-poll, or RPC behind a narrow interface β€” send(msg), onMessage(cb), onClose(cb) β€” so the FSM is transport-agnostic and you can swap implementations at runtime.

Attach monotonic sequence IDs to every payload for deduplication and reordering. Buffer ICE candidates that arrive before setRemoteDescription resolves β€” the browser silently drops candidates queued more than roughly 500 ms without a remote description β€” and flush them in order once it does. When the transport itself drops, apply exponential backoff with jitter and cap reconnection at 5–7 attempts before escalating to a TURN relay or teardown; never block media threads waiting on signaling recovery.

const pendingCandidates = [];

async function onRemoteCandidate(init) {
  // Hold candidates until the remote description exists, then apply in arrival order
  if (pc.remoteDescription && pc.remoteDescription.type) {
    await pc.addIceCandidate(init);
  } else {
    pendingCandidates.push(init);
  }
}

async function flushCandidates() {
  while (pendingCandidates.length) {
    await pc.addIceCandidate(pendingCandidates.shift());
  }
}

For heartbeat framing, reconnection algorithms, and the sub-10 ms delivery characteristics WebSocket gives you, the WebSocket Signaling Implementation guide covers the transport details this abstraction sits on top of. Candidate generation upstream of all this is shaped by ICE Candidate Gathering & Filtering, which determines what the FSM will be buffering in the first place.

The transport interface should expose connection health to the FSM, not hide it. A disconnected from the signaling socket is a different event from a disconnected from the media transport: the first means you cannot send SDP and should buffer, the second means the media path itself flapped and may need an ICE restart. Conflating the two leads to spurious renegotiation β€” restarting ICE because the WebSocket briefly dropped, even though the media path was healthy the whole time. Model them as separate inputs so the reducer can react correctly to each. In practice this means the transport adapter emits signalingDown/signalingUp events that drive buffering, while connectionstatechange drives the media-path transitions in the graph above; only the latter ever triggers an ICE restart.

Step 4 β€” Verification

Verify the FSM by driving it through each transition deliberately and confirming both the application state and the native connectionState agree at every step. State drift β€” your FSM believing it is connected while the browser reports failed β€” is the leading cause of silent WebRTC breakage, so make the comparison an explicit assertion rather than a hope.

function assertNoDrift(pc, fsmState) {
  // FSM state and native connectionState must stay consistent
  const native = pc.connectionState;
  const expected = { connecting: ['connecting', 'new'], connected: ['connected'],
    disconnected: ['disconnected'], failed: ['failed', 'closed'] };
  if (expected[fsmState] && !expected[fsmState].includes(native)) {
    console.warn(`[FSM] drift: fsm=${fsmState} native=${native}`); // alert in prod
    return false;
  }
  return true;
}

Run this checklist before shipping:

Instrument the FSM with structured transition events (correlation ID, from-state, to-state, monotonic timestamp) emitted to your APM. Set alert thresholds for states exceeding expected durations β€” connecting over 5 s, disconnected over 10 s β€” and keep a debug mode that logs every transition so you can reconstruct a session from logs. Poll getStats() at 1 s intervals during connecting and disconnected to correlate state with packet loss and RTT.

Edge Cases & Browser Quirks

Chromium silent candidate drops. Chrome (and Edge) discard ICE candidates added more than ~500 ms before a valid remote description without throwing, so a missing buffer manifests as connectivity that β€œsometimes” fails rather than a clean error. Always buffer, as in Step 3.

Firefox rollback strictness. Firefox honours setLocalDescription({ type: 'rollback' }) but rejects rollback from stable with InvalidStateError; guard the call with a signalingState !== 'stable' check (as shown) or Firefox will turn your recovery path into a new failure.

Safari background suspension. Safari on iOS aggressively suspends RTCPeerConnection ICE activity when the tab backgrounds, surfacing as a disconnected that never recovers on its own. Add a visibilitychange listener that forces an ICE restart on resume rather than waiting for the stack.

connectionState vs iceConnectionState. Older Safari builds (pre-15) lag or omit connectionState; fall back to iceConnectionState and normalise both into your FSM inputs so behaviour is uniform across engines.

Firefox disconnected transience. Firefox surfaces disconnected more eagerly than Chrome during brief packet loss and frequently self-recovers within a second or two. Debounce the connected β†’ disconnected transition by waiting a short grace window (2–3 s) before treating it as actionable, or you will trigger ICE restarts on losses the stack would have healed on its own.

Chrome failed is terminal. Once Chrome reports connectionState === 'failed', the only recovery is an ICE restart or a fresh connection; the state will not return to connected on its own. Make failed β†’ connecting (via ICE restart) the sole non-teardown edge out of failed in your graph, and do not wait for a spontaneous recovery that will never come.

Common Implementation Mistakes

FAQ

Why use a state machine instead of RTCPeerConnection event listeners directly?

Direct listeners create implicit, non-deterministic flows that hide race conditions and scale poorly. A formal FSM enforces valid transitions, centralises error handling and rollback, and gives you a single auditable place to trigger ICE restarts and clean teardown β€” which is what makes recovery predictable.

How does the FSM avoid full renegotiation after a network flap?

It treats a flap as a connected β†’ disconnected transition and waits a bounded window for native recovery before issuing a single iceRestart: true offer. The existing RTP/RTCP session and media tracks stay intact while only the transport path is renegotiated, so there is no full teardown.

Can this pattern scale to thousands of concurrent sessions?

Yes, because the reducer is stateless and transport-agnostic. With connection pooling or schema-driven streams behind the transport interface, the FSM logic stays lightweight and runs identically across distributed edge nodes β€” the per-session cost is just a small state object.

Should custom FSM state replace pc.signalingState?

No. The native state is authoritative for what SDP operations are legal. Your FSM augments it for application-level decisions; always gate negotiation against pc.signalingState and assert against pc.connectionState to catch drift.

Related: continue with the WebRTC Protocol Stack & Signaling Servers guide, move the transport to typed streams with Custom Signaling Protocols with gRPC-Web, harden the offer/answer race in Recovering from Glare in Offer Collisions, and ground the SDP rules in the SDP Offer/Answer Lifecycle.