WebRTC over CGNAT (Carrier-Grade NAT)

Carrier-grade NAT (CGNAT, RFC 6598) sits between millions of mobile and broadband subscribers and the public internet, sharing a small pool of public IPv4 addresses across thousands of customers. This guide is part of the ICE Candidate Gathering & Filtering guide, and it addresses one decision: how to keep WebRTC connections alive over CGNAT, where bindings expire fast, ports run out, and direct paths frequently never form.

Context & Trade-offs

CGNAT magnifies every NAT problem. Two properties dominate. First, binding lifetime: to conserve table space across thousands of subscribers, carriers age out idle UDP mappings aggressively — often in under 30 seconds, sometimes as low as 20 s. A srflx candidate discovered at call setup can be dead before connectivity checks finish, and a connection that goes briefly idle can lose its mapping mid-call. Second, port exhaustion: with thousands of subscribers behind one public IP, the carrier may run a per-subscriber port budget. Under load, new mappings get refused, so additional candidate gathering or a fresh allocation simply fails.

Most CGNAT deployments are also symmetric, which means STUN srflx candidates rarely produce a usable direct path — the same failure mode covered in Traversing Symmetric NAT with TURN. The practical consequence: assume a relay will be needed, keep mappings warm with frequent keepalives, and design for fast re-establishment rather than fighting for a direct path.

The trade-offs are concrete. Sending consent/keepalive traffic every 5–15 s holds the binding open at the cost of a trickle of background bandwidth and battery on mobile. Routing through a TURN Server Configuration & Auth relay adds 20–40 ms of one-way latency but converts a near-certain failure into a reliable call. Skipping keepalives saves battery but invites a silent drop the moment the user stops talking.

It is worth separating the two mechanisms that keep a CGNAT call alive, because they fail differently. WebRTC’s built-in ICE consent freshness (RFC 7675) sends a STUN binding request on the nominated pair roughly every 5 s and tears the connection down if it gets no response for ~15 s — that protects the active media path. But consent only runs on the pair carrying media; a paused or muted call can still let the underlying UDP mapping age out faster than consent notices, especially when the OS suspends the radio. An application-level keepalive on a data channel forces actual packets through the mapping on a schedule you control, independent of whether audio/video is flowing. The cleanest design uses both: rely on consent freshness for liveness detection, and add a short-interval data-channel heartbeat to keep the NAT binding warm during silence. When the mapping is lost anyway — port exhaustion, a radio handoff, a genuinely long idle — an ICE restart re-gathers and re-nominates without dropping the session, which is far cheaper than a full renegotiation or a user-visible reconnect.

Minimal Runnable Implementation

const pc = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:stun.l.google.com:19302' },
    {
      urls: [
        'turn:turn.example.com:3478?transport=udp',
        'turns:turn.example.com:5349?transport=tcp' // survives UDP-hostile carriers
      ],
      username: 'time-limited-user',
      credential: 'base64-hmac-token'
    }
  ],
  iceTransportPolicy: 'all',  // try direct first; ICE falls back to relay on CGNAT
  bundlePolicy: 'max-bundle',
  rtcpMuxPolicy: 'require'
});

// WebRTC sends STUN consent checks every ~5 s automatically, but an idle
// data channel keeps the binding hot below the carrier's <30 s aging timer.
function startKeepalive(pc) {
  const dc = pc.createDataChannel('keepalive', { negotiated: true, id: 0 });
  dc.onopen = () => {
    const timer = setInterval(() => {
      // 1-byte heartbeat well under the 30 s binding lifetime
      if (dc.readyState === 'open') dc.send('�');
      else clearInterval(timer);
    }, 10000); // 10 s interval: safe margin under a 20–30 s CGNAT timeout
  };
}

// On a dropped binding, re-gather rather than tearing the call down
pc.oniceconnectionstatechange = () => {
  if (pc.iceConnectionState === 'disconnected') {
    pc.restartIce(); // refreshes mappings; cap retries at 3
  }
};

Set the keepalive interval to roughly half the observed binding lifetime — 10 s is a safe default against a 20–30 s timeout. Always offer a TLS TURN endpoint on 5349 (or 443) because some carriers throttle or block raw UDP.

Reproduction Steps & Debugging Log Patterns

  1. Place a client on a mobile carrier known to use CGNAT and establish a call, then stop all media/data for 35 s.
  2. Poll pc.getStats() at 1 s intervals and watch the nominated candidate-pair for consentRequestsSent rising and responsesReceived stalling.
  3. Observe iceConnectionState flip to disconnected shortly after the binding ages out, then watch whether restartIce() recovers it.
  4. Re-run with a 10 s keepalive enabled and confirm the binding survives the idle window.

Expected log on binding expiry without keepalive:

// t+0s   candidate-pair (relay/srflx) state: succeeded  nominated: true
// t+28s  consentRequestsSent: 6  responsesReceived: 4   <- mapping aging out
// t+31s  iceConnectionState: disconnected
// t+31s  restartIce() -> iceConnectionState: checking -> connected

If restartIce() cannot recover and you see no new relay candidate, suspect port exhaustion on the carrier — the allocation request is being refused. Fall back to the already-established TLS relay rather than gathering fresh candidates.

Common Implementation Mistakes

FAQ

How short can a CGNAT binding lifetime really be?

Commonly 30–120 s for UDP, but aggressive carriers age idle mappings in under 30 s — some near 20 s. Always assume the worst and keep mappings warm.

Does a keepalive drain mobile battery noticeably?

A 1-byte heartbeat every 10 s is negligible compared to active media. The radio is already awake during a call; the cost only matters for long-idle background connections, where you can stretch the interval slightly.

Why does my call work on Wi-Fi but fail on cellular?

Home Wi-Fi is usually a single cone NAT with generous timeouts; cellular is symmetric CGNAT with short binding lifetimes. Provision TURN and keepalives specifically for the cellular path.

Related: return to ICE Candidate Gathering & Filtering, and see Traversing Symmetric NAT with TURN and ICE Candidate Trickle vs Bulk Gathering.