Compositing Multi-Party Recordings Server-Side

Compositing turns a set of independent, separately-timed participant streams into a single synchronized video file with a chosen layout. This guide is part of the Server-Side Recording & Composition guide, and it resolves one concrete problem: how to build the compositor and mixer stage so a four-to-twelve-person call produces a watchable, in-sync recording even as people join and leave mid-call. The decisions that matter are the layout policy (fixed grid vs active-speaker), the synchronization source (RTP timestamps, never arrival time), and how the pipeline survives roster changes without breaking the encoder’s timeline.

Context & Trade-offs

Two layout policies dominate. A grid assigns every active participant an equal tile — predictable, fair, and trivial to lay out, but tiles shrink as the room grows and a 12-person grid wastes most pixels on idle faces. An active-speaker layout promotes whoever is talking to a large tile with the rest as thumbnails or hidden — far better use of resolution, but it requires audio-energy detection and debouncing, and it flickers badly if you switch on every transient sound.

Dimension	Grid	Active-speaker
Layout cost	trivial	audio-energy + debounce
Pixel efficiency	poor at scale (12+)	high
Flicker risk	none	high without 1–2 s debounce
Best fit	small meetings, fairness	webinars, large rooms

Synchronization is the harder constraint. Every stream carries its own RTP timestamp in its own codec clock — 90 kHz for video, 48 kHz for Opus — and packets arrive reordered and jittered. If you align media by arrival time, network jitter leaks straight into lip-sync, and a participant on a worse connection drifts visibly behind their own audio. The fix is to normalize every input to one monotonic output clock by its RTP timestamp (using RTCP sender reports to map each stream’s RTP clock to a common wall-clock reference), then sample all streams against that single timeline. The sync budget is roughly 250 ms of A/V drift end-to-end; beyond that, lip-sync is perceptibly broken.

Roster churn is the third constraint. Participants join and leave at arbitrary times, and the compositor must recompute layout on each change while keeping the encoder’s presentation timestamps strictly monotonic — a discontinuity in PTS produces a file that seeks wrong or refuses to play. A join mid-call also means a new stream whose RTP timestamp base is unrelated to everyone else’s; you cannot assume two senders started their clocks together, which is exactly why each stream must be mapped to the common reference via its own RTCP sender report before it ever reaches a tile. A leave is simpler but still requires care: drop the participant from the layout and stop sampling their queues, but never rewind or pause the shared output clock, or the encoder will emit a duplicate-PTS frame the muxer rejects.

There is also a resolution-versus-cost trade-off baked into the layout choice. A grid at 1280×720 splits a fixed pixel budget across all tiles, so each face in a 12-person call lands at roughly 320×180 regardless of the source quality — re-encoding 12 high-resolution decodes down to thumbnails wastes most of the decode work. Active-speaker layouts spend the budget where attention is, decoding the promoted tile at full resolution and the thumbnails cheaply, which is why large-room recordings almost always favor speaker layouts. When senders use simulcast, subscribe the compositor to the layer that matches each tile’s on-screen size: full resolution for the active speaker, a low spatial layer for thumbnails. That coupling to Simulcast-Aware Forwarding keeps the decoder from wasting cycles shrinking a 720p stream into a 180px thumbnail.

Minimal Runnable Implementation

// ffmpeg/GStreamer-style compositor + mixer stage. Inputs are decoded YUV frames and
// PCM buffers, each tagged with an RTP timestamp; output is one synced A/V timeline.
const OUTPUT_FPS = 30;
const FRAME_MS = 1000 / OUTPUT_FPS;          // 33.3 ms per composite frame
const SAMPLE_RATE = 48000;                   // audio mix clock
const SPEAKER_DEBOUNCE_MS = 1500;            // hold active speaker 1.5 s before switching

class Compositor {
  constructor(encoder) {
    this.encoder = encoder;
    this.participants = new Map();           // id -> { videoQueue, audioQueue, lastFrame, rmsEwma }
    this.outputPtsMs = 0;                    // single monotonic output clock — never rewinds
    this.activeSpeakerId = null;
    this.lastSwitchMs = 0;
  }

  // Roster change: recompute layout, but DO NOT touch outputPtsMs. New joiners get a
  // tile from the next frame; leavers are dropped. The timeline stays continuous.
  setRoster(ids) {
    for (const id of ids) if (!this.participants.has(id)) this.participants.set(id, newPeer());
    for (const id of [...this.participants.keys()]) if (!ids.includes(id)) this.participants.delete(id);
    this.layout = computeLayout([...this.participants.keys()], this.mode); // 'grid' | 'speaker'
  }

  // Called once per output frame on a fixed 33.3 ms timer.
  tick() {
    const pts = this.outputPtsMs;
    this.pickActiveSpeaker(pts);
    this.canvas.clear();

    for (const [id, p] of this.participants) {
      // latest decoded frame at-or-before this PTS; hold last frame across sender gaps
      const frame = p.videoQueue.latestAtOrBefore(pts) ?? p.lastFrame;
      if (!frame) continue;
      const rect = this.layout[id];          // tile for grid, or large/thumb for speaker
      this.canvas.drawScaled(frame, rect.x, rect.y, rect.w, rect.h);
      p.lastFrame = frame;
    }
    this.encoder.pushVideo(this.canvas.snapshot(), pts);
    this.mixAudio(pts);
    this.outputPtsMs += FRAME_MS;            // advance the one shared clock
  }

  pickActiveSpeaker(nowMs) {
    let loudest = null, max = -Infinity;
    for (const [id, p] of this.participants) {
      p.rmsEwma = 0.8 * p.rmsEwma + 0.2 * p.audioQueue.rmsAround(nowMs); // smoothed energy
      if (p.rmsEwma > max) { max = p.rmsEwma; loudest = id; }
    }
    // debounce: only switch the big tile after the floor has held long enough
    if (loudest && loudest !== this.activeSpeakerId && nowMs - this.lastSwitchMs > SPEAKER_DEBOUNCE_MS) {
      this.activeSpeakerId = loudest;
      this.lastSwitchMs = nowMs;
      this.layout = computeLayout([...this.participants.keys()], 'speaker', this.activeSpeakerId);
    }
  }

  mixAudio(pts) {
    const out = new Float32Array(SAMPLE_RATE * FRAME_MS / 1000); // one frame of samples
    for (const [, p] of this.participants) {
      const pcm = p.audioQueue.samplesAt(pts, out.length); // resampled to 48k, aligned by RTP ts
      if (pcm) for (let i = 0; i < out.length; i++) out[i] += pcm[i]; // sum sources
      // DTX gap → samplesAt returns null → contributes silence, timeline still advances
    }
    softLimit(out);                          // prevent clipping on simultaneous speech
    this.encoder.pushAudio(out, pts);
  }
}

Reproduction Steps & Debugging Log Patterns

Start a 3-participant recording with a grid layout; confirm three equal tiles render and outputPtsMs advances by exactly 33.3 ms per tick.
Have a fourth participant join mid-call; verify setRoster recomputes a 2×2 layout and the encoder PTS does not jump or reset.
Switch to active-speaker mode and have two people alternate talking; watch the debounce hold each speaker for ~1.5 s before promoting the next.
Drop a participant abruptly (kill their connection); confirm the layout collapses to the remaining set and audio keeps mixing without a stall.
Run ffprobe on the finalized file and compare audio vs video duration at start, middle, and end.

Expected healthy log:

// roster=[a,b,c] layout=grid-3 pts=0ms
// roster=[a,b,c,d] layout=grid-4 pts=12440ms   // joined mid-call, PTS continuous
// activeSpeaker: a -> b  (held 1520ms)          // debounce respected
// roster=[a,c,d] layout=grid-3 pts=48900ms      // b left, no PTS jump
// ffprobe: video=600.10s audio=600.02s drift=80ms  // under the 250ms budget

A broken sync run shows the drift line climbing over time — drift=90ms at the start growing to drift=420ms by the end — which means an input was aligned on arrival time, or the audio timeline advanced on packet arrival and compressed during DTX silence. A layout that flickers between speakers every few frames means the debounce window is too short or RMS is not being smoothed. If the file refuses to seek or a player reports a negative timestamp after a mid-call join, the roster handler reset outputPtsMs instead of leaving it untouched — log the PTS immediately before and after every setRoster call and confirm it only ever increases.

To isolate which input drifted, log per-participant the offset between each stream’s normalized RTP timestamp and the current outputPtsMs once per second. A healthy stream holds a small, stable offset; the one that desyncs shows its offset walking away monotonically, which points straight at a sender whose RTCP sender reports were missing or stale when its clock was mapped. For audio specifically, count the silence samples the mixer padded for DTX gaps versus the real samples decoded — if padding dominates for a participant who was clearly speaking, their Opus DTX gaps are being mishandled and the timeline is compressing.

Common Implementation Mistakes

Aligning on arrival time, not RTP timestamps. Jitter then becomes drift; the participant on the worst network desyncs first. Map each stream’s RTP clock to a common reference via RTCP sender reports and sample against one output clock.
Resetting PTS on roster change. A join or leave must not touch the output timeline. Recompute layout only; keep outputPtsMs monotonic or the file becomes unseekable.
No active-speaker debounce. Switching the main tile on raw per-frame energy makes the layout strobe. Smooth RMS and hold the floor 1–2 s.
Stalling audio on DTX silence. When Opus discontinuous transmission sends nothing, fill the output with silence on the wall clock instead of waiting for the next packet, or audio creeps ahead of video.
Dropping a participant’s last frame. Without a held last frame, any sender frame-rate dip punches a black hole into that tile. Always redraw the last decoded frame for an active participant.
Summing audio without limiting. Adding several speakers’ PCM clips hard when they overlap. Apply a soft limiter or per-source attenuation before encoding.

FAQ

How do I keep lip-sync across participants with different network quality?

Normalize every stream to one output clock using its RTP timestamps and RTCP sender reports, never arrival time. A slow participant’s jitter is then absorbed by their own jitter buffer instead of leaking into the shared timeline, and all tiles stay within the ~250 ms A/V budget regardless of individual connection quality.

Should the compositor run on the SFU or a separate node?

Separate. Live decode-plus-encode is CPU-heavy and must not steal cycles from packet forwarding on the Selective Forwarding Unit Design; an overloaded SFU drops media for live participants. The compositor consumes the same forwarded streams a subscriber would, so it can run anywhere the streams can be delivered.

Can I do this with ffmpeg or GStreamer instead of hand-writing the loop?

Yes — both express this as a filter graph (xstack/overlay for the grid, amix for audio in ffmpeg; compositor and audiomixer in GStreamer). The same rules apply: feed inputs tagged with normalized timestamps, keep one output clock, and rebuild the graph on roster changes while holding PTS continuous. Device-side capture hygiene from Audio/Video Track Management still governs the quality of what enters the mix.

Related: return to Server-Side Recording & Composition, and cross-reference Selective Forwarding Unit Design and Simulcast-Aware Forwarding for how the source streams are forwarded to the recorder.

Compositing Multi-Party Recordings Server-Side

Context & Trade-offs

Minimal Runnable Implementation

Reproduction Steps & Debugging Log Patterns

Common Implementation Mistakes

FAQ

Related Guides