Skip to content

render: fix progressive A/V (lip-sync) drift across multi-segment concats#62

Open
wdeynes wants to merge 1 commit into
browser-use:mainfrom
wdeynes:fix/concat-av-drift
Open

render: fix progressive A/V (lip-sync) drift across multi-segment concats#62
wdeynes wants to merge 1 commit into
browser-use:mainfrom
wdeynes:fix/concat-av-drift

Conversation

@wdeynes

@wdeynes wdeynes commented Jun 10, 2026

Copy link
Copy Markdown

Summary

Long EDLs render with progressively worsening lip-sync: the audio runs ahead of the video, getting worse toward the end of the timeline. On a real 37-segment / 103 s edit, audio was −570 ms early by the last segment — blatantly visible.

Root cause

extract_segment() writes each segment with -t <duration> at CFR -r 24 + AAC audio:

  • the video stream rounds up to a whole frame (41.7 ms steps at 24 fps)
  • the audio stream keeps the raw -t length (quantized only by AAC's 21.3 ms frame size)

so every segment's audio ends up ~17–40 ms shorter than its video. concat_segments() then concatenates with -c copy, and the concat demuxer packs each stream back-to-back independently — so the mismatch accumulates: segment N's audio plays roughly N × 17 ms before its video.

Measured on a 37-segment EDL (per-segment ffprobe stream durations):

video sum audio sum cumulative
before 103.792 s 103.170 s −0.622 s

Cross-correlating the output audio against the source audio at each segment's video-timeline position (numpy, 16 kHz mono, ±0.8 s search window) confirms the drift is progressive and audible:

segment output pos lag before lag after
0 0.0 s −21 ms 0.0 ms
8 27 s −116 ms −0.1 ms
19 57 s −318 ms −0.1 ms
29 84 s −423 ms 0.0 ms
35 101 s −569 ms 0.0 ms

(correlation confidence 0.90–1.00 at every checkpoint)

Fix

  1. Quantize each segment to whole output frames: n_frames = round(duration × OUTPUT_FPS), vdur = n_frames / OUTPUT_FPS; cap video with -frames:v (the -t now overshoots by 0.5 s purely to give the audio filters enough input).
  2. Force audio to exactly vdur with atrim=end=vdur,apad=whole_dur=vdur (the 30 ms fades are unchanged, now timed against vdur).
  3. Use sample-exact PCM (pcm_s16le) in .mov intermediates instead of AAC mp4 segments — PCM stream durations are sample-accurate, with no encoder priming or frame rounding to survive the concat demuxer.
  4. Encode AAC once at the final composite: build_final_composite()'s early-return and filter paths now use -c:a aac -b:a 192k instead of -c copy/-c:a copy. Final deliverables are unchanged (.mp4, h264 + AAC, +faststart).

After

  • all 37 segments: |audio − video| = 0.0 ms, cumulative diff 0.0000 s
  • cross-correlation lag 0.0 ms (±0.1 ms) at every checkpoint across the 103 s timeline
  • container duration now matches the EDL sum exactly (was +0.6 s)

Notes

  • Intermediates are renamed clips_*/seg_NN_<src>.mov and base*.mov (PCM-in-mp4 is poorly supported; final outputs are still mp4). PCM audio costs ~11.5 MB/min of intermediate disk — negligible next to the video data.
  • Behavior when a range overruns the source EOF is unchanged: apad fills audio to vdur; video may still come up short, as before.

🤖 Generated with Claude Code


Summary by cubic

Fixes progressive lip-sync drift across multi-segment renders. Audio now stays aligned with video across the entire timeline (measured −570 ms -> 0 ms on a 103 s/37-segment edit).

  • Bug Fixes

    • Quantize each segment to whole frames at OUTPUT_FPS=24 and cap video with -frames:v.
    • Force audio to match vdur exactly using atrim + apad (30 ms fades now timed to vdur).
    • Switch intermediates to sample-exact PCM .mov for safe -c copy concat; encode AAC only once in the final composite.
    • Outcome: 0.0 ms drift across the full edit; container duration matches the EDL sum.
  • Migration

    • Intermediates are now .mov: clips_*/seg_*.mov and base*.mov. Update any scripts that referenced .mp4.
    • Final deliverables remain .mp4 (H.264 + AAC, +faststart).

Written for commit f7206d8. Summary will update on new commits.

Review in cubic

Per-segment video rounds up to whole 24fps frames while AAC audio keeps
the raw -t duration (~17-40ms shorter per segment). The -c copy concat
packs each stream back-to-back independently, so the mismatch
accumulates into progressive audio-early drift — measured -570ms over a
37-segment, 103s timeline via cross-correlation of output vs source
audio.

Quantize each segment to whole output frames (-frames:v, vdur=n/fps),
force the audio to exactly vdur (atrim + apad), and write sample-exact
PCM .mov intermediates, encoding AAC once at the final composite. After
the fix every segment has |a-v| = 0ms and output-vs-source
cross-correlation shows 0.0ms lag at every checkpoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/render.py">

<violation number="1" location="helpers/render.py:531">
P2: Double AAC encoding in loudnorm path: `build_final_composite()` encodes PCM → AAC for the prenorm intermediate, then `apply_loudnorm_two_pass()` re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread helpers/render.py
run(["ffmpeg", "-y", "-i", str(base_path), "-c", "copy", str(out_path)], quiet=True)
# No filters — copy video, encode the PCM intermediate audio to AAC for mp4
run(["ffmpeg", "-y", "-i", str(base_path), "-c:v", "copy",
"-c:a", "aac", "-b:a", "192k", "-ar", "48000",

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Double AAC encoding in loudnorm path: build_final_composite() encodes PCM → AAC for the prenorm intermediate, then apply_loudnorm_two_pass() re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/render.py, line 531:

<comment>Double AAC encoding in loudnorm path: `build_final_composite()` encodes PCM → AAC for the prenorm intermediate, then `apply_loudnorm_two_pass()` re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.</comment>

<file context>
@@ -508,8 +526,10 @@ def build_final_composite(
-        run(["ffmpeg", "-y", "-i", str(base_path), "-c", "copy", str(out_path)], quiet=True)
+        # No filters — copy video, encode the PCM intermediate audio to AAC for mp4
+        run(["ffmpeg", "-y", "-i", str(base_path), "-c:v", "copy",
+             "-c:a", "aac", "-b:a", "192k", "-ar", "48000",
+             "-movflags", "+faststart", str(out_path)], quiet=True)
         return
</file context>
Fix with cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant