render: fix progressive A/V (lip-sync) drift across multi-segment concats by wdeynes · Pull Request #62 · browser-use/video-use

wdeynes · 2026-06-10T01:15:05Z

Summary

Long EDLs render with progressively worsening lip-sync: the audio runs ahead of the video, getting worse toward the end of the timeline. On a real 37-segment / 103 s edit, audio was −570 ms early by the last segment — blatantly visible.

Root cause

extract_segment() writes each segment with -t <duration> at CFR -r 24 + AAC audio:

the video stream rounds up to a whole frame (41.7 ms steps at 24 fps)
the audio stream keeps the raw -t length (quantized only by AAC's 21.3 ms frame size)

so every segment's audio ends up ~17–40 ms shorter than its video. concat_segments() then concatenates with -c copy, and the concat demuxer packs each stream back-to-back independently — so the mismatch accumulates: segment N's audio plays roughly N × 17 ms before its video.

Measured on a 37-segment EDL (per-segment ffprobe stream durations):

	video sum	audio sum	cumulative
before	103.792 s	103.170 s	−0.622 s

Cross-correlating the output audio against the source audio at each segment's video-timeline position (numpy, 16 kHz mono, ±0.8 s search window) confirms the drift is progressive and audible:

segment	output pos	lag before	lag after
0	0.0 s	−21 ms	0.0 ms
8	27 s	−116 ms	−0.1 ms
19	57 s	−318 ms	−0.1 ms
29	84 s	−423 ms	0.0 ms
35	101 s	−569 ms	0.0 ms

(correlation confidence 0.90–1.00 at every checkpoint)

Fix

Quantize each segment to whole output frames: n_frames = round(duration × OUTPUT_FPS), vdur = n_frames / OUTPUT_FPS; cap video with -frames:v (the -t now overshoots by 0.5 s purely to give the audio filters enough input).
Force audio to exactly vdur with atrim=end=vdur,apad=whole_dur=vdur (the 30 ms fades are unchanged, now timed against vdur).
Use sample-exact PCM (pcm_s16le) in .mov intermediates instead of AAC mp4 segments — PCM stream durations are sample-accurate, with no encoder priming or frame rounding to survive the concat demuxer.
Encode AAC once at the final composite: build_final_composite()'s early-return and filter paths now use -c:a aac -b:a 192k instead of -c copy/-c:a copy. Final deliverables are unchanged (.mp4, h264 + AAC, +faststart).

After

all 37 segments: |audio − video| = 0.0 ms, cumulative diff 0.0000 s
cross-correlation lag 0.0 ms (±0.1 ms) at every checkpoint across the 103 s timeline
container duration now matches the EDL sum exactly (was +0.6 s)

Notes

Intermediates are renamed clips_*/seg_NN_<src>.mov and base*.mov (PCM-in-mp4 is poorly supported; final outputs are still mp4). PCM audio costs ~11.5 MB/min of intermediate disk — negligible next to the video data.
Behavior when a range overruns the source EOF is unchanged: apad fills audio to vdur; video may still come up short, as before.

🤖 Generated with Claude Code

Summary by cubic

Fixes progressive lip-sync drift across multi-segment renders. Audio now stays aligned with video across the entire timeline (measured −570 ms -> 0 ms on a 103 s/37-segment edit).

Bug Fixes
- Quantize each segment to whole frames at OUTPUT_FPS=24 and cap video with -frames:v.
- Force audio to match vdur exactly using atrim + apad (30 ms fades now timed to vdur).
- Switch intermediates to sample-exact PCM .mov for safe -c copy concat; encode AAC only once in the final composite.
- Outcome: 0.0 ms drift across the full edit; container duration matches the EDL sum.
Migration
- Intermediates are now .mov: clips_*/seg_*.mov and base*.mov. Update any scripts that referenced .mp4.
- Final deliverables remain .mp4 (H.264 + AAC, +faststart).

^{Written for commit f7206d8. Summary will update on new commits.}

Per-segment video rounds up to whole 24fps frames while AAC audio keeps the raw -t duration (~17-40ms shorter per segment). The -c copy concat packs each stream back-to-back independently, so the mismatch accumulates into progressive audio-early drift — measured -570ms over a 37-segment, 103s timeline via cross-correlation of output vs source audio. Quantize each segment to whole output frames (-frames:v, vdur=n/fps), force the audio to exactly vdur (atrim + apad), and write sample-exact PCM .mov intermediates, encoding AAC once at the final composite. After the fix every segment has |a-v| = 0ms and output-vs-source cross-correlation shows 0.0ms lag at every checkpoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 1 file

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/render.py">

<violation number="1" location="helpers/render.py:531">
P2: Double AAC encoding in loudnorm path: `build_final_composite()` encodes PCM → AAC for the prenorm intermediate, then `apply_loudnorm_two_pass()` re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic}

cubic-dev-ai · 2026-06-10T01:17:50Z

-        run(["ffmpeg", "-y", "-i", str(base_path), "-c", "copy", str(out_path)], quiet=True)
+        # No filters — copy video, encode the PCM intermediate audio to AAC for mp4
+        run(["ffmpeg", "-y", "-i", str(base_path), "-c:v", "copy",
+             "-c:a", "aac", "-b:a", "192k", "-ar", "48000",


P2: Double AAC encoding in loudnorm path: build_final_composite() encodes PCM → AAC for the prenorm intermediate, then apply_loudnorm_two_pass() re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At helpers/render.py, line 531: <comment>Double AAC encoding in loudnorm path: `build_final_composite()` encodes PCM → AAC for the prenorm intermediate, then `apply_loudnorm_two_pass()` re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.</comment> <file context> @@ -508,8 +526,10 @@ def build_final_composite( - run(["ffmpeg", "-y", "-i", str(base_path), "-c", "copy", str(out_path)], quiet=True) + # No filters — copy video, encode the PCM intermediate audio to AAC for mp4 + run(["ffmpeg", "-y", "-i", str(base_path), "-c:v", "copy", + "-c:a", "aac", "-b:a", "192k", "-ar", "48000", + "-movflags", "+faststart", str(out_path)], quiet=True) return </file context>

cubic-dev-ai Bot reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

render: fix progressive A/V (lip-sync) drift across multi-segment concats#62

render: fix progressive A/V (lip-sync) drift across multi-segment concats#62
wdeynes wants to merge 1 commit into
browser-use:mainfrom
wdeynes:fix/concat-av-drift

wdeynes commented Jun 10, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdeynes commented Jun 10, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

After

Notes

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wdeynes commented Jun 10, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot Jun 10, 2026 •

edited

Loading