Idea
Use ElevenLabs Text-to-Speech to add new words to the transcript, not just remove them. When the user types new words into the transcript, generate audio for those words during export.
Video sync challenge
Added words have no matching video footage. Two approaches:
- Freeze frame — hold the last frame of the preceding clip while the added words play. Doable with FFmpeg
tpad or loop filters. Looks like a brief pause in the video.
- Audio-only insert — only works for podcast/audio-first content.
Freeze frame is the more practical approach.
Flow
- User adds words in the transcript editor (new editing capability needed)
- During export, detect which words are "added" (no original audio)
- Generate audio for added words via ElevenLabs TTS API
- At each insertion point, freeze the last frame of the preceding clip
- Stitch everything together with FFmpeg
Complexity
This is a significant expansion — requires:
- Transcript editing (currently only deletion is supported)
- Per-segment TTS generation with the selected voice
- Timeline manipulation to insert freeze frames at precise timestamps
- Matching the voice/tone of the surrounding audio
Notes
- Came up during ElevenLabs voice recreation implementation
- The current STS flow replaces the entire audio track; this would need per-word generation
Idea
Use ElevenLabs Text-to-Speech to add new words to the transcript, not just remove them. When the user types new words into the transcript, generate audio for those words during export.
Video sync challenge
Added words have no matching video footage. Two approaches:
tpadorloopfilters. Looks like a brief pause in the video.Freeze frame is the more practical approach.
Flow
Complexity
This is a significant expansion — requires:
Notes