Skip to content

Add words to transcript via ElevenLabs TTS + freeze frame #1

@madebysan

Description

@madebysan

Idea

Use ElevenLabs Text-to-Speech to add new words to the transcript, not just remove them. When the user types new words into the transcript, generate audio for those words during export.

Video sync challenge

Added words have no matching video footage. Two approaches:

  1. Freeze frame — hold the last frame of the preceding clip while the added words play. Doable with FFmpeg tpad or loop filters. Looks like a brief pause in the video.
  2. Audio-only insert — only works for podcast/audio-first content.

Freeze frame is the more practical approach.

Flow

  1. User adds words in the transcript editor (new editing capability needed)
  2. During export, detect which words are "added" (no original audio)
  3. Generate audio for added words via ElevenLabs TTS API
  4. At each insertion point, freeze the last frame of the preceding clip
  5. Stitch everything together with FFmpeg

Complexity

This is a significant expansion — requires:

  • Transcript editing (currently only deletion is supported)
  • Per-segment TTS generation with the selected voice
  • Timeline manipulation to insert freeze frames at precise timestamps
  • Matching the voice/tone of the surrounding audio

Notes

  • Came up during ElevenLabs voice recreation implementation
  • The current STS flow replaces the entire audio track; this would need per-word generation

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions