Karaoke Maker is a local, source-first pipeline for turning a song plus lyrics into a karaoke video.
It prepares stems, drafts lyric timings, lets you review and correct those timings in a browser editor, and renders the final video with the bundled video-gen/ Remotion app.
- Current release target: open-source v1
- Supported platform: macOS only
- Distribution model: clone the repo and run it locally
- Primary interface:
uv run karaoke-maker ...
- Python 3.10
uv- Node.js
24.14.0 - npm
11.9.0 - FFmpeg / ffprobe on
PATH
-
Clone the repository:
git clone https://github.com/nitsh/karaoke-maker.git cd karaoke-maker -
Install dependencies:
uv sync npm --prefix video-gen install
-
Run the demo pipeline:
uv run karaoke-maker generate \
--audio examples/demo/song.wav \
--lyrics examples/demo/lyrics.txt \
--alignment-provider detect-cues \
--skip-renderThis stages a complete demo run without invoking the final Remotion render. Generated artifacts will appear under:
output/alignment/video-gen/public/
If you want the final video too, rerun the same command without --skip-render.
If you want a recognizable song instead of the tiny synthetic smoke test, the repository also ships two public-domain demo packs under examples/demo/real-songs/:
twinkle-twinkle-little-starmary-had-a-little-lamb
Each pack includes:
song.wavvocals.wavinstruments.wavlyrics.txtmanual_aligned_lyrics.txt
Example:
uv run karaoke-maker generate \
--audio examples/demo/real-songs/twinkle-twinkle-little-star/song.wav \
--lyrics examples/demo/real-songs/twinkle-twinkle-little-star/lyrics.txt \
--alignment-provider detect-cues \
--skip-renderThese recordings were synthesized locally from public-domain melodies and lyrics, so the demo stays source-first and rights-clean.
You can either place files at the default runtime locations:
input/song.mp3input/lyrics.txt
or pass explicit paths:
uv run karaoke-maker generate --audio /path/to/song.wav --lyrics /path/to/lyrics.txtuv run karaoke-maker prepare \
--audio examples/demo/song.wav \
--lyrics examples/demo/lyrics.txt \
--alignment-provider detect-cuesThis command:
- runs Demucs to split vocals and instrumentals
- creates an alignment draft
- stages review files into
video-gen/public/
uv run karaoke-maker reviewThen open:
http://127.0.0.1:8766
Saving from the editor writes:
input/manual_aligned_lyrics.txtvideo-gen/public/manual_aligned_lyrics.jsonoutput/alignment/corrections/<timestamp>.json
If you edit input/manual_aligned_lyrics.txt directly:
uv run karaoke-maker publishuv run karaoke-maker generateTiming source behavior:
--timings-source autoprefers reviewed manual timings wheninput/manual_aligned_lyrics.txtexists, otherwise falls back to the latest draft--timings-source manualrequires reviewed manual timings--timings-source draftforces the latest draft timing JSON
uv run karaoke-maker prepareUseful flags:
--audio--lyrics--skip-demucs--skip-aligner--alignment-provider {auto,elevenlabs,stable-ts,detect-cues,whisperx}--language
uv run karaoke-maker reviewuv run karaoke-maker publishuv run karaoke-maker generateUseful flags:
--timings-source {auto,manual,draft}--skip-render
| Provider | Mode | When to use it | Notes |
|---|---|---|---|
auto |
fallback chain | Default | Tries providers in the configured order and uses the first one available. |
elevenlabs |
forced alignment | Best remote quality when you have API access | Requires ELEVENLABS_API_KEY. |
stable-ts |
local forced alignment | Useful for local alignment with reference lyrics | Requires the local stable-ts / stable_whisper dependency in the environment. |
detect-cues |
local heuristic timing | Fast fallback for predictable line-level timings | Uses silence detection on vocals-only audio. |
whisperx |
local transcription + timestamps | Fallback when reference lyrics are unavailable or forced alignment is unavailable | Requires the local whisperx dependency in the environment. |
The pipeline writes runtime artifacts to:
output/alignment/alignment_result.jsonoutput/alignment/manual_draft.txtoutput/alignment/manual_draft.jsonvideo-gen/public/aligned_lyrics.jsonvideo-gen/public/alignment_review.jsonvideo-gen/public/manual_aligned_lyrics.jsonvideo-gen/public/render_meta.jsonvideo-gen/public/instruments.wavvideo-gen/out/video.mov
input/ and output/ are runtime directories. Bundled demo assets live under examples/demo/, including the tiny synthetic smoke clip and the public-domain song packs in examples/demo/real-songs/.
Canonical Python source lives in src/karaoke_maker/. Root files such as main.py, alignment_editor.py, detect_cues.py, and manual_aligned_lyrics_parser.py are compatibility shims only.
Validation commands:
uv run pytest
uv run ruff check .
npm --prefix video-gen testAdditional implementation notes live in docs/architecture.md.
- macOS is the only validated platform for this release.
- The pipeline depends on local multimedia tooling and can be heavy on CPU.
- Automatic timings are still best treated as a draft. Manual review is expected for polished output.
- Provider availability depends on your local environment and credentials.