Trainscribe is a command-line tool that transcribes audio files in a specified folder using OpenAI's Whisper and generates a metadata.csv file. The produced metadata file is intended to use in training/finetune of text to speech (TTS) models, and may use one of the following formats:
file_id|transcribed_text, orfile_id|transcribed_text|speaker, if a speaker label is provided.
This is similar to LJ Speech format, but lacks an additional field with normalized transcribed text for pronuciation. Particularly, file_id|transcribed_text may be used in projects like piper-train, and file_id|transcribed_text|speaker in xtts-finetune.
- Python >=3.10, <3.14
uvffmpeg(install withsudo apt install ffmpeg)
Run the tool with:
uvx trainscribe --folder /path/to/audio/folder [options]Transcribe a folder of audio files to metadata.csv using Whisper.
options:
-h, --help show this help message and exit
--folder, -f FOLDER Folder with audio files
--lang, -l LANG Language code for transcription (e.g. 'en')
--model, -m MODEL Whisper model name (tiny, base, small, medium, large, turbo)
--speaker, -s SPEAKER
Speaker label to add to metadata lines
--device, -d DEVICE Device for whisper model (cuda/cpu)
--output, -o OUTPUTTranscribe English audio in dataset/wavs using the medium model:
uvx trainscribe --folder dataset/wavs --lang en --model medium This generates dataset/wavs/metadata.csv