Custom Mimi-Based TTS System (NovaTTS)

Overview

This project implements a custom text-to-speech (TTS) system designed for voice cloning using the Kyutai Mimi audio codec. Inspired by architectures like SesameAI's CSM, it takes text and a short reference audio clip as input and generates speech in the reference speaker's voice.

The core idea is to train a custom Transformer-based model (NovaTTS) to generate intermediate Mimi audio codes conditioned on both text content and speaker characteristics. These codes are then decoded into an audio waveform using the pre-trained Mimi vocoder provided by the moshi library.

Features

Voice Cloning: Generates speech in the voice of a speaker from a short reference audio clip.
Text-to-Speech: Converts input text into audible speech.
Custom Model: Trains a dedicated TransformerDecoder (MimiCodeGenerator) to predict Mimi codes.
Pre-trained Components: Leverages pre-trained models for text representation (Qwen/Qwen3-0.6B), speaker encoding (speechbrain/spkrec-ecapa-voxceleb), and audio decoding (moshi).
Streaming Architecture: The code generation component (model.generate_codes) is designed to yield code chunks, enabling potential low-latency inference (though the current infer.py script decodes the full sequence due to library constraints).
Experiment Tracking: Integrated with Weights & Biases (wandb) for monitoring training progress.

Setup

Clone the repository:
```
git clone <repository_url>
cd mimi-tts 
```

Create Environment & Install Dependencies: This project uses uv for environment management.

# Create a virtual environment (e.g., using Python's venv)
python -m venv .venv

# Activate the environment
source .venv/bin/activate 

# Install dependencies using uv
uv pip install -r requirements.txt
# Or using pip:
# pip install -r requirements.txt

Install Espeak-NG: The phonemizer library relies on espeak-ng as its backend.
- macOS: brew install espeak-ng
- Ubuntu/Debian: sudo apt update && sudo apt install espeak-ng
- Other: Follow instructions for your OS. Note: The code attempts to find the library path automatically on macOS, adjust src/text_processing.py if needed for other systems.
(Optional) Weights & Biases Setup:
- Sign up for a free account at wandb.ai.
- Log in via your terminal:
```
wandb login
```
  Follow the prompts (you may need to paste an API key).

Usage

Training

Configuration: Review and adjust parameters in src/config.py (e.g., DATASET_NAME, EPOCHS, BATCH_SIZE, LEARNING_RATE).
Prepare Data: Ensure the dataset specified in config.py is accessible (e.g., on Hugging Face Hub).
Delete Old Checkpoints (Important on Retraining): If restarting training, remove previous checkpoints:
```
rm models/*.pth
```
Run Training: Activate the environment and run the training script:
```
source .venv/bin/activate
python src/train.py
```
- Training progress (loss) will be printed to the console.
- If wandb is enabled, metrics will be logged to your WandB dashboard (default project: mimi-tts-nova).
- The script saves the best model checkpoint based on validation loss (if available) to models/novatts_best.pth.
- If no validation set is found (or created from the train set), checkpoints are saved after every epoch (models/novatts_epoch_N.pth).

Inference

Ensure Checkpoint Exists: You need a trained model checkpoint (e.g., models/novatts_best.pth).
Run Inference: Activate the environment and run the inference script:
```
source .venv/bin/activate
python src/infer.py --text "Your desired text to synthesize." \
                    --ref_audio_path path/to/reference_audio.wav \
                    --checkpoint_path models/novatts_best.pth \
                    --output_path output/generated_speech.wav \
                    # Optional generation parameters:
                    # --temperature 0.8 \
                    # --top_k 50 \
                    # --max_gen_len 1500
```
- --text: The input text.
- --ref_audio_path: Path to a short audio file (.wav recommended) of the target speaker.
- --checkpoint_path: Path to the trained model checkpoint.
- --output_path: Path where the generated .wav file will be saved.
- --temperature, --top_k, --max_gen_len: Control the sampling process during code generation.
- --chunk_size: (Currently affects internal generation but not end-to-end latency due to decode limitations).
Note: For models in early training stages, the script applies necessary clamping to generated codes to prevent errors. This may result in glitchy audio until the model is better trained.

Code Structure

src/
- config.py: Central configuration for hyperparameters, paths, etc.
- model.py: Defines the NovaTTS model, including MimiCodeGenerator.
- dataset.py: Handles dataset loading, preprocessing, collation, and dataloader creation.
- text_processing.py: Contains TextProcessor for normalization, phonemization, and tokenization.
- train.py: Script for training the NovaTTS model.
- infer.py: Script for running inference with a trained model.
memory-bank/: Documentation tracking project context, progress, decisions.
models/: Default directory for saving trained model checkpoints.
output/: Default directory for saving generated audio files.
scripts/: Utility scripts (e.g., reference audio samples).
requirements.txt: Python dependencies.
README.md: This file.

Key Dependencies

PyTorch
Transformers (Hugging Face)
Datasets (Hugging Face)
SpeechBrain
Moshi (Kyutai)
Librosa
Phonemizer (with espeak-ng backend)
WandB (Weights & Biases)
uv (for environment/package management)

Known Issues & Future Work

Inference Clamping: Requires clamping generated codes during inference until the model is sufficiently trained.
Phonemizer Warning: A benign "words count mismatch" warning from phonemizer occurs but is filtered from logs.
True Streaming Inference: The moshi decoder currently seems to require the full code sequence, preventing true low-latency streaming in infer.py. Further investigation into moshi's API or alternative decoding might be needed.
Expressive Speech: Generating non-speech sounds (giggles, sighs) based on text tags requires appropriately tagged training data, which is currently unavailable.
Model Evaluation: More rigorous evaluation metrics (e.g., WER, MOS, FAD) could be added.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Custom Mimi-Based TTS System (NovaTTS)

Overview

Features

Setup

Usage

Training

Inference

Code Structure

Key Dependencies

Known Issues & Future Work

About

Uh oh!

Releases

Packages

Languages

novatechai/nova-tts

Folders and files

Latest commit

History

Repository files navigation

Custom Mimi-Based TTS System (NovaTTS)

Overview

Features

Setup

Usage

Training

Inference

Code Structure

Key Dependencies

Known Issues & Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages