Local-first voice conversion and voice cloning toolkit with a Gradio WebUI, pluggable model backends, and model download management.
Voice Morpher is built for local demos on Apple Silicon first. It does not bundle model weights or lock the app to one model. Instead, it provides a small application layer around open-source speech models such as Seed-VC, CosyVoice3, Qwen3 TTS / MLX, Chatterbox, F5-TTS, OpenVoice, IndexTTS, and XTTS.
- Convert source audio into a target speaker's timbre while preserving source rhythm and pauses as much as possible.
- Generate cloned speech from text and a target speaker reference audio.
- Download supported Hugging Face or ModelScope model snapshots from the WebUI.
- Clone reusable voice profiles from reference audio in the WebUI.
- Use built-in WebUI flows first, with command-template backends kept for advanced setups.
- Keep runtime data, external model repos, and downloaded weights outside the source tree.
This is an MVP. The application flow, audio preprocessing, Gradio interface, voice profile library, model catalog, model download management, and backend adapters are implemented.
The default passthrough backend is intentionally model-free. It copies the preprocessed source audio to the output so the UI and pipeline can be tested before installing any model.
| Workflow | Input | Output | Recommended backend |
|---|---|---|---|
| Voice conversion | Source audio + target reference audio | Converted audio | seed_vc_cli |
| TTS voice cloning | Target reference audio + text | Generated speech | cosyvoice3_builtin, cosyvoice3_cli, qwen3_tts_cli, chatterbox_cli, f5_tts_cli, openvoice_cli, indextts_cli, xtts_cli |
| Pipeline test | Source audio + target reference audio | Copied source audio | passthrough |
| Model family | Integration status | Notes |
|---|---|---|
| Qwen3 TTS / MLX | Download catalog + CLI backend | Strong Apple Silicon candidate. |
| CosyVoice3 | Download catalog + built-in backend + CLI backend | Built-in backend expects the official CosyVoice Python package locally. |
| Chatterbox | Download catalog + CLI backend | Open-source TTS voice cloning with expressive controls. |
| F5-TTS | Download catalog + CLI backend | Mature zero-shot TTS; check model license for commercial use. |
| OpenVoice V2 | Download catalog + CLI backend | Lightweight MIT-licensed voice cloning candidate. |
| IndexTTS-2 | Download catalog + CLI backend | Current public weights. IndexTTS-2.5 has a technical report, but no official downloadable weights are configured yet. |
| XTTS v2 | Download catalog + CLI backend | Classic multilingual voice cloning model; check license before production use. |
| Seed-VC | CLI backend | Audio-to-audio voice conversion, not TTS. |
CLI backend means the WebUI can call the model through a configured command template. It does not mean the third-party model runtime is bundled in this repository.
- macOS, Linux, or Windows
- Python 3.12+
- uv
- Optional model-specific runtimes for Seed-VC, CosyVoice3, Qwen3 TTS / MLX, or Chatterbox
Apple Silicon is the primary local target, but the app itself is model-agnostic.
uv sync
uv run python main.pyOpen:
http://127.0.0.1:8000
If port 8000 is already in use:
VOICE_MORPHER_PORT=8001 uv run python main.pyThe Gradio interface includes:
- Audio Voice Conversion: upload source audio and target reference audio.
- Voice Clone Library: save target speaker reference audio as a reusable voice profile.
- Text Voice Cloning: upload target reference audio and enter text.
- Model Download: download configured Hugging Face or ModelScope snapshots into
models/. - Model Configuration: inspect configured and missing CLI backends.
Downloadable model metadata is stored in:
config/models.toml
Add or edit entries there instead of editing Python code. Each entry defines the display label, Hugging Face repo, local directory name, task, and description.
Downloaded weights are stored under:
models/
models/ is ignored by git.
For example, the CosyVoice3 entry downloads to:
models/Fun-CosyVoice3-0.5B-2512/
The download source can be selected in the WebUI. Hugging Face is available for every configured model. ModelScope is available only when the model entry defines modelscope_id.
The built-in CosyVoice3 backend still expects the official CosyVoice repository to exist locally:
external/CosyVoice/
models/Fun-CosyVoice3-0.5B-2512/
The repository is not cloned by this app. Install it manually if you want to use the built-in CosyVoice3 backend.
Copy the example env file if you prefer file-based configuration:
cp .env.example .envCommand templates support these placeholders:
{source}: preprocessed source audio path.{reference}: preprocessed target speaker reference audio path.{text}: input text for TTS voice cloning.{output}: output wav path that the backend command must create.
Install and test Seed-VC separately first, then configure:
VOICE_MORPHER_SEED_VC_COMMAND='python inference.py --source {source} --target {reference} --output {output}' \
uv run python main.pySeed-VC may require a wrapper if its CLI writes to an output directory instead of a single wav file.
The recommended app workflow is:
- Save a voice in Voice Clone Library.
- Download the CosyVoice3 model in Model Download.
- Install the official CosyVoice repo manually under
external/CosyVoice. - Open Text Voice Cloning, select the saved voice, and select
CosyVoice3 Built-in.
The command-template backend is still available for custom setups:
VOICE_MORPHER_COSYVOICE3_COMMAND='python cosyvoice3_infer.py --prompt-audio {reference} --text {text} --output {output}' \
uv run python main.pyVOICE_MORPHER_QWEN3_TTS_COMMAND='python qwen3_tts.py --reference {reference} --text {text} --output {output}' \
uv run python main.pyVOICE_MORPHER_CHATTERBOX_COMMAND='python chatterbox_tts.py --reference {reference} --text {text} --output {output}' \
uv run python main.pyVOICE_MORPHER_F5_TTS_COMMAND='f5-tts_infer-cli --ref_audio {reference} --ref_text {prompt_text} --gen_text {text} --output_file {output}' \
uv run python main.pyVOICE_MORPHER_OPENVOICE_COMMAND='python openvoice_infer.py --reference {reference} --text {text} --output {output}' \
uv run python main.pyVOICE_MORPHER_INDEXTTS_COMMAND='python indextts_infer.py --reference {reference} --text {text} --output {output}' \
uv run python main.pyVOICE_MORPHER_XTTS_COMMAND='python xtts_infer.py --speaker_wav {reference} --text {text} --output {output}' \
uv run python main.py.
├── config/
│ └── models.toml
├── docs/
│ └── TECHNICAL_DESIGN.md
├── src/
│ ├── backends/
│ ├── core/
│ ├── services/
│ └── ui/
├── tests/
├── .env.example
├── LICENSE
├── README.md
└── README.zh.md
uv run ruff check
uv run pytest- Add wrappers for common Seed-VC output layouts.
- Add install helpers for external model repositories.
- Add background jobs and progress reporting for long downloads and inference.
- Add model-specific validation for required local files.
- Add long-audio slicing, silence handling, and vocal separation.
- Add ASR + TTS workflow for video translation.
Only use voices you own or are authorized to process. This project does not include consent verification, watermarking, or misuse detection. Add those controls before any public or commercial deployment.
This project is released under the MIT License.
Third-party model weights and model repositories have their own licenses. Check each model license before commercial use.
Maintainer: SK Studio
Email: developer@skstudio.cn
For model selection details, see docs/OPEN_SOURCE_VOICE_CLONING.md.