Skip to content

A Runpod serverless that implements the Voice Conversion capabilities of LinaCodec

Notifications You must be signed in to change notification settings

sruckh/LinaCodec-Serverless

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LinaCodec Voice Conversion

RunPod PyTorch License

AI-powered voice conversion API deployed on RunPod Serverless using the LinaCodec model.

LinaCodec Voice Conversion is a serverless API that transforms speech from one voice to another. Simply provide a source audio file (content) and a reference audio file (style/timbre), and receive a converted audio file in MP3 or WAV format. Built for scalability on RunPod's GPU infrastructure with optional S3 integration for persistent storage.

Features

  • Zero-Cold-Start Optimization: Lazy model loading minimizes first-request latency
  • Flexible Audio Formats: Output as MP3 (192k bitrate) or WAV (PCM_16)
  • Dual Output Modes: Return audio as base64-encoded data or via S3 presigned URL
  • Session Isolation: UUID-based temporary file handling for concurrent processing
  • Persistent Caching: Network volume stores model cache across pod restarts
  • Optional S3 Integration: Upload outputs directly to S3 with presigned URLs (1-hour expiry)
  • High-Quality Output: 48kHz sample rate audio conversion
  • Graceful Error Handling: Comprehensive logging and fallback mechanisms

Architecture

Architecture Diagram

The system is built on RunPod Serverless with the following components:

  • Handler (handler.py): Main RunPod serverless entry point with lazy-loaded model
  • Bootstrap (bootstrap.sh): Idempotent container initialization script
  • Configuration (config.py): Environment-based configuration management
  • Persistent Volume: Stores model cache, outputs, and Python environment across restarts

External Integrations

  • Hugging Face Hub: Model repository (requires HF_TOKEN)
  • AWS S3: Optional output storage (requires S3_* environment variables)
  • GitHub: LinaCodec source code repository

Data Flow

Data Flow Diagram

  1. Input: Client sends source audio URL, reference audio URL, and format preference
  2. Download: Audio files are downloaded to isolated temporary directories
  3. Model Load: LinaCodec model is loaded (cached after first request)
  4. Voice Conversion: model.convert_voice() processes the audio at 48kHz
  5. Encoding: Output encoded to MP3 (FFmpeg) or WAV (Soundfile)
  6. Storage: Saved to persistent volume and optionally uploaded to S3
  7. Response: Returns S3 URL or base64-encoded audio
  8. Cleanup: Temporary files are removed

Quick Start

Prerequisites

  • Docker installed
  • RunPod account with GPU support
  • Hugging Face API token (get one here)
  • (Optional) S3 credentials for cloud storage

Local Development with Docker

# Clone the repository
git clone https://github.com/your-username/LinaCodec.git
cd LinaCodec

# Build the Docker image
docker build -t linacodec .

# Run with GPU support
docker run -d --gpus all \
  -e HF_TOKEN=your_hf_token_here \
  linacodec

Deployment to RunPod

# Build and push to RunPod
docker build -t your-registry/linacodec:latest .
docker push your-registry/linacodec:latest

# Deploy via RunPod Console or CLI
# Set environment variables in RunPod template:
# - HF_TOKEN (required)
# - S3_ENDPOINT_URL, S3_ACCESS_KEY_ID, S3_SECRET_ACCESS_KEY, S3_BUCKET_NAME, S3_REGION (optional)

Usage

API Request Format

Send a POST request to your RunPod serverless endpoint:

{
  "input": {
    "audio_url_1": "https://example.com/source.mp3",
    "audio_url_2": "https://example.com/reference.mp3",
    "format": "mp3"
  }
}
Parameter Type Required Description
audio_url_1 string Yes URL of source audio (content voice)
audio_url_2 string Yes URL of reference audio (style/timbre voice)
format string No Output format: "mp3" (default) or "wav"

API Response Format

Success with S3 enabled:

{
  "status": "success",
  "format": "mp3",
  "audio_url": "https://s3-bucket.s3.region.amazonaws.com/output.mp3"
}

Success without S3 (base64):

{
  "status": "success",
  "format": "wav",
  "audio_base64": "UklGRiQAAABXQVZFZm10IBAAAAABAAEARKwAAIhYAQACABAAZGF0YQAAAAA..."
}

Error:

{
  "error": "Failed to download source audio from https://example.com/source.mp3"
}

Configuration

Environment Variables

Variable Required Default Description
HF_TOKEN Yes - Hugging Face API token for model access
S3_ENDPOINT_URL No - Custom S3 endpoint URL
S3_ACCESS_KEY_ID No - S3 access key ID
S3_SECRET_ACCESS_KEY No - S3 secret access key
S3_BUCKET_NAME No - S3 bucket name for output storage
S3_REGION No us-east-1 S3 region

RunPod Volume Structure

The persistent volume at /runpod-volume/LinaCodecVC/ contains:

/runpod-volume/LinaCodecVC/
├── output/    # Generated audio files
├── cache/     # HuggingFace model cache (HF_HOME)
├── src/       # LinaCodec source code
└── venv/      # Python virtual environment

Development

Bootstrap Process

The bootstrap.sh script runs on container start:

  1. Creates directory structure on network volume
  2. Checks for first-run flag file
  3. First run only:
    • Creates Python virtual environment
    • Installs PyTorch 2.9.1 with CUDA 12.8 support
    • Installs Flash Attention v2.8.3
    • Clones LinaCodec from GitHub
    • Installs LinaCodec package in editable mode
    • Installs Python dependencies
    • Creates first-run flag
  4. Subsequent runs: Activates existing venv
  5. Starts handler

Dependencies

# Core ML framework (installed in bootstrap)
torch==2.9.1
torchvision==0.24.1
torchaudio==2.9.1

# Flash Attention (installed in bootstrap)
flash_attn==2.8.3

# RunPod serverless
runpod>=1.6.0

# Audio processing
librosa
soundfile
numpy>=1.26.0

# HTTP/storage
boto3>=1.26.0
requests

# Model acceleration
hf_transfer

Audio Processing Details

  • Sample Rate: 48kHz output (LinaCodec native rate)
  • MP3 Encoding: FFmpeg with 192k bitrate, constant quality
  • WAV Encoding: Soundfile with PCM_16 subtype
  • Input Handling: Automatic float32 to int16 conversion with clamping

Troubleshooting

Common Issues

Issue: "Failed to load model at startup"

  • Solution: Ensure HF_TOKEN is set and valid. Check network connectivity to Hugging Face.

Issue: "FFmpeg encoding failed"

  • Solution: FFmpeg is installed in bootstrap.sh. Verify installation or check audio input format.

Issue: Slow first request

  • Solution: Expected behavior due to model download (~2GB). Subsequent requests are faster due to caching.

Issue: S3 upload fails, no base64 fallback

  • Solution: Check S3 credentials and endpoint. Ensure bucket exists and credentials have write permissions.

Logs

View logs for debugging:

# RunPod logs via console
runpodctl logs <pod_id>

# Or check container logs
docker logs <container_id>

Technology Stack

  • Runtime: Python 3.10+
  • ML Framework: PyTorch 2.9.1 with CUDA 12.8
  • Optimization: Flash Attention v2.8.3
  • Model: LinaCodec by ysharma3501
  • Platform: RunPod Serverless
  • Container Base: runpod/base:1.0.3-cuda1281-ubuntu2404
  • Audio Tools: FFmpeg, SoX, Librosa, Soundfile
  • Storage: AWS S3 (optional)

License

This project is licensed under the MIT License - see LICENSE for details.

Acknowledgments

About

A Runpod serverless that implements the Voice Conversion capabilities of LinaCodec

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •