Skip to content

mayflower/multitalker-livekit-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multitalker LiveKit Agent

GPU-enabled LiveKit Agent for streaming multi-speaker transcription using NVIDIA NeMo's multitalker ASR and diarization models.

Features

  • Real-time streaming multi-speaker transcription
  • Speaker diarization (up to 4 speakers)
  • Multiple audio tracks per room support
  • Multiple concurrent rooms via LiveKit AgentServer scaling
  • Model prewarming for efficient resource usage

Requirements

  • Docker with NVIDIA GPU support
  • NVIDIA GPU with CUDA 12.1+ support
  • LiveKit server (cloud or self-hosted)

Quick Start

Build the Docker Image

docker build -t multitalker-livekit-agent:latest .

Run the GPU Smoke Test

Verify the container can access the GPU and load the models:

docker run --gpus all --env-file .env multitalker-livekit-agent:latest \
  python -m agent.gpu_smoke_test

Run the Agent

docker run --gpus all --env-file .env multitalker-livekit-agent:latest

Configuration

Copy .env.example to .env and fill in your LiveKit credentials:

cp .env.example .env

Environment variables:

Variable Description Default
LIVEKIT_URL LiveKit server URL (required)
LIVEKIT_API_KEY LiveKit API key (required)
LIVEKIT_API_SECRET LiveKit API secret (required)
INPUT_TRACK_NAME Name of audio track to transcribe mix
TRANSCRIPT_TOPIC Data topic for transcript messages multitalker_transcript
ROOM_PREFIX Only join rooms with this prefix (empty = all rooms) (empty)

Testing

Offline Pipeline Test

Test the NeMo pipeline with a local audio file:

docker run --gpus all -v $(pwd)/tests/data:/audio \
  multitalker-livekit-agent:latest \
  python -m agent.multitalker_pipeline --audio /audio/sample.wav

Run Unit Tests

docker run --gpus all multitalker-livekit-agent:latest \
  pytest -q

Multi-Room Integration Test

With the agent running and connected to LiveKit:

python scripts/multi_room_test.py

Architecture

Scaling Model

The agent uses LiveKit's standard AgentServer scaling model:

  1. Worker Processes: The AgentServer spawns worker processes
  2. Model Prewarming: Each worker loads NeMo models once via prewarm()
  3. Multiple Jobs: Each worker can handle multiple concurrent jobs (rooms)
  4. Horizontal Scaling: Deploy multiple agent containers for higher capacity

State Management

  • Shared across jobs (in proc.userdata):

    • MultitalkerTranscriptionConfig
    • SortformerEncLabelModel (diarization)
    • ASRModel (multitalker ASR)
  • Per-job state (local to each room):

    • Sessions per audio track
    • Asyncio tasks per track
    • Frame count metrics

Multi-Track Support

  • One job (room) can handle multiple audio tracks concurrently
  • Each track gets its own MultitalkerStreamingSession
  • Transcripts include session/track identification

Transcript Format

Transcripts are published as JSON via LiveKit data packets:

{
  "type": "multitalker_transcript",
  "segments": [
    {
      "session_id": "room-name_track-sid",
      "speaker": "spk_0",
      "start_time": 1.5,
      "end_time": 3.2,
      "text": "Hello, how are you?",
      "is_final": true
    }
  ]
}

Development

Local Development Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install NeMo (requires CUDA)
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

Project Structure

.
├── agent/
│   ├── __init__.py
│   ├── agent.py                      # LiveKit agent entrypoint
│   ├── gpu_smoke_test.py             # GPU/model verification script
│   ├── multitalker_pipeline.py       # NeMo streaming wrapper
│   └── multitalker_transcript_config.py  # Configuration dataclass
├── scripts/
│   └── multi_room_test.py            # Integration test harness
├── tests/
│   ├── __init__.py
│   └── test_multitalker_pipeline.py  # Unit tests
├── Dockerfile
├── requirements.txt
├── .env.example
└── README.md

Kubernetes Deployment

Push Image to Registry

docker tag multitalker-livekit-agent:latest your-registry.com/multitalker-livekit-agent:latest
docker push your-registry.com/multitalker-livekit-agent:latest

Create Kubernetes Secret

kubectl create secret generic livekit-agent-secrets \
  --from-literal=LIVEKIT_URL=wss://your-livekit-server.com \
  --from-literal=LIVEKIT_API_KEY=your-api-key \
  --from-literal=LIVEKIT_API_SECRET=your-api-secret

Deployment Manifest

# multitalker-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multitalker-livekit-agent
  labels:
    app: multitalker-agent
spec:
  replicas: 1
  selector:
    matchLabels:
      app: multitalker-agent
  template:
    metadata:
      labels:
        app: multitalker-agent
    spec:
      containers:
      - name: agent
        image: your-registry.com/multitalker-livekit-agent:latest
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "2"
        envFrom:
        - secretRef:
            name: livekit-agent-secrets
        env:
        - name: INPUT_TRACK_NAME
          value: "mix"
        - name: TRANSCRIPT_TOPIC
          value: "multitalker_transcript"
        - name: ROOM_PREFIX
          value: "transcribe-"  # Only join rooms starting with "transcribe-"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-gpu

Apply with:

kubectl apply -f multitalker-agent-deployment.yaml

How It Works

The agent registers with LiveKit using the Agents SDK. When a room is created, LiveKit automatically dispatches an available agent worker to join. The agent:

  1. Subscribes to audio tracks matching INPUT_TRACK_NAME
  2. Runs real-time ASR with speaker diarization
  3. Publishes transcripts via LiveKit data channel on TRANSCRIPT_TOPIC

Receiving Transcripts in Your Application

JavaScript/TypeScript:

import { Room } from 'livekit-client';

const room = new Room();

room.on('dataReceived', (payload, participant, kind, topic) => {
  if (topic === 'multitalker_transcript') {
    const transcript = JSON.parse(new TextDecoder().decode(payload));
    transcript.segments.forEach(segment => {
      console.log(`[${segment.speaker}]: ${segment.text}`);
    });
  }
});

await room.connect(LIVEKIT_URL, accessToken);

Python:

from livekit import rtc
import json

async def on_data_received(data: bytes, participant, kind, topic: str):
    if topic == "multitalker_transcript":
        transcript = json.loads(data.decode("utf-8"))
        for segment in transcript.get("segments", []):
            print(f"[{segment['speaker']}]: {segment['text']}")

room = rtc.Room()
room.on("data_received", on_data_received)
await room.connect(livekit_url, token)

Publishing Audio for Transcription

Ensure your audio track name matches INPUT_TRACK_NAME:

const track = await createLocalAudioTrack();
await room.localParticipant.publishTrack(track, {
  name: 'mix',  // Must match INPUT_TRACK_NAME
  source: Track.Source.Microphone,
});

Scaling

For horizontal scaling, increase replicas (limited by GPU availability):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: multitalker-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: multitalker-livekit-agent
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Troubleshooting

# Check agent logs
kubectl logs -l app=multitalker-agent -f

# Verify GPU access
kubectl exec -it deployment/multitalker-livekit-agent -- nvidia-smi

# Test agent connectivity
kubectl exec -it deployment/multitalker-livekit-agent -- python -m agent.gpu_smoke_test
Issue Solution
Agent not joining rooms Check LIVEKIT_URL uses wss:// protocol
No transcripts received Verify track name matches INPUT_TRACK_NAME
GPU out of memory Reduce concurrent rooms or use larger GPU
Slow startup Normal - NeMo models are large (~2-3GB)

License

See LICENSE file for details.

About

GPU-enabled LiveKit Agent for streaming multi-speaker transcription using NVIDIA NeMo ASR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published