Voice Infrastructure

Architecture

graph LR
    A[User Browser] -->|WebSocket| B[LiveKit Server]
    B -->|WebSocket| C[LiveKit Agent]
    C -->|STT| D[OpenAI Whisper]
    C -->|TTS| E[OpenAI TTS]
    C -->|VAD| F[Silero VAD]
    C -->|LangGraph| G[Orchestrator]
    G -->|Response| C
    C -->|Audio Stream| B
    B -->|Audio Stream| A

The agent acts as a bridge between LiveKit's WebRTC streams and the orchestrator. Audio flows bidirectionally: user speech is buffered, VAD detects speech boundaries, then STT transcribes. The orchestrator processes text and returns next_message, which TTS converts to audio and streams back through LiveKit. The agent uses non-streaming STT (OpenAI's default), so VAD is required to detect when to send audio chunks for transcription.

Components

Component	Technology	Purpose	Latency
LiveKit Server	LiveKit Cloud/Self-hosted	WebRTC signaling, media routing	<100ms
Agent	LiveKit Agents Python SDK	Voice agent orchestration	<50ms
STT	OpenAI Whisper API	Speech-to-text	200-500ms
TTS	OpenAI TTS (tts-1-hd)	Text-to-speech	300-800ms
VAD	Silero VAD	Voice activity detection	<50ms

Agent Lifecycle

sequenceDiagram
    participant LK as LiveKit Server
    participant AG as Agent
    participant O as Orchestrator
    participant STT as OpenAI STT
    participant TTS as OpenAI TTS

    LK->>AG: JobContext room metadata
    AG->>AG: Bootstrap DB TTS STT VAD
    AG->>O: Initialize orchestrator
    AG->>LK: Connect handshake

    loop Interview Loop
        LK->>AG: Audio stream user speaks
        AG->>STT: Transcribe audio
        STT->>AG: Text response
        AG->>O: execute_step user_response
        O->>AG: Response message
        AG->>TTS: Generate speech
        TTS->>AG: Audio bytes
        AG->>LK: Audio stream
        LK->>User: Play audio
    end

    LK->>AG: Room closed
    AG->>O: Cleanup

Bootstrap happens before connection to meet LiveKit's <100ms handshake requirement. Heavy imports (database, orchestrator) are deferred until after metadata extraction. The OrchestratorLLM adapter wraps the orchestrator, translating agent callbacks into execute_step calls with proper state loading and checkpointing. TTS/STT failures are handled gracefully—the agent continues with degraded functionality rather than crashing.

Setup

LiveKit Server

Cloud (Recommended):

Sign up at livekit.io
Get API key, secret, and URL

Set environment variables:

LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret

Self-Hosted:

docker run -d \
  -p 7880:7880 \
  -p 7881:7881 \
  -p 7882:7882/udp \
  -e LIVEKIT_KEYS="api-key: api-secret" \
  livekit/livekit-server

Agent Configuration

# src/core/config.py
LIVEKIT_URL = "wss://your-project.livekit.cloud"
LIVEKIT_API_KEY = "your-api-key"
LIVEKIT_API_SECRET = "your-api-secret"
OPENAI_API_KEY = "your-openai-key"

Running Agent

# Development
python -m src.agents.interview_agent dev

# Production
python -m livekit.agents start src.agents.interview_agent
# Alternative (if PATH is set correctly): livekit-agents start src.agents.interview_agent

Voice Pipeline

Speech-to-Text (STT)

Flow:

User speaks → Browser captures audio
LiveKit routes audio to agent
Agent buffers audio chunks
VAD detects speech end
Agent sends to OpenAI Whisper
Text returned to orchestrator

Configuration:

from livekit.plugins import openai

stt = openai.STT(
    model="whisper-1",
    language="en"  # Optional
)

Text-to-Speech (TTS)

Flow:

Orchestrator generates response
Agent receives next_message
Text prepared for TTS (punctuation, pauses)
OpenAI TTS generates audio
Audio streamed to LiveKit
Browser plays audio

Configuration:

from livekit.plugins import openai

tts = openai.TTS(
    voice="alloy",  # alloy, echo, fable, onyx, nova, shimmer
    model="tts-1-hd"  # tts-1 or tts-1-hd
)

Voice Options:

Voice	Characteristics
`alloy`	Neutral, professional (default)
`echo`	Warm, friendly
`fable`	Clear, articulate
`onyx`	Deep, authoritative
`nova`	Bright, energetic
`shimmer`	Soft, gentle

Voice Activity Detection (VAD)

Purpose: Detect when user stops speaking to trigger STT

Implementation:

Silero VAD model (lightweight, fast)
Loaded asynchronously to avoid blocking
Graceful degradation if loading fails

from livekit.plugins import silero

vad = await silero.VAD.load()  # Async loading

Performance Optimization

Latency Reduction

Technique	Impact	Implementation
Streaming TTS	-200ms	Stream audio chunks as generated
VAD Optimization	-100ms	Lower threshold for speech end
Connection Pooling	-50ms	Reuse OpenAI clients
Parallel Processing	-150ms	STT + Intent detection in parallel

Resource Management

VAD Caching: Per-process singleton (thread-safe)
Client Reuse: OpenAI clients cached in NodeHandler
Connection Limits: Max 50 concurrent interviews per agent instance

Troubleshooting

Issue	Symptom	Solution
No audio	Agent connects but no sound	Check TTS API key, verify audio output enabled
STT not working	User speech not transcribed	Verify VAD loaded, check STT API key
High latency	>1s delay	Check network, reduce TTS model (tts-1 vs tts-1-hd)
Agent crashes	Connection drops	Check memory usage, verify database connection

Monitoring

Key Metrics:

STT latency: <500ms (p95)
TTS latency: <800ms (p95)
End-to-end latency: <1.5s (p95)
Agent uptime: >99.9%
Concurrent interviews: Track active rooms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Infrastructure

Architecture

Components

Agent Lifecycle

Setup

LiveKit Server

Agent Configuration

Running Agent

Voice Pipeline

Speech-to-Text (STT)

Text-to-Speech (TTS)

Voice Activity Detection (VAD)

Performance Optimization

Latency Reduction

Resource Management

Troubleshooting

Monitoring

FilesExpand file tree

VOICE_INFRASTRUCTURE.md

Latest commit

History

VOICE_INFRASTRUCTURE.md

File metadata and controls

Voice Infrastructure

Architecture

Components

Agent Lifecycle

Setup

LiveKit Server

Agent Configuration

Running Agent

Voice Pipeline

Speech-to-Text (STT)

Text-to-Speech (TTS)

Voice Activity Detection (VAD)

Performance Optimization

Latency Reduction

Resource Management

Troubleshooting

Monitoring