Summary
We'd like to propose an abstraction that enables EVA to benchmark hosted voice agent APIs (native voice agents beyond Pipecat) — directly addressing the Extended Leaderboard roadmap item:
More cascade and speech-to-speech systems evaluated and added to the findings including native voice agent APIs beyond Pipecat (ElevenAgents, Gemini Live, OpenAI Realtime, Deepgram Voice Agent, etc.)
We've built a working implementation in our fork and wanted to discuss the approach before submitting a PR.
The Problem
EVA currently evaluates voice agents that run inside a Pipecat pipeline (STT → LLM → TTS or S2S). But many production voice agents are hosted black boxes — you call them via a phone number or WebSocket and interact over audio. There's no Pipecat pipeline to instrument.
Examples: Telnyx AI Assistants, ElevenLabs Conversational AI, Deepgram Voice Agent, OpenAI Realtime API, Google Gemini Live.
Proposed Architecture
An External Agent Provider plugin interface that lets EVA evaluate any hosted voice agent:
┌──────────────────────────────────────────────────────────────────┐
│ EVA Runner │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ ┌────────────┐ │
│ │ ElevenLabs │ WS │ External Agent │ │ Tool │ │
│ │ User Sim │◄───►│ Bridge (generic) │ │ Webhook │ │
│ └──────────────┘ └──────────┬───────────┘ └──────┬─────┘ │
│ Provider │ │ │
│ Transport│ │ │
└──────────────────────────────────┼───────────────────────┼───────┘
│ │
┌────────▼─────────┐ ┌────────▼───────┐
│ Provider │ │ External │
│ Platform │───►│ Voice Agent │
└──────────────────┘ └────────────────┘
Core abstraction: ExternalAgentProvider
class ExternalAgentProvider(ABC):
@abstractmethod
def create_transport(self, conversation_id, webhook_base_url) -> BaseTelephonyTransport:
"""Create a transport connection to the external agent."""
@abstractmethod
async def setup(self) -> None:
"""One-time setup before a benchmark run (e.g., configure assistant via API)."""
@abstractmethod
async def teardown(self) -> None:
"""Clean up after a benchmark run."""
@abstractmethod
async def fetch_intended_speech(self, transport) -> list[dict]:
"""Post-call enrichment: fetch what the agent intended to say (for metrics)."""
def register_webhook_routes(self, app: FastAPI) -> None:
"""Optional: register provider-specific webhook endpoints."""
Adding a new provider requires only a new providers/xxx/ directory with config, transport, and setup. No changes to the bridge, worker, runner, or metrics pipeline.
Why a client-side bridge?
The bridge sits between the user simulator and the external agent. This is intentional:
- Metrics compatibility — captures
pipecat_logs.jsonl format from raw audio (VAD events, turn boundaries, timestamps) that the external agent's API doesn't expose
- Platform-independent latency measurement — third-party observer, not self-reported by the provider, making cross-provider comparisons fair
Important caveat: Real production calls don't have this intermediary. Benchmark latency will be higher than real-world latency. EVA measures relative performance across providers under identical conditions — not absolute production latency.
Changes to Core EVA
We've kept upstream changes minimal. Here's every file we touch and why:
| Area |
Files |
Lines |
Why |
| User simulator |
client.py, audio_interface.py |
+63 |
audio_codec param (pcm/mulaw) — external agents may use different formats |
| Config |
models/config.py |
+131 |
ExternalAgentConfig base + discriminator (same pattern as existing config hierarchy) |
| Orchestrator |
runner.py, worker.py |
+171 |
Provider lifecycle + bridge instantiation (parallel to existing AssistantServer logic) |
| Metrics base |
metrics/base.py |
+20 |
message_trace field + helper methods for turn counting |
| Metrics processor |
metrics/processor.py |
+168 |
Control event filtering, sort priority, message_trace loading (general correctness improvements) |
| Individual metrics |
5 files |
+30 |
Use base class helpers instead of direct trace access |
| Tools |
airline_tools.py |
+5 |
end_call stub (webhook intercepts it; stub passes tool validation) |
| Audit log |
audit_log.py |
+8 |
replace_transcript() for post-call enrichment |
Everything else (~2000 lines) is purely additive in a new assistant/external/ package.
What We've Built
Our fork includes a working Telnyx provider as a reference implementation that benchmarks Telnyx AI Assistants via Call Control. We've run 3×3 evaluations (3 records, 3 trials, concurrency 3) successfully.
The full diff against upstream and documentation are available for review.
Relationship to PR #14
We noticed the concern in PR #14 about Pipecat log coupling. Our approach addresses this by having the bridge write normalized output files (message_trace.jsonl, pipecat_logs.jsonl) before the metrics processor sees them — the processor doesn't need to know about bridge mode.
Next Steps
We'd love feedback on:
- Does this abstraction align with your vision for the Extended Leaderboard?
- Are the core changes acceptable? We've tried to minimize them, but we're open to alternative approaches (e.g., making the metrics processor fully pluggable).
- Would you prefer a single PR or a series of smaller ones?
Happy to iterate on the design before submitting any code. Thanks for building EVA — it's been great to work with.
Summary
We'd like to propose an abstraction that enables EVA to benchmark hosted voice agent APIs (native voice agents beyond Pipecat) — directly addressing the Extended Leaderboard roadmap item:
We've built a working implementation in our fork and wanted to discuss the approach before submitting a PR.
The Problem
EVA currently evaluates voice agents that run inside a Pipecat pipeline (STT → LLM → TTS or S2S). But many production voice agents are hosted black boxes — you call them via a phone number or WebSocket and interact over audio. There's no Pipecat pipeline to instrument.
Examples: Telnyx AI Assistants, ElevenLabs Conversational AI, Deepgram Voice Agent, OpenAI Realtime API, Google Gemini Live.
Proposed Architecture
An External Agent Provider plugin interface that lets EVA evaluate any hosted voice agent:
Core abstraction:
ExternalAgentProviderAdding a new provider requires only a new
providers/xxx/directory with config, transport, and setup. No changes to the bridge, worker, runner, or metrics pipeline.Why a client-side bridge?
The bridge sits between the user simulator and the external agent. This is intentional:
pipecat_logs.jsonlformat from raw audio (VAD events, turn boundaries, timestamps) that the external agent's API doesn't exposeImportant caveat: Real production calls don't have this intermediary. Benchmark latency will be higher than real-world latency. EVA measures relative performance across providers under identical conditions — not absolute production latency.
Changes to Core EVA
We've kept upstream changes minimal. Here's every file we touch and why:
client.py,audio_interface.pyaudio_codecparam (pcm/mulaw) — external agents may use different formatsmodels/config.pyExternalAgentConfigbase + discriminator (same pattern as existing config hierarchy)runner.py,worker.pymetrics/base.pymessage_tracefield + helper methods for turn countingmetrics/processor.pyairline_tools.pyend_callstub (webhook intercepts it; stub passes tool validation)audit_log.pyreplace_transcript()for post-call enrichmentEverything else (~2000 lines) is purely additive in a new
assistant/external/package.What We've Built
Our fork includes a working Telnyx provider as a reference implementation that benchmarks Telnyx AI Assistants via Call Control. We've run 3×3 evaluations (3 records, 3 trials, concurrency 3) successfully.
The full diff against upstream and documentation are available for review.
Relationship to PR #14
We noticed the concern in PR #14 about Pipecat log coupling. Our approach addresses this by having the bridge write normalized output files (
message_trace.jsonl,pipecat_logs.jsonl) before the metrics processor sees them — the processor doesn't need to know about bridge mode.Next Steps
We'd love feedback on:
Happy to iterate on the design before submitting any code. Thanks for building EVA — it's been great to work with.