Skip to content

Feature Proposal: External Agent Provider abstraction for native voice agent APIs #17

@jamestwhedbee

Description

@jamestwhedbee

Summary

We'd like to propose an abstraction that enables EVA to benchmark hosted voice agent APIs (native voice agents beyond Pipecat) — directly addressing the Extended Leaderboard roadmap item:

More cascade and speech-to-speech systems evaluated and added to the findings including native voice agent APIs beyond Pipecat (ElevenAgents, Gemini Live, OpenAI Realtime, Deepgram Voice Agent, etc.)

We've built a working implementation in our fork and wanted to discuss the approach before submitting a PR.

The Problem

EVA currently evaluates voice agents that run inside a Pipecat pipeline (STT → LLM → TTS or S2S). But many production voice agents are hosted black boxes — you call them via a phone number or WebSocket and interact over audio. There's no Pipecat pipeline to instrument.

Examples: Telnyx AI Assistants, ElevenLabs Conversational AI, Deepgram Voice Agent, OpenAI Realtime API, Google Gemini Live.

Proposed Architecture

An External Agent Provider plugin interface that lets EVA evaluate any hosted voice agent:

┌──────────────────────────────────────────────────────────────────┐
│  EVA Runner                                                       │
│                                                                   │
│  ┌──────────────┐     ┌──────────────────────┐    ┌────────────┐ │
│  │ ElevenLabs   │ WS  │  External Agent       │    │ Tool       │ │
│  │ User Sim     │◄───►│  Bridge (generic)     │    │ Webhook    │ │
│  └──────────────┘     └──────────┬───────────┘    └──────┬─────┘ │
│                         Provider │                       │       │
│                         Transport│                       │       │
└──────────────────────────────────┼───────────────────────┼───────┘
                                   │                       │
                          ┌────────▼─────────┐    ┌────────▼───────┐
                          │  Provider         │    │  External      │
                          │  Platform         │───►│  Voice Agent   │
                          └──────────────────┘    └────────────────┘

Core abstraction: ExternalAgentProvider

class ExternalAgentProvider(ABC):
    @abstractmethod
    def create_transport(self, conversation_id, webhook_base_url) -> BaseTelephonyTransport:
        """Create a transport connection to the external agent."""

    @abstractmethod
    async def setup(self) -> None:
        """One-time setup before a benchmark run (e.g., configure assistant via API)."""

    @abstractmethod
    async def teardown(self) -> None:
        """Clean up after a benchmark run."""

    @abstractmethod
    async def fetch_intended_speech(self, transport) -> list[dict]:
        """Post-call enrichment: fetch what the agent intended to say (for metrics)."""

    def register_webhook_routes(self, app: FastAPI) -> None:
        """Optional: register provider-specific webhook endpoints."""

Adding a new provider requires only a new providers/xxx/ directory with config, transport, and setup. No changes to the bridge, worker, runner, or metrics pipeline.

Why a client-side bridge?

The bridge sits between the user simulator and the external agent. This is intentional:

  1. Metrics compatibility — captures pipecat_logs.jsonl format from raw audio (VAD events, turn boundaries, timestamps) that the external agent's API doesn't expose
  2. Platform-independent latency measurement — third-party observer, not self-reported by the provider, making cross-provider comparisons fair

Important caveat: Real production calls don't have this intermediary. Benchmark latency will be higher than real-world latency. EVA measures relative performance across providers under identical conditions — not absolute production latency.

Changes to Core EVA

We've kept upstream changes minimal. Here's every file we touch and why:

Area Files Lines Why
User simulator client.py, audio_interface.py +63 audio_codec param (pcm/mulaw) — external agents may use different formats
Config models/config.py +131 ExternalAgentConfig base + discriminator (same pattern as existing config hierarchy)
Orchestrator runner.py, worker.py +171 Provider lifecycle + bridge instantiation (parallel to existing AssistantServer logic)
Metrics base metrics/base.py +20 message_trace field + helper methods for turn counting
Metrics processor metrics/processor.py +168 Control event filtering, sort priority, message_trace loading (general correctness improvements)
Individual metrics 5 files +30 Use base class helpers instead of direct trace access
Tools airline_tools.py +5 end_call stub (webhook intercepts it; stub passes tool validation)
Audit log audit_log.py +8 replace_transcript() for post-call enrichment

Everything else (~2000 lines) is purely additive in a new assistant/external/ package.

What We've Built

Our fork includes a working Telnyx provider as a reference implementation that benchmarks Telnyx AI Assistants via Call Control. We've run 3×3 evaluations (3 records, 3 trials, concurrency 3) successfully.

The full diff against upstream and documentation are available for review.

Relationship to PR #14

We noticed the concern in PR #14 about Pipecat log coupling. Our approach addresses this by having the bridge write normalized output files (message_trace.jsonl, pipecat_logs.jsonl) before the metrics processor sees them — the processor doesn't need to know about bridge mode.

Next Steps

We'd love feedback on:

  1. Does this abstraction align with your vision for the Extended Leaderboard?
  2. Are the core changes acceptable? We've tried to minimize them, but we're open to alternative approaches (e.g., making the metrics processor fully pluggable).
  3. Would you prefer a single PR or a series of smaller ones?

Happy to iterate on the design before submitting any code. Thanks for building EVA — it's been great to work with.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions