Feature Proposal: External Agent Provider abstraction for native voice agent APIs

## Summary

We'd like to propose an abstraction that enables EVA to benchmark **hosted voice agent APIs** (native voice agents beyond Pipecat) — directly addressing the [Extended Leaderboard](https://servicenow.github.io/eva/#limitations) roadmap item:

> *More cascade and speech-to-speech systems evaluated and added to the findings including native voice agent APIs beyond Pipecat (ElevenAgents, Gemini Live, OpenAI Realtime, Deepgram Voice Agent, etc.)*

We've built a working implementation in [our fork](https://github.com/team-telnyx/eva) and wanted to discuss the approach before submitting a PR.

## The Problem

EVA currently evaluates voice agents that run inside a Pipecat pipeline (STT → LLM → TTS or S2S). But many production voice agents are **hosted black boxes** — you call them via a phone number or WebSocket and interact over audio. There's no Pipecat pipeline to instrument.

Examples: Telnyx AI Assistants, ElevenLabs Conversational AI, Deepgram Voice Agent, OpenAI Realtime API, Google Gemini Live.

## Proposed Architecture

An **External Agent Provider** plugin interface that lets EVA evaluate any hosted voice agent:

```
┌──────────────────────────────────────────────────────────────────┐
│  EVA Runner                                                       │
│                                                                   │
│  ┌──────────────┐     ┌──────────────────────┐    ┌────────────┐ │
│  │ ElevenLabs   │ WS  │  External Agent       │    │ Tool       │ │
│  │ User Sim     │◄───►│  Bridge (generic)     │    │ Webhook    │ │
│  └──────────────┘     └──────────┬───────────┘    └──────┬─────┘ │
│                         Provider │                       │       │
│                         Transport│                       │       │
└──────────────────────────────────┼───────────────────────┼───────┘
                                   │                       │
                          ┌────────▼─────────┐    ┌────────▼───────┐
                          │  Provider         │    │  External      │
                          │  Platform         │───►│  Voice Agent   │
                          └──────────────────┘    └────────────────┘
```

### Core abstraction: `ExternalAgentProvider`

```python
class ExternalAgentProvider(ABC):
    @abstractmethod
    def create_transport(self, conversation_id, webhook_base_url) -> BaseTelephonyTransport:
        """Create a transport connection to the external agent."""

    @abstractmethod
    async def setup(self) -> None:
        """One-time setup before a benchmark run (e.g., configure assistant via API)."""

    @abstractmethod
    async def teardown(self) -> None:
        """Clean up after a benchmark run."""

    @abstractmethod
    async def fetch_intended_speech(self, transport) -> list[dict]:
        """Post-call enrichment: fetch what the agent intended to say (for metrics)."""

    def register_webhook_routes(self, app: FastAPI) -> None:
        """Optional: register provider-specific webhook endpoints."""
```

**Adding a new provider** requires only a new `providers/xxx/` directory with config, transport, and setup. No changes to the bridge, worker, runner, or metrics pipeline.

### Why a client-side bridge?

The bridge sits between the user simulator and the external agent. This is intentional:

1. **Metrics compatibility** — captures `pipecat_logs.jsonl` format from raw audio (VAD events, turn boundaries, timestamps) that the external agent's API doesn't expose
2. **Platform-independent latency measurement** — third-party observer, not self-reported by the provider, making cross-provider comparisons fair

**Important caveat:** Real production calls don't have this intermediary. Benchmark latency will be higher than real-world latency. EVA measures *relative* performance across providers under identical conditions — not absolute production latency.

## Changes to Core EVA

We've kept upstream changes minimal. Here's every file we touch and why:

| Area | Files | Lines | Why |
|------|-------|-------|-----|
| **User simulator** | `client.py`, `audio_interface.py` | +63 | `audio_codec` param (pcm/mulaw) — external agents may use different formats |
| **Config** | `models/config.py` | +131 | `ExternalAgentConfig` base + discriminator (same pattern as existing config hierarchy) |
| **Orchestrator** | `runner.py`, `worker.py` | +171 | Provider lifecycle + bridge instantiation (parallel to existing AssistantServer logic) |
| **Metrics base** | `metrics/base.py` | +20 | `message_trace` field + helper methods for turn counting |
| **Metrics processor** | `metrics/processor.py` | +168 | Control event filtering, sort priority, message_trace loading (general correctness improvements) |
| **Individual metrics** | 5 files | +30 | Use base class helpers instead of direct trace access |
| **Tools** | `airline_tools.py` | +5 | `end_call` stub (webhook intercepts it; stub passes tool validation) |
| **Audit log** | `audit_log.py` | +8 | `replace_transcript()` for post-call enrichment |

Everything else (~2000 lines) is purely additive in a new `assistant/external/` package.

## What We've Built

Our fork includes a working **Telnyx provider** as a reference implementation that benchmarks Telnyx AI Assistants via Call Control. We've run 3×3 evaluations (3 records, 3 trials, concurrency 3) successfully.

The [full diff against upstream](https://github.com/team-telnyx/eva/pull/13) and [documentation](https://github.com/team-telnyx/eva/blob/main/docs/external_agent_providers.md) are available for review.

## Relationship to PR #14

We noticed the concern in [PR #14](https://github.com/ServiceNow/eva/pull/14#issuecomment-4121996810) about Pipecat log coupling. Our approach addresses this by having the bridge write normalized output files (`message_trace.jsonl`, `pipecat_logs.jsonl`) *before* the metrics processor sees them — the processor doesn't need to know about bridge mode.

## Next Steps

We'd love feedback on:
1. **Does this abstraction align with your vision** for the Extended Leaderboard?
2. **Are the core changes acceptable?** We've tried to minimize them, but we're open to alternative approaches (e.g., making the metrics processor fully pluggable).
3. **Would you prefer a single PR or a series of smaller ones?**

Happy to iterate on the design before submitting any code. Thanks for building EVA — it's been great to work with.


Area	Files	Lines	Why
User simulator	`client.py`, `audio_interface.py`	+63	`audio_codec` param (pcm/mulaw) — external agents may use different formats
Config	`models/config.py`	+131	`ExternalAgentConfig` base + discriminator (same pattern as existing config hierarchy)
Orchestrator	`runner.py`, `worker.py`	+171	Provider lifecycle + bridge instantiation (parallel to existing AssistantServer logic)
Metrics base	`metrics/base.py`	+20	`message_trace` field + helper methods for turn counting
Metrics processor	`metrics/processor.py`	+168	Control event filtering, sort priority, message_trace loading (general correctness improvements)
Individual metrics	5 files	+30	Use base class helpers instead of direct trace access
Tools	`airline_tools.py`	+5	`end_call` stub (webhook intercepts it; stub passes tool validation)
Audit log	`audit_log.py`	+8	`replace_transcript()` for post-call enrichment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Proposal: External Agent Provider abstraction for native voice agent APIs #17

Summary

The Problem

Proposed Architecture

Core abstraction: `ExternalAgentProvider`

Why a client-side bridge?

Changes to Core EVA

What We've Built

Relationship to PR #14

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Proposal: External Agent Provider abstraction for native voice agent APIs #17

Description

Summary

The Problem

Proposed Architecture

Core abstraction: ExternalAgentProvider

Why a client-side bridge?

Changes to Core EVA

What We've Built

Relationship to PR #14

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Core abstraction: `ExternalAgentProvider`