Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,16 @@ python main.py \
--metrics task_completion,faithfulness,conciseness
```

### Exploring Results

EVA includes a Streamlit analysis app for visualizing and comparing results:

```bash
streamlit run apps/analysis.py
```

The app reads from the `output/` directory by default and provides three views: cross-run comparison, run overview, and per-record detail (transcripts, audio, metrics, conversation traces). See [`apps/README.md`](apps/README.md) for full documentation.

### Using Docker

```bash
Expand Down Expand Up @@ -256,6 +266,7 @@ Flight rebooking is a strong initial domain: it is high-stakes, time-pressured,
eva/
├── main.py # Main entry point
├── pyproject.toml # Python project configuration
├── apps/ # Streamlit apps
├── Dockerfile # Docker configuration
├── compose.yaml # Docker Compose configuration
├── src/eva/
Expand Down
47 changes: 47 additions & 0 deletions apps/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# EVA Apps

Streamlit applications for exploring EVA results.

## Analysis App

Interactive dashboard for visualizing and comparing results.

### Usage

```bash
streamlit run apps/analysis.py
```

By default, the app looks for runs in the `output/` directory. You can change this in the sidebar or by setting the `EVA_OUTPUT_DIR` environment variable:

```bash
EVA_OUTPUT_DIR=path/to/results streamlit run apps/analysis.py
```

### Views

**Cross-Run Comparison** — Compare aggregate metrics across multiple runs. Filter by model, provider, and pipeline type. Includes an EVA scatter plot (accuracy vs. experience) and per-metric bar charts.

![Cross-Run Comparison view](images/cross_run_comparison.png)

**Run Overview** — Drill into a single run: per-category metric breakdowns, score distributions, and a full records table with per-metric scores.

![Run Overview view](images/run_overview.png)

**Record Detail** — Deep-dive into individual conversation records:
- Audio playback (mixed recording)
- Transcript with color-coded speaker turns
- Metric scores with explanations
- Conversation trace: tool calls, LLM calls, and audit log entries with a timeline view
- Database state diff (expected vs. actual)
- User goal, persona, and ground truth from the evaluation record

![Record Detail view](images/analysis_record_detail.png)

### Sidebar Navigation

1. **Output Directory** — Path to the directory containing run folders
2. **View** — Switch between the three views above
3. **Run Selection** — Pick a run (with metadata summary)
4. **Record Selection** — Pick a record within the selected run
5. **Trial Selection** — If a record has multiple trials, pick one
1 change: 1 addition & 0 deletions apps/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""EVA analysis app."""
Loading
Loading