ServiceNow · gabegma · Apr 9, 2026 · Mar 31, 2026 · Mar 31, 2026 · Mar 31, 2026
diff --git a/README.md b/README.md
@@ -139,6 +139,16 @@ python main.py \
     --metrics task_completion,faithfulness,conciseness
 ```
 
+### Exploring Results
+
+EVA includes a Streamlit analysis app for visualizing and comparing results:
+
+```bash
+streamlit run apps/analysis.py
+```
+
+The app reads from the `output/` directory by default and provides three views: cross-run comparison, run overview, and per-record detail (transcripts, audio, metrics, conversation traces). See [`apps/README.md`](apps/README.md) for full documentation.
+
 ### Using Docker
 
 ```bash
@@ -256,6 +266,7 @@ Flight rebooking is a strong initial domain: it is high-stakes, time-pressured,
 eva/
 ├── main.py                    # Main entry point
 ├── pyproject.toml             # Python project configuration
+├── apps/                      # Streamlit apps
 ├── Dockerfile                 # Docker configuration
 ├── compose.yaml               # Docker Compose configuration
 ├── src/eva/

diff --git a/apps/README.md b/apps/README.md
@@ -0,0 +1,47 @@
+# EVA Apps
+
+Streamlit applications for exploring EVA results.
+
+## Analysis App
+
+Interactive dashboard for visualizing and comparing results.
+
+### Usage
+
+```bash
+streamlit run apps/analysis.py
+```
+
+By default, the app looks for runs in the `output/` directory. You can change this in the sidebar or by setting the `EVA_OUTPUT_DIR` environment variable:
+
+```bash
+EVA_OUTPUT_DIR=path/to/results streamlit run apps/analysis.py
+```
+
+### Views
+
+**Cross-Run Comparison** — Compare aggregate metrics across multiple runs. Filter by model, provider, and pipeline type. Includes an EVA scatter plot (accuracy vs. experience) and per-metric bar charts.
+
+![Cross-Run Comparison view](images/cross_run_comparison.png)
+
+**Run Overview** — Drill into a single run: per-category metric breakdowns, score distributions, and a full records table with per-metric scores.
+
+![Run Overview view](images/run_overview.png)
+
+**Record Detail** — Deep-dive into individual conversation records:
+- Audio playback (mixed recording)
+- Transcript with color-coded speaker turns
+- Metric scores with explanations
+- Conversation trace: tool calls, LLM calls, and audit log entries with a timeline view
+- Database state diff (expected vs. actual)
+- User goal, persona, and ground truth from the evaluation record
+
+![Record Detail view](images/analysis_record_detail.png)
+
+### Sidebar Navigation
+
+1. **Output Directory** — Path to the directory containing run folders
+2. **View** — Switch between the three views above
+3. **Run Selection** — Pick a run (with metadata summary)
+4. **Record Selection** — Pick a record within the selected run
+5. **Trial Selection** — If a record has multiple trials, pick one
diff --git a/apps/__init__.py b/apps/__init__.py
@@ -0,0 +1 @@
+"""EVA analysis app."""