Observability dashboard: surface agents failing in prod + run agents against datasets

## Summary

Add an **observability layer** to Seer with two capabilities:

1. **A dashboard that surfaces client agents failing in production** — so an agent owner knows an agent is broken *before* a user complains, and across *other* users, not just the creator.
2. **Running agents against datasets on a schedule** — to catch quality/behavior regressions when Glean models, actions, or connectors change underneath an already-tuned agent.

## Motivation

- The CarGurus CIO reported that agents **fail silently**: the creator stays unaware an agent is broken for others unless a user surfaces it directly. ([Slack thread](https://askscio.slack.com/archives/C08E2R57TRR/p1780323571760059))
- Broader customer concern: changes in Glean (models, actions, connectors) can **degrade agents that were tuned and working**.
- Apollo raised the same proactive-monitoring gap. Glean lacks native **evals** and **observability** today, so this is additive value Seer is well positioned to deliver (it already has the agent runner, trace parsing, judge engine, and token ledger).

## Capability 1 — Surface agents failing in prod

**Goal:** a per-agent health view that answers "is this agent failing, for whom, and why?"

- Ingest production run signals (keyed on `trace_id` as the canonical join key) and display per agent:
  - Failure rate over time, error types (tool error, timeout, empty/blocked response, groundedness drop)
  - Latency trend, last-failed timestamp, number/identity of affected users
- **Cross-user detection** — surface failures occurring under other users' permissions/data, not just the creator's runs (this is the core of the CarGurus ask).
- **Failure-mode "signals"** — cluster failures (hallucination, tool error, regression) across runs and feed candidate cases back into eval datasets.
- **Alerting** — Slack notification when an agent crosses a failure threshold.

## Capability 2 — Run agents against datasets

**Goal:** scheduled, automated regression detection on top of the existing eval engine.

- Reuse current eval-set + 7-call LLM-as-judge architecture and golden-set mode.
- **Scheduler** — run a given eval set against an agent on a fixed cadence (cron).
- **Baseline + diff** — store a baseline and diff `overallScore` and per-dimension scores per run; flag regressions.
- **Canary mode** — small fixed case set per agent, snapshot scores, alert on drop (mirrors the canary-eval mechanic in the harness plan, pointed at the agent).

## Data sourcing (which endpoints / logs)

| Surface | Endpoint / Log | Notes |
|---|---|---|
| Synthetic / dataset runs | `POST /rest/api/v1/runworkflow` (`enableTrace: true`) | Returns response, `traceId`, tool calls, reasoning chain. Internal/undocumented; no token counts. Already used by Seer. |
| Prod cross-user signal | `WORKFLOW_RUN` event in Glean Customer Event Logs / GCP Cloud Logging (`scrubbed-agentspan`, `glean-workflows`); `fetch_agent_logs(project_id, run_id\|trace_id)` | The signal for "failing for other users." Lives in the customer's GCP project / S3-GCS. |
| Full trace (timing + tokens) | `POST /api/v1/getworkflowtrace` | Browser-session gated (Cloudflare TLS fingerprint) — not callable from scripts today. |
| Fast-follow | External Observability OTLP export (`otelExport.enabled`) | In-progress productized OTLP stream of `chat` / `agent_run` / `mcp` spans to customer endpoints. |

**Key constraint:** customer deployments typically expose logs only in their own S3/GCS bucket, **not** internal Tempo — the prod-ingestion design must bridge that gap.

## Proposed phasing

- **Phase 1 (synthetic-first):** scheduled health checks + regression evals via `runworkflow`, baseline/diff, Slack alerting. Ships without depending on customer log access. Architect `trace_id` as the join key from day one.
- **Phase 2 (prod observability):** ingest production traces/logs (WorkflowRun / Cloud Logging, then OTLP export), failure-mode signals dashboard, cross-user failure detection.

## Schema / architecture impact

- New tables for: monitored agents (per-deployment), scheduled runs, run health/failure records, alert config.
- Today Seer is single-tenant single-`GLEAN_API_KEY`; multi-deployment monitoring will need a tenant/deployment dimension if AIOM-hosted.

## Open questions

- AIOM-hosted multi-tenant vs. customer self-serve hosting?
- Source of record for prod failures: WorkflowRun GCE log vs. OTLP export — sequence and access path per customer?
- Alert routing/thresholds and who owns triage.

## References

- [Slack thread (CarGurus ask)](https://askscio.slack.com/archives/C08E2R57TRR/p1780323571760059)
- `docs/TRACE_API_LIMITATIONS.md`, `docs/glean-api-needs.md`, `docs/harness-engineering-plan.md`
- External Observability V0 and Agents Observability (Glean internal docs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability dashboard: surface agents failing in prod + run agents against datasets #11

Summary

Motivation

Capability 1 — Surface agents failing in prod

Capability 2 — Run agents against datasets

Data sourcing (which endpoints / logs)

Proposed phasing

Schema / architecture impact

Open questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Surface	Endpoint / Log	Notes
Synthetic / dataset runs	`POST /rest/api/v1/runworkflow` (`enableTrace: true`)	Returns response, `traceId`, tool calls, reasoning chain. Internal/undocumented; no token counts. Already used by Seer.
Prod cross-user signal	`WORKFLOW_RUN` event in Glean Customer Event Logs / GCP Cloud Logging (`scrubbed-agentspan`, `glean-workflows`); `fetch_agent_logs(project_id, run_id\|trace_id)`	The signal for "failing for other users." Lives in the customer's GCP project / S3-GCS.
Full trace (timing + tokens)	`POST /api/v1/getworkflowtrace`	Browser-session gated (Cloudflare TLS fingerprint) — not callable from scripts today.
Fast-follow	External Observability OTLP export (`otelExport.enabled`)	In-progress productized OTLP stream of `chat` / `agent_run` / `mcp` spans to customer endpoints.

Observability dashboard: surface agents failing in prod + run agents against datasets #11

Description

Summary

Motivation

Capability 1 — Surface agents failing in prod

Capability 2 — Run agents against datasets

Data sourcing (which endpoints / logs)

Proposed phasing

Schema / architecture impact

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions