Summary
Add an observability layer to Seer with two capabilities:
- A dashboard that surfaces client agents failing in production — so an agent owner knows an agent is broken before a user complains, and across other users, not just the creator.
- Running agents against datasets on a schedule — to catch quality/behavior regressions when Glean models, actions, or connectors change underneath an already-tuned agent.
Motivation
- The CarGurus CIO reported that agents fail silently: the creator stays unaware an agent is broken for others unless a user surfaces it directly. (Slack thread)
- Broader customer concern: changes in Glean (models, actions, connectors) can degrade agents that were tuned and working.
- Apollo raised the same proactive-monitoring gap. Glean lacks native evals and observability today, so this is additive value Seer is well positioned to deliver (it already has the agent runner, trace parsing, judge engine, and token ledger).
Capability 1 — Surface agents failing in prod
Goal: a per-agent health view that answers "is this agent failing, for whom, and why?"
- Ingest production run signals (keyed on
trace_id as the canonical join key) and display per agent:
- Failure rate over time, error types (tool error, timeout, empty/blocked response, groundedness drop)
- Latency trend, last-failed timestamp, number/identity of affected users
- Cross-user detection — surface failures occurring under other users' permissions/data, not just the creator's runs (this is the core of the CarGurus ask).
- Failure-mode "signals" — cluster failures (hallucination, tool error, regression) across runs and feed candidate cases back into eval datasets.
- Alerting — Slack notification when an agent crosses a failure threshold.
Capability 2 — Run agents against datasets
Goal: scheduled, automated regression detection on top of the existing eval engine.
- Reuse current eval-set + 7-call LLM-as-judge architecture and golden-set mode.
- Scheduler — run a given eval set against an agent on a fixed cadence (cron).
- Baseline + diff — store a baseline and diff
overallScore and per-dimension scores per run; flag regressions.
- Canary mode — small fixed case set per agent, snapshot scores, alert on drop (mirrors the canary-eval mechanic in the harness plan, pointed at the agent).
Data sourcing (which endpoints / logs)
| Surface |
Endpoint / Log |
Notes |
| Synthetic / dataset runs |
POST /rest/api/v1/runworkflow (enableTrace: true) |
Returns response, traceId, tool calls, reasoning chain. Internal/undocumented; no token counts. Already used by Seer. |
| Prod cross-user signal |
WORKFLOW_RUN event in Glean Customer Event Logs / GCP Cloud Logging (scrubbed-agentspan, glean-workflows); fetch_agent_logs(project_id, run_id|trace_id) |
The signal for "failing for other users." Lives in the customer's GCP project / S3-GCS. |
| Full trace (timing + tokens) |
POST /api/v1/getworkflowtrace |
Browser-session gated (Cloudflare TLS fingerprint) — not callable from scripts today. |
| Fast-follow |
External Observability OTLP export (otelExport.enabled) |
In-progress productized OTLP stream of chat / agent_run / mcp spans to customer endpoints. |
Key constraint: customer deployments typically expose logs only in their own S3/GCS bucket, not internal Tempo — the prod-ingestion design must bridge that gap.
Proposed phasing
- Phase 1 (synthetic-first): scheduled health checks + regression evals via
runworkflow, baseline/diff, Slack alerting. Ships without depending on customer log access. Architect trace_id as the join key from day one.
- Phase 2 (prod observability): ingest production traces/logs (WorkflowRun / Cloud Logging, then OTLP export), failure-mode signals dashboard, cross-user failure detection.
Schema / architecture impact
- New tables for: monitored agents (per-deployment), scheduled runs, run health/failure records, alert config.
- Today Seer is single-tenant single-
GLEAN_API_KEY; multi-deployment monitoring will need a tenant/deployment dimension if AIOM-hosted.
Open questions
- AIOM-hosted multi-tenant vs. customer self-serve hosting?
- Source of record for prod failures: WorkflowRun GCE log vs. OTLP export — sequence and access path per customer?
- Alert routing/thresholds and who owns triage.
References
- Slack thread (CarGurus ask)
docs/TRACE_API_LIMITATIONS.md, docs/glean-api-needs.md, docs/harness-engineering-plan.md
- External Observability V0 and Agents Observability (Glean internal docs)
Summary
Add an observability layer to Seer with two capabilities:
Motivation
Capability 1 — Surface agents failing in prod
Goal: a per-agent health view that answers "is this agent failing, for whom, and why?"
trace_idas the canonical join key) and display per agent:Capability 2 — Run agents against datasets
Goal: scheduled, automated regression detection on top of the existing eval engine.
overallScoreand per-dimension scores per run; flag regressions.Data sourcing (which endpoints / logs)
POST /rest/api/v1/runworkflow(enableTrace: true)traceId, tool calls, reasoning chain. Internal/undocumented; no token counts. Already used by Seer.WORKFLOW_RUNevent in Glean Customer Event Logs / GCP Cloud Logging (scrubbed-agentspan,glean-workflows);fetch_agent_logs(project_id, run_id|trace_id)POST /api/v1/getworkflowtraceotelExport.enabled)chat/agent_run/mcpspans to customer endpoints.Key constraint: customer deployments typically expose logs only in their own S3/GCS bucket, not internal Tempo — the prod-ingestion design must bridge that gap.
Proposed phasing
runworkflow, baseline/diff, Slack alerting. Ships without depending on customer log access. Architecttrace_idas the join key from day one.Schema / architecture impact
GLEAN_API_KEY; multi-deployment monitoring will need a tenant/deployment dimension if AIOM-hosted.Open questions
References
docs/TRACE_API_LIMITATIONS.md,docs/glean-api-needs.md,docs/harness-engineering-plan.md