Skip to content

Observability dashboard: surface agents failing in prod + run agents against datasets #11

@fazil-raja-glean

Description

@fazil-raja-glean

Summary

Add an observability layer to Seer with two capabilities:

  1. A dashboard that surfaces client agents failing in production — so an agent owner knows an agent is broken before a user complains, and across other users, not just the creator.
  2. Running agents against datasets on a schedule — to catch quality/behavior regressions when Glean models, actions, or connectors change underneath an already-tuned agent.

Motivation

  • The CarGurus CIO reported that agents fail silently: the creator stays unaware an agent is broken for others unless a user surfaces it directly. (Slack thread)
  • Broader customer concern: changes in Glean (models, actions, connectors) can degrade agents that were tuned and working.
  • Apollo raised the same proactive-monitoring gap. Glean lacks native evals and observability today, so this is additive value Seer is well positioned to deliver (it already has the agent runner, trace parsing, judge engine, and token ledger).

Capability 1 — Surface agents failing in prod

Goal: a per-agent health view that answers "is this agent failing, for whom, and why?"

  • Ingest production run signals (keyed on trace_id as the canonical join key) and display per agent:
    • Failure rate over time, error types (tool error, timeout, empty/blocked response, groundedness drop)
    • Latency trend, last-failed timestamp, number/identity of affected users
  • Cross-user detection — surface failures occurring under other users' permissions/data, not just the creator's runs (this is the core of the CarGurus ask).
  • Failure-mode "signals" — cluster failures (hallucination, tool error, regression) across runs and feed candidate cases back into eval datasets.
  • Alerting — Slack notification when an agent crosses a failure threshold.

Capability 2 — Run agents against datasets

Goal: scheduled, automated regression detection on top of the existing eval engine.

  • Reuse current eval-set + 7-call LLM-as-judge architecture and golden-set mode.
  • Scheduler — run a given eval set against an agent on a fixed cadence (cron).
  • Baseline + diff — store a baseline and diff overallScore and per-dimension scores per run; flag regressions.
  • Canary mode — small fixed case set per agent, snapshot scores, alert on drop (mirrors the canary-eval mechanic in the harness plan, pointed at the agent).

Data sourcing (which endpoints / logs)

Surface Endpoint / Log Notes
Synthetic / dataset runs POST /rest/api/v1/runworkflow (enableTrace: true) Returns response, traceId, tool calls, reasoning chain. Internal/undocumented; no token counts. Already used by Seer.
Prod cross-user signal WORKFLOW_RUN event in Glean Customer Event Logs / GCP Cloud Logging (scrubbed-agentspan, glean-workflows); fetch_agent_logs(project_id, run_id|trace_id) The signal for "failing for other users." Lives in the customer's GCP project / S3-GCS.
Full trace (timing + tokens) POST /api/v1/getworkflowtrace Browser-session gated (Cloudflare TLS fingerprint) — not callable from scripts today.
Fast-follow External Observability OTLP export (otelExport.enabled) In-progress productized OTLP stream of chat / agent_run / mcp spans to customer endpoints.

Key constraint: customer deployments typically expose logs only in their own S3/GCS bucket, not internal Tempo — the prod-ingestion design must bridge that gap.

Proposed phasing

  • Phase 1 (synthetic-first): scheduled health checks + regression evals via runworkflow, baseline/diff, Slack alerting. Ships without depending on customer log access. Architect trace_id as the join key from day one.
  • Phase 2 (prod observability): ingest production traces/logs (WorkflowRun / Cloud Logging, then OTLP export), failure-mode signals dashboard, cross-user failure detection.

Schema / architecture impact

  • New tables for: monitored agents (per-deployment), scheduled runs, run health/failure records, alert config.
  • Today Seer is single-tenant single-GLEAN_API_KEY; multi-deployment monitoring will need a tenant/deployment dimension if AIOM-hosted.

Open questions

  • AIOM-hosted multi-tenant vs. customer self-serve hosting?
  • Source of record for prod failures: WorkflowRun GCE log vs. OTLP export — sequence and access path per customer?
  • Alert routing/thresholds and who owns triage.

References

  • Slack thread (CarGurus ask)
  • docs/TRACE_API_LIMITATIONS.md, docs/glean-api-needs.md, docs/harness-engineering-plan.md
  • External Observability V0 and Agents Observability (Glean internal docs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions