Skip to content

feat(observability): add structured digest.* event family to discussion-digest, the unobserved hot-path LLM call #228

@chrisleekr

Description

@chrisleekr

Finding

src/workflows/discussion-digest.ts is the only LLM-touching subsystem in the bot that ships without a structured event: "..." log family, and it runs on the hot path of every comment-aware workflow (triage, plan, implement, review, resolve, remember, see src/workflows/handlers/triage.ts:115, plan.ts:41, implement.ts:54, review.ts:62, resolve.ts:100, remember.ts:47). Triage, dispatcher, retry, secret-guard, MCP, fleet, and scheduler each get their own table in docs/operate/observability.md (see ## Dispatcher log fields at line 135 and ## Retry log fields at line 83); digest does not. The only digest emissions in the entire file are five free-form strings (src/workflows/discussion-digest.ts:329 LLM failure, :334 structured-output rejection, :360 directives dropped, :411 map-reduce start, :567 fetch failure). No completion event exists, so a successful digest call is invisible to operators.

This silence hides four operationally relevant signals. (1) Skip rate. buildDiscussionDigest returns { ok: false, reason: "no-comments" } at src/workflows/discussion-digest.ts:380 when every comment was authored by a bot, with no log line. An operator who sees the workflow run without a "Maintainer guidance" section in the prompt cannot distinguish "no human commented" from "LLM failed" from "all owner directives got dropped by enforceOwnerDirectives at :351". (2) Token usage. runDigestCall at :313 discards response.usage even though LLMUsage { inputTokens, outputTokens } is on the response object (src/ai/llm-client.ts:158). The digest is wired to config.digestModel (default Sonnet 4.6) and runs on every structured workflow event, so it materially contributes to spend — yet a fleet operator cannot answer "how much is the digest costing me per workflow" without instrumenting it ad-hoc. (3) Latency. No latencyMs field is captured for any of the 1-to-N LLM calls in single-pass or map-reduce; under load this is the prime suspect when a workflow's perceived start-to-first-comment delay drifts. (4) Trust-boundary effectiveness. enforceOwnerDirectives (:351) only logs when dropped > 0. The dropped: 0 case — the steady-state, attestation-worthy event — is unobservable, so an attacker successfully extracting a directive is the same log shape as no attack at all.

The fix is a small, contained structured-event family (digest.skipped, digest.call.completed, digest.completed, digest.failed) emitted from the existing chokepoints, mirroring the pattern documented at docs/operate/observability.md:135 (dispatcher) and :83 (retry). No new dependencies; response.usage already carries the token counts, parseStructuredResponse(...).strategy (src/ai/structured-output.ts:57) already carries strict-vs-tolerant for free.

Diagram

flowchart TD
    Caller["Workflow handler<br/>triage / plan / implement<br/>review / resolve / remember"]
    Fetch["fetchAndBuildDigest<br/>discussion-digest.ts:541"]
    Build["buildDiscussionDigest<br/>discussion-digest.ts:370"]
    Classify["classifyComments<br/>partitions owner/other/bot"]
    Skip["humanCount === 0<br/>return no-comments<br/>discussion-digest.ts:378"]
    Pack["estimate tokens<br/>chunk if over budget"]
    Single["Single-pass LLM call<br/>runDigestCall :313"]
    Map["Map: N LLM calls"]
    Reduce["Reduce: 1 LLM call"]
    Enforce["enforceOwnerDirectives<br/>drop forged authors :351"]
    Done["return ok=true"]

    Caller --> Fetch
    Fetch -.catch warn.-> FetchFail["log.warn comment fetch failed<br/>:567"]
    Fetch --> Build
    Build --> Classify
    Classify --> Skip
    Classify --> Pack
    Pack -- one chunk --> Single
    Pack -- many chunks --> Map
    Map --> Reduce

    Single -.catch warn.-> CallFail["log.warn LLM call failed :329<br/>log.warn parse rejected :334"]
    Map -.catch warn.-> CallFail
    Reduce -.catch warn.-> CallFail

    Single --> Enforce
    Reduce --> Enforce
    Map -.info: chunkCount only.-> MapLog["log.info map-reduce :411"]
    Enforce -.only when dropped greater than 0.-> Drop["log.warn dropped directives :360"]
    Enforce --> Done

    classDef gap fill:#922b21,color:#ffffff,stroke:#641e16
    classDef ok fill:#196f3d,color:#ffffff,stroke:#0e4d2a
    classDef path fill:#1f4e79,color:#ffffff,stroke:#11314d

    class Skip,Done,Single,Map,Reduce gap
    class FetchFail,CallFail,Drop,MapLog ok
    class Caller,Fetch,Build,Classify,Pack,Enforce path
Loading

Red nodes = chokepoints emitting no structured event today. Green nodes = the five existing free-form warn/info call sites. Blue = unobserved code path.

Rationale

The digest is on the forced prefix of six structured workflows. Each workflow event therefore costs ≥1 digest LLM call (single-pass) or N+1 calls (map-reduce on large threads). With config.digestModel defaulting to Sonnet 4.6, the digest can comfortably account for 5-30% of a workflow's per-event spend, and it is the dominant contributor to the wall-clock between webhook receipt and first agent token on a comment-heavy thread (the synchronous await fetchAndBuildDigest(...) blocks postStartingComment, see e.g. src/workflows/handlers/triage.ts:115-127). Yet none of these costs are pin-pointable from logs today.

This matters for three operational scenarios the project already invests in. (a) The existing token-usage instrumentation that feat(observability): log + persist SDK token usage on executions (commit 5407dcd) added persists input_tokens/output_tokens for the agent SDK call only; the digest's tokens roll up under nothing. An operator running SELECT sum(input_tokens) FROM executions WHERE workflow = 'triage' undercounts true triage spend by the digest portion. (b) Issue feat(observability): add scheduler.scan.* lifecycle events (open, area: observability) explicitly models the pattern this proposal would reuse — a tick-and-summary event family — so adding digest.* keeps the codebase's observability surface uniform rather than divergent. (c) enforceOwnerDirectives is a load-bearing security boundary (CLAUDE.md § Comment-aware workflows — "Directives are re-checked post-parse against the classified owner authors, so the boundary does not depend on the model"). The current dropped > 0-only log means an SRE has no positive-control evidence the boundary fires when it should, and no baseline to alert on a sudden spike in drops (which is the signature of a prompt-injection or model-regression event the design exists to catch).

The implementation cost is small: ~15 lines of new logging in discussion-digest.ts, one event-key constants block, one Zod discriminated union pinning the field shape (mirror src/orchestrator/log-fields.ts:55#DispatcherOfferLogSchema), and a new ## Discussion digest log fields table in docs/operate/observability.md. No schema, no new dep, no behaviour change. Compare: the dispatcher digest.* analogue (feat(observability): structured dispatcher.offer.{sent,accepted,rejected,timed_out} events, closed) was a similarly contained PR.

References

Internal:

  • src/workflows/discussion-digest.ts:313runDigestCall, the LLM-call chokepoint that drops response.usage and emits no completion event.
  • src/workflows/discussion-digest.ts:351enforceOwnerDirectives, the trust boundary that only logs the dropped > 0 minority case.
  • src/workflows/discussion-digest.ts:370buildDiscussionDigest, where the no-comments return at :380 is silent.
  • src/ai/llm-client.ts:158LLMResponse.usage: LLMUsage carrying inputTokens + outputTokens, already populated, currently discarded by the digest call site.
  • src/ai/structured-output.ts:57StructuredResult.strategy: "strict" | "tolerant", the same JSON-quality signal the closed triage response recovered via tolerant parser line at src/orchestrator/triage.ts:356 already surfaces; digest discards it.
  • docs/operate/observability.md:135## Dispatcher log fields, the pattern to mirror (event-key constants + discriminated union + test).
  • src/orchestrator/log-fields.ts:55#DispatcherOfferLogSchema — the strict-Zod-per-event-shape template.
  • src/workflows/handlers/triage.ts:115, plan.ts:41, implement.ts:54, review.ts:62, resolve.ts:100, remember.ts:47 — the six consumers that synchronously block on the un-observed await fetchAndBuildDigest(...).

External:

Suggested Next Steps

  1. Add src/workflows/digest-log-fields.ts with a DIGEST_LOG_EVENTS constant ("digest.skipped" | "digest.call.completed" | "digest.completed" | "digest.failed") and a z.discriminatedUnion("event", ...) schema mirroring src/orchestrator/log-fields.ts:55#DispatcherOfferLogSchema. Co-locate a test that fails CI on event-key drift.
  2. In runDigestCall (src/workflows/discussion-digest.ts:313), capture performance.now() deltas and emit digest.call.completed with { phase: "extract" | "reduce", input_tokens, output_tokens, latency_ms, strategy } on the success path, where strategy is parsed.strategy. Reuse response.usage.inputTokens / response.usage.outputTokens from the existing client.create result.
  3. In buildDiscussionDigest (:370), emit digest.skipped at the humanCount === 0 early-return (:378) with { comment_counts: { owner, other, bot } }, and digest.completed at the final success return with { chunks, total_latency_ms, directives_kept, directives_dropped, has_prior_bot_output, untrusted_context_count, conversation_summary_chars }.
  4. In fetchAndBuildDigest (:541), emit digest.failed at the fetch catch (:567) and on every ok: false propagated return, with the existing DigestResult.reason enum (no-comments | llm-error | parse-error) carried as a field.
  5. Add a ## Discussion digest log fields section to docs/operate/observability.md documenting the four event keys, fields, and at least one suggested alert (e.g. count(event = "digest.failed" AND reason = "parse-error") by repo over 15-minute windows as a model-quality regression alarm).
  6. (Optional follow-up issue) Persist digest token counts to executions.input_tokens / output_tokens by summing the agent SDK and digest counters at workflow-end so the existing SELECT sum(input_tokens) FROM executions queries account for digest spend without per-handler arithmetic.

Areas Evaluated

  • Read src/workflows/discussion-digest.ts end-to-end (629 lines) and inventoried all five log call sites.
  • Verified every consumer handler (triage, plan, implement, review, resolve, remember) calls fetchAndBuildDigest synchronously before any user-facing comment is posted, via grep -rn "fetchAndBuildDigest" against src/.
  • Cross-checked the existing observability surface in docs/operate/observability.md (sections at lines 83, 110, 122, 135, 151, 163, 220), src/orchestrator/log-fields.ts (dispatcher), src/utils/retry-log-fields.ts (retry), src/orchestrator/triage.ts (triage) — none cover discussion-digest.
  • Confirmed LLMResponse.usage is already populated by src/ai/llm-client.ts:158, and parseStructuredResponse(...).strategy is already in scope (src/ai/structured-output.ts:57), so adding the proposed fields requires no upstream-API changes.
  • Reviewed every open + closed research-labelled issue (29 total): no overlap with this proposal. The closest neighbours (feat(observability): structured dispatcher.offer.*, closed; feat(observability): periodic fleet-state gauge snapshot, closed; feat(observability): add scheduler.scan.* lifecycle events, open) cover different chokepoints.
  • Verified git log --oneline -20: the most recent observability commit (6713cbf feat(observability): add structured retry.* events) follows the exact event-family pattern this proposal would extend to the digest, confirming the codebase is actively converging on structured noun.verb events.

Generated by the scheduled research action on 2026-06-13

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions