Skip to content

Plan: durable Auto Review quality pipeline #324

@shiny-code-bot

Description

@shiny-code-bot

Summary

Auto Review should become a durable, repo-scoped quality pipeline, not a TUI-local timeout/status side effect.

The direction we want to love is:

  • the harness owns run identity, freshness, dedupe, cancellation, restart recovery, result classification, and cost policy
  • the assistant receives a compact per-turn Auto Review ledger so it can make better decisions without carrying raw review output
  • details are lazy and inspectable by stable run/finding ids
  • completed review work is always classified as current, superseded, obsolete, failed, cancelled, or archived; useful findings are not silently dropped
  • runtime is not staleness; staleness comes from snapshot freshness, activity/heartbeat, and repo state

This is the successor planning track to #76. Keep #76 as the evidence archive and record of earlier snapshot-aware/currentness slices.

Current Evidence

Existing #76 evidence showed the core problem at product scale:

Recent Auto Review feedback also caught a conceptual bug: showing stale based only on elapsed runtime is misleading. A long-running review can be healthy and valuable. Stale should mean the reviewed snapshot is no longer relevant, or the agent/process is inactive/lost according to stronger evidence.

Desired Architecture

Add a durable Auto Review run store under the existing repo-scoped review state area, conceptually:

CODE_HOME/state/review/repo-<repo-key>/auto-review/
  runs.ndjson or runs.json
  outputs/<run_id>.json
  events/<run_id>.jsonl

Initial implementation should prefer simple atomic JSON/NDJSON files over SQLite unless locking/atomicity becomes a proven problem. The store belongs in code-core, near review_coord, so TUI, exec, auto-drive, and future surfaces share the same contract.

Core record shape:

AutoReviewRun
  schema_version
  repo_key, repo_root, run_id
  source: tui | exec | auto_drive
  owner_session_id, trigger_turn_id, trigger_kind
  status: pending | snapshotting | reviewing | resolving | completed | failed | cancelled | superseded | skipped
  phase
  created_at, updated_at, started_at, completed_at
  last_activity_at
  base_commit, snapshot_commit, snapshot_epoch
  head_at_launch, diff_fingerprint, scope_hash, prompt_policy_version
  changed_path_count, listed_paths, omitted_path_count
  agent_id, batch_id, branch, worktree_path
  model, reasoning_effort, prompt_token_estimate, token_count
  finding_count, finding_digests, summary_digest
  freshness: current | long_running | inactive | superseded | obsolete | lost
  superseded_by, cancel_reason, error_class, error_summary
  output_path, detail_path

Finding digests should be compact and stable:

FindingDigest
  finding_id
  priority
  confidence
  title
  repo_relative_path
  line_start, line_end
  status: open | addressed | obsolete | dismissed | unresolved
  detail_ref

Harness-Owned Invariants

The assistant may inspect and reason over review state, but the harness owns the hard rules:

  • never start a duplicate review for the same useful scope unless the prompt/model/policy version intentionally changes
  • never cancel solely because elapsed runtime is high
  • classify freshness from snapshot/head/epoch plus activity/heartbeat, not from wall-clock runtime alone
  • adopt or reconnect to a matching active run before retrying
  • cancel only for explicit user stop, superseded/obsolete scope, dead/lost process, hard budget exhaustion, or a proven duplicate
  • terminal findings are immutable evidence and are always classified/surfaced/archived, not silently discarded
  • restart recovery reconciles durable runs against AgentManager, worktrees, review locks, PIDs, and snapshot epochs before exposing state
  • fallback worktrees are for preserving surfaced fixes or safe follow-up work, not for casually duplicating near-identical live reviews

Per-Turn Auto Review Ledger

Expose a compact hidden ledger each turn, integrated with the broader context ledger work in #92.

Default target: under roughly 600-1200 tokens, and zero tokens when there are no recent/actionable runs.

Example shape:

<auto_review_ledger version="1" repo="code">
  <active run="ar_123" phase="reviewing" snapshot="abc1234"
          runtime="9m" last_activity="38s" freshness="current"
          scope="4 files" />
  <latest run="ar_122" status="superseded" snapshot="def5678"
          findings="1" surfaced="false"
          summary="Potential regression in background review completion." />
  <finding id="f1" p="2" path="code-rs/tui/src/chatwidget.rs" line="1488"
           title="Stale label uses elapsed runtime rather than inactivity" />
  <details available="auto_review.details(run_id, finding_id)" />
</auto_review_ledger>

Ledger rules:

  • include at most one active run by default
  • include latest current/superseded terminal result with actionable findings
  • include only unresolved high-priority digests, not full finding bodies
  • include stable ids and detail refs
  • include runtime for cost awareness and last_activity for liveness, but do not equate runtime with stale
  • no raw diff, full JSON, or repeated historical runs by default

Lazy Detail Surface

Add a read-only detail path, either as a harness tool or internal operation exposed through existing agent tooling:

auto_review.details(run_id, finding_id?)

It should return bounded detail from the durable output sidecar, with truncation and clear errors when the requested run/finding is unavailable.

Do not give the LLM unrestricted cancellation authority. If future tools allow LLM-suggested actions, they should be requests to the harness, and the harness should enforce the policy.

Token-Efficiency Strategy

We should prove this is cheaper because it prevents wasted review runs, not because the ledger is free.

Expected savings:

  • dedupe avoids entire duplicate review runs
  • adoption/reconnect avoids relaunching after restart
  • superseded/obsolete cancellation stops spending on low-value reviews
  • compact ledger prevents dumping raw review results into every turn
  • lazy details let the assistant inspect only when findings matter

Track ledger overhead explicitly:

auto_review.ledger_tokens
auto_review.detail_fetch_count
auto_review.detail_tokens

And compare against avoided spend:

auto_review.duplicate_reused
auto_review.duplicate_skipped
auto_review.tokens_avoided_estimate
auto_review.cancelled_obsolete_token_estimate
auto_review.completed_but_never_surfaced

Proof Metrics

Add structured counters/events so dogfooding can prove whether the concept works:

auto_review.run_started
auto_review.run_completed
auto_review.run_failed
auto_review.run_cancelled
auto_review.run_superseded
auto_review.restart_recovered
auto_review.inactivity_detected
auto_review.stale_snapshot_detected
auto_review.findings_current
auto_review.findings_superseded
auto_review.findings_inspected_by_assistant
auto_review.findings_applied_or_dismissed
auto_review.time_to_surface_findings
auto_review.review_tokens_estimated
auto_review.review_tokens_actual_when_available
auto_review.ledger_tokens

Success signals:

  • duplicate review rate decreases
  • completed findings that never surface goes to zero
  • obsolete-token burn decreases
  • useful finding rate does not regress
  • assistant naturally waits/adopts/inspects based on the ledger
  • restarts preserve visible review state
  • users see fewer confusing late-review interruptions

Staged Implementation

  1. Durable run store foundation

    • add AutoReviewRun schema and store in code-core
    • persist run creation, status, activity, terminal result, and sidecar output
    • add compaction/GC and restart reconciliation tests
  2. Wire existing TUI/exec paths into the store

    • TUI run_background_review records run lifecycle
    • AgentManager/status updates refresh activity
    • exec AutoReviewTracker writes completions into the same store
    • replace local-only last_seen recovery with store-backed activity/currentness
  3. Compact per-turn ledger

  4. Lazy detail retrieval

    • expose bounded review detail lookup by run/finding id
    • ensure detail retrieval is read-only and does not mutate run disposition
    • add tests for truncation and missing-detail errors
  5. Dedupe, supersede, and cancellation policy

    • compute stable scope/diff/prompt-policy keys
    • adopt/reuse active matching runs before launch
    • classify older runs as superseded/obsolete when head/epoch moves
    • cancel only through harness-owned policy
    • never cancel on runtime alone
  6. Metrics and dogfood proof

    • emit counters/events
    • add /review-stats or similar diagnostic surface when useful
    • compare duplicate rate, unsurfaced completions, ledger overhead, avoided review tokens, and finding usefulness before/after
  7. Docs and issue cleanup

Suggested Sub-Issues

Create sub-issues once the parent is accepted:

  1. Add durable AutoReviewRun store and restart reconciliation
  2. Wire TUI/exec Auto Review lifecycle events into the store
  3. Add compact per-turn Auto Review ledger
  4. Add lazy review detail retrieval
  5. Enforce dedupe, freshness, supersede, and cancellation policy
  6. Add proof metrics and dogfood diagnostics
  7. Update Auto Review docs and reconcile Make Auto Review snapshot-aware and preserve superseded results #76/Add context source ledger and prompt observability #92 relationships

Open Decisions

  • Store format: atomic JSON/NDJSON first, or SQLite now? Default recommendation: JSON/NDJSON first with careful locking and compaction.
  • Scope key: exactly when and where do we compute the diff fingerprint so TUI, exec, and auto-drive cannot disagree?
  • Ledger horizon: last active/current run plus latest actionable terminal result by default, with older history only in diagnostics.
  • Cancellation thresholds: should inactivity be configurable, and what evidence besides no heartbeat is required before cancellation?
  • Detail tool shape: new explicit tool vs. existing agent/status operation surface.
  • How much of Make Auto Review snapshot-aware and preserve superseded results #76 should be reopened vs. kept closed as evidence? Default recommendation: keep Make Auto Review snapshot-aware and preserve superseded results #76 closed as archive, link this successor.

Acceptance Criteria

  • Auto Review runs are durably recorded by repo and survive TUI/harness restart.
  • Running, terminal, superseded, failed, skipped, and cancelled states are represented in durable state and visible through a compact projection.
  • Freshness/staleness is computed from snapshot/head/epoch and activity, not elapsed runtime alone.
  • The assistant receives a compact per-turn Auto Review ledger when there is active/recent actionable review state.
  • Ledger injection is bounded and observable; idle Auto Review adds no prompt overhead.
  • Full review output/details are preserved and retrievable lazily by stable ids.
  • Duplicate review launches for the same useful scope are adopted/reused/skipped by harness policy.
  • Superseded or obsolete review runs are classified and, when appropriate, cancelled through explicit harness policy.
  • Completed findings are never silently discarded; they are current, superseded, obsolete, dismissed, or archived.
  • Restart recovery reconciles durable state with AgentManager/worktrees/locks before launching new review work.
  • Metrics prove duplicate review reduction, unsurfaced finding elimination, ledger token overhead, avoided token spend, and finding usefulness across dogfood cycles.
  • code-rs/core/docs/auto-review.md describes the new durable lifecycle and fallback/concurrency semantics.

Relationships

Evidence archive: #76
Related prompt/context ledger work: #92
Related token/budget guardrails: #50
Related token efficiency roadmap: #43

Finish Line

Every Code Auto Review reaches the dogfood "love gate": it feels safe to leave on during real coding because review runs survive restarts, duplicate token spend is avoided, stale or superseded work is visible and handled, useful terminal results are preserved, freshness is based on snapshot/activity evidence, each assistant turn receives only compact actionable review state, and we can prove latency/token/finding usefulness with dogfood metrics instead of vibes.

Current Status

State: Active. Foundation and compact ledger slices are complete; the next bar is the dogfood love gate, not just shipping another small review PR.

Completed:

  1. Add durable AutoReviewRun store and restart reconciliation #325 Add durable AutoReviewRun store and restart reconciliation - landed in PR Add durable Auto Review run store #332.
  2. Wire TUI and exec Auto Review lifecycle events into durable state #326 Wire TUI and exec Auto Review lifecycle events into durable state - landed in PR Wire Auto Review lifecycle state into durable store #333.
  3. Expose compact per-turn Auto Review ledger #327 Expose compact per-turn Auto Review ledger - landed in PR Expose compact Auto Review ledger #334, with follow-up fix PR Avoid creating review dirs for ledger reads #335 to keep idle ledger reads non-mutating.

What "love" means for this workstream:

  • Auto Review can be left enabled while coding without repeatedly burning tokens on equivalent diffs.
  • The assistant can see active/recent review state and decide whether to wait, merge, ignore, cancel, or launch new review work.
  • Runtime alone never marks a healthy active review stale; freshness comes from snapshot/head/epoch plus liveness/activity evidence.
  • The system records enough per-run evidence to explain why a review was slow: model, reasoning effort, phase timing, token/prompt estimate, follow-up count, duplicate/skipped/superseded reason, and whether findings were useful.
  • Dogfooding can compare before/after behavior across sessions and show avoided duplicate review spend, reduced stale review confusion, bounded ledger overhead, and no silently lost findings.

Remaining implementation sequence:

  1. Enforce Auto Review dedupe freshness supersede and cancellation policy #329 Enforce Auto Review dedupe freshness supersede and cancellation policy, including diff-fingerprint based duplicate detection and explicit adopt/reuse/skip/supersede decisions.
  2. Add lazy Auto Review detail retrieval #328 Add lazy Auto Review detail retrieval, after the ledger has stable ids and policy has clearer run classification.
  3. Add Auto Review proof metrics and dogfood diagnostics #330 Add Auto Review proof metrics and dogfood diagnostics, including prompt/token estimates and phase timing so review slowness can be proven and tuned.
  4. Update Auto Review lifecycle docs for durable runs #331 Update Auto Review lifecycle docs for durable runs, after behavior settles.

Recommended next action: dogfood the merged ledger after rebuilding the PATH binary, then start #329. The assistant now receives bounded review state when useful; the harness should next prevent duplicate/stale review work, classify superseded/obsolete runs, and cancel or adopt existing work through explicit policy.

Key design constraints from recent Auto Review feedback:

  • Runtime is not staleness.
  • A lower follow-up limit reduces repeated review loops but does not make the first review pass fast.
  • Background Auto Review has its own model, reasoning, resolve model, and follow-up settings; diagnostics must expose which ones a run actually used.
  • Metrics and ledgers must help future turns make decisions without injecting bulky telemetry into ordinary context.

Blocked by: None for #329.

Last verified: 2026-06-02 after read-only agent review of Auto Review latency/settings paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activeCurrent active plan

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions