Plan: durable Auto Review quality pipeline

## Summary

Auto Review should become a durable, repo-scoped quality pipeline, not a TUI-local timeout/status side effect.

The direction we want to love is:

- the harness owns run identity, freshness, dedupe, cancellation, restart recovery, result classification, and cost policy
- the assistant receives a compact per-turn Auto Review ledger so it can make better decisions without carrying raw review output
- details are lazy and inspectable by stable run/finding ids
- completed review work is always classified as current, superseded, obsolete, failed, cancelled, or archived; useful findings are not silently dropped
- runtime is not staleness; staleness comes from snapshot freshness, activity/heartbeat, and repo state

This is the successor planning track to #76. Keep #76 as the evidence archive and record of earlier snapshot-aware/currentness slices.

## Current Evidence

Existing #76 evidence showed the core problem at product scale:

- hundreds of Auto Review sessions, with median runtime around 5.3 minutes and long tails over 20, 60, and 160 minutes
- late review processes continuing after work had merged/deployed
- high token counts on reviews of already-superseded snapshots
- repeated background completion notices arriving into active sessions
- prior slices (#305, #307, #308, #310, #311, #312) improved run identity/currentness and superseded-result handling, but did not finish durable run state, ledger exposure, budget/cancellation policy, or proof metrics

Recent Auto Review feedback also caught a conceptual bug: showing `stale` based only on elapsed runtime is misleading. A long-running review can be healthy and valuable. Stale should mean the reviewed snapshot is no longer relevant, or the agent/process is inactive/lost according to stronger evidence.

## Desired Architecture

Add a durable Auto Review run store under the existing repo-scoped review state area, conceptually:

```text
CODE_HOME/state/review/repo-<repo-key>/auto-review/
  runs.ndjson or runs.json
  outputs/<run_id>.json
  events/<run_id>.jsonl
```

Initial implementation should prefer simple atomic JSON/NDJSON files over SQLite unless locking/atomicity becomes a proven problem. The store belongs in `code-core`, near `review_coord`, so TUI, exec, auto-drive, and future surfaces share the same contract.

Core record shape:

```text
AutoReviewRun
  schema_version
  repo_key, repo_root, run_id
  source: tui | exec | auto_drive
  owner_session_id, trigger_turn_id, trigger_kind
  status: pending | snapshotting | reviewing | resolving | completed | failed | cancelled | superseded | skipped
  phase
  created_at, updated_at, started_at, completed_at
  last_activity_at
  base_commit, snapshot_commit, snapshot_epoch
  head_at_launch, diff_fingerprint, scope_hash, prompt_policy_version
  changed_path_count, listed_paths, omitted_path_count
  agent_id, batch_id, branch, worktree_path
  model, reasoning_effort, prompt_token_estimate, token_count
  finding_count, finding_digests, summary_digest
  freshness: current | long_running | inactive | superseded | obsolete | lost
  superseded_by, cancel_reason, error_class, error_summary
  output_path, detail_path
```

Finding digests should be compact and stable:

```text
FindingDigest
  finding_id
  priority
  confidence
  title
  repo_relative_path
  line_start, line_end
  status: open | addressed | obsolete | dismissed | unresolved
  detail_ref
```

## Harness-Owned Invariants

The assistant may inspect and reason over review state, but the harness owns the hard rules:

- never start a duplicate review for the same useful scope unless the prompt/model/policy version intentionally changes
- never cancel solely because elapsed runtime is high
- classify freshness from snapshot/head/epoch plus activity/heartbeat, not from wall-clock runtime alone
- adopt or reconnect to a matching active run before retrying
- cancel only for explicit user stop, superseded/obsolete scope, dead/lost process, hard budget exhaustion, or a proven duplicate
- terminal findings are immutable evidence and are always classified/surfaced/archived, not silently discarded
- restart recovery reconciles durable runs against AgentManager, worktrees, review locks, PIDs, and snapshot epochs before exposing state
- fallback worktrees are for preserving surfaced fixes or safe follow-up work, not for casually duplicating near-identical live reviews

## Per-Turn Auto Review Ledger

Expose a compact hidden ledger each turn, integrated with the broader context ledger work in #92.

Default target: under roughly 600-1200 tokens, and zero tokens when there are no recent/actionable runs.

Example shape:

```xml
<auto_review_ledger version="1" repo="code">
  <active run="ar_123" phase="reviewing" snapshot="abc1234"
          runtime="9m" last_activity="38s" freshness="current"
          scope="4 files" />
  <latest run="ar_122" status="superseded" snapshot="def5678"
          findings="1" surfaced="false"
          summary="Potential regression in background review completion." />
  <finding id="f1" p="2" path="code-rs/tui/src/chatwidget.rs" line="1488"
           title="Stale label uses elapsed runtime rather than inactivity" />
  <details available="auto_review.details(run_id, finding_id)" />
</auto_review_ledger>
```

Ledger rules:

- include at most one active run by default
- include latest current/superseded terminal result with actionable findings
- include only unresolved high-priority digests, not full finding bodies
- include stable ids and detail refs
- include `runtime` for cost awareness and `last_activity` for liveness, but do not equate runtime with stale
- no raw diff, full JSON, or repeated historical runs by default

## Lazy Detail Surface

Add a read-only detail path, either as a harness tool or internal operation exposed through existing agent tooling:

```text
auto_review.details(run_id, finding_id?)
```

It should return bounded detail from the durable output sidecar, with truncation and clear errors when the requested run/finding is unavailable.

Do not give the LLM unrestricted cancellation authority. If future tools allow LLM-suggested actions, they should be requests to the harness, and the harness should enforce the policy.

## Token-Efficiency Strategy

We should prove this is cheaper because it prevents wasted review runs, not because the ledger is free.

Expected savings:

- dedupe avoids entire duplicate review runs
- adoption/reconnect avoids relaunching after restart
- superseded/obsolete cancellation stops spending on low-value reviews
- compact ledger prevents dumping raw review results into every turn
- lazy details let the assistant inspect only when findings matter

Track ledger overhead explicitly:

```text
auto_review.ledger_tokens
auto_review.detail_fetch_count
auto_review.detail_tokens
```

And compare against avoided spend:

```text
auto_review.duplicate_reused
auto_review.duplicate_skipped
auto_review.tokens_avoided_estimate
auto_review.cancelled_obsolete_token_estimate
auto_review.completed_but_never_surfaced
```

## Proof Metrics

Add structured counters/events so dogfooding can prove whether the concept works:

```text
auto_review.run_started
auto_review.run_completed
auto_review.run_failed
auto_review.run_cancelled
auto_review.run_superseded
auto_review.restart_recovered
auto_review.inactivity_detected
auto_review.stale_snapshot_detected
auto_review.findings_current
auto_review.findings_superseded
auto_review.findings_inspected_by_assistant
auto_review.findings_applied_or_dismissed
auto_review.time_to_surface_findings
auto_review.review_tokens_estimated
auto_review.review_tokens_actual_when_available
auto_review.ledger_tokens
```

Success signals:

- duplicate review rate decreases
- completed findings that never surface goes to zero
- obsolete-token burn decreases
- useful finding rate does not regress
- assistant naturally waits/adopts/inspects based on the ledger
- restarts preserve visible review state
- users see fewer confusing late-review interruptions

## Staged Implementation

1. Durable run store foundation
   - add `AutoReviewRun` schema and store in `code-core`
   - persist run creation, status, activity, terminal result, and sidecar output
   - add compaction/GC and restart reconciliation tests

2. Wire existing TUI/exec paths into the store
   - TUI `run_background_review` records run lifecycle
   - `AgentManager`/status updates refresh activity
   - exec `AutoReviewTracker` writes completions into the same store
   - replace local-only `last_seen` recovery with store-backed activity/currentness

3. Compact per-turn ledger
   - implement `AutoReviewLedger` projection with a strict byte/token cap
   - inject it through the same context-source observability family as #92
   - skip injection when idle
   - add tests for ledger cap and freshness labels

4. Lazy detail retrieval
   - expose bounded review detail lookup by run/finding id
   - ensure detail retrieval is read-only and does not mutate run disposition
   - add tests for truncation and missing-detail errors

5. Dedupe, supersede, and cancellation policy
   - compute stable scope/diff/prompt-policy keys
   - adopt/reuse active matching runs before launch
   - classify older runs as superseded/obsolete when head/epoch moves
   - cancel only through harness-owned policy
   - never cancel on runtime alone

6. Metrics and dogfood proof
   - emit counters/events
   - add `/review-stats` or similar diagnostic surface when useful
   - compare duplicate rate, unsurfaced completions, ledger overhead, avoided review tokens, and finding usefulness before/after

7. Docs and issue cleanup
   - update `code-rs/core/docs/auto-review.md` to describe durable runs, ledgers, fallback semantics, and proof metrics
   - update #76 or closeout comments to point here as the successor architecture
   - link related prompt/context observability work in #92

## Suggested Sub-Issues

Create sub-issues once the parent is accepted:

1. Add durable AutoReviewRun store and restart reconciliation
2. Wire TUI/exec Auto Review lifecycle events into the store
3. Add compact per-turn Auto Review ledger
4. Add lazy review detail retrieval
5. Enforce dedupe, freshness, supersede, and cancellation policy
6. Add proof metrics and dogfood diagnostics
7. Update Auto Review docs and reconcile #76/#92 relationships

## Open Decisions

- Store format: atomic JSON/NDJSON first, or SQLite now? Default recommendation: JSON/NDJSON first with careful locking and compaction.
- Scope key: exactly when and where do we compute the diff fingerprint so TUI, exec, and auto-drive cannot disagree?
- Ledger horizon: last active/current run plus latest actionable terminal result by default, with older history only in diagnostics.
- Cancellation thresholds: should inactivity be configurable, and what evidence besides no heartbeat is required before cancellation?
- Detail tool shape: new explicit tool vs. existing agent/status operation surface.
- How much of #76 should be reopened vs. kept closed as evidence? Default recommendation: keep #76 closed as archive, link this successor.

## Acceptance Criteria

- [ ] Auto Review runs are durably recorded by repo and survive TUI/harness restart.
- [ ] Running, terminal, superseded, failed, skipped, and cancelled states are represented in durable state and visible through a compact projection.
- [ ] Freshness/staleness is computed from snapshot/head/epoch and activity, not elapsed runtime alone.
- [ ] The assistant receives a compact per-turn Auto Review ledger when there is active/recent actionable review state.
- [ ] Ledger injection is bounded and observable; idle Auto Review adds no prompt overhead.
- [ ] Full review output/details are preserved and retrievable lazily by stable ids.
- [ ] Duplicate review launches for the same useful scope are adopted/reused/skipped by harness policy.
- [ ] Superseded or obsolete review runs are classified and, when appropriate, cancelled through explicit harness policy.
- [ ] Completed findings are never silently discarded; they are current, superseded, obsolete, dismissed, or archived.
- [ ] Restart recovery reconciles durable state with AgentManager/worktrees/locks before launching new review work.
- [ ] Metrics prove duplicate review reduction, unsurfaced finding elimination, ledger token overhead, avoided token spend, and finding usefulness across dogfood cycles.
- [ ] `code-rs/core/docs/auto-review.md` describes the new durable lifecycle and fallback/concurrency semantics.

## Relationships

Evidence archive: #76
Related prompt/context ledger work: #92
Related token/budget guardrails: #50
Related token efficiency roadmap: #43
- related: cbusillo/code#76 - https://github.com/cbusillo/code/issues/76
- related: cbusillo/code#92 - https://github.com/cbusillo/code/issues/92
- related: cbusillo/code#50 - https://github.com/cbusillo/code/issues/50
- related: cbusillo/code#43 - https://github.com/cbusillo/code/issues/43

## Finish Line

Every Code Auto Review reaches the dogfood "love gate": it feels safe to leave on during real coding because review runs survive restarts, duplicate token spend is avoided, stale or superseded work is visible and handled, useful terminal results are preserved, freshness is based on snapshot/activity evidence, each assistant turn receives only compact actionable review state, and we can prove latency/token/finding usefulness with dogfood metrics instead of vibes.

## Current Status

State: Active. Foundation and compact ledger slices are complete; the next bar is the dogfood love gate, not just shipping another small review PR.

Completed:
1. #325 Add durable AutoReviewRun store and restart reconciliation - landed in PR #332.
2. #326 Wire TUI and exec Auto Review lifecycle events into durable state - landed in PR #333.
3. #327 Expose compact per-turn Auto Review ledger - landed in PR #334, with follow-up fix PR #335 to keep idle ledger reads non-mutating.

What "love" means for this workstream:
- Auto Review can be left enabled while coding without repeatedly burning tokens on equivalent diffs.
- The assistant can see active/recent review state and decide whether to wait, merge, ignore, cancel, or launch new review work.
- Runtime alone never marks a healthy active review stale; freshness comes from snapshot/head/epoch plus liveness/activity evidence.
- The system records enough per-run evidence to explain why a review was slow: model, reasoning effort, phase timing, token/prompt estimate, follow-up count, duplicate/skipped/superseded reason, and whether findings were useful.
- Dogfooding can compare before/after behavior across sessions and show avoided duplicate review spend, reduced stale review confusion, bounded ledger overhead, and no silently lost findings.

Remaining implementation sequence:
1. #329 Enforce Auto Review dedupe freshness supersede and cancellation policy, including diff-fingerprint based duplicate detection and explicit adopt/reuse/skip/supersede decisions.
2. #328 Add lazy Auto Review detail retrieval, after the ledger has stable ids and policy has clearer run classification.
3. #330 Add Auto Review proof metrics and dogfood diagnostics, including prompt/token estimates and phase timing so review slowness can be proven and tuned.
4. #331 Update Auto Review lifecycle docs for durable runs, after behavior settles.

Recommended next action: dogfood the merged ledger after rebuilding the PATH binary, then start #329. The assistant now receives bounded review state when useful; the harness should next prevent duplicate/stale review work, classify superseded/obsolete runs, and cancel or adopt existing work through explicit policy.

Key design constraints from recent Auto Review feedback:
- Runtime is not staleness.
- A lower follow-up limit reduces repeated review loops but does not make the first review pass fast.
- Background Auto Review has its own model, reasoning, resolve model, and follow-up settings; diagnostics must expose which ones a run actually used.
- Metrics and ledgers must help future turns make decisions without injecting bulky telemetry into ordinary context.

Blocked by: None for #329.

Last verified: 2026-06-02 after read-only agent review of Auto Review latency/settings paths.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan: durable Auto Review quality pipeline #324

Summary

Current Evidence

Desired Architecture

Harness-Owned Invariants

Per-Turn Auto Review Ledger

Lazy Detail Surface

Token-Efficiency Strategy

Proof Metrics

Staged Implementation

Suggested Sub-Issues

Open Decisions

Acceptance Criteria

Relationships

Finish Line

Current Status

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Plan: durable Auto Review quality pipeline #324

Description

Summary

Current Evidence

Desired Architecture

Harness-Owned Invariants

Per-Turn Auto Review Ledger

Lazy Detail Surface

Token-Efficiency Strategy

Proof Metrics

Staged Implementation

Suggested Sub-Issues

Open Decisions

Acceptance Criteria

Relationships

Finish Line

Current Status

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions