You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recent Auto Review feedback also caught a conceptual bug: showing stale based only on elapsed runtime is misleading. A long-running review can be healthy and valuable. Stale should mean the reviewed snapshot is no longer relevant, or the agent/process is inactive/lost according to stronger evidence.
Desired Architecture
Add a durable Auto Review run store under the existing repo-scoped review state area, conceptually:
CODE_HOME/state/review/repo-<repo-key>/auto-review/
runs.ndjson or runs.json
outputs/<run_id>.json
events/<run_id>.jsonl
Initial implementation should prefer simple atomic JSON/NDJSON files over SQLite unless locking/atomicity becomes a proven problem. The store belongs in code-core, near review_coord, so TUI, exec, auto-drive, and future surfaces share the same contract.
FindingDigest
finding_id
priority
confidence
title
repo_relative_path
line_start, line_end
status: open | addressed | obsolete | dismissed | unresolved
detail_ref
Harness-Owned Invariants
The assistant may inspect and reason over review state, but the harness owns the hard rules:
never start a duplicate review for the same useful scope unless the prompt/model/policy version intentionally changes
never cancel solely because elapsed runtime is high
classify freshness from snapshot/head/epoch plus activity/heartbeat, not from wall-clock runtime alone
adopt or reconnect to a matching active run before retrying
cancel only for explicit user stop, superseded/obsolete scope, dead/lost process, hard budget exhaustion, or a proven duplicate
terminal findings are immutable evidence and are always classified/surfaced/archived, not silently discarded
restart recovery reconciles durable runs against AgentManager, worktrees, review locks, PIDs, and snapshot epochs before exposing state
fallback worktrees are for preserving surfaced fixes or safe follow-up work, not for casually duplicating near-identical live reviews
Per-Turn Auto Review Ledger
Expose a compact hidden ledger each turn, integrated with the broader context ledger work in #92.
Default target: under roughly 600-1200 tokens, and zero tokens when there are no recent/actionable runs.
Example shape:
<auto_review_ledgerversion="1"repo="code">
<activerun="ar_123"phase="reviewing"snapshot="abc1234"runtime="9m"last_activity="38s"freshness="current"scope="4 files" />
<latestrun="ar_122"status="superseded"snapshot="def5678"findings="1"surfaced="false"summary="Potential regression in background review completion." />
<findingid="f1"p="2"path="code-rs/tui/src/chatwidget.rs"line="1488"title="Stale label uses elapsed runtime rather than inactivity" />
<detailsavailable="auto_review.details(run_id, finding_id)" />
</auto_review_ledger>
Ledger rules:
include at most one active run by default
include latest current/superseded terminal result with actionable findings
include only unresolved high-priority digests, not full finding bodies
include stable ids and detail refs
include runtime for cost awareness and last_activity for liveness, but do not equate runtime with stale
no raw diff, full JSON, or repeated historical runs by default
Lazy Detail Surface
Add a read-only detail path, either as a harness tool or internal operation exposed through existing agent tooling:
auto_review.details(run_id, finding_id?)
It should return bounded detail from the durable output sidecar, with truncation and clear errors when the requested run/finding is unavailable.
Do not give the LLM unrestricted cancellation authority. If future tools allow LLM-suggested actions, they should be requests to the harness, and the harness should enforce the policy.
Token-Efficiency Strategy
We should prove this is cheaper because it prevents wasted review runs, not because the ledger is free.
Expected savings:
dedupe avoids entire duplicate review runs
adoption/reconnect avoids relaunching after restart
superseded/obsolete cancellation stops spending on low-value reviews
compact ledger prevents dumping raw review results into every turn
lazy details let the assistant inspect only when findings matter
Every Code Auto Review reaches the dogfood "love gate": it feels safe to leave on during real coding because review runs survive restarts, duplicate token spend is avoided, stale or superseded work is visible and handled, useful terminal results are preserved, freshness is based on snapshot/activity evidence, each assistant turn receives only compact actionable review state, and we can prove latency/token/finding usefulness with dogfood metrics instead of vibes.
Current Status
State: Active. Foundation and compact ledger slices are complete; the next bar is the dogfood love gate, not just shipping another small review PR.
Auto Review can be left enabled while coding without repeatedly burning tokens on equivalent diffs.
The assistant can see active/recent review state and decide whether to wait, merge, ignore, cancel, or launch new review work.
Runtime alone never marks a healthy active review stale; freshness comes from snapshot/head/epoch plus liveness/activity evidence.
The system records enough per-run evidence to explain why a review was slow: model, reasoning effort, phase timing, token/prompt estimate, follow-up count, duplicate/skipped/superseded reason, and whether findings were useful.
Dogfooding can compare before/after behavior across sessions and show avoided duplicate review spend, reduced stale review confusion, bounded ledger overhead, and no silently lost findings.
Recommended next action: dogfood the merged ledger after rebuilding the PATH binary, then start #329. The assistant now receives bounded review state when useful; the harness should next prevent duplicate/stale review work, classify superseded/obsolete runs, and cancel or adopt existing work through explicit policy.
Key design constraints from recent Auto Review feedback:
Runtime is not staleness.
A lower follow-up limit reduces repeated review loops but does not make the first review pass fast.
Background Auto Review has its own model, reasoning, resolve model, and follow-up settings; diagnostics must expose which ones a run actually used.
Metrics and ledgers must help future turns make decisions without injecting bulky telemetry into ordinary context.
Summary
Auto Review should become a durable, repo-scoped quality pipeline, not a TUI-local timeout/status side effect.
The direction we want to love is:
This is the successor planning track to #76. Keep #76 as the evidence archive and record of earlier snapshot-aware/currentness slices.
Current Evidence
Existing #76 evidence showed the core problem at product scale:
Recent Auto Review feedback also caught a conceptual bug: showing
stalebased only on elapsed runtime is misleading. A long-running review can be healthy and valuable. Stale should mean the reviewed snapshot is no longer relevant, or the agent/process is inactive/lost according to stronger evidence.Desired Architecture
Add a durable Auto Review run store under the existing repo-scoped review state area, conceptually:
Initial implementation should prefer simple atomic JSON/NDJSON files over SQLite unless locking/atomicity becomes a proven problem. The store belongs in
code-core, nearreview_coord, so TUI, exec, auto-drive, and future surfaces share the same contract.Core record shape:
Finding digests should be compact and stable:
Harness-Owned Invariants
The assistant may inspect and reason over review state, but the harness owns the hard rules:
Per-Turn Auto Review Ledger
Expose a compact hidden ledger each turn, integrated with the broader context ledger work in #92.
Default target: under roughly 600-1200 tokens, and zero tokens when there are no recent/actionable runs.
Example shape:
Ledger rules:
runtimefor cost awareness andlast_activityfor liveness, but do not equate runtime with staleLazy Detail Surface
Add a read-only detail path, either as a harness tool or internal operation exposed through existing agent tooling:
It should return bounded detail from the durable output sidecar, with truncation and clear errors when the requested run/finding is unavailable.
Do not give the LLM unrestricted cancellation authority. If future tools allow LLM-suggested actions, they should be requests to the harness, and the harness should enforce the policy.
Token-Efficiency Strategy
We should prove this is cheaper because it prevents wasted review runs, not because the ledger is free.
Expected savings:
Track ledger overhead explicitly:
And compare against avoided spend:
Proof Metrics
Add structured counters/events so dogfooding can prove whether the concept works:
Success signals:
Staged Implementation
Durable run store foundation
AutoReviewRunschema and store incode-coreWire existing TUI/exec paths into the store
run_background_reviewrecords run lifecycleAgentManager/status updates refresh activityAutoReviewTrackerwrites completions into the same storelast_seenrecovery with store-backed activity/currentnessCompact per-turn ledger
AutoReviewLedgerprojection with a strict byte/token capLazy detail retrieval
Dedupe, supersede, and cancellation policy
Metrics and dogfood proof
/review-statsor similar diagnostic surface when usefulDocs and issue cleanup
code-rs/core/docs/auto-review.mdto describe durable runs, ledgers, fallback semantics, and proof metricsSuggested Sub-Issues
Create sub-issues once the parent is accepted:
Open Decisions
Acceptance Criteria
code-rs/core/docs/auto-review.mddescribes the new durable lifecycle and fallback/concurrency semantics.Relationships
Evidence archive: #76
Related prompt/context ledger work: #92
Related token/budget guardrails: #50
Related token efficiency roadmap: #43
Finish Line
Every Code Auto Review reaches the dogfood "love gate": it feels safe to leave on during real coding because review runs survive restarts, duplicate token spend is avoided, stale or superseded work is visible and handled, useful terminal results are preserved, freshness is based on snapshot/activity evidence, each assistant turn receives only compact actionable review state, and we can prove latency/token/finding usefulness with dogfood metrics instead of vibes.
Current Status
State: Active. Foundation and compact ledger slices are complete; the next bar is the dogfood love gate, not just shipping another small review PR.
Completed:
What "love" means for this workstream:
Remaining implementation sequence:
Recommended next action: dogfood the merged ledger after rebuilding the PATH binary, then start #329. The assistant now receives bounded review state when useful; the harness should next prevent duplicate/stale review work, classify superseded/obsolete runs, and cancel or adopt existing work through explicit policy.
Key design constraints from recent Auto Review feedback:
Blocked by: None for #329.
Last verified: 2026-06-02 after read-only agent review of Auto Review latency/settings paths.