Skip to content

fix: 7 bugs from discussion #149 smoke walk (envelope, spawn, CDC, observer)#153

Merged
harshitsinghbhandari merged 5 commits into
mainfrom
fix/issue-152-smoke-walk-bugs
Jun 7, 2026
Merged

fix: 7 bugs from discussion #149 smoke walk (envelope, spawn, CDC, observer)#153
harshitsinghbhandari merged 5 commits into
mainfrom
fix/issue-152-smoke-walk-bugs

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator

Summary

Fixes the 7 bugs surfaced by the discussion #149 smoke walk. RCA + suggested fixes are written up in #152; this PR implements them.

Three commits, one per investigation lane (file-disjoint, dispatched in parallel):

Commit Bugs Subsystem
fix(api,spawn): typed errors + project/branch/binary preflight 1, 2, 3, 4, 6 service/session, session_manager, gitworktree, all 21 agent adapters, /rollback endpoint
fix(cdc): emit pr_review_thread_resolved on replace polls 5 sqlite pr_store set-diff
fix(observe): emit scm-disabled log on startup with no subjects 7 observer.go pre-Poll credential check

Per-bug breakdown

Bug 1 — orphan terminated row + opaque 500 on unknown projectId. Service.Spawn/SpawnOrchestrator now call store.GetProject first and return apierr.NotFound("PROJECT_NOT_FOUND", ...) before manager.Spawn. No row is created when the project doesn't exist.

Bug 2 — Restore opaque 500 on half-spawned/terminated session. Manager.Restore gained the ErrIncompleteHandle guard that Manager.Kill already had. Both endpoints now return the same SESSION_INCOMPLETE_HANDLE 409.

Bug 3 — --branch unfetched or checked-out-elsewhere → opaque 500. Two new port sentinels: ErrWorkspaceBranchCheckedOutElsewhere (mapped to BRANCH_CHECKED_OUT_ELSEWHERE 409) and ErrWorkspaceBranchNotFetched (mapped to BRANCH_NOT_FETCHED 400). gitworktree pre-checks listRecords for branch-in-other-worktree and falls back to refs/tags on missing local/remote head.

Bug 4 — orphan terminated row on --claim-pr rollback. New Store.DeleteSession gated to seed-state rows only (preserves the no-resurrection guarantee for live sessions), transactional change_log cleanup, Manager.RollbackSpawn (delete-then-fallback-to-kill), new POST /sessions/{id}/rollback endpoint, CLI rewired to use it. The exit-0 sub-symptom from the smoke walk was unreproducible from current source (every traced path returns a non-nil error → os.Exit(1)) and is intentionally not addressed here — see #152 for the trace.

Bug 5 — pr_review_thread_resolved CDC event never lands. writePRRows no longer does DELETE-then-UPSERT (which made every poll hit the INSERT branch and bypass the AFTER UPDATE WHEN OLD.resolved <> NEW.resolved trigger). It now upserts observed threads (real updates fire the trigger) and set-diff-deletes orphans, all inside the existing transaction.

Bug 6 — agent binary not on PATH succeeds silently. Dropped the return "<name>", nil anti-pattern from all 21 agent adapters; on exec.LookPath miss they now return ports.ErrAgentBinaryNotFound. Manager.Spawn gained a validateAgentBinary pre-flight (with injectable LookPath for tests) that aborts before runtime.Create. Mapped to AGENT_BINARY_NOT_FOUND 400. The longer-term zellij-pane-liveness fix is a separate concern.

Bug 7 — SCM observer disabled log never emitted. checkCredentials is now called once in Observer.loop before the first Poll, independent of subject discovery. The existing credentialsChecked guard keeps it once-per-process. Daemon wiring still uses SkipTokenPreflight: true so readiness doesn't block on gh.

Cross-cutting cleanup

  • toAPIError whitelist extended (no refactor into a registry — kept tight per the RCA recommendation).
  • New port sentinels live in ports/ so service doesn't import adapters; adapter packages keep aliases.
  • Store grew DeleteSession (seed-state gated) and DeleteSeedSession / DeleteChangeLogForSession queries.
  • OpenAPI regenerated for /rollback.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test -race ./... — 1097 passed, 0 failed across 61 packages
  • Each bug fix has at least one test that fails without the fix
  • Integration tests in internal/integration/ updated to stub LookPath for the new pre-flight
  • Smoke-walk repro recipes from discussion Smoke test plan: what's testable on main today #149 re-run against this branch (recommended before merge)

Notes for reviewer

  1. Lane B's pr_review_threads.sql.go was hand-edited because sqlc isn't on this machine's PATH. The file matches what sqlc generate from backend/ would produce; please regenerate before merge if your environment has it, just to confirm.
  2. Three commits, not squashed — each maps cleanly to one lane / one investigation and the per-bug change sets are easy to bisect.
  3. Bug 6 audit touched 21 adapter files (one-line return-error change each); they're mechanical and bundled into the Lane A commit.

Closes #152
EOF

harshitsinghbhandari and others added 3 commits June 7, 2026 07:01
writePRRows was DELETE-then-UPSERT on the Replace path, so every poll's
upserts hit the INSERT branch and the AFTER UPDATE trigger that emits
pr_review_thread_resolved never fired in production. Replaces the
blanket delete with a set-diff: upsert observed threads first (so
unchanged thread_ids go through ON CONFLICT DO UPDATE and fire the
UPDATE trigger when resolved flips), then delete orphans whose
thread_id is not in the observed set, all inside the existing tx.

Adds DeletePRReviewThread query (sqlc-generated form hand-edited; no
sqlc binary available locally — sqlc generate from backend/ produces an
identical file).

Tests: TestPRReviewThreadsCDC_EmitsResolvedOnReplacePoll (regression —
fails without fix) and TestPRReviewThreadsReplace_PrunesOrphansWithoutReinserting.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bug 7)

checkCredentials lived only inside Poll, which short-circuits when
discoverSubjects is empty. On a fresh daemon with no tracked PRs the
documented "scm observer disabled: provider credentials unavailable"
warn was unreachable, leaving users with no signal that the SCM
observer was a no-op.

Calls checkCredentials once in Observer.loop before the first Poll.
The existing credentialsChecked guard preserves once-per-process
semantics; provider construction still uses SkipTokenPreflight so
daemon readiness doesn't block on gh.

Test: TestStart_LogsDisabledWarningWhenNoTokenAndNoSubjects with a
race-safe syncBuffer for capturing slog from the observer goroutine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bugs 1-4,6)

Closes the long tail of opaque-500-and-orphan-row failures that
discussion #149's smoke walk surfaced. The common shape: spawn created
the session row before validating preconditions, and the underlying
errors weren't typed, so toAPIError defaulted to INTERNAL_ERROR.

Bug 1 (orphan row + opaque 500 on unknown projectId):
Service.Spawn / SpawnOrchestrator now call store.GetProject first and
return apierr.NotFound("PROJECT_NOT_FOUND", ...) before manager.Spawn,
eliminating the create-row-then-fail-workspace ordering.

Bug 2 (Restore opaque 500 on half-spawned/terminated session):
Manager.Restore gained the ErrIncompleteHandle guard that Kill has at
manager.go:189-193. toAPIError now maps both restore and kill to the
same SESSION_INCOMPLETE_HANDLE 409 envelope.

Bug 3 (--branch unfetched / checked-out-elsewhere → opaque 500):
gitworktree pre-checks listRecords for branch-in-other-worktree, falls
back to refs/tags on missing local/remote head, and emits two new port
sentinels (ErrWorkspaceBranchCheckedOutElsewhere,
ErrWorkspaceBranchNotFetched) mapped to BRANCH_CHECKED_OUT_ELSEWHERE
(409) and BRANCH_NOT_FETCHED (400).

Bug 4 (orphan terminated row on claim-pr rollback):
Adds Store.DeleteSession gated to seed-state rows only (preserves the
no-resurrection guarantee for live sessions), transactional change_log
cleanup, Manager.RollbackSpawn (delete-then-fallback-to-kill), a new
POST /sessions/{id}/rollback endpoint, and rewires
cli/spawn.rollbackSpawnedSession to use it. The exit-0 sub-symptom was
unreproducible from current source and is left unaddressed.

Bug 6 (agent binary not on PATH → silent idle session):
Drops the "return name, nil" anti-pattern from all 21 agent adapters
and returns the new ports.ErrAgentBinaryNotFound on exec.LookPath miss.
Manager.Spawn gained a validateAgentBinary pre-flight (with injectable
LookPath so tests don't need real binaries on PATH) that aborts before
runtime.Create. Mapped to AGENT_BINARY_NOT_FOUND (400). Integration
tests in internal/integration/ stub LookPath to /usr/bin/true.

Tests cover each bug end-to-end. OpenAPI regenerated for /rollback.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@harshitsinghbhandari harshitsinghbhandari added this to the rewrite milestone Jun 7, 2026
CI fixes for #153:
- gofmt/goimports on kilocode and kiro adapters that the bug 6 audit
  left mis-grouped.
- openapi-typescript regen against the new /rollback endpoint added in
  the Lane A commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 7, 2026

Greptile Summary

This PR fixes 7 bugs from the #149 smoke walk across four subsystems: session spawn/lifecycle, CDC PR review threads, gitworktree branch handling, and the SCM observer credential gate. The changes span 57 files across three parallel investigation lanes.

  • Session lifecycle (Bugs 1–4, 6): requireProject preflight in Service.Spawn/SpawnOrchestrator, ErrIncompleteHandle guard in Manager.Restore, typed branch/binary sentinels in ports/, validateAgentBinary pre-flight before runtime.Create, new Store.DeleteSession (seed-state gated) + Manager.RollbackSpawn + POST /sessions/{id}/rollback endpoint, and a one-liner fix to all 21 agent adapters that previously returned a bare binary name on LookPath miss.
  • CDC fix (Bug 5): writePRRows Replace mode now upserts observed threads first then set-diff-deletes orphans, so unchanged threads hit ON CONFLICT DO UPDATE and the AFTER UPDATE trigger that emits pr_review_thread_resolved can fire.
  • Observer fix (Bug 7): Observer.loop calls checkCredentials once before the first Poll so the "scm observer disabled" warning is emitted on a fresh daemon with no tracked sessions.

Confidence Score: 3/5

The CDC, observer, branch, binary-preflight, and project-preflight fixes are all correct and well-tested; one transaction ordering defect in DeleteSession will silently destroy CDC history for any fully-spawned session that goes through the rollback-fallback-to-kill path.

The store's DeleteSession always runs DeleteChangeLogForSession before checking whether the row is in seed state. For the primary caller (RollbackSpawn on a fully-spawned session), this unconditionally wipes the session's CDC history — session_created and session_updated events — even though the session itself is not deleted and is subsequently killed via the fallback path. This defect lives in the exact flow Bug 4 was designed to fix, and the new test does not catch it because it only checks the row-deletion gate, not whether change_log entries survive.

backend/internal/storage/sqlite/store/session_store.go — the DeleteSession transaction ordering needs to be reversed; backend/internal/storage/sqlite/gen/pr_review_threads.sql.go — hand-edited generated file should be regenerated via sqlc before merge.

Important Files Changed

Filename Overview
backend/internal/storage/sqlite/store/session_store.go Adds DeleteSession with a transaction that unconditionally purges CDC events before checking seed state — destroys change_log rows for live sessions when the seed-state gate is a no-op.
backend/internal/storage/sqlite/store/pr_store.go Replaces DELETE-all+upsert with upsert-first then set-diff deletion for Replace mode, correctly letting UPDATE trigger fire for resolved thread transitions.
backend/internal/session_manager/manager.go Adds RollbackSpawn (delete-then-kill fallback), ErrIncompleteHandle guard for Restore, and validateAgentBinary pre-flight before runtime.Create; logic is correct.
backend/internal/observe/scm/observer.go Pre-poll checkCredentials call ensures disabled warning is emitted even when discoverSubjects returns empty; credentialsChecked guard prevents double-check on the subsequent Poll.
backend/internal/service/session/service.go Adds requireProject pre-validation for Spawn/SpawnOrchestrator and RollbackSpawn delegation; toAPIError extended with branch and binary sentinels.
backend/internal/adapters/workspace/gitworktree/workspace.go Adds branch-in-other-worktree preflight check and ErrBranchNotFetched sentinel with tag-ref fallback in resolveBaseRef; error types are correctly wired to ports sentinels.
backend/internal/storage/sqlite/gen/pr_review_threads.sql.go Hand-edited generated file adds DeletePRReviewThread per-row delete; looks structurally correct but should be regenerated via sqlc before merge to eliminate drift risk.
backend/internal/httpd/controllers/sessions.go Wires new rollback endpoint, delegates to service, returns typed RollbackSessionResponse; controller pattern matches existing kill/restore handlers.

Sequence Diagram

sequenceDiagram
    participant CLI
    participant HTTP as POST /rollback
    participant Svc as Service.RollbackSpawn
    participant Mgr as Manager.RollbackSpawn
    participant Store
    participant Kill as Manager.Kill

    CLI->>HTTP: "POST /sessions/{id}/rollback"
    HTTP->>Svc: RollbackSpawn(id)
    Svc->>Mgr: RollbackSpawn(id)
    Mgr->>Store: DeleteSession(id)
    Note over Store: tx: DeleteChangeLogForSession(id) runs unconditionally<br/>then DeleteSeedSession(id) is no-op for live sessions
    Store-->>Mgr: "deleted=false"
    Mgr->>Kill: Kill(id)
    Kill-->>Mgr: "freed=true"
    Mgr-->>Svc: "deleted=false, killed=true"
    Svc-->>HTTP: RollbackOutcome
    HTTP-->>CLI: "200 killed=true"
Loading

Reviews (1): Last reviewed commit: "chore: gofmt + regen frontend schema.ts ..." | Re-trigger Greptile

Comment thread backend/internal/storage/sqlite/store/session_store.go Outdated
Comment thread backend/internal/storage/sqlite/gen/pr_review_threads.sql.go Outdated
, PR #153 review)

Addresses @greptile-apps P1 and P2 review feedback on PR #153.

P1 (CDC events deleted for live sessions in rollback fallback):
DeleteChangeLogForSession ran unconditionally inside the transaction
before DeleteSeedSession's seed-state predicates filtered the session
delete to a no-op. For a live session reaching DeleteSession (the
delete-then-kill fallback path inside RollbackSpawn), the seed delete
returned 0 rows but the session_created/session_updated CDC events
had already been purged. Now probes via a new SessionIsSeed query
first and short-circuits the whole tx — including the change_log
cleanup — when the row is not in seed state.

P2 (regen sqlc): installed sqlc 1.31.1 and ran `sqlc generate` from
backend/, replacing the hand-edited pr_review_threads.sql.go (and
producing minor format-only churn in models.go, pr.sql.go,
sessions.sql.go, changelog.sql.go).

The regen surfaced two issues:

1. GetPR / ListPRsBySession had their return types hand-changed to
   gen.PR by the previous PR; sqlc actually emits GetPRRow /
   ListPRsBySessionRow when queries enumerate columns. Fixed by
   collapsing those two queries to `SELECT * FROM pr` so sqlc returns
   gen.PR (which is what the store's prRowFromGen converter expects),
   and pr.last_nudge_signature now lands in the result alongside the
   existing 37 columns.

2. sqlc 1.31.1's SQLite parser silently strips trailing `?`
   placeholders and string literals from DELETE statements (reproduced
   with sqlc.arg, IFNULL, rowid subquery, and second-predicate
   workarounds — all eaten). DeleteSeedSession and
   DeleteChangeLogForSession both tripped it. They are now run as
   plain tx.ExecContext calls inside Store.DeleteSession, inside the
   same write transaction as SessionIsSeed; both queries are removed
   from the queries/ directory and the workaround context is
   documented inline in queries/sessions.sql and queries/changelog.sql
   to keep future contributors from re-adding them.

Verified: go build ./... clean, go test -race ./... 1097/1097 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harshitsinghbhandari has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@harshitsinghbhandari harshitsinghbhandari merged commit c343c55 into main Jun 7, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RCA: 7 bugs from discussion #149 smoke walk (opaque 500s, orphan rows, silent failures)

1 participant