feat: harness hook resilience (self-heal hooks, start short-circuit, doctor drift detection)#22
Merged
Merged
Conversation
- Extend the hidden hook stdin extraction helper to accept and validate multiple top-level fields, printing each scalar value on its own line.
- Add coverage for hook stdin extraction with multiple fields, including order, missing-field, non-scalar, and empty-args cases.
- Promote known adapter metadata from single config paths to per-path specs with marker variants, and expand Codex discovery entries to include hooks.json plus legacy TOML locations.
- Selected the only offered capture because it refactors `internal/adapter/detect.go` to centralize harness path metadata into `pathSpec` entries, update `ConfigPath`/`allPaths` to use that structure, and make marker detection path-specific.
- Selected because it extends `internal/adapter/known.go` with install-detection marker checks that fit the recent harness-path and marker-detection refactor, specifically supporting `acd doctor` shadowed-install warnings.
- Add a helper to read config files and detect known marker strings for adapter identification.
- Small portability/standard-library cleanup in Codex install detection: use os.UserHomeDir and filepath.Join instead of local wrappers.
- Add coverage for Codex installation detection paths and marker handling.
- Add Codex hook configuration to start, wake, and touch sessions around tool use and prompt events.
- Update the Codex setup harness snippet reference to point at the hooks JSON file and its comment style.
- Delete the Codex config snippet file from templates/codex, matching the move away from the old hook-based snippet.
- Align Codex setup tests with the new hooks.json output and hook schema, replacing the obsolete config snippet expectations.
- Adjust the Codex template README to document the new hooks.json-based setup, legacy TOML cleanup, and current hook wiring/fallback behavior.
- Add a helper to check whether the primary harness path contains its marker, so doctor can avoid false positives across formats.
- Adjust doctor harness reporting to use primary-path marker matching and refine Codex legacy config guidance.
- Expand doctor tests around Codex hooks.json detection: treat managed hooks.json as installed, add shadow-warning coverage when legacy config.toml coexists, and update human/json expectations.
- Add legacy Codex hooks.json detection in doctor output and align the tests with the new behavior, including the updated install report expectations and warning text.
- Update doctor’s Codex harness detection to treat hooks.json as the primary install marker and adjust tests for the new hooks.json behavior plus legacy config.toml shadow warning.
- Switch the adapter E2E codex snippet parser from TOML-style handling to hooks.json JSON unmarshalling, and remove the obsolete TOML helper.
- Aligns the Codex E2E test with the new hooks.json format and stdin cwd handling for Codex v2, while expanding the same test coverage for SessionStart, UserPromptSubmit, PreToolUse, and Stop behavior.
- Add the new legacy TOML auto-detection subtest to the Codex E2E suite in `test/integration/adapter_e2e_test.go` alongside the existing Codex E2E checks.
- Adds an integration E2E case covering Codex auto-detection when only the legacy `~/.codex/config.toml` contains the `acd` marker and `hooks.json` is absent, matching the recent Codex hooks discovery changes.
- Docs update for Codex harness guidance: switch to hooks.json v2 discovery, note legacy TOML removal, describe event wiring/markers, and update helper behavior notes.
- README.md updates the Codex setup example to direct users to save output to `~/.codex/hooks.json` instead of the old TOML config path, matching the newer hooks.json v2 setup.
- Document the Codex ACD hook’s new hooks.json v2 wiring, including the added lifecycle events and the Stop->acd touch behavior.
- Record the new Codex hook installation format, event wiring, migration path, and hook-stdin-extract behavior in the changelog.
- The change is a focused update to the Codex uninstall guide: it adds the current hooks.json uninstall path, keeps the legacy TOML removal note, and clarifies daemon/log cleanup steps.
- This commit appears to restore Codex setup around the TOML config snippet: it adds the new `templates/codex/config.snippet.toml`, updates README and CLAUDE guidance to point at the TOML setup, simplifies the Codex hook docs, and removes the JSON/legacy-shadow detection and tests that supported the previous `hooks.json` migration path. The CHANGELOG edit matches the same Codex hooks reorientation.
- Implement and document the Codex hooks v2 migration end-to-end: switch setup output to ~/.codex/hooks.json, update guide/docs/changelog, and add path-specific adapter detection plus tests to avoid JSON/TOML marker cross-matches and detect shadowed legacy TOML installs.
- The captures all point to the same Codex rollback/restoration: docs and changelog remove hooks.json v2 migration language, setup switches Codex back to a TOML snippet, adapter detection/doctor logic is adjusted around the legacy TOML path and warning, and the hook helper reverts to single-field extraction. These changes belong together as one coherent Codex restore commit.
- Update the codex helper failure logging regression test to account for Stop having only one failure branch that must capture rc, while the other events still require two.
- The change switches start-cache writes to a unique temporary file per write, then preserves atomic rename and 0600 permissions, to prevent concurrent `acd start` hooks from corrupting the shared JSON cache.
- Adjust stop-command cleanup to remove in-flight start-cache temp files created by os.CreateTemp, alongside final .json cache files.
- Select the start-cache concurrency regression test because it directly follows the temp-file cleanup work in the same start cache area. Defer the Codex hook E2E change as unrelated to start-cache file handling and a separate concern.
- Forced-aging window: this is the overdue capture, so it should be committed alone.
- Prepare the start short-circuit tests for concurrent access by adding the missing sync import used by the new regression coverage.
- Add regression coverage that all opencode and Pi hook snippets gate `mkdir -p` behind a log-dir existence check, matching the codex template invariant.
- Use a generated fallback Pi session ID so hook logs remain attributable when PI_SESSION_ID is unset.
- Add a regression test that verifies TouchClient only updates last_seen_ts, preserves existing client metadata, returns false for unknown sessions, and errors on empty session IDs.
- Extend Pi setup regression coverage so hook snippets require the per-process SID fallback (`pi-$$-$(date +%s)`) instead of the shared `unknown` placeholder, and add a shell-level check that two processes produce distinct fallback IDs.
- Add a non-fatal hot-path update to refresh daemon client last-seen timestamps on cache hits so the sweeper does not evict active sessions during sustained hot-path activity.
- Add a helper that updates daemon client last_seen_ts on the start hot path via a brief per-repo SQLite touch, with test stubbing support.
- The change adds commentary explaining the hot-path refcount refresh that immediately follows a successful start short-circuit decision, matching the surrounding timestamp-touching logic.
- Add a regression test that verifies repeated `acd start` hot-path calls under the same session_id still advance `daemon_clients.last_seen_ts`, guarding the short-circuit touch behavior.
- The diff only adds `strconv` to `test/integration/regressions_test.go`, so this capture stands alone as a small test-file update.
- Update the repeated-active-hooks short-circuit regression test so it proves the hot-path helper runs and no daemon respawn occurs.
- Replace the SQLite file-removal tripwire in the multi-session start short-circuit test with a stubbed hot-path touch recorder keyed by session ID, and assert each session records the expected number of hot-path touches while the spawn count stays at one.
- Selected the lone stop.go capture: it expands the stop-path cache invalidation logic to also clear all per-session start caches when a forced stop does not fully stop the daemon, keeping the cache state conservative in unknown outcomes.
- Add stop CLI tests covering P1 #7 cache cleanup for both the deferred path (remove only the caller’s start-cache file while preserving peers) and the failed force-escalation path (wipe all per-session caches after an unsuccessful kill sequence).
- Select the stop-test change because the diff only shows a test file import addition that fits the existing stop CLI test work. Defer the CLAUDE.md hook-table wording edit as a separate documentation change unrelated to the stop-test adjustment.
- Document Codex hook matcher scope and marker constant formats in CLAUDE.md.
- Add a regression test that exercises `acd stop --all` and verifies all per-session start-cache files under `.git/acd` are removed after full daemon shutdown.
- Select the single capture because it is one coherent modification to the existing start-latency integration test: it rewrites the test to measure a cold start, compare it against repeated hot starts, and tighten the budgets and assertions accordingly.
- Document the latest ACD start caching, doctor diagnostics, Codex hook logging, setup JSON validation, and Pi SID fallback changes in the changelog.
- Add a test-only hook that fires after the events watch cursor is captured and before followEvents begins, so tests can synchronize appends with the watcher becoming live.
- Stabilize the events watch stream test by waiting on the ready hook instead of sleeping before appending decisions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch makes the harness integration self-healing and adds first-class
diagnostics for it. Active hooks across Claude Code, Codex, OpenCode, and Pi
now run idempotent
acd start && acd wake, so the daemon recovers after amanual
acd stopwithout waiting for a brand-new harness session. A newper-session start cache short-circuits the hot path to roughly 50 ms.
acd doctorflags installed snippet drift and tails the Codex hook log.It also ships Codex hooks v2 (
~/.codex/hooks.jsoninstead of the legacy TOMLsnippet) and resolves the P1/P2 findings from the cr-expert review of the
branch (tracked under Trekoon epic
fb35fb76-1487-4bfb-9da2-44eb09a3f90a).Closes #7.
What changed
Self-healing active hooks
body
{ acd start --... && acd wake --... ; } 2>>"$LOG" || { printf '[ts] active hook failed exit=%d cmd=acd-start-wake\n' ... ; exit 1; }.acd startfailure is no longer masked by a successfulwake; the active hookexits non-zero and writes a stable line to the per-harness log fallback at
${XDG_STATE_HOME:-$HOME/.local/state}/acd/<harness>-hook.log.paths on every harness.
Start short-circuit fast path
runStartnow reads a per-session cache file at<gitDir>/acd/start-cache-<sha256(session_id)[:16]>.json(schema v2).Healthy hot path returns in ~50 ms vs. the 1 s cold-path budget.
DaemonStartTS+DaemonArgvHash).Short-circuit requires a live PID and a fingerprint match before reusing the
cache, so PID reuse on long-running boxes cannot serve a stale fast path.
daemon_clients.last_seen_tsso the refcountsweeper does not evict a session that lives entirely on the fast path.
os.CreateTempso concurrent active hooks no longercollide on a shared
start-cache.json.tmp.acd stopinvalidates the cache on theDeferredpath and after a failedforce-stop escalation.
test/integration/start_latency_budget_test.goenforces the speedup claim:10 hot calls each under 200 ms and under
cold/1.5.acd doctorupgrades~/.claude/settings.json,~/.codex/hooks.json,~/.config/opencode/hooks.yaml, or~/.pi/hook/hooks.yamllack bothacd startandacd wakeor the log fallback. Per-harness remediationhints point at the merge-first install path.
error count plus the first failing line.
new
config_read_errorJSON field.actionsitems astop-level hook entries (drift never fired for any real OpenCode/Pi config).
parseLogTimestampso the bracketed wrapper-printf shape parsescorrectly. The five-minute recency window was effectively disabled before.
looksLikeHookErrorso JSONL info lines containing the substringfailed(for examplefailed_blocking_pending=0) are no longer flagged.Codex hooks v2 (breaking setup change)
acd setup codexnow writes~/.codex/hooks.jsoninstead of the legacyTOML snippet. Codex hooks read
cwdfrom stdin viaacd hook-stdin-extract session_id cwd?;CODEX_PROJECT_DIRis gone.PostToolUse, Stop) capture helper exit explicitly. A failing
acd hook-stdin-extractnow writes[ts] active hook failed exit=N cmd=acd-hook-stdin-extractto the hook loginstead of being swallowed by a trailing
|| exit 0. This also fixes alatent bug where
$?was consumed by$(date +%FT%T%z)before beingprinted, so the logged exit code was always 0.
acd setup --rawvalidates JSON for.jsonsnippet targets and refuses toemit malformed templates, so users cannot redirect garbage into
~/.codex/hooks.json.templates/codex/uninstall.mdnow documents surgical removal of the fiveacd hook entries plus the
_acd_managedflag instead of deleting thewhole file (which destroyed merged custom hooks).
pi) carry overwrite warnings that mirror the existing codex template note.
Marker tightening and Pi SID
yamlAcdManagedMarkersnow requires the#comment prefix. A hand-editedconfig with bare
acd-managed: trueis no longer detected as installed,and the marker no longer collides with the TOML form as a substring.
pi-$$-$(date +%s)) instead ofa shared
unknown. Multiple Pi sessions withoutPI_SESSION_IDno longercollapse onto the same client and tear down the daemon when one session
ends.
Files
internal/cli/doctor.go(+507/-18),internal/cli/doctor_test.go(+668/-8)internal/cli/start_shortcircuit.go(+344/-0),internal/cli/start_shortcircuit_test.go(+798/-0),internal/cli/start.go(+143/-1),internal/cli/stop.go,internal/cli/stop_test.gointernal/cli/setup.go,internal/cli/setup_test.go(+664/-6),templates/codex/hooks.json,templates/{claude-code,codex,opencode,pi}/...,internal/adapter/{detect,known}.go,internal/adapter/detect_test.gointernal/cli/hookhelper.go,internal/cli/hookhelper_test.go(+269/-0)test/integration/adapter_e2e_test.go(+420/-71),test/integration/regressions_test.go,test/integration/start_latency_budget_test.go(+172/-0)README.md,CHANGELOG.md,CLAUDE.md,docs/multi-tool.md, per-harness READMEsTotal: 32 files changed, 5457 insertions(+), 388 deletions(-).
Migration
Existing users must re-run
acd setup <harness>after pulling and merge theupdated hook block into their installed config so the self-heal pattern takes
effect. Codex users additionally need
/hooksre-approval after everyhooks.jsoncontent change. Per-harness READMEs and CHANGELOG v2026-05-08document the migration in detail.
acd doctorflags drift on stale installs.Test plan
Pre-PR gate (CLAUDE.md), all green via
cleanenv:cleanenv make lintcleanenv make testcleanenv go test ./test/integration/... -tags=integration -race -count=1 -timeout 5mcleanenv go test ./internal/daemon/... ./internal/git/... ./internal/state/... ./internal/pause/... ./internal/cli/... -race -count=3 -timeout 10mPer-task verifies (cr-expert P1/P2 fixes) all pass at
-race -count=3. Newregression coverage: YAML drift table-driven snippets, parseLogTimestamp with
TZ=Asia/Tokyo, EACCES/EIO disambig, fingerprint mismatch, multi-sessionhot-path, concurrent writeStartCache, deferred/failed cache invalidation,
Codex helper-missing E2E, Pi SID uniqueness, mkdir gate parity for OpenCode
and Pi.
Known follow-ups (out of scope)
$?-after-$(date)bug pattern still exists in claude-code,opencode, and pi hook bodies; only Codex was in scope for this branch.
templates/pi/README.mdSID prose update was deferred; the YAML changestands alone.
force-failedescalation wipes all caches (intentional butworth flagging).
Notes for reviewers
The diff is large because it lands the cr-expert P1/P2 fixes alongside the
feature work. Review pointers:
internal/cli/start_shortcircuit.gois the new fast path. Pair withstart_shortcircuit_test.go(matrix coverage for fingerprint, multisession,TouchClient, deferred/failed cache invalidation).
internal/cli/doctor.gocarries the YAML drift parser,parseLogTimestamp,looksLikeHookError, andconfig_read_errorwork.templates/codex/hooks.jsonis the new canonical Codex install; all fivehook bodies share the new helper-failure handling.
internal/adapter/known.gofor the# acd-managed: truemarker tightening.