Skip to content

feat: harness hook resilience (self-heal hooks, start short-circuit, doctor drift detection)#22

Merged
KristjanPikhof merged 161 commits into
mainfrom
feat/harness-hook-resilience
May 8, 2026
Merged

feat: harness hook resilience (self-heal hooks, start short-circuit, doctor drift detection)#22
KristjanPikhof merged 161 commits into
mainfrom
feat/harness-hook-resilience

Conversation

@KristjanPikhof
Copy link
Copy Markdown
Owner

Summary

This branch makes the harness integration self-healing and adds first-class
diagnostics for it. Active hooks across Claude Code, Codex, OpenCode, and Pi
now run idempotent acd start && acd wake, so the daemon recovers after a
manual acd stop without waiting for a brand-new harness session. A new
per-session start cache short-circuits the hot path to roughly 50 ms.
acd doctor flags installed snippet drift and tails the Codex hook log.

It also ships Codex hooks v2 (~/.codex/hooks.json instead of the legacy TOML
snippet) and resolves the P1/P2 findings from the cr-expert review of the
branch (tracked under Trekoon epic fb35fb76-1487-4bfb-9da2-44eb09a3f90a).

Closes #7.

What changed

Self-healing active hooks

  • All four harnesses (Claude Code, Codex, OpenCode, Pi) now use the canonical
    body { acd start --... && acd wake --... ; } 2>>"$LOG" || { printf '[ts] active hook failed exit=%d cmd=acd-start-wake\n' ... ; exit 1; }. acd start failure is no longer masked by a successful wake; the active hook
    exits non-zero and writes a stable line to the per-harness log fallback at
    ${XDG_STATE_HOME:-$HOME/.local/state}/acd/<harness>-hook.log.
  • Adapter E2E coverage added for stop-all self-heal and corrupt-DB negative
    paths on every harness.

Start short-circuit fast path

  • runStart now reads a per-session cache file at
    <gitDir>/acd/start-cache-<sha256(session_id)[:16]>.json (schema v2).
    Healthy hot path returns in ~50 ms vs. the 1 s cold-path budget.
  • The cache stores a daemon fingerprint (DaemonStartTS + DaemonArgvHash).
    Short-circuit requires a live PID and a fingerprint match before reusing the
    cache, so PID reuse on long-running boxes cannot serve a stale fast path.
  • Each hot-path call refreshes daemon_clients.last_seen_ts so the refcount
    sweeper does not evict a session that lives entirely on the fast path.
  • Cache writes use os.CreateTemp so concurrent active hooks no longer
    collide on a shared start-cache.json.tmp.
  • acd stop invalidates the cache on the Deferred path and after a failed
    force-stop escalation.
  • test/integration/start_latency_budget_test.go enforces the speedup claim:
    10 hot calls each under 200 ms and under cold/1.5.

acd doctor upgrades

  • Detects installed snippet drift: warns when active-hook bodies in
    ~/.claude/settings.json, ~/.codex/hooks.json,
    ~/.config/opencode/hooks.yaml, or ~/.pi/hook/hooks.yaml lack both
    acd start and acd wake or the log fallback. Per-harness remediation
    hints point at the merge-first install path.
  • Tails the Codex hook log (last 50 lines or 5 minutes) and surfaces the
    error count plus the first failing line.
  • Distinguishes "config unreadable" from "marker missing" on EACCES/EIO via a
    new config_read_error JSON field.
  • Fixes a YAML drift parser bug that misread nested actions items as
    top-level hook entries (drift never fired for any real OpenCode/Pi config).
  • Fixes parseLogTimestamp so the bracketed wrapper-printf shape parses
    correctly. The five-minute recency window was effectively disabled before.
  • Tightens looksLikeHookError so JSONL info lines containing the substring
    failed (for example failed_blocking_pending=0) are no longer flagged.

Codex hooks v2 (breaking setup change)

  • acd setup codex now writes ~/.codex/hooks.json instead of the legacy
    TOML snippet. Codex hooks read cwd from stdin via
    acd hook-stdin-extract session_id cwd?; CODEX_PROJECT_DIR is gone.
  • All five Codex hook bodies (SessionStart, UserPromptSubmit, PreToolUse,
    PostToolUse, Stop) capture helper exit explicitly. A failing
    acd hook-stdin-extract now writes
    [ts] active hook failed exit=N cmd=acd-hook-stdin-extract to the hook log
    instead of being swallowed by a trailing || exit 0. This also fixes a
    latent bug where $? was consumed by $(date +%FT%T%z) before being
    printed, so the logged exit code was always 0.
  • acd setup --raw validates JSON for .json snippet targets and refuses to
    emit malformed templates, so users cannot redirect garbage into
    ~/.codex/hooks.json.
  • templates/codex/uninstall.md now documents surgical removal of the five
    acd hook entries plus the _acd_managed flag instead of deleting the
    whole file (which destroyed merged custom hooks).
  • README, doctor remediation, and per-harness READMEs (claude-code, opencode,
    pi) carry overwrite warnings that mirror the existing codex template note.

Marker tightening and Pi SID

  • yamlAcdManagedMarkers now requires the # comment prefix. A hand-edited
    config with bare acd-managed: true is no longer detected as installed,
    and the marker no longer collides with the TOML form as a substring.
  • Pi SID fallback is now per-process unique (pi-$$-$(date +%s)) instead of
    a shared unknown. Multiple Pi sessions without PI_SESSION_ID no longer
    collapse onto the same client and tear down the daemon when one session
    ends.

Files

Area Files
Doctor internal/cli/doctor.go (+507/-18), internal/cli/doctor_test.go (+668/-8)
Start short-circuit internal/cli/start_shortcircuit.go (+344/-0), internal/cli/start_shortcircuit_test.go (+798/-0), internal/cli/start.go (+143/-1), internal/cli/stop.go, internal/cli/stop_test.go
Setup / templates internal/cli/setup.go, internal/cli/setup_test.go (+664/-6), templates/codex/hooks.json, templates/{claude-code,codex,opencode,pi}/..., internal/adapter/{detect,known}.go, internal/adapter/detect_test.go
Hook helper internal/cli/hookhelper.go, internal/cli/hookhelper_test.go (+269/-0)
Integration test/integration/adapter_e2e_test.go (+420/-71), test/integration/regressions_test.go, test/integration/start_latency_budget_test.go (+172/-0)
Docs README.md, CHANGELOG.md, CLAUDE.md, docs/multi-tool.md, per-harness READMEs

Total: 32 files changed, 5457 insertions(+), 388 deletions(-).

Migration

Existing users must re-run acd setup <harness> after pulling and merge the
updated hook block into their installed config so the self-heal pattern takes
effect. Codex users additionally need /hooks re-approval after every
hooks.json content change. Per-harness READMEs and CHANGELOG v2026-05-08
document the migration in detail. acd doctor flags drift on stale installs.

Test plan

Pre-PR gate (CLAUDE.md), all green via cleanenv:

  • cleanenv make lint
  • cleanenv make test
  • cleanenv go test ./test/integration/... -tags=integration -race -count=1 -timeout 5m
  • cleanenv go test ./internal/daemon/... ./internal/git/... ./internal/state/... ./internal/pause/... ./internal/cli/... -race -count=3 -timeout 10m

Per-task verifies (cr-expert P1/P2 fixes) all pass at -race -count=3. New
regression coverage: YAML drift table-driven snippets, parseLogTimestamp with
TZ=Asia/Tokyo, EACCES/EIO disambig, fingerprint mismatch, multi-session
hot-path, concurrent writeStartCache, deferred/failed cache invalidation,
Codex helper-missing E2E, Pi SID uniqueness, mkdir gate parity for OpenCode
and Pi.

Known follow-ups (out of scope)

  • Same $?-after-$(date) bug pattern still exists in claude-code,
    opencode, and pi hook bodies; only Codex was in scope for this branch.
  • templates/pi/README.md SID prose update was deferred; the YAML change
    stands alone.
  • start-cache force-failed escalation wipes all caches (intentional but
    worth flagging).

Notes for reviewers

The diff is large because it lands the cr-expert P1/P2 fixes alongside the
feature work. Review pointers:

  • internal/cli/start_shortcircuit.go is the new fast path. Pair with
    start_shortcircuit_test.go (matrix coverage for fingerprint, multisession,
    TouchClient, deferred/failed cache invalidation).
  • internal/cli/doctor.go carries the YAML drift parser, parseLogTimestamp,
    looksLikeHookError, and config_read_error work.
  • templates/codex/hooks.json is the new canonical Codex install; all five
    hook bodies share the new helper-failure handling.
  • internal/adapter/known.go for the # acd-managed: true marker tightening.

- Extend the hidden hook stdin extraction helper to accept and validate
  multiple top-level fields, printing each scalar value on its own line.
- Add coverage for hook stdin extraction with multiple fields, including
  order, missing-field, non-scalar, and empty-args cases.
- Promote known adapter metadata from single config paths to per-path
  specs with marker variants, and expand Codex discovery entries to
  include hooks.json plus legacy TOML locations.
- Selected the only offered capture because it refactors
  `internal/adapter/detect.go` to centralize harness path metadata into
  `pathSpec` entries, update `ConfigPath`/`allPaths` to use that
  structure, and make marker detection path-specific.
- Selected because it extends `internal/adapter/known.go` with
  install-detection marker checks that fit the recent harness-path and
  marker-detection refactor, specifically supporting `acd doctor`
  shadowed-install warnings.
- Add a helper to read config files and detect known marker strings for
  adapter identification.
- Small portability/standard-library cleanup in Codex install detection:
  use os.UserHomeDir and filepath.Join instead of local wrappers.
- Add coverage for Codex installation detection paths and marker
  handling.
- Add Codex hook configuration to start, wake, and touch sessions around
  tool use and prompt events.
- Update the Codex setup harness snippet reference to point at the hooks
  JSON file and its comment style.
- Delete the Codex config snippet file from templates/codex, matching
  the move away from the old hook-based snippet.
- Align Codex setup tests with the new hooks.json output and hook
  schema, replacing the obsolete config snippet expectations.
- Adjust the Codex template README to document the new hooks.json-based
  setup, legacy TOML cleanup, and current hook wiring/fallback behavior.
- Add a helper to check whether the primary harness path contains its
  marker, so doctor can avoid false positives across formats.
- Adjust doctor harness reporting to use primary-path marker matching
  and refine Codex legacy config guidance.
- Expand doctor tests around Codex hooks.json detection: treat managed
  hooks.json as installed, add shadow-warning coverage when legacy
  config.toml coexists, and update human/json expectations.
- Add legacy Codex hooks.json detection in doctor output and align the
  tests with the new behavior, including the updated install report
  expectations and warning text.
- Update doctor’s Codex harness detection to treat hooks.json as the
  primary install marker and adjust tests for the new hooks.json
  behavior plus legacy config.toml shadow warning.
- Switch the adapter E2E codex snippet parser from TOML-style handling
  to hooks.json JSON unmarshalling, and remove the obsolete TOML helper.
- Aligns the Codex E2E test with the new hooks.json format and stdin cwd
  handling for Codex v2, while expanding the same test coverage for
  SessionStart, UserPromptSubmit, PreToolUse, and Stop behavior.
- Add the new legacy TOML auto-detection subtest to the Codex E2E suite
  in `test/integration/adapter_e2e_test.go` alongside the existing Codex
  E2E checks.
- Adds an integration E2E case covering Codex auto-detection when only
  the legacy `~/.codex/config.toml` contains the `acd` marker and
  `hooks.json` is absent, matching the recent Codex hooks discovery
  changes.
- Docs update for Codex harness guidance: switch to hooks.json v2
  discovery, note legacy TOML removal, describe event wiring/markers,
  and update helper behavior notes.
- README.md updates the Codex setup example to direct users to save
  output to `~/.codex/hooks.json` instead of the old TOML config path,
  matching the newer hooks.json v2 setup.
- Document the Codex ACD hook’s new hooks.json v2 wiring, including
  the added lifecycle events and the Stop->acd touch behavior.
- Record the new Codex hook installation format, event wiring, migration
  path, and hook-stdin-extract behavior in the changelog.
- The change is a focused update to the Codex uninstall guide: it adds
  the current hooks.json uninstall path, keeps the legacy TOML removal
  note, and clarifies daemon/log cleanup steps.
- This commit appears to restore Codex setup around the TOML config
  snippet: it adds the new `templates/codex/config.snippet.toml`,
  updates README and CLAUDE guidance to point at the TOML setup,
  simplifies the Codex hook docs, and removes the JSON/legacy-shadow
  detection and tests that supported the previous `hooks.json` migration
  path. The CHANGELOG edit matches the same Codex hooks reorientation.
- Implement and document the Codex hooks v2 migration end-to-end: switch
  setup output to ~/.codex/hooks.json, update guide/docs/changelog, and
  add path-specific adapter detection plus tests to avoid JSON/TOML
  marker cross-matches and detect shadowed legacy TOML installs.
- The captures all point to the same Codex rollback/restoration: docs
  and changelog remove hooks.json v2 migration language, setup switches
  Codex back to a TOML snippet, adapter detection/doctor logic is
  adjusted around the legacy TOML path and warning, and the hook helper
  reverts to single-field extraction. These changes belong together as
  one coherent Codex restore commit.
- Update the codex helper failure logging regression test to account for
  Stop having only one failure branch that must capture rc, while the
  other events still require two.
- The change switches start-cache writes to a unique temporary file per
  write, then preserves atomic rename and 0600 permissions, to prevent
  concurrent `acd start` hooks from corrupting the shared JSON cache.
- Adjust stop-command cleanup to remove in-flight start-cache temp files
  created by os.CreateTemp, alongside final .json cache files.
- Select the start-cache concurrency regression test because it directly
  follows the temp-file cleanup work in the same start cache area. Defer
  the Codex hook E2E change as unrelated to start-cache file handling
  and a separate concern.
- Forced-aging window: this is the overdue capture, so it should be
  committed alone.
- Prepare the start short-circuit tests for concurrent access by adding
  the missing sync import used by the new regression coverage.
- Add regression coverage that all opencode and Pi hook snippets gate
  `mkdir -p` behind a log-dir existence check, matching the codex
  template invariant.
- Use a generated fallback Pi session ID so hook logs remain
  attributable when PI_SESSION_ID is unset.
- Add a regression test that verifies TouchClient only updates
  last_seen_ts, preserves existing client metadata, returns false for
  unknown sessions, and errors on empty session IDs.
- Extend Pi setup regression coverage so hook snippets require the
  per-process SID fallback (`pi-$$-$(date +%s)`) instead of the shared
  `unknown` placeholder, and add a shell-level check that two processes
  produce distinct fallback IDs.
- Add a non-fatal hot-path update to refresh daemon client last-seen
  timestamps on cache hits so the sweeper does not evict active sessions
  during sustained hot-path activity.
- Add a helper that updates daemon client last_seen_ts on the start hot
  path via a brief per-repo SQLite touch, with test stubbing support.
- The change adds commentary explaining the hot-path refcount refresh
  that immediately follows a successful start short-circuit decision,
  matching the surrounding timestamp-touching logic.
- Add a regression test that verifies repeated `acd start` hot-path
  calls under the same session_id still advance
  `daemon_clients.last_seen_ts`, guarding the short-circuit touch
  behavior.
- The diff only adds `strconv` to
  `test/integration/regressions_test.go`, so this capture stands alone
  as a small test-file update.
- Update the repeated-active-hooks short-circuit regression test so it
  proves the hot-path helper runs and no daemon respawn occurs.
- Replace the SQLite file-removal tripwire in the multi-session start
  short-circuit test with a stubbed hot-path touch recorder keyed by
  session ID, and assert each session records the expected number of
  hot-path touches while the spawn count stays at one.
- Selected the lone stop.go capture: it expands the stop-path cache
  invalidation logic to also clear all per-session start caches when a
  forced stop does not fully stop the daemon, keeping the cache state
  conservative in unknown outcomes.
- Add stop CLI tests covering P1 #7 cache cleanup for both the deferred
  path (remove only the caller’s start-cache file while preserving
  peers) and the failed force-escalation path (wipe all per-session
  caches after an unsuccessful kill sequence).
- Select the stop-test change because the diff only shows a test file
  import addition that fits the existing stop CLI test work. Defer the
  CLAUDE.md hook-table wording edit as a separate documentation change
  unrelated to the stop-test adjustment.
- Document Codex hook matcher scope and marker constant formats in
  CLAUDE.md.
- Add a regression test that exercises `acd stop --all` and verifies all
  per-session start-cache files under `.git/acd` are removed after full
  daemon shutdown.
- Select the single capture because it is one coherent modification to
  the existing start-latency integration test: it rewrites the test to
  measure a cold start, compare it against repeated hot starts, and
  tighten the budgets and assertions accordingly.
- Document the latest ACD start caching, doctor diagnostics, Codex hook
  logging, setup JSON validation, and Pi SID fallback changes in the
  changelog.
@KristjanPikhof KristjanPikhof self-assigned this May 8, 2026
- Add a test-only hook that fires after the events watch cursor is
  captured and before followEvents begins, so tests can synchronize
  appends with the watcher becoming live.
- Stabilize the events watch stream test by waiting on the ready hook
  instead of sleeping before appending decisions.
@KristjanPikhof KristjanPikhof marked this pull request as ready for review May 8, 2026 19:03
@KristjanPikhof KristjanPikhof merged commit 953be15 into main May 8, 2026
7 of 8 checks passed
@KristjanPikhof KristjanPikhof deleted the feat/harness-hook-resilience branch May 8, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant