ci: gate the full test suite (Phase 1+2 — measure + make deterministic)#465
Merged
Conversation
Data-Wise
pushed a commit
that referenced
this pull request
Jun 14, 2026
v7.10.0 shipped; CI-gate Phase 1 measured (PR #465: 51/14/0, all 14 = tool-absent skew). Phase 2 (clean-skip the 14 suites) is the WIP, resumes in the ci-full-suite-gate worktree session. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Working artifact for feature/ci-full-suite-gate. Implement in a fresh session from this worktree. See docs/specs/SPEC-ci-full-suite-gate-2026-06-13.md. Phased: measure (non-blocking) → determinism/skip → promote to required. Delete this file during dev merge cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 1 of ci-full-suite-gate: run the full 65-suite run-all.sh on the hosted runner to capture ground-truth pass/skip/fail before gating. - Separate job (parallel to smoke), continue-on-error: true => non-blocking - Captures run-all.sh real exit via PIPESTATUS (tee masks it); re-exits so job color reflects reality (0=clean,1=FAIL,2=TIMEOUT) but never blocks PR - Emits full output + exit code to $GITHUB_STEP_SUMMARY for measurement - NOT added to required checks (that's Phase 3) Ref: docs/specs/SPEC-ci-full-suite-gate-2026-06-13.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spec prediction inverted: e2e-core-commands + test-atlas-contract PASS on runner; 14 OTHER suites FAIL (tool-absent: brew/atlas/himalaya/R/quarto). 3 pure-zsh suites fail unexpectedly -> triage as possible real bugs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Temporary Phase 2 triage. Runs help-compliance, help-compliance-dogfood, automated-plugin-dogfood with output visible + locale fingerprint to find why they fail on the runner but pass locally. To be deleted post-diagnosis. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The binary-precedence guard's default keep-list was (r mcp cc), omitting tm. On systems with a `tm` binary (some Linux distros, GitHub ubuntu runners) the documented tm terminal-manager dispatcher was SILENTLY unfunctioned at load (the skip notice only prints under FLOW_DEBUG) — invisible on macOS dev boxes with no tm binary. Surfaced by Phase 1 of the CI full-suite gate: 3 suites (help-compliance, help-compliance-dogfood, automated-plugin-dogfood) failed on the runner but passed locally; all traced to `tm` not being a function. - flow.plugin.zsh: default FLOW_INTENTIONAL_SHADOWS now (r mcp cc tm) - test-dispatcher-binary-precedence.zsh: regression test simulating a tm binary collision against the real tm-dispatcher.zsh; extend the "intentional shadow survives" loop to cover tm - CLAUDE.md + CHANGELOG x2: document the new default (historical CHANGELOG entries left as-is per project convention) - remove the temporary diagnose-pure-zsh CI job (triage complete) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnostic on the runner disproved the binary-precedence hypothesis: there is NO tm binary on ubuntu-latest (commands[tm] empty). tm resolves to an ALIAS and _tm_help is undefined because tm-dispatcher.zsh requires the `ait` (aiterm) CLI and early-returns to a graceful "not installed" alias when it's absent (tm-dispatcher.zsh:44-55). The runner has no ait. So the 3 "pure-zsh" failures are NOT pure-zsh — they are the SAME tool-absent skew class as the other 11 (atlas/himalaya/R/quarto). The correct fix belongs in the tests (gate tm assertions on `ait`), not in the loader. Reverting: - flow.plugin.zsh: default keep-list back to (r mcp cc) - test-dispatcher-binary-precedence.zsh: drop the bogus tm collision test - CLAUDE.md + CHANGELOG x2: remove the false "tm binary" claim; keep the accurate "CI measures full suite" note - remove the temporary diag-tm CI job Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t class Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2, batch 1 of the tool-absent-skew fixes. The `tm` dispatcher only loads fully when the `ait` (aiterm) CLI is present; on hosted runners it degrades to an alias, so suites asserting tm-is-a-full-dispatcher failed. Foundation: - run-all.sh: exit code 77 now counted as SKIP (not FAIL); shown in the results line + an explanatory note. Whole-suite tool guards will use it. tm/aiterm determinism (mixed suites — keep full coverage when ait present, skip only the tm cases when absent): - automated-plugin-dogfood.zsh: include tm in the dispatcher / help-fn checks only when `ait` exists. - lib/help-compliance.zsh: _FLOW_HELP_DISPATCHERS includes tm only when `ait` exists — also stops `flow doctor --help-check` from false-flagging tm as non-compliant on machines without aiterm (real fix, not just tests). Fixes test-help-compliance.zsh (no edit needed) via the shared list. - test-help-compliance-dogfood.zsh: skip tm in all subject loops when ait absent; expected dispatcher count is dynamic (14 with ait, 13 without). Verified locally both ways (tool present AND hidden via PATH sandbox): with ait: 16/16, 379/379, 60/60 without ait: 15/15, 351/351, 58/58 (all exit 0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-all.sh in CI exposed a real Linux-only runtime bug: lib/doctor-cache.zsh
and lib/analysis-cache.zsh used bash-only high-fd redirection (`exec 201>`,
`exec 200>`). In zsh a literal fd >= 10 is parsed as a COMMAND, so
`exec 201>file` errors with "command not found: 201". The flock branch only
runs when `flock` exists — true on Linux, false on macOS (which falls back to
mkdir locking) — so this only ever broke on hosted CI runners, never locally.
Fix: use zsh's dynamic `exec {var}>file` allocation and reference $var on
acquire, lock, and release. Verified the syntax in zsh; the old literal form
is proven to fail. This is exactly the regression-class the full-suite gate
was built to catch (smoke-only CI never ran test-doctor).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ach-doctor) Phase 2, batch 2. Make tool-dependent cases skip cleanly when the tool is absent (CI runner), preserving full coverage when present: - test-cc-dispatcher: gate 2 cases that exec `claude` (HERE path). - e2e-em-dispatcher: bound the IMAP/himalaya check with `timeout` and exit 77 when unreachable (real cause was a HANG, not a missing binary). - dogfood-teach-doctor-v2: gate the renv.lock case on `R`. - teach-deploy-v2 (unit/integration/dogfood/e2e): gate on `yq` (the deploy history helpers parse YAML via yq), exit 77 when absent. Verified both ways (tool present AND hidden via PATH sandbox): identical pass counts with the tool, clean skip/exit-0 without. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These two suites pass on CI (atlas absent) but FAILED locally (atlas installed) — the inverse skew the spec flagged. Acceptance criterion: the suite must be green locally whether or not atlas is installed. - e2e-core-commands: export FLOW_ATLAS_ENABLED=no before sourcing so `status` and `catch` exercise flow-cli's standalone fallback (with atlas installed they delegate to the binary, flipping [1]/[7]). - test-atlas-contract: add skip_without_warm_atlas() — the `atlas` on PATH may be a different/older binary whose stats/parked/trail/-v return 127; route the 4 warm-path/exit-code contract tests through it so they skip unless a flow-compatible atlas actually implements them. Verified both ways: e2e-core-commands 22/22 with atlas and without; test-atlas-contract 14/18 (4 skip) with atlas, 11/18 without — all exit 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…idge tm gate Two more tool-absent-skew / cross-platform fixes surfaced by the CI gate: - lib/em-cache.zsh: replaced macOS-only `stat -f %m`/`stat -f '%m %N'` with a portable `_em_cache_mtime` helper (BSD `stat -f` then GNU `stat -c %Y`). On Linux the bare `stat -f` failed → mtime read as 0 → every entry looked expired → cache get/prune/cap returned empty. The email cache never worked on Linux. (Real product bug, caught by test-em/dogfood-em cache round-trip.) - tests/dogfood-atlas-bridge.zsh: the "at() coexists with all 14 dispatchers" case failed because `tm` isn't a function without aiterm (ait). Gate tm on `command -v ait` (same pattern as the other dispatcher-enumeration suites) — this was a tm/ait issue, not atlas. Verified locally: dogfood-atlas-bridge 29/29 with AND without ait; em suites 108/108 + 159/159 (macOS/BSD path); mtime helper returns a real epoch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
teach-deploy suites run 'git commit' (direct deploy, history, back-merge) which fails with 'empty ident' on a fresh runner (git user.name/email unset). Configure a CI git identity — environment provisioning, not a test change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
teach-deploy [42] "teaching_week from start_date" failed on Linux: the date math used `date -j -f` (BSD/macOS only), which fails on GNU date → empty epoch → week calc returns 0. Added _teach_date_to_epoch/_teach_epoch_to_date helpers (BSD `date -j -f` then GNU `date -d`) and routed all 6 date conversions through them. macOS behavior unchanged (BSD form wins first); Linux now works. Also re-point the temporary diagnostic at the em-cache suites + a stat/md5 probe to find why em-cache still fails on Linux after the stat -f fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The em-cache stat fix was still broken on Linux. `stat -f %m FILE` there treats -f as --file-system and prints a filesystem block for FILE to stdout while erroring on the `%m` operand, so a BSD-first `stat -f %m || stat -c %Y` captures BOTH outputs → garbage mtime → cache always looks expired → email cache get/prune/cap return empty (test-em/dogfood-em cache round-trip FAIL). Fix: try GNU `stat -c %Y` FIRST (works cleanly on Linux), fall back to BSD `stat -f %m` (which fails cleanly on macOS with an illegal-option error and empty stdout). Verified the order is correct on both platforms via a CI stat probe and locally. Applied the same swap to the 5 other BSD-first stat sites (teach-doctor-impl x4, teach-dispatcher x1) to prevent the same latent bug. Local (macOS, GNU-first): test-em 108/108, dogfood-em 159/159, dogfood-teach-doctor-v2 43/43. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…c job dogfood-teach-deploy-v2 [42] "Status update calculates teaching_week" used a bare macOS `date -j -f` in the deploy status-update path (teach-deploy-enhanced.zsh) — added the GNU `date -d` fallback (the deploy path is separate from teaching-utils.zsh, fixed earlier). Also removes the temporary diagnostic job from test.yml now that all CI-only failures are diagnosed and fixed. Workflow is back to zsh-tests + full-suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Two CI jobs (smoke zsh-tests + full-suite run-all.sh); phasing note (non-blocking measurement -> required after soak) - Skip semantics: exit 77 = clean skip when a tool/service is absent; whole-suite vs mixed-suite gating; tool list; FLOW_ATLAS_ENABLED=no determinism note - Refresh stats: 65 suites, 64 passed / 1 skipped / 0 failed; 213 files Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7b77560 to
8190c79
Compare
- analysis-cache/doctor-cache: declare the flock fd `typeset -g` explicitly
instead of relying on zsh's implicit-global-on-assignment, so the
cross-function acquire→release reference is unambiguous.
- em-cache LRU: null-delimited find/read + tab-separated mtime + `cut -f2-`
so cache paths with spaces survive the sort (prior `awk '{print $2}'`
truncated them). Defensive — cache files are hash-named.
- test.yml: refresh the full-suite job comment (Phase 1 measure → Phase 1+2;
still non-blocking pending the Phase 3 dev soak).
No behavior change on the green path. Verified: run-all.sh 64 passed / 0
failed / 0 timeout / 1 skipped locally; plugin sources clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per CLAUDE.md merge-cleanup convention — ORCHESTRATE-*.md are feature-branch working artifacts and should not land on dev. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Data-Wise
pushed a commit
that referenced
this pull request
Jun 14, 2026
Phase 1+2 merged to dev (b57c0d8) — full suite runs in CI (non-blocking), 64 pass/0 fail/1 skip via rc-77; 4 cross-platform bugs fixed. Worktree + branch removed. Remaining: Phase 3 (promote to required after dev soak). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes CI run the full 65-suite
./tests/run-all.shon every PR and makes it deterministically green on a hosted Ubuntu runner — without creating a perpetually-red gate. Spec:docs/specs/SPEC-ci-full-suite-gate-2026-06-13.md.Previously the only required check ran smoke tests only (~3 of 65 suites); regressions could land green.
Result
run-all.shon the runner: 64 passed, 0 failed, 0 timeout, 1 skipped (exit 0). Determinism proven: green locally with and withoutatlas, and on the runner (no atlas/ait/himalaya/R/quarto). Went from 14 runner failures → 0.Real bugs the gate caught (Linux-only; would have shipped)
exec 201>/exec 200>) indoctor-cache.zsh+analysis-cache.zsh— bash-only; errored on Linux'sflockpath. → dynamicexec {var}>.stat -f %m(macOS-only) inem-cache.zsh— email cache silently never worked on Linux. → portable_em_cache_mtime(GNU-first; BSD-first corrupts mtime becausestat -fpartially prints before erroring).date -j -f(macOS-only) inteaching-utils.zsh+teach-deploy-enhanced.zsh—teaching_weekalways 0 on Linux. → portable date helpers.doctor --help-checkfalse-flaggedtmwithout aiterm (help-compliance.zsh) — now conditional onait.(Plus 5 other latent
stat -f/fd sites pre-emptively fixed.)Test determinism (skip when tool absent; full coverage when present)
run-all.shnow treats exit 77 = SKIP. Suites skip/degrade cleanly when their tool is absent: tm/ait,claude,himalaya(IMAP, also bounded so it can't hang),yq,R. Standalone-behavior suites pinFLOW_ATLAS_ENABLED=no(e2e-core-commands);test-atlas-contractskips warm-path unless a flow-compatibleatlasis functional. CI provisions a git identity for deploy suites.CI jobs
zsh-tests(smoke) — unchanged required check.full-suite(new) — runsrun-all.sh, parallel. Non-blocking (continue-on-error) for now; promoted to required after a dev soak (Phase 3, separate).Docs
docs/guides/TESTING.mddocuments the gate + the rc-77 skip convention;CHANGELOG.md/docs/CHANGELOG.mdupdated.Not in this PR (Phase 3 — needs sign-off)
Flipping
full-suiteto a required status check ondevthenmain(branch protection) — outward-facing, after a soak.🤖 Generated with Claude Code