Skip to content

ci: gate the full test suite (Phase 1+2 — measure + make deterministic)#465

Merged
Data-Wise merged 24 commits into
devfrom
feature/ci-full-suite-gate
Jun 14, 2026
Merged

ci: gate the full test suite (Phase 1+2 — measure + make deterministic)#465
Data-Wise merged 24 commits into
devfrom
feature/ci-full-suite-gate

Conversation

@Data-Wise

@Data-Wise Data-Wise commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What

Makes CI run the full 65-suite ./tests/run-all.sh on every PR and makes it deterministically green on a hosted Ubuntu runner — without creating a perpetually-red gate. Spec: docs/specs/SPEC-ci-full-suite-gate-2026-06-13.md.

Previously the only required check ran smoke tests only (~3 of 65 suites); regressions could land green.

Result

run-all.sh on the runner: 64 passed, 0 failed, 0 timeout, 1 skipped (exit 0). Determinism proven: green locally with and without atlas, and on the runner (no atlas/ait/himalaya/R/quarto). Went from 14 runner failures → 0.

Real bugs the gate caught (Linux-only; would have shipped)

  • zsh fd syntax (exec 201>/exec 200>) in doctor-cache.zsh + analysis-cache.zsh — bash-only; errored on Linux's flock path. → dynamic exec {var}>.
  • stat -f %m (macOS-only) in em-cache.zsh — email cache silently never worked on Linux. → portable _em_cache_mtime (GNU-first; BSD-first corrupts mtime because stat -f partially prints before erroring).
  • date -j -f (macOS-only) in teaching-utils.zsh + teach-deploy-enhanced.zshteaching_week always 0 on Linux. → portable date helpers.
  • doctor --help-check false-flagged tm without aiterm (help-compliance.zsh) — now conditional on ait.

(Plus 5 other latent stat -f/fd sites pre-emptively fixed.)

Test determinism (skip when tool absent; full coverage when present)

run-all.sh now treats exit 77 = SKIP. Suites skip/degrade cleanly when their tool is absent: tm/ait, claude, himalaya (IMAP, also bounded so it can't hang), yq, R. Standalone-behavior suites pin FLOW_ATLAS_ENABLED=no (e2e-core-commands); test-atlas-contract skips warm-path unless a flow-compatible atlas is functional. CI provisions a git identity for deploy suites.

CI jobs

  • zsh-tests (smoke) — unchanged required check.
  • full-suite (new) — runs run-all.sh, parallel. Non-blocking (continue-on-error) for now; promoted to required after a dev soak (Phase 3, separate).

Docs

docs/guides/TESTING.md documents the gate + the rc-77 skip convention; CHANGELOG.md / docs/CHANGELOG.md updated.

Not in this PR (Phase 3 — needs sign-off)

Flipping full-suite to a required status check on dev then main (branch protection) — outward-facing, after a soak.

🤖 Generated with Claude Code

Data-Wise pushed a commit that referenced this pull request Jun 14, 2026
v7.10.0 shipped; CI-gate Phase 1 measured (PR #465: 51/14/0, all 14 =
tool-absent skew). Phase 2 (clean-skip the 14 suites) is the WIP, resumes
in the ci-full-suite-gate worktree session.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Test User and others added 21 commits June 13, 2026 21:02
Working artifact for feature/ci-full-suite-gate. Implement in a fresh
session from this worktree. See docs/specs/SPEC-ci-full-suite-gate-2026-06-13.md.
Phased: measure (non-blocking) → determinism/skip → promote to required.
Delete this file during dev merge cleanup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 1 of ci-full-suite-gate: run the full 65-suite run-all.sh on the
hosted runner to capture ground-truth pass/skip/fail before gating.

- Separate job (parallel to smoke), continue-on-error: true => non-blocking
- Captures run-all.sh real exit via PIPESTATUS (tee masks it); re-exits so
  job color reflects reality (0=clean,1=FAIL,2=TIMEOUT) but never blocks PR
- Emits full output + exit code to $GITHUB_STEP_SUMMARY for measurement
- NOT added to required checks (that's Phase 3)

Ref: docs/specs/SPEC-ci-full-suite-gate-2026-06-13.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spec prediction inverted: e2e-core-commands + test-atlas-contract PASS on
runner; 14 OTHER suites FAIL (tool-absent: brew/atlas/himalaya/R/quarto).
3 pure-zsh suites fail unexpectedly -> triage as possible real bugs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Temporary Phase 2 triage. Runs help-compliance, help-compliance-dogfood,
automated-plugin-dogfood with output visible + locale fingerprint to find
why they fail on the runner but pass locally. To be deleted post-diagnosis.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The binary-precedence guard's default keep-list was (r mcp cc), omitting
tm. On systems with a `tm` binary (some Linux distros, GitHub ubuntu
runners) the documented tm terminal-manager dispatcher was SILENTLY
unfunctioned at load (the skip notice only prints under FLOW_DEBUG) —
invisible on macOS dev boxes with no tm binary.

Surfaced by Phase 1 of the CI full-suite gate: 3 suites (help-compliance,
help-compliance-dogfood, automated-plugin-dogfood) failed on the runner
but passed locally; all traced to `tm` not being a function.

- flow.plugin.zsh: default FLOW_INTENTIONAL_SHADOWS now (r mcp cc tm)
- test-dispatcher-binary-precedence.zsh: regression test simulating a tm
  binary collision against the real tm-dispatcher.zsh; extend the
  "intentional shadow survives" loop to cover tm
- CLAUDE.md + CHANGELOG x2: document the new default (historical CHANGELOG
  entries left as-is per project convention)
- remove the temporary diagnose-pure-zsh CI job (triage complete)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnostic on the runner disproved the binary-precedence hypothesis:
there is NO tm binary on ubuntu-latest (commands[tm] empty). tm resolves
to an ALIAS and _tm_help is undefined because tm-dispatcher.zsh requires
the `ait` (aiterm) CLI and early-returns to a graceful "not installed"
alias when it's absent (tm-dispatcher.zsh:44-55). The runner has no ait.

So the 3 "pure-zsh" failures are NOT pure-zsh — they are the SAME
tool-absent skew class as the other 11 (atlas/himalaya/R/quarto). The
correct fix belongs in the tests (gate tm assertions on `ait`), not in
the loader. Reverting:
- flow.plugin.zsh: default keep-list back to (r mcp cc)
- test-dispatcher-binary-precedence.zsh: drop the bogus tm collision test
- CLAUDE.md + CHANGELOG x2: remove the false "tm binary" claim; keep the
  accurate "CI measures full suite" note
- remove the temporary diag-tm CI job

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t class

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Phase 2, batch 1 of the tool-absent-skew fixes. The `tm` dispatcher only
loads fully when the `ait` (aiterm) CLI is present; on hosted runners it
degrades to an alias, so suites asserting tm-is-a-full-dispatcher failed.

Foundation:
- run-all.sh: exit code 77 now counted as SKIP (not FAIL); shown in the
  results line + an explanatory note. Whole-suite tool guards will use it.

tm/aiterm determinism (mixed suites — keep full coverage when ait present,
skip only the tm cases when absent):
- automated-plugin-dogfood.zsh: include tm in the dispatcher / help-fn
  checks only when `ait` exists.
- lib/help-compliance.zsh: _FLOW_HELP_DISPATCHERS includes tm only when
  `ait` exists — also stops `flow doctor --help-check` from false-flagging
  tm as non-compliant on machines without aiterm (real fix, not just tests).
  Fixes test-help-compliance.zsh (no edit needed) via the shared list.
- test-help-compliance-dogfood.zsh: skip tm in all subject loops when ait
  absent; expected dispatcher count is dynamic (14 with ait, 13 without).

Verified locally both ways (tool present AND hidden via PATH sandbox):
  with ait:    16/16, 379/379, 60/60
  without ait: 15/15, 351/351, 58/58  (all exit 0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run-all.sh in CI exposed a real Linux-only runtime bug: lib/doctor-cache.zsh
and lib/analysis-cache.zsh used bash-only high-fd redirection (`exec 201>`,
`exec 200>`). In zsh a literal fd >= 10 is parsed as a COMMAND, so
`exec 201>file` errors with "command not found: 201". The flock branch only
runs when `flock` exists — true on Linux, false on macOS (which falls back to
mkdir locking) — so this only ever broke on hosted CI runners, never locally.

Fix: use zsh's dynamic `exec {var}>file` allocation and reference $var on
acquire, lock, and release. Verified the syntax in zsh; the old literal form
is proven to fail. This is exactly the regression-class the full-suite gate
was built to catch (smoke-only CI never ran test-doctor).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ach-doctor)

Phase 2, batch 2. Make tool-dependent cases skip cleanly when the tool is
absent (CI runner), preserving full coverage when present:
- test-cc-dispatcher: gate 2 cases that exec `claude` (HERE path).
- e2e-em-dispatcher: bound the IMAP/himalaya check with `timeout` and exit 77
  when unreachable (real cause was a HANG, not a missing binary).
- dogfood-teach-doctor-v2: gate the renv.lock case on `R`.
- teach-deploy-v2 (unit/integration/dogfood/e2e): gate on `yq` (the deploy
  history helpers parse YAML via yq), exit 77 when absent.
Verified both ways (tool present AND hidden via PATH sandbox): identical pass
counts with the tool, clean skip/exit-0 without.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These two suites pass on CI (atlas absent) but FAILED locally (atlas
installed) — the inverse skew the spec flagged. Acceptance criterion: the
suite must be green locally whether or not atlas is installed.

- e2e-core-commands: export FLOW_ATLAS_ENABLED=no before sourcing so `status`
  and `catch` exercise flow-cli's standalone fallback (with atlas installed
  they delegate to the binary, flipping [1]/[7]).
- test-atlas-contract: add skip_without_warm_atlas() — the `atlas` on PATH may
  be a different/older binary whose stats/parked/trail/-v return 127; route the
  4 warm-path/exit-code contract tests through it so they skip unless a
  flow-compatible atlas actually implements them.

Verified both ways: e2e-core-commands 22/22 with atlas and without;
test-atlas-contract 14/18 (4 skip) with atlas, 11/18 without — all exit 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…idge tm gate

Two more tool-absent-skew / cross-platform fixes surfaced by the CI gate:

- lib/em-cache.zsh: replaced macOS-only `stat -f %m`/`stat -f '%m %N'` with a
  portable `_em_cache_mtime` helper (BSD `stat -f` then GNU `stat -c %Y`). On
  Linux the bare `stat -f` failed → mtime read as 0 → every entry looked
  expired → cache get/prune/cap returned empty. The email cache never worked
  on Linux. (Real product bug, caught by test-em/dogfood-em cache round-trip.)
- tests/dogfood-atlas-bridge.zsh: the "at() coexists with all 14 dispatchers"
  case failed because `tm` isn't a function without aiterm (ait). Gate tm on
  `command -v ait` (same pattern as the other dispatcher-enumeration suites) —
  this was a tm/ait issue, not atlas.

Verified locally: dogfood-atlas-bridge 29/29 with AND without ait; em suites
108/108 + 159/159 (macOS/BSD path); mtime helper returns a real epoch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
teach-deploy suites run 'git commit' (direct deploy, history, back-merge)
which fails with 'empty ident' on a fresh runner (git user.name/email unset).
Configure a CI git identity — environment provisioning, not a test change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
teach-deploy [42] "teaching_week from start_date" failed on Linux: the date
math used `date -j -f` (BSD/macOS only), which fails on GNU date → empty
epoch → week calc returns 0. Added _teach_date_to_epoch/_teach_epoch_to_date
helpers (BSD `date -j -f` then GNU `date -d`) and routed all 6 date conversions
through them. macOS behavior unchanged (BSD form wins first); Linux now works.

Also re-point the temporary diagnostic at the em-cache suites + a stat/md5
probe to find why em-cache still fails on Linux after the stat -f fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The em-cache stat fix was still broken on Linux. `stat -f %m FILE` there
treats -f as --file-system and prints a filesystem block for FILE to stdout
while erroring on the `%m` operand, so a BSD-first `stat -f %m || stat -c %Y`
captures BOTH outputs → garbage mtime → cache always looks expired → email
cache get/prune/cap return empty (test-em/dogfood-em cache round-trip FAIL).

Fix: try GNU `stat -c %Y` FIRST (works cleanly on Linux), fall back to BSD
`stat -f %m` (which fails cleanly on macOS with an illegal-option error and
empty stdout). Verified the order is correct on both platforms via a CI stat
probe and locally. Applied the same swap to the 5 other BSD-first stat sites
(teach-doctor-impl x4, teach-dispatcher x1) to prevent the same latent bug.

Local (macOS, GNU-first): test-em 108/108, dogfood-em 159/159,
dogfood-teach-doctor-v2 43/43.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…c job

dogfood-teach-deploy-v2 [42] "Status update calculates teaching_week" used a
bare macOS `date -j -f` in the deploy status-update path
(teach-deploy-enhanced.zsh) — added the GNU `date -d` fallback (the deploy
path is separate from teaching-utils.zsh, fixed earlier).

Also removes the temporary diagnostic job from test.yml now that all CI-only
failures are diagnosed and fixed. Workflow is back to zsh-tests + full-suite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Two CI jobs (smoke zsh-tests + full-suite run-all.sh); phasing note
  (non-blocking measurement -> required after soak)
- Skip semantics: exit 77 = clean skip when a tool/service is absent;
  whole-suite vs mixed-suite gating; tool list; FLOW_ATLAS_ENABLED=no
  determinism note
- Refresh stats: 65 suites, 64 passed / 1 skipped / 0 failed; 213 files

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Data-Wise Data-Wise force-pushed the feature/ci-full-suite-gate branch from 7b77560 to 8190c79 Compare June 14, 2026 03:02
@Data-Wise Data-Wise changed the title ci: gate full test suite (Phase 1 — measure, non-blocking) ci: gate the full test suite (Phase 1+2 — measure + make deterministic) Jun 14, 2026
@Data-Wise Data-Wise marked this pull request as ready for review June 14, 2026 03:05
Test User and others added 2 commits June 13, 2026 21:28
- analysis-cache/doctor-cache: declare the flock fd `typeset -g` explicitly
  instead of relying on zsh's implicit-global-on-assignment, so the
  cross-function acquire→release reference is unambiguous.
- em-cache LRU: null-delimited find/read + tab-separated mtime + `cut -f2-`
  so cache paths with spaces survive the sort (prior `awk '{print $2}'`
  truncated them). Defensive — cache files are hash-named.
- test.yml: refresh the full-suite job comment (Phase 1 measure → Phase 1+2;
  still non-blocking pending the Phase 3 dev soak).

No behavior change on the green path. Verified: run-all.sh 64 passed / 0
failed / 0 timeout / 1 skipped locally; plugin sources clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per CLAUDE.md merge-cleanup convention — ORCHESTRATE-*.md are feature-branch
working artifacts and should not land on dev.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Data-Wise Data-Wise merged commit b57c0d8 into dev Jun 14, 2026
2 checks passed
@Data-Wise Data-Wise deleted the feature/ci-full-suite-gate branch June 14, 2026 03:31
Data-Wise pushed a commit that referenced this pull request Jun 14, 2026
Phase 1+2 merged to dev (b57c0d8) — full suite runs in CI (non-blocking),
64 pass/0 fail/1 skip via rc-77; 4 cross-platform bugs fixed. Worktree +
branch removed. Remaining: Phase 3 (promote to required after dev soak).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant