Skip to content

feat(f3): bin/pos-selftest.sh — end-to-end selftest of pos plugin (D1/D3/D4/D5/D6)#27

Merged
javiAI merged 17 commits into
mainfrom
feat/f3-selftest-end-to-end
Apr 26, 2026
Merged

feat(f3): bin/pos-selftest.sh — end-to-end selftest of pos plugin (D1/D3/D4/D5/D6)#27
javiAI merged 17 commits into
mainfrom
feat/f3-selftest-end-to-end

Conversation

@javiAI
Copy link
Copy Markdown
Owner

@javiAI javiAI commented Apr 26, 2026

Summary

  • bin/pos-selftest.sh (wrapper bash mínimo, 9 líneas) + bin/_selftest.py (orquestador stdlib Python, ~297 líneas) ejercitan los gates funcionales-críticos del plugin contra un proyecto sintético generado real-time por npx tsx generator/run.ts --profile cli-tool.yaml.
  • 5 escenarios funcionales-críticos: D1 pre-branch-gate (deny sin marker → allow tras touch <marker>), D3 pre-write-guard (deny Write hooks/foo.py sin test pair → allow tras crear test), D4 pre-pr-gate (deny gh pr create sin docs-sync → allow tras commit ROADMAP+HANDOFF), D5 post-action (git merge confirmado → advisory /pos:compound), D6 stop-policy-check (Stop con session_id rogue deny / clean allow).
  • CI: nuevo job selftest en .github/workflows/ci.yml (ubuntu × Python 3.11, single matrix). Comando único: pytest bin/tests -q.
  • Pytest harness: bin/tests/test_selftest_smoke.py (4 tests, contrato del wrapper) + bin/tests/test_selftest_scenarios.py (5 tests, fixture module-scoped).
  • Sin Claude Code runtime, sin invocaciones reales de skills/agents — cobertura estática queda en agents/tests/test_agent_frontmatter.py y .claude/skills/tests/test_skill_frontmatter.py.

Out of scope (ratificado en Fase -1): D2 session-start + D6 pre-compact (informative-only, sin contrato deny/allow), Claude Code runtime, D5b loader (cubierto indirectamente — los hooks D3/D4/D5 lo consumen y los escenarios sobre-escriben sólo la sección relevante de synthetic/policy.yaml para desacoplar la cobertura de la migración del template).

Suite global post-F3: 829 passed + 1 skipped (vs baseline F2 819 + 1 skip; +10 netos). Sin regresión D1..D6 + E1a..E3b + F1 + F2. Selftest end-to-end local ~1.2s.

Decisiones Fase -1 ratificadas: A1.b (wrapper bash + orquestador Python + smoke pytest), A2 (subset funcional-crítico), A3 (tmpdir + cli-tool + generator real, no fixture committeado), A4 (exit code + tokens, no golden diff), A5 (job en ci.yml, no workflow separado, single matrix), A6 (no Claude runtime).

Ajustes durante implementación: fnmatch no recursa en **/ (corregido generator/**/*.tsgenerator/*.ts); docs_sync_rules() requiere ambas docs_sync_required AND docs_sync_conditional (override añade [] para satisfacer); ci-cd.md H3 placement movido a tras item 3 (release.yml) por MD029/MD032 lint.

Drift abierto post-F3: templates/policy.yaml.hbs y generator/renderers/policy.ts siguen emitiendo el shape pre-D5b (cada escenario sobre-escribe la sección que necesita en synthetic/policy.yaml). Reabrir en rama propia post-F3.

Pre-PR simplify pass: extraídos check_deny / check_allow helpers en bin/_selftest.py (5 instancias del patrón, regla #7 CLAUDE.md cumplida). 344 → 297 líneas; sin regresión.

Docs-sync:

  • ROADMAP.md (F-row 3/4 ramas + § F3 progress block).
  • HANDOFF.md (§1 snapshot + §9 próxima rama → F4 + §21 estado F3).
  • MASTER_PLAN.md (§ Rama F3 expandida con decisiones realizadas + ajustes + drift).
  • docs/ARCHITECTURE.md (§ 10 nueva subsección "Selftest end-to-end (entregado en F3)").
  • .claude/rules/ci-cd.md (bullet "integración end-to-end" promovido de "Diferidos" a "Aterrizado" + H3 ### Job selftest (entregado en F3)).

Test plan

  • pytest bin/tests -q — 10 passed (4 smoke + 5 scenarios + 1 GREEN smoke).
  • pytest hooks/tests .claude/skills/tests agents/tests bin/tests -q — 829 passed + 1 skipped (no regresión).
  • npm run typecheck — limpio.
  • CI job selftest espejado del comando local (mismo pytest bin/tests -q).
  • Verificar verde el job nuevo selftest en GitHub Actions (post-push).

🤖 Generated with Claude Code

Javier and others added 15 commits April 26, 2026 16:52
Locks down the contract for `bin/pos-selftest.sh` before implementing
the wrapper or its Python orchestrator. Five failing tests verify:

- `bin/pos-selftest.sh` exists and is executable
- `bin/_selftest.py` exists (Python orchestrator)
- Wrapper uses `set -euo pipefail` and delegates to `python3 _selftest.py`
- Running the wrapper from the repo root exits 0

Scope ratified in Fase -1 (see `.claude/branch-approvals/feat_f3-selftest-end-to-end.approved`):
gates funcionales criticos D1 / D3 / D4 / D5 / D6 stop-policy-check;
informativos D2 + D6 pre-compact diferidos; sin runtime Claude Code;
solo checks estaticos baratos para skills/agents.

Following commits:
- GREEN minimo wrapper + orchestrator (smoke exit 0)
- RED/GREEN incrementales por escenario (D1, D3, D4, D5, D6)
- CI job selftest
- Docs-sync

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Satisfies the contract locked by RED in the previous commit:

- bin/pos-selftest.sh: thin bash wrapper, set -euo pipefail, execs
  python3 _selftest.py. Both files are executable (mode 0755).
- bin/_selftest.py: stdlib orchestrator with empty scenario set
  (smoke print + return 0). Scenarios D1 / D3 / D4 / D5 / D6 stop
  are added in subsequent RED/GREEN commits.

All 5 smoke contract tests pass. Hooks suite intact (587 passed + 1
skipped baseline preserved). Skills + agents + bin tests: 237 passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Locks down the orchestrator contract: each registered scenario must emit
`[ok] D{N} {name}` on its line. Module-scoped fixture runs the wrapper
once and shares stdout across scenario tests. Fails until _selftest.py
registers + implements D1 against the synthetic project.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Orchestrator generates synthetic project per scenario via real
`npx tsx generator/run.ts --profile cli-tool.yaml --out <tmp>`,
invokes meta-repo hook against synthetic cwd, asserts deny-without-marker
+ allow-after-touch contract. Selftest runs in ~1.2s end-to-end.

Stdlib only (subprocess + tempfile + shutil + json + pathlib). Each
scenario gets its own tmpdir to avoid cross-scenario contamination.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extracts shared diag helper. D3 scenario test fails until the
orchestrator registers + implements the pre-write-guard contract
(deny Write to enforced path without test pair, allow once test pair
exists) against the synthetic project.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Synthetic project's rendered policy.yaml lacks pre_write (template
drift documented post-D5b), so the scenario writes a minimal policy
override into synthetic/policy.yaml before invoking. Then asserts
deny on `Write hooks/foo.py` without test pair, allow once
hooks/tests/test_foo.py exists.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fails until the orchestrator registers + implements the pre-pr-gate
contract: deny `gh pr create` when docs-sync (ROADMAP.md + HANDOFF.md)
is missing from the diff, allow once docs-sync is satisfied.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Synthetic project's policy.yaml is overridden with a minimal pre_pr
section (baseline + empty conditional — loader requires both keys).
Init git on main, commit baseline, branch feat/example with a code-only
commit, assert deny on `gh pr create`. Add ROADMAP/HANDOFF changes,
re-invoke, assert allow.

Factors `git_in()` + `init_baseline_repo()` helpers — D5 will reuse
them for reflog-based detection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fails until the orchestrator registers + implements the post-action
contract: a confirmed `git merge` whose diff matches a configured
trigger emits the `Consider running /pos:compound` advisory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Override synthetic policy with minimal post_merge trigger (fnmatch-style
non-recursive globs since `**` is literal in fnmatch). Init git on main,
branch feat/example, add `generator/feature.ts`, merge --no-ff, then
invoke post-action with a `git merge` payload. Asserts exit 0 and
`/pos:compound` advisory in stdout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fails until the orchestrator registers + implements the stop-policy-check
contract: with `skills_allowed` declared in policy.yaml + a rogue
invocation in `.claude/logs/skills.jsonl` for the active session, the
Stop hook denies exit 2; an unrelated session_id with no invocations
allows exit 0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Override synthetic policy.yaml with `skills_allowed: ["pos:simplify"]`,
seed `.claude/logs/skills.jsonl` with a rogue invocation under
session_id `sess-rogue`. Deny phase: Stop payload `{session_id:
"sess-rogue"}` triggers exit 2 deny. Allow phase: a different
session_id with no recorded invocations passes through with exit 0.
Locks down the session-scoping contract end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`selftest` job runs `pytest bin/tests -q` on ubuntu × Python 3.11 with
Node setup (for `npx tsx generator/run.ts`). Covers smoke wrapper +
5 functional-critical scenarios end-to-end.

Move integration bullet from "Diferidos" to "Aterrizado" per the
invariant in `.claude/rules/ci-cd.md`. Add a dedicated H3 documenting
the job's scope, what it covers, what it explicitly does not (D2
informative, Claude Code runtime), and the synthetic-policy drift.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…chestrator

Each of the 5 scenarios in bin/_selftest.py repeated the same deny phase
(exit 2 + permissionDecision deny check) and allow phase (exit 0 check)
boilerplate. Extracting two small helpers (check_deny, check_allow)
removes ~30 lines of duplication and makes scenario intent more readable
without hiding what each scenario asserts.

Pre-PR simplify pass (CLAUDE.md regla #7 satisfied: 5 instances).
829 passed + 1 skipped — no regression.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Close F3 in the standard docs-sync surfaces:
- ROADMAP.md: F-row 3/4 ramas, F3 row → ✅, full progress block under § F.
- HANDOFF.md: §1 snapshot updated (F3 closed, F4 next), §9 next-branch
  pointer flipped to F4, new §21 with F3 state (entregables, escenarios,
  out-of-scope, ajustes).
- MASTER_PLAN.md § Rama F3: stub → realized decisions (A1.b shape,
  A2 functional-critical subset, A3 tmpdir + cli-tool, A4 exit + tokens,
  A5 single-matrix CI job, A6 no Claude runtime), implementation
  adjustments documented (fnmatch literal vs recursive, docs_sync_rules
  double-key contract, ci-cd.md H3 placement), drift open post-F3.
- docs/ARCHITECTURE.md § 10: new "Selftest end-to-end (entregado en F3)"
  subsection inside Testing — tres niveles. Documents wrapper +
  orchestrator + scenarios + CI + drift.

Pre-PR gate (D4 dogfooding) satisfied: ROADMAP + HANDOFF in diff;
no conditional triggers apply (bin/** + .github/** outside the rules).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 26, 2026 17:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end “selftest” harness that generates a synthetic project with the real generator and runs several functional-critical hooks against it, wiring this into CI and documenting the new testing layer.

Changes:

  • Introduces bin/pos-selftest.sh + bin/_selftest.py to generate a synthetic repo and run D1/D3/D4/D5/D6 gate scenarios.
  • Adds a pytest harness under bin/tests/ (smoke contract + scenario assertions).
  • Wires the selftest into GitHub Actions (selftest job) and updates docs (ARCHITECTURE/ROADMAP/MASTER_PLAN/HANDOFF + CI/CD rules).

Reviewed changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/ARCHITECTURE.md Documents the new selftest testing layer and its scope.
bin/pos-selftest.sh Adds a stable bash entrypoint delegating to the Python orchestrator.
bin/_selftest.py Implements scenario orchestration: generate synthetic repo, override policy sections, invoke real hooks, assert outcomes.
bin/tests/test_selftest_smoke.py Smoke tests for wrapper presence/shape + wrapper execution contract.
bin/tests/test_selftest_scenarios.py Scenario-level contract assertions against orchestrator stdout.
ROADMAP.md Updates phase tracking and records F3 deliverables.
MASTER_PLAN.md Expands F3 scope/results documentation.
HANDOFF.md Updates snapshot/next-branch status and F3 summary.
.github/workflows/ci.yml Adds a dedicated selftest job running pytest bin/tests -q.
.claude/rules/ci-cd.md Updates CI/CD rules to reflect the new selftest job and scope.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ROADMAP.md Outdated
Entregables:

- `bin/pos-selftest.sh` (9 líneas) — wrapper bash mínimo (`#!/usr/bin/env bash` + `set -euo pipefail` + delega a `python3 bin/_selftest.py`). No contiene lógica; es entrypoint estable que tests + CI consumen sin dependencia de path absoluto.
- `bin/_selftest.py` (~344 líneas) — orquestador stdlib (sin dependencias externas). Por cada escenario: crea un tmpdir, ejecuta `npx tsx generator/run.ts --profile questionnaire/profiles/cli-tool.yaml --out <tmpdir>` para generar un proyecto sintético, sobre-escribe `synthetic/policy.yaml` con la sección mínima que el escenario necesita, monta el repo sintético como git repo (`git init -b main` + commit baseline), e invoca el hook real (`hooks/<name>.py`) vía subprocess con payload JSON. Asserta exit code + presencia de tokens en stdout/stderr/files. Imprime `[ok] D{N} {name}` o `[fail] D{N} {name}: <diag>`. Exit 0/1 según pass/fail.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs pin an approximate line count for bin/_selftest.py ("~344 líneas"), but the file currently differs and will naturally drift over time. To avoid docs becoming stale, consider removing the line count or describing it more generically (e.g., "~300 líneas" / "~320 líneas" / "~few hundred lines").

Suggested change
- `bin/_selftest.py` (~344 líneas) — orquestador stdlib (sin dependencias externas). Por cada escenario: crea un tmpdir, ejecuta `npx tsx generator/run.ts --profile questionnaire/profiles/cli-tool.yaml --out <tmpdir>` para generar un proyecto sintético, sobre-escribe `synthetic/policy.yaml` con la sección mínima que el escenario necesita, monta el repo sintético como git repo (`git init -b main` + commit baseline), e invoca el hook real (`hooks/<name>.py`) vía subprocess con payload JSON. Asserta exit code + presencia de tokens en stdout/stderr/files. Imprime `[ok] D{N} {name}` o `[fail] D{N} {name}: <diag>`. Exit 0/1 según pass/fail.
- `bin/_selftest.py` (unos pocos cientos de líneas) — orquestador stdlib (sin dependencias externas). Por cada escenario: crea un tmpdir, ejecuta `npx tsx generator/run.ts --profile questionnaire/profiles/cli-tool.yaml --out <tmpdir>` para generar un proyecto sintético, sobre-escribe `synthetic/policy.yaml` con la sección mínima que el escenario necesita, monta el repo sintético como git repo (`git init -b main` + commit baseline), e invoca el hook real (`hooks/<name>.py`) vía subprocess con payload JSON. Asserta exit code + presencia de tokens en stdout/stderr/files. Imprime `[ok] D{N} {name}` o `[fail] D{N} {name}: <diag>`. Exit 0/1 según pass/fail.

Copilot uses AI. Check for mistakes.
Comment thread MASTER_PLAN.md Outdated
**Archivos entregados**:

- `bin/pos-selftest.sh` — wrapper bash (`#!/usr/bin/env bash` + `set -euo pipefail` + delega a `python3 bin/_selftest.py`). 9 líneas. Sin lógica.
- `bin/_selftest.py` — orquestador Python stdlib (~344 líneas). Por escenario: tmpdir + generator real + sobre-escribe sección mínima de `synthetic/policy.yaml` + monta git repo (`git init -b main` + commit baseline) + invoca hook real vía subprocess + asserta exit + tokens.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section hardcodes an approximate line count for bin/_selftest.py ("~344 líneas"). Since the file length changes as scenarios evolve, this is likely to go stale; consider removing the line count or making it less specific.

Suggested change
- `bin/_selftest.py` — orquestador Python stdlib (~344 líneas). Por escenario: tmpdir + generator real + sobre-escribe sección mínima de `synthetic/policy.yaml` + monta git repo (`git init -b main` + commit baseline) + invoca hook real vía subprocess + asserta exit + tokens.
- `bin/_selftest.py` — orquestador Python stdlib. Por escenario: tmpdir + generator real + sobre-escribe sección mínima de `synthetic/policy.yaml` + monta git repo (`git init -b main` + commit baseline) + invoca hook real vía subprocess + asserta exit + tokens.

Copilot uses AI. Check for mistakes.
Comment thread HANDOFF.md Outdated
**Entregables**:

- `bin/pos-selftest.sh` (9 líneas) — wrapper bash mínimo (`#!/usr/bin/env bash` + `set -euo pipefail` + delega a `python3 bin/_selftest.py`). Sin lógica; entrypoint estable.
- `bin/_selftest.py` (~344 líneas) — orquestador stdlib Python. Por escenario: tmpdir + `npx tsx generator/run.ts --profile cli-tool.yaml --out <tmpdir>` para generar proyecto sintético, sobre-escribe la sección mínima de `synthetic/policy.yaml` que el escenario necesita, monta el sintético como git repo (`git init -b main` + commit baseline), invoca el hook real (`hooks/<name>.py`) vía subprocess con payload JSON, asserta exit + tokens en stdout/stderr/files. Imprime `[ok] D{N} {name}` o `[fail] D{N} {name}: <diag>`.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot notes bin/_selftest.py as "~344 líneas", but this value will drift as the orchestrator changes. Consider dropping the line count (or making it intentionally coarse) to keep the docs accurate long-term.

Suggested change
- `bin/_selftest.py` (~344 líneas) — orquestador stdlib Python. Por escenario: tmpdir + `npx tsx generator/run.ts --profile cli-tool.yaml --out <tmpdir>` para generar proyecto sintético, sobre-escribe la sección mínima de `synthetic/policy.yaml` que el escenario necesita, monta el sintético como git repo (`git init -b main` + commit baseline), invoca el hook real (`hooks/<name>.py`) vía subprocess con payload JSON, asserta exit + tokens en stdout/stderr/files. Imprime `[ok] D{N} {name}` o `[fail] D{N} {name}: <diag>`.
- `bin/_selftest.py` (orquestador stdlib Python) — por escenario: tmpdir + `npx tsx generator/run.ts --profile cli-tool.yaml --out <tmpdir>` para generar proyecto sintético, sobre-escribe la sección mínima de `synthetic/policy.yaml` que el escenario necesita, monta el sintético como git repo (`git init -b main` + commit baseline), invoca el hook real (`hooks/<name>.py`) vía subprocess con payload JSON, asserta exit + tokens en stdout/stderr/files. Imprime `[ok] D{N} {name}` o `[fail] D{N} {name}: <diag>`.

Copilot uses AI. Check for mistakes.
Javier added 2 commits April 26, 2026 20:00
…3 drift

F3 documented an open drift: templates/policy.yaml.hbs + generator/renderers/policy.ts
still emit pre-D5b shape, evaded in F3 via per-scenario overlays in bin/_selftest.py.

Open the stub branch slot now so the carry-over has a concrete home:

- MASTER_PLAN.md § Rama F3b — full stub (scope, contexto a leer, decisiones a
  cerrar en Fase -1, criterio de salida, rationale de no entregarlo en F3).
  Position F3b mirrors precedent of refactor/d5-policy-loader as Rama D5b.
- ROADMAP.md — new row between F3 and F4. F3 row updated with PR #27.
- HANDOFF.md § 7 Gotchas — drift bullet now points to the stub instead of
  saying "diferida a rama propia post-D6".

No code changes. Stub is informational; Fase -1 happens when F4 closes (or
when a consumer requires post-D5b shape in generator output).
Copilot flagged "~344 líneas" hardcoded for bin/_selftest.py in 3 docs
(ROADMAP § F3, HANDOFF §21, MASTER_PLAN § Rama F3). Already stale: the
file is 321 lines after the simplify pass (bb146bc), and will keep
drifting as scenarios evolve. Drop the precise count entirely — the
description carries the magnitude signal without the maintenance debt.

Wrapper count ("9 líneas" for bin/pos-selftest.sh) kept: minimal-by-design,
stable as long as the wrapper just exec's the orchestrator.

Triage: 3 inline comments, all FIX (low value / trivial effort).
0 issue/conversation comments. Review-body was the PR overview summary.
@javiAI
Copy link
Copy Markdown
Owner Author

javiAI commented Apr 26, 2026

Address review (1/1 reviewer = Copilot, 3 inline comments, all same suggestion).

FIX (3) — commit 4fbffe1:

  • Dropped hardcoded ~344 lineas for bin/_selftest.py in ROADMAP, HANDOFF, MASTER_PLAN. Already stale (file is 321 lines after the simplify pass bb146bc), naturally drifts with scenarios. Description carries magnitude signal without the maintenance debt.

KEPT — wrapper bin/pos-selftest.sh at 9 lines: stable-by-design (no logic, just exec).

SKIP / DISCUSS: none. Review-body was the PR overview summary. 0 issue comments.

@javiAI javiAI merged commit 2595bf9 into main Apr 26, 2026
7 checks passed
@javiAI javiAI deleted the feat/f3-selftest-end-to-end branch April 26, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants