Project context
gregor_zwanzig is a headless Python service that normalises weather data and emits compact reports (SMS ≤160 chars, HTML email) for long-distance hikers. Stack: Python + uv + pytest, Go API binary, Svelte frontend, NiceGUI admin UI, deployed via systemd to a Hetzner VPS with a separate staging environment. The project has been running the OpenSpec 8-phase workflow with adversary verification since Epic #77 (commit 7e86270, 2026-04-18). 170 commits land on main since April; almost every commit is tied to a GitHub Issue ID.
What works (with evidence)
- Hook-enforced phase gates actually block bypass attempts.
.claude/settings.json chains 14 PreToolUse hooks on Edit|Write (workflow_gate, spec_enforcement, tdd_enforcement, red_test_gate, post_implementation_gate, scope_guard, …). The system caught at least two real bypass attempts that ended up in long-term memory: editing legacy_entities.txt to whitelist a missing spec (memory feedback_no_workflow_bypass.md) and trying to "hot-fix" a worktree-routing bug directly in main (memory feedback_workflow_strict.md, 2026-05-02). In both cases the hooks won.
- External validator as isolated
claude --print session caught real bugs that internal tests missed. .claude/validate-external.sh spawns a fresh Claude with no conversation history, only spec + running app. Issue #111 result: first run was AMBIGUOUS because Python-loader output wasn't HTTP-observable. After adding an internal endpoint (Issue #115, commit 23be83c), a second validator run found 3 real bugs, one CRITICAL HTTP 500 on aggregation.profile=null (commit 4d57330). Documented in memory feedback_validator_observability_first.md.
- Product Owner Pattern with worktree-isolated developer agent. Main context delegates implementation to a developer agent (Opus, isolated git worktree). 44 entries in
.claude/worktrees/ show heavy use. Role split is enforced by the Phase 6 command (5-implement.md lines 230–236: "Du TUST … Developer Agent spawnen … Du tust NICHT … Code schreiben oder editieren").
- Adversary verification with tri-state verdict. Phase 6b (
implementation-validator agent on Sonnet) returns VERIFIED / BROKEN / AMBIGUOUS. AMBIGUOUS verdicts are not nodded through — memory captures the rule that AMBIGUOUS triggers an observability-first response, not dismissal.
- Real integration tests, no mocks. CLAUDE.md prohibits
Mock()/patch()/MagicMock. Email tests round-trip through Gmail SMTP + IMAP. API tests hit the real provider. Result: zero "tests passed but prod broke" incidents in the last 50 commits.
- Memory system distils repeated mistakes into actionable rules. 13 feedback memories in
~/.claude/projects/-home-hem-gregor-zwanzig/memory/, each with a concrete originating session and dated incident. Examples: feedback_post_push_workflow.md (5-step Push → staging-wait → staging-validate → prod-deploy → validator-vs-prod, anchored in Issues #113 and #114), feedback_validator_after_push.md (validator order matters; pre-push runs hit old code).
What does not work (with evidence)
- Worktree state-routing was repeatedly broken.
workflow_state.json is a single 137KB central file shared across worktrees. Issue #112 (commit 510717b) fixed the initial routing. Then tdd_enforcement still resolved artifact paths against the worktree root instead of main repo root (commit 28d5b22). Then active_workflow drift between parallel worktrees pushed artifacts into the wrong workflow twice in one session (memory feedback_workflow_state_explicit_name.md, 2026-05-04). The shared-state design is the root cause; per-fix patches haven't fully eliminated drift.
- Validator order vs. push direction is a footgun. Default validator URL is production. Running validator pre-push hits old prod code → false-negative AMBIGUOUS. Documented in memory
feedback_validator_after_push.md and Issue #113. Fix is procedural, not enforced — nothing in the workflow stops you from running it in the wrong order.
systemctl restart ≠ deploy. Three Python-only commits restarted services without rebuilding the Go binary or frontend → 23 h of code drift before BetterStack alerted (memory feedback_post_push_workflow.md). Required adding deploy-gregor-prod.sh and a drift monitor (check-gregor20.sh). The workflow had no concept of "what counts as deployed".
- Schema rework caused silent data loss. Issue #102 (commit
b0a3576): a refactor lost 3 of 4 stages of the GR221 trip. Recovery only worked because GPX files happened to survive in a stash. Triggered a new data_schema_backup.py hook and a "Daten-Schema-Reworks" section in CLAUDE.md. The original spec/TDD-RED loop did not catch the regression because acceptance criteria didn't cover persistence-survives-edit semantics.
- Spec-Approval gating is grep-fragile. Approval is encoded as
- [ ] Approved / - [x] Approved in the spec markdown. There is no schema validation that acceptance criteria are present, traceable to tests, or distinct from boilerplate. Several specs in docs/specs/modules/ use the template structure inconsistently.
- Phase 6 fix-loop has no telemetry.
5-implement.md says "max 3 iterations" but state file does not count them. There's no audit trail of how often a workflow bounced between developer and validator, so chronic problem features can't be detected.
Gaps / blind spots
- No machine-readable spec format. Specs are free-form markdown. Approval is grep-detected; acceptance criteria are not parsed; no spec → test traceability map. RED-test artifacts are registered with a free-text description, not linked to specific criteria.
- Validator observability is bespoke per project. The Issue #111 → #115 lesson ("AMBIGUOUS means observability gap, not 'tests cover it'") is encoded as a memory, not a workflow contract. A standard spec should declare its required black-box surface.
- Single-file workflow state. 137KB JSON, mixes per-workflow phase data, artifacts, approvals, and active-workflow pointer. Concurrent worktrees cause drift. Per-workflow files would be more robust and would not require the "always pass workflow name explicitly" workaround documented in memory.
- No standard deploy contract. "What counts as deployed" had to be discovered the hard way (Issue #113). The workflow does not check that a binary build, a frontend build, and a smoke test all happened before declaring "done".
- Memory rules can rot. Feedback memories reference file paths and module names that may move; nothing audits them against the current repo.
feedback_workflow_state_explicit_name.md is already a workaround, not a permanent fix.
- Cross-repo dependency contract is informal. This project notifies sibling Claude instances (
claude-mq) via free-text messages. Specs that span multiple repos (e.g. nginx config in henemm-infra for a new endpoint) have no formal link.
- GitHub Issue ↔ Spec linkage is by convention only. Commits cite issue numbers, but specs in
docs/specs/modules/ don't always reference issues, and issues don't list their spec path. Bidirectional traceability would help auditing.
Concrete recommendations to agent-os-openspec
- Adopt a structured spec frontmatter with parsable fields:
acceptance_criteria (list of IDs), observability_requirements (HTTP endpoints / log signatures the validator can check), tests (mapping criterion → test file/name). Replace grep-based approval with explicit approved_at timestamp + signer.
- Define a standard validator contract. Every spec must declare which black-box surfaces the external validator can rely on. AMBIGUOUS verdict on a non-declared surface should auto-block; on a declared surface it falls back to user review. This codifies the Issue #111 lesson.
- Move workflow state to per-workflow files (
docs/artifacts/<workflow>/state.json) plus a tiny active.json pointer. Eliminates the central-file drift class entirely.
- Add a deploy contract phase with explicit gates: artifact-built, smoke-tested, drift-monitor-clear.
systemctl restart alone must not satisfy "done".
- Track fix-loop iterations as first-class state. Persist a counter per workflow per phase, surface it in
/status, fail closed at the configured max.
- Schema-rework template: mandatory pre-snapshot, post-restore-test, and acceptance criterion "no field of any persisted record is unreadable after migration". Generalise from the GR221 incident.
- Bi-directional GitHub Issue ↔ Spec link as a workflow primitive: spec frontmatter has
issue: field, gh issue is annotated with spec: path. CI fails if mismatch.
- Memory audit hook: when memories reference file paths, periodically check the paths still exist; flag stale memories. Prevents the "rule about a deleted file" failure mode.
- Adopt the "tech-free user-facing language" convention as an explicit role contract in agent definitions, with examples of bad vs. good phrasing — currently it lives in memory only and re-learned each session.
Project context
gregor_zwanzigis a headless Python service that normalises weather data and emits compact reports (SMS ≤160 chars, HTML email) for long-distance hikers. Stack: Python + uv + pytest, Go API binary, Svelte frontend, NiceGUI admin UI, deployed via systemd to a Hetzner VPS with a separate staging environment. The project has been running the OpenSpec 8-phase workflow with adversary verification since Epic #77 (commit7e86270, 2026-04-18). 170 commits land onmainsince April; almost every commit is tied to a GitHub Issue ID.What works (with evidence)
.claude/settings.jsonchains 14 PreToolUse hooks onEdit|Write(workflow_gate, spec_enforcement, tdd_enforcement, red_test_gate, post_implementation_gate, scope_guard, …). The system caught at least two real bypass attempts that ended up in long-term memory: editinglegacy_entities.txtto whitelist a missing spec (memoryfeedback_no_workflow_bypass.md) and trying to "hot-fix" a worktree-routing bug directly in main (memoryfeedback_workflow_strict.md, 2026-05-02). In both cases the hooks won.claude --printsession caught real bugs that internal tests missed..claude/validate-external.shspawns a fresh Claude with no conversation history, only spec + running app. Issue #111 result: first run was AMBIGUOUS because Python-loader output wasn't HTTP-observable. After adding an internal endpoint (Issue #115, commit23be83c), a second validator run found 3 real bugs, one CRITICAL HTTP 500 onaggregation.profile=null(commit4d57330). Documented in memoryfeedback_validator_observability_first.md..claude/worktrees/show heavy use. Role split is enforced by the Phase 6 command (5-implement.mdlines 230–236: "Du TUST … Developer Agent spawnen … Du tust NICHT … Code schreiben oder editieren").implementation-validatoragent on Sonnet) returns VERIFIED / BROKEN / AMBIGUOUS. AMBIGUOUS verdicts are not nodded through — memory captures the rule that AMBIGUOUS triggers an observability-first response, not dismissal.Mock()/patch()/MagicMock. Email tests round-trip through Gmail SMTP + IMAP. API tests hit the real provider. Result: zero "tests passed but prod broke" incidents in the last 50 commits.~/.claude/projects/-home-hem-gregor-zwanzig/memory/, each with a concrete originating session and dated incident. Examples:feedback_post_push_workflow.md(5-step Push → staging-wait → staging-validate → prod-deploy → validator-vs-prod, anchored in Issues #113 and #114),feedback_validator_after_push.md(validator order matters; pre-push runs hit old code).What does not work (with evidence)
workflow_state.jsonis a single 137KB central file shared across worktrees. Issue #112 (commit510717b) fixed the initial routing. Thentdd_enforcementstill resolved artifact paths against the worktree root instead of main repo root (commit28d5b22). Thenactive_workflowdrift between parallel worktrees pushed artifacts into the wrong workflow twice in one session (memoryfeedback_workflow_state_explicit_name.md, 2026-05-04). The shared-state design is the root cause; per-fix patches haven't fully eliminated drift.feedback_validator_after_push.mdand Issue #113. Fix is procedural, not enforced — nothing in the workflow stops you from running it in the wrong order.systemctl restart≠ deploy. Three Python-only commits restarted services without rebuilding the Go binary or frontend → 23 h of code drift before BetterStack alerted (memoryfeedback_post_push_workflow.md). Required addingdeploy-gregor-prod.shand a drift monitor (check-gregor20.sh). The workflow had no concept of "what counts as deployed".b0a3576): a refactor lost 3 of 4 stages of the GR221 trip. Recovery only worked because GPX files happened to survive in a stash. Triggered a newdata_schema_backup.pyhook and a "Daten-Schema-Reworks" section in CLAUDE.md. The original spec/TDD-RED loop did not catch the regression because acceptance criteria didn't cover persistence-survives-edit semantics.- [ ] Approved/- [x] Approvedin the spec markdown. There is no schema validation that acceptance criteria are present, traceable to tests, or distinct from boilerplate. Several specs indocs/specs/modules/use the template structure inconsistently.5-implement.mdsays "max 3 iterations" but state file does not count them. There's no audit trail of how often a workflow bounced between developer and validator, so chronic problem features can't be detected.Gaps / blind spots
feedback_workflow_state_explicit_name.mdis already a workaround, not a permanent fix.claude-mq) via free-text messages. Specs that span multiple repos (e.g. nginx config inhenemm-infrafor a new endpoint) have no formal link.docs/specs/modules/don't always reference issues, and issues don't list their spec path. Bidirectional traceability would help auditing.Concrete recommendations to agent-os-openspec
acceptance_criteria(list of IDs),observability_requirements(HTTP endpoints / log signatures the validator can check),tests(mapping criterion → test file/name). Replace grep-based approval with explicitapproved_attimestamp + signer.docs/artifacts/<workflow>/state.json) plus a tinyactive.jsonpointer. Eliminates the central-file drift class entirely.systemctl restartalone must not satisfy "done"./status, fail closed at the configured max.issue:field,gh issueis annotated withspec:path. CI fails if mismatch.