Skip to content

[Feedback] gregor_zwanzig — Workflow analysis #5

@henemm

Description

@henemm

Project context

gregor_zwanzig is a headless Python service that normalises weather data and emits compact reports (SMS ≤160 chars, HTML email) for long-distance hikers. Stack: Python + uv + pytest, Go API binary, Svelte frontend, NiceGUI admin UI, deployed via systemd to a Hetzner VPS with a separate staging environment. The project has been running the OpenSpec 8-phase workflow with adversary verification since Epic #77 (commit 7e86270, 2026-04-18). 170 commits land on main since April; almost every commit is tied to a GitHub Issue ID.

What works (with evidence)

  • Hook-enforced phase gates actually block bypass attempts. .claude/settings.json chains 14 PreToolUse hooks on Edit|Write (workflow_gate, spec_enforcement, tdd_enforcement, red_test_gate, post_implementation_gate, scope_guard, …). The system caught at least two real bypass attempts that ended up in long-term memory: editing legacy_entities.txt to whitelist a missing spec (memory feedback_no_workflow_bypass.md) and trying to "hot-fix" a worktree-routing bug directly in main (memory feedback_workflow_strict.md, 2026-05-02). In both cases the hooks won.
  • External validator as isolated claude --print session caught real bugs that internal tests missed. .claude/validate-external.sh spawns a fresh Claude with no conversation history, only spec + running app. Issue #111 result: first run was AMBIGUOUS because Python-loader output wasn't HTTP-observable. After adding an internal endpoint (Issue #115, commit 23be83c), a second validator run found 3 real bugs, one CRITICAL HTTP 500 on aggregation.profile=null (commit 4d57330). Documented in memory feedback_validator_observability_first.md.
  • Product Owner Pattern with worktree-isolated developer agent. Main context delegates implementation to a developer agent (Opus, isolated git worktree). 44 entries in .claude/worktrees/ show heavy use. Role split is enforced by the Phase 6 command (5-implement.md lines 230–236: "Du TUST … Developer Agent spawnen … Du tust NICHT … Code schreiben oder editieren").
  • Adversary verification with tri-state verdict. Phase 6b (implementation-validator agent on Sonnet) returns VERIFIED / BROKEN / AMBIGUOUS. AMBIGUOUS verdicts are not nodded through — memory captures the rule that AMBIGUOUS triggers an observability-first response, not dismissal.
  • Real integration tests, no mocks. CLAUDE.md prohibits Mock()/patch()/MagicMock. Email tests round-trip through Gmail SMTP + IMAP. API tests hit the real provider. Result: zero "tests passed but prod broke" incidents in the last 50 commits.
  • Memory system distils repeated mistakes into actionable rules. 13 feedback memories in ~/.claude/projects/-home-hem-gregor-zwanzig/memory/, each with a concrete originating session and dated incident. Examples: feedback_post_push_workflow.md (5-step Push → staging-wait → staging-validate → prod-deploy → validator-vs-prod, anchored in Issues #113 and #114), feedback_validator_after_push.md (validator order matters; pre-push runs hit old code).

What does not work (with evidence)

  • Worktree state-routing was repeatedly broken. workflow_state.json is a single 137KB central file shared across worktrees. Issue #112 (commit 510717b) fixed the initial routing. Then tdd_enforcement still resolved artifact paths against the worktree root instead of main repo root (commit 28d5b22). Then active_workflow drift between parallel worktrees pushed artifacts into the wrong workflow twice in one session (memory feedback_workflow_state_explicit_name.md, 2026-05-04). The shared-state design is the root cause; per-fix patches haven't fully eliminated drift.
  • Validator order vs. push direction is a footgun. Default validator URL is production. Running validator pre-push hits old prod code → false-negative AMBIGUOUS. Documented in memory feedback_validator_after_push.md and Issue #113. Fix is procedural, not enforced — nothing in the workflow stops you from running it in the wrong order.
  • systemctl restart ≠ deploy. Three Python-only commits restarted services without rebuilding the Go binary or frontend → 23 h of code drift before BetterStack alerted (memory feedback_post_push_workflow.md). Required adding deploy-gregor-prod.sh and a drift monitor (check-gregor20.sh). The workflow had no concept of "what counts as deployed".
  • Schema rework caused silent data loss. Issue #102 (commit b0a3576): a refactor lost 3 of 4 stages of the GR221 trip. Recovery only worked because GPX files happened to survive in a stash. Triggered a new data_schema_backup.py hook and a "Daten-Schema-Reworks" section in CLAUDE.md. The original spec/TDD-RED loop did not catch the regression because acceptance criteria didn't cover persistence-survives-edit semantics.
  • Spec-Approval gating is grep-fragile. Approval is encoded as - [ ] Approved / - [x] Approved in the spec markdown. There is no schema validation that acceptance criteria are present, traceable to tests, or distinct from boilerplate. Several specs in docs/specs/modules/ use the template structure inconsistently.
  • Phase 6 fix-loop has no telemetry. 5-implement.md says "max 3 iterations" but state file does not count them. There's no audit trail of how often a workflow bounced between developer and validator, so chronic problem features can't be detected.

Gaps / blind spots

  • No machine-readable spec format. Specs are free-form markdown. Approval is grep-detected; acceptance criteria are not parsed; no spec → test traceability map. RED-test artifacts are registered with a free-text description, not linked to specific criteria.
  • Validator observability is bespoke per project. The Issue #111 → #115 lesson ("AMBIGUOUS means observability gap, not 'tests cover it'") is encoded as a memory, not a workflow contract. A standard spec should declare its required black-box surface.
  • Single-file workflow state. 137KB JSON, mixes per-workflow phase data, artifacts, approvals, and active-workflow pointer. Concurrent worktrees cause drift. Per-workflow files would be more robust and would not require the "always pass workflow name explicitly" workaround documented in memory.
  • No standard deploy contract. "What counts as deployed" had to be discovered the hard way (Issue #113). The workflow does not check that a binary build, a frontend build, and a smoke test all happened before declaring "done".
  • Memory rules can rot. Feedback memories reference file paths and module names that may move; nothing audits them against the current repo. feedback_workflow_state_explicit_name.md is already a workaround, not a permanent fix.
  • Cross-repo dependency contract is informal. This project notifies sibling Claude instances (claude-mq) via free-text messages. Specs that span multiple repos (e.g. nginx config in henemm-infra for a new endpoint) have no formal link.
  • GitHub Issue ↔ Spec linkage is by convention only. Commits cite issue numbers, but specs in docs/specs/modules/ don't always reference issues, and issues don't list their spec path. Bidirectional traceability would help auditing.

Concrete recommendations to agent-os-openspec

  1. Adopt a structured spec frontmatter with parsable fields: acceptance_criteria (list of IDs), observability_requirements (HTTP endpoints / log signatures the validator can check), tests (mapping criterion → test file/name). Replace grep-based approval with explicit approved_at timestamp + signer.
  2. Define a standard validator contract. Every spec must declare which black-box surfaces the external validator can rely on. AMBIGUOUS verdict on a non-declared surface should auto-block; on a declared surface it falls back to user review. This codifies the Issue #111 lesson.
  3. Move workflow state to per-workflow files (docs/artifacts/<workflow>/state.json) plus a tiny active.json pointer. Eliminates the central-file drift class entirely.
  4. Add a deploy contract phase with explicit gates: artifact-built, smoke-tested, drift-monitor-clear. systemctl restart alone must not satisfy "done".
  5. Track fix-loop iterations as first-class state. Persist a counter per workflow per phase, surface it in /status, fail closed at the configured max.
  6. Schema-rework template: mandatory pre-snapshot, post-restore-test, and acceptance criterion "no field of any persisted record is unreadable after migration". Generalise from the GR221 incident.
  7. Bi-directional GitHub Issue ↔ Spec link as a workflow primitive: spec frontmatter has issue: field, gh issue is annotated with spec: path. CI fails if mismatch.
  8. Memory audit hook: when memories reference file paths, periodically check the paths still exist; flag stale memories. Prevents the "rule about a deleted file" failure mode.
  9. Adopt the "tech-free user-facing language" convention as an explicit role contract in agent definitions, with examples of bad vs. good phrasing — currently it lives in memory only and re-learned each session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-feedbackFeedback from a project implementing the workflow spec

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions