Skip to content

Spec Enhancement: Workflow Observability & Measurement #6

@henemm

Description

@henemm

Summary

Three independent projects running the OpenSpec 8-phase workflow — FocusBlox (711 commits, 37+ cycles), Meditationstimer (iOS/watchOS, Workflow v6), and gregor_zwanzig (Python service, 170 commits since April) — independently identified the same structural gap: zero machine-readable record of workflow execution. Without execution logs, it is impossible to measure which phases add value, calibrate the Adversary, detect chronic problem features, or do cross-project learning. All three projects also independently identified two related spec gaps: (1) specs have no standardised acceptance-criteria format, making tests untraceable to criteria, and (2) scope enforcement checks file count but not LoC delta.

This proposal defines the minimal additions to the OpenSpec spec that would close these gaps.


Evidence (from 3 projects)

Gap 1 — No Workflow Execution Log (3/3 projects)

FocusBlox: "No systematic log exists. When asked 'what works and what doesn't?', the only available data is anecdotal memory entries. There is no structured record of: how often phases were skipped, adversary true/false positive rate, scope compliance, TDD discipline rate."

Meditationstimer: "No system measures: how often each phase catches issues, average time per phase, how often the Adversary is BROKEN vs VERIFIED, how many findings are deferred vs fixed. Without data, it's impossible to know which phases add the most value."

gregor_zwanzig: "Phase 6 fix-loop has no telemetry. 5-implement.md says 'max 3 iterations' but state file does not count them. There's no audit trail of how often a workflow bounced between developer and validator, so chronic problem features can't be detected."

Gap 2 — No Testable Acceptance Criteria Format (3/3 projects)

FocusBlox: "No Reproducibility Spec — no machine-readable spec that describes acceptance criteria in a way another project could implement independently."

Meditationstimer: "No Spec Quality Gate — no check that the spec is testable — no acceptance criteria format, no 'must include at least N acceptance tests' requirement. Vague specs produce vague tests."

gregor_zwanzig: "No machine-readable spec format. Approval is grep-detected; acceptance criteria are not parsed; no spec → test traceability map." The approval mechanism is - [ ] Approved / - [x] Approved toggled by grep — no validation that AC are present, distinct from boilerplate, or traceable to tests.

Gap 3 — Scope Enforcement Incomplete: File Count Only, No LoC (2/3 projects)

FocusBlox: "Scope Guard Not Enforced — CLAUDE.md documents max 5 files / ±250 LoC. No hook enforces the LoC limit. Violations go undetected."

Meditationstimer: "Scope Enforcement Is File-Count-Only, Not LoC — edit_gate.py checks len(code_affected) > 5 but does not count lines changed."

Gap 4 — Adversary Code-First Verification Not Enforced (2/3 projects)

FocusBlox: "Adversary Produces False Positives — 3 of 6 findings in one session were false positives. The Adversary read the spec to generate findings but didn't verify against actual code." Root cause: Adversary prompt does not require reading current implementation before reporting.

Meditationstimer: "Adversary AMBIGUOUS verdict has no enforcement path — workflow still proceeded to checkpoint3_approved and commit. An AMBIGUOUS verdict with all findings resolved is indistinguishable from VERIFIED at the gate level."

Gap 5 — Phase Transition Audit Trail Missing (2/3 projects)

Meditationstimer: "Phase transitions themselves are not enforced by hooks — the orchestrator calls workflow.py phase phase5_implement directly. Claude could skip from phase2_analyse to phase5_implement without checkpoints."

gregor_zwanzig: "Track fix-loop iterations as first-class state. Persist a counter per workflow per phase, surface it in /status, fail closed at the configured max."

Cross-cutting confirmation — "Hooks are Law" (3/3 projects)

FocusBlox (strongest evidence): "CLAUDE.md rules are followed with ~60–70% probability. Hooks with 100%. Documentation is a suggestion. Hooks are law."

All three projects independently confirm: any spec requirement without a corresponding hook will eventually be skipped.


Proposed Spec Additions

S1 — Workflow Execution Log Schema (closes Gap 1)

Add a standard section to the workflow spec defining a mandatory execution log entry per completed workflow. The log is written to .claude/workflows/_log/YYYY-MM-DD_<workflow-id>.yaml and committed alongside the work.

Minimum required fields:

workflow_id: FEAT_001
project: <project-name>
completed_at: 2026-05-09T14:22:00Z
phases_completed: [phase1_context, phase2_analyse, phase3_spec, phase4_approved, phase5_tdd_red, phase6_implement, phase7_validate]
phases_skipped: []
override_used: false
tdd_red_confirmed: true
adversary_verdict: VERIFIED        # VERIFIED | BROKEN | AMBIGUOUS
adversary_findings_total: 2
adversary_fix_loop_iterations: 1
scope_files_changed: 3
scope_loc_delta: +142
outcome: success                   # success | partial | reverted

workflow.py complete must refuse to archive without a valid log entry. bash_gate.py git-commit check must verify the log file is staged in phase8_complete.

S2 — Acceptance Criteria Format Standard (closes Gap 2)

Mandatory ## Acceptance Criteria section in every spec. Each criterion must:

  1. Have a unique ID: AC-<N>
  2. Use testable format: Given <precondition> / When <action> / Then <observable outcome>
  3. Reference at least one test (populated after TDD RED phase)

Example:

## Acceptance Criteria

- **AC-1:** Given the app is in background / When the session timer fires / Then the session continues and elapsed time is correct on foreground.
  - Test: `BackgroundMeditationUITests.test_backgroundForeground_sessionStillRunning`

edit_gate.py must parse the spec file for ## Acceptance Criteria before allowing phase6 edits. Block if section is missing or contains zero AC-N: entries. qa_gate.py must link adversary findings to AC IDs — a finding without an AC reference is flagged unverifiable.

S3 — LoC Delta Enforcement (closes Gap 3)

Extend edit_gate.py to track cumulative LoC delta per workflow session via git diff --shortstat on modified files. Block when max_loc_delta threshold is exceeded (default: 250). Surface in workflow.py status:

Scope: 3/5 files, +142 LoC (limit: 250)

Per-workflow override via workflow.py set-field loc_limit_override 500 for legitimate exceptions (e.g. bulk translation commits). Configurable loc_exclude_patterns in openspec.yaml for generated files.

S4 — Adversary Code-First Requirement (closes Gap 4)

Add to implementation-validator.md: every finding must include a file:line reference obtained by reading the actual current implementation, not only the spec. A finding without a code reference is rejected as malformed.

Required finding format:

Finding #N — <title>
Severity: CRITICAL | HIGH | MEDIUM | LOW
Code reference: path/to/file.py:42
Evidence: <what the code actually does>
Spec requirement: AC-N says <X>
Conflict: <why this is a violation>

Confirmations (AC satisfied) must be listed explicitly alongside findings to prove coverage.

For AMBIGUOUS verdict: AMBIGUOUS with zero open findings must be treated identically to BROKEN at bash_gate.py. Only workflow.py override-ambiguous "<reason>" (requires user keyword via phase_listener.py) may unlock the commit.

S5 — Phase Transition Audit Trail (closes Gap 5)

workflow.py phase <new_phase> must append to phase_transitions in workflow state:

"phase_transitions": [
  {"from": "phase2_analyse", "to": "phase3_spec", "at": "2026-05-09T10:00:00Z", "trigger": "command"},
  {"from": "phase3_spec", "to": "phase4_approved", "at": "2026-05-09T10:31:00Z", "trigger": "user_keyword"}
]

trigger values: user_keyword | command | manual. Transitions with trigger: manual that skip phases emit a warning (not a block). Fix-loop iteration counter incremented each time phase6_implement is re-entered after phase6b_adversary. Surfaced in workflow.py status and included in execution log.


Open Questions

  1. Log storage: .claude/workflows/_log/ (project-local, committed) vs. optional central aggregation endpoint. Recommendation: local-first, add optional log_export_url config field later.

  2. False-positive annotation: adversary_findings_false_positives requires human judgement post-session. Options: (a) manual annotation in PR comment; (b) workflow.py annotate-finding <id> false-positive command with bash_gate requirement. Decision needed before S1 is implemented.

  3. LoC exclusions for generated files: Translation files (*.xcstrings, *.po) and codegen output should be excludable. Define loc_exclude_patterns as project-specific or part of standard config?

  4. AC-N enforcement on existing specs: Should edit_gate.py enforce the AC-N format for existing specs (warn-only) or only for new specs (hard-block)? Recommendation: warn for existing, block for new.

  5. AMBIGUOUS verdict resolution rule: If 4/5 findings are FIXED/ACCEPT and 1 is DEFER, is the verdict AMBIGUOUS or BROKEN? Suggest: AMBIGUOUS only if all findings are FIXED or ACCEPT; BROKEN if any finding remains DEFER with no explicit user override.


Source Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    spec-proposalProposal for spec/framework improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions