Skip to content

feat: swamp-native multi-model eval extensions#1162

Draft
stack72 wants to merge 1 commit intomainfrom
feat/swamp-ci-eval-extensions
Draft

feat: swamp-native multi-model eval extensions#1162
stack72 wants to merge 1 commit intomainfrom
feat/swamp-ci-eval-extensions

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 11, 2026

Summary

Ports the multi-model-eval GitHub Actions workflow (.github/workflows/multi-model-eval.yml) to a native swamp workflow driven by extension models, method-scope reports, and a workflow-scope cross-model analysis report.

This PR adds the extensions (models and reports). The workflow definition and model instances that use them are not committed (both /workflows/ and /models/ are gitignored), but manual setup instructions are included below.

What's included

File Purpose
extensions/models/ci_git.ts @swamp/ci/git — clone / checkout / fetch / diff / clean
extensions/models/ci_promptfoo_eval.ts @swamp/ci/promptfoo-evalsetupNpm + run methods
extensions/reports/ci_eval_result.ts @swamp/ci/eval-result — per-model method-scope summary
extensions/reports/ci_eval_analysis.ts @swamp/ci/eval-analysis — workflow-scope cross-model comparison

Design choices

1. Extension scope

Each responsibility got its own extension rather than a monolithic "CI" model:

  • @swamp/ci/git is a genuinely reusable primitive. Any swamp workflow that wants to operate on a checked-out git repository can use it — not just this eval workflow. It intentionally exposes the full set of read operations we'd expect (clone, checkout, fetch, diff) plus clean for teardown.
  • @swamp/ci/promptfoo-eval is specific to this use case (running a single model's skill-trigger eval and capturing structured results). It's narrow enough to actually be useful — the shape of its result resource lets downstream reports read it without parsing raw promptfoo JSON.

2. Report split: method-scope vs workflow-scope

The GitHub Actions version produces two kinds of summaries:

  • A per-job summary for each model (pass rate, tokens, cost, failed tests)
  • A final cross-model comparison job

I mirrored this exactly with two separate reports:

  • @swamp/ci/eval-result (method scope) runs automatically after each eval-runner.run call and renders the per-model table. Registered via the reports: [...] field on the model type, so it fires without any YAML ceremony in the workflow.
  • @swamp/ci/eval-analysis (workflow scope) runs once at the end via reports.require on the workflow. It deliberately skips the comparison when fewer than 2 models ran, returning a short explanation instead — a single-model "cross-model" table would be misleading.

3. Per-model temp-dir isolation in run

The eval script (scripts/eval_skill_triggers_promptfoo.ts) writes to hardcoded paths: evals/promptfoo/promptfooconfig.yaml and evals/promptfoo/results.json. Running four models in parallel against one checkout would clobber both files.

Rather than bypass the script and duplicate its logic, the extension:

  1. Calls generate_config.ts directly (it prints YAML to stdout) and captures the output into a per-model temp path
  2. Invokes npx promptfoo eval -c <tempDir>/promptfooconfig.yaml -o <tempDir>/results.json with cwd set to the shared evals/promptfoo/ directory so node_modules is found
  3. Parses results.json from the temp path and cleans up in a finally block

The tradeoff: we're skipping the script's API-key preflight and threshold check. The API key check is replaced by detecting "no results.json produced" in the extension. The threshold check isn't needed because the cross-model analysis report applies its own threshold.

4. Shared node_modules via a setup job

A naive implementation would npm install inside each parallel run call, racing four concurrent installs on the same node_modules tree. Instead there's an explicit setupNpm method that runs once (as a single-step setup-npm job that run-evals depends on), populating evals/promptfoo/node_modules before the parallel evals launch. Each eval then uses that shared directory.

5. Raw driver, not docker

The workflow runs in raw mode (the default). The docker driver currently isolates each step in its own container with no shared filesystem, which doesn't fit this workflow's shared-state pattern (checkout + node_modules + parallel evals reading the same directory tree). Making docker work required:

  • Explicit volume mounts
  • Identical host/container paths so the same path string is valid in both modes
  • A workspace primitive that swamp doesn't have yet
  • Passing host env vars through driverConfig.env (which needs has() CEL guards to avoid errors on unset keys)

Rather than ship those workarounds, I opened a feature request (#83 below) describing what swamp would need to make CI-shaped workflows a first-class docker use case. Until then, raw mode is the right default for this workflow.

6. Model filtering via workflow input, handled in-extension

The workflow accepts a selected_model input (default all) to let you run just one model. Rather than dynamically filtering the model_configs array for forEach (which requires a CEL-generated array, awkward to implement cleanly with command/shell), the extension itself checks the selectedModel arg against its model arg and returns an early "skipped" result if they don't match. The per-model and cross-model reports both handle skipped: true correctly — skipped models don't count toward the pass/fail verdict.

7. dataOutputOverrides.vary for forEach isolation

Each eval-runner.run call writes a result resource. Without a vary dimension, all four parallel calls would overwrite the same data artifact. With vary: [model], each model's result is stored under its own versioned namespace so the analysis report can find all of them via stepExecutions[].dataHandles.

Related issues

  • swamp.club/lab/81 — Fixed during development. executeReports() was calling registry.getAll() which didn't include lazy-loaded user extension reports, so @swamp/ci/eval-analysis silently didn't run. The fix ensures required reports are promoted from lazy to fully-loaded before filtering.
  • swamp.club/lab/83 — Feature request. Describes the shared-state gap in the docker driver that blocked us from running this workflow in docker mode. References this workflow as the concrete use case.

Manual setup for reviewers

The model instances and workflow YAML are gitignored, so to try this locally:

  1. Create models/@swamp/ci-swamp-repo/<uuid>.yaml:

    type: '@swamp/ci/git'
    typeVersion: 2026.04.10.1
    id: <uuid>
    name: swamp-repo
    version: 1
    tags: { category: ci, purpose: eval }
    globalArguments:
      url: https://github.com/systeminit/swamp
    methods: {}
  2. Create models/@swamp/ci-eval-runner/<uuid>.yaml:

    type: '@swamp/ci/promptfoo-eval'
    typeVersion: 2026.04.10.1
    id: <uuid>
    name: eval-runner
    version: 1
    tags: { category: ci, purpose: eval }
    globalArguments: {}
    methods: {}
  3. Create workflows/workflow-<uuid>.yaml with jobs: checkoutsetup-npmrun-evals (forEach over model configs, allowFailure: true, dataOutputOverrides.vary: [model]) → cleanup. Workflow-level reports.require: ["@swamp/ci/eval-analysis"].

  4. Export ANTHROPIC_API_KEY (and optionally the others) in your shell.

  5. Run:

    swamp workflow run multi-model-eval --input selected_model=sonnet
    swamp workflow run multi-model-eval   # all models
    

Test plan

  • Extensions compile and register (verified via swamp report describe @swamp/ci/eval-analysis and swamp report describe @swamp/ci/eval-result)
  • Single-model run (sonnet) produces per-model result data and per-model report
  • Single-model run correctly skips the cross-model comparison section
  • Multi-model run produces separate result resources per model via vary
  • Cross-model analysis report reads all results and renders the comparison table
  • Skipped models (unselected) appear as skipped: true and don't affect the verdict
  • cleanup job runs even when run-evals fails (dependency condition completed)
  • Re-run after npm cache is warm to confirm setup-npm is fast (depends on reviewer)

🤖 Generated with Claude Code

Adds extension models and reports to reproduce the multi-model-eval
GitHub Actions workflow as a native swamp workflow. Rather than
orchestrating evals through GHA matrix jobs and upload/download
artifact steps, the workflow uses swamp's forEach parallelism, data
artifacts for cross-step communication, and the report system for the
cross-model comparison summary.

New extensions:
- @swamp/ci/git: clone, checkout, fetch, diff, clean
- @swamp/ci/promptfoo-eval: setupNpm (one-time install) and run (per-model
  eval with isolated temp config/results paths so parallel invocations
  don't collide)
- @swamp/ci/eval-result (method scope): per-model summary that runs
  after each eval step
- @swamp/ci/eval-analysis (workflow scope): cross-model comparison that
  only renders when 2+ models ran

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@stack72
Copy link
Copy Markdown
Contributor Author

stack72 commented Apr 11, 2026

Associated workflow:

id: 8a88a569-4620-431c-9028-643df0118c72
name: multi-model-eval
description: >-
  Run skill trigger evals across multiple LLM models in parallel and produce
  a cross-model comparison report. Replaces the GitHub Actions workflow
  .github/workflows/multi-model-eval.yml with native swamp orchestration.
version: 1
tags:
  category: ci
  purpose: eval

trigger:
  schedule: "0 8 * * 6"

# This workflow runs in raw mode. The docker driver currently isolates each
# step in its own container, which doesn't fit this workflow's shared-state
# pattern (checkout + shared node_modules + parallel evals reading the same
# filesystem). See swamp.club/lab/83 for the feature request tracking a
# workspace primitive / session-mode docker driver that would make this work.

inputs:
  selected_model:
    type: string
    description: >-
      Model alias to evaluate, or 'all' for every model.
      Valid: sonnet, opus, gpt-5.4, gemini-2.5-pro, all
    default: all
  model_configs:
    type: array
    description: >-
      Full model configuration list. Each step checks the 'selected_model'
      input and skips if it doesn't match.
    default:
      - model: sonnet
        concurrency: 20
      - model: opus
        concurrency: 20
      - model: gpt-5.4
        concurrency: 1
      - model: gemini-2.5-pro
        concurrency: 20

reports:
  require:
    - "@swamp/ci/eval-analysis"

jobs:
  - name: checkout
    description: Clone the swamp repository into the shared workspace
    steps:
      - name: clone-repo
        description: Shallow clone of the swamp repository
        task:
          type: model_method
          modelIdOrName: swamp-repo
          methodName: clone
          inputs:
            ref: main
            depth: 1
        dependsOn: []
        weight: 0
    dependsOn: []
    weight: 0

  - name: setup-npm
    description: Install promptfoo dependencies once before parallel evals
    dependsOn:
      - job: checkout
        condition:
          type: succeeded
    steps:
      - name: npm-install
        description: Run npm install in evals/promptfoo once
        task:
          type: model_method
          modelIdOrName: eval-runner
          methodName: setupNpm
          inputs:
            workDir: ${{ data.latest('swamp-repo', 'repository').attributes.path }}
        dependsOn: []
        weight: 0
    weight: 0

  - name: run-evals
    description: Run promptfoo skill trigger evals in parallel for each model
    dependsOn:
      - job: setup-npm
        condition:
          type: succeeded
    steps:
      - name: eval-${{ self.model_config.model }}
        description: Run eval for model ${{ self.model_config.model }}
        forEach:
          item: model_config
          in: ${{ inputs.model_configs }}
        task:
          type: model_method
          modelIdOrName: eval-runner
          methodName: run
          inputs:
            workDir: ${{ data.latest('swamp-repo', 'repository').attributes.path }}
            model: ${{ self.model_config.model }}
            concurrency: ${{ self.model_config.concurrency }}
            selectedModel: ${{ inputs.selected_model }}
        dataOutputOverrides:
          - specName: result
            vary:
              - model
        allowFailure: true
        dependsOn: []
        weight: 0
    weight: 0

  - name: cleanup
    description: Remove the cloned repository from the shared workspace
    dependsOn:
      - job: run-evals
        condition:
          type: completed
    steps:
      - name: remove-checkout
        description: Clean up the workspace directory
        task:
          type: model_method
          modelIdOrName: swamp-repo
          methodName: clean
          inputs:
            path: ${{ data.latest('swamp-repo', 'repository').attributes.path }}
        dependsOn: []
        weight: 0
    weight: 0

model for git:

type: '@swamp/ci/git'
typeVersion: 2026.04.10.1
id: 341a3712-04d2-4335-8dd0-23cbe8e24250
name: swamp-repo
version: 1
tags:
  category: ci
  purpose: eval
globalArguments:
  url: https://github.com/systeminit/swamp
methods: {}

model for eval-runner:

type: '@swamp/ci/promptfoo-eval'
typeVersion: 2026.04.10.1
id: 6e4b06bd-0fff-4c31-9fbe-754bb067cd05
name: eval-runner
version: 1
tags:
  category: ci
  purpose: eval
globalArguments: {}
methods: {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant