feat: swamp-native multi-model eval extensions#1162
Draft
Conversation
Adds extension models and reports to reproduce the multi-model-eval GitHub Actions workflow as a native swamp workflow. Rather than orchestrating evals through GHA matrix jobs and upload/download artifact steps, the workflow uses swamp's forEach parallelism, data artifacts for cross-step communication, and the report system for the cross-model comparison summary. New extensions: - @swamp/ci/git: clone, checkout, fetch, diff, clean - @swamp/ci/promptfoo-eval: setupNpm (one-time install) and run (per-model eval with isolated temp config/results paths so parallel invocations don't collide) - @swamp/ci/eval-result (method scope): per-model summary that runs after each eval step - @swamp/ci/eval-analysis (workflow scope): cross-model comparison that only renders when 2+ models ran Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Associated workflow: model for git: model for eval-runner: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the
multi-model-evalGitHub Actions workflow (.github/workflows/multi-model-eval.yml) to a native swamp workflow driven by extension models, method-scope reports, and a workflow-scope cross-model analysis report.This PR adds the extensions (models and reports). The workflow definition and model instances that use them are not committed (both
/workflows/and/models/are gitignored), but manual setup instructions are included below.What's included
extensions/models/ci_git.ts@swamp/ci/git— clone / checkout / fetch / diff / cleanextensions/models/ci_promptfoo_eval.ts@swamp/ci/promptfoo-eval—setupNpm+runmethodsextensions/reports/ci_eval_result.ts@swamp/ci/eval-result— per-model method-scope summaryextensions/reports/ci_eval_analysis.ts@swamp/ci/eval-analysis— workflow-scope cross-model comparisonDesign choices
1. Extension scope
Each responsibility got its own extension rather than a monolithic "CI" model:
@swamp/ci/gitis a genuinely reusable primitive. Any swamp workflow that wants to operate on a checked-out git repository can use it — not just this eval workflow. It intentionally exposes the full set of read operations we'd expect (clone, checkout, fetch, diff) pluscleanfor teardown.@swamp/ci/promptfoo-evalis specific to this use case (running a single model's skill-trigger eval and capturing structured results). It's narrow enough to actually be useful — the shape of itsresultresource lets downstream reports read it without parsing raw promptfoo JSON.2. Report split: method-scope vs workflow-scope
The GitHub Actions version produces two kinds of summaries:
I mirrored this exactly with two separate reports:
@swamp/ci/eval-result(method scope) runs automatically after eacheval-runner.runcall and renders the per-model table. Registered via thereports: [...]field on the model type, so it fires without any YAML ceremony in the workflow.@swamp/ci/eval-analysis(workflow scope) runs once at the end viareports.requireon the workflow. It deliberately skips the comparison when fewer than 2 models ran, returning a short explanation instead — a single-model "cross-model" table would be misleading.3. Per-model temp-dir isolation in
runThe eval script (
scripts/eval_skill_triggers_promptfoo.ts) writes to hardcoded paths:evals/promptfoo/promptfooconfig.yamlandevals/promptfoo/results.json. Running four models in parallel against one checkout would clobber both files.Rather than bypass the script and duplicate its logic, the extension:
generate_config.tsdirectly (it prints YAML to stdout) and captures the output into a per-model temp pathnpx promptfoo eval -c <tempDir>/promptfooconfig.yaml -o <tempDir>/results.jsonwithcwdset to the sharedevals/promptfoo/directory sonode_modulesis foundresults.jsonfrom the temp path and cleans up in afinallyblockThe tradeoff: we're skipping the script's API-key preflight and threshold check. The API key check is replaced by detecting "no results.json produced" in the extension. The threshold check isn't needed because the cross-model analysis report applies its own threshold.
4. Shared
node_modulesvia a setup jobA naive implementation would
npm installinside each parallelruncall, racing four concurrent installs on the samenode_modulestree. Instead there's an explicitsetupNpmmethod that runs once (as a single-stepsetup-npmjob thatrun-evalsdepends on), populatingevals/promptfoo/node_modulesbefore the parallel evals launch. Each eval then uses that shared directory.5. Raw driver, not docker
The workflow runs in raw mode (the default). The docker driver currently isolates each step in its own container with no shared filesystem, which doesn't fit this workflow's shared-state pattern (checkout + node_modules + parallel evals reading the same directory tree). Making docker work required:
driverConfig.env(which needshas()CEL guards to avoid errors on unset keys)Rather than ship those workarounds, I opened a feature request (#83 below) describing what swamp would need to make CI-shaped workflows a first-class docker use case. Until then, raw mode is the right default for this workflow.
6. Model filtering via workflow input, handled in-extension
The workflow accepts a
selected_modelinput (defaultall) to let you run just one model. Rather than dynamically filtering themodel_configsarray forforEach(which requires a CEL-generated array, awkward to implement cleanly withcommand/shell), the extension itself checks theselectedModelarg against itsmodelarg and returns an early "skipped" result if they don't match. The per-model and cross-model reports both handleskipped: truecorrectly — skipped models don't count toward the pass/fail verdict.7.
dataOutputOverrides.varyfor forEach isolationEach
eval-runner.runcall writes aresultresource. Without a vary dimension, all four parallel calls would overwrite the same data artifact. Withvary: [model], each model's result is stored under its own versioned namespace so the analysis report can find all of them viastepExecutions[].dataHandles.Related issues
executeReports()was callingregistry.getAll()which didn't include lazy-loaded user extension reports, so@swamp/ci/eval-analysissilently didn't run. The fix ensures required reports are promoted from lazy to fully-loaded before filtering.Manual setup for reviewers
The model instances and workflow YAML are gitignored, so to try this locally:
Create
models/@swamp/ci-swamp-repo/<uuid>.yaml:Create
models/@swamp/ci-eval-runner/<uuid>.yaml:Create
workflows/workflow-<uuid>.yamlwith jobs:checkout→setup-npm→run-evals(forEach over model configs,allowFailure: true,dataOutputOverrides.vary: [model]) →cleanup. Workflow-levelreports.require: ["@swamp/ci/eval-analysis"].Export
ANTHROPIC_API_KEY(and optionally the others) in your shell.Run:
Test plan
swamp report describe @swamp/ci/eval-analysisandswamp report describe @swamp/ci/eval-result)varyskipped: trueand don't affect the verdictcleanupjob runs even whenrun-evalsfails (dependency conditioncompleted)🤖 Generated with Claude Code