feat: add cross-model analysis step to eval workflow by stack72 · Pull Request #1159 · systeminit/swamp

stack72 · 2026-04-10T10:48:06Z

Summary

Add a final analysis job to the multi-model eval workflow that runs after all model evals complete
Each eval job now uploads its results.json as an artifact
The analysis job downloads all results, produces a cross-model comparison, and writes a GitHub Actions step summary with:
- Results table (model, pass rate, tokens, duration, pass/fail)
- Cross-model failures (same test fails on multiple models → likely skill description issue)
- Model-specific failures (only one model fails → likely model quirk)
- Overall verdict

Test plan

Tested analysis script locally against Gemini results — produces correct output
CI workflow run should show the new analysis step

🤖 Generated with Claude Code

Add a final analysis job that runs after all model evals complete. It downloads each model's results.json, produces a cross-model comparison, identifies shared vs model-specific failures, and writes a summary to the GitHub Actions step summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

CI Security Review

Medium

denoland/setup-deno@v2 tag-only pin (.github/workflows/multi-model-eval.yml:88): Third-party action pinned to a mutable tag rather than a full commit SHA. A compromised tag could deliver malicious code. However, this is a pre-existing pattern (same action at line 49, not changed in this PR) and denoland is an established trusted publisher per repo conventions. No action required for this PR, but consider SHA-pinning in a future cleanup pass.

Low

Unscoped --allow-write Deno permission (.github/workflows/multi-model-eval.yml:98): The analysis step uses --allow-write without restricting the path. Could be tightened to --allow-write=$GITHUB_STEP_SUMMARY to follow least-privilege. Practical risk is negligible since the job has no secrets and runs on an ephemeral runner with contents: read only.

Verdict

PASS — The changes are security-clean. The new analysis job has properly scoped job-level permissions (contents: read), uses no secrets, processes only workflow-generated artifacts, and introduces no injection vectors. All new actions are GitHub-owned and appropriately pinned.

github-actions

Code Review

Blocking Issues

None.

Suggestions

Resilience to malformed results: In scripts/analyze_eval_results.ts:117, the destructuring const { stats, results } = data.results is outside the try/catch that handles missing files. If a model's eval job writes a partial or malformed results.json (e.g., data.results is undefined), this would crash the entire analysis rather than skipping that model. Consider wrapping the per-model processing block (lines 117–143) in its own try/catch with a console.warn + continue, matching the existing pattern for missing files.

Overall this is a clean, well-structured addition. The script follows project conventions (license header, no any types, proper unknown usage, named interfaces), the workflow permissions are appropriately scoped (contents: read), and the always() + artifact upload pattern correctly captures partial results from failed eval jobs. The scripts/ exclusion in deno.json is consistent with existing scripts not having tests or type-check requirements.

github-actions bot approved these changes Apr 10, 2026

View reviewed changes

stack72 merged commit 5236b24 into main Apr 10, 2026
11 checks passed

stack72 deleted the eval-cross-model-analysis branch April 10, 2026 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add cross-model analysis step to eval workflow#1159

feat: add cross-model analysis step to eval workflow#1159
stack72 merged 1 commit intomainfrom
eval-cross-model-analysis

stack72 commented Apr 10, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Apr 10, 2026

Summary

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

CI Security Review

Medium

Low

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Code Review

Blocking Issues

Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant