feat: add cross-model analysis step to eval workflow#1159
Conversation
Add a final analysis job that runs after all model evals complete. It downloads each model's results.json, produces a cross-model comparison, identifies shared vs model-specific failures, and writes a summary to the GitHub Actions step summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
CI Security Review
Medium
denoland/setup-deno@v2tag-only pin (.github/workflows/multi-model-eval.yml:88): Third-party action pinned to a mutable tag rather than a full commit SHA. A compromised tag could deliver malicious code. However, this is a pre-existing pattern (same action at line 49, not changed in this PR) anddenolandis an established trusted publisher per repo conventions. No action required for this PR, but consider SHA-pinning in a future cleanup pass.
Low
- Unscoped
--allow-writeDeno permission (.github/workflows/multi-model-eval.yml:98): The analysis step uses--allow-writewithout restricting the path. Could be tightened to--allow-write=$GITHUB_STEP_SUMMARYto follow least-privilege. Practical risk is negligible since the job has no secrets and runs on an ephemeral runner withcontents: readonly.
Verdict
PASS — The changes are security-clean. The new analysis job has properly scoped job-level permissions (contents: read), uses no secrets, processes only workflow-generated artifacts, and introduces no injection vectors. All new actions are GitHub-owned and appropriately pinned.
There was a problem hiding this comment.
Code Review
Blocking Issues
None.
Suggestions
- Resilience to malformed results: In
scripts/analyze_eval_results.ts:117, the destructuringconst { stats, results } = data.resultsis outside the try/catch that handles missing files. If a model's eval job writes a partial or malformedresults.json(e.g.,data.resultsis undefined), this would crash the entire analysis rather than skipping that model. Consider wrapping the per-model processing block (lines 117–143) in its own try/catch with aconsole.warn+continue, matching the existing pattern for missing files.
Overall this is a clean, well-structured addition. The script follows project conventions (license header, no any types, proper unknown usage, named interfaces), the workflow permissions are appropriately scoped (contents: read), and the always() + artifact upload pattern correctly captures partial results from failed eval jobs. The scripts/ exclusion in deno.json is consistent with existing scripts not having tests or type-check requirements.
Summary
analysisjob to the multi-model eval workflow that runs after all model evals completeresults.jsonas an artifactTest plan
🤖 Generated with Claude Code