CLI for running LLM behavioral evaluation packs against any OpenAI-compatible endpoint, with deterministic verifier-backed scoring (no LLM-as-judge). Three pack families today:
- BenchLocal ports — tool-call · instruction-follow · structured output · numeric reasoning · data extraction · debug · multi-tool agent · CLI exec (8 packs from stevibe/BenchLocal, MIT-licensed).
- Eval-expansion track — additional packs vendored from upstream open-source benches. v0.9 ships
aider-polyglot-30(multi-language code editing via Aider-AI/aider'sbenchmark.py). - Reasoning suite — opt-in
--reasoningpacks for code reasoning, symbolic math, and gated science QA: HumanEval+, LiveCodeBench v6, GSM-Symbolic, and GPQA-Diamond metadata.
Companion to club-3090 — primarily intended for measuring quality on quantized models served by club-3090's compose stack, but works against any OpenAI-compatible API.
We needed a headless, scriptable quality gate for compose-release validation on an inference rig. BenchLocal is a great Electron desktop app for human-in-the-loop quality A/B; this repo turns the same pack semantics into a CLI that:
- Hits any OpenAI-compatible HTTP endpoint
- Runs BenchLocal's 8 deterministic-verifier packs + agentic eval packs (currently 1:
aider-polyglot-30) + the opt-in reasoning suite - Supports
--quick/--medium/--fullbudget modes for the BenchLocal packs (~30-45 min for--full) plus a separate--reasoningmode; agentic packs can also run independently via--pack <name> - Outputs paste-ready markdown for benchmark tables + JSON for machine consumption
- Keeps quantized-model quality measurement light enough to run as a CI gate
🟢 Beta — full BenchLocal prompt fidelity, reasoning-model aware, sandbox-capable, plus eval-expansion track. JSONL packs are generated from vendored upstream TypeScript mirrors; deterministic packs use upstream system prompts and scenario prompts verbatim. Requests use pack-level default_thinking metadata so reasoning-rewarding packs can think while execution/format packs stay answer-only. BugFind-15, HermesAgent-20, CLI-40, AiderPolyglot-30, HumanEval+-30, and LiveCodeBench-v6-30 run through Docker-hosted HTTP verifier sandboxes when --enable-sandboxed-packs is set.
v0.9.0 added the eval-expansion track — aider-polyglot-30 ships as the first non-BenchLocal sandboxed pack: 30-exercise multi-language code-editing bench across cpp/go/java/javascript/python/rust, vendored upstream from Aider-AI/aider's benchmark.py. Run with --pack aider-polyglot-30 --enable-sandboxed-packs. See docs/AIDER_POLYGLOT_30.md.
| Mode | Packs | Budget | Use case |
|---|---|---|---|
--quick |
ToolCall-15 + InstructFollow-15 | ~10-15 min | Per-commit gate; pre-push smoke |
--medium (default) |
+ StructOutput-15 + DataExtract-15 | ~25-30 min | Pre-release; pin bumps; new compose authoring |
--full |
+ ReasonMath-15 + (BugFind / HermesAgent / CLI when sandboxed) | ~45-60 min | Cross-rig comparison; quality A/B vs another quant |
--reasoning |
HumanEval+-30 + LiveCodeBench-v6-30 + GPQA-Diamond (gated) + GSM-Symbolic-30 | ~30-90+ min; code packs need Docker | Dedicated reasoning/code suite; structured-CoT / no-think / thinking A/B |
--pack aider-polyglot-30 |
aider-polyglot-30 (independent — not bundled in --quick/--medium/--full/--reasoning) |
~15-25 min | Agentic code-editing signal; cross-model quality A/B for IDE-agent / coding workloads |
Pack selection in each mode follows Codex design-review feedback (2026-05-09) — ToolCall + InstructFollow are the primary signals for IDE-agent regressions; StructOutput catches grammar/JSON drift; ReasonMath defers to --full because it leans toward generic benchmark behavior rather than agent-stack-specific. --reasoning stays separate from --full because it changes the question from general behavior to code/math/science reasoning under larger thinking budgets. AiderPolyglot-30 is run independently because its harness is a batch runner with multi-turn edit/test loops — different shape from the per-scenario BenchLocal packs.
By default, packs sample at their declared per-pack temperature — the deterministic packs use temperature 0 (greedy) for reproducible, cross-rig-comparable scoring. This is the canonical baseline.
Two opt-in flags evaluate a model at a non-default temperature. Both tag the run ⚠ NON-CANONICAL (markdown header + JSON) and block --exit-on-regression (non-canonical runs shouldn't gate CI):
| Flag | Effect |
|---|---|
--temperature N (+ --top-p / --top-k / --min-p / --repeat-penalty) |
Override sampling with values you specify. |
--sampling-from-server |
Omit all sampling params from requests so the server applies its own configured defaults (e.g. a compose's --temp / --override-generation-config). Reads the actual values back via GET /props (llama.cpp) and records them as sampling_source: "server" + server_defaults. Mutually exclusive with --temperature et al. |
Use the canonical temp-0 default for regression tracking and cross-model ranking (fixed bar, reproducible). Use the override flags to evaluate a model as it's served / at its recommended temperature — e.g. reasoning or exploratory fine-tunes that recommend temp 0.75–1, where greedy decoding under-represents what the model was tuned for.
| Pack | Verifier type | Status |
|---|---|---|
| ToolCall-15 | Deterministic — per-scenario asserts on JSON tool-calls | ✅ vendor-generated |
| InstructFollow-15 | Deterministic — constraint validators | ✅ vendor-generated |
| StructOutput-15 | Deterministic — JSON / CSV / markdown / YAML-lite validate | ✅ vendor-generated |
| ReasonMath-15 | Deterministic — numeric/string/regex compare | ✅ vendor-generated |
| DataExtract-15 | Deterministic — JSON field-match | ✅ vendor-generated |
| BugFind-15 | Execution-backed — candidate-fix verifier sandbox | ✅ sandboxed v0.4 verifier |
| HermesAgent-20 | Multi-tool harness — browser/cron/memory/artifact mocks | ✅ sandboxed v0.4 verifier |
| CLI-40 | Linux exec sandbox — command verifier sandbox | ✅ sandboxed v0.4 verifier |
| AiderPolyglot-30 | Multi-language edit/test harness — wraps upstream Aider-AI/aider benchmark.py over 30 curated exercises (cpp / go / java / js / python / rust, 5 each) |
✅ sandboxed v0.9 (single-scoreboard) |
| HumanEval+-30 | Execution-backed code reasoning — HumanEval+ functional tests via the code-reasoning sandbox |
✅ sandboxed reasoning subset |
| LiveCodeBench-v6-30 | Execution-backed code reasoning — public LCB functional tests via the code-reasoning sandbox |
✅ sandboxed reasoning subset |
| GSM-Symbolic-30 | Deterministic — answer_match exact numeric final-answer scoring |
✅ reasoning subset |
| GPQA-Diamond | Deterministic — answer_match exact letter final-answer scoring |
⚠ gated metadata-only; no restricted data committed |
The bolded "sandboxed" packs above (BugFind-15, HermesAgent-20, CLI-40, AiderPolyglot-30, HumanEval+-30, LiveCodeBench-v6-30) run their verifier inside a Docker container that calls back out to your model endpoint. The networking gotcha: localhost inside the container is the container's own loopback, not the host. The CLI handles this automatically in most cases:
- Loopback endpoints (
localhost,127.x,[::1],[::]) — auto-resolved tohost.docker.internaland--add-host=host.docker.internal:host-gatewayis injected into the sandbox container. No env var needed. Works out of the box for--endpoint http://localhost:PORT. - Non-loopback endpoints (LAN IPs, k8s service names, docker-compose service DNS) — passed through verbatim. Assumes you've set up networking so the sandbox container can resolve and reach the host. The container's own DNS is used.
- Force the host-gateway rewrite for non-loopback — if you have a custom hostname that actually needs the rewrite (e.g., the name resolves on the host but not inside containers), set
BENCHLOCAL_HERMES_RESOLVE_LOCALHOST=1. This forces the same rewrite +--add-hostfor hermes-style packs regardless of host type. Aider always uses the rewrite; this flag controls the hermes/cli-class behavior for non-loopback hosts.
If a sandboxed pack scores 0/N with uniform short latencies, networking is the first thing to check. See docs/SANDBOX_PROTOCOL.md for per-pack protocol details and docs/PACK_FORMAT.md for metadata schema.
Each scenario's timeout is sized by precedence (highest wins):
--timeout-per-case N(envTIMEOUT_PER_CASE) — explicit override, used verbatim.- Auto-scaling (default) —
timeout = base × max(1, reference_tps / measured_tps) × thinking_multiplier:base= the pack'sdefault_max_secondsmetadata.reference_tps= the pack'stimeout_reference_tps(the decode ratebaseassumes; override with--reference-tps).measured_tps= a one-shot startup decode-TPS probe of the endpoint (sent withenable_thinking=false; skip it by passing--measured-tps N). The probe runs a reachability preflight (GET /v1/models, 5s, no retry) and fails fast — it never hangs a run against a dead or blackholed endpoint.thinking_multiplier=thinking_max_tokens / nominal_max_tokens, applied only when thinking is enabled and the budget exceeds the nominal output. Prevents thinking-on runs from spuriously timing out (#54).max(1, …)means a faster rig never shrinks the budget belowbase. The result deliberately over-budgets — a timeout is a ceiling, not a target.
- Static default — the pack's
default_max_seconds, when noreference_tpsis set or the probe is unavailable.
A timeout is not retried as transient (a timeout means the budget was genuinely hit); connection errors and HTTP 5xx still retry. --retry-on-timeout (default off) restores the old retry behavior.
benchlocal_cli/ # Python package (CLI entry point + runner)
├── __init__.py
├── cli.py # `benchlocal-cli run …` entry point
├── runner.py # core: dispatch packs, score, aggregate, output
├── sandbox.py # SandboxClient — Docker lifecycle for sandboxed packs
├── types.py # ScenarioRun / ScenarioResult / PackRun shapes
├── scoring/ # verifier modules (one per deterministic pack)
│ ├── tool_call.py
│ ├── instruct_follow.py
│ ├── struct_output.py
│ ├── reason_math.py
│ ├── data_extract.py
│ └── _stub.py # dispatch to Docker verifier when --enable-sandboxed-packs
└── packs/ # vendored JSONL pack data
├── toolcall-15.jsonl
├── instructfollow-15.jsonl
├── structoutput-15.jsonl
├── reasonmath-15.jsonl
├── dataextract-15.jsonl
├── bugfind-15.jsonl
├── hermesagent-20.jsonl
├── cli-40.jsonl
├── aider-polyglot-30.jsonl
├── humaneval-plus-30.jsonl
├── lcb-v6-30.jsonl
├── gsm-symbolic-30.jsonl
└── gpqa-diamond.jsonl
sandboxes/ # Docker images for execution-backed verifier packs
├── bugfind/ # Python pytest harness for BugFind-15 candidate fixes
├── cli/ # Linux exec sandbox for CLI-40 commands
├── hermes/ # Hermes-agent runtime + Node grader for HermesAgent-20
├── aider-polyglot/ # Aider + polyglot-benchmark for AiderPolyglot-30
└── code-reasoning/ # Python execution sandbox for HumanEval+ and LCB
vendor/ # vendored upstream sources for pack generation
├── ToolCall-15/ … # one dir per BenchLocal pack (TypeScript mirror)
└── AiderPolyglot-30/ # exercise manifest + sync metadata
tools/
├── build-packs.js # generates JSONL packs from vendor/ TypeScript mirrors
├── build-sandboxes.sh # builds the Docker images under sandboxes/
└── sync-vendor.sh # bumps vendored upstream pin
tests/ # pytest unit tests (33+ tests; runs against the JSONL packs)
docs/
├── AIDER_POLYGLOT_30.md # aider-polyglot-30 pack details + cross-rig run guide
├── DESIGN.md # design rationale (why these choices)
├── EXTRACTOR_NOTES.md # how vendor/ → JSONL extraction works per pack
├── HERMES_V073_AB.md # forensic notes from the Hermes A/B run
├── INTEGRATION.md # how club-3090 (or other repos) consume this CLI
├── PACK_FORMAT.md # JSONL schema each pack file follows
├── SANDBOX_PROTOCOL.md # HTTP protocol the sandboxed packs implement
└── VENDOR_SYNC.md # how to bump vendored upstream pins
# install
pip install -e .
# install with sandbox dependencies and build verifier images
pip install -e '.[sandbox]'
bash tools/build-sandboxes.sh
# list available packs
benchlocal-cli list
# run quick mode against a local club-3090 endpoint; pack metadata picks thinking on/off
benchlocal-cli run --quick --endpoint http://localhost:8020 --model qwen3.6-27b-autoround
# force reasoning/thinking enabled for every pack with a larger token budget
benchlocal-cli run --quick --endpoint http://localhost:8020 --model qwen3.6-27b-autoround --enable-thinking
# force answer-only mode for every pack, ignoring pack defaults
benchlocal-cli run --quick --endpoint http://localhost:8020 --model qwen3.6-27b-autoround --no-thinking
# pass vendor-specific request body fields
benchlocal-cli run --quick --endpoint http://localhost:8020 --model qwen3.6-27b-autoround --extra-body '{"foo":"bar"}'
# run full mode with custom timeout per scenario
benchlocal-cli run --full --endpoint http://localhost:8010 --model qwen3.6-27b-autoround --timeout-per-case 60
# run full mode including Docker-backed verifier packs
benchlocal-cli run --full --enable-sandboxed-packs --endpoint http://localhost:8010 --model qwen3.6-27b-autoround
# run the dedicated reasoning suite; HumanEval+ and LCB need the Docker code sandbox
benchlocal-cli run --reasoning --enable-sandboxed-packs \
--endpoint http://localhost:8020 --model qwen3.6-27b-autoround \
--thinking-max-tokens 16384
# run a single deterministic pack with detailed per-scenario output
benchlocal-cli run --pack toolcall-15 --endpoint http://localhost:8020 --model qwen3.6-27b-autoround
# run aider-polyglot-30 (multi-language code editing — not bundled in --quick/--medium/--full)
benchlocal-cli run --pack aider-polyglot-30 --enable-sandboxed-packs \
--endpoint http://localhost:8010 --model qwen3.6-27b-autoround \
--timeout-per-case 2700
# emit machine-readable JSON instead of markdown
benchlocal-cli run --quick --endpoint http://localhost:8020 --model qwen3.6-27b-autoround --output json > results.jsonbenchlocal-cli uses each pack's default_thinking metadata by default. Reasoning-rewarding packs such as reasonmath-15, bugfind-15, instructfollow-15, hermesagent-20, and every --reasoning pack run with chat_template_kwargs.enable_thinking=true; execution/format packs such as toolcall-15, structoutput-15, dataextract-15, and cli-40 run answer-only. Use --enable-thinking to force thinking on for every pack, or --no-thinking to force it off for every pack. Whenever thinking is enabled for a pack, request max_tokens is raised to --thinking-max-tokens (default 16384) and the request uses the recommended thinking sampler (temperature=1.0, top_p=0.95, top_k=20, min_p=0.0) instead of the deterministic pack's greedy sampler. Override it with --thinking-sampler '{"temperature":0.7,"top_p":0.9}', override individual sampling keys with --temperature/--top-p/--top-k/--min-p, or use --sampling-from-server to omit sampler params entirely. HumanEval+ and LiveCodeBench also carry 16K scenario budgets so thinking-on code runs do not measure a 4K truncation failure; hardest LCB items may still exceed 16K, so compare against --no-thinking for budget-runaway diagnostics. Use --extra-body to pass any other OpenAI-compatible server extension fields. Saved JSON records thinking_enabled per pack plus the run-level thinking_mode.
=== benchlocal-cli --medium (endpoint: http://localhost:8020, model: qwen3.6-27b-autoround, 2026-05-09T10:30) ===
Pack | Pass / Total | Score | p50 latency | p95 latency | Status
ToolCall-15 (v1.0.1) | 14 / 15 | 93% | 8.2s | 12.1s | ✅
InstructFollow-15 (v1.0.0)| 13 / 15 | 87% | 11.4s | 17.8s | ✅
StructOutput-15 (v1.0.0) | 15 / 15 | 100% | 6.9s | 9.2s | ✅
DataExtract-15 (v1.0.0) | 12 / 15 | 80% | 7.3s | 10.5s | ✅
─────────────────────────|──────────────|───────|─────────────|─────────────|──────
TOTAL | 54 / 60 | 90% | | |
Failure breakdown:
- toolcall-15 TC-07: verifier_fail (wrong arg value for "filename": expected report.pdf, got output.pdf)
- instructfollow-15 IF-03: verifier_fail (word count 247, target 250 ±5)
- dataextract-15 DE-05: verifier_fail (7/14 atomic fields correct (50%). product_name: mismatch)
For agentic packs (e.g. aider-polyglot-30), the headline number is pass_rate over 30 exercises rather than per-scenario pass/fail; per-exercise breakdown is surfaced in the JSON verifier_trace.upstream_per_exercise. See docs/AIDER_POLYGLOT_30.md for the full output shape.
When --repeat N is greater than 1, the markdown table adds per-pack Std and CV columns derived from repeat-arm pass rates. The saved JSON includes the same data under each pack result as variance: {"repeat", "mean", "std", "cv"} so cross-rig runs can distinguish real deltas from run-to-run noise.
The Failure breakdown: block above is the quickest read — failure_mode + full detail per failed scenario, printed at the end of every run. For deeper forensics, any run with --save-json (which quality-test.sh sets) records per-scenario tokens, latency, and the full verifier trace; the inspect subcommand reads it back:
benchlocal-cli inspect results.json --failed # every failure + reason + tokens + latency
benchlocal-cli inspect results.json --scenario IF-10 --full # full prompt/response/verifier trace + conversation
benchlocal-cli inspect results.json --mode timeout # only this failure_mode
benchlocal-cli inspect results.json --diff previous.json # side-by-side vs a prior run (regressions + latency delta)
benchlocal-cli inspect results.json --logs ./sandbox-logs # pull sandboxed-pack stdout/stderr
failure_mode is one of verifier_fail, timeout, agent_runner_timeout, agent_runner_crashed, server_error, http_error, model_endpoint_unreachable, result_json_malformed, wrong_answer, verifier_not_implemented.
This repo ports MIT-licensed bench pack scenarios from stevibe/BenchLocal and the individual pack repos:
- stevibe/ToolCall-15 (v1.0.1)
- stevibe/InstructFollow-15 (v1.0.0)
- stevibe/StructOutput-15 (v1.0.0)
- stevibe/ReasonMath-15 (v1.0.0)
- stevibe/DataExtract-15 (v1.0.0)
- stevibe/BugFind-15 (v1.0.1)
- stevibe/HermesAgent-20 (v1.0.0)
- stevibe/CLI-40 (v1.0.2)
Eval-expansion track:
- Aider-AI/aider (Apache-2.0) —
benchmark.pyharness for AiderPolyglot-30 - Aider-AI/polyglot-benchmark (CC-BY-SA-3.0 / various Exercism licenses) — exercise tree
See ATTRIBUTION.md for full attribution + license preservation per pack.
MIT — same as upstream BenchLocal. See LICENSE.
Beta. Pack updates should go through tools/sync-vendor.sh and tools/build-packs.js; see CONTRIBUTING.md.