Evaluate a locally hosted large language model against commercial APIs using a repeatable, data-driven workflow. The goal is to help you iterate on self-hosted models, quantify gaps with closed providers, and track improvements over time.
- Create and activate a Python 3.10+ environment, e.g.
uv venv && source .venv/bin/activate
- Install dependencies
pip install -e .[openai,anthropic]
- Prepare the providers used by the default comparison config
- Ollama: start
ollama serveand pullqwen3-coder:30b:
ollama pull qwen3-coder:30b
- OpenAI: export the API key expected by
configs/local-vs-commercial.yaml:
export ME_OPENAI_API_KEY=your_key_here- Transformers: if you want to benchmark a local Hugging Face model instead, update
configs/local-example.yamlor usellm-eval add-local-model.
- Ollama: start
- Populate
data/prompts.jsonlwith representative evaluation prompts. - Run the default local-vs-commercial evaluation
Override the prompt file for one-off runs without editing YAML:
llm-eval run configs/local-vs-commercial.yaml
For a full Ollama-only shootout (qwen/llama/gpt-oss variants), run:llm-eval run configs/local-vs-commercial.yaml --dataset-path data/prompts_extended.jsonl
For the separate 100-prompt reasoning benchmark comparingllm-eval run configs/local-ollama-stack.yaml
command-randgpt-4o, run:llm-eval run configs/local-vs-commercial-reasoning.yaml
configs/ YAML configs describing providers, prompts, metrics
data/ Prompt sets and optional reference answers
scripts/ Helper scripts (ad-hoc data prep, viz, etc.)
src/llm_eval/ Python package with CLI, providers, metrics, orchestration
Install API dependencies:
pip install -e .[api]Start the API server:
llm-eval serve-api --host 0.0.0.0 --port 8000Endpoints:
GET /healthz- service health check.GET /v1/system-info- host metadata snapshot (CPU/RAM/GPU/Ollama/Python).POST /v1/evaluations- synchronous evaluation request (returns summary directly).POST /v1/runs- async run creation (returnsrun_idandqueued).GET /v1/runs- list runs and lifecycle metadata.GET /v1/runs/{run_id}- get run status and summary/error.
Run lifecycle metadata is persisted in SQLite at artifacts/runs.db, so run history survives API restarts.
Example request:
curl -X POST http://127.0.0.1:8000/v1/evaluations \
-H "Content-Type: application/json" \
-d '{"config_path":"configs/local-ollama-stack-extended.yaml","dataset_path":"data/prompts_extended.jsonl"}'Async lifecycle example:
curl -X POST http://127.0.0.1:8000/v1/runs \
-H "Content-Type: application/json" \
-d '{"config_path":"configs/local-ollama-stack-extended.yaml"}'
# then poll:
curl http://127.0.0.1:8000/v1/runs/<run_id>- Data: prompt files and artifact JSON outputs (
data/,artifacts/). - Processing: provider execution, metrics, and performance aggregation (
llm_evalcore modules). - Display: web frontend should consume API responses/artifacts and avoid re-implementing scoring logic.
- Default runner targets Hugging Face
transformersmodels. - Ollama is supported out of the box (
type: ollama), letting you compare any containerized model exposed via the localollama serveREST API. - You can swap in other backends (e.g.,
llama.cpp, vLLM) by extendingllm_eval/providers/local.py. - GPU acceleration is recommended; see
accelerate launchdocs for multi-GPU setups (or lean on Ollama's CUDA/Metal builds).
Use the CLI to append new local entries to any config without manual YAML edits:
llm-eval add-local-model configs/local-vs-commercial.yaml \
--name mistral-7b-cpu \
--model-id mistralai/Mistral-7B-Instruct-v0.2 \
--device-map cpu \
--torch-dtype float32 \
-o revision="main"Add --overwrite to replace an existing provider with the same name.
The configs/local-ollama-stack.yaml config includes the following models (via ollama serve):
qwen3-coder:30bllama3:latestqwen2.5-coder:14bqwen2.5-coder:7bgpt-oss:20bllama3.1:latest
Pull each model with ollama pull <model> first, then run:
llm-eval run configs/local-ollama-stack.yamlFor more stable ranking, use the extended 20-prompt benchmark:
llm-eval run configs/local-ollama-stack-extended.yamlThe scaffolding ships with OpenAI and Anthropic clients. Add new providers by subclassing llm_eval.providers.base.BaseProvider.
The current configs/local-vs-commercial.yaml example uses gpt-4o and reads its key from ME_OPENAI_API_KEY.
You can optionally add cost_per_1k_tokens under a provider's kwargs to include price metadata in the performance summary.
- Load prompts from JSONL (each line:
{ "id": "...", "prompt": "...", "reference": "..." }). - Execute each provider for the prompt set (optionally in parallel).
- Compute metrics (string distance, embeddings, task-specific scoring).
- Compute performance stats per provider (
latency_mean_s,latency_p50_s,latency_p95_s,tokens_per_second, and optionalcost_per_1k_tokens). - Save per-sample and aggregate results to
artifacts/.
For free-form generations, prefer combining exact_match with token_f1 and semantic_cosine. exact_match is strict, while token_f1 captures partial lexical overlap.
To print only reference machine metadata (without running prompts), use:
llm-eval run configs/local-ollama-stack-extended.yaml --system-info-onlyexact_match:1.0only when prediction exactly matches reference after whitespace normalization; else0.0.token_f1: token-overlap F1 score (0 to 1), balancing precision and recall of predicted tokens vs reference tokens.semantic_cosine: TF-IDF cosine similarity between prediction and reference (0 to 1), useful for loose paraphrase similarity.latency_mean_s: average per-prompt response latency in seconds.latency_p50_s: median latency (50th percentile) in seconds.latency_p95_s: tail latency (95th percentile) in seconds.output_tokens_per_second: generated output tokens divided by total elapsed latency (requires provider token usage).tokens_per_second: input+output tokens divided by total elapsed latency (requires provider token usage).cost_per_1k_tokens: optional provider price metadata copied from config into the summary output.
Example console output:
Quality Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Provider ┃ exact-match ┃ token-f1 ┃ semantic ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ qwen3-coder-30b │ 0.050 │ 0.412 │ 0.503 │
│ llama3-latest │ 0.040 │ 0.371 │ 0.446 │
│ qwen25-coder-14b │ 0.030 │ 0.355 │ 0.430 │
└──────────────────┴─────────────┴──────────┴──────────┘
Performance Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Provider ┃ latency_mean_s ┃ latency_p50_s ┃ latency_p95_s ┃ output_tok_s ┃ tokens_per_second ┃ cost_per_1k_tokens ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ qwen3-coder-30b │ 1.284 │ 1.201 │ 1.690 │ 46.220 │ 138.550 │ n/a │
│ llama3-latest │ 0.942 │ 0.881 │ 1.210 │ 61.432 │ 184.009 │ n/a │
│ qwen25-coder-14b │ 0.811 │ 0.774 │ 1.052 │ 58.301 │ 172.445 │ n/a │
└──────────────────┴────────────────┴───────────────┴───────────────┴──────────────┴───────────────────┴────────────────────┘
Example artifacts/.../summary.json structure:
{
"run_name": "ollama-local-stack-extended",
"system_info": {
"model_eval_version": "0.1.0",
"os": "Darwin",
"python_version": "3.14.0",
"cpu_model": "Apple M3 Max",
"ram_gb": 128.0
},
"metrics": {
"qwen3-coder-30b": {
"exact-match": 0.05,
"token-f1": 0.412,
"semantic": 0.503
}
},
"performance": {
"qwen3-coder-30b": {
"latency_mean_s": 1.284,
"latency_p50_s": 1.201,
"latency_p95_s": 1.690,
"output_tokens_per_second": 46.22,
"tokens_per_second": 138.55,
"cost_per_1k_tokens": null
}
}
}Run the unit suite with Python’s built-in discovery rooted at the test folder:
python3 -m unittest discover -s src/llm_eval/tests -p "test_*.py" -vThis keeps discovery scoped to the package tests and prints per-test names (drop -v for terse output). Some CLI tests will auto-skip if the optional Typer dependency isn’t installed.
- Extend
metrics.pywith domain-specific scorers. - Add automation (cron or GitHub Actions) to rerun evaluations nightly.
- Integrate visualization/notebook dashboards for deeper analysis.