Model Evaluation Harness

Evaluate a locally hosted large language model against commercial APIs using a repeatable, data-driven workflow. The goal is to help you iterate on self-hosted models, quantify gaps with closed providers, and track improvements over time.

Quick Start

Create and activate a Python 3.10+ environment, e.g.
```
uv venv && source .venv/bin/activate
```
Install dependencies
```
pip install -e .[openai,anthropic]
```
Prepare the providers used by the default comparison config
- Ollama: start ollama serve and pull qwen3-coder:30b:
```
ollama pull qwen3-coder:30b
```
- OpenAI: export the API key expected by configs/local-vs-commercial.yaml:
```
export ME_OPENAI_API_KEY=your_key_here
```
- Transformers: if you want to benchmark a local Hugging Face model instead, update configs/local-example.yaml or use llm-eval add-local-model.
Populate data/prompts.jsonl with representative evaluation prompts.
Run the default local-vs-commercial evaluation
```
llm-eval run configs/local-vs-commercial.yaml
```
Override the prompt file for one-off runs without editing YAML:
```
llm-eval run configs/local-vs-commercial.yaml --dataset-path data/prompts_extended.jsonl
```
For a full Ollama-only shootout (qwen/llama/gpt-oss variants), run:
```
llm-eval run configs/local-ollama-stack.yaml
```
For the separate 100-prompt reasoning benchmark comparing command-r and gpt-4o, run:
```
llm-eval run configs/local-vs-commercial-reasoning.yaml
```

Repository Layout

configs/            YAML configs describing providers, prompts, metrics
data/               Prompt sets and optional reference answers
scripts/            Helper scripts (ad-hoc data prep, viz, etc.)
src/llm_eval/       Python package with CLI, providers, metrics, orchestration

API (Web Product Integration)

Install API dependencies:

pip install -e .[api]

Start the API server:

llm-eval serve-api --host 0.0.0.0 --port 8000

Endpoints:

GET /healthz - service health check.
GET /v1/system-info - host metadata snapshot (CPU/RAM/GPU/Ollama/Python).
POST /v1/evaluations - synchronous evaluation request (returns summary directly).
POST /v1/runs - async run creation (returns run_id and queued).
GET /v1/runs - list runs and lifecycle metadata.
GET /v1/runs/{run_id} - get run status and summary/error.

Run lifecycle metadata is persisted in SQLite at artifacts/runs.db, so run history survives API restarts.

Example request:

curl -X POST http://127.0.0.1:8000/v1/evaluations \
  -H "Content-Type: application/json" \
  -d '{"config_path":"configs/local-ollama-stack-extended.yaml","dataset_path":"data/prompts_extended.jsonl"}'

Async lifecycle example:

curl -X POST http://127.0.0.1:8000/v1/runs \
  -H "Content-Type: application/json" \
  -d '{"config_path":"configs/local-ollama-stack-extended.yaml"}'
# then poll:
curl http://127.0.0.1:8000/v1/runs/<run_id>

Separation of Concerns

Data: prompt files and artifact JSON outputs (data/, artifacts/).
Processing: provider execution, metrics, and performance aggregation (llm_eval core modules).
Display: web frontend should consume API responses/artifacts and avoid re-implementing scoring logic.

Local Model Notes

Default runner targets Hugging Face transformers models.
Ollama is supported out of the box (type: ollama), letting you compare any containerized model exposed via the local ollama serve REST API.
You can swap in other backends (e.g., llama.cpp, vLLM) by extending llm_eval/providers/local.py.
GPU acceleration is recommended; see accelerate launch docs for multi-GPU setups (or lean on Ollama's CUDA/Metal builds).

Adding Local Models Quickly

Use the CLI to append new local entries to any config without manual YAML edits:

llm-eval add-local-model configs/local-vs-commercial.yaml \
  --name mistral-7b-cpu \
  --model-id mistralai/Mistral-7B-Instruct-v0.2 \
  --device-map cpu \
  --torch-dtype float32 \
  -o revision="main"

Add --overwrite to replace an existing provider with the same name.

Prebuilt Ollama Stack

The configs/local-ollama-stack.yaml config includes the following models (via ollama serve):

qwen3-coder:30b
llama3:latest
qwen2.5-coder:14b
qwen2.5-coder:7b
gpt-oss:20b
llama3.1:latest

Pull each model with ollama pull <model> first, then run:

llm-eval run configs/local-ollama-stack.yaml

For more stable ranking, use the extended 20-prompt benchmark:

llm-eval run configs/local-ollama-stack-extended.yaml

Commercial Providers

The scaffolding ships with OpenAI and Anthropic clients. Add new providers by subclassing llm_eval.providers.base.BaseProvider.

The current configs/local-vs-commercial.yaml example uses gpt-4o and reads its key from ME_OPENAI_API_KEY. You can optionally add cost_per_1k_tokens under a provider's kwargs to include price metadata in the performance summary.

Evaluation Flow

Load prompts from JSONL (each line: { "id": "...", "prompt": "...", "reference": "..." }).
Execute each provider for the prompt set (optionally in parallel).
Compute metrics (string distance, embeddings, task-specific scoring).
Compute performance stats per provider (latency_mean_s, latency_p50_s, latency_p95_s, tokens_per_second, and optional cost_per_1k_tokens).
Save per-sample and aggregate results to artifacts/.

For free-form generations, prefer combining exact_match with token_f1 and semantic_cosine. exact_match is strict, while token_f1 captures partial lexical overlap.

To print only reference machine metadata (without running prompts), use:

llm-eval run configs/local-ollama-stack-extended.yaml --system-info-only

Metric Definitions

exact_match: 1.0 only when prediction exactly matches reference after whitespace normalization; else 0.0.
token_f1: token-overlap F1 score (0 to 1), balancing precision and recall of predicted tokens vs reference tokens.
semantic_cosine: TF-IDF cosine similarity between prediction and reference (0 to 1), useful for loose paraphrase similarity.
latency_mean_s: average per-prompt response latency in seconds.
latency_p50_s: median latency (50th percentile) in seconds.
latency_p95_s: tail latency (95th percentile) in seconds.
output_tokens_per_second: generated output tokens divided by total elapsed latency (requires provider token usage).
tokens_per_second: input+output tokens divided by total elapsed latency (requires provider token usage).
cost_per_1k_tokens: optional provider price metadata copied from config into the summary output.

Sample Report

Example console output:

Quality Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Provider         ┃ exact-match ┃ token-f1 ┃ semantic ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ qwen3-coder-30b  │ 0.050       │ 0.412    │ 0.503    │
│ llama3-latest    │ 0.040       │ 0.371    │ 0.446    │
│ qwen25-coder-14b │ 0.030       │ 0.355    │ 0.430    │
└──────────────────┴─────────────┴──────────┴──────────┘

Performance Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Provider         ┃ latency_mean_s ┃ latency_p50_s ┃ latency_p95_s ┃ output_tok_s ┃ tokens_per_second ┃ cost_per_1k_tokens ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ qwen3-coder-30b  │ 1.284          │ 1.201         │ 1.690         │ 46.220       │ 138.550           │ n/a                │
│ llama3-latest    │ 0.942          │ 0.881         │ 1.210         │ 61.432       │ 184.009           │ n/a                │
│ qwen25-coder-14b │ 0.811          │ 0.774         │ 1.052         │ 58.301       │ 172.445           │ n/a                │
└──────────────────┴────────────────┴───────────────┴───────────────┴──────────────┴───────────────────┴────────────────────┘

Example artifacts/.../summary.json structure:

{
  "run_name": "ollama-local-stack-extended",
  "system_info": {
    "model_eval_version": "0.1.0",
    "os": "Darwin",
    "python_version": "3.14.0",
    "cpu_model": "Apple M3 Max",
    "ram_gb": 128.0
  },
  "metrics": {
    "qwen3-coder-30b": {
      "exact-match": 0.05,
      "token-f1": 0.412,
      "semantic": 0.503
    }
  },
  "performance": {
    "qwen3-coder-30b": {
      "latency_mean_s": 1.284,
      "latency_p50_s": 1.201,
      "latency_p95_s": 1.690,
      "output_tokens_per_second": 46.22,
      "tokens_per_second": 138.55,
      "cost_per_1k_tokens": null
    }
  }
}

Testing

Run the unit suite with Python’s built-in discovery rooted at the test folder:

python3 -m unittest discover -s src/llm_eval/tests -p "test_*.py" -v

This keeps discovery scoped to the package tests and prints per-test names (drop -v for terse output). Some CLI tests will auto-skip if the optional Typer dependency isn’t installed.

Next Steps

Extend metrics.py with domain-specific scorers.
Add automation (cron or GitHub Actions) to rerun evaluations nightly.
Integrate visualization/notebook dashboards for deeper analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
configs		configs
data		data
docs		docs
src/llm_eval		src/llm_eval
.gitignore		.gitignore
README.md		README.md
memory.md		memory.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Evaluation Harness

Quick Start

Repository Layout

API (Web Product Integration)

Separation of Concerns

Local Model Notes

Adding Local Models Quickly

Prebuilt Ollama Stack

Commercial Providers

Evaluation Flow

Metric Definitions

Sample Report

Testing

Next Steps

Product Planning Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Model Evaluation Harness

Quick Start

Repository Layout

API (Web Product Integration)

Separation of Concerns

Local Model Notes

Adding Local Models Quickly

Prebuilt Ollama Stack

Commercial Providers

Evaluation Flow

Metric Definitions

Sample Report

Testing

Next Steps

Product Planning Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages