Skip to content

Dazzla/model-evaluation-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Evaluation Harness

Evaluate a locally hosted large language model against commercial APIs using a repeatable, data-driven workflow. The goal is to help you iterate on self-hosted models, quantify gaps with closed providers, and track improvements over time.

Quick Start

  1. Create and activate a Python 3.10+ environment, e.g.
    uv venv && source .venv/bin/activate
  2. Install dependencies
    pip install -e .[openai,anthropic]
  3. Prepare the providers used by the default comparison config
    • Ollama: start ollama serve and pull qwen3-coder:30b:
    ollama pull qwen3-coder:30b
    • OpenAI: export the API key expected by configs/local-vs-commercial.yaml:
    export ME_OPENAI_API_KEY=your_key_here
    • Transformers: if you want to benchmark a local Hugging Face model instead, update configs/local-example.yaml or use llm-eval add-local-model.
  4. Populate data/prompts.jsonl with representative evaluation prompts.
  5. Run the default local-vs-commercial evaluation
    llm-eval run configs/local-vs-commercial.yaml
    Override the prompt file for one-off runs without editing YAML:
    llm-eval run configs/local-vs-commercial.yaml --dataset-path data/prompts_extended.jsonl
    For a full Ollama-only shootout (qwen/llama/gpt-oss variants), run:
    llm-eval run configs/local-ollama-stack.yaml
    For the separate 100-prompt reasoning benchmark comparing command-r and gpt-4o, run:
    llm-eval run configs/local-vs-commercial-reasoning.yaml

Repository Layout

configs/            YAML configs describing providers, prompts, metrics
data/               Prompt sets and optional reference answers
scripts/            Helper scripts (ad-hoc data prep, viz, etc.)
src/llm_eval/       Python package with CLI, providers, metrics, orchestration

API (Web Product Integration)

Install API dependencies:

pip install -e .[api]

Start the API server:

llm-eval serve-api --host 0.0.0.0 --port 8000

Endpoints:

  • GET /healthz - service health check.
  • GET /v1/system-info - host metadata snapshot (CPU/RAM/GPU/Ollama/Python).
  • POST /v1/evaluations - synchronous evaluation request (returns summary directly).
  • POST /v1/runs - async run creation (returns run_id and queued).
  • GET /v1/runs - list runs and lifecycle metadata.
  • GET /v1/runs/{run_id} - get run status and summary/error.

Run lifecycle metadata is persisted in SQLite at artifacts/runs.db, so run history survives API restarts.

Example request:

curl -X POST http://127.0.0.1:8000/v1/evaluations \
  -H "Content-Type: application/json" \
  -d '{"config_path":"configs/local-ollama-stack-extended.yaml","dataset_path":"data/prompts_extended.jsonl"}'

Async lifecycle example:

curl -X POST http://127.0.0.1:8000/v1/runs \
  -H "Content-Type: application/json" \
  -d '{"config_path":"configs/local-ollama-stack-extended.yaml"}'
# then poll:
curl http://127.0.0.1:8000/v1/runs/<run_id>

Separation of Concerns

  • Data: prompt files and artifact JSON outputs (data/, artifacts/).
  • Processing: provider execution, metrics, and performance aggregation (llm_eval core modules).
  • Display: web frontend should consume API responses/artifacts and avoid re-implementing scoring logic.

Local Model Notes

  • Default runner targets Hugging Face transformers models.
  • Ollama is supported out of the box (type: ollama), letting you compare any containerized model exposed via the local ollama serve REST API.
  • You can swap in other backends (e.g., llama.cpp, vLLM) by extending llm_eval/providers/local.py.
  • GPU acceleration is recommended; see accelerate launch docs for multi-GPU setups (or lean on Ollama's CUDA/Metal builds).

Adding Local Models Quickly

Use the CLI to append new local entries to any config without manual YAML edits:

llm-eval add-local-model configs/local-vs-commercial.yaml \
  --name mistral-7b-cpu \
  --model-id mistralai/Mistral-7B-Instruct-v0.2 \
  --device-map cpu \
  --torch-dtype float32 \
  -o revision="main"

Add --overwrite to replace an existing provider with the same name.

Prebuilt Ollama Stack

The configs/local-ollama-stack.yaml config includes the following models (via ollama serve):

  • qwen3-coder:30b
  • llama3:latest
  • qwen2.5-coder:14b
  • qwen2.5-coder:7b
  • gpt-oss:20b
  • llama3.1:latest

Pull each model with ollama pull <model> first, then run:

llm-eval run configs/local-ollama-stack.yaml

For more stable ranking, use the extended 20-prompt benchmark:

llm-eval run configs/local-ollama-stack-extended.yaml

Commercial Providers

The scaffolding ships with OpenAI and Anthropic clients. Add new providers by subclassing llm_eval.providers.base.BaseProvider.

The current configs/local-vs-commercial.yaml example uses gpt-4o and reads its key from ME_OPENAI_API_KEY. You can optionally add cost_per_1k_tokens under a provider's kwargs to include price metadata in the performance summary.

Evaluation Flow

  1. Load prompts from JSONL (each line: { "id": "...", "prompt": "...", "reference": "..." }).
  2. Execute each provider for the prompt set (optionally in parallel).
  3. Compute metrics (string distance, embeddings, task-specific scoring).
  4. Compute performance stats per provider (latency_mean_s, latency_p50_s, latency_p95_s, tokens_per_second, and optional cost_per_1k_tokens).
  5. Save per-sample and aggregate results to artifacts/.

For free-form generations, prefer combining exact_match with token_f1 and semantic_cosine. exact_match is strict, while token_f1 captures partial lexical overlap.

To print only reference machine metadata (without running prompts), use:

llm-eval run configs/local-ollama-stack-extended.yaml --system-info-only

Metric Definitions

  • exact_match: 1.0 only when prediction exactly matches reference after whitespace normalization; else 0.0.
  • token_f1: token-overlap F1 score (0 to 1), balancing precision and recall of predicted tokens vs reference tokens.
  • semantic_cosine: TF-IDF cosine similarity between prediction and reference (0 to 1), useful for loose paraphrase similarity.
  • latency_mean_s: average per-prompt response latency in seconds.
  • latency_p50_s: median latency (50th percentile) in seconds.
  • latency_p95_s: tail latency (95th percentile) in seconds.
  • output_tokens_per_second: generated output tokens divided by total elapsed latency (requires provider token usage).
  • tokens_per_second: input+output tokens divided by total elapsed latency (requires provider token usage).
  • cost_per_1k_tokens: optional provider price metadata copied from config into the summary output.

Sample Report

Example console output:

Quality Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ Provider         ┃ exact-match ┃ token-f1 ┃ semantic ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ qwen3-coder-30b  │ 0.050       │ 0.412    │ 0.503    │
│ llama3-latest    │ 0.040       │ 0.371    │ 0.446    │
│ qwen25-coder-14b │ 0.030       │ 0.355    │ 0.430    │
└──────────────────┴─────────────┴──────────┴──────────┘

Performance Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Provider         ┃ latency_mean_s ┃ latency_p50_s ┃ latency_p95_s ┃ output_tok_s ┃ tokens_per_second ┃ cost_per_1k_tokens ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ qwen3-coder-30b  │ 1.284          │ 1.201         │ 1.690         │ 46.220       │ 138.550           │ n/a                │
│ llama3-latest    │ 0.942          │ 0.881         │ 1.210         │ 61.432       │ 184.009           │ n/a                │
│ qwen25-coder-14b │ 0.811          │ 0.774         │ 1.052         │ 58.301       │ 172.445           │ n/a                │
└──────────────────┴────────────────┴───────────────┴───────────────┴──────────────┴───────────────────┴────────────────────┘

Example artifacts/.../summary.json structure:

{
  "run_name": "ollama-local-stack-extended",
  "system_info": {
    "model_eval_version": "0.1.0",
    "os": "Darwin",
    "python_version": "3.14.0",
    "cpu_model": "Apple M3 Max",
    "ram_gb": 128.0
  },
  "metrics": {
    "qwen3-coder-30b": {
      "exact-match": 0.05,
      "token-f1": 0.412,
      "semantic": 0.503
    }
  },
  "performance": {
    "qwen3-coder-30b": {
      "latency_mean_s": 1.284,
      "latency_p50_s": 1.201,
      "latency_p95_s": 1.690,
      "output_tokens_per_second": 46.22,
      "tokens_per_second": 138.55,
      "cost_per_1k_tokens": null
    }
  }
}

Testing

Run the unit suite with Python’s built-in discovery rooted at the test folder:

python3 -m unittest discover -s src/llm_eval/tests -p "test_*.py" -v

This keeps discovery scoped to the package tests and prints per-test names (drop -v for terse output). Some CLI tests will auto-skip if the optional Typer dependency isn’t installed.

Next Steps

  • Extend metrics.py with domain-specific scorers.
  • Add automation (cron or GitHub Actions) to rerun evaluations nightly.
  • Integrate visualization/notebook dashboards for deeper analysis.

Product Planning Docs

About

A framework for evaluating speed and capability of various locally-hosted and commercial artificial intelligence large language models

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages