Benchmarking framework for llama.cpp's WebGPU backend. Compiles llama.cpp to WebAssembly, runs GGUF models in real browsers via WebGPU, and measures inference performance and numerical correctness across Chrome and Safari.
Currently tracks 10 models with 194 quantization variants from HuggingFace.
Two ways to run:
- One-click Run page — open
/site/run.htmlin any WebGPU-capable browser. Tick the variants you want, click Download → Run. Great for measuring your own laptop or for the hosted leaderboard URL. See One-click benchmark. - Automated CLI (
runner.js) — Playwright + WebDriverIO orchestrator that runs cross-browser matrices headlessly. Used for CI and cloud runs. See Running Benchmarks.
The Run page is a standalone entry at site/run.html, linked from the header of the dashboard. Start the dev server and open it in whichever browser you want to test.
# Prerequisites: built WASM (`npm run build`) and `npm install`.
node server.js
# http://localhost:3000/ → dashboard. Click "Run" in the header, or open
# http://localhost:3000/site/run.html directly.Run-page flow:
- Three device cards show browser + platform + GPU, deviceMemory + WebGPU support, and the estimated safe model budget.
- Models panel lists all 194 variants grouped by family. Every variant is checked by default; variants that exceed the budget are dimmed and unchecked. Uncheck whatever you don't want to run.
[Download selected]streams GGUFs straight from HuggingFace into the browser's OPFS cache. Per-row byte progress.[Run benchmarks]runs each cached variant throughbench-worker.jssequentially (one Worker per run, OPFS-backed model load —use_mmap=0, no WASM-heap size cap). A crash in one variant doesn't halt the queue.- Output — copy the markdown block or download JSON. When served from
localhost:3000, a checkbox appends each record toresults/results.jsonas runner.js does.
The canonical deployment is the HF Space at
https://abhijitramesh-webgpu-bench.static.hf.space/. The Run page auto-detects
its surface and adapts:
| Surface | URL | Models | Cache | Submit |
|---|---|---|---|---|
| Localhost | /site/run.html |
/api/models (Express) |
OPFS in the browser | POST /api/results → npm run submit |
| HF Space | /run.html |
./models.json |
OPFS in the browser | HF OAuth → direct commit to the leaderboard dataset |
| Other hosted | /run.html |
./models.json |
OPFS in the browser | Hidden (read-only — a banner points at the Space) |
The sync-to-hf-space workflow flattens site/ onto your Space root on every push to main (set the HF_SPACE_REPO repo variable + HF_TOKEN secret first). Dataset repo + OAuth scopes live in site/js/run/config.js.
| Dependency | Version | Notes |
|---|---|---|
| Emscripten SDK | Latest | Set EMSDK_DIR env var or place at ../emsdk/ |
| Ninja | Any | brew install ninja / apt install ninja-build |
| Node.js | 18+ | |
| CMake | 3.14+ | |
| Playwright browsers | Installed via npx | For Chrome |
| Safari Remote Automation | macOS only | Safari > Settings > Advanced > "Allow Remote Automation" |
# Clone with submodules (llama.cpp)
git clone --recurse-submodules <repo-url>
cd webgpu-bench
# Build WASM (downloads emdawnwebgpu automatically)
npm run build
# Install dependencies + Playwright browsers
npm install
npx playwright install chromium
# Run a quick benchmark (3 quants, Chromium only)
node runner.js --quick --browsers=chromium
# View results
node report.jsFor the interactive one-click page instead, see One-click benchmark.
Compiles llama.cpp (git submodule) to two WASM variants with WebGPU support:
| Variant | Browser | Mechanism |
|---|---|---|
| JSPI | Chrome | JavaScript Promise Integration (native async) |
| Asyncify | Safari | Emscripten Asyncify (transform-based async) |
npm run build
# or: bash build.shThe browser automatically detects JSPI support at runtime and loads the correct variant. emdawnwebgpu (WebGPU bindings for Emscripten) is downloaded on first build.
Output:
build/jspi/bin/bench.js + bench.wasm
build/asyncify/bin/bench.js + bench.wasm
node runner.js [options]| Flag | Description | Example |
|---|---|---|
| (none) | All 230 variants on default browsers (chromium, plus webkit on macOS) | node runner.js |
--quick |
Only Q2_K, Q4_K_M, Q8_0 | node runner.js --quick |
--study |
Curated leaderboard sweep — same selection as the interactive Run-page "Run study" button (focus model at four quants + every other model at the standard quant). Defined in models.json → studySelection. |
node runner.js --study |
--browsers= |
Comma-separated browser list | --browsers=chromium,webkit |
--variants= |
Specific quantization types | --variants=Q4_K_M,Q8_0 |
--models= |
Filter by model name (substring match) | --models=Llama-3.2-1B |
--no-webgpu |
CPU-only mode (disable GPU offload) | --no-webgpu |
--consistency |
Measure WebGPU vs CPU numerical correctness | --consistency |
--resume |
Skip browser+variant+GPU-layer combos that already succeeded | --resume |
# Quick smoke test on Chrome
node runner.js --quick --browsers=chromium
# All quants for a specific model
node runner.js --models=Qwen3-0.6B --browsers=chromium
# Full suite (expect 5-6 hours)
node runner.js
# Consistency check: how faithfully does WebGPU reproduce CPU results?
node runner.js --quick --consistency
# Resume a partial run (skips completed combos)
node runner.js --resume
# CPU-only baseline
node runner.js --quick --no-webgpu| Script | Command |
|---|---|
npm run build |
Build WASM (both variants) |
npm run bench |
Run all benchmarks |
npm run bench:quick |
Quick benchmark (3 quants) |
npm run bench:chromium |
All quants, Chromium only |
npm run report |
Generate CSV from results |
npm run submit |
Push results to the HF leaderboard dataset (needs HF_TOKEN + HF_DATASET_REPO) |
npm run build:site |
Build dashboard data |
Models are fetched directly from HuggingFace into the browser's OPFS by bench-worker.js (the same loader the interactive Run page uses). Each Playwright context has its own OPFS, so every variant is downloaded once per runner.js invocation and discarded when the context closes.
If you want a persistent on-disk cache for the CLI, point Playwright at a persistent context — see chromium.launchPersistentContext(). There is no longer an Express-side disk cache.
Results are saved to results/:
| File | Description |
|---|---|
results.json |
Full benchmark data with all metrics |
summary.json |
Grouped by browser with pass/fail status |
results.csv |
Flat CSV for spreadsheets |
cpu_baselines.json |
CPU reference token sequences (from --consistency) |
Generate reports:
npm run report| Field | Description |
|---|---|
prefill_tok_s |
Prompt processing speed (tokens/sec) |
decode_tok_s |
Token generation speed (tokens/sec) |
t_p_eval_ms |
Prefill time in milliseconds |
t_eval_ms |
Decode time in milliseconds |
n_p_eval |
Number of prompt tokens processed |
n_eval |
Number of tokens generated |
buildType |
jspi or asyncify |
webgpuAvailable |
Whether WebGPU was available |
| Field | Description |
|---|---|
consistency.agreement_rate |
Fraction of positions where GPU and CPU agree on top-1 token (0.0-1.0) |
consistency.n_agree |
Number of agreeing positions |
consistency.n_tokens |
Total positions evaluated |
consistency.first_disagreement |
Position of first divergence (-1 if perfect) |
The --consistency flag measures how faithfully the WebGPU backend reproduces CPU computation for each quantization type.
For each variant, two runs happen in the same browser (isolating the WebGPU backend precisely):
- CPU baseline (
n_gpu_layers=0): greedy-decodes 128 tokens and records the token ID sequence. Cached toresults/cpu_baselines.jsonso subsequent runs skip this step. - WebGPU run (
n_gpu_layers=999): runs the normal benchmark, then performs a forced-decoding pass -- feeds the CPU's token sequence one token at a time and checks whether the GPU backend independently predicts the same top-1 token at each position.
When benchmarking across multiple browsers, the CPU baseline is shared (collected once from the first browser) since CPU computation is browser-independent.
Naively comparing generated text suffers from cascading divergence: a single different token changes the KV cache for all subsequent tokens, making the rest statistically unrelated. A text match ratio of 24% might mean only one token actually diverged.
Forced decoding evaluates each position independently against the same reference context, giving a clean per-token accuracy signal.
agreement_rate |
Interpretation |
|---|---|
1.00 |
Numerically identical to CPU -- no precision issues |
0.95-0.99 |
A few tokens differ due to near-equal logits -- expected for lower-precision quants |
< 0.90 |
Systematic precision issues -- the GPU kernel may need investigation |
0.00 |
First token wrong -- the quantization kernel is likely broken |
A static dashboard visualizes benchmark results across machines and browsers. Deployed to the HF Space on every push to main via .github/workflows/sync-to-hf-space.yml.
npm run build:site
npx serve site- Support Matrix -- pass/fail for each model/quant/browser combination
- Performance Charts -- decode and prefill throughput, throughput vs model size
- Machine Comparison -- side-by-side results when multiple machines have data
- Error Analysis -- failures grouped by category (OOM, WASM Abort, Timeout)
- Filtering -- filter by machine, browser, model, status, quantization type
The default path pushes to a shared Hugging Face dataset. The HF Space sync workflow pulls from the dataset on every push, so your results surface publicly without any manual PR (re-trigger manually via workflow_dispatch if you want to refresh between pushes).
# 1. Run benchmarks
node runner.js --browsers=chromium
# …or use the Run page: http://localhost:3000/site/run.html
# 2. Push to the leaderboard dataset
export HF_TOKEN=hf_your_write_token # create at https://huggingface.co/settings/tokens
export HF_DATASET_REPO=owner/webgpu-bench-leaderboard
npm run submitEach machine/browser pair becomes one commit at runs/{YYYY-MM-DD}/{slug}-{browser}-{epoch}.json.
- Benchmarks (CLI or the Run page at
/site/run.html) write toresults/results.json. npm run submitpushes stripped records to the HF dataset viascripts/push-to-dataset.mjs.sync-to-hf-space.ymlrunsscripts/sync-from-dataset.mjsto regroupruns/**/*.jsonintodata/machines/{slug}.json, thenscripts/build-site.jsmerges intodata/combined.json, then flattenssite/onto the HF Space root.- The static Space renders
combined.jsonclient-side.
First-time bootstrap: HF_TOKEN=… HF_DATASET_REPO=… node scripts/bootstrap-dataset.mjs seeds the dataset from any existing data/machines/*.json.
Edit models.json to add new models, repos, or quantization types.
Add an entry to the model's variants array:
{
"quant": "Q5_0",
"filename": "Llama-3.2-1B-Instruct-Q5_0.gguf",
"sizeMB": 870
}Add a new entry to the top-level models array:
{
"repo": "bartowski/Qwen2.5-1.5B-Instruct-GGUF",
"name": "Qwen2.5-1.5B-Instruct",
"variants": [
{ "quant": "Q4_K_M", "filename": "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf", "sizeMB": 1050 },
{ "quant": "Q8_0", "filename": "Qwen2.5-1.5B-Instruct-Q8_0.gguf", "sizeMB": 1680 }
]
}Edit the quickVariants array at the bottom of models.json:
"quickVariants": ["Q2_K", "Q4_K_M", "Q8_0"]curl -s "https://huggingface.co/api/models/<owner>/<repo>/tree/main" \
| python3 -c "
import sys, json
for f in json.load(sys.stdin):
if f['path'].endswith('.gguf'):
print(f'{f[\"path\"]:60s} {f[\"size\"]/(1024**2):8.1f} MB')
"| Model | Repo | Variants |
|---|---|---|
| Llama-3.2-1B-Instruct | unsloth | 27 |
| gemma-3-270m-it | unsloth | 24 |
| Qwen3-0.6B | unsloth | 26 |
| LFM2.5-350M | LiquidAI | 7 |
| SmolLM3-3B | unsloth | 24 |
| Ministral-3-3B-Instruct-2512 | unsloth | 26 |
| Qwen3.5-2B | unsloth | 22 |
| gemma-4-E2B-it | unsloth | 21 |
| granite-4.0-h-1b | ibm-granite | 15 |
| Bonsai-1.7B | prism-ml | 2 |
Run node scripts/fill-sizes.mjs after editing the list to HEAD each file on HF and populate sizeMB.
webgpu-bench/
llama.cpp/ # Git submodule
bench.cpp # C++ wrapper exporting 5 WASM functions
CMakeLists.txt # CMake config (JSPI/Asyncify toggle)
build.sh # Builds both WASM variants
harness.html/js # Browser-side: downloads model, runs inference
server.js # Express server (static site + /api/models, /api/results) + CORS
runner.js # Playwright/WebDriverIO orchestrator
config.js # Reads models.json, parses CLI args
models.json # Model definitions (10 models, 230 variants)
report.js # Results aggregation (JSON/CSV)
scripts/
submit-results.js # Prepare results for PR submission
build-site.js # Merge machine data into combined.json
data/machines/ # Committed benchmark results (one file per machine)
site/ # Static dashboard (HF Space)
.github/workflows/ # CI: deploy dashboard on merge
build.shcompiles llama.cpp to WASM with WebGPU support (Emscripten + emdawnwebgpu)runner.jsstarts a local Express server that serves the harness page- Playwright launches Chrome; WebDriverIO launches real Safari (for actual WebGPU support)
- Each browser navigates to
harness.html, which detects JSPI support and loads the correct WASM variant - The model is downloaded from HuggingFace into OPFS inside the browser; the worker reads it via a
FileSystemSyncAccessHandle(no WASM-heap copy) - Inference runs via WebGPU (or CPU fallback) using llama.cpp's C API
- Performance metrics from
llama_perf_context()are exposed to the test runner viawindow.__BENCH - Results are aggregated into JSON/CSV files
| Function | Description |
|---|---|
bench_init() |
Load all GGML backends |
bench_load(path, n_ctx, n_gpu_layers) |
Load a GGUF model |
bench_run(prompt, n_predict) |
Greedy-decode tokens, return metrics + token IDs as JSON |
bench_eval_tokens(prompt, ref_ids_csv) |
Forced-decoding consistency check against CPU reference |
bench_exit() |
Free model and context |
All functions use greedy sampling for deterministic output.
| Browser | Automation | WASM Variant | WebGPU |
|---|---|---|---|
| Chrome | Playwright | JSPI | Yes (via Dawn) |
| Safari | WebDriverIO | Asyncify | Yes (macOS native) |
Safari uses WebDriverIO instead of Playwright to access real Safari with native WebGPU support. Playwright's WebKit engine doesn't support WebGPU.
The framework detects the platform and sets Chromium GPU flags:
- macOS:
--use-angle=metal - Linux:
--use-angle=vulkan
On Linux with NVIDIA GPU, use xvfb-run for headless:
xvfb-run node runner.js --quickcd llama.cpp
git pull origin master
cd ..
git add llama.cpp
git commit -m "Update llama.cpp submodule"
npm run build