Skip to content

bench(inference): track agentic workload (1000-ctx/100-resp) lattice vs ollama vs MLX #186

@ohdearquant

Description

@ohdearquant

What

Promote this session's ad-hoc agentic comparison into a tracked benchmark. Long-context + short-response is the workload that exposes the prefill gap (#185); our existing benches measure decode-only at short context (context_scaling.tsv tops out at 256 tokens) and miss it entirely.

Deliverables

  • Commit scripts/bench_compare_1k.py (lattice via bench_decode_ab, ollama via /api/generate, MLX via mlx_lm) — reports TTFT / decode / total + prefill & decode tok/s
  • Keep the BENCH_PROMPT_TOKENS prompt-padding knob in bench_decode_ab (added this session) so lattice can be measured at arbitrary context depth
  • Sweep contexts {1000, 2000, 4000} × response 100, save to docs/bench_results/agentic_*.json (MLX rows already exist in agentic_workload.json)
  • Normalize ollama's context to the target (current heuristic over-shot to 1450) for a clean apples-to-apples row

Why separate from #185

#185 is the fix (batched prefill). This issue is the measurement that proves the fix worked — without a tracked agentic bench, a prefill improvement won't show up in CI (the regression gate only watches short-context decode slope).

Current numbers (baseline to beat)

Engine       Ctx  TTFT(ms)  Decode(ms)  Total(ms)  Prefill t/s  Decode t/s
lattice     1009    10207        1513      11720           99          66
ollama      1450      480        1194       1674         3022          84
mlx         1000      384         406        789         2608         246

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlattice-inferenceAffects the lattice-inference crate (transformer inference)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions