bench(inference): track agentic workload (1000-ctx/100-resp) lattice vs ollama vs MLX

## What

Promote this session's ad-hoc agentic comparison into a tracked benchmark. Long-context + short-response is the workload that exposes the prefill gap (#185); our existing benches measure decode-only at short context (`context_scaling.tsv` tops out at 256 tokens) and miss it entirely.

## Deliverables

- [ ] Commit `scripts/bench_compare_1k.py` (lattice via `bench_decode_ab`, ollama via `/api/generate`, MLX via `mlx_lm`) — reports TTFT / decode / total + prefill & decode tok/s
- [ ] Keep the `BENCH_PROMPT_TOKENS` prompt-padding knob in `bench_decode_ab` (added this session) so lattice can be measured at arbitrary context depth
- [ ] Sweep contexts {1000, 2000, 4000} × response 100, save to `docs/bench_results/agentic_*.json` (MLX rows already exist in `agentic_workload.json`)
- [ ] Normalize ollama's context to the target (current heuristic over-shot to 1450) for a clean apples-to-apples row

## Why separate from #185

#185 is the fix (batched prefill). This issue is the **measurement** that proves the fix worked — without a tracked agentic bench, a prefill improvement won't show up in CI (the regression gate only watches short-context decode slope).

## Current numbers (baseline to beat)

```
Engine       Ctx  TTFT(ms)  Decode(ms)  Total(ms)  Prefill t/s  Decode t/s
lattice     1009    10207        1513      11720           99          66
ollama      1450      480        1194       1674         3022          84
mlx         1000      384         406        789         2608         246
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(inference): track agentic workload (1000-ctx/100-resp) lattice vs ollama vs MLX #186

What

Deliverables

Why separate from #185

Current numbers (baseline to beat)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bench(inference): track agentic workload (1000-ctx/100-resp) lattice vs ollama vs MLX #186

Description

What

Deliverables

Why separate from #185

Current numbers (baseline to beat)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions