feat(server): add scoped disk prefix cache policy by weicj · Pull Request #364 · Luce-Org/lucebox-hub

weicj · 2026-06-10T12:47:23Z

Summary

This PR adds a configurable prefix-selection policy for disk prefix cache to improve hit rate on real agent-style workloads.

The existing disk prefix restore behavior, introduced by PR #227 and extended for target split / mixed-backend restore by PR #325 and PR #352, was oriented around saving/restoring the full prompt. That works for repeated identical requests, but agent workloads usually have a large stable context followed by a dynamic tail: tool results, recent turns, and task state can change between requests. When the full prompt is used as the cache scope, small tail changes can prevent reuse of an otherwise valuable cached prefix.

This PR adds a controllable prefix scope for disk cache. By caching a slightly shorter but more stable prefix, the server can trade a small number of cached tokens for a higher cross-request hit rate. The scope can be a fixed token count provided by the user, or auto can infer a stable boundary from recent similar requests.

CLI:

--disk-prefix-cache off
--disk-prefix-cache full
--disk-prefix-cache auto
--disk-prefix-cache auto:30
--disk-prefix-cache 1000

Requests can also override the policy through prefix_cache.scope:

{
  "prefix_cache": {
    "scope": "auto:30"
  }
}

Semantics:

off: disables disk prefix cache restore/save.
full: keeps the existing full-prompt disk cache behavior, including cold-prefix and continued checkpoints.
auto: defaults to auto:30; searches the most similar token prefix in the most recent 30 requests, then aligns down to a safe chat boundary.
auto:N: sets the lookback window to N requests. N is the candidate window size; it does not require all N requests to share the same prefix.
N: caches/restores the first N prompt tokens, for example 1000.

The server only compares token prefixes. It does not make semantic decisions about system prompts, tools, AGENTS.md, RAG, or dynamic tails. Disk keys remain token-prefix hashes, so semantically similar but token-different prompts do not collide.

Window / Hit-rate Tradeoff

auto:N uses N as the recent-request lookback window. Smaller windows usually cache more recent-context tokens, but the key is more likely to drift with the dynamic tail; larger windows usually cache fewer tokens but produce a more stable hit rate.

Tokenizer-level simulation results:

workload	`auto:2`	`auto:8`	`auto:30`
independent agent tasks	avg `1432` tokens, `28/30` hits (`93%`)	avg `1432` tokens, `28/30` hits (`93%`)	avg `1432` tokens, `28/30` hits (`93%`)
synthetic rolling chat	avg `1988` tokens, `1/30` hits (`3%`)	avg `1783` tokens, `7/30` hits (`23%`)	avg `1452` tokens, `28/30` hits (`93%`)
real rolling trace A	avg `1776` tokens, `8/20` hits (`40%`)	avg `1389` tokens, `10/20` hits (`50%`)	avg `1049` tokens, `18/20` hits (`90%`)
real rolling trace B	avg `2204` tokens, `7/20` hits (`35%`)	avg `1489` tokens, `10/20` hits (`50%`)	avg `869` tokens, `18/20` hits (`90%`)

The default auto:30 is therefore intentionally conservative: it caches fewer high-variance tail tokens to get a higher disk prefix hit rate. Users that want a longer recent-context cache can still use auto:2, auto:8, or a fixed token count.

Changes

Adds DiskPrefixCachePolicy with off, full, auto[:window], and fixed-token modes.
Adds --disk-prefix-cache off|full|auto|auto:N|N.
Adds request-level support for the same values through prefix_cache.scope; prefix_cache.window can override the auto window.
Makes auto use the best-match longest common token prefix from the recent request window, aligned down to a safe chat boundary.
Sets the default auto window to 30.
Makes fixed / auto prefill exactly to the selected token boundary before saving the KV snapshot, keeping the disk key and snapshot position aligned.
Keeps full on the existing full-prompt, cold-prefix, and continued-checkpoint behavior.
Keeps --kv-cache-min-tokens as the global minimum save threshold.
Limits --kv-cache-cold-max to full mode cold-prefix selection; it does not cap auto or fixed-number boundaries.
Disables auto / fixed when PFlash rewrites the effective prompt; full-prompt restore keeps the existing behavior.
Exposes the active disk policy in /props.full_cache.disk_policy.
Adds [disk-cache] auto scope: ... selected=... diagnostics for boundary selection.

End-to-end Restore Checks

End-to-end restore closure was validated on both a 27B dense backend and an OpenClaw-style MoE request. The dense backend check showed that a scoped prefix selected by auto:30 can be persisted, found again after server restart as a cold-start disk hit, and reduce prefill from a 77.6s cold run to 1.6s after restore. The MoE check showed that auto boundary selection maps to a stable scoped prefix on a real agent prompt shape. qwen35moe hybrid restore itself is covered by companion bugfix PR #362: #362. This PR only covers disk prefix policy and boundary selection; the hit-rate behavior is covered by the window tradeoff simulation above.

cubic-dev-ai

No issues found across 6 files

_{Re-trigger cubic}

The scoped disk prefix cache (auto/fixed policy) prefills exactly to the selected boundary and requests a snapshot at snap_pos == prompt end. The qwen35 chunked prefill only snapshots when snap_pos falls strictly inside a chunk, so end-of-prefill snapshots never fired: scoped saves silently no-opped and every auto-mode request paid the scoped prefill plus a full re-prefill (~2x prefill cost, zero cache entries). Take the snapshot after the chunk loop when snap_pos == committed. This does not touch the prefill computation (no chunk reshaping, cur_pos is already at committed), so cache-hit bit-exactness is preserved. Verified on RTX 3090 (sm_86), Qwen3.6-27B Q4_K_M, auto:8 policy: - scoped save fires at the chat boundary ([snap] end-of-prefill) - double prefill gone: save request 8.35s vs 13.4s before - cold-start disk hits after server restart: 0.36-0.50s vs 9.2s miss - with DFlash draft: greedy output bit-identical across miss/save/hit/ restart, accept rate unchanged - 1905 server unit assertions pass Also fixes qwen35moe in the fully GPU-resident case, which delegates to this prefill path. Co-Authored-By: WOZCODE <contact@withwoz.com>

davide221 · 2026-06-11T11:50:57Z

Validated end-to-end on lucebox2 (RTX 3090 sm_86, branch merged with current main), and pushed one fix commit (70cd8bf) to this branch.

Validation results

qwen35moe hybrid (Qwen3.6-35B-A3B, forced hot/cold split), auto:8 — works as designed:

scoped save lands on a chat-turn boundary (LCP 3804 floored to 3801)
never-seen-tail requests: 1.06-2.9s vs 19-21s miss
cold-start fallback hit after server restart confirmed
note: on this backend main persists nothing for <10240-token prompts (every inline snap logs [snap] hybrid skip unsafe boundary), so the scoped save is the first working disk persistence there

qwen35 dense (Qwen3.6-27B) — was broken, fixed by 70cd8bf: the chunked prefill only snapshotted when snap_pos fell strictly inside a chunk, and scoped requests always ask for snap_pos == prompt end, so auto/fixed silently never saved and every request paid the scoped prefill plus a full re-prefill (13.4s vs 6.9s per request). The fix takes the snapshot after the chunk loop when snap_pos == committed — no chunk reshaping, so hit bit-exactness is preserved. After the fix: cold-start disk hits 0.36-0.50s vs 9.2s miss; with a DFlash draft attached, greedy output is byte-identical across miss/save/hit/restart with unchanged accept rate. This also covers qwen35moe fully-GPU-resident, which delegates to the same prefill path.

laguna / gemma4 single-device — scoped snapshot fires and the staged restore works in-run (laguna 1.68s vs 3.2s, gemma4 0.26s vs 2.45s), but disk_cache_.save() returns false because neither backend implements snapshot_ref — pre-existing gap, identical on main, and the PR degrades gracefully ("scoped prefix staged", no extra prefill cost). Worth a follow-up to lift the serialization from laguna_layer_split_adapter into the single-device backends.

Other checks: 1905 unit assertions pass (incl. the 4 new policy tests); /props.full_cache.disk_policy exposed; request-level prefix_cache.scope override works; invalid CLI value exits 2; invalid request scope returns a clean 400; default full policy behavior matches main.

LGTM with 70cd8bf included.

…rg#364 scoped cache - Port 354e7b6 message-count freeze (aged[1..n-hot) compressed once, cached) - Remove mutual-exclusion: FlowKV active → disk clamps to system_end (verbatim system anchor, stable cross-session key); Luce-Org#364 unchanged when compress=false - WS1: non-continuation turns skip compression (cold-poison fix preserved) - Inert-guard: aged band < 512 tokens → FlowKV-OFF - Config: DiskPrefixCachePolicy::compress + --disk-prefix-cache-compress CLI - Tests T1-T7: 1908 assertions, 0 failures

… vs Luce-Org#364 FlowKV ran whenever disk_cache_policy.compress was set, with no size gate, so every multi-turn agentic turn paid the full pFlash drafter-forward (~400s/session at 59K) and re-expanded the prompt — making COMPOSE ~1.9x slower than the plain Luce-Org#364 scoped disk cache it should improve on. - Gate FlowKV on the original prompt size (same threshold as the pFlash gate), and skip it once pFlash has already compressed. - Below threshold COMPOSE is byte-identical to Luce-Org#364 (full prefix-cache hits, no drafter tax); compression fires only when the conversation can't fit the KV. - Keep the scoped-disk-re-prefill skip under compression (avoids turn-2 hang). Validated on abc_cache_harness COMPOSE arm (auto, threshold=65000): goldgate_fix total wall 846s -> 480s (~Luce-Org#364's 443s), zero compression on sub-threshold turns. Activate via --prefill-compression auto --prefill-threshold ~max_ctx.

…g#364 scoped save 47081e67 demoted FlowKV to a downstream else-if after whole-prompt pFlash, gated on the same threshold — making FlowKV structurally unreachable (any threshold that let it run made pFlash fire first; PFLASH_FREEZE_HISTORY went dead). Replace with the unified gate (compute should_compress once; route continuations to FlowKV-freeze with should_compress=false; whole-prompt pFlash only for cold non-continuations), mirroring the working flowkv-standalone structure. Re-enable Luce-Org#364's scoped disk save under compression (drop the band-aid guard; the disk-clamp already pins the save to the stable system_end prefix). Paired A/B, same binary (cb458145), full 7-turn goldgate_fix, single-session: COMPOSE_FLOWKV 615.9s vs pure-Luce-Org#364 713.7s (1.16x), decode 13.6 vs 6.7 tps, tool-valid 85.7% vs 71.4%. FlowKV engages on continuations; ee7 keeps the drafter forward cheap. Turn-4 transition cost (park/unpark + uncached compressed-prefill) is the remaining lever, not the gate.

feat(server): add scoped disk prefix cache policy

fdd6e88

cubic-dev-ai Bot reviewed Jun 10, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 10, 2026

merge: integrate PR Luce-Org#364 scoped disk prefix cache policy

bd6e952

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 10, 2026

docs: record PR Luce-Org#364 integration in stack manifest

efb3fc6

davide221 merged commit 4cde812 into Luce-Org:main Jun 11, 2026
3 checks passed

dusterbloom mentioned this pull request Jun 11, 2026

compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K #372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add scoped disk prefix cache policy#364

feat(server): add scoped disk prefix cache policy#364
davide221 merged 2 commits into
Luce-Org:mainfrom
weicj:feat-scoped-disk-prefix-cache-policy

weicj commented Jun 10, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weicj commented Jun 10, 2026

Summary

Window / Hit-rate Tradeoff

Changes

End-to-end Restore Checks

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented Jun 11, 2026

Validation results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants