server: fall back to fresh prefill when a cached snapshot is longer than the prompt by Rhonstin · Pull Request #370 · Luce-Org/lucebox-hub

Rhonstin · 2026-06-11T13:01:23Z

Problem

The prefix-cache restore path treats prompt_len < snap_pos as impossible and fails the request:

} else if (prompt_len > 0 && prompt_len < snap_pos) {
    // Cached more than the request — should never happen in practice.
    result.error = "snapshot_longer_than_prompt";
    out_io.emit(-1);
    return result;
}

With agent clients (Letta, Hermes, OpenClaw, ...) this happens routinely: they summarize/edit their message history between turns, so a follow-up prompt is often shorter than the KV snapshot cached on the slot. In production this surfaces as agents silently hanging mid-conversation — the server returns an empty completion:

chat DONE ... ok=false in=42491 effective_in=42491 out=0 restore=true slot=14 prefix_len=38663 error=snapshot_longer_than_prompt
chat DONE ... ok=false in=37831 effective_in=37831 out=0 restore=true slot=25 prefix_len=37530 error=snapshot_longer_than_prompt

Note prefix_len (the matched boundary) is smaller than the prompt in these logs — the slot's actual snapshot cur_pos was larger, i.e. the entry table and the slot contents had diverged after history edits.

Fix (two layers)

Routing (http_server.cpp): when the selected slot's snapshot covers more KV than the incoming prompt, treat the lookup as a cache miss instead of restoring a state that cannot be diff-prefilled. Applies to the inline, full-compress and disk hit paths.
Backends (qwen35, gemma4, layer_split): if such a request still reaches restore_and_generate, fall back to a fresh full prefill (mirroring the generate() path) instead of emitting -1.

Validation

On an RTX 3090 (Qwen3.6-27B Q4_K_M + DFlash + PFlash): cold request, cache-hit continuation (diff prefill 1.0s vs 8.0s cold), and a shortened-history follow-up all return tokens; greedy outputs on normal cache hits are unchanged. Running in production since 2026-06-11 with the failures gone.

@Luce-Org/maintainers

…han the prompt The prefix-cache restore path treated prompt_len < snap_pos as "should never happen" and failed the request with error=snapshot_longer_than_prompt and zero output tokens. With agent clients (Letta, Hermes, OpenClaw, ...) this happens routinely: they summarize/edit their message history between turns, so a follow-up prompt is often shorter than the KV snapshot cached for the slot. In production this surfaced as agents silently hanging — the server returned an empty completion mid-conversation: chat DONE ... ok=false in=42491 effective_in=42491 out=0 restore=true slot=14 prefix_len=38663 error=snapshot_longer_than_prompt Fix, two layers: - routing (http_server): when the selected slot's snapshot covers more KV than the incoming prompt, treat the lookup as a cache miss instead of restoring a state that cannot be diff-prefilled - backends (qwen35, gemma4, layer_split): if such a request still reaches restore_and_generate, fall back to a fresh full prefill instead of emitting -1 Verified on an RTX 3090: cold request, cache-hit continuation, and a shortened-history follow-up all return tokens; greedy outputs on normal cache hits are unchanged.

cubic-dev-ai

No issues found across 4 files

_{Re-trigger cubic}

davide221 · 2026-06-11T14:10:44Z

@Rhonstin very good finding, thanks for the contribution

The inline and full-compress prefix caches assign snapshot slots round-robin via next_slot_, which advances in prepare_*_snap even when the snapshot later aborts (degenerate boundary, failed generation, client disconnect). A burned step makes a later confirm wrap onto a slot that a live entry still references. From then on the entry table and the slot contents disagree: the entry's hash describes one token stream, the slot holds a snapshot of another. Consequences of such a stale entry: - follow-up prompt shorter than the slot snapshot: failed request (snapshot_longer_than_prompt) before PR Luce-Org#370, conservative cache miss after it; - follow-up prompt longer than the slot snapshot: the restore path attaches KV from the wrong token stream with no validation - silent context corruption. Fix the root cause: when confirm_inline_snap / confirm_full_snap commit a snapshot into a slot, erase every other entry still pointing at that slot. A slot holds exactly one snapshot, so at most one entry may describe it. Verified on RTX 3090 (Qwen3.6-27B Q4_K_M, --prefix-cache-slots 2) with the deterministic PR Luce-Org#370 repro (short conv -> aborted snap -> big conv wrapping onto slot 0 -> shortened follow-up): the wrap now logs '[pc] dropping stale entry for reused slot=0', the follow-up is a clean miss with correct output, and a longer same-conversation follow-up restores a valid snapshot. Greedy outputs across the sequence match the no-cache baseline; 1905 server unit assertions pass. Co-Authored-By: WOZCODE <contact@withwoz.com>

cubic-dev-ai Bot reviewed Jun 11, 2026

View reviewed changes

davide221 merged commit c9a3e09 into Luce-Org:main Jun 11, 2026
5 checks passed

davide221 mentioned this pull request Jun 11, 2026

fix(server): drop stale prefix-cache entries when a snapshot slot is reused #371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fall back to fresh prefill when a cached snapshot is longer than the prompt#370

server: fall back to fresh prefill when a cached snapshot is longer than the prompt#370
davide221 merged 1 commit into
Luce-Org:mainfrom
Rhonstin:fix/prefix-cache-shorter-prompt

Rhonstin commented Jun 11, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

davide221 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rhonstin commented Jun 11, 2026

Problem

Fix (two layers)

Validation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants