server: fall back to fresh prefill when a cached snapshot is longer than the prompt#370
Merged
davide221 merged 1 commit intoJun 11, 2026
Conversation
…han the prompt
The prefix-cache restore path treated prompt_len < snap_pos as
"should never happen" and failed the request with
error=snapshot_longer_than_prompt and zero output tokens.
With agent clients (Letta, Hermes, OpenClaw, ...) this happens
routinely: they summarize/edit their message history between turns, so
a follow-up prompt is often shorter than the KV snapshot cached for the
slot. In production this surfaced as agents silently hanging — the
server returned an empty completion mid-conversation:
chat DONE ... ok=false in=42491 effective_in=42491 out=0
restore=true slot=14 prefix_len=38663 error=snapshot_longer_than_prompt
Fix, two layers:
- routing (http_server): when the selected slot's snapshot covers more
KV than the incoming prompt, treat the lookup as a cache miss instead
of restoring a state that cannot be diff-prefilled
- backends (qwen35, gemma4, layer_split): if such a request still
reaches restore_and_generate, fall back to a fresh full prefill
instead of emitting -1
Verified on an RTX 3090: cold request, cache-hit continuation, and a
shortened-history follow-up all return tokens; greedy outputs on normal
cache hits are unchanged.
Contributor
|
@Rhonstin very good finding, thanks for the contribution |
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 11, 2026
The inline and full-compress prefix caches assign snapshot slots round-robin via next_slot_, which advances in prepare_*_snap even when the snapshot later aborts (degenerate boundary, failed generation, client disconnect). A burned step makes a later confirm wrap onto a slot that a live entry still references. From then on the entry table and the slot contents disagree: the entry's hash describes one token stream, the slot holds a snapshot of another. Consequences of such a stale entry: - follow-up prompt shorter than the slot snapshot: failed request (snapshot_longer_than_prompt) before PR Luce-Org#370, conservative cache miss after it; - follow-up prompt longer than the slot snapshot: the restore path attaches KV from the wrong token stream with no validation - silent context corruption. Fix the root cause: when confirm_inline_snap / confirm_full_snap commit a snapshot into a slot, erase every other entry still pointing at that slot. A slot holds exactly one snapshot, so at most one entry may describe it. Verified on RTX 3090 (Qwen3.6-27B Q4_K_M, --prefix-cache-slots 2) with the deterministic PR Luce-Org#370 repro (short conv -> aborted snap -> big conv wrapping onto slot 0 -> shortened follow-up): the wrap now logs '[pc] dropping stale entry for reused slot=0', the follow-up is a clean miss with correct output, and a longer same-conversation follow-up restores a valid snapshot. Greedy outputs across the sequence match the no-cache baseline; 1905 server unit assertions pass. Co-Authored-By: WOZCODE <contact@withwoz.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The prefix-cache restore path treats
prompt_len < snap_posas impossible and fails the request:With agent clients (Letta, Hermes, OpenClaw, ...) this happens routinely: they summarize/edit their message history between turns, so a follow-up prompt is often shorter than the KV snapshot cached on the slot. In production this surfaces as agents silently hanging mid-conversation — the server returns an empty completion:
Note
prefix_len(the matched boundary) is smaller than the prompt in these logs — the slot's actual snapshotcur_poswas larger, i.e. the entry table and the slot contents had diverged after history edits.Fix (two layers)
http_server.cpp): when the selected slot's snapshot covers more KV than the incoming prompt, treat the lookup as a cache miss instead of restoring a state that cannot be diff-prefilled. Applies to the inline, full-compress and disk hit paths.qwen35,gemma4,layer_split): if such a request still reachesrestore_and_generate, fall back to a fresh full prefill (mirroring thegenerate()path) instead of emitting-1.Validation
On an RTX 3090 (Qwen3.6-27B Q4_K_M + DFlash + PFlash): cold request, cache-hit continuation (diff prefill 1.0s vs 8.0s cold), and a shortened-history follow-up all return tokens; greedy outputs on normal cache hits are unchanged. Running in production since 2026-06-11 with the failures gone.
@Luce-Org/maintainers