Skip to content

server: fall back to fresh prefill when a cached snapshot is longer than the prompt#370

Merged
davide221 merged 1 commit into
Luce-Org:mainfrom
Rhonstin:fix/prefix-cache-shorter-prompt
Jun 11, 2026
Merged

server: fall back to fresh prefill when a cached snapshot is longer than the prompt#370
davide221 merged 1 commit into
Luce-Org:mainfrom
Rhonstin:fix/prefix-cache-shorter-prompt

Conversation

@Rhonstin

Copy link
Copy Markdown
Contributor

Problem

The prefix-cache restore path treats prompt_len < snap_pos as impossible and fails the request:

} else if (prompt_len > 0 && prompt_len < snap_pos) {
    // Cached more than the request — should never happen in practice.
    result.error = "snapshot_longer_than_prompt";
    out_io.emit(-1);
    return result;
}

With agent clients (Letta, Hermes, OpenClaw, ...) this happens routinely: they summarize/edit their message history between turns, so a follow-up prompt is often shorter than the KV snapshot cached on the slot. In production this surfaces as agents silently hanging mid-conversation — the server returns an empty completion:

chat DONE ... ok=false in=42491 effective_in=42491 out=0 restore=true slot=14 prefix_len=38663 error=snapshot_longer_than_prompt
chat DONE ... ok=false in=37831 effective_in=37831 out=0 restore=true slot=25 prefix_len=37530 error=snapshot_longer_than_prompt

Note prefix_len (the matched boundary) is smaller than the prompt in these logs — the slot's actual snapshot cur_pos was larger, i.e. the entry table and the slot contents had diverged after history edits.

Fix (two layers)

  1. Routing (http_server.cpp): when the selected slot's snapshot covers more KV than the incoming prompt, treat the lookup as a cache miss instead of restoring a state that cannot be diff-prefilled. Applies to the inline, full-compress and disk hit paths.
  2. Backends (qwen35, gemma4, layer_split): if such a request still reaches restore_and_generate, fall back to a fresh full prefill (mirroring the generate() path) instead of emitting -1.

Validation

On an RTX 3090 (Qwen3.6-27B Q4_K_M + DFlash + PFlash): cold request, cache-hit continuation (diff prefill 1.0s vs 8.0s cold), and a shortened-history follow-up all return tokens; greedy outputs on normal cache hits are unchanged. Running in production since 2026-06-11 with the failures gone.

@Luce-Org/maintainers

…han the prompt

The prefix-cache restore path treated prompt_len < snap_pos as
"should never happen" and failed the request with
error=snapshot_longer_than_prompt and zero output tokens.

With agent clients (Letta, Hermes, OpenClaw, ...) this happens
routinely: they summarize/edit their message history between turns, so
a follow-up prompt is often shorter than the KV snapshot cached for the
slot. In production this surfaced as agents silently hanging — the
server returned an empty completion mid-conversation:

  chat DONE ... ok=false in=42491 effective_in=42491 out=0
    restore=true slot=14 prefix_len=38663 error=snapshot_longer_than_prompt

Fix, two layers:
- routing (http_server): when the selected slot's snapshot covers more
  KV than the incoming prompt, treat the lookup as a cache miss instead
  of restoring a state that cannot be diff-prefilled
- backends (qwen35, gemma4, layer_split): if such a request still
  reaches restore_and_generate, fall back to a fresh full prefill
  instead of emitting -1

Verified on an RTX 3090: cold request, cache-hit continuation, and a
shortened-history follow-up all return tokens; greedy outputs on normal
cache hits are unchanged.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

@davide221

Copy link
Copy Markdown
Contributor

@Rhonstin very good finding, thanks for the contribution

@davide221 davide221 merged commit c9a3e09 into Luce-Org:main Jun 11, 2026
5 checks passed
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 11, 2026
The inline and full-compress prefix caches assign snapshot slots
round-robin via next_slot_, which advances in prepare_*_snap even when
the snapshot later aborts (degenerate boundary, failed generation,
client disconnect). A burned step makes a later confirm wrap onto a
slot that a live entry still references. From then on the entry table
and the slot contents disagree: the entry's hash describes one token
stream, the slot holds a snapshot of another.

Consequences of such a stale entry:
- follow-up prompt shorter than the slot snapshot: failed request
  (snapshot_longer_than_prompt) before PR Luce-Org#370, conservative cache
  miss after it;
- follow-up prompt longer than the slot snapshot: the restore path
  attaches KV from the wrong token stream with no validation - silent
  context corruption.

Fix the root cause: when confirm_inline_snap / confirm_full_snap
commit a snapshot into a slot, erase every other entry still pointing
at that slot. A slot holds exactly one snapshot, so at most one entry
may describe it.

Verified on RTX 3090 (Qwen3.6-27B Q4_K_M, --prefix-cache-slots 2)
with the deterministic PR Luce-Org#370 repro (short conv -> aborted snap ->
big conv wrapping onto slot 0 -> shortened follow-up): the wrap now
logs '[pc] dropping stale entry for reused slot=0', the follow-up is
a clean miss with correct output, and a longer same-conversation
follow-up restores a valid snapshot. Greedy outputs across the
sequence match the no-cache baseline; 1905 server unit assertions
pass.

Co-Authored-By: WOZCODE <contact@withwoz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants