fix(server): drop stale prefix-cache entries when a snapshot slot is reused#371
Merged
Merged
Conversation
The inline and full-compress prefix caches assign snapshot slots round-robin via next_slot_, which advances in prepare_*_snap even when the snapshot later aborts (degenerate boundary, failed generation, client disconnect). A burned step makes a later confirm wrap onto a slot that a live entry still references. From then on the entry table and the slot contents disagree: the entry's hash describes one token stream, the slot holds a snapshot of another. Consequences of such a stale entry: - follow-up prompt shorter than the slot snapshot: failed request (snapshot_longer_than_prompt) before PR #370, conservative cache miss after it; - follow-up prompt longer than the slot snapshot: the restore path attaches KV from the wrong token stream with no validation - silent context corruption. Fix the root cause: when confirm_inline_snap / confirm_full_snap commit a snapshot into a slot, erase every other entry still pointing at that slot. A slot holds exactly one snapshot, so at most one entry may describe it. Verified on RTX 3090 (Qwen3.6-27B Q4_K_M, --prefix-cache-slots 2) with the deterministic PR #370 repro (short conv -> aborted snap -> big conv wrapping onto slot 0 -> shortened follow-up): the wrap now logs '[pc] dropping stale entry for reused slot=0', the follow-up is a clean miss with correct output, and a longer same-conversation follow-up restores a valid snapshot. Greedy outputs across the sequence match the no-cache baseline; 1905 server unit assertions pass. Co-Authored-By: WOZCODE <contact@withwoz.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Follow-up to #370, fixing the root cause behind its symptom.
The inline and full-compress prefix caches hand out snapshot slots round-robin (
next_slot_), and the counter advances inprepare_*_snapeven when the snapshot later aborts (degenerate boundary < 512 tokens, failed generation, client disconnect). One burned step is enough: a later confirm wraps onto a slot that a live entry still references. From then on the entry table and the slot contents disagree — the entry's hash describes one token stream, the slot holds a snapshot of another.A stale entry then misbehaves in two ways:
snapshot_longer_than_prompt, a failed request with an empty completion (the silent agent hang server: fall back to fresh prefill when a cached snapshot is longer than the prompt #370 reported; server: fall back to fresh prefill when a cached snapshot is longer than the prompt #370 downgrades it to a conservative cache miss);Fix
When
confirm_inline_snap/confirm_full_snapcommit a snapshot into a slot, erase every other entry still pointing at that slot. A slot holds exactly one snapshot, so at most one entry may describe it. 24 lines, no API change.Validation (RTX 3090, Qwen3.6-27B Q4_K_M,
--prefix-cache-slots 2)Deterministic repro: short conv (snap@3801 → slot 0) → tiny conv whose snap aborts (burns a slot step) → big distinct conv (8.2K tokens, wraps onto slot 0) → shortened follow-up of the first conv.
ok=false out=0 error=snapshot_longer_than_prompt;[pc] dropping stale entry for reused slot=0, the follow-up is a clean miss with correct output, and a longer same-conversation follow-up restores a valid snapshot (correct KV, correct answer) — the corruption window is closed, cache effectiveness preserved.Greedy outputs across the whole sequence match the no-cache baseline. All 1905 server unit assertions pass.
🧙 Built with WOZCODE