compose: FlowKV aged-history compression + drafter residency fix — 1.72x vs disk-cache baseline at <=64K#372
Conversation
…rg#364 scoped cache - Port 354e7b6 message-count freeze (aged[1..n-hot) compressed once, cached) - Remove mutual-exclusion: FlowKV active → disk clamps to system_end (verbatim system anchor, stable cross-session key); Luce-Org#364 unchanged when compress=false - WS1: non-continuation turns skip compression (cold-poison fix preserved) - Inert-guard: aged band < 512 tokens → FlowKV-OFF - Config: DiskPrefixCachePolicy::compress + --disk-prefix-cache-compress CLI - Tests T1-T7: 1908 assertions, 0 failures
… vs Luce-Org#364 FlowKV ran whenever disk_cache_policy.compress was set, with no size gate, so every multi-turn agentic turn paid the full pFlash drafter-forward (~400s/session at 59K) and re-expanded the prompt — making COMPOSE ~1.9x slower than the plain Luce-Org#364 scoped disk cache it should improve on. - Gate FlowKV on the original prompt size (same threshold as the pFlash gate), and skip it once pFlash has already compressed. - Below threshold COMPOSE is byte-identical to Luce-Org#364 (full prefix-cache hits, no drafter tax); compression fires only when the conversation can't fit the KV. - Keep the scoped-disk-re-prefill skip under compression (avoids turn-2 hang). Validated on abc_cache_harness COMPOSE arm (auto, threshold=65000): goldgate_fix total wall 846s -> 480s (~Luce-Org#364's 443s), zero compression on sub-threshold turns. Activate via --prefill-compression auto --prefill-threshold ~max_ctx.
…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.
…g#364 scoped save 47081e67 demoted FlowKV to a downstream else-if after whole-prompt pFlash, gated on the same threshold — making FlowKV structurally unreachable (any threshold that let it run made pFlash fire first; PFLASH_FREEZE_HISTORY went dead). Replace with the unified gate (compute should_compress once; route continuations to FlowKV-freeze with should_compress=false; whole-prompt pFlash only for cold non-continuations), mirroring the working flowkv-standalone structure. Re-enable Luce-Org#364's scoped disk save under compression (drop the band-aid guard; the disk-clamp already pins the save to the stable system_end prefix). Paired A/B, same binary (cb458145), full 7-turn goldgate_fix, single-session: COMPOSE_FLOWKV 615.9s vs pure-Luce-Org#364 713.7s (1.16x), decode 13.6 vs 6.7 tps, tool-valid 85.7% vs 71.4%. FlowKV engages on continuations; ee7 keeps the drafter forward cheap. Turn-4 transition cost (park/unpark + uncached compressed-prefill) is the remaining lever, not the gate.
Resident drafter (~2GB) starves the target's large prefill on 24GB cards (370 -> 121 tok/s on the freeze transition turn). Release after scoring, lazy reload next turn (~2s). N=3 interleaved: 527.5s -> 306.7s (1.72x), turn-4 prefill 217-269s -> 66-73s, quality held. persistent remains the big-card opt-out.
…them Ingress gate rejected prompt+max_tokens > max_ctx before compression ran, making >max_ctx sessions unreachable even when FlowKV/pFlash could shrink them. Extract pure should_reject_oversized() (admission.h): pass oversized requests through when compression will run; enforce the hard limit on the post-compress effective size in worker_loop. Oversized requests now get compressed first and reject cleanly only if still over budget.
-133 net LOC, comments only — zero logic/string/assertion changes. All suites re-verified green (1926 asserts + 4 standalone tests).
Dual-resident target+draft fragments VMM virtual address space; at max_ctx=131072 the compute pool's cuMemSetAccess fails (device not ready). Safe cell (<=65536, 10+ clean runs) keeps the fast no-park path; dangerous cell parks. Note: GGML_CUDA_NO_VMM=1 env is compile- time-only in this fork and never mitigated this.
There was a problem hiding this comment.
8 issues found across 24 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/server/freeze_history.h">
<violation number="1" location="server/src/server/freeze_history.h:10">
P3: Unused include: `<vector>` is not used by any declaration in this header. Remove it to keep dependencies minimal.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>
<file name="server/src/server/http_server.cpp">
<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>
<file name="server/src/qwen3/anchor_scan.cpp">
<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>
<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>
<file name="server/src/qwen3/anchor_scan.h">
<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| { | ||
| const int n_chunks = (int)forced.size(); | ||
| const int ngram = cfg.ngram; | ||
| const int search_end = std::max(0, body_end - ngram); |
There was a problem hiding this comment.
P1: search_end clamping to 0 causes one invalid n-gram comparison when body_end < ngram, risking out-of-bounds reads and boundary violations.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 27:
<comment>`search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</comment>
<file context>
@@ -0,0 +1,164 @@
+{
+ const int n_chunks = (int)forced.size();
+ const int ngram = cfg.ngram;
+ const int search_end = std::max(0, body_end - ngram);
+
+ for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
</file context>
| // Cascade loop: expand pool with tokens from newly-forced chunks and re-scan. | ||
| std::vector<uint8_t> prev_forced; | ||
| for (int it = 0; it < max_iters; ++it) { | ||
| prev_forced = forced; |
There was a problem hiding this comment.
P1: Transitive cascade loop exits early due to comparing forced against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 103:
<comment>Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</comment>
<file context>
@@ -0,0 +1,164 @@
+ // Cascade loop: expand pool with tokens from newly-forced chunks and re-scan.
+ std::vector<uint8_t> prev_forced;
+ for (int it = 0; it < max_iters; ++it) {
+ prev_forced = forced;
+
+ // Rare-token worklist: catches multi-hop cascades within a single outer iteration.
</file context>
| const std::string ptype = part.value("type", ""); | ||
| if (ptype == "text" || ptype == "input_text" || | ||
| ptype == "output_text") | ||
| msg_content += part.value("text", ""); |
There was a problem hiding this comment.
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string text values can throw uncaught exceptions in the worker loop.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/http_server.cpp, line 1904:
<comment>FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</comment>
<file context>
@@ -1798,6 +1808,233 @@ void HttpServer::worker_loop() {
+ const std::string ptype = part.value("type", "");
+ if (ptype == "text" || ptype == "input_text" ||
+ ptype == "output_text")
+ msg_content += part.value("text", "");
+ }
+ }
</file context>
| msg_content += part.value("text", ""); | |
| if (part.contains("text") && part["text"].is_string()) msg_content += part["text"].get<std::string>(); |
| { | ||
| size_t total_vram = 0; | ||
| int dev = 0; | ||
| cudaGetDevice(&dev); |
There was a problem hiding this comment.
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 541:
<comment>Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</comment>
<file context>
@@ -534,7 +535,22 @@ bool Qwen35Backend::handle_compress(const std::string & line, const DaemonIO & i
+ {
+ size_t total_vram = 0;
+ int dev = 0;
+ cudaGetDevice(&dev);
+ cudaDeviceProp prop{};
+ if (cudaGetDeviceProperties(&prop, dev) == cudaSuccess)
</file context>
| int ngram = 4; | ||
| int rare_token_max_freq = 8; // tokens appearing <= this many times in body count as rare | ||
| int cascade_min_anchor_count = 0; // skip cascade if pass-1 forced >= this many chunks (0 = always cascade) | ||
| int max_forced_count = INT_MAX; // hard cap on total forced chunks |
There was a problem hiding this comment.
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.h, line 18:
<comment>max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</comment>
<file context>
@@ -0,0 +1,42 @@
+ int ngram = 4;
+ int rare_token_max_freq = 8; // tokens appearing <= this many times in body count as rare
+ int cascade_min_anchor_count = 0; // skip cascade if pass-1 forced >= this many chunks (0 = always cascade)
+ int max_forced_count = INT_MAX; // hard cap on total forced chunks
+};
+
</file context>
…ze in post-compress gate Two confirmed PR-review findings: - request-level prefix_cache.scope override replaced the whole policy, silently dropping the server-level compress flag (FlowKV disabled for any client sending an explicit scope) - post-compress context gate used the raw prompt size on pflash full-cache hits, falsely 400ing oversized repeats served from cached compressed state Both extracted to pure helpers (apply_request_scope_override, effective_prompt_overflows) with failing-test-first coverage.
|
Review disposition (all 8 cubic findings verified against code + tests before fixing):
Suite after fixes: 1939 assertions green; admission standalone 12/12. |
There was a problem hiding this comment.
1 issue found across 6 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:541">
P2: Skip-park safety guard reads VRAM from current CUDA device instead of the configured placement GPU, making guard decisions incorrect on multi-GPU setups.</violation>
</file>
<file name="server/src/server/http_server.cpp">
<violation number="1" location="server/src/server/http_server.cpp:1904">
P1: FlowKV message-part extraction reads unvalidated JSON as strings; non-string `text` values can throw uncaught exceptions in the worker loop.</violation>
</file>
<file name="server/src/qwen3/anchor_scan.cpp">
<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P1: `search_end` clamping to 0 causes one invalid n-gram comparison when `body_end < ngram`, risking out-of-bounds reads and boundary violations.</violation>
<violation number="2" location="server/src/qwen3/anchor_scan.cpp:103">
P1: Transitive cascade loop exits early due to comparing `forced` against an immediately copied snapshot, so subsequent expansion/rescan iterations are skipped.</violation>
</file>
<file name="server/src/qwen3/anchor_scan.h">
<violation number="1" location="server/src/qwen3/anchor_scan.h:18">
P2: max_forced_count hard cap is checked only inside the cascade loop, but not after pass-1. If pass-1 alone already pushes forced chunks above max_forced_count, the cap is never enforced — the result can exceed the limit.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
0 conflated 'no hit' with a zero-length hit; sentinel is now -1 and the gate treats any >=0 value as served-from-cache.
|
cubic P2 (http_server.cpp:1798 sentinel conflation): a zero-length full-cache entry is not constructible today (admission floor + kept anchors guarantee non-empty compressed prompts), but the conflation class is now removed outright — sentinel is |
…ter residency fix Keep the current stack's qwen3 helper/test implementations where the PR overlapped, while taking the PR's server-side admission, skip-park, HTTP/server wiring, and test additions.
Record the PR Luce-Org#372 integration, current head, and updated open-PR accounting.
TL;DR
On current main (RTX 3090 24GB, which includes the PR #364 scoped disk prefix cache), enabling FlowKV aged-history compression on top of the disk cache now delivers:
Benchmark:
goldgate_fixtrace (real multi-turn agentic session, 34K-64K prompt tokens per turn), N=3 interleaved A/B on the same binary, same thermal window.Summary
PR #364 made warm agentic turns cheap by restoring a stable token prefix from disk. The remaining cost on long sessions is the aged conversation history that still has to be prefilled fresh whenever the prefix diverges, and the per-turn growth beyond the cached boundary. This PR composes FlowKV aged-history compression with that cache: messages older than a hot window are compressed (drafter-scored, anchor-preserving) while the system prompt stays verbatim as the cache anchor, so the disk-cache key remains stable across turns. A unified gate keeps the three paths exclusive — turn-1 verbatim, FlowKV on continuations, whole-prompt pFlash otherwise — and with compression disabled the request path is byte-identical to main.
Two fixes found during benchmarking turned the compose from a wash into the 1.72x above:
autoresidency default;--draft-residency persistentkeeps the old behavior for >=32GB cards.Changes
http_server.cpp);compressoff keeps main's behavior byte-identical.autodraft residency releases the pflash drafter after compress scoring (placement/draft_residency.h).should_reject_oversized()+ post-compress effective-size gate (server/admission.h).--prefill-skip-parkdowngraded on <32GB GPUs at max_ctx>65536 (VMM VA-fragmentation crash class) (placement/skip_park_guard.h).Limitations
GGML_CUDA_NO_VMM=1as an environment variable is a no-op (compile-time option in this fork); scripts relying on it were never protected.History
731561d1compose FlowKV with feat(server): add scoped disk prefix cache policy #364 scoped cache;0efdc33cgate compression as fallback so compose can't regress main;6a848058unified gate (FlowKV reachable + scoped save preserved).cefa3cafee7 early-exit drafter + anchor-transitive cascade + tail-capture guard.3fc6882fdrafter auto-release after compress scoring (the 1.72x).2ae98c0fcompress-aware admission.637fbdafcomment trim (-133 LOC, no logic changes).1c562eb4skip-park footprint guard.