feat: KV cache compaction, GraphRAG pipeline, Strix Halo optimizations by fabiantax · Pull Request #3 · fabiantax/llama.cpp

fabiantax · 2026-03-09T19:43:16Z

Summary

KV Cache Compaction POC: Implementation of "KV Cache Compaction via Attention Matching" (arXiv paper) with math utilities, unit tests, and comprehensive documentation including algorithms reference, user stories, and cross-pollination map with adjacent concepts (Nyström, coresets, Frank-Wolfe, etc.)
GraphRAG Pipeline: Rust-based NER+RE extraction pipeline with FalkorDB graph database integration, plus a ModernBERT fine-tuning pipeline (Python) for domain-specific entity/relation extraction targeting 44x speedup over GLiNER (72s → <2s via ONNX INT8)
Strix Halo Optimization Report: Documentation of all optimizations achieving +15.5% Vulkan (67 tok/s) and +29% HIP (56 tok/s) on Qwen3.5-35B-A3B, including SSM shared memory tiling, batched elementwise mega-kernel, wave64 config, fused SSM recurrence, and APEX scheduling framework

Key files

KV Cache Compaction

tools/kv-compact/kv-compact.cpp — POC tool (891 lines)
tools/kv-compact/kv-compact-math.h — Math utilities (378 lines)
tests/test-kv-compact-math.cpp — Unit tests (582 lines)
docs/kv-cache-compaction-*.md — Research docs (5 files)

GraphRAG Pipeline

graphrag-pipeline/src/main.rs — Rust NER+RE pipeline (1017 lines)
graphrag-pipeline/training/ — ModernBERT fine-tuning (NER + RE, two-stage training)
graphrag-pipeline/extract.mjs — Node.js extraction tool

Documentation

docs/development/STRIX_HALO_OPTIMIZATION_REPORT.md — Full optimization report
docs/development/STRIX_HALO_USER_STORIES.md — 17 user stories (11 completed)

Test plan

KV compaction math tests: ctest -R test-kv-compact-math
GraphRAG Rust pipeline: cd graphrag-pipeline && cargo build
ModernBERT training dry run: cd graphrag-pipeline/training && python train_ner.py --dry-run
Backend-ops tests still pass: 214/214

🤖 Generated with Claude Code

* Enable tmate debugging for investigating thread safety issue * Refactor wait and submit to operate on vector<wgpu::FutureWaitInfo>, and fix wait to delete only the future that is completed. * Cleanup * Remove clear change and run clang-format * Cleanup

…l-org#20101)

…atMul updates (ggml-org#20118) * ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity * ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration # Conflicts: # ggml/src/ggml-hexagon/htp/flash-attn-ops.c * ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity * ggml-hexagon: fix compiling error * fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking * refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility * wip * fa: instrumentation and dma reordering * hex-fa: use block-size 64 to improve DMA pipelining * hex-fa: optimize vec-dot for v79 and above * hex-fa: use block size 64 * hex-fa: avoid scalar fp32->fp16 conversions * hex-fa: simplify dot_f16 functions using optimized vec_mpyacc * hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc * hex-mm: use mpyacc in matmul dot functions --------- Co-authored-by: chraac <chraac@gmail.com>

* fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c3" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model : fix Qwen3.5 model type detection * Update src/llama-model.cpp whoops, my bad Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@ggerganov

…ml-org#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* models : add llm_build_delta_net_base * cont : keep qwen35 and qwen35moe graphs intact * cont : add comments [no ci] * add kimi linear to delta-net-base * removed unnecessary ggml_cont from g_exp_t * removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp * removed unnecessary diag mask * cont : simplify * cont : avoid graph splits * scale q after mul instead of beginning * scale q after mul instead of beginning * identical ppl * cont : fix scale and decay mask * minor : remove TODO * block implementation for kda * remove space at the end of line 101 * concat+pad * pad+binary row concat * chunk size 16 for kda * removed minor differences to master --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…0139) * hexagon: add fp16 support for binary ops: add,sub,mul,div * hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79) * hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad * snapdragon: fix readme link --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

* opencl: add `neg` * opencl: add `exp` * opencl: add `diag`

* Enhance /clear command to include system prompt Add system prompt to messages when clearing chat history. * Use lambda

…d Prompts (ggml-org#18655)

* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast by doing a cat on the data first and then calling the intrinsic. This bypasses the buggy compiler path while maintaining identical instruction selection. Performance Verification: Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction: `plxv 40, 2(14)` This ensures zero performance regression while unblocking builds on newer toolchains. Reproduced on: - Alpine Linux + GCC 15.2.0-r2 - RHEL 9 + GCC 15.1.1 (gcc-toolset-15) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

…ml-org#20157) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cuda: add mem check for fusion * Replace NaNs with -FLT_MAX * fix typo Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…0120) * server : preserve anthropic thinking blocks in conversion (ggml-org#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home>

* hexagon: add ssm_conv op * hexagon: hvx kernel is functional * hexagon: improvements to ssm-conv hvx kernel * hexagon: added dma to ssm-conv hvx kernel * hexagon: ssm-conv dynamically compute gather scratchpad * hex-ssm-conv: add local context and fix various issues (spad indexing, etc) --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser

* Add memsets and other fixes for IQ quants * Make memset unconditional, change Laux back to L * Move another memset

Add production infrastructure for KV cache compaction based on the "Fast KV Compaction via Attention Matching" paper (Zweiger et al., 2026). Core changes: - Beta bias injection into attention mask: per-KV-head compaction biases are stored CPU-side and folded into the attention mask before softmax. The mask shape expands from [n_kv, n_tps, 1, n_stream] to [n_kv, n_tps, n_head, n_stream] when biases are active, enabling per-head correction via GQA broadcasting. Zero overhead when inactive. - C_v writeback: optimized value vectors are written directly back into KV cache tensors (F32/F16, handles V-transpose layout). - Cell metadata management: compact_cells() evicts non-kept positions from the cell ring buffer after compaction. - Public API: llama_kv_cache_compact() and llama_compact_params in llama.h - Full compaction module (llama-kv-compact.cpp): orchestrates the 3-step AM pipeline (token selection, NNLS beta solve, LS value refit) across all model layers and KV heads, with global key importance aggregation. - Fix double-scaling bug in kv-compact-math.h where scores were multiplied by inv_sqrt_dk twice in the X matrix computation. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Move n_head lookup from external hparams parameter into mctx->get_n_head(), reverting all 16 callsite changes in llama-graph.cpp back to upstream signatures. Now only the 2 function definitions differ from upstream, making rebases against llama.cpp updates essentially conflict-free. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Capture real Q vectors during llama_decode() via eval callback and use them as reference queries for KV cache compaction instead of K-vector proxies. This is the paper's key quality insight — using actual query vectors produces significantly better beta/C_v fits. API: llama_kv_cache_capture_q(ctx, true) enables capture, then llama_kv_cache_compact() with use_repeat_prefill=true uses them. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Implement sensitivity-based budget allocation so that attention heads more sensitive to compression retain more tokens while less sensitive heads are compressed more aggressively. Changes: - kv-compact-math.h: add compute_head_reconstruction_error(), compute_head_sensitivity(), allocate_head_budgets(), and compute_sensitivity_weights() primitives - llama-kv-compact.cpp: when use_nonuniform_budgets is enabled, compute per-head importance score variance as sensitivity proxy, then weight global key scores by sqrt-sensitivity so keys critical to sensitive heads are preferentially retained - kv-compact-profile.cpp: new profiling tool that measures per-(layer, head) reconstruction error at multiple compression ratios and outputs a JSON sensitivity profile for offline analysis - test-kv-compact-math.cpp: 11 new tests covering reconstruction error, sensitivity profiling, budget allocation, and sensitivity weights https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Implements automatic KV cache compaction during generation and bias persistence across state save/load: - Auto-compact trigger: when KV cache fills beyond configurable threshold during llama_decode(), compaction runs automatically before returning a slot-not-found error. Hooked into the existing FAILED_PREPARE retry path with a one-shot guard to prevent loops. - Public API: llama_kv_cache_set_auto_compact(ctx, threshold, params) configures the threshold (e.g. 0.9 = 90% full) and compaction parameters (target_ratio, repeat-prefill, non-uniform budgets). - Bias serialization: compaction bias values are now included in state_write/state_read, enabling save/restore of compacted cache state. Format: [has_bias flag] + per-layer per-head bias floats for active cells. Backwards-compatible (old states have no bias section; new states include the flag). - Integration tests: consecutive compression (6 rounds, 128→2 tokens), bias serialization round-trip, threshold logic, and API defaults. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Three critical bugs fixed: 1. Per-head AM indices vs global selection mismatch: compact_head_highest_attn was doing its own key selection per-head, but results were being written to globally-selected cell positions. Now uses fit_head_for_selection() which takes pre-selected indices and only computes beta + C_v fitting. 2. V tensor reading used sequential indices (0..n_active-1) instead of actual active_cells positions. Inlined V reading with correct active_cells indexing for both transposed and non-transposed layouts. 3. C_v least-squares solver was numerically unstable when underdetermined (n_ref_queries < n_kept_tokens). Added Tikhonov regularization toward original V values: min ||X*C - Y||^2 + lambda*||C - V_sel||^2. This prevents wild extrapolation while allowing meaningful corrections. Benchmark results (TinyLlama 1.1B, 1024 prefill, 256 eval tokens): | Compression | Baseline | Eviction | AM select | AM full | |-------------|----------|----------|-----------|---------| | 1x | 6.38 | - | - | - | | 2x | - | 6.88 | 7.36 | 7.95 | | 5x | - | 7.35 | 7.95 | 16.1 | | 10x | - | 8.26 | 8.49 | 58.5 | Also adds kv-compact-bench tool for perplexity measurement with compaction, supporting --compact-ratio, --evict-only, --no-compact ablation modes. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Per-layer beta injection: - Replace layer-averaged beta in shared KQ mask with per-layer bias tensors injected directly into the compute graph before softmax - Each layer gets its own compaction bias tensor [n_kv, 1, n_head_kv, 1] that broadcasts across tokens and streams via ggml_add - Simplifies mask back to [n_kv, n_tps, 1, n_stream] (no per-head expansion) - Disables flash attention when per-layer bias is active (falls back to standard attention with explicit KQ computation) - Resolves TODO at llama-kv-cache.cpp:1525 KV cache defragmentation: - After compaction, move kept cells from scattered positions to contiguous [0, n_kept) positions, reducing n_kv from full cache size to kept count - Moves K/V tensor data row-by-row (K) and element-by-element (transposed V) - Remaps compaction bias indices to match new cell positions - Uses llama_kv_cells::cp/set API for safe metadata migration - Dramatically reduces attention compute: e.g. 50x compression reduces n_kv from 4096 to GGML_PAD(82, 256)=256, a 16x attention speedup https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Downloaded dataset for KV cache compaction PPL benchmarking should not be tracked. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Mermaid timeline covering all 5 phases: Foundation (docs, math, tests), Core Integration (beta injection, C_v writeback, full model compaction), Quality (Q capture, diversity selection, error metrics), Optimization (iterative refinement, GCV ridge, non-uniform budgets), and Production (library API, online compaction, FA+bias kernel, multi-stream batching). https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

The uma_profiler_cb_eval callback was auto-registered on all UMA systems, firing on every op of every token. This caused llama-server to run at 18.5 tok/s instead of 92.6 tok/s (5x regression). llama-bench was unaffected because it doesn't use common_init. Fix: disable auto-enable (condition set to false). The profiler should only activate when explicitly requested via a CLI flag. Also fix duplicate __avx_f32cx8_load when building with GGML_CPU_ALL_VARIANTS=ON (conflicts with simd-mappings.h). Benchmark (SmolLM3 3B, 8 slots, Vulkan): Before: 46-87 agg tok/s After: 80-144 agg tok/s (matches stock b8334) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add infrastructure for cache-aware MoE routing (arxiv 2412.00099): - llama_set_expert_cache_bias() API in llama.h - Per-layer expert bias stored in llama_model - Bias injection point in build_moe_ffn() (llama-graph.cpp) - expert-cache-test harness with baseline/biased comparison + bonus sweep Status: Baseline generation works (60 tok/s on Qwen3.5-35B-A3B). Biased path crashes because bias tensors aren't in backend-managed buffers. Next: register bias as llm_graph_input (like attention masks) so the scheduler allocates it in the correct backend buffer. Also includes UMA profiler fix (common.cpp) from previous commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix the bias tensor integration by using llm_graph_input pattern: - llm_graph_input_expert_cache_bias creates tensor in ctx0 with ggml_set_input() - set_input() copies bias data via ggml_backend_tensor_set() (proper backend transfer) - Scheduler now correctly manages the bias tensor buffers Test results (Qwen3.5-35B-A3B, 256 experts, top-8): Baseline: 59.3 tok/s Biased (bonus=0.5): 48.5 tok/s, IDENTICAL output All bonus levels (0.1-2.0): ZERO quality impact, SAME output The per-token overhead (~18%) is from 40 extra ggml_add ops (one per MoE layer). This is amortized in multi-slot serving where expert overlap reduces weight reads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add CLI flag to enable static cache-aware expert routing in llama-server. Biases MoE router toward a fixed expert subset so concurrent tokens share experts, reducing unique expert weight reads by ~60%. A/B test on Qwen3-Coder-Next 80B.A3B (10 slots): No bonus: 72.0 agg tok/s (7.2 per-slot) Bonus=0.5: 92.4 agg tok/s (9.2 per-slot) — +28% improvement Single-slot has ~9% overhead from extra ggml_add per layer. Multi-slot gains dominate at 5+ concurrent agents. Usage: llama-server -m model.gguf -np 10 --expert-cache-bonus 0.5 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ction-636FM # Conflicts: # .gitignore # CLAUDE.md # include/llama.h # src/llama-kv-cache.h # tools/CMakeLists.txt

yomaytk and others added 30 commits March 4, 2026 11:19

Add concat op to webgpu. (ggml-org#20068)

541bf37

hexagon: add llama-completion runner script (ggml-org#20095)

1a29907

opencl: add SET, support i32 for CPY, minor refactor for cpy (ggm…

69fd345

…l-org#20101)

webui: Improvements for Models Selector UI (ggml-org#20066)

5e335ba

convert : register Qwen 3.5 ForCausalLM for text only (ggml-org#20119)

cf23251

cli : add command and file auto-completion (ggml-org#19985)

b5ed0e0

opencl: add neg, exp and diag (ggml-org#20127)

6c97bff

* opencl: add `neg` * opencl: add `exp` * opencl: add `diag`

cli : Don't clear system prompt when using '/clear' (ggml-org#20067)

f7db3f3

* Enhance /clear command to include system prompt Add system prompt to messages when clearing chat history. * Use lambda

kv-cache : fix M-RoPE checkpoints (ggml-org#20132)

17a4258

ggml-cpu: fix data race for debug asserts (ggml-org#20148)

2850bc6

webui: Agentic Loop + MCP Client with support for Tools, Resources an…

f6235a4

…d Prompts (ggml-org#18655)

Checkpoint every n tokens: squash (ggml-org#20087)

f5ddcd1

context: ignore zero scale LoRAs when checking sameness (ggml-org#20166)

388baab

CUDA: use shared mem for ssm_conv (ggml-org#20128)

1e38a7a

* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggml: update comments for backends which have no memory to report (gg…

ba2ff79

…ml-org#20157) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

ggml-cuda: add mem check for fusion (ggml-org#19916)

d48e876

* ggml-cuda: add mem check for fusion * Replace NaNs with -FLT_MAX * fix typo Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

cpu: skip redudant ROPE cache updates (ggml-org#20149)

ba2fd11

server : preserve anthropic thinking blocks in conversion (ggml-org#2…

e68f2fb

…0120) * server : preserve anthropic thinking blocks in conversion (ggml-org#20090) * server : add tests for anthropic thinking block conversion --------- Co-authored-by: root <root@llamacpp.home>

Autoparser - complete refactoring of parser architecture (ggml-org#18675

566059a

) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser

Add @pwilkin to CODEOWNERS for autoparser code (ggml-org#20174)

7463687

quants : Add memsets and other fixes for IQ quants (ggml-org#19861)

649f064

* Add memsets and other fixes for IQ quants * Make memset unconditional, change Laux back to L * Move another memset

claude and others added 15 commits March 10, 2026 08:03

chore: add wikitext-2-raw/ to .gitignore

1dbbc87

Downloaded dataset for KV cache compaction PPL benchmarking should not be tracked. https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e

Merge remote-tracking branch 'origin/main' into claude/kv-cache-compa…

15c5198

…ction-636FM # Conflicts: # .gitignore # CLAUDE.md # include/llama.h # src/llama-kv-cache.h # tools/CMakeLists.txt

wip: uncommitted work before Linux migration

f548100

github-actions Bot added Apple Metal SYCL Nvidia GPU Vulkan devops script server ggml model jinja parser Ascend NPU OpenCL server/webui Hexagon WebGPU labels Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: KV cache compaction, GraphRAG pipeline, Strix Halo optimizations#3

feat: KV cache compaction, GraphRAG pipeline, Strix Halo optimizations#3
fabiantax wants to merge 143 commits into
masterfrom
claude/kv-cache-compaction-636FM

fabiantax commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

fabiantax commented Mar 9, 2026

Summary

Key files

KV Cache Compaction

GraphRAG Pipeline

Documentation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants