Skip to content

feat: KV cache compaction, GraphRAG pipeline, Strix Halo optimizations#3

Open
fabiantax wants to merge 143 commits into
masterfrom
claude/kv-cache-compaction-636FM
Open

feat: KV cache compaction, GraphRAG pipeline, Strix Halo optimizations#3
fabiantax wants to merge 143 commits into
masterfrom
claude/kv-cache-compaction-636FM

Conversation

@fabiantax
Copy link
Copy Markdown
Owner

Summary

  • KV Cache Compaction POC: Implementation of "KV Cache Compaction via Attention Matching" (arXiv paper) with math utilities, unit tests, and comprehensive documentation including algorithms reference, user stories, and cross-pollination map with adjacent concepts (Nyström, coresets, Frank-Wolfe, etc.)
  • GraphRAG Pipeline: Rust-based NER+RE extraction pipeline with FalkorDB graph database integration, plus a ModernBERT fine-tuning pipeline (Python) for domain-specific entity/relation extraction targeting 44x speedup over GLiNER (72s → <2s via ONNX INT8)
  • Strix Halo Optimization Report: Documentation of all optimizations achieving +15.5% Vulkan (67 tok/s) and +29% HIP (56 tok/s) on Qwen3.5-35B-A3B, including SSM shared memory tiling, batched elementwise mega-kernel, wave64 config, fused SSM recurrence, and APEX scheduling framework

Key files

KV Cache Compaction

  • tools/kv-compact/kv-compact.cpp — POC tool (891 lines)
  • tools/kv-compact/kv-compact-math.h — Math utilities (378 lines)
  • tests/test-kv-compact-math.cpp — Unit tests (582 lines)
  • docs/kv-cache-compaction-*.md — Research docs (5 files)

GraphRAG Pipeline

  • graphrag-pipeline/src/main.rs — Rust NER+RE pipeline (1017 lines)
  • graphrag-pipeline/training/ — ModernBERT fine-tuning (NER + RE, two-stage training)
  • graphrag-pipeline/extract.mjs — Node.js extraction tool

Documentation

  • docs/development/STRIX_HALO_OPTIMIZATION_REPORT.md — Full optimization report
  • docs/development/STRIX_HALO_USER_STORIES.md — 17 user stories (11 completed)

Test plan

  • KV compaction math tests: ctest -R test-kv-compact-math
  • GraphRAG Rust pipeline: cd graphrag-pipeline && cargo build
  • ModernBERT training dry run: cd graphrag-pipeline/training && python train_ner.py --dry-run
  • Backend-ops tests still pass: 214/214

🤖 Generated with Claude Code

yomaytk and others added 30 commits March 4, 2026 11:19
* Enable tmate debugging for investigating thread safety issue

* Refactor wait and submit to operate on vector<wgpu::FutureWaitInfo>, and fix wait to delete only the future that is completed.

* Cleanup

* Remove clear change and run clang-format

* Cleanup
…atMul updates (ggml-org#20118)

* ggml-hexagon: enhance hvx_dot_f16_f16_aa_rx4 for improved performance by expanding vector handling and optimizing accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx4 and enhance hvx_vec_reduce_sum_f32x4 for improved performance and reduced complexity

* ggml-hexagon: add hvx_dot_f16_f16_aa_rx32 for enhanced vector processing in flash attention

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* optimize hvx_dot_f16_f16_aa_rx4 and hvx_dot_f16_f16_aa_rx32 by removing unused scale parameter and improving vector accumulation

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: refactor hvx_dot_f16_f16_aa_rx4 for improved readability and return HVX_Vector for better integration

# Conflicts:
#	ggml/src/ggml-hexagon/htp/flash-attn-ops.c

* ggml-hexagon: initialize sums variable in hvx_dot_f16_f16_aa_rx32 for clarity

* ggml-hexagon: fix compiling error

* fix hvx_dot_f16_f16_aa_rx4 to handle leftover elements correctly using masking

* refactor hvx_dot_f16_f16_aa_rx4 to accept vector and leftover element counts as parameters for improved clarity and flexibility

* wip

* fa: instrumentation and dma reordering

* hex-fa: use block-size 64 to improve DMA pipelining

* hex-fa: optimize vec-dot for v79 and above

* hex-fa: use block size 64

* hex-fa: avoid scalar fp32->fp16 conversions

* hex-fa: simplify dot_f16 functions using optimized vec_mpyacc

* hex-fa: rewrite mad_f32_f16 using hvx_vec_mpyacc

* hex-mm: use mpyacc in matmul dot functions

---------

Co-authored-by: chraac <chraac@gmail.com>
* fix(docs): correct typos found during code review

Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>

* Update docs/backend/CANN.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Revert "Auxiliary commit to revert individual files from 846d1c3"

This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Signed-off-by: Marcel Petrick <mail@marcelpetrick.it>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* model : fix Qwen3.5 model type detection

* Update src/llama-model.cpp

whoops, my bad

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…ml-org#17795)

* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()

* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)

* Exchanges synchronous copy with async copy function.

* Adds macro guards to allow compilation in non-CUDA builds

* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts

* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues

* Minor cleanup

* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.

* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.

* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization

* Simplifies synchronizations to adhere to `saaasg` pattern.

* Apply suggestion from @ggerganov (src->buffer to buf_src)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Apply suggestion from @ggerganov (src->buffer to buf_src) v2

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments [no ci]

* add kimi linear to delta-net-base

* removed unnecessary ggml_cont from g_exp_t

* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp

* removed unnecessary diag mask

* cont : simplify

* cont : avoid graph splits

* scale q after mul instead of beginning

* scale q after mul instead of beginning

* identical ppl

* cont : fix scale and decay mask

* minor : remove TODO

* block implementation for kda

* remove space at the end of line 101

* concat+pad

* pad+binary row concat

* chunk size 16 for kda

* removed minor differences to master

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…0139)

* hexagon: add fp16 support for binary ops: add,sub,mul,div

* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)

* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad

* snapdragon: fix readme link

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* opencl: add `neg`

* opencl: add `exp`

* opencl: add `diag`
* Enhance /clear command to include system prompt

Add system prompt to messages when clearing chat history.

* Use lambda
* CUDA: use shared mem for ssm_conv

* fuse silu + ssm_conv

* fuse unary + mul

* enable for fp16

* formatting

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
    `plxv 40, 2(14)`

This ensures zero performance regression while unblocking builds on
newer toolchains.

Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9  + GCC 15.1.1 (gcc-toolset-15)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* ggml-cuda: add mem check for fusion

* Replace NaNs with -FLT_MAX

* fix typo

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
…0120)

* server : preserve anthropic thinking blocks in conversion (ggml-org#20090)

* server : add tests for anthropic thinking block conversion

---------

Co-authored-by: root <root@llamacpp.home>
* hexagon: add ssm_conv op

* hexagon: hvx kernel is functional

* hexagon: improvements to ssm-conv hvx kernel

* hexagon: added dma to ssm-conv hvx kernel

* hexagon: ssm-conv dynamically compute gather scratchpad

* hex-ssm-conv: add local context and fix various issues (spad indexing, etc)

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
)

* Autoparser - full single commit squish

* Final pre-merge changes: minor fixes, Kimi 2.5 model parser
* Add memsets and other fixes for IQ quants

* Make memset unconditional, change Laux back to L

* Move another memset
claude and others added 15 commits March 10, 2026 08:03
Add production infrastructure for KV cache compaction based on the
"Fast KV Compaction via Attention Matching" paper (Zweiger et al., 2026).

Core changes:
- Beta bias injection into attention mask: per-KV-head compaction biases
  are stored CPU-side and folded into the attention mask before softmax.
  The mask shape expands from [n_kv, n_tps, 1, n_stream] to
  [n_kv, n_tps, n_head, n_stream] when biases are active, enabling
  per-head correction via GQA broadcasting. Zero overhead when inactive.

- C_v writeback: optimized value vectors are written directly back into
  KV cache tensors (F32/F16, handles V-transpose layout).

- Cell metadata management: compact_cells() evicts non-kept positions
  from the cell ring buffer after compaction.

- Public API: llama_kv_cache_compact() and llama_compact_params in llama.h

- Full compaction module (llama-kv-compact.cpp): orchestrates the 3-step
  AM pipeline (token selection, NNLS beta solve, LS value refit) across
  all model layers and KV heads, with global key importance aggregation.

- Fix double-scaling bug in kv-compact-math.h where scores were multiplied
  by inv_sqrt_dk twice in the X matrix computation.

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Move n_head lookup from external hparams parameter into mctx->get_n_head(),
reverting all 16 callsite changes in llama-graph.cpp back to upstream
signatures. Now only the 2 function definitions differ from upstream,
making rebases against llama.cpp updates essentially conflict-free.

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Capture real Q vectors during llama_decode() via eval callback and use
them as reference queries for KV cache compaction instead of K-vector
proxies. This is the paper's key quality insight — using actual query
vectors produces significantly better beta/C_v fits.

API: llama_kv_cache_capture_q(ctx, true) enables capture, then
llama_kv_cache_compact() with use_repeat_prefill=true uses them.

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Implement sensitivity-based budget allocation so that attention heads
more sensitive to compression retain more tokens while less sensitive
heads are compressed more aggressively.

Changes:
- kv-compact-math.h: add compute_head_reconstruction_error(),
  compute_head_sensitivity(), allocate_head_budgets(), and
  compute_sensitivity_weights() primitives
- llama-kv-compact.cpp: when use_nonuniform_budgets is enabled,
  compute per-head importance score variance as sensitivity proxy,
  then weight global key scores by sqrt-sensitivity so keys critical
  to sensitive heads are preferentially retained
- kv-compact-profile.cpp: new profiling tool that measures per-(layer,
  head) reconstruction error at multiple compression ratios and outputs
  a JSON sensitivity profile for offline analysis
- test-kv-compact-math.cpp: 11 new tests covering reconstruction error,
  sensitivity profiling, budget allocation, and sensitivity weights

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Implements automatic KV cache compaction during generation and
bias persistence across state save/load:

- Auto-compact trigger: when KV cache fills beyond configurable
  threshold during llama_decode(), compaction runs automatically
  before returning a slot-not-found error. Hooked into the existing
  FAILED_PREPARE retry path with a one-shot guard to prevent loops.

- Public API: llama_kv_cache_set_auto_compact(ctx, threshold, params)
  configures the threshold (e.g. 0.9 = 90% full) and compaction
  parameters (target_ratio, repeat-prefill, non-uniform budgets).

- Bias serialization: compaction bias values are now included in
  state_write/state_read, enabling save/restore of compacted cache
  state. Format: [has_bias flag] + per-layer per-head bias floats
  for active cells. Backwards-compatible (old states have no bias
  section; new states include the flag).

- Integration tests: consecutive compression (6 rounds, 128→2
  tokens), bias serialization round-trip, threshold logic, and
  API defaults.

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Three critical bugs fixed:

1. Per-head AM indices vs global selection mismatch: compact_head_highest_attn
   was doing its own key selection per-head, but results were being written to
   globally-selected cell positions. Now uses fit_head_for_selection() which
   takes pre-selected indices and only computes beta + C_v fitting.

2. V tensor reading used sequential indices (0..n_active-1) instead of
   actual active_cells positions. Inlined V reading with correct active_cells
   indexing for both transposed and non-transposed layouts.

3. C_v least-squares solver was numerically unstable when underdetermined
   (n_ref_queries < n_kept_tokens). Added Tikhonov regularization toward
   original V values: min ||X*C - Y||^2 + lambda*||C - V_sel||^2.
   This prevents wild extrapolation while allowing meaningful corrections.

Benchmark results (TinyLlama 1.1B, 1024 prefill, 256 eval tokens):

| Compression | Baseline | Eviction | AM select | AM full |
|-------------|----------|----------|-----------|---------|
| 1x          | 6.38     | -        | -         | -       |
| 2x          | -        | 6.88     | 7.36      | 7.95    |
| 5x          | -        | 7.35     | 7.95      | 16.1    |
| 10x         | -        | 8.26     | 8.49      | 58.5    |

Also adds kv-compact-bench tool for perplexity measurement with compaction,
supporting --compact-ratio, --evict-only, --no-compact ablation modes.

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Per-layer beta injection:
- Replace layer-averaged beta in shared KQ mask with per-layer bias tensors
  injected directly into the compute graph before softmax
- Each layer gets its own compaction bias tensor [n_kv, 1, n_head_kv, 1]
  that broadcasts across tokens and streams via ggml_add
- Simplifies mask back to [n_kv, n_tps, 1, n_stream] (no per-head expansion)
- Disables flash attention when per-layer bias is active (falls back to
  standard attention with explicit KQ computation)
- Resolves TODO at llama-kv-cache.cpp:1525

KV cache defragmentation:
- After compaction, move kept cells from scattered positions to contiguous
  [0, n_kept) positions, reducing n_kv from full cache size to kept count
- Moves K/V tensor data row-by-row (K) and element-by-element (transposed V)
- Remaps compaction bias indices to match new cell positions
- Uses llama_kv_cells::cp/set API for safe metadata migration
- Dramatically reduces attention compute: e.g. 50x compression reduces
  n_kv from 4096 to GGML_PAD(82, 256)=256, a 16x attention speedup

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Downloaded dataset for KV cache compaction PPL benchmarking
should not be tracked.

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
Mermaid timeline covering all 5 phases: Foundation (docs, math, tests),
Core Integration (beta injection, C_v writeback, full model compaction),
Quality (Q capture, diversity selection, error metrics), Optimization
(iterative refinement, GCV ridge, non-uniform budgets), and Production
(library API, online compaction, FA+bias kernel, multi-stream batching).

https://claude.ai/code/session_01QGc86jDa66eGpKaRCDtz6e
The uma_profiler_cb_eval callback was auto-registered on all UMA systems,
firing on every op of every token. This caused llama-server to run at
18.5 tok/s instead of 92.6 tok/s (5x regression). llama-bench was
unaffected because it doesn't use common_init.

Fix: disable auto-enable (condition set to false). The profiler should
only activate when explicitly requested via a CLI flag.

Also fix duplicate __avx_f32cx8_load when building with
GGML_CPU_ALL_VARIANTS=ON (conflicts with simd-mappings.h).

Benchmark (SmolLM3 3B, 8 slots, Vulkan):
  Before: 46-87 agg tok/s
  After:  80-144 agg tok/s (matches stock b8334)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add infrastructure for cache-aware MoE routing (arxiv 2412.00099):
- llama_set_expert_cache_bias() API in llama.h
- Per-layer expert bias stored in llama_model
- Bias injection point in build_moe_ffn() (llama-graph.cpp)
- expert-cache-test harness with baseline/biased comparison + bonus sweep

Status: Baseline generation works (60 tok/s on Qwen3.5-35B-A3B).
Biased path crashes because bias tensors aren't in backend-managed buffers.
Next: register bias as llm_graph_input (like attention masks) so the
scheduler allocates it in the correct backend buffer.

Also includes UMA profiler fix (common.cpp) from previous commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix the bias tensor integration by using llm_graph_input pattern:
- llm_graph_input_expert_cache_bias creates tensor in ctx0 with ggml_set_input()
- set_input() copies bias data via ggml_backend_tensor_set() (proper backend transfer)
- Scheduler now correctly manages the bias tensor buffers

Test results (Qwen3.5-35B-A3B, 256 experts, top-8):
  Baseline: 59.3 tok/s
  Biased (bonus=0.5): 48.5 tok/s, IDENTICAL output
  All bonus levels (0.1-2.0): ZERO quality impact, SAME output

The per-token overhead (~18%) is from 40 extra ggml_add ops (one per MoE layer).
This is amortized in multi-slot serving where expert overlap reduces weight reads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add CLI flag to enable static cache-aware expert routing in llama-server.
Biases MoE router toward a fixed expert subset so concurrent tokens share
experts, reducing unique expert weight reads by ~60%.

A/B test on Qwen3-Coder-Next 80B.A3B (10 slots):
  No bonus:   72.0 agg tok/s (7.2 per-slot)
  Bonus=0.5:  92.4 agg tok/s (9.2 per-slot) — +28% improvement

Single-slot has ~9% overhead from extra ggml_add per layer.
Multi-slot gains dominate at 5+ concurrent agents.

Usage: llama-server -m model.gguf -np 10 --expert-cache-bonus 0.5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ction-636FM

# Conflicts:
#	.gitignore
#	CLAUDE.md
#	include/llama.h
#	src/llama-kv-cache.h
#	tools/CMakeLists.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.