Skip to content

Peterc3-dev/kv-compressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kv-compressor

A short algorithmic study of sub-fp16 KV cache compression for transformer attention on shared-memory mobile SoCs. Documented negative result — see FINDINGS.md for the full write-up.

What this is

A Rust crate that implements and benchmarks several scalar sign+magnitude schemes for compressing the key/value cache of transformer attention:

  • quantize.rs — v1: 1-bit sign + a single scalar magnitude per row (mean(|row|))
  • outlier.rs — v2: adds top-K outlier-channel extraction
  • grouped.rs — v3: adds per-group inlier magnitudes
  • twobit.rs — v4: 2-bit signs + levels, plus outliers
  • attention.rs — shared attention / softmax path used by the variants
  • src/bin/ — benchmark binaries that sweep the variants (synthetic and on real dumped Qwen3 KV tensors)
  • scripts/dump_qkv.py — dumps KV tensors from a HuggingFace model via hooks
  • scripts/ppl_eval.py — measures WikiText-2 perplexity with monkey-patched k/v projections

The KV layout assumed is [n_tokens × n_kv_heads × head_dim], flattened row-major (one row = one (token, head) head-vector). The crate does not own the layout transform; the caller reshapes into that form.

Status

Parked as an instructive negative result. The conclusion of the study is that scalar sign+magnitude KV compression — even with outlier extraction and 2-bit reconstruction — cannot match production methods (KIVI, llama.cpp q4_0 KV) on real perplexity. The code runs and its unit tests pass, but it is research code documenting why the approach falls short, not a production compressor.

Two findings from the write-up worth highlighting:

  1. Cosine similarity of attention outputs is an unreliable proxy for perplexity on KV quantization — validate with end-to-end PPL on a real corpus instead.
  2. Outlier-channel extraction is necessary but not sufficient; the inlier residual still carries enough information that 1-bit-per-element is too lossy.

A separate bandwidth-contention measurement (iGPU vs concurrent iGPU+NPU) from the same project is preserved in FINDINGS.md and stands on its own.

Build and test

Requires a Rust toolchain (edition 2021).

cargo build --release
cargo test            # 19 unit tests across the four variants

Run the benchmarks

# benchmark binary declared in Cargo.toml
cargo run --release --bin kv-bench

# real-data sweep (expects dumped Qwen3 KV tensors; see scripts/dump_qkv.py)
cargo run --release --bin realdata-bench

The Python scripts under scripts/ need a separate environment with HuggingFace transformers and a model to dump from; they are how the real KV tensors and the perplexity numbers in FINDINGS.md were produced.

License

MIT — see LICENSE.

About

Algorithmic study of sub-fp16 KV cache compression on shared-memory mobile SoCs. Documented negative result: scalar sign+magnitude (incl. outlier extraction + 2-bit) insufficient vs KIVI. Cos sim shown unreliable for KV-quant PPL.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors