A short algorithmic study of sub-fp16 KV cache compression for transformer attention on shared-memory mobile SoCs. Documented negative result — see FINDINGS.md for the full write-up.
A Rust crate that implements and benchmarks several scalar sign+magnitude schemes for compressing the key/value cache of transformer attention:
quantize.rs— v1: 1-bit sign + a single scalar magnitude per row (mean(|row|))outlier.rs— v2: adds top-K outlier-channel extractiongrouped.rs— v3: adds per-group inlier magnitudestwobit.rs— v4: 2-bit signs + levels, plus outliersattention.rs— shared attention / softmax path used by the variantssrc/bin/— benchmark binaries that sweep the variants (synthetic and on real dumped Qwen3 KV tensors)scripts/dump_qkv.py— dumps KV tensors from a HuggingFace model via hooksscripts/ppl_eval.py— measures WikiText-2 perplexity with monkey-patched k/v projections
The KV layout assumed is [n_tokens × n_kv_heads × head_dim], flattened
row-major (one row = one (token, head) head-vector). The crate does not own the
layout transform; the caller reshapes into that form.
Parked as an instructive negative result. The conclusion of the study is that scalar sign+magnitude KV compression — even with outlier extraction and 2-bit reconstruction — cannot match production methods (KIVI, llama.cpp q4_0 KV) on real perplexity. The code runs and its unit tests pass, but it is research code documenting why the approach falls short, not a production compressor.
Two findings from the write-up worth highlighting:
- Cosine similarity of attention outputs is an unreliable proxy for perplexity on KV quantization — validate with end-to-end PPL on a real corpus instead.
- Outlier-channel extraction is necessary but not sufficient; the inlier residual still carries enough information that 1-bit-per-element is too lossy.
A separate bandwidth-contention measurement (iGPU vs concurrent iGPU+NPU) from the same project is preserved in FINDINGS.md and stands on its own.
Requires a Rust toolchain (edition 2021).
cargo build --release
cargo test # 19 unit tests across the four variants# benchmark binary declared in Cargo.toml
cargo run --release --bin kv-bench
# real-data sweep (expects dumped Qwen3 KV tensors; see scripts/dump_qkv.py)
cargo run --release --bin realdata-benchThe Python scripts under scripts/ need a separate environment with HuggingFace
transformers and a model to dump from; they are how the real KV tensors and
the perplexity numbers in FINDINGS.md were produced.
MIT — see LICENSE.