kv-compressor

A short algorithmic study of sub-fp16 KV cache compression for transformer attention on shared-memory mobile SoCs. Documented negative result — see FINDINGS.md for the full write-up.

What this is

A Rust crate that implements and benchmarks several scalar sign+magnitude schemes for compressing the key/value cache of transformer attention:

quantize.rs — v1: 1-bit sign + a single scalar magnitude per row (mean(|row|))
outlier.rs — v2: adds top-K outlier-channel extraction
grouped.rs — v3: adds per-group inlier magnitudes
twobit.rs — v4: 2-bit signs + levels, plus outliers
attention.rs — shared attention / softmax path used by the variants
src/bin/ — benchmark binaries that sweep the variants (synthetic and on real dumped Qwen3 KV tensors)
scripts/dump_qkv.py — dumps KV tensors from a HuggingFace model via hooks
scripts/ppl_eval.py — measures WikiText-2 perplexity with monkey-patched k/v projections

The KV layout assumed is [n_tokens × n_kv_heads × head_dim], flattened row-major (one row = one (token, head) head-vector). The crate does not own the layout transform; the caller reshapes into that form.

Status

Parked as an instructive negative result. The conclusion of the study is that scalar sign+magnitude KV compression — even with outlier extraction and 2-bit reconstruction — cannot match production methods (KIVI, llama.cpp q4_0 KV) on real perplexity. The code runs and its unit tests pass, but it is research code documenting why the approach falls short, not a production compressor.

Two findings from the write-up worth highlighting:

Cosine similarity of attention outputs is an unreliable proxy for perplexity on KV quantization — validate with end-to-end PPL on a real corpus instead.
Outlier-channel extraction is necessary but not sufficient; the inlier residual still carries enough information that 1-bit-per-element is too lossy.

A separate bandwidth-contention measurement (iGPU vs concurrent iGPU+NPU) from the same project is preserved in FINDINGS.md and stands on its own.

Build and test

Requires a Rust toolchain (edition 2021).

cargo build --release
cargo test            # 19 unit tests across the four variants

Run the benchmarks

# benchmark binary declared in Cargo.toml
cargo run --release --bin kv-bench

# real-data sweep (expects dumped Qwen3 KV tensors; see scripts/dump_qkv.py)
cargo run --release --bin realdata-bench

The Python scripts under scripts/ need a separate environment with HuggingFace transformers and a model to dump from; they are how the real KV tensors and the perplexity numbers in FINDINGS.md were produced.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
FINDINGS.md		FINDINGS.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv-compressor

What this is

Status

Build and test

Run the benchmarks

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kv-compressor

What this is

Status

Build and test

Run the benchmarks

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages