Detailed implementation notes, benchmarks, and roadmap for libpz.
For day-to-day development instructions, see CLAUDE.md.
- Algorithms: LZ77 (brute, hashchain, lazy, parallel), LzSeq (code+extra-bits, repeat offsets, 128KB window), Huffman, BWT (SA-IS), MTF, RLE, FSE, rANS
- Pipelines: Bw (BWT+MTF+RLE+FSE), Lzf (LzSeq+FSE), LzSeqR (LzSeq+rANS), LzSeqH (LzSeq+Huffman), SortLz (sort-LZ77+FSE) — Lzf and LzSeqR/LzSeqH use multi-stream entropy coding for ~16-18% better compression; LzSeqR/LzSeqH use zstd-style code+extra-bits encoding with 6-stream demux; SortLz uses sort-based match finding (GPU-accelerated)
- Auto-selection: Heuristic (
select_pipeline) and trial-based (select_pipeline_trial) pipeline selection using data analysis (entropy, match density, run ratio, autocorrelation); LzSeqR included in trial candidates - Data analysis:
src/analysis.rs— statistical profiling (Shannon entropy, autocorrelation, run ratio, match density, distribution shape) with sampling support - Optimal parsing: GPU top-K match table → CPU backward DP (4-6% better compression)
- Multi-threading: Block-parallel and pipeline-parallel via V2 container format; within-block parallel LZ77 match finding (
compress_lazy_parallel) - SortLZ: Sort-based match finder — standalone pipeline (ID 10) and pluggable
MatchFinder::SortLzfor Lzf/LzSeqR/LzSeqH; GPU radix sort batched (single submit); adaptiveselect_match_finder()heuristic; u64-optimizedextend_match; 39.6% ratio - GPU kernels: LZ77 hash-table (fast), LZ77 batch/per-position (legacy), LZ77 top-K, BWT radix sort + parallel rank assignment, SortLZ radix sort + match verification, Huffman encode (two-pass with Blelloch prefix sum)
- Tooling: CLI (
pzwith-a/--autoand--trialflags), C FFI, Criterion benchmarks, CI (3 OS) - Fuzz testing (M5.3):
cargo-fuzzinfrastructure with 12 targets covering all algorithms and pipelines (roundtrip + crash resistance)
- rANS SIMD decode paths — N-way interleaved rANS decode in
src/simd.rs(SSE2 4-way, AVX2 8-way). The scalar interleaved encoder/decoder is implemented; SIMD intrinsics for the hot decode loop are not yet wired. - rANS reciprocal multiplication — Replace division in the encode loop with precomputed reciprocal multiply-shift for GPU/SIMD (avoids data-dependent division). Documented as future optimization due to u32 overflow edge cases with small frequencies.
- CPU: Uses SA-IS (Suffix Array by Induced Sorting) — O(n) linear time via doubled-text-with-sentinel strategy.
- GPU: Uses LSB-first 8-bit radix sort with prefix-doubling for suffix array construction. Replaced earlier bitonic sort (PR #21). Features adaptive key width (skip zero-digit radix passes) and event chain batching (one host sync per doubling step). Rank assignment runs on GPU via Blelloch prefix sum + scatter. Still slower than CPU SA-IS at all sizes but dramatically improved from bitonic sort (7-14x faster). The GPU uses circular comparison
(sa[i]+k) % nvs CPU SA-IS's doubled-text approach — both produce valid BWTs that round-trip correctly.
All LZ-based pipelines use multi-stream entropy coding: the match finder
produces a universal LzToken stream (literals + matches), and a pluggable
TokenEncoder splits it into independent byte streams with tighter symbol
distributions. Each stream gets its own entropy coder, yielding lower
per-stream entropy than a single combined stream.
| Encoder | Streams | Used by | Contents |
|---|---|---|---|
LzSeqEncoder |
6 | Lzf, LzSeqR, LzSeqH, SortLz | flags, literals, offset_codes, offset_extra, length_codes, length_extra |
LzssEncoder |
4 | Lzfi, LzssR | flags (1-bit per token), literals, offsets (u16 LE), lengths (u16 LE) |
LzSeqEncoder uses log2-coded offsets and lengths with repeat offset tracking,
achieving the best ratio (~32% on Canterbury+large). LzssEncoder uses raw u16
values with flag bits, trading ratio for simplicity.
Input → tokenize() → Vec<LzToken> → TokenEncoder::encode() → EncodedStreams → entropy coding
Match finding and parsing (tokenize() in src/pipeline/mod.rs) are decoupled
from wire encoding. The tokenize() entry point handles GPU/CPU dispatch,
SortLz vs HashChain match finding, and parse strategy (greedy/lazy/optimal).
src/rans.rs implements range ANS (rANS), a streaming entropy coder that
uses multiply-shift arithmetic instead of table lookups (FSE/tANS) or bit-level
tree walks (Huffman). rANS approaches Shannon entropy like arithmetic/range
coding but with a simpler, more parallelizable decode hot path.
| Property | Huffman | FSE (tANS) | rANS |
|---|---|---|---|
| Decode operation | Bit-level tree walk | Table lookup | Multiply + lookup |
| I/O granularity | Bits | Bits | 16-bit words |
| Branch predict | Poor | Good | Good |
| State independence | N/A | Awkward | Interleave N states |
| GPU shared memory | Large trees | Large tables | Small freq tables |
- Single-stream (
encode/decode): Reference scalar implementation. 32-bit state in[2^16, 2^32), 16-bit word I/O. - Interleaved N-way (
encode_interleaved/decode_interleaved): N independent rANS states with round-robin symbol assignment. Default N=4 (maps to SSE2 lanes). All N decode chains run in parallel with zero data dependencies between them.
Single-stream:
[scale_bits: u8] [freq_table: 256 × u16 LE] [final_state: u32 LE]
[num_words: u32 LE] [words: num_words × u16 LE]
Interleaved N-way:
[scale_bits: u8] [freq_table: 256 × u16 LE] [num_states: u8]
[final_states: N × u32 LE] [num_words: N × u32 LE]
[stream_0_words] [stream_1_words] ... [stream_N-1_words]
Header overhead is 521 bytes (1 + 512 + 4 + 4) for single-stream, making rANS most effective for inputs larger than ~1 KB.
Frequencies are normalized to sum to 1 << scale_bits (default 12 bits = 4096).
Every symbol present in the input gets at least frequency 1. Excess is trimmed
from the most-frequent symbol; deficit is added to it. The normalization code
is shared conceptually with src/fse.rs (both operate on power-of-2 tables).
See docs/exec-plans/tech-debt-tracker.md for rANS SIMD decode and reciprocal multiplication work items.
src/simd.rs provides runtime-dispatched SIMD for CPU hot paths:
- Byte frequency counting — 4-bank histogramming with AVX2 merge, integrated into
FrequencyTable::count() - LZ77 match comparison — SSE2 (16 bytes/cycle) or AVX2 (32 bytes/cycle)
compare_bytes, integrated intoHashChainFinder::find_match()andfind_top_k() - u32 array summation — widened u64 accumulator lanes for overflow-safe SIMD sum
| Architecture | Baseline | Extended | Status |
|---|---|---|---|
| x86_64 | SSE2 | AVX2 | Implemented + integrated |
| aarch64 | NEON | SVE | Stubs (dispatch to scalar) |
Runtime detection via Dispatcher::new() caches the best ISA level at first call. All SIMD implementations are verified against scalar reference in tests.
SortLZ is a deterministic, GPU-friendly LZ77 match finder. It replaces hash-chain match finding with radix sort of (hash, position) pairs followed by adjacent-pair match verification. Zero atomics, fully deterministic — ideal for GPU execution.
| Mode | Description | Wire format |
|---|---|---|
Pipeline::SortLz (ID 10) |
Standalone pipeline with its own wire format | SortLZ-specific (see below) |
MatchFinder::SortLz |
Pluggable match finder for other pipelines | Host pipeline's format |
When used as a MatchFinder, SortLZ is transparent to the wire format — the
output is 100% compatible with the host pipeline (Lzf, LzSeqR, LzSeqH). The consumer and decompressor see no difference.
[meta_len: u16 LE] LzSeq metadata length
[meta: meta_len bytes] num_tokens + num_matches (u32 each)
[num_streams: u8] number of streams (6 for LzSeq)
per stream:
[orig_len: u32 LE] uncompressed stream length
[fse_len: u32 LE] FSE-compressed length
[fse_data: fse_len bytes]
SortLz uses LzSeqEncoder for wire encoding + FSE for entropy. The 6 streams
are: flags, literals, offset_codes, offset_extra, length_codes, length_extra.
- Hash: Compute 4-byte window hashes (u32 from LE bytes, no collisions for 4-byte matches)
- Radix sort: 4-pass 8-bit LSB radix sort on (hash, position) pairs
- Verify: Adjacent same-hash entries → extend match byte-by-byte (u64 chunk comparison)
- Select: Best match per position (longest wins, max_candidates=8 per sorted entry)
- Parse: Greedy or lazy token emission
Uses GPU radix sort (same kernels as BWT) + GPU match verification:
- 4-pass radix sort batched into single command encoder (1 submit, not 16+)
encoder.clear_buffer()for histogram zeroing (no CPU↔GPU sync)- Separate submit for match verification (needs sort results)
- 10.6x faster than CPU SortLZ at 4MB (89 vs 8.4 MB/s)
select_match_finder() in src/pipeline/mod.rs chooses SortLz when:
- GPU available and input ≥ MIN_GPU_INPUT_SIZE
- High match density (>0.3) and moderate entropy (<6.5 bits/byte)
- Large input (≥64KB) with match density >0.2 and low entropy (<5.5)
| Size | CPU hashchain | CPU SortLZ | GPU SortLZ | GPU vs CPU SortLZ |
|---|---|---|---|---|
| 8KB | 244 MB/s | 85 MB/s | 4 MB/s | GPU overhead |
| 64KB | 140 MB/s | 44 MB/s | 31 MB/s | 0.7x |
| 256KB | 131 MB/s | 31 MB/s | 53 MB/s | 1.7x faster |
| 4MB | 142 MB/s | 8 MB/s | 89 MB/s | 10.6x faster |
SortLZ compression ratio: 39.6% (vs BWT 32.7%).
compress_lazy_parallel(input, num_threads) pre-computes matches in parallel (each thread builds its own hash chain), then serializes sequentially with lazy evaluation. Thresholds:
MIN_PARALLEL_SIZE = 256KB— below this, single-threaded is fasterMIN_SEGMENT_SIZE = 128KB— caps thread count to amortize hash chain warmup
| Size | CPU hashchain | CPU lazy | GPU hash | GPU vs CPU hashchain |
|---|---|---|---|---|
| 1KB | 14µs (71 MiB/s) | 6µs (164 MiB/s) | 1.1ms (1 MiB/s) | 65x slower |
| 10KB | 57µs (171 MiB/s) | 42µs (231 MiB/s) | 1.4ms (7 MiB/s) | 20x slower |
| 64KB | 1.3ms (48 MiB/s) | 611µs (102 MiB/s) | 1.7ms (36 MiB/s) | 1.3x slower |
| 256KB | 6.2ms (40 MiB/s) | 2.6ms (97 MiB/s) | 3.4ms (74 MiB/s) | 2x faster |
| 1MB | 20ms (50 MiB/s) | 16.7ms (60 MiB/s) | 9.0ms (111 MiB/s) | 2x faster |
| Size | CPU | GPU + CPU scan | GPU + GPU scan | Best GPU vs CPU |
|---|---|---|---|---|
| 10KB | 23µs (418 MiB/s) | 312µs (31 MiB/s) | 518µs (19 MiB/s) | CPU 13x faster |
| 64KB | 432µs (145 MiB/s) | 926µs (68 MiB/s) | 1.45ms (43 MiB/s) | CPU 2x faster |
| 256KB | 1.85ms (135 MiB/s) | 999µs (250 MiB/s) | 543µs (460 MiB/s) | GPU 3.4x faster |
GPU Huffman with Blelloch prefix sum crosses over ~128KB. At 256KB the GPU scan path is 3.4x faster than CPU.
| Size | GPU radix | Throughput | Old bitonic | Speedup vs bitonic |
|---|---|---|---|---|
| 1KB | 3.4ms | 295 KiB/s | 23ms | 6.8x |
| 10KB | 5.9ms | 1.6 MiB/s | 42ms | 7.1x |
| 64KB | 4.1ms | 15.3 MiB/s | 56ms | 13.7x |
| 256KB | 11.6ms | 21.6 MiB/s | — | — |
| 4MB | 333ms | 12.0 MiB/s | — | — |
| 16MB | 1.73s | 9.2 MiB/s | — | — |
GPU BWT radix sort is 7-14x faster than the old bitonic sort. Still slower than CPU SA-IS at small sizes but becoming competitive at 64KB+ (CPU SA-IS ~1ms at 64KB vs GPU 4.1ms). The gap narrows at larger sizes where GPU parallelism helps more.
The optimal split for libpz is GPU for LZ77 match-finding, CPU for entropy coding, overlapped via the unified scheduler with ring-buffered streaming.
Why GPU wins on LZ77: Match-finding is embarrassingly parallel — each position's
search is independent. The cooperative-stitch kernel does 1,788 probes/position and
is 2x faster than CPU at 256KB+. Ring-buffered batching (find_matches_batched)
adds +7-17% throughput by amortizing buffer allocation and overlapping GPU compute
with CPU readback.
Why CPU wins on entropy: rANS/FSE/Huffman are serial state machines — each symbol depends on the previous state. GPU entropy has been tried extensively (500+ iterations: single-stream, independent blocks, Recoil checkpoints, batched cross-block) and is 0.77x CPU on encode, 0.54x on decode. The serial dependency limits GPU to ~300 useful threads when saturation needs ~8K-16K. PCIe transfer overhead dominates at typical block sizes (128KB-256KB).
Architecture: The unified scheduler dispatches LZ77 to GPU and entropy to CPU
workers in parallel. While CPU thread N entropy-encodes block K, the GPU is already
match-finding block K+1. The GPU_ENTROPY_THRESHOLD (256KB) is deliberately set
above DEFAULT_GPU_BLOCK_SIZE (128KB) to prevent routing entropy to the slower
GPU path.
See docs/design-docs/gpu-strategy.md for full analysis and CLAUDE.md "Known
dead ends" for the complete list of GPU optimization attempts that failed.
- LZ77 match finding — GPU hash-table kernel for Lzf/LzSeq pipelines (2x faster at 256KB+)
- SortLZ match finding — GPU radix sort + match verification (10.6x faster at 4MB)
- BWT suffix array — GPU radix sort with prefix-doubling
- Interleaved FSE — GPU-accelerated encode/decode for Lzfi pipeline
- rANS Recoil decode — GPU-accelerated parallel rANS decode using split-point metadata
- All entropy coding in the streaming CLI —
streaming::compress_streamuses CPU rANS/FSE - rANS/FSE/Huffman encode/decode — serial state machines, GPU is 0.54-0.77x CPU speed
- LZ77 lazy evaluation — sequential dependency (next match depends on current)
- LzSeq
encode_with_config— repeat offset tracking, adaptive chain depth, hash4 prefix
GPU_ENTROPY_THRESHOLD(256KB) >DEFAULT_GPU_BLOCK_SIZE(128KB) — prevents entropy routing to GPUMIN_GPU_INPUT_SIZE— minimum block size for GPU dispatch (avoids setup overhead)
-
GPU BWT still slower than CPU SA-IS — Radix sort improved 7-14x over bitonic sort, but CPU SA-IS (O(n)) remains faster at small/medium sizes. GPU catches up at 64KB+ but prefix-doubling's O(n log n) work is inherently more than SA-IS's O(n).
-
Hash bucket overflow — Fixed BUCKET_CAP=64 means highly repetitive data may miss good matches. Adaptive bucket sizing could help.
-
LZ77 match array still downloaded for dedupe — GPU match dedup is sequential and runs on CPU. Keeping serialized LZ77 bytes on GPU for histogram+Huffman is already done (ByteHistogram optimization), but the match download is unavoidable.
LzSeqR is our best pipeline at 35.1% vs gzip's 28.6% (6.5pp gap). The gap is
encoding efficiency, not match quality (see CLAUDE.md "Known dead ends"). The
format is pre-release so all changes are free. Key opportunities:
- Zstd-style literal-run sequences — replace per-token flags with
(literal_run_length, offset, match_length)tuples, eliminating the flags stream entirely. Highest ceiling. - Larger repeat offset cache (4→8) — each additional repeat saves all extra bits for that match.
- Entropy-code the extra bits —
offset_extraandlength_extracurrently bypass rANS; if values are skewed, 5-15% savings. - Sparse frequency tables — 512 bytes per rANS stream → ~61 bytes for narrow-alphabet streams. Saves ~1.3KB/block.
The optimal parser currently uses LZ77's match_cost approximation. A dedicated
LzSeq optimal parser that tracks repeat offset state through the backward DP would
find more repeat matches, directly improving ratio.
- Replace scalar stubs in
src/simd.rswith actual NEON intrinsics compare_bytes:vceqq_u8+vmovn_u16for 16-byte comparisonbyte_frequencies: 4-bank unrolled (NEON lacks efficient gather/scatter)sum_u32:vld1q_u32+vaddq_u32+vaddvq_u32- SVE for ARMv8.2+ (variable-length vectors, predicated operations)
- Requires aarch64 hardware for benchmarking