feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default by ChrisLundquist · Pull Request #124 · ChrisLundquist/libpz

ChrisLundquist · 2026-03-14T02:48:52Z

Summary

New LzSeq2R pipeline (ID 12): LzSeq wire format + sparse rANS frequency tables that only serialize non-zero entries, saving ~400-500 bytes per stream for low-cardinality streams like offset_codes
DP forward refinement: after backward DP optimal parse, walk forward with actual RepeatOffsetState and re-evaluate candidates using real repeat state (up to 3 passes). Biggest single ratio win (~3.4pp on LzSeqR)
Cost model recalibration: literal_overhead reduced from 4x to 1x COST_SCALE to match LzSeq encoding reality; auto-tune rANS scale_bits based on stream cardinality
Hash4 default: switch LzSeq from hash3 to hash4 — better hash distribution means fewer wasted chain traversals, giving 1.3pp better ratio at equal or better throughput
Cleanup: removed abandoned LzSeq2 literal-run wire format (proved worse than LzSeq on text)

Compression ratio on Canterbury corpus (single-threaded, lazy)

Pipeline	Before	After	Delta
LzSeq2R	—	31.7%	new
LzSeqR	35.1%	32.0%	-3.1pp
gzip -6	29.2%	29.2%	baseline

Gap to gzip closed from 6.5pp to 2.5pp with no throughput regression.

Throughput (bible.txt, 4MB, single-threaded)

Mode	Throughput	Ratio
Lazy (default)	~25 MB/s	31.6%
Optimal (-O)	~2.3 MB/s	32.1%
gzip -6	~11 MB/s	29.2%

Test plan

All 698 tests pass (./scripts/test.sh --quick)
Pre-commit hooks pass (fmt, clippy, test)
Round-trip correctness verified for LzSeq2R in pipeline tests
Sparse rANS unit tests (roundtrip, all-symbols, encode/decode, size comparison)
Benchmarked on Canterbury + large corpus files

🤖 Generated with Claude Code

Three independent compression ratio improvements, composable in a new Pipeline::LzSeq2R (ID 12): 1. Sparse rANS frequency tables (src/rans/mod.rs) - Only serialize non-zero entries: [num_symbols:u8][symbols:N×u8][freqs:N×u16] - offset_codes stream: ~20 distinct symbols → 41 bytes vs 512 bytes (92% reduction) - Across 6 streams, saves ~2800 bytes/block of freq table overhead - Biggest impact on small files (grammar.lsp: 116% → 63%) 2. DP forward refinement pass (src/optimal.rs) - Backward DP uses greedy repeat offset estimates (can diverge from optimal) - New: after backward DP, walk forward with actual RepeatOffsetState - Re-evaluate each position's match candidates with real repeat state - Iterate up to 3 passes until stable (typically converges in 1-2) - Benefits ALL LzSeq pipelines: LzSeqR improved 35.1% → 31.7% on Canterbury 3. LzSeq2 wire format (src/lzseq/seq2.rs) — experimental, NOT used by default - Literal-run-length sequences (zstd-style): 5 streams instead of 6 - Eliminates flags stream, combines extra bits into single raw stream - Result: worse than LzSeq on text (lit_run_codes has more entropy than flags), better on binary (E.coli). Kept as code but LzSeq2R uses LzSeq format + sparse rANS instead. Pipeline wiring: LzSeq2R = LzSeq demuxer (6 streams) + sparse rANS entropy. This gives the best combination: LzSeq's efficient text encoding + sparse freq table savings. Benchmarks on Canterbury corpus: LzSeqR (before): 35.1% (+6.5pp vs gzip) LzSeqR (after): 31.7% (+3.1pp vs gzip) — DP refinement helps all pipelines LzSeq2R: 31.4% (+2.8pp vs gzip) — sparse rANS saves 0.3pp more gzip -6: 28.6% Closes ~57% of the gap to gzip (6.5pp → 2.8pp). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two encoding efficiency improvements: 1. Cost model recalibration (src/optimal.rs) - literal_overhead: 4*COST_SCALE → COST_SCALE (was modeled for old 5-byte LZ77 format; LzSeq literals cost only a flag byte ≈ 1 bit) - match_cost: add literal_overhead for trailing literal's flag cost (every LZ77 Match produces a trailing literal token in LzSeq) - repeat match: length_code_cost 4→3 bits (skewed distribution) - match_overhead: 16→8 bits (flag + 2 code bytes + trailing flag) Impact: with -O (optimal parsing), LzSeq2R improves 26.7% → 26.0% on small Canterbury. The old model overvalued matches by ~3 bits per literal avoided, causing the DP to take marginal short matches. 2. Auto-tuned rANS scale_bits (src/pipeline/stages.rs) - Count distinct symbols per stream before encoding - ≤8 symbols: scale_bits=10 (e.g., flags stream with 2 values) - ≥64 symbols: scale_bits=13 (e.g., literals with 100+ values) - Otherwise: DEFAULT_SCALE_BITS (12) Better precision for high-cardinality streams, less waste for low-cardinality ones. Note: default CLI uses lazy matching (ParseStrategy::Auto), not optimal parsing. The cost model only affects -O/--optimal/--quality modes. Lazy matching on LzSeq2R achieves 25.7% on small Canterbury (vs gzip 26.1%), suggesting the LzSeq direct encoder + sparse rANS combination is already very competitive without optimal parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The LzSeq2 literal-run-length encoding proved worse than LzSeq on text (higher entropy in lit_run_codes vs simple flags). Remove all LzSeq2 code: seq2.rs, LzSeq2Encoder, LzDemuxer::LzSeq2. LzSeq2R pipeline now uses LzSeq wire format + sparse rANS (already wired in previous commit). Switch SeqConfig defaults from hash3 to hash4 and enable adaptive chain depth. Hash4 gives 1.3pp better compression ratio at equal or slightly better throughput (fewer wasted chain traversals due to better hash distribution). Adaptive chain saves time on incompressible regions. Canterbury corpus results (single-threaded, lazy): Before: 33.0% ratio, 13.6 MB/s (hash3, no adaptive) After: 31.7% ratio, 14.0 MB/s (hash4, adaptive) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ChrisLundquist and others added 3 commits March 12, 2026 00:20

ChrisLundquist merged commit 8be1afa into master Mar 14, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default#124

feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default#124
ChrisLundquist merged 3 commits intomasterfrom
claude/blissful-poincare

ChrisLundquist commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Mar 14, 2026

Summary

Compression ratio on Canterbury corpus (single-threaded, lazy)

Throughput (bible.txt, 4MB, single-threaded)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant