feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default#124
Merged
ChrisLundquist merged 3 commits intomasterfrom Mar 14, 2026
Merged
feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default#124ChrisLundquist merged 3 commits intomasterfrom
ChrisLundquist merged 3 commits intomasterfrom
Conversation
Three independent compression ratio improvements, composable in a new
Pipeline::LzSeq2R (ID 12):
1. Sparse rANS frequency tables (src/rans/mod.rs)
- Only serialize non-zero entries: [num_symbols:u8][symbols:N×u8][freqs:N×u16]
- offset_codes stream: ~20 distinct symbols → 41 bytes vs 512 bytes (92% reduction)
- Across 6 streams, saves ~2800 bytes/block of freq table overhead
- Biggest impact on small files (grammar.lsp: 116% → 63%)
2. DP forward refinement pass (src/optimal.rs)
- Backward DP uses greedy repeat offset estimates (can diverge from optimal)
- New: after backward DP, walk forward with actual RepeatOffsetState
- Re-evaluate each position's match candidates with real repeat state
- Iterate up to 3 passes until stable (typically converges in 1-2)
- Benefits ALL LzSeq pipelines: LzSeqR improved 35.1% → 31.7% on Canterbury
3. LzSeq2 wire format (src/lzseq/seq2.rs) — experimental, NOT used by default
- Literal-run-length sequences (zstd-style): 5 streams instead of 6
- Eliminates flags stream, combines extra bits into single raw stream
- Result: worse than LzSeq on text (lit_run_codes has more entropy than
flags), better on binary (E.coli). Kept as code but LzSeq2R uses
LzSeq format + sparse rANS instead.
Pipeline wiring: LzSeq2R = LzSeq demuxer (6 streams) + sparse rANS entropy.
This gives the best combination: LzSeq's efficient text encoding + sparse
freq table savings. Benchmarks on Canterbury corpus:
LzSeqR (before): 35.1% (+6.5pp vs gzip)
LzSeqR (after): 31.7% (+3.1pp vs gzip) — DP refinement helps all pipelines
LzSeq2R: 31.4% (+2.8pp vs gzip) — sparse rANS saves 0.3pp more
gzip -6: 28.6%
Closes ~57% of the gap to gzip (6.5pp → 2.8pp).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two encoding efficiency improvements:
1. Cost model recalibration (src/optimal.rs)
- literal_overhead: 4*COST_SCALE → COST_SCALE (was modeled for old
5-byte LZ77 format; LzSeq literals cost only a flag byte ≈ 1 bit)
- match_cost: add literal_overhead for trailing literal's flag cost
(every LZ77 Match produces a trailing literal token in LzSeq)
- repeat match: length_code_cost 4→3 bits (skewed distribution)
- match_overhead: 16→8 bits (flag + 2 code bytes + trailing flag)
Impact: with -O (optimal parsing), LzSeq2R improves 26.7% → 26.0%
on small Canterbury. The old model overvalued matches by ~3 bits per
literal avoided, causing the DP to take marginal short matches.
2. Auto-tuned rANS scale_bits (src/pipeline/stages.rs)
- Count distinct symbols per stream before encoding
- ≤8 symbols: scale_bits=10 (e.g., flags stream with 2 values)
- ≥64 symbols: scale_bits=13 (e.g., literals with 100+ values)
- Otherwise: DEFAULT_SCALE_BITS (12)
Better precision for high-cardinality streams, less waste for
low-cardinality ones.
Note: default CLI uses lazy matching (ParseStrategy::Auto), not optimal
parsing. The cost model only affects -O/--optimal/--quality modes.
Lazy matching on LzSeq2R achieves 25.7% on small Canterbury (vs gzip
26.1%), suggesting the LzSeq direct encoder + sparse rANS combination
is already very competitive without optimal parsing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LzSeq2 literal-run-length encoding proved worse than LzSeq on text (higher entropy in lit_run_codes vs simple flags). Remove all LzSeq2 code: seq2.rs, LzSeq2Encoder, LzDemuxer::LzSeq2. LzSeq2R pipeline now uses LzSeq wire format + sparse rANS (already wired in previous commit). Switch SeqConfig defaults from hash3 to hash4 and enable adaptive chain depth. Hash4 gives 1.3pp better compression ratio at equal or slightly better throughput (fewer wasted chain traversals due to better hash distribution). Adaptive chain saves time on incompressible regions. Canterbury corpus results (single-threaded, lazy): Before: 33.0% ratio, 13.6 MB/s (hash3, no adaptive) After: 31.7% ratio, 14.0 MB/s (hash4, adaptive) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Compression ratio on Canterbury corpus (single-threaded, lazy)
Gap to gzip closed from 6.5pp to 2.5pp with no throughput regression.
Throughput (bible.txt, 4MB, single-threaded)
Test plan
./scripts/test.sh --quick)🤖 Generated with Claude Code