Skip to content

feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default#124

Merged
ChrisLundquist merged 3 commits intomasterfrom
claude/blissful-poincare
Mar 14, 2026
Merged

feat: LzSeq2R pipeline with sparse rANS, DP refinement, and hash4 default#124
ChrisLundquist merged 3 commits intomasterfrom
claude/blissful-poincare

Conversation

@ChrisLundquist
Copy link
Copy Markdown
Owner

Summary

  • New LzSeq2R pipeline (ID 12): LzSeq wire format + sparse rANS frequency tables that only serialize non-zero entries, saving ~400-500 bytes per stream for low-cardinality streams like offset_codes
  • DP forward refinement: after backward DP optimal parse, walk forward with actual RepeatOffsetState and re-evaluate candidates using real repeat state (up to 3 passes). Biggest single ratio win (~3.4pp on LzSeqR)
  • Cost model recalibration: literal_overhead reduced from 4x to 1x COST_SCALE to match LzSeq encoding reality; auto-tune rANS scale_bits based on stream cardinality
  • Hash4 default: switch LzSeq from hash3 to hash4 — better hash distribution means fewer wasted chain traversals, giving 1.3pp better ratio at equal or better throughput
  • Cleanup: removed abandoned LzSeq2 literal-run wire format (proved worse than LzSeq on text)

Compression ratio on Canterbury corpus (single-threaded, lazy)

Pipeline Before After Delta
LzSeq2R 31.7% new
LzSeqR 35.1% 32.0% -3.1pp
gzip -6 29.2% 29.2% baseline

Gap to gzip closed from 6.5pp to 2.5pp with no throughput regression.

Throughput (bible.txt, 4MB, single-threaded)

Mode Throughput Ratio
Lazy (default) ~25 MB/s 31.6%
Optimal (-O) ~2.3 MB/s 32.1%
gzip -6 ~11 MB/s 29.2%

Test plan

  • All 698 tests pass (./scripts/test.sh --quick)
  • Pre-commit hooks pass (fmt, clippy, test)
  • Round-trip correctness verified for LzSeq2R in pipeline tests
  • Sparse rANS unit tests (roundtrip, all-symbols, encode/decode, size comparison)
  • Benchmarked on Canterbury + large corpus files

🤖 Generated with Claude Code

ChrisLundquist and others added 3 commits March 12, 2026 00:20
Three independent compression ratio improvements, composable in a new
Pipeline::LzSeq2R (ID 12):

1. Sparse rANS frequency tables (src/rans/mod.rs)
   - Only serialize non-zero entries: [num_symbols:u8][symbols:N×u8][freqs:N×u16]
   - offset_codes stream: ~20 distinct symbols → 41 bytes vs 512 bytes (92% reduction)
   - Across 6 streams, saves ~2800 bytes/block of freq table overhead
   - Biggest impact on small files (grammar.lsp: 116% → 63%)

2. DP forward refinement pass (src/optimal.rs)
   - Backward DP uses greedy repeat offset estimates (can diverge from optimal)
   - New: after backward DP, walk forward with actual RepeatOffsetState
   - Re-evaluate each position's match candidates with real repeat state
   - Iterate up to 3 passes until stable (typically converges in 1-2)
   - Benefits ALL LzSeq pipelines: LzSeqR improved 35.1% → 31.7% on Canterbury

3. LzSeq2 wire format (src/lzseq/seq2.rs) — experimental, NOT used by default
   - Literal-run-length sequences (zstd-style): 5 streams instead of 6
   - Eliminates flags stream, combines extra bits into single raw stream
   - Result: worse than LzSeq on text (lit_run_codes has more entropy than
     flags), better on binary (E.coli). Kept as code but LzSeq2R uses
     LzSeq format + sparse rANS instead.

Pipeline wiring: LzSeq2R = LzSeq demuxer (6 streams) + sparse rANS entropy.
This gives the best combination: LzSeq's efficient text encoding + sparse
freq table savings. Benchmarks on Canterbury corpus:

  LzSeqR (before):  35.1%  (+6.5pp vs gzip)
  LzSeqR (after):   31.7%  (+3.1pp vs gzip)  — DP refinement helps all pipelines
  LzSeq2R:          31.4%  (+2.8pp vs gzip)  — sparse rANS saves 0.3pp more
  gzip -6:          28.6%

Closes ~57% of the gap to gzip (6.5pp → 2.8pp).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two encoding efficiency improvements:

1. Cost model recalibration (src/optimal.rs)
   - literal_overhead: 4*COST_SCALE → COST_SCALE (was modeled for old
     5-byte LZ77 format; LzSeq literals cost only a flag byte ≈ 1 bit)
   - match_cost: add literal_overhead for trailing literal's flag cost
     (every LZ77 Match produces a trailing literal token in LzSeq)
   - repeat match: length_code_cost 4→3 bits (skewed distribution)
   - match_overhead: 16→8 bits (flag + 2 code bytes + trailing flag)

   Impact: with -O (optimal parsing), LzSeq2R improves 26.7% → 26.0%
   on small Canterbury. The old model overvalued matches by ~3 bits per
   literal avoided, causing the DP to take marginal short matches.

2. Auto-tuned rANS scale_bits (src/pipeline/stages.rs)
   - Count distinct symbols per stream before encoding
   - ≤8 symbols: scale_bits=10 (e.g., flags stream with 2 values)
   - ≥64 symbols: scale_bits=13 (e.g., literals with 100+ values)
   - Otherwise: DEFAULT_SCALE_BITS (12)

   Better precision for high-cardinality streams, less waste for
   low-cardinality ones.

Note: default CLI uses lazy matching (ParseStrategy::Auto), not optimal
parsing. The cost model only affects -O/--optimal/--quality modes.
Lazy matching on LzSeq2R achieves 25.7% on small Canterbury (vs gzip
26.1%), suggesting the LzSeq direct encoder + sparse rANS combination
is already very competitive without optimal parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LzSeq2 literal-run-length encoding proved worse than LzSeq on text
(higher entropy in lit_run_codes vs simple flags). Remove all LzSeq2
code: seq2.rs, LzSeq2Encoder, LzDemuxer::LzSeq2. LzSeq2R pipeline now
uses LzSeq wire format + sparse rANS (already wired in previous commit).

Switch SeqConfig defaults from hash3 to hash4 and enable adaptive chain
depth. Hash4 gives 1.3pp better compression ratio at equal or slightly
better throughput (fewer wasted chain traversals due to better hash
distribution). Adaptive chain saves time on incompressible regions.

Canterbury corpus results (single-threaded, lazy):
  Before: 33.0% ratio, 13.6 MB/s (hash3, no adaptive)
  After:  31.7% ratio, 14.0 MB/s (hash4, adaptive)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit 8be1afa into master Mar 14, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant