Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 47 additions & 56 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,59 +25,32 @@ For day-to-day development instructions, see `CLAUDE.md`.

## Multi-stream entropy coding

The Lzf and LzSeqR pipelines use **multi-stream entropy coding** to improve
compression ratio by separating LZ77 output into independent byte streams with
tighter symbol distributions. Instead of feeding one mixed stream to the entropy
coder, the encoder deinterleaves tokens into three streams:
All LZ-based pipelines use **multi-stream entropy coding**: the match finder
produces a universal `LzToken` stream (literals + matches), and a pluggable
`TokenEncoder` splits it into independent byte streams with tighter symbol
distributions. Each stream gets its own entropy coder, yielding lower
per-stream entropy than a single combined stream.

| Stream | Contents | Why it helps |
|--------|----------|-------------|
| **Offsets** | High bytes of match offsets (offset >> 8) | Offsets cluster in a narrow range; dedicated Huffman/RC table exploits this |
| **Lengths** | Match lengths (capped to u8) | Length distribution is highly skewed (short matches dominate) |
| **Literals** | Literal bytes + low offset bytes + next bytes | Natural-language / binary byte distribution |
### Wire encoders (`src/lz_token.rs`)

Each stream gets its own FSE table (Lzf) or rANS context (LzSeqR),
yielding lower per-stream entropy than a single combined stream.
| Encoder | Streams | Used by | Contents |
|---------|---------|---------|----------|
| `LzSeqEncoder` | 6 | Lzf, LzSeqR, LzSeqH, SortLz | flags, literals, offset_codes, offset_extra, length_codes, length_extra |
| `LzssEncoder` | 4 | Lzfi, LzssR | flags (1-bit per token), literals, offsets (u16 LE), lengths (u16 LE) |

### Encoding format
`LzSeqEncoder` uses log2-coded offsets and lengths with repeat offset tracking,
achieving the best ratio (~32% on Canterbury+large). `LzssEncoder` uses raw u16
values with flag bits, trading ratio for simplicity.

Multi-stream data is stored with a `0x02` stream-format flag in the container
header, followed by three length-prefixed compressed sub-streams:
### Architecture

```
[stream_format: u8 = 0x02]
[offsets_len: u32 LE] [offsets compressed data...]
[lengths_len: u32 LE] [lengths compressed data...]
[literals_len: u32 LE] [literals compressed data...]
Input → tokenize() → Vec<LzToken> → TokenEncoder::encode() → EncodedStreams → entropy coding
```

The decoder reads the flag, decompresses each sub-stream independently, then
reinterleaves them back into the original LZ77 token sequence. Single-stream
format (`0x01`) is used as fallback for small inputs (< 256 bytes) or when
multi-stream produces larger output.

### Benchmark results

Comparison on Canterbury + Large corpus (14 files, 13.3 MB total), averaged over
3 iterations. "Before" = single-stream, "After" = multi-stream:

**Compression (size and throughput):**

| Pipeline | Before (bytes) | After (bytes) | Size delta | Throughput delta |
|----------|---------------|--------------|------------|-----------------|
| Lzf | 6,199,044 | 5,107,601 | **-17.6%** | +2.8% faster |

**Decompression throughput:**

| Pipeline | Throughput delta |
|----------|-----------------|
| Lzf | **+2.4%** faster |

Multi-stream is a pure win: better compression **and** faster speed. The largest
gains are on big files (E.coli: -21% size, +11% decode throughput; bible.txt:
-14% size, +16% decode throughput). Small files (< 4 KB) may see slight
expansion due to the overhead of three separate stream headers — the encoder
automatically falls back to single-stream when multi-stream would be larger.
Match finding and parsing (`tokenize()` in `src/pipeline/mod.rs`) are decoupled
from wire encoding. The `tokenize()` entry point handles GPU/CPU dispatch,
SortLz vs HashChain match finding, and parse strategy (greedy/lazy/optimal).

## rANS entropy coder

Expand Down Expand Up @@ -163,21 +136,20 @@ match verification. Zero atomics, fully deterministic — ideal for GPU executio
When used as a `MatchFinder`, SortLZ is transparent to the wire format — the
output is 100% compatible with the host pipeline (Lzf, LzSeqR, LzSeqH). The consumer and decompressor see no difference.

### Pipeline::SortLz wire format (per block)
### Pipeline::SortLz wire format v2 (per block)

```
[num_tokens: u32 LE] total token count (literals + matches)
[num_literals: u32 LE] literal count
[flags_len: u32 LE] ceil(num_tokens / 8)
[flags: flags_len bytes] bitfield (1 = literal, 0 = match, MSB-first)
[fse_lit_len: u32 LE] [fse_literals: ...] FSE-encoded literal bytes
[fse_off_len: u32 LE] [fse_offsets: ...] FSE-encoded u16 LE offsets
[fse_len_len: u32 LE] [fse_lengths: ...] FSE-encoded u16 LE lengths
[meta_len: u16 LE] LzSeq metadata length
[meta: meta_len bytes] num_tokens + num_matches (u32 each)
[num_streams: u8] number of streams (6 for LzSeq)
per stream:
[orig_len: u32 LE] uncompressed stream length
[fse_len: u32 LE] FSE-compressed length
[fse_data: fse_len bytes]
```

This is NOT wire-compatible with any other pipeline. It uses FSE entropy coding
on three raw byte streams (literals, offsets as u16 LE, lengths as u16 LE),
with a bitfield flag stream to interleave them during decompression.
SortLz uses `LzSeqEncoder` for wire encoding + FSE for entropy. The 6 streams
are: flags, literals, offset_codes, offset_extra, length_codes, length_extra.

### Algorithm

Expand Down Expand Up @@ -280,6 +252,25 @@ GPU path.
See `docs/design-docs/gpu-strategy.md` for full analysis and `CLAUDE.md` "Known
dead ends" for the complete list of GPU optimization attempts that failed.

## Active architecture: what GPU does / doesn't do

### GPU-accelerated paths (shipping)
- **LZ77 match finding** — GPU hash-table kernel for Lzf/LzSeq pipelines (2x faster at 256KB+)
- **SortLZ match finding** — GPU radix sort + match verification (10.6x faster at 4MB)
- **BWT suffix array** — GPU radix sort with prefix-doubling
- **Interleaved FSE** — GPU-accelerated encode/decode for Lzfi pipeline
- **rANS Recoil decode** — GPU-accelerated parallel rANS decode using split-point metadata

### CPU-only paths (by design)
- **All entropy coding in the streaming CLI** — `streaming::compress_stream` uses CPU rANS/FSE
- **rANS/FSE/Huffman encode/decode** — serial state machines, GPU is 0.54-0.77x CPU speed
- **LZ77 lazy evaluation** — sequential dependency (next match depends on current)
- **LzSeq `encode_with_config`** — repeat offset tracking, adaptive chain depth, hash4 prefix

### Threshold gates
- `GPU_ENTROPY_THRESHOLD` (256KB) > `DEFAULT_GPU_BLOCK_SIZE` (128KB) — prevents entropy routing to GPU
- `MIN_GPU_INPUT_SIZE` — minimum block size for GPU dispatch (avoids setup overhead)

## Remaining GPU bottlenecks

1. **GPU BWT still slower than CPU SA-IS** — Radix sort improved 7-14x over bitonic
Expand Down
39 changes: 39 additions & 0 deletions docs/exec-plans/active/TODO-benchmark-lzfi-vs-lzssr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# TODO: Benchmark Lzfi vs LzssR — consolidation candidate

## Question

Are both Lzfi and LzssR worth keeping? They use the same demuxer (LZSS, 4
streams) and differ only in entropy coder (interleaved FSE vs rANS).

## Current state

| Property | Lzfi | LzssR |
|----------|------|-------|
| Demuxer | LzssEncoder (4 streams) | LzssEncoder (4 streams) |
| Entropy | Interleaved FSE | rANS |
| Auto-selected | Yes (high entropy + matches) | Never |
| Pipeline ID | 5 | 6 |
| GPU entropy | Yes (interleaved FSE) | Yes (rANS Recoil) |

## Known data

- FSE decode is ~2.2x faster than rANS decode (596 vs 266 MB/s, Criterion)
- FSE encode is comparable to rANS encode (~357 vs 359 MB/s)
- Lzfi auto-selected when: match_density > 0.4 + byte_entropy > 6.0,
or match_density > 0.2 + byte_entropy > 5.0
- LzssR is only exercised via trial compression or explicit user selection

## Action items

1. Run `./scripts/bench.sh` comparing Lzfi vs LzssR on Canterbury+Silesia corpus
2. Run Criterion benchmarks: `cargo bench -- lzfi lzssr` for per-stage timing
3. If LzssR shows no ratio or throughput advantage over Lzfi, consider removing
it to reduce pipeline surface area (similar to Lzr removal)
4. If rANS interleaved or Recoil decode gives LzssR better GPU decode throughput,
document the use case and keep it

## Files

- `src/pipeline/stages.rs` — stage dispatch for both pipelines
- `src/pipeline/mod.rs` — `auto_select_pipeline`, `select_pipeline_trial`
- `src/pipeline/blocks.rs` — entropy encode/decode dispatch
38 changes: 38 additions & 0 deletions docs/exec-plans/active/TODO-gpu-rans-6stream-bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# TODO: GPU rANS interleaved decode fails with 6-stream LzSeqR

## Problem

GPU rANS interleaved decode works correctly for 4-stream pipelines (LzssR)
but fails for 6-stream pipelines (LzSeqR). CPU rANS interleaved decode
works correctly for both 4 and 6 streams.

## Evidence

- `test_gpu_rans_interleaved_decode_round_trip` originally used `Pipeline::Lzr`
(3 streams). After Lzr removal, switching to `Pipeline::LzSeqR` (6 streams)
caused the test to fail with `InvalidInput`.
- Switching to `Pipeline::LzssR` (4 streams) passes.
- CPU rANS interleaved encode/decode with LzSeqR works fine.
- The rANS encode/decode code in `src/pipeline/stages.rs` is stream-count
agnostic — each stream is encoded/decoded independently.

## Workaround

Test uses `Pipeline::LzssR` (4-stream) instead of `Pipeline::LzSeqR` (6-stream).
See `src/pipeline/tests.rs:test_gpu_rans_interleaved_decode_round_trip`.

## Investigation directions

- LzSeq's `offset_extra` and `length_extra` streams can be very small or empty.
GPU buffer sizing or dispatch dimensions may misbehave with near-zero streams.
- Check if the GPU rANS decode path (`stage_rans_decode_webgpu`) has alignment
assumptions that break with 6 streams.
- Compare the per-stream byte sizes between LzssR (4 streams, all non-trivial)
and LzSeqR (6 streams, some potentially empty) to find the divergence point.
- Test with synthetic 6-stream data where all streams are non-trivially sized.

## Files

- `src/pipeline/stages.rs` — `stage_rans_decode_webgpu`, `stage_rans_encode_with_options`
- `src/pipeline/tests.rs` — `test_gpu_rans_interleaved_decode_round_trip`
- `src/webgpu/rans.rs` — GPU rANS implementation
24 changes: 10 additions & 14 deletions docs/exec-plans/active/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Active Execution Plans

**Last Updated:** 2026-03-09
**Last Updated:** 2026-03-10

## Active Plans

Expand All @@ -10,27 +10,23 @@
### [PLAN-unified-scheduler-perf-validation.md](PLAN-unified-scheduler-perf-validation.md)
**Status:** In Progress (Phases 0-1 landed; local baseline captured; Phase 2 optimization started) | **Priority:** P0

## Parked Plans
## Investigation TODOs

### [PLAN-interleaved-rans.md](PLAN-interleaved-rans.md)
**Status:** PARKEDPhase A merged (PR #91); Phase D cancelled (GPU rANS dead end); Phases B–C need new owner | **Priority:** P1
### [TODO-gpu-rans-6stream-bug.md](TODO-gpu-rans-6stream-bug.md)
**Status:** OpenGPU rANS interleaved decode fails with 6-stream LzSeqR; works with 4-stream LzssR | **Priority:** P1

### [PLAN-unified-scheduler-north-star.md](PLAN-unified-scheduler-north-star.md)
**Status:** PARKEDPhases 3–4 done and in production; Phases 2+5 blocked indefinitely (GPU entropy not competitive) | **Priority:** P1
### [TODO-benchmark-lzfi-vs-lzssr.md](TODO-benchmark-lzfi-vs-lzssr.md)
**Status:** OpenBenchmark whether LzssR is worth keeping vs Lzfi consolidation | **Priority:** P2

### [TODO-huffman-sync-decode.md](TODO-huffman-sync-decode.md)
**Status:** PARKED — valid approach, zero implementation progress, awaiting LzSeq encoding work | **Priority:** P2

### [agent-harness-implementation.md](agent-harness-implementation.md)
**Status:** PARKED — Phase 1 complete; Phases 2–8 deferred | **Priority:** P1

## Closed Plans

### [PLAN-p0a-gpu-rans-vertical-slice.md](PLAN-p0a-gpu-rans-vertical-slice.md)
**Status:** CLOSED — Slice 4 perf gate failed; GPU rANS 0.54–0.77x CPU after 29+ iterations; structural dead end | **Priority:** was P0

## Completed Plans (in ../completed/)

- `PLAN-p0a-gpu-rans-vertical-slice.md` — GPU chunked rANS vertical slice (CLOSED: structural dead end)
- `PLAN-unified-scheduler-north-star.md` — Unified scheduler north star (PARKED: GPU entropy blocked)
- `PLAN-interleaved-rans.md` — Interleaved rANS (PARKED: Phase A merged, Phase D cancelled)
- `agent-harness-implementation.md` — Agent harness (PARKED: Phase 1 complete, rest deferred)
- `PLAN-gpu-backpressure-impl.md` — GPU ring buffer batching
- `lz77_merge.md` — Cooperative-stitch kernel consolidation
- `upgrade-wgpu-to-27.md` — wgpu 24→27 upgrade
Expand Down
Loading