ChrisLundquist · ChrisLundquist · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -25,59 +25,32 @@ For day-to-day development instructions, see `CLAUDE.md`.
 
 ## Multi-stream entropy coding
 
-The Lzf and LzSeqR pipelines use **multi-stream entropy coding** to improve
-compression ratio by separating LZ77 output into independent byte streams with
-tighter symbol distributions. Instead of feeding one mixed stream to the entropy
-coder, the encoder deinterleaves tokens into three streams:
+All LZ-based pipelines use **multi-stream entropy coding**: the match finder
+produces a universal `LzToken` stream (literals + matches), and a pluggable
+`TokenEncoder` splits it into independent byte streams with tighter symbol
+distributions. Each stream gets its own entropy coder, yielding lower
+per-stream entropy than a single combined stream.
 
-| Stream | Contents | Why it helps |
-|--------|----------|-------------|
-| **Offsets** | High bytes of match offsets (offset >> 8) | Offsets cluster in a narrow range; dedicated Huffman/RC table exploits this |
-| **Lengths** | Match lengths (capped to u8) | Length distribution is highly skewed (short matches dominate) |
-| **Literals** | Literal bytes + low offset bytes + next bytes | Natural-language / binary byte distribution |
+### Wire encoders (`src/lz_token.rs`)
 
-Each stream gets its own FSE table (Lzf) or rANS context (LzSeqR),
-yielding lower per-stream entropy than a single combined stream.
+| Encoder | Streams | Used by | Contents |
+|---------|---------|---------|----------|
+| `LzSeqEncoder` | 6 | Lzf, LzSeqR, LzSeqH, SortLz | flags, literals, offset_codes, offset_extra, length_codes, length_extra |
+| `LzssEncoder` | 4 | Lzfi, LzssR | flags (1-bit per token), literals, offsets (u16 LE), lengths (u16 LE) |
 
-### Encoding format
+`LzSeqEncoder` uses log2-coded offsets and lengths with repeat offset tracking,
+achieving the best ratio (~32% on Canterbury+large). `LzssEncoder` uses raw u16
+values with flag bits, trading ratio for simplicity.
 
-Multi-stream data is stored with a `0x02` stream-format flag in the container
-header, followed by three length-prefixed compressed sub-streams:
+### Architecture
 
 ```
-[stream_format: u8 = 0x02]
-[offsets_len: u32 LE] [offsets compressed data...]
-[lengths_len: u32 LE] [lengths compressed data...]
-[literals_len: u32 LE] [literals compressed data...]
+Input → tokenize() → Vec<LzToken> → TokenEncoder::encode() → EncodedStreams → entropy coding
 ```
 
-The decoder reads the flag, decompresses each sub-stream independently, then
-reinterleaves them back into the original LZ77 token sequence. Single-stream
-format (`0x01`) is used as fallback for small inputs (< 256 bytes) or when
-multi-stream produces larger output.
-
-### Benchmark results
-
-Comparison on Canterbury + Large corpus (14 files, 13.3 MB total), averaged over
-3 iterations. "Before" = single-stream, "After" = multi-stream:
-
-**Compression (size and throughput):**
-
-| Pipeline | Before (bytes) | After (bytes) | Size delta | Throughput delta |
-|----------|---------------|--------------|------------|-----------------|
-| Lzf      | 6,199,044     | 5,107,601    | **-17.6%** | +2.8% faster    |
-
-**Decompression throughput:**
-
-| Pipeline | Throughput delta |
-|----------|-----------------|
-| Lzf      | **+2.4%** faster |
-
-Multi-stream is a pure win: better compression **and** faster speed. The largest
-gains are on big files (E.coli: -21% size, +11% decode throughput; bible.txt:
--14% size, +16% decode throughput). Small files (< 4 KB) may see slight
-expansion due to the overhead of three separate stream headers — the encoder
-automatically falls back to single-stream when multi-stream would be larger.
+Match finding and parsing (`tokenize()` in `src/pipeline/mod.rs`) are decoupled
+from wire encoding. The `tokenize()` entry point handles GPU/CPU dispatch,
+SortLz vs HashChain match finding, and parse strategy (greedy/lazy/optimal).
 
 ## rANS entropy coder
 
@@ -163,21 +136,20 @@ match verification. Zero atomics, fully deterministic — ideal for GPU executio
 When used as a `MatchFinder`, SortLZ is transparent to the wire format — the
 output is 100% compatible with the host pipeline (Lzf, LzSeqR, LzSeqH). The consumer and decompressor see no difference.
 
-### Pipeline::SortLz wire format (per block)
+### Pipeline::SortLz wire format v2 (per block)
 
 ```
-[num_tokens: u32 LE]       total token count (literals + matches)
-[num_literals: u32 LE]     literal count
-[flags_len: u32 LE]        ceil(num_tokens / 8)
-[flags: flags_len bytes]   bitfield (1 = literal, 0 = match, MSB-first)
-[fse_lit_len: u32 LE]      [fse_literals: ...]   FSE-encoded literal bytes
-[fse_off_len: u32 LE]      [fse_offsets: ...]    FSE-encoded u16 LE offsets
-[fse_len_len: u32 LE]      [fse_lengths: ...]    FSE-encoded u16 LE lengths
+[meta_len: u16 LE]         LzSeq metadata length
+[meta: meta_len bytes]     num_tokens + num_matches (u32 each)
+[num_streams: u8]          number of streams (6 for LzSeq)
+per stream:
+  [orig_len: u32 LE]       uncompressed stream length
+  [fse_len: u32 LE]        FSE-compressed length
+  [fse_data: fse_len bytes]
 ```
 
-This is NOT wire-compatible with any other pipeline. It uses FSE entropy coding
-on three raw byte streams (literals, offsets as u16 LE, lengths as u16 LE),
-with a bitfield flag stream to interleave them during decompression.
+SortLz uses `LzSeqEncoder` for wire encoding + FSE for entropy. The 6 streams
+are: flags, literals, offset_codes, offset_extra, length_codes, length_extra.
 
 ### Algorithm
 
@@ -280,6 +252,25 @@ GPU path.
 See `docs/design-docs/gpu-strategy.md` for full analysis and `CLAUDE.md` "Known
 dead ends" for the complete list of GPU optimization attempts that failed.
 
+## Active architecture: what GPU does / doesn't do
+
+### GPU-accelerated paths (shipping)
+- **LZ77 match finding** — GPU hash-table kernel for Lzf/LzSeq pipelines (2x faster at 256KB+)
+- **SortLZ match finding** — GPU radix sort + match verification (10.6x faster at 4MB)
+- **BWT suffix array** — GPU radix sort with prefix-doubling
+- **Interleaved FSE** — GPU-accelerated encode/decode for Lzfi pipeline
+- **rANS Recoil decode** — GPU-accelerated parallel rANS decode using split-point metadata
+
+### CPU-only paths (by design)
+- **All entropy coding in the streaming CLI** — `streaming::compress_stream` uses CPU rANS/FSE
+- **rANS/FSE/Huffman encode/decode** — serial state machines, GPU is 0.54-0.77x CPU speed
+- **LZ77 lazy evaluation** — sequential dependency (next match depends on current)
+- **LzSeq `encode_with_config`** — repeat offset tracking, adaptive chain depth, hash4 prefix
+
+### Threshold gates
+- `GPU_ENTROPY_THRESHOLD` (256KB) > `DEFAULT_GPU_BLOCK_SIZE` (128KB) — prevents entropy routing to GPU
+- `MIN_GPU_INPUT_SIZE` — minimum block size for GPU dispatch (avoids setup overhead)
+
 ## Remaining GPU bottlenecks
 
 1. **GPU BWT still slower than CPU SA-IS** — Radix sort improved 7-14x over bitonic

diff --git a/docs/exec-plans/active/TODO-benchmark-lzfi-vs-lzssr.md b/docs/exec-plans/active/TODO-benchmark-lzfi-vs-lzssr.md
@@ -0,0 +1,39 @@
+# TODO: Benchmark Lzfi vs LzssR — consolidation candidate
+
+## Question
+
+Are both Lzfi and LzssR worth keeping? They use the same demuxer (LZSS, 4
+streams) and differ only in entropy coder (interleaved FSE vs rANS).
+
+## Current state
+
+| Property | Lzfi | LzssR |
+|----------|------|-------|
+| Demuxer | LzssEncoder (4 streams) | LzssEncoder (4 streams) |
+| Entropy | Interleaved FSE | rANS |
+| Auto-selected | Yes (high entropy + matches) | Never |
+| Pipeline ID | 5 | 6 |
+| GPU entropy | Yes (interleaved FSE) | Yes (rANS Recoil) |
+
+## Known data
+
+- FSE decode is ~2.2x faster than rANS decode (596 vs 266 MB/s, Criterion)
+- FSE encode is comparable to rANS encode (~357 vs 359 MB/s)
+- Lzfi auto-selected when: match_density > 0.4 + byte_entropy > 6.0,
+  or match_density > 0.2 + byte_entropy > 5.0
+- LzssR is only exercised via trial compression or explicit user selection
+
+## Action items
+
+1. Run `./scripts/bench.sh` comparing Lzfi vs LzssR on Canterbury+Silesia corpus
+2. Run Criterion benchmarks: `cargo bench -- lzfi lzssr` for per-stage timing
+3. If LzssR shows no ratio or throughput advantage over Lzfi, consider removing
+   it to reduce pipeline surface area (similar to Lzr removal)
+4. If rANS interleaved or Recoil decode gives LzssR better GPU decode throughput,
+   document the use case and keep it
+
+## Files
+
+- `src/pipeline/stages.rs` — stage dispatch for both pipelines
+- `src/pipeline/mod.rs` — `auto_select_pipeline`, `select_pipeline_trial`
+- `src/pipeline/blocks.rs` — entropy encode/decode dispatch
diff --git a/docs/exec-plans/active/TODO-gpu-rans-6stream-bug.md b/docs/exec-plans/active/TODO-gpu-rans-6stream-bug.md
@@ -0,0 +1,38 @@
+# TODO: GPU rANS interleaved decode fails with 6-stream LzSeqR
+
+## Problem
+
+GPU rANS interleaved decode works correctly for 4-stream pipelines (LzssR)
+but fails for 6-stream pipelines (LzSeqR). CPU rANS interleaved decode
+works correctly for both 4 and 6 streams.
+
+## Evidence
+
+- `test_gpu_rans_interleaved_decode_round_trip` originally used `Pipeline::Lzr`
+  (3 streams). After Lzr removal, switching to `Pipeline::LzSeqR` (6 streams)
+  caused the test to fail with `InvalidInput`.
+- Switching to `Pipeline::LzssR` (4 streams) passes.
+- CPU rANS interleaved encode/decode with LzSeqR works fine.
+- The rANS encode/decode code in `src/pipeline/stages.rs` is stream-count
+  agnostic — each stream is encoded/decoded independently.
+
+## Workaround
+
+Test uses `Pipeline::LzssR` (4-stream) instead of `Pipeline::LzSeqR` (6-stream).
+See `src/pipeline/tests.rs:test_gpu_rans_interleaved_decode_round_trip`.
+
+## Investigation directions
+
+- LzSeq's `offset_extra` and `length_extra` streams can be very small or empty.
+  GPU buffer sizing or dispatch dimensions may misbehave with near-zero streams.
+- Check if the GPU rANS decode path (`stage_rans_decode_webgpu`) has alignment
+  assumptions that break with 6 streams.
+- Compare the per-stream byte sizes between LzssR (4 streams, all non-trivial)
+  and LzSeqR (6 streams, some potentially empty) to find the divergence point.
+- Test with synthetic 6-stream data where all streams are non-trivially sized.
+
+## Files
+
+- `src/pipeline/stages.rs` — `stage_rans_decode_webgpu`, `stage_rans_encode_with_options`
+- `src/pipeline/tests.rs` — `test_gpu_rans_interleaved_decode_round_trip`
+- `src/webgpu/rans.rs` — GPU rANS implementation
diff --git a/docs/exec-plans/active/index.md b/docs/exec-plans/active/index.md
@@ -1,6 +1,6 @@
 # Active Execution Plans
 
-**Last Updated:** 2026-03-09
+**Last Updated:** 2026-03-10
 
 ## Active Plans
 
@@ -10,27 +10,23 @@
 ### [PLAN-unified-scheduler-perf-validation.md](PLAN-unified-scheduler-perf-validation.md)
 **Status:** In Progress (Phases 0-1 landed; local baseline captured; Phase 2 optimization started) | **Priority:** P0
 
-## Parked Plans
+## Investigation TODOs
 
-### [PLAN-interleaved-rans.md](PLAN-interleaved-rans.md)
-**Status:** PARKED — Phase A merged (PR #91); Phase D cancelled (GPU rANS dead end); Phases B–C need new owner | **Priority:** P1
+### [TODO-gpu-rans-6stream-bug.md](TODO-gpu-rans-6stream-bug.md)
+**Status:** Open — GPU rANS interleaved decode fails with 6-stream LzSeqR; works with 4-stream LzssR | **Priority:** P1
 
-### [PLAN-unified-scheduler-north-star.md](PLAN-unified-scheduler-north-star.md)
-**Status:** PARKED — Phases 3–4 done and in production; Phases 2+5 blocked indefinitely (GPU entropy not competitive) | **Priority:** P1
+### [TODO-benchmark-lzfi-vs-lzssr.md](TODO-benchmark-lzfi-vs-lzssr.md)
+**Status:** Open — Benchmark whether LzssR is worth keeping vs Lzfi consolidation | **Priority:** P2
 
 ### [TODO-huffman-sync-decode.md](TODO-huffman-sync-decode.md)
 **Status:** PARKED — valid approach, zero implementation progress, awaiting LzSeq encoding work | **Priority:** P2
 
-### [agent-harness-implementation.md](agent-harness-implementation.md)
-**Status:** PARKED — Phase 1 complete; Phases 2–8 deferred | **Priority:** P1
-
-## Closed Plans
-
-### [PLAN-p0a-gpu-rans-vertical-slice.md](PLAN-p0a-gpu-rans-vertical-slice.md)
-**Status:** CLOSED — Slice 4 perf gate failed; GPU rANS 0.54–0.77x CPU after 29+ iterations; structural dead end | **Priority:** was P0
-
 ## Completed Plans (in ../completed/)
 
+- `PLAN-p0a-gpu-rans-vertical-slice.md` — GPU chunked rANS vertical slice (CLOSED: structural dead end)
+- `PLAN-unified-scheduler-north-star.md` — Unified scheduler north star (PARKED: GPU entropy blocked)
+- `PLAN-interleaved-rans.md` — Interleaved rANS (PARKED: Phase A merged, Phase D cancelled)
+- `agent-harness-implementation.md` — Agent harness (PARKED: Phase 1 complete, rest deferred)
 - `PLAN-gpu-backpressure-impl.md` — GPU ring buffer batching
 - `lz77_merge.md` — Cooperative-stitch kernel consolidation
 - `upgrade-wgpu-to-27.md` — wgpu 24→27 upgrade

diff --git a/...xec-plans/active/PLAN-interleaved-rans.md → ...-plans/completed/PLAN-interleaved-rans.md b/...xec-plans/active/PLAN-interleaved-rans.md → ...-plans/completed/PLAN-interleaved-rans.md
diff --git a/...ctive/PLAN-p0a-gpu-rans-vertical-slice.md → ...leted/PLAN-p0a-gpu-rans-vertical-slice.md b/...ctive/PLAN-p0a-gpu-rans-vertical-slice.md → ...leted/PLAN-p0a-gpu-rans-vertical-slice.md
diff --git a/...tive/PLAN-unified-scheduler-north-star.md → ...eted/PLAN-unified-scheduler-north-star.md b/...tive/PLAN-unified-scheduler-north-star.md → ...eted/PLAN-unified-scheduler-north-star.md
diff --git a/...ns/active/agent-harness-implementation.md → ...completed/agent-harness-implementation.md b/...ns/active/agent-harness-implementation.md → ...completed/agent-harness-implementation.md