Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 43 additions & 16 deletions docs/exec-plans/active/TODO-benchmark-lzfi-vs-lzssr.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,49 @@ streams) and differ only in entropy coder (interleaved FSE vs rANS).
| Pipeline ID | 5 | 6 |
| GPU entropy | Yes (interleaved FSE) | Yes (rANS Recoil) |

## Known data

- FSE decode is ~2.2x faster than rANS decode (596 vs 266 MB/s, Criterion)
- FSE encode is comparable to rANS encode (~357 vs 359 MB/s)
- Lzfi auto-selected when: match_density > 0.4 + byte_entropy > 6.0,
or match_density > 0.2 + byte_entropy > 5.0
- LzssR is only exercised via trial compression or explicit user selection

## Action items

1. Run `./scripts/bench.sh` comparing Lzfi vs LzssR on Canterbury+Silesia corpus
2. Run Criterion benchmarks: `cargo bench -- lzfi lzssr` for per-stage timing
3. If LzssR shows no ratio or throughput advantage over Lzfi, consider removing
it to reduce pipeline surface area (similar to Lzr removal)
4. If rANS interleaved or Recoil decode gives LzssR better GPU decode throughput,
document the use case and keep it
## Criterion benchmark data (2026-03-10)

### Entropy throughput (CPU-only, Canterbury 64KB)

| Coder | Encode | Decode |
|-------|--------|--------|
| FSE | 238-302 MB/s | 412-533 MB/s |
| rANS (basic) | 326-446 MB/s | 262-316 MB/s |
| rANS (chunked) | 279-486 MB/s | 453 MB/s – 1.06 GB/s |

FSE decode is ~2x faster than rANS basic decode. rANS chunked decode is
competitive but requires the chunked wire format.

### Full pipeline throughput (Canterbury corpus, 25 MB)

| Pipeline | Compress | Decompress |
|----------|----------|------------|
| Lzfi | **543 MB/s** | **1.23 GB/s** |
| LzSeqR | 333 MB/s | 1.22 GB/s |
| Lzf | 295 MB/s | 1.02 GB/s |

Lzfi dominates compress speed (63% faster than LzSeqR). Decompress is
effectively tied across all LZ pipelines.

### Recommendation

LzssR has no measurable advantage over Lzfi:
- Same demuxer (LZSS, 4 streams)
- rANS decode is slower than FSE decode (262 vs 533 MB/s)
- rANS encode is faster but Lzfi pipeline throughput is still higher
- LzssR is never auto-selected
- GPU Recoil decode is interesting but GPU entropy is known to be slower
than CPU (0.54-0.77x), so the GPU rANS advantage doesn't materialize

**Action:** LzssR is a candidate for removal, similar to the Lzr removal.
Keep only if a concrete use case for GPU Recoil decode emerges.

## Remaining action items

1. ~~Run Criterion benchmarks~~ Done (see above)
2. Run `./scripts/bench.sh` for compression ratio comparison on Canterbury+Silesia
3. Decide: remove LzssR or document its niche use case
4. If removing: retire Pipeline ID 6, update wire-formats.md, add to retired IDs

## Files

Expand Down
69 changes: 44 additions & 25 deletions docs/exec-plans/active/TODO-gpu-rans-6stream-bug.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,57 @@
# TODO: GPU rANS interleaved decode fails with 6-stream LzSeqR
# RESOLVED: GPU rANS encode routing bug for LzSeqR

**Status:** Fixed (2026-03-10)

## Problem

GPU rANS interleaved decode works correctly for 4-stream pipelines (LzssR)
but fails for 6-stream pipelines (LzSeqR). CPU rANS interleaved decode
works correctly for both 4 and 6 streams.
LzSeqR (6-stream) round-trip failed with `InvalidInput` when compressed with
multi-threaded GPU backend. The bug was mischaracterized as "GPU rANS
interleaved decode fails with 6-stream LzSeqR" — the actual root cause was
on the encode side.

## Root Cause

**Routing inconsistency between single-block and parallel compress paths:**

- `entropy_encode` (blocks.rs:134) — used by single-block/single-thread paths:
LzSeqR always uses `stage_rans_encode_with_options` (CPU rANS)

- `run_compress_stage` (stages.rs:911) — used by the parallel scheduler:
LzSeqR routed to `stage_rans_encode_webgpu` when WebGpu backend was active

`stage_rans_encode_webgpu` uses `rans_encode_chunked_payload_gpu_batched`
which produces a **chunked payload format** incompatible with the standard
`rans::decode_interleaved` decoder. The data would encode successfully but
no decoder (CPU or GPU) could decode it.

LzssR didn't have this bug because `run_compress_stage` for LzssR (line 909)
always used `stage_rans_encode_with_options`.

## Evidence

- `test_gpu_rans_interleaved_decode_round_trip` originally used `Pipeline::Lzr`
(3 streams). After Lzr removal, switching to `Pipeline::LzSeqR` (6 streams)
caused the test to fail with `InvalidInput`.
- Switching to `Pipeline::LzssR` (4 streams) passes.
- CPU rANS interleaved encode/decode with LzSeqR works fine.
- The rANS encode/decode code in `src/pipeline/stages.rs` is stream-count
agnostic — each stream is encoded/decoded independently.
Diagnostic results (192KB input, 64KB block size):
```
size=65536: OK (1 block, single-block fast path)
size=131072: FAIL (2 blocks, parallel path → stage_rans_encode_webgpu)
192KB threads=1: OK (sequential path → stage_rans_encode_with_options)
192KB threads=2: FAIL (parallel path → stage_rans_encode_webgpu)
```

## Workaround
## Fix

Test uses `Pipeline::LzssR` (4-stream) instead of `Pipeline::LzSeqR` (6-stream).
See `src/pipeline/tests.rs:test_gpu_rans_interleaved_decode_round_trip`.
Changed `run_compress_stage` for `(Pipeline::LzSeqR, 1)` to always use
`stage_rans_encode_with_options`, matching the `entropy_encode` path.

## Investigation directions
## Remaining TODO

- LzSeq's `offset_extra` and `length_extra` streams can be very small or empty.
GPU buffer sizing or dispatch dimensions may misbehave with near-zero streams.
- Check if the GPU rANS decode path (`stage_rans_decode_webgpu`) has alignment
assumptions that break with 6 streams.
- Compare the per-stream byte sizes between LzssR (4 streams, all non-trivial)
and LzSeqR (6 streams, some potentially empty) to find the divergence point.
- Test with synthetic 6-stream data where all streams are non-trivially sized.
The GPU rANS encode path (`stage_rans_encode_webgpu`) produces a chunked
payload wire format that the standard rANS decoder doesn't understand.
If GPU rANS encode is ever re-enabled for LzSeqR, the chunked decode path
must be wired into `stage_rans_decode_webgpu`. However, since GPU rANS
entropy is known to be slower than CPU (0.54-0.77x), this is low priority.

## Files

- `src/pipeline/stages.rs` — `stage_rans_decode_webgpu`, `stage_rans_encode_with_options`
- `src/pipeline/tests.rs` — `test_gpu_rans_interleaved_decode_round_trip`
- `src/webgpu/rans.rs` — GPU rANS implementation
- `src/pipeline/stages.rs:911` — fix: removed GPU routing for LzSeqR stage 1
- `src/pipeline/blocks.rs:134` — reference: entropy_encode always uses CPU
- `src/pipeline/tests.rs` — new test: `test_gpu_rans_interleaved_decode_lzseqr_6stream`
4 changes: 2 additions & 2 deletions docs/exec-plans/active/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
## Investigation TODOs

### [TODO-gpu-rans-6stream-bug.md](TODO-gpu-rans-6stream-bug.md)
**Status:** OpenGPU rANS interleaved decode fails with 6-stream LzSeqR; works with 4-stream LzssR | **Priority:** P1
**Status:** RESOLVEDWas routing bug: parallel path sent LzSeqR entropy to GPU chunked encoder incompatible with standard decoder. Fixed by routing to CPU rANS.

### [TODO-benchmark-lzfi-vs-lzssr.md](TODO-benchmark-lzfi-vs-lzssr.md)
**Status:** Open — Benchmark whether LzssR is worth keeping vs Lzfi consolidation | **Priority:** P2
**Status:** In Progress — Criterion benchmarks done; Lzfi dominates (543 vs 333 MB/s compress). LzssR removal candidate. | **Priority:** P2

### [TODO-huffman-sync-decode.md](TODO-huffman-sync-decode.md)
**Status:** PARKED — valid approach, zero implementation progress, awaiting LzSeq encoding work | **Priority:** P2
Expand Down
101 changes: 5 additions & 96 deletions src/pipeline/stages.rs
Original file line number Diff line number Diff line change
Expand Up @@ -478,91 +478,6 @@ pub(crate) fn stage_rans_decode_webgpu(
Ok(block)
}

/// batched GPU dispatch with ring-buffered submit/readback overlap.
///
/// Per-stream framing: [orig_len: u32] [compressed_len: u32 | flags] [rans_data]
///
/// Streams below `rans_interleaved_min_bytes` fall back to CPU basic rANS.
/// The output wire format is identical to [`stage_rans_encode_with_options()`],
/// so the same decoder works for both CPU and GPU encoded data.
#[cfg(feature = "webgpu")]
pub(crate) fn stage_rans_encode_webgpu(
mut block: StageBlock,
engine: &crate::webgpu::WebGpuEngine,
options: &CompressOptions,
) -> PzResult<StageBlock> {
let streams = block.streams.take().ok_or(PzError::InvalidInput)?;
let pre_entropy_len = block
.metadata
.pre_entropy_len
.ok_or(PzError::InvalidInput)?;
let meta = &block.metadata.demux_meta;

// Phase 1: batch-encode all GPU-eligible streams in one call.
// The batched API uses a ring buffer internally to overlap GPU
// compute with readback across streams.
let min_bytes = options.rans_interleaved_min_bytes;
let mut gpu_inputs: Vec<&[u8]> = Vec::new();
let mut gpu_indices: Vec<usize> = Vec::new();
for (i, stream) in streams.iter().enumerate() {
if stream.len() >= min_bytes {
gpu_inputs.push(stream);
gpu_indices.push(i);
}
}

let batch_results = if !gpu_inputs.is_empty() {
engine.rans_encode_chunked_payload_gpu_batched(
&gpu_inputs,
options.rans_interleaved_states,
rans::DEFAULT_SCALE_BITS,
256,
)?
} else {
Vec::new()
};

// Index batch results by original stream position.
let mut gpu_results: Vec<Option<Vec<u8>>> = vec![None; streams.len()];
for ((data, _used_chunked), &stream_idx) in batch_results.into_iter().zip(&gpu_indices) {
gpu_results[stream_idx] = Some(data);
}

// Phase 2: assemble the multi-stream container.
let mut output = Vec::new();
output.push(streams.len() as u8);
output.extend_from_slice(&(pre_entropy_len as u32).to_le_bytes());
output.extend_from_slice(&(meta.len() as u16).to_le_bytes());
output.extend_from_slice(meta);

for (i, stream) in streams.iter().enumerate() {
// GPU path: always interleaved (engine uses encode_interleaved_n
// even when chunked encoding is not possible).
// CPU path: basic rANS for small streams.
let (rans_data, is_interleaved) = if let Some(data) = gpu_results[i].take() {
(data, true)
} else {
(rans::encode(stream), false)
};

if rans_data.len() >= (1usize << 31) {
return Err(PzError::InvalidInput);
}

let flagged_len = if is_interleaved {
(rans_data.len() as u32) | RANS_INTERLEAVED_FLAG
} else {
rans_data.len() as u32
};
output.extend_from_slice(&(stream.len() as u32).to_le_bytes());
output.extend_from_slice(&flagged_len.to_le_bytes());
output.extend_from_slice(&rans_data);
}

block.data = output;
Ok(block)
}

// ---------------------------------------------------------------------------
// Entropy stage functions — FSE (multi-stream, LZ-based pipelines)
// ---------------------------------------------------------------------------
Expand Down Expand Up @@ -908,17 +823,11 @@ pub(crate) fn run_compress_stage(
(Pipeline::LzssR, 0) => stage_demux_compress(block, &LzDemuxer::Lzss, options),
(Pipeline::LzssR, 1) => stage_rans_encode_with_options(block, options),
(Pipeline::LzSeqR, 0) => stage_demux_compress(block, &LzDemuxer::LzSeq, options),
(Pipeline::LzSeqR, 1) => {
#[cfg(feature = "webgpu")]
{
if let super::Backend::WebGpu = options.backend {
if let Some(ref engine) = options.webgpu_engine {
return stage_rans_encode_webgpu(block, engine, options);
}
}
}
stage_rans_encode_with_options(block, options)
}
// GPU rANS encode (stage_rans_encode_webgpu) uses chunked-payload format
// which is incompatible with the standard decode_interleaved decoder.
// Use CPU rANS encode to match the entropy_encode path in blocks.rs.
// TODO: fix GPU rANS encode wire format compatibility, then re-enable.
(Pipeline::LzSeqR, 1) => stage_rans_encode_with_options(block, options),
(Pipeline::LzSeqH, 0) => stage_demux_compress(block, &LzDemuxer::LzSeq, options),
(Pipeline::LzSeqH, 1) => stage_huffman_encode(block),
(Pipeline::SortLz, 0) => stage_sortlz_compress(block),
Expand Down
28 changes: 28 additions & 0 deletions src/pipeline/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -832,6 +832,34 @@ mod gpu_batched_tests {
assert_eq!(decompressed, input);
}

#[test]
fn test_gpu_rans_interleaved_decode_lzseqr_6stream() {
let mut opts = match make_webgpu_options() {
Some(o) => o,
None => return,
};
opts.rans_interleaved = true;
opts.rans_interleaved_min_bytes = 0;
opts.rans_interleaved_states = 4;

let input: Vec<u8> = (0..192 * 1024)
.map(|i| ((i * 13 + 97) % 251) as u8)
.collect();
// LzSeqR (6 streams) with GPU compress + CPU decode should round-trip.
// Previously failed because the parallel path routed LzSeqR entropy
// to stage_rans_encode_webgpu (chunked format) which was incompatible
// with the standard rANS decoder.
let compressed = compress_with_options(&input, Pipeline::LzSeqR, &opts).unwrap();

let dec_opts = DecompressOptions {
backend: Backend::WebGpu,
webgpu_engine: opts.webgpu_engine.clone(),
threads: 0,
};
let decompressed = decompress_with_options(&compressed, &dec_opts).unwrap();
assert_eq!(decompressed, input);
}

#[test]
fn test_gpu_batched_lzfi_round_trip() {
let opts = match make_webgpu_options() {
Expand Down