Skip to content

feat: wire GPU LZ77 match-finding into streaming compressor#121

Merged
ChrisLundquist merged 1 commit intomasterfrom
claude/determined-mcnulty
Mar 11, 2026
Merged

feat: wire GPU LZ77 match-finding into streaming compressor#121
ChrisLundquist merged 1 commit intomasterfrom
claude/determined-mcnulty

Conversation

@ChrisLundquist
Copy link
Copy Markdown
Owner

Summary

  • Wires the GPU LZ77 match-finding coordinator into the streaming compressor (compress_stream_parallel), matching the architecture from parallel.rs
  • Adds adaptive backpressure (shared AtomicUsize: +2 on channel Full, -1 on Ok) so workers stop trying GPU after a few Full signals — prevents slow GPU hardware from becoming a bottleneck
  • Fixes critical bug where "CPU fallback" workers still routed through GPU via compress_and_demuxlzseq_encode_gpu(), causing all 196 blocks to contend for GPU (25s → 0.9s)
  • Gates GPU coordinator to LZ-demux pipelines only (Pipeline::uses_lz_demux()) — BWT/SortLz handle their own GPU paths
  • Adds compress_block_from_demux for entropy-only encoding of pre-computed GPU match results

Key results (LzSeqR, silesia/mozilla, 4 threads)

Path Wall time
Streaming GPU (before fix) 25.4s
Streaming GPU (after fix) 0.94s
Streaming CPU 0.98s

Test plan

  • cargo clippy --all-targets -- -D warnings — clean
  • cargo test — all tests pass
  • pz -c -p lzseqr -g -t 4 samples/silesia/mozilla > /dev/null — should be ~1s
  • Round-trip: pz -c -p lzseqr -g -t 4 mozilla | pz -d > /tmp/rt && diff mozilla /tmp/rt

🤖 Generated with Claude Code

Merge the GPU coordinator into compress_stream_parallel using
try_send + CPU fallback with adaptive backpressure, matching the
in-memory scheduler's pattern from parallel.rs.

Key changes:
- GPU coordinator thread batches blocks for find_matches_batched,
  demuxes matches, and entropy-encodes via compress_block_from_demux
- Workers use CPU-only options (backend: Cpu, webgpu_engine: None)
  to prevent accidental GPU routing through compress_and_demux
- Adaptive backpressure (AtomicUsize: +2 on Full, -1 on Ok) limits
  GPU blocks to an initial burst, then routes everything to CPU
- GPU coordinator only spawns for LZ-demux pipelines (Pipeline::uses_lz_demux);
  BWT/SortLz pass through their own GPU paths in compress_block
- Mark two slow optimal-parse tests as #[ignore] (>60s in debug)

Before: pz -c -p lzseqr -g -t4 mozilla took 25.9s (workers
accidentally routed all blocks through GPU via compress_and_demux).
After: 0.94s — on par with CPU-only and in-memory GPU paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit d33a953 into master Mar 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant