docs: document GPU dead ends to prevent agent rework by ChrisLundquist · Pull Request #114 · ChrisLundquist/libpz

ChrisLundquist · 2026-03-09T06:39:10Z

Summary

Add "Known dead ends" section to CLAUDE.md documenting 12 pitfalls that agents have repeatedly wasted full sessions rediscovering (GPU entropy, hash tables, SSE2 rANS, parallel parsing, etc.)
Add warning comments to gpu_fused_span() in parallel.rs and fix misleading GPU_ENTROPY_THRESHOLD comment in mod.rs
Add GPU pipeline multi-block correctness and pipelining regression tests
Fix CRLF line endings in scripts/fetch-silesia.sh

Context

Multiple agents have spent full sessions attempting to optimize GPU entropy encoding before discovering it's fundamentally slower than CPU. The historian found 10+ major dead ends across 420 commits. This PR places warnings at all the discovery points so future agents hit the documentation before burning context.

Test plan

./scripts/test.sh --quick passes (fmt, clippy, all tests)
Pre-commit hooks pass on all 4 commits
New tests exercise GPU pipeline multi-block correctness and batched LZ77 pipelining

🤖 Generated with Claude Code

Multiple agents have spent full sessions attempting to optimize GPU entropy (rANS/FSE) encoding before discovering it's fundamentally slower than CPU (0.77x encode, 0.54x decode). Add warnings at all three discovery points: - CLAUDE.md: new "Known dead ends" section (read first by all agents) - pipeline/mod.rs: GPU_ENTROPY_THRESHOLD comment corrected from "win" to "gate that prevents routing" with regression warning - pipeline/parallel.rs: gpu_fused_span() warning explaining why the fused path exists but must not be activated Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- test_gpu_pipeline_multiblock_correctness: validates full GPU coordinator pipeline with 8 distinct-pattern blocks across Deflate, Lzr, and LzSeqR pipelines (guards against ring slot cross-contamination) - test_lz77_batched_pipeline_correctness: validates batched GPU LZ77 match-finding across 8 blocks with cross-validation against serial - test_lz77_batched_pipeline_not_slower_than_serial: timing guard ensuring batched path doesn't regress vs serial execution - fix CRLF line endings in scripts/fetch-silesia.sh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Expand the "Known dead ends" section with three more common pitfalls: - GPU device init time skews bench.sh vs Criterion throughput numbers - Compression ratio bottleneck is 5-byte match encoding, not matcher - GPU Huffman is architecturally incompatible (needs byte alignment) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Historian research uncovered additional costly dead ends that agents have repeatedly fallen into. Adding the highest-impact ones: - GPU hash tables for LZ matching (atomics lose insertion order) - SSE2 rANS decode (32% slower than scalar due to serializing extracts) - Fully parallel GPU LZ parsing (37.6% compression gap) - Iterative GPU algorithms (quadratic host overhead from per-round sync) - Window-capped suffix sorts (break BWT invertibility, 433% expansion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add "GPU/CPU strategy (settled)" section documenting the proven GPU-for-LZ77 + CPU-for-entropy architecture with rationale - Refresh "Next steps" to reflect current priorities: closing the 6.5pp gzip ratio gap via LzSeq encoding improvements (P0), rANS SIMD decode wiring (P1), LzSeq optimal parser (P2), NEON SIMD (P3) - Remove stale priorities (fuzz testing already done as M5.3, GPU Huffman chunk packing and shared memory are low-value given settled GPU/CPU strategy, auto-selection tuning is done) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove rANS SIMD decode wiring from priorities — SSE2 rANS decode was proven 32% slower than scalar (documented dead end), and the speculative "proper implementation" with merged tables is unproven - Remove GPU Huffman atomic contention bottleneck — GPU Huffman is a documented dead end (bit-level alignment incompatible with GPU) - Remove LZ77 shared memory bottleneck — low value given settled GPU/CPU strategy - Renumber remaining priorities Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Chris Lundquist and others added 6 commits March 8, 2026 23:28

ChrisLundquist merged commit 6c7f4b7 into master Mar 10, 2026
4 checks passed

ChrisLundquist deleted the claude/festive-bardeen branch March 10, 2026 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document GPU dead ends to prevent agent rework#114

docs: document GPU dead ends to prevent agent rework#114
ChrisLundquist merged 6 commits intomasterfrom
claude/festive-bardeen

ChrisLundquist commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Mar 9, 2026

Summary

Context

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant