Skip to content

docs: document GPU dead ends to prevent agent rework#114

Merged
ChrisLundquist merged 6 commits intomasterfrom
claude/festive-bardeen
Mar 10, 2026
Merged

docs: document GPU dead ends to prevent agent rework#114
ChrisLundquist merged 6 commits intomasterfrom
claude/festive-bardeen

Conversation

@ChrisLundquist
Copy link
Copy Markdown
Owner

Summary

  • Add "Known dead ends" section to CLAUDE.md documenting 12 pitfalls that agents have repeatedly wasted full sessions rediscovering (GPU entropy, hash tables, SSE2 rANS, parallel parsing, etc.)
  • Add warning comments to gpu_fused_span() in parallel.rs and fix misleading GPU_ENTROPY_THRESHOLD comment in mod.rs
  • Add GPU pipeline multi-block correctness and pipelining regression tests
  • Fix CRLF line endings in scripts/fetch-silesia.sh

Context

Multiple agents have spent full sessions attempting to optimize GPU entropy encoding before discovering it's fundamentally slower than CPU. The historian found 10+ major dead ends across 420 commits. This PR places warnings at all the discovery points so future agents hit the documentation before burning context.

Test plan

  • ./scripts/test.sh --quick passes (fmt, clippy, all tests)
  • Pre-commit hooks pass on all 4 commits
  • New tests exercise GPU pipeline multi-block correctness and batched LZ77 pipelining

🤖 Generated with Claude Code

Chris Lundquist and others added 6 commits March 8, 2026 23:28
Multiple agents have spent full sessions attempting to optimize GPU
entropy (rANS/FSE) encoding before discovering it's fundamentally slower
than CPU (0.77x encode, 0.54x decode). Add warnings at all three
discovery points:

- CLAUDE.md: new "Known dead ends" section (read first by all agents)
- pipeline/mod.rs: GPU_ENTROPY_THRESHOLD comment corrected from "win"
  to "gate that prevents routing" with regression warning
- pipeline/parallel.rs: gpu_fused_span() warning explaining why the
  fused path exists but must not be activated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_gpu_pipeline_multiblock_correctness: validates full GPU
  coordinator pipeline with 8 distinct-pattern blocks across Deflate,
  Lzr, and LzSeqR pipelines (guards against ring slot cross-contamination)
- test_lz77_batched_pipeline_correctness: validates batched GPU LZ77
  match-finding across 8 blocks with cross-validation against serial
- test_lz77_batched_pipeline_not_slower_than_serial: timing guard
  ensuring batched path doesn't regress vs serial execution
- fix CRLF line endings in scripts/fetch-silesia.sh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expand the "Known dead ends" section with three more common pitfalls:
- GPU device init time skews bench.sh vs Criterion throughput numbers
- Compression ratio bottleneck is 5-byte match encoding, not matcher
- GPU Huffman is architecturally incompatible (needs byte alignment)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Historian research uncovered additional costly dead ends that agents
have repeatedly fallen into. Adding the highest-impact ones:

- GPU hash tables for LZ matching (atomics lose insertion order)
- SSE2 rANS decode (32% slower than scalar due to serializing extracts)
- Fully parallel GPU LZ parsing (37.6% compression gap)
- Iterative GPU algorithms (quadratic host overhead from per-round sync)
- Window-capped suffix sorts (break BWT invertibility, 433% expansion)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "GPU/CPU strategy (settled)" section documenting the proven
  GPU-for-LZ77 + CPU-for-entropy architecture with rationale
- Refresh "Next steps" to reflect current priorities: closing the
  6.5pp gzip ratio gap via LzSeq encoding improvements (P0), rANS
  SIMD decode wiring (P1), LzSeq optimal parser (P2), NEON SIMD (P3)
- Remove stale priorities (fuzz testing already done as M5.3,
  GPU Huffman chunk packing and shared memory are low-value given
  settled GPU/CPU strategy, auto-selection tuning is done)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove rANS SIMD decode wiring from priorities — SSE2 rANS decode
  was proven 32% slower than scalar (documented dead end), and the
  speculative "proper implementation" with merged tables is unproven
- Remove GPU Huffman atomic contention bottleneck — GPU Huffman is a
  documented dead end (bit-level alignment incompatible with GPU)
- Remove LZ77 shared memory bottleneck — low value given settled
  GPU/CPU strategy
- Renumber remaining priorities

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit 6c7f4b7 into master Mar 10, 2026
4 checks passed
@ChrisLundquist ChrisLundquist deleted the claude/festive-bardeen branch March 10, 2026 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant