docs: document GPU dead ends to prevent agent rework#114
Merged
ChrisLundquist merged 6 commits intomasterfrom Mar 10, 2026
Merged
docs: document GPU dead ends to prevent agent rework#114ChrisLundquist merged 6 commits intomasterfrom
ChrisLundquist merged 6 commits intomasterfrom
Conversation
Multiple agents have spent full sessions attempting to optimize GPU entropy (rANS/FSE) encoding before discovering it's fundamentally slower than CPU (0.77x encode, 0.54x decode). Add warnings at all three discovery points: - CLAUDE.md: new "Known dead ends" section (read first by all agents) - pipeline/mod.rs: GPU_ENTROPY_THRESHOLD comment corrected from "win" to "gate that prevents routing" with regression warning - pipeline/parallel.rs: gpu_fused_span() warning explaining why the fused path exists but must not be activated Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_gpu_pipeline_multiblock_correctness: validates full GPU coordinator pipeline with 8 distinct-pattern blocks across Deflate, Lzr, and LzSeqR pipelines (guards against ring slot cross-contamination) - test_lz77_batched_pipeline_correctness: validates batched GPU LZ77 match-finding across 8 blocks with cross-validation against serial - test_lz77_batched_pipeline_not_slower_than_serial: timing guard ensuring batched path doesn't regress vs serial execution - fix CRLF line endings in scripts/fetch-silesia.sh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expand the "Known dead ends" section with three more common pitfalls: - GPU device init time skews bench.sh vs Criterion throughput numbers - Compression ratio bottleneck is 5-byte match encoding, not matcher - GPU Huffman is architecturally incompatible (needs byte alignment) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Historian research uncovered additional costly dead ends that agents have repeatedly fallen into. Adding the highest-impact ones: - GPU hash tables for LZ matching (atomics lose insertion order) - SSE2 rANS decode (32% slower than scalar due to serializing extracts) - Fully parallel GPU LZ parsing (37.6% compression gap) - Iterative GPU algorithms (quadratic host overhead from per-round sync) - Window-capped suffix sorts (break BWT invertibility, 433% expansion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "GPU/CPU strategy (settled)" section documenting the proven GPU-for-LZ77 + CPU-for-entropy architecture with rationale - Refresh "Next steps" to reflect current priorities: closing the 6.5pp gzip ratio gap via LzSeq encoding improvements (P0), rANS SIMD decode wiring (P1), LzSeq optimal parser (P2), NEON SIMD (P3) - Remove stale priorities (fuzz testing already done as M5.3, GPU Huffman chunk packing and shared memory are low-value given settled GPU/CPU strategy, auto-selection tuning is done) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove rANS SIMD decode wiring from priorities — SSE2 rANS decode was proven 32% slower than scalar (documented dead end), and the speculative "proper implementation" with merged tables is unproven - Remove GPU Huffman atomic contention bottleneck — GPU Huffman is a documented dead end (bit-level alignment incompatible with GPU) - Remove LZ77 shared memory bottleneck — low value given settled GPU/CPU strategy - Renumber remaining priorities Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gpu_fused_span()in parallel.rs and fix misleadingGPU_ENTROPY_THRESHOLDcomment in mod.rsscripts/fetch-silesia.shContext
Multiple agents have spent full sessions attempting to optimize GPU entropy encoding before discovering it's fundamentally slower than CPU. The historian found 10+ major dead ends across 420 commits. This PR places warnings at all the discovery points so future agents hit the documentation before burning context.
Test plan
./scripts/test.sh --quickpasses (fmt, clippy, all tests)🤖 Generated with Claude Code