Skip to content

Add challenge 86: Paged KV-Cache Attention (Medium)#225

Open
claude[bot] wants to merge 2 commits intomainfrom
add-challenge-86-paged-attention
Open

Add challenge 86: Paged KV-Cache Attention (Medium)#225
claude[bot] wants to merge 2 commits intomainfrom
add-challenge-86-paged-attention

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude bot commented Mar 24, 2026

Summary

  • Adds challenge 86: Paged KV-Cache Attention (Medium difficulty)
  • Models the decode-phase attention kernel used in vLLM and other LLM serving systems, where KV cache is stored in non-contiguous memory pages
  • Solvers must implement block-table indirection to gather K/V tokens from scattered physical blocks, then compute scaled dot-product attention with online softmax

What makes this interesting for GPU programmers

  • Non-contiguous memory access: tokens are fetched via a block_table that maps logical block indices to physical block IDs in a shared pool — requires careful pointer arithmetic and strided access patterns
  • Online softmax: to avoid materializing all scores, the numerically-stable running-max trick must be applied as blocks are processed one at a time
  • Memory bandwidth bound: decode-phase attention is memory-bandwidth limited, rewarding coalesced access and shared memory reuse

Files

  • challenge.py: reference implementation, 10 functional test cases (edge cases, power-of-2, non-power-of-2, variable-length batch, realistic sizes), performance test at LLaMA-3 scale (batch=8, heads=32, head_dim=128, block_size=16, ctx_len=2,048)
  • challenge.html: full problem description with SVG block-table visualization, worked example, and constraints
  • 6 starter files: CUDA, PyTorch, Triton, JAX, CuTe, Mojo

Test plan

  • Reference implementation verified against manual calculation for example test
  • Validation run (--action run) passed on NVIDIA TESLA T4
  • pre-commit run --all-files passes (black, isort, flake8, clang-format)
  • Challenge number 86 does not conflict with any merged challenge or open PR
  • All checklist items in CLAUDE.md verified

🤖 Generated with Claude Code

Implements decode-phase attention over a non-contiguous paged KV cache,
modeled on the vLLM paged attention architecture. Teaches block-table
indirection, online softmax across scattered memory pages, and the
memory access patterns central to LLM serving workloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Redesign SVG: block_table as a proper table with column headers,
cache pool as horizontal memory strip with color-coded blocks and
sequence labels. Convert example and computation steps from HTML
entities to LaTeX math notation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shxjames
Copy link
Copy Markdown
Contributor

Screenshot 2026-03-26 at 22 20 29 Screenshot 2026-03-26 at 22 20 23 Screenshot 2026-03-26 at 22 19 32


<h2>Implementation Requirements</h2>
<p>
Implement the function <code>solve(Q, K_cache, V_cache, block_table, context_lens, output, batch_size, num_heads, head_dim, block_size, max_blocks_per_seq)</code>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't really have to say this in the implementation requirements. @claude change this to match the format of other challenge's implementation requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants