Skip to content

docs: GB10 (Blackwell sm_121) optimization guide#283

Open
CyberBrown wants to merge 1 commit intohuggingface:mainfrom
CyberBrown:docs/gb10-optimization-guide
Open

docs: GB10 (Blackwell sm_121) optimization guide#283
CyberBrown wants to merge 1 commit intohuggingface:mainfrom
CyberBrown:docs/gb10-optimization-guide

Conversation

@CyberBrown
Copy link

Summary

  • Adds a hardware-specific optimization guide for the NVIDIA GB10 (DGX Spark), the first Blackwell desktop GPU
  • Covers the unified memory architecture (128 GB LPDDR5X shared with Grace CPU), which is fundamentally different from HBM-based data-center GPUs
  • Includes measured benchmark numbers from real hardware, not theoretical estimates

What's in the guide

Section Key content
Architecture overview 48 SMs, 1536 threads/SM, 100 KB shared/SM, 24 MB L2
Comparison table Side-by-side with H100 and A100
Measured performance 91 TFLOPS BF16 tensor, 218 GB/s bandwidth, RMSNorm benchmarks
Memory hierarchy Unified memory behavior, vectorization patterns, L2 sizing
Occupancy tuning 48 warps/SM math, block size recommendations
Compilation sm_121 flags, TORCH_CUDA_ARCH_LIST, flash-attn warning
Best practices 8 actionable rules specific to GB10's bandwidth profile

Benchmark highlights (GB10, CUDA 13, PyTorch 2.10, BF16)

  • BF16 MatMul: 92.3 TFLOPS (tensor core)
  • Memory bandwidth: 218 GB/s measured (40% of ~546 GB/s theoretical)
  • Vectorized RMSNorm: 2.59× average speedup over PyTorch baseline
  • Key insight: 256-thread blocks give 100% occupancy; 1024-thread blocks drop to 67%

Motivation

The GB10/DGX Spark is shipping now and kernel authors need architecture-specific guidance. The unified memory model means many assumptions from HBM GPU guides (pinned memory, PCIe transfers, multi-TB/s bandwidth budgets) don't apply. This guide fills that gap with tested, practical advice.

A companion kernel (logos-flux/gb10-rmsnorm) is published on the Hub as the first sm_121 kernel.

Test plan

  • Guide renders correctly in markdown
  • Added to _toctree.yml under "Building kernels" section
  • All benchmark numbers measured on real GB10 hardware
  • Code snippets tested and verified

🤖 Generated with Claude Code

Hardware-specific guide for writing CUDA kernels targeting the NVIDIA GB10
(DGX Spark). Covers the unified memory architecture, 48 SM / 100 KB shared
mem specifics, occupancy tuning for 1536 threads/SM, and includes measured
benchmark numbers (2.59x RMSNorm speedup, 91 TFLOPS BF16 tensor core,
218 GB/s memory bandwidth).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@danieldk
Copy link
Member

cc @burtenshaw @sayakpaul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants