docs: GB10 (Blackwell sm_121) optimization guide by CyberBrown · Pull Request #283 · huggingface/kernels

CyberBrown · 2026-02-14T00:27:41Z

Summary

Adds a hardware-specific optimization guide for the NVIDIA GB10 (DGX Spark), the first Blackwell desktop GPU
Covers the unified memory architecture (128 GB LPDDR5X shared with Grace CPU), which is fundamentally different from HBM-based data-center GPUs
Includes measured benchmark numbers from real hardware, not theoretical estimates

What's in the guide

Section	Key content
Architecture overview	48 SMs, 1536 threads/SM, 100 KB shared/SM, 24 MB L2
Comparison table	Side-by-side with H100 and A100
Measured performance	91 TFLOPS BF16 tensor, 218 GB/s bandwidth, RMSNorm benchmarks
Memory hierarchy	Unified memory behavior, vectorization patterns, L2 sizing
Occupancy tuning	48 warps/SM math, block size recommendations
Compilation	sm_121 flags, `TORCH_CUDA_ARCH_LIST`, flash-attn warning
Best practices	8 actionable rules specific to GB10's bandwidth profile

Benchmark highlights (GB10, CUDA 13, PyTorch 2.10, BF16)

BF16 MatMul: 92.3 TFLOPS (tensor core)
Memory bandwidth: 218 GB/s measured (40% of ~546 GB/s theoretical)
Vectorized RMSNorm: 2.59× average speedup over PyTorch baseline
Key insight: 256-thread blocks give 100% occupancy; 1024-thread blocks drop to 67%

Motivation

The GB10/DGX Spark is shipping now and kernel authors need architecture-specific guidance. The unified memory model means many assumptions from HBM GPU guides (pinned memory, PCIe transfers, multi-TB/s bandwidth budgets) don't apply. This guide fills that gap with tested, practical advice.

A companion kernel (logos-flux/gb10-rmsnorm) is published on the Hub as the first sm_121 kernel.

Test plan

Guide renders correctly in markdown
Added to _toctree.yml under "Building kernels" section
All benchmark numbers measured on real GB10 hardware
Code snippets tested and verified

🤖 Generated with Claude Code

Hardware-specific guide for writing CUDA kernels targeting the NVIDIA GB10 (DGX Spark). Covers the unified memory architecture, 48 SM / 100 KB shared mem specifics, occupancy tuning for 1536 threads/SM, and includes measured benchmark numbers (2.59x RMSNorm speedup, 91 TFLOPS BF16 tensor core, 218 GB/s memory bandwidth). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

danieldk · 2026-02-14T12:28:53Z

cc @burtenshaw @sayakpaul

CyberBrown mentioned this pull request Feb 14, 2026

Support for compute capability 12.1 (sm121) - NVIDIA GB10 GPU Dao-AILab/flash-attention#1969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: GB10 (Blackwell sm_121) optimization guide#283

docs: GB10 (Blackwell sm_121) optimization guide#283
CyberBrown wants to merge 1 commit intohuggingface:mainfrom
CyberBrown:docs/gb10-optimization-guide

CyberBrown commented Feb 14, 2026

Uh oh!

danieldk commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CyberBrown commented Feb 14, 2026

Summary

What's in the guide

Benchmark highlights (GB10, CUDA 13, PyTorch 2.10, BF16)

Motivation

Test plan

Uh oh!

danieldk commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants