Skip to content

fix(ggml-cuda): skip sm_120→sm_120a for consumer Blackwell (no FP4 MMA)#3

Merged
davide221 merged 1 commit into
Luce-Org:luce-dflashfrom
easel:fix/consumer-blackwell-sm120
May 4, 2026
Merged

fix(ggml-cuda): skip sm_120→sm_120a for consumer Blackwell (no FP4 MMA)#3
davide221 merged 1 commit into
Luce-Org:luce-dflashfrom
easel:fix/consumer-blackwell-sm120

Conversation

@easel

@easel easel commented Apr 27, 2026

Copy link
Copy Markdown

Problem

Consumer Blackwell GPUs (RTX 5090, SM 12.0) lack FP4 tensor core hardware.
ggml-cuda/CMakeLists.txt unconditionally replaces sm_12X with sm_12Xa
and compiles mmq-instance-mxfp4/nvfp4 with BLACKWELL_MMA_AVAILABLE, which
emits .block_scale / mxf4 PTX instructions that fault with
CUDA_ERROR_ILLEGAL_INSTRUCTION on consumer hardware at runtime.

Changes

ggml/src/ggml-cuda/CMakeLists.txt

  • Add GGML_CUDA_BLACKWELL_CONSUMER CMake option (default OFF)
  • When ON: skip the 12X → 12Xa arch replacement and exclude
    mmq-instance-mxfp4.cu / mmq-instance-nvfp4.cu from the build
  • Add GGML_CUDA_BLACKWELL_CONSUMER compile definition so mmq.cu can gate dispatch

ggml/src/ggml-cuda/mmq.cu

  • Guard MXFP4/NVFP4 switch cases with #ifndef GGML_CUDA_BLACKWELL_CONSUMER
    to prevent linker errors when those instance files are excluded
  • Return mmq_supported = false for those types when the flag is set

Usage

Set GGML_CUDA_BLACKWELL_CONSUMER=ON at cmake configure time for builds
targeting consumer Blackwell (RTX 5080/5090). The parent repo
Luce-Org/lucebox-hub sets this automatically via nvidia-smi detection
(see companion PR Luce-Org/lucebox-hub#48).

Test plan

  • cmake -B build -DGGML_CUDA_BLACKWELL_CONSUMER=ON -DCMAKE_CUDA_ARCHITECTURES=120 -S . completes without error
  • No ptxas: Feature '.block_scale' not supported on .target 'sm_120' errors
  • Runtime kernels execute without CUDA_ERROR_ILLEGAL_INSTRUCTION

🤖 Generated with Claude Code

Consumer Blackwell GPUs (RTX 5090, SM 12.0) do not have FP4 tensor core
instructions. The existing code unconditionally replaces sm_120 with sm_120a
and compiles mmq-instance-mxfp4/nvfp4 with BLACKWELL_MMA_AVAILABLE, which
emits .block_scale / mxf4 PTX that faults on sm_120 hardware.

Add GGML_CUDA_BLACKWELL_CONSUMER option (set by parent build when nvidia-smi
reports SM 12.x without an explicit 'a' variant):
- Skip the 12X→12Xa arch replacement so ggml-cuda compiles for plain sm_120
- Exclude mmq-instance-mxfp4.cu and mmq-instance-nvfp4.cu from the build
- Guard their dispatch cases in mmq.cu to prevent linker errors and
  surface a clear abort if FP4 types are somehow requested at runtime

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@davide221 davide221 merged commit 5776d4d into Luce-Org:luce-dflash May 4, 2026
@easel easel deleted the fix/consumer-blackwell-sm120 branch May 10, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants