Skip to content

Add maxnreg autotuning to Mamba-3 Triton kernels#905

Merged
caitWW merged 1 commit intomainfrom
add-maxnreg-autotuning
Apr 9, 2026
Merged

Add maxnreg autotuning to Mamba-3 Triton kernels#905
caitWW merged 1 commit intomainfrom
add-maxnreg-autotuning

Conversation

@caitWW
Copy link
Copy Markdown
Collaborator

@caitWW caitWW commented Apr 7, 2026

Add maxnreg (max register count) as an autotuning dimension ([None, 128, 256]) to all Mamba-3 SISO forward and backward Triton kernels; On SM100 (Blackwell/B200/B300), the Triton compiler may default to a low register count (e.g. 32 or 168 out of 255 available) when maxnreg is unspecified, causing excessive register spilling

@caitWW caitWW merged commit bd6dc62 into main Apr 9, 2026
@caitWW caitWW deleted the add-maxnreg-autotuning branch April 10, 2026 20:26
ChrisLundquist added a commit to ChrisLundquist/mamba that referenced this pull request Apr 12, 2026
Triton's HIP backend does not recognize the maxnreg keyword argument
(added for NVIDIA in PR state-spaces#905), raising:
  KeyError: 'Keyword argument maxnreg was specified but unrecognised'

Add _maxnreg() helper that returns {} on HIP and {maxnreg: value} on
CUDA, and use **_maxnreg(r) in all autotune Config constructors.

Verified on AMD RX 9070 XT (gfx1201):
  - All math utils: PASS (cos/sin err=0, tanh err=1.8e-7)
  - mamba3_siso_fwd kernel: PASS
  - angle_dt_fwd kernel: PASS
  - mamba3_siso_combined pipeline: PASS
  - Forward throughput: 3.9M tok/s at seqlen=1024

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ChrisLundquist added a commit to ChrisLundquist/mamba that referenced this pull request Apr 12, 2026
Triton's HIP backend does not recognize the maxnreg keyword argument
(added for NVIDIA in PR state-spaces#905), raising:
  KeyError: 'Keyword argument maxnreg was specified but unrecognised'

Add _maxnreg() helper that returns {} on HIP and {maxnreg: value} on
CUDA, and use **_maxnreg(r) in all autotune Config constructors.

Verified on AMD RX 9070 XT (gfx1201):
  - All math utils: PASS (cos/sin err=0, tanh err=1.8e-7)
  - mamba3_siso_fwd kernel: PASS
  - angle_dt_fwd kernel: PASS
  - mamba3_siso_combined pipeline: PASS
  - Forward throughput: 3.9M tok/s at seqlen=1024

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant