Skip to content

Add initial NVIDIA GPU backend bring-up#32

Open
zhoubot wants to merge 7 commits intomainfrom
feat_gpu
Open

Add initial NVIDIA GPU backend bring-up#32
zhoubot wants to merge 7 commits intomainfrom
feat_gpu

Conversation

@zhoubot
Copy link
Copy Markdown
Collaborator

@zhoubot zhoubot commented Mar 31, 2026

Summary

This PR brings up an initial NVIDIA GPU backend for PTO Tile Lib, centered on DGX Spark / GB10 (sm121), and adds a real GPU test lane.

Included in this PR

  • initial CUDA GPU backend scaffolding and dispatch wiring
  • sm121 matmul fast-path groundwork
  • tensor-core WMMA matmul fast paths for half / bf16 on sm121
  • extended TMATMUL, TMATMUL_ACC, TMATMUL_BIAS, TMATMUL_MX, and TGEMV_MX GPU coverage
  • standalone CUDA correctness test lane under tests/gpu/st
  • lightweight GB10 matmul microbenchmark
  • GPU-specific swizzle tile layout (SLayout::GpuSwizzle128B) that is intentionally separate from NPU boxed layouts
  • larger 64x64x64 GEMM correctness tests for half / bf16

Notes

  • current float matmul still uses an inline-PTX FMA fallback path
  • current MX wrappers accept the scale tiles but reuse the existing GPU matmul path; scale semantics are not fully modeled yet
  • current GPU swizzle layout is groundwork for future shared-memory / tensor-core-friendly paths and is not yet consumed by the sm121 matmul fast path

Validation

Executed on the target GB10 / DGX Spark environment:

  • cmake --build build/tests/gpu-st -j4
  • ctest --output-on-failure
  • ./build/tests/gpu-st/testcase/pto_gpu_perf/pto_gpu_perf

Representative benchmark signal from GB10 (64x64x64, 1 block):

  • float: ~2.2036 ms, ~0.24 GFLOPS
  • half: ~0.0082 ms, ~63.86 GFLOPS
  • bf16: ~0.0082 ms, ~64.03 GFLOPS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

2 participants