-
Notifications
You must be signed in to change notification settings - Fork 447
Description
hi, I am just testing the performance of cublasLt GEMM with nvfp4 on RTX 5080. Basically, I followed the code in sample_cublasLt_LtNvfp4Matmul.cu. At first, I change nothing, it works perfectly.
Then i tried to get a bfloat16 tensor D, and a core dump with error code 7 happend.
It will be very nice of you to do me a favor~
ENV:
nvcc version: cuda_12.8.r12.8/compiler.35583870_0
driver version: 570.172.08
here is the problem when I try to get a bfloat16 tensor D:
Situation 1: I commented out line 86 and line 92 (just disable D_{out} quantize), the error occured.
Situation 2: I commented out line 85, 86, 91 and 92 (disable per-tensor wide dequantize and D_{out} quantize), the error was gone and it works perfectly again (in attachment code_1.cu)
My question:
- I don' know if i get the right way to use
cublasLtMatmul()api. Or if there is a right to get full nvfp4 gemm(fp4 block scaled gemm with inner fp8 scale + per tensor fp32 scale dequant like Situation 1).
Besides, 1D Block Scaling Factors Layout1d-block-scaling-factors-layout makes me a little confused.
could you show me a simple example in pytorch-like style to show how to get the right scale layout?( For example, shape of contiguous tensor A is (m,k) , and we have already get the contiguous scale_a with shape(m, k // 16) )
scale_a = scale_a.view(m//128, 4,32, k//64, 4).permute(0,3,2,1,4).contiguous() # here is my understanding, but it seems wrong?