Skip to content

cublasLt nvfp4 gemm sample code core dump error #292

@Malfurionzz

Description

@Malfurionzz

hi, I am just testing the performance of cublasLt GEMM with nvfp4 on RTX 5080. Basically, I followed the code in sample_cublasLt_LtNvfp4Matmul.cu. At first, I change nothing, it works perfectly.
Then i tried to get a bfloat16 tensor D, and a core dump with error code 7 happend.
It will be very nice of you to do me a favor~

ENV:
nvcc version: cuda_12.8.r12.8/compiler.35583870_0
driver version: 570.172.08

here is the problem when I try to get a bfloat16 tensor D:

Situation 1: I commented out line 86 and line 92 (just disable D_{out} quantize), the error occured.

Situation 2: I commented out line 85, 86, 91 and 92 (disable per-tensor wide dequantize and D_{out} quantize), the error was gone and it works perfectly again (in attachment code_1.cu)

nvfp4.zip

My question:

  1. I don' know if i get the right way to use cublasLtMatmul() api. Or if there is a right to get full nvfp4 gemm(fp4 block scaled gemm with inner fp8 scale + per tensor fp32 scale dequant like Situation 1).

Besides, 1D Block Scaling Factors Layout1d-block-scaling-factors-layout makes me a little confused.

could you show me a simple example in pytorch-like style to show how to get the right scale layout?( For example, shape of contiguous tensor A is (m,k) , and we have already get the contiguous scale_a with shape(m, k // 16) )

    scale_a = scale_a.view(m//128, 4,32, k//64, 4).permute(0,3,2,1,4).contiguous() # here is my understanding, but it seems wrong? 

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions