Skip to content

CublasDx With Variable Global Pitch Per Matrix In Batched Gemm #302

@tugrul512bit

Description

@tugrul512bit

I couldn't find any example for a batched gemm with CublasDx that has different global memory pitch or global memory stride (lda, ldb, ldc) across batch.

Is it possible to dispatch a different layout per cuda-block using an array of pitch values? My work involves using sub-matrices of varying-sized large matrices to compute many gemms, and this is causing a pitch or lda/ldb/ldc difference between every gemm. I want to compute all gemms at once in single kernel, probably with 1 matrix per block approach, using an array of matrix pointers, an array of matrix pitch values, an array of matrix sizes (that are not always equal to pitch).

Pitch values are not known in compile-time. For some cases, there can be 15 different pitch values across a batch of 100 gemms. I still need to compute all in single kernel call without any per-unique-pitch call.

Before CublasDx, I tried Cublas batched version that required 15 kernel calls for 15 different pitch combinations of A, B, C. Some cases have close to this number as batch count so it becomes slower than a normal batched gemm and still not fast enough compared to sequentially computing each gemm (matrix sizes are not high enough to fully utilize gpu with a normal Cublas call). Also running multiple batches in different streams adds an extra event-based synchronization overhead and loses benefits of using single kernel (such as being able to load-balance between resident blocks of single kernel).

There's also possibility of increased number of unique pitch values for each A, B, C, maybe 15 per parameter, and more than 3000 possibilities can exist in runtime to make it too slow for selecting a global layout from 3000 different compile-time generated possibilities in kernel.

I need something like a functor to be called during indexing for global memory access such as

auto index = x + y * pitchFunctor(); // row-major

or during generation of tma descriptor such as

getCuTensorMapEncodeTiled(..., pitchFunctor() * sizeof(T), ...)

or somehow I should do load/store operations manually without CublasDx and let CublasDx do the multiplications only. Is this also possible?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions