Hi all,
- it would be great to have some python bindings for cuSparseLt, because it'll be a while until pytorch supports this for all dtypes, especially low-precision such as int8
- Using the C++ API, I'm only getting about 30% more speed for sparse int8 x int8 compared to dense
torch._int_mm. Is that expected? I would have expected more given the hardware claims of twice the speed.
I'm pretty sure int8xint8 isn't bandwidth-limited on any modern GPU, is it?