-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Description
Hello! We are considering support for f16 (and bf16) via the half crate in ndarray (rust-ndarray/ndarray#1551), but we are seeing rather dismal performance on matrix multiplication for the new types: f16 appears to be ~3 orders of magnitude slower than f32. After some debugging, I believe this is a testament to matrixmultiply's performance: the code on my Apple M2 chip is hitting the f16 assembly instructions, so I think most of the performance difference is thanks to matrixmultiply's very fast sgemm.
In light of this, I was wondering what the appetite would be for supporting f16 here in the matrixmultiply crate.
cc: @swfsql, who has been the champion for f16 in ndarray.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels