Releases: RIKEN-RCCS/GEMMul8
Releases · RIKEN-RCCS/GEMMul8
v2.0.3
v2.0.2
Fix: value for scaling
v2.0.1
Perf: Improved the moduli list for FP8-based emulation
v2.0.0
Opt: Add FP8 backend and streamline API - Breaking change: Removed the UseExtraWorkspace option. - Added stream support. - Enabled the internal GEMM implementation to use cuBLASLt. - Added an FP8-based implementation and an FP8 backend option. - Introduced gemmul8::Backend to select the emulation backend: - gemmul8::Backend::INT8 (default) - gemmul8::Backend::FP8 - Added a sample/ directory with several example programs. - Updated test programs. - Revised the hook (hijack) mode strategy.
v1.1.1
Merge pull request #2 from elbriggs/main Fix hijack mode on AMD platforms
v1.1.0
Add: UseExtraWorkspace option This commit introduces UseExtraWorkspace as a public template parameter for workSize() and gemm(), allowing users to explicitly control whether an extra internal workspace is used. Numerical accuracy and computed results are unchanged compared to previous releases. Extra workspace usage is now enabled by default (UseExtraWorkspace = true), providing improved performance for typical GEMM workloads at the cost of increased GPU memory usage. This change modifies the public template interfaces and the default execution behavior, and therefore is not backward compatible with previous releases. Users who want to preserve the previous memory behavior must explicitly set UseExtraWorkspace = false.
v1.0.2
Fix: correct handling of subnormals and remove undefined shifts in T2…
v1.0.1
improved constant table values in table.hpp
v1.0.0
Perf: Improved the performance of INT8 GEMMs E.g., on GH200, the performance has improved as follows: ZGEMM fast-14: 111 TFLOPS -> 127 TFLOPS accu-14: 103 TFLOPS -> 117 TFLOPS DGEMM fast-14: 82 TFLOPS -> 92 TFLOPS accu-14: 75 TFLOPS -> 85 TFLOPS