Skip to content

Releases: RIKEN-RCCS/GEMMul8

v2.0.3

17 Feb 20:01

Choose a tag to compare

Fix: Memory usage calculation

v2.0.2

17 Feb 19:32

Choose a tag to compare

Fix: value for scaling

v2.0.1

17 Feb 15:20

Choose a tag to compare

Perf: Improved the moduli list for FP8-based emulation

v2.0.0

17 Feb 00:48

Choose a tag to compare

Opt: Add FP8 backend and streamline API

- Breaking change: Removed the UseExtraWorkspace option.
- Added stream support.
- Enabled the internal GEMM implementation to use cuBLASLt.
- Added an FP8-based implementation and an FP8 backend option.
- Introduced gemmul8::Backend to select the emulation backend:
  - gemmul8::Backend::INT8 (default)
  - gemmul8::Backend::FP8
- Added a sample/ directory with several example programs.
- Updated test programs.
- Revised the hook (hijack) mode strategy.

v1.1.1

20 Jan 04:28
cc32d55

Choose a tag to compare

Merge pull request #2 from elbriggs/main

Fix hijack mode on AMD platforms

v1.1.0

24 Dec 13:57

Choose a tag to compare

Add: UseExtraWorkspace option

This commit introduces UseExtraWorkspace as a public template parameter for workSize() and gemm(), allowing users to explicitly control whether an extra internal workspace is used.
Numerical accuracy and computed results are unchanged compared to previous releases.

Extra workspace usage is now enabled by default (UseExtraWorkspace = true), providing improved performance for typical GEMM workloads at the cost of increased GPU memory usage.

This change modifies the public template interfaces and the default execution behavior, and therefore is not backward compatible with
previous releases.

Users who want to preserve the previous memory behavior must explicitly set UseExtraWorkspace = false.

v1.0.2

21 Dec 21:30

Choose a tag to compare

Fix: correct handling of subnormals and remove undefined shifts in T2…

v1.0.1

21 Dec 15:17

Choose a tag to compare

improved constant table values in table.hpp

v1.0.0

05 Dec 11:14

Choose a tag to compare

Perf: Improved the performance of INT8 GEMMs

E.g., on GH200, the performance has improved as follows:

ZGEMM
fast-14: 111 TFLOPS -> 127 TFLOPS
accu-14: 103 TFLOPS -> 117 TFLOPS

DGEMM
fast-14: 82 TFLOPS -> 92 TFLOPS
accu-14: 75 TFLOPS -> 85 TFLOPS