Releases · RIKEN-RCCS/GEMMul8

17 Feb 20:01

v2.0.3

950f9e3

v2.0.3 Latest

Latest

Fix: Memory usage calculation

Assets 2

17 Feb 19:32

UCHINO-Yuki

v2.0.2

5877f67

v2.0.2

Fix: value for scaling

Assets 2

17 Feb 15:20

UCHINO-Yuki

v2.0.1

58d57d4

v2.0.1

Perf: Improved the moduli list for FP8-based emulation

Assets 2

17 Feb 00:48

UCHINO-Yuki

v2.0.0

c2a83a9

v2.0.0

Opt: Add FP8 backend and streamline API

- Breaking change: Removed the UseExtraWorkspace option.
- Added stream support.
- Enabled the internal GEMM implementation to use cuBLASLt.
- Added an FP8-based implementation and an FP8 backend option.
- Introduced gemmul8::Backend to select the emulation backend:
  - gemmul8::Backend::INT8 (default)
  - gemmul8::Backend::FP8
- Added a sample/ directory with several example programs.
- Updated test programs.
- Revised the hook (hijack) mode strategy.

Assets 2

20 Jan 04:28

UCHINO-Yuki

v1.1.1

cc32d55

v1.1.1

Merge pull request #2 from elbriggs/main

Fix hijack mode on AMD platforms

Assets 2

24 Dec 13:57

UCHINO-Yuki

v1.1.0

dd20e53

v1.1.0

Add: UseExtraWorkspace option

This commit introduces UseExtraWorkspace as a public template parameter for workSize() and gemm(), allowing users to explicitly control whether an extra internal workspace is used.
Numerical accuracy and computed results are unchanged compared to previous releases.

Extra workspace usage is now enabled by default (UseExtraWorkspace = true), providing improved performance for typical GEMM workloads at the cost of increased GPU memory usage.

This change modifies the public template interfaces and the default execution behavior, and therefore is not backward compatible with
previous releases.

Users who want to preserve the previous memory behavior must explicitly set UseExtraWorkspace = false.

Assets 2

21 Dec 21:30

UCHINO-Yuki

v1.0.2

a943585

v1.0.2

Fix: correct handling of subnormals and remove undefined shifts in T2…

Assets 2

21 Dec 15:17

UCHINO-Yuki

v1.0.1

063c08f

v1.0.1

improved constant table values in table.hpp

Assets 2

05 Dec 11:14

UCHINO-Yuki

v1.0.0

3713c25

v1.0.0

Perf: Improved the performance of INT8 GEMMs

E.g., on GH200, the performance has improved as follows:

ZGEMM
fast-14: 111 TFLOPS -> 127 TFLOPS
accu-14: 103 TFLOPS -> 117 TFLOPS

DGEMM
fast-14: 82 TFLOPS -> 92 TFLOPS
accu-14: 75 TFLOPS -> 85 TFLOPS

Assets 2

Releases: RIKEN-RCCS/GEMMul8

v2.0.3

Uh oh!

v2.0.2

Uh oh!

v2.0.1

Uh oh!

v2.0.0

Uh oh!

v1.1.1

Uh oh!

v1.1.0

Uh oh!

v1.0.2

Uh oh!

v1.0.1

Uh oh!

v1.0.0

Uh oh!