GEMMul8 (GEMMulate): GEMM emulation using INT8/FP8 matrix engines based on the Ozaki Scheme II
GEMMul8 is a library for emulating high-precision matrix multiplication (SGEMM, DGEMM, CGEMM, and ZGEMM) using low-precision matrix engines, including INT8 and FP8. The library is based on the Ozaki Scheme II and supports selectable INT8- or FP8-based emulation backends within each GEMM routine. This design enables bit-wise reproducible results while achieving superior performance and power efficiency compared to native floating-point implementations.
- Technical Overview
- Build
- Running Test Codes
- Usage
- Numerical results
- Acknowledgment
- Contact (Responsible Developer)
- References
- Citations
- License
GEMMul8 implements GEMM emulation based on Ozaki scheme II, which utilizes the Chinese Remainder Theorem (CRT).
A larger number of moduli (num_moduli) for the CRT results in higher precision at the cost of increased computation time.
This project supports both CUDA and HIP backends for GPU computation.
The GEMM emulation ensures bit-wise numerical reproducibility. This is achieved through a combination of exact, error-free computations and a fixed order of operations.
However, please note that this reproducibility guarantee may not extend to environments using different toolkits or compilers. Discrepancies can arise from varying propagation rules for Inf/NaN values or differences in the precision of underlying math library functions.
GEMMul8 supports two emulation backends:
- INT8 backend: uses standard BLAS handle (cuBLAS/hipBLAS handle) or Lt handle (cuBLASLt/hipBLASLt handle).
- FP8 backend: uses Lt handle (cuBLASLt/hipBLASLt handle).
Run make in the GEMMUl8/GEMMul8 directory to compile all source files.
make -j8To rebuild from scratch, run make clean && make -j8.
| Option | Default | Description |
|---|---|---|
CUDA_PATH |
/usr/local/cuda |
Path to your CUDA toolkit installation. Used for CUDA backends. |
HIP_PATH |
/opt/rocm |
Path to your HIP (ROCm) toolkit installation. Used for HIP backends. |
BACKEND |
auto |
Select GPU backend: cuda, hip, or auto (auto-detect). |
GPU_ARCH |
auto |
Target GPU architecture. Examples: 80 (A100), 90 (H100), gfx90a (MI250X), gfx942 (MI300X). |
Note
BACKEND=autowill attempt to detect your GPU vendor automatically.GPU_ARCH=autowill automatically detect and use the appropriate compute capability or architecture for your GPU.- Target GPU architecture can be found from e.g., CUDA GPU Compute Capability or AMD GPU hardware specifications.
Build for an NVIDIA H100/H200 GPU (Compute Capability 9.0)
make -j8 BACKEND=cuda CUDA_PATH=/usr/local/cuda GPU_ARCH=90Build for an AMD MI300X GPU (gfx942 architecture)
make -j8 BACKEND=hip HIP_PATH=/opt/rocm GPU_ARCH=gfx942After building the project, you can run test codes from the testing directory.
Navigate to the testing directory and use make run MODE="test_modes" to run tests for different precisions.
| Value | Description |
|---|---|
SGEMM |
Runs SGEMM tests. |
DGEMM |
Runs DGEMM tests. |
CGEMM |
Runs CGEMM tests. |
ZGEMM |
Runs ZGEMM tests. |
accuracy |
Tests numerical accuracy (maximum element-wise relative error). |
flops |
Measures TFLOPS with square matrices. |
watt |
Measures watt and GFLOPS/watt with square matrices. |
all |
Runs accuracy, flops, and watt. |
# Run accuracy test (DGEMM)
make run MODE="DGEMM accuracy"# Run accuracy, flops, & watt tests (SGEMM)
make run MODE="SGEMM all"# Run accuracy & flops tests (SGEMM & DGEMM)
make run MODE="SGEMM DGEMM accuracy flops"This library provides two ways to use the GEMM emulation.
Call GEMMul8 functions explicitly from your source code. This gives you fine-grained control over the emulation parameters.
See the sample code in sample/.
The arguments for gemmul8::gemm closely match the standard cublas<t>gemm and hipblasXgemm interfaces, with additional parameters that control internal preprocessing and workspace reuse.
GEMMul8 provides both a workspace query function (workSize) and an extended GEMM computation interface (gemm).
See include/gemmul8.hpp for the full function signatures:
gemmul8::workSize<is_Complex, Backend>(...)gemmul8::gemm<T, Backend>(...)(BLAS handle / Lt handle overloads)
gemmul8::gemminternally preprocesses the input matricesAandBinto a backend-specific low-precision representation (INT8 or FP8) and performs modular multiplications across multiple moduli.- This preprocessing step can be skipped in consecutive GEMM calls if the same matrices are reused, allowing for substantial performance gains.
- If
enable_skip_scal{A|B} = true, additional workspace is reserved so that the preprocessed representation ofA/Bcan be retained between calls. - If both
enable_skip_scal{A|B} && skip_scal{A|B} = true, the preprocessing step forA/Bis actually skipped, and previously prepared data are reused for faster execution. This mode assumes that the contents ofA/Bin device memory remain unchanged. - This mechanism is particularly effective when repeatedly multiplying the same
A(orB) with differentB(orA) matrices.
Note
When using skip_scalA / skip_scalB, the preprocessing step that converts A/B into an internal backend-specific representation (INT8/FP8) is skipped.
For correctness, the following conditions must all hold between consecutive GEMM calls:
- The dimensions (
M/KforA,K/NforB) must be identical to those in the previous call. - The operation type (
CUBLAS_OP_N/CUBLAS_OP_T) forA/Bmust be the same as before. - The value of
num_modulimust remain unchanged. - The
fastmodesetting must be identical to that of the previous call. - The contents of
A/Bin device memory must not be modified between calls. - The selected emulation backend must be identical in both calls (INT8 or FP8).
Note
- GEMMul8 does not verify the contents of
A/Bwhen skipping; correctness requires the user to ensure immutability. alpha/betafollow the BLAS handle pointer mode (host or device). GEMMul8 does not override the pointer mode.
Caution
If any of these conditions differ, the cached scaled data become invalid, and skipping must not be used.
In such cases, set skip_scalA=false / skip_scalB=false.
Caution
This skip mechanism is designed for repeated GEMM calls with identical A or B. Use it only when you are certain that the input matrices and configuration have not changed. When in doubt, disable skipping to ensure correctness.
#include "gemmul8.hpp"
// 1. Create a handle to the cuBLAS library context
cublasHandle_t cublas_handle;
cublasCreate(&cublas_handle);
// 2. Settings
const unsigned num_moduli = 14u;
const bool fastmode = false;
// 3. Matrix shapes
const size_t m1 = 64, n1 = 10, k1 = 10; // 1st GEMM: 64×10 × 10×10
const size_t m2 = 20, n2 = 10, k2 = 10; // 2nd GEMM: 20×10 × 10×10
// 4. Allocate workspace
size_t worksizeA, worksizeB;
const size_t worksize = gemmul8::workSize(
std::max(m1, m2), std::max(n1, n2), std::max(k1, k2),
num_moduli, true, true, &worksizeA, &worksizeB);
void *work;
cudaMalloc(&work, worksize);
// NOTE: worksizeA/worksizeB are byte sizes for the dedicated A/B work areas.
const size_t offsetA = worksizeA;
const size_t offsetB = worksizeB;
int8_t *workA = reinterpret_cast<int8_t *>(work); // dedicated workspace for A
int8_t *workB = workA + offsetA; // dedicated workspace for B
int8_t *work_rem = workB + offsetB; // remaining workspace
// 5. Run GEMM (first call: preprocessing performed)
gemmul8::gemm(cublas_handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m1, n1, k1,
&alpha, devA1, lda,
devB, ldb,
&beta, devC, ldc,
num_moduli, fastmode, (void*)work_rem, (void*)workA, (void*)workB,
true, true, false, false);
// 6. Reuse preprocessed B (second call: skip_scalB = true)
gemmul8::gemm(cublas_handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m1, n1, k1,
&alpha, devA1, lda,
devB, ldb,
&beta, devC, ldc,
num_moduli, fastmode, (void*)work_rem, (void*)workA, (void*)workB,
true, true, false, true);
// 7. Free workspace
cudaFree(work);
// 8. Destroy a handle
cublasDestroy(cublas_handle);Intercept standard GEMM calls automatically without modifying the application source code.
cublasSgemm,cublasDgemm,cublasCgemm,cublasZgemmcublasSgemm_v2,cublasDgemm_v2,cublasCgemm_v2,cublasZgemm_v2cublasGemmExhipblasSgemm,hipblasDgemm,hipblasCgemm,hipblasZgemmhipblasGemmEx,hipblasGemmEx_v2
- Build the library.
- Set the
LD_PRELOADenvironment variable.
export LD_PRELOAD=/path-to-GEMMul8/lib/libgemmul8.so- Run your application.
export GEMMUL8_BACKEND=INT8
export GEMMUL8_NUM_MOD_D=15
export GEMMUL8_NUM_MOD_S=7
export GEMMUL8_NUM_MOD_Z=15
export GEMMUL8_NUM_MOD_C=7
export GEMMUL8_FASTMODE_D=0
export GEMMUL8_FASTMODE_S=1
export GEMMUL8_FASTMODE_Z=0
export GEMMUL8_FASTMODE_C=1
export GEMMUL8_MAXWS_BACKEND=INT8
export GEMMUL8_MAX_M=4096
export GEMMUL8_MAX_N=4096
export GEMMUL8_MAX_K=4096
export GEMMUL8_MAX_NUM_MOD=18
export GEMMUL8_SKIP_SCALE_A=1
export GEMMUL8_SKIP_SCALE_B=1| Variable | Default | Applies to | Description |
|---|---|---|---|
GEMMUL8_BACKEND |
0 |
all | Selects the emulation backend (0 or INT8 = INT8-based emulation, 1 or FP8 = FP8-based emulation). |
GEMMUL8_NUM_MOD_D |
0 |
DGEMM | Number of moduli used in DGEMM emulation. When num_moduli < 2 or 20 < num_moduli, native DGEMM is used. |
GEMMUL8_NUM_MOD_S |
0 |
SGEMM | Number of moduli used in SGEMM emulation. When num_moduli < 2 or 14 < num_moduli, native SGEMM is used. |
GEMMUL8_NUM_MOD_Z |
0 |
ZGEMM | Number of moduli used in ZGEMM emulation. When num_moduli < 2 or 20 < num_moduli, native ZGEMM is used. |
GEMMUL8_NUM_MOD_C |
0 |
CGEMM | Number of moduli used in CGEMM emulation. When num_moduli < 2 or 14 < num_moduli, native CGEMM is used. |
GEMMUL8_FASTMODE_D |
0 |
DGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_FASTMODE_S |
0 |
SGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_FASTMODE_Z |
0 |
ZGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_FASTMODE_C |
0 |
CGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_MAXWS_BACKEND |
0 |
all | Max workspace calc target (0 or INT8, 1 or FP8, 2 or BOTH). Default is INT8. |
GEMMUL8_MAX_M |
0 |
all | Maximum value of M used to preallocate workspace memory. |
GEMMUL8_MAX_N |
0 |
all | Maximum value of N used to preallocate workspace memory. |
GEMMUL8_MAX_K |
0 |
all | Maximum value of K used to preallocate workspace memory. |
GEMMUL8_MAX_NUM_MOD |
2 |
all | Maximum number of moduli used when computing the size of the preallocated workspace. |
GEMMUL8_SKIP_SCALE_A |
0 |
all | Enables skipping redundant preprocessing for A (1 = enable, 0 = disable). |
GEMMUL8_SKIP_SCALE_B |
0 |
all | Enables skipping redundant preprocessing for B (1 = enable, 0 = disable). |
- This hook mode maintains an independent workspace per BLAS handle (
cublasHandle_t/hipblasHandle_t). - For each handle, the hook allocates three independent GPU work buffers:
workA: cache area for preprocessedAworkB: cache area for preprocessedBworkC: remaining workspace used by the emulation routine
- Each buffer follows a grow-only policy: it is resized upward on demand and is never shrunk automatically.
- Allocation/free use stream-ordered APIs (
cudaMallocAsync/cudaFreeAsyncor HIP equivalents) on the current stream. - When the same handle is used with different CUDA/HIP streams across calls, the hook enforces ordering by inserting
an event dependency (
eventRecordon the previous stream ->streamWaitEventon the current stream). This prevents freeing/reallocating a workspace buffer before the previous stream has finished using it. - Workspace sizing logic (per buffer):
- Let
(needA, needB, needC)be computed fromgemmul8::workSize(...)for the current GEMM shape and parameters. - Let
(prevA, prevB, prevC)be the previously allocated sizes for the same handle. - If
GEMMUL8_SKIP_SCALE_A=1and/orGEMMUL8_SKIP_SCALE_B=1, the hook may also enforce a per-process maximum size(maxA, maxB, maxC)computed fromGEMMUL8_MAX_{M,N,K,NUM_MOD}andGEMMUL8_MAXWS_BACKEND.
- Let
- The requested sizes become:
reqA = max(prevA, needA, maxA)ifGEMMUL8_SKIP_SCALE_A=1, elsemax(prevA, needA)reqB = max(prevB, needB, maxB)ifGEMMUL8_SKIP_SCALE_B=1, elsemax(prevB, needB)reqC = max(prevC, needC, maxC)if either skip flag is enabled, elsemax(prevC, needC)
- The workspaces are released when the corresponding handle is destroyed
(intercepted via
cublasDestroy_v2on CUDA or the mappedhipblasDestroyon HIP). - When
GEMMUL8_SKIP_SCALE_{A|B}=1, redundant preprocessing forA/Bis skipped (see below).
Important
GEMMUL8_SKIP_SCALE_{A|B}=1enables automatic reuse of already-preprocessed intermediate data forA/Bwithin the hook, when it is safe according to the cache conditions below.- The decision is based on pointer identity and cached metadata only.
The hook does not verify the contents of
A/B.
Automatic skipping for A or B is enabled only when all of the following hold between consecutive calls:
GEMMUL8_SKIP_SCALE_A=1(for skippingA) and/orGEMMUL8_SKIP_SCALE_B=1(for skippingB).- The emulation path is taken in both calls:
- SGEMM / CGEMM:
2 <= GEMMUL8_NUM_MOD_{S|C} <= 14 - DGEMM / ZGEMM:
2 <= GEMMUL8_NUM_MOD_{D|Z} <= 20
- SGEMM / CGEMM:
- The same BLAS handle is used (cache is per-handle).
- The same emulation backend is used (
GEMMUL8_BACKEND: INT8 or FP8). - The same
fastmodesetting is used. - The same
num_moduliis used. - The inner dimension
Kis identical. - (For skipping
A) the following are identical:Adevice pointerM,lda, andtransa- the internal cached workspace pointer for
A(workA) is unchanged
- (For skipping
B) the following are identical:Bdevice pointerN,ldb, andtransb- the internal cached workspace pointer for
B(workB) is unchanged
If any condition differs, the hook performs preprocessing again for that operand.
Tip
To keep the internal workspace pointers (workA/workB) stable across calls, avoid workspace reallocation.
Setting GEMMUL8_MAX_M, GEMMUL8_MAX_N, GEMMUL8_MAX_K, GEMMUL8_MAX_NUM_MOD appropriately helps the hook allocate sufficiently large buffers early, reducing pointer changes.
If you may switch GEMMUL8_BACKEND at runtime, set GEMMUL8_MAXWS_BACKEND=BOTH.
Caution
Skip scaling assumes that the contents of A or B remain unchanged in GPU memory.
If A/B data are modified between GEMM calls, do not rely on skipping (set GEMMUL8_SKIP_SCALE_{A|B}=0).
Note
GEMMUL8_MAX_* and GEMMUL8_MAXWS_BACKEND are read only once (on first use) to compute wsmax.
Other variables (e.g., GEMMUL8_NUM_MOD_*, GEMMUL8_FASTMODE_*, GEMMUL8_BACKEND, GEMMUL8_SKIP_SCALE_*) are read at each GEMM call.
You can also set these environment variables programmatically from within your code using setenv.
// Run emulation with num_moduli = 15 & fastmode = true
char num_moduli[12];
snprintf(num_moduli, sizeof(num_moduli), "%u", 15u);
setenv("GEMMUL8_NUM_MOD_D", num_moduli, 1);
setenv("GEMMUL8_FASTMODE_D", "1", 1);
cublasDgemm_v2(...);See numerical results in the separate repository: GEMMul8_numerical_results
Caution
Please do not contact the individuals listed below regarding this code.
- Patrick Gutsche (École Normale Supérieure de Lyon, France)
- Prajval Kumar (Indian Institute of Science and Education Research, India)
- Dr. William Dawson (RIKEN Center for Computational Science, Japan)
- Dr. Toshiyuki Imamura (RIKEN Center for Computational Science, Japan)
(Affiliations as of 2025)
- Dr. Qianxiang Ma (RIKEN Center for Computational Science, Japan)
- Prof. Rio Yokota (Institute of Science Tokyo, Japan)
(Affiliations as of 2025)
- Yuki Uchino (RIKEN Center for Computational Science, Japan)
- yuki.uchino.fe (at) riken.jp
- Ootomo, H., & Yokota, R. (2022). Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance. The International Journal of High Performance Computing Applications, 36(4), 475-491, doi.org/10.1177/10943420221090256.
- Ootomo, H., Manabe, H., Harada, K., & Yokota, R. (2023). Quantum Circuit Simulation by SGEMM Emulation on Tensor Cores and Automatic Precision Selection. In High Performance Computing (pp. 259-276). Springer, doi.org/10.1007/978-3-031-32041-5_14.
- Ootomo, H., Ozaki, K., & Yokota, R. (2024). DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 38(4), 297-313, https://doi.org/10.1177/10943420241239588.
- Uchino, Y., Ozaki, K., & Imamura, T. (2025). Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 39(3), 462-476, doi.org/10.1177/10943420241313064.
@inproceedings{10.1145/3731599.3767539,
author = {Uchino, Yuki and Ozaki, Katsuhisa and Imamura, Toshiyuki},
title = {High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines},
year = {2025},
isbn = {9798400718717},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3731599.3767539},
doi = {10.1145/3731599.3767539},
booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
pages = {1824-1831},
numpages = {8},
series = {SC Workshops '25}
}@misc{ozaki2025ozakischemeiigemmoriented,
title={Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique},
author={Katsuhisa Ozaki and Yuki Uchino and Toshiyuki Imamura},
year={2025},
eprint={2504.08009},
archivePrefix={arXiv},
primaryClass={cs.MS},
url={https://arxiv.org/abs/2504.08009},
}@misc{uchino2025emulationcomplexmatrixmultiplication,
title={Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem},
author={Yuki Uchino and Qianxiang Ma and Toshiyuki Imamura and Katsuhisa Ozaki and Patrick Lars Gutsche},
year={2025},
eprint={2512.08321},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2512.08321},
}This project is licensed under the MIT License. See the LICENSE file for details.