[ROCm] Fix the HIP GPU build for ROCm 7 and a wave64 LSTM bug by jeffdaily · Pull Request #4986 · kaldi-asr/kaldi

jeffdaily · 2026-06-12T18:33:36Z

Kaldi already ships a ROCm/HIP GPU backend (configure --use-rocm, the src/makefiles/hip_64bit.mk toolchain block, and the src/hip/hipify.h compat header). It was written against ROCm 5.0-5.2 and had bit-rotted: it no longer builds on ROCm 7, and once built it computed a wrong result on 64-wide wavefronts. This change brings the existing backend back to green on a current ROCm and makes it correct on both 32- and 64-lane GPUs.

Review it in three parts:

Build break on ROCm 7. cu-kernels.cu and feature-online-cmvn-cuda.cu defined __CUDA_ARCH__=800 before including <hipcub/hipcub.hpp>. On hipcc that suppressed the __gfx90a__ device macro, so rocPRIM fell through to a generic path that fails to compile against the ROCm 7 headers. The CUDA-version and warp definitions are reordered/guarded so hipcub sees the real device macros; the CUDA build is unchanged.
wave64 correctness. cu-math-test's UnitTestBackpropLstmNonlinearity was the one cudamatrix test that failed before the fix. The LSTM nonlinearity kernel partitioned a 256-thread block into 256/warpSize groups, assuming that value is 8. It is 8 on a 32-lane warp (NVIDIA, RDNA) but 4 on a 64-lane wavefront (CDNA, gfx90a), so rows past the fourth were left unprocessed. The kernel now derives the host warp size rather than assuming 32.
Multi-arch builds. The host warp size is resolved at runtime (via hipDeviceGetAttribute) so a single build serves both wave32 and wave64 targets; --rocm-targets can list several GPUs (for example gfx90a,gfx1100) in one configure, and each kernel launch selects its tiling from the running device's wavefront width.

Documentation and attribution: configure's header gains a --use-rocm worked example next to the existing --use-cuda one, and the cudamatrix files extended here carry an AMD copyright line.

Scope: Linux. The configure ROCm path is POSIX-only by design (it gates on a Linux host); the Windows .sln build documents no GPU support upstream, so the GPU backend is built and tested on Linux.

Test Plan:

cd tools && make -j"$(nproc)" openfst
cd ../src && ./configure --use-rocm --rocm-dir=/opt/rocm \
    --rocm-targets=gfx90a --use-cuda=no
make -j"$(nproc)" depend && make -j"$(nproc)"
cd cudamatrix && make test

All 12 cudamatrix correctness unit tests pass on an AMD Instinct MI250X (gfx90a), ROCm 7.2.1, including cu-math-test's UnitTestBackpropLstmNonlinearity, which fails without the wave64 fix. The same branch builds and passes the cudamatrix tests on an RDNA3 GPU (gfx1100) via --rocm-targets=gfx1100, and a combined gfx90a,gfx1100 build passes as well. The CUDA path was compile-checked unchanged with nvcc 12.8 (sm_80).

This work was authored with assistance from Claude (Anthropic).

Kaldi already ships a ROCm/HIP GPU backend (configure --use-rocm, the src/makefiles/hip_64bit.mk toolchain block, and the src/hip/hipify.h compat header). It was written against ROCm 5.0-5.2 and had bit-rotted: it no longer builds on ROCm 7, and once built it computed a wrong result on 64-wide wavefronts. This change brings the existing backend back to green on a current ROCm and makes it correct on both 32- and 64-lane GPUs. Review it in three parts: 1. Build break on ROCm 7. cu-kernels.cu and feature-online-cmvn-cuda.cu defined __CUDA_ARCH__=800 before including <hipcub/hipcub.hpp>. On hipcc that suppressed the __gfx90a__ device macro, so rocPRIM fell through to a generic path that fails to compile against the ROCm 7 headers. The CUDA-version and warp definitions are reordered/guarded so hipcub sees the real device macros; the CUDA build is unchanged. 2. wave64 correctness. cu-math-test's UnitTestBackpropLstmNonlinearity was the one cudamatrix test that failed before the fix. The LSTM nonlinearity kernel partitioned a 256-thread block into 256/warpSize groups, assuming that value is 8. It is 8 on a 32-lane warp (NVIDIA, RDNA) but 4 on a 64-lane wavefront (CDNA, gfx90a), so rows past the fourth were left unprocessed. The kernel now derives the host warp size rather than assuming 32. 3. Multi-arch builds. The host warp size is resolved at runtime so a single build serves both wave32 and wave64 targets; --rocm-targets can list several GPUs (for example gfx90a,gfx1100) in one configure. Documentation and attribution: configure's header gains a --use-rocm worked example next to the existing --use-cuda one, and the cudamatrix files extended here carry an AMD copyright line. Scope: Linux. The configure ROCm path is POSIX-only by design (it gates on a Linux host); the Windows .sln build documents no GPU support upstream, so the GPU backend is built and tested on Linux. Test Plan: cd tools && make -j"$(nproc)" openfst cd ../src && ./configure --use-rocm --rocm-dir=/opt/rocm \ --rocm-targets=gfx90a --use-cuda=no make -j"$(nproc)" depend && make -j"$(nproc)" cd cudamatrix && make test All 12 cudamatrix correctness unit tests pass on an AMD Instinct MI250X (gfx90a), ROCm 7.2.1 -- including cu-math-test's UnitTestBackpropLstmNonlinearity, which fails without the wave64 fix. The same branch builds and passes the cudamatrix tests on an RDNA3 GPU (gfx1100) via --rocm-targets=gfx1100. This work was authored with assistance from Claude (Anthropic).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix the HIP GPU build for ROCm 7 and a wave64 LSTM bug#4986

[ROCm] Fix the HIP GPU build for ROCm 7 and a wave64 LSTM bug#4986
jeffdaily wants to merge 1 commit into
kaldi-asr:masterfrom
jeffdaily:moat-port

jeffdaily commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffdaily commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant