Skip to content

[ROCm] Fix the HIP GPU build for ROCm 7 and a wave64 LSTM bug#4986

Open
jeffdaily wants to merge 1 commit into
kaldi-asr:masterfrom
jeffdaily:moat-port
Open

[ROCm] Fix the HIP GPU build for ROCm 7 and a wave64 LSTM bug#4986
jeffdaily wants to merge 1 commit into
kaldi-asr:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

Kaldi already ships a ROCm/HIP GPU backend (configure --use-rocm, the src/makefiles/hip_64bit.mk toolchain block, and the src/hip/hipify.h compat header). It was written against ROCm 5.0-5.2 and had bit-rotted: it no longer builds on ROCm 7, and once built it computed a wrong result on 64-wide wavefronts. This change brings the existing backend back to green on a current ROCm and makes it correct on both 32- and 64-lane GPUs.

Review it in three parts:

  1. Build break on ROCm 7. cu-kernels.cu and feature-online-cmvn-cuda.cu defined __CUDA_ARCH__=800 before including <hipcub/hipcub.hpp>. On hipcc that suppressed the __gfx90a__ device macro, so rocPRIM fell through to a generic path that fails to compile against the ROCm 7 headers. The CUDA-version and warp definitions are reordered/guarded so hipcub sees the real device macros; the CUDA build is unchanged.

  2. wave64 correctness. cu-math-test's UnitTestBackpropLstmNonlinearity was the one cudamatrix test that failed before the fix. The LSTM nonlinearity kernel partitioned a 256-thread block into 256/warpSize groups, assuming that value is 8. It is 8 on a 32-lane warp (NVIDIA, RDNA) but 4 on a 64-lane wavefront (CDNA, gfx90a), so rows past the fourth were left unprocessed. The kernel now derives the host warp size rather than assuming 32.

  3. Multi-arch builds. The host warp size is resolved at runtime (via hipDeviceGetAttribute) so a single build serves both wave32 and wave64 targets; --rocm-targets can list several GPUs (for example gfx90a,gfx1100) in one configure, and each kernel launch selects its tiling from the running device's wavefront width.

Documentation and attribution: configure's header gains a --use-rocm worked example next to the existing --use-cuda one, and the cudamatrix files extended here carry an AMD copyright line.

Scope: Linux. The configure ROCm path is POSIX-only by design (it gates on a Linux host); the Windows .sln build documents no GPU support upstream, so the GPU backend is built and tested on Linux.

Test Plan:

cd tools && make -j"$(nproc)" openfst
cd ../src && ./configure --use-rocm --rocm-dir=/opt/rocm \
    --rocm-targets=gfx90a --use-cuda=no
make -j"$(nproc)" depend && make -j"$(nproc)"
cd cudamatrix && make test

All 12 cudamatrix correctness unit tests pass on an AMD Instinct MI250X (gfx90a), ROCm 7.2.1, including cu-math-test's UnitTestBackpropLstmNonlinearity, which fails without the wave64 fix. The same branch builds and passes the cudamatrix tests on an RDNA3 GPU (gfx1100) via --rocm-targets=gfx1100, and a combined gfx90a,gfx1100 build passes as well. The CUDA path was compile-checked unchanged with nvcc 12.8 (sm_80).

This work was authored with assistance from Claude (Anthropic).

Kaldi already ships a ROCm/HIP GPU backend (configure --use-rocm, the
src/makefiles/hip_64bit.mk toolchain block, and the src/hip/hipify.h compat
header). It was written against ROCm 5.0-5.2 and had bit-rotted: it no longer
builds on ROCm 7, and once built it computed a wrong result on 64-wide
wavefronts. This change brings the existing backend back to green on a current
ROCm and makes it correct on both 32- and 64-lane GPUs.

Review it in three parts:

1. Build break on ROCm 7. cu-kernels.cu and feature-online-cmvn-cuda.cu defined
   __CUDA_ARCH__=800 before including <hipcub/hipcub.hpp>. On hipcc that
   suppressed the __gfx90a__ device macro, so rocPRIM fell through to a generic
   path that fails to compile against the ROCm 7 headers. The CUDA-version and
   warp definitions are reordered/guarded so hipcub sees the real device macros;
   the CUDA build is unchanged.

2. wave64 correctness. cu-math-test's UnitTestBackpropLstmNonlinearity was the
   one cudamatrix test that failed before the fix. The LSTM nonlinearity kernel
   partitioned a 256-thread block into 256/warpSize groups, assuming that value
   is 8. It is 8 on a 32-lane warp (NVIDIA, RDNA) but 4 on a 64-lane wavefront
   (CDNA, gfx90a), so rows past the fourth were left unprocessed. The kernel now
   derives the host warp size rather than assuming 32.

3. Multi-arch builds. The host warp size is resolved at runtime so a single
   build serves both wave32 and wave64 targets; --rocm-targets can list several
   GPUs (for example gfx90a,gfx1100) in one configure.

Documentation and attribution: configure's header gains a --use-rocm worked
example next to the existing --use-cuda one, and the cudamatrix files extended
here carry an AMD copyright line.

Scope: Linux. The configure ROCm path is POSIX-only by design (it gates on a
Linux host); the Windows .sln build documents no GPU support upstream, so the
GPU backend is built and tested on Linux.

Test Plan:

  cd tools && make -j"$(nproc)" openfst
  cd ../src && ./configure --use-rocm --rocm-dir=/opt/rocm \
      --rocm-targets=gfx90a --use-cuda=no
  make -j"$(nproc)" depend && make -j"$(nproc)"
  cd cudamatrix && make test

All 12 cudamatrix correctness unit tests pass on an AMD Instinct MI250X
(gfx90a), ROCm 7.2.1 -- including cu-math-test's UnitTestBackpropLstmNonlinearity,
which fails without the wave64 fix. The same branch builds and passes the
cudamatrix tests on an RDNA3 GPU (gfx1100) via --rocm-targets=gfx1100.

This work was authored with assistance from Claude (Anthropic).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant