Skip to content

[HIP/ROCm] Enable BF16: register layer_norm bfloat16 kernel + guard conv2d fuse passes#49

Open
oldzhu wants to merge 1 commit into
ROCm:paddle_hackthonfrom
oldzhu:hip-bf16-layer-norm-and-conv2d-fix
Open

[HIP/ROCm] Enable BF16: register layer_norm bfloat16 kernel + guard conv2d fuse passes#49
oldzhu wants to merge 1 commit into
ROCm:paddle_hackthonfrom
oldzhu:hip-bf16-layer-norm-and-conv2d-fix

Conversation

@oldzhu
Copy link
Copy Markdown

@oldzhu oldzhu commented Apr 23, 2026

Summary

Fixes two HIP/ROCm BF16 regressions that block PaddleOCR-VL-1.5 from running in BF16 on AMD GPUs.

Related upstream issue: PaddlePaddle#78759

Changes

1. paddle/phi/kernels/gpu/layer_norm_kernel.cu

Add phi::bfloat16 to the HIP PD_REGISTER_KERNEL:

// Before
PD_REGISTER_KERNEL(layer_norm, GPU, ALL_LAYOUT, phi::LayerNormKernel,
                   float, phi::float16) { ... }
// After
PD_REGISTER_KERNEL(layer_norm, GPU, ALL_LAYOUT, phi::LayerNormKernel,
                   float, phi::float16, phi::bfloat16) { ... }

LayerNormKernel uses templated CUDA-compatible intrinsics that compile and execute correctly on ROCm. The omission of bfloat16 from the HIP registration was the sole blocker.

2. paddle/fluid/pir/transforms/gpu/conv2d_add_act_fuse_pass.cc and conv2d_add_fuse_pass.cc

Add #ifdef PADDLE_WITH_HIP early-return guard in InitializePatterns():

#ifdef PADDLE_WITH_HIP
  // fused_conv2d_add_act kernel is not implemented for ROCm/HIP.
  return ps;
#endif

The fused op (FusedConv2dAddActOp) is only compiled under PADDLE_WITH_CUDA. On ROCm the pass generates un-dispatchable nodes causing a runtime crash.

3. test/legacy_test/test_layer_norm_bf16_hip.py (new)

Unit tests covering:

  • LayerNorm BF16 output matches FP32 reference (rtol/atol 1e-2) for 2D/3D/4D inputs
  • Output dtype preserved as bfloat16
  • No 'kernel not registered' exception for BF16 input on HIP
  • ROCm-specific: SNR >= 30 dB vs FP32 reference

Validation

Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12, Paddle 3.4.0.dev20260408:

Test Result
is_compiled_with_rocm() ✅ True
is_bfloat16_available('dcu:0') ✅ True
BF16 conv2d SNR vs FP32 ✅ 44 dB (8/8 PASS)
PaddleOCR-VL-1.5 full BF16 pipeline PASS — 202.8s, EXIT:0
OCR output correctness ✅ Verified

Evidence:

Related

…/ROCm

## Problem

Two categories of failures when running PaddleOCR-VL-1.5 in BF16 mode on
AMD ROCm (gfx1100 / ROCm 7.2.0):

1. **conv2d_add_act_fuse_pass / conv2d_add_fuse_pass**: The fusion passes
   generate FusedConv2dAddActOp nodes, but this op is only compiled under
   PADDLE_WITH_CUDA (not HIP). At runtime Paddle fails with
   'Cannot find the kernel for FusedConv2dAddAct on GPU with float32'.

2. **layer_norm with bfloat16**: The HIP PD_REGISTER_KERNEL block for
   layer_norm only registered float and float16. Calling layer_norm with a
   bfloat16 tensor raises 'kernel not registered for GPU / bfloat16'.

## Fix

### conv2d_add_act_fuse_pass.cc / conv2d_add_fuse_pass.cc
Add an `#ifdef PADDLE_WITH_HIP` guard in InitializePatterns() that returns
an empty pattern set. This prevents the pass from generating the fused op on
ROCm without disabling it on CUDA. PaddleX previously worked around this by
calling config.delete_pass() for every inference session; this C++ guard
makes that unnecessary.

### paddle/phi/kernels/gpu/layer_norm_kernel.cu
Add `phi::bfloat16` to the HIP PD_REGISTER_KERNEL for layer_norm. The
LayerNormKernel implementation uses templated CUDA-compatible intrinsics
that compile and run correctly under ROCm — the bfloat16 dtype was simply
never registered.

## Validation

Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12:
- Operator-level: BF16 conv2d SNR 44 dB vs FP32 reference (all 5 tests PASS)
- Integration: PaddleOCR-VL-1.5 full BF16 pipeline, 202.8s inference, EXIT:0

Related: PaddleX workaround branch vivienfanghuagood:PaddleX:dev_rocm70
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant