[HIP/ROCm] Enable BF16: register layer_norm bfloat16 kernel + guard conv2d fuse passes by oldzhu · Pull Request #49 · ROCm/Paddle

oldzhu · 2026-04-23T07:46:33Z

Summary

Fixes two HIP/ROCm BF16 regressions that block PaddleOCR-VL-1.5 from running in BF16 on AMD GPUs.

Related upstream issue: PaddlePaddle#78759

Changes

1. `paddle/phi/kernels/gpu/layer_norm_kernel.cu`

Add phi::bfloat16 to the HIP PD_REGISTER_KERNEL:

// Before
PD_REGISTER_KERNEL(layer_norm, GPU, ALL_LAYOUT, phi::LayerNormKernel,
                   float, phi::float16) { ... }
// After
PD_REGISTER_KERNEL(layer_norm, GPU, ALL_LAYOUT, phi::LayerNormKernel,
                   float, phi::float16, phi::bfloat16) { ... }

LayerNormKernel uses templated CUDA-compatible intrinsics that compile and execute correctly on ROCm. The omission of bfloat16 from the HIP registration was the sole blocker.

2. `paddle/fluid/pir/transforms/gpu/conv2d_add_act_fuse_pass.cc` and `conv2d_add_fuse_pass.cc`

Add #ifdef PADDLE_WITH_HIP early-return guard in InitializePatterns():

#ifdef PADDLE_WITH_HIP
  // fused_conv2d_add_act kernel is not implemented for ROCm/HIP.
  return ps;
#endif

The fused op (FusedConv2dAddActOp) is only compiled under PADDLE_WITH_CUDA. On ROCm the pass generates un-dispatchable nodes causing a runtime crash.

3. `test/legacy_test/test_layer_norm_bf16_hip.py` (new)

Unit tests covering:

LayerNorm BF16 output matches FP32 reference (rtol/atol 1e-2) for 2D/3D/4D inputs
Output dtype preserved as bfloat16
No 'kernel not registered' exception for BF16 input on HIP
ROCm-specific: SNR >= 30 dB vs FP32 reference

Validation

Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12, Paddle 3.4.0.dev20260408:

Test	Result
`is_compiled_with_rocm()`	✅ True
`is_bfloat16_available('dcu:0')`	✅ True
BF16 conv2d SNR vs FP32	✅ 44 dB (8/8 PASS)
PaddleOCR-VL-1.5 full BF16 pipeline	✅ PASS — 202.8s, EXIT:0
OCR output correctness	✅ Verified

Evidence:

…/ROCm ## Problem Two categories of failures when running PaddleOCR-VL-1.5 in BF16 mode on AMD ROCm (gfx1100 / ROCm 7.2.0): 1. **conv2d_add_act_fuse_pass / conv2d_add_fuse_pass**: The fusion passes generate FusedConv2dAddActOp nodes, but this op is only compiled under PADDLE_WITH_CUDA (not HIP). At runtime Paddle fails with 'Cannot find the kernel for FusedConv2dAddAct on GPU with float32'. 2. **layer_norm with bfloat16**: The HIP PD_REGISTER_KERNEL block for layer_norm only registered float and float16. Calling layer_norm with a bfloat16 tensor raises 'kernel not registered for GPU / bfloat16'. ## Fix ### conv2d_add_act_fuse_pass.cc / conv2d_add_fuse_pass.cc Add an `#ifdef PADDLE_WITH_HIP` guard in InitializePatterns() that returns an empty pattern set. This prevents the pass from generating the fused op on ROCm without disabling it on CUDA. PaddleX previously worked around this by calling config.delete_pass() for every inference session; this C++ guard makes that unnecessary. ### paddle/phi/kernels/gpu/layer_norm_kernel.cu Add `phi::bfloat16` to the HIP PD_REGISTER_KERNEL for layer_norm. The LayerNormKernel implementation uses templated CUDA-compatible intrinsics that compile and run correctly under ROCm — the bfloat16 dtype was simply never registered. ## Validation Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12: - Operator-level: BF16 conv2d SNR 44 dB vs FP32 reference (all 5 tests PASS) - Integration: PaddleOCR-VL-1.5 full BF16 pipeline, 202.8s inference, EXIT:0 Related: PaddleX workaround branch vivienfanghuagood:PaddleX:dev_rocm70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HIP/ROCm] Enable BF16: register layer_norm bfloat16 kernel + guard conv2d fuse passes#49

[HIP/ROCm] Enable BF16: register layer_norm bfloat16 kernel + guard conv2d fuse passes#49
oldzhu wants to merge 1 commit into
ROCm:paddle_hackthonfrom
oldzhu:hip-bf16-layer-norm-and-conv2d-fix

oldzhu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oldzhu commented Apr 23, 2026

Summary

Changes

1. paddle/phi/kernels/gpu/layer_norm_kernel.cu

2. paddle/fluid/pir/transforms/gpu/conv2d_add_act_fuse_pass.cc and conv2d_add_fuse_pass.cc

3. test/legacy_test/test_layer_norm_bf16_hip.py (new)

Validation

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `paddle/phi/kernels/gpu/layer_norm_kernel.cu`

2. `paddle/fluid/pir/transforms/gpu/conv2d_add_act_fuse_pass.cc` and `conv2d_add_fuse_pass.cc`

3. `test/legacy_test/test_layer_norm_bf16_hip.py` (new)