[HIP/ROCm] Enable BF16: register layer_norm bfloat16 kernel + guard conv2d fuse passes#49
Open
oldzhu wants to merge 1 commit into
Open
Conversation
…/ROCm ## Problem Two categories of failures when running PaddleOCR-VL-1.5 in BF16 mode on AMD ROCm (gfx1100 / ROCm 7.2.0): 1. **conv2d_add_act_fuse_pass / conv2d_add_fuse_pass**: The fusion passes generate FusedConv2dAddActOp nodes, but this op is only compiled under PADDLE_WITH_CUDA (not HIP). At runtime Paddle fails with 'Cannot find the kernel for FusedConv2dAddAct on GPU with float32'. 2. **layer_norm with bfloat16**: The HIP PD_REGISTER_KERNEL block for layer_norm only registered float and float16. Calling layer_norm with a bfloat16 tensor raises 'kernel not registered for GPU / bfloat16'. ## Fix ### conv2d_add_act_fuse_pass.cc / conv2d_add_fuse_pass.cc Add an `#ifdef PADDLE_WITH_HIP` guard in InitializePatterns() that returns an empty pattern set. This prevents the pass from generating the fused op on ROCm without disabling it on CUDA. PaddleX previously worked around this by calling config.delete_pass() for every inference session; this C++ guard makes that unnecessary. ### paddle/phi/kernels/gpu/layer_norm_kernel.cu Add `phi::bfloat16` to the HIP PD_REGISTER_KERNEL for layer_norm. The LayerNormKernel implementation uses templated CUDA-compatible intrinsics that compile and run correctly under ROCm — the bfloat16 dtype was simply never registered. ## Validation Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12: - Operator-level: BF16 conv2d SNR 44 dB vs FP32 reference (all 5 tests PASS) - Integration: PaddleOCR-VL-1.5 full BF16 pipeline, 202.8s inference, EXIT:0 Related: PaddleX workaround branch vivienfanghuagood:PaddleX:dev_rocm70
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two HIP/ROCm BF16 regressions that block PaddleOCR-VL-1.5 from running in BF16 on AMD GPUs.
Related upstream issue: PaddlePaddle#78759
Changes
1.
paddle/phi/kernels/gpu/layer_norm_kernel.cuAdd
phi::bfloat16to the HIPPD_REGISTER_KERNEL:LayerNormKerneluses templated CUDA-compatible intrinsics that compile and execute correctly on ROCm. The omission ofbfloat16from the HIP registration was the sole blocker.2.
paddle/fluid/pir/transforms/gpu/conv2d_add_act_fuse_pass.ccandconv2d_add_fuse_pass.ccAdd
#ifdef PADDLE_WITH_HIPearly-return guard inInitializePatterns():The fused op (
FusedConv2dAddActOp) is only compiled underPADDLE_WITH_CUDA. On ROCm the pass generates un-dispatchable nodes causing a runtime crash.3.
test/legacy_test/test_layer_norm_bf16_hip.py(new)Unit tests covering:
Validation
Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12, Paddle 3.4.0.dev20260408:
is_compiled_with_rocm()is_bfloat16_available('dcu:0')Evidence:
Related