[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP#48
[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP#48austin1997 wants to merge 3 commits into
Conversation
BF16 profiling:
|
MIOpen (as of ROCm 7.x) returns MIOPEN_STATUS_NOT_IMPLEMENTED for miopenSoftmaxForward_V2 with miopenBFloat16, so the gpudnn softmax path cannot be used for BF16 on HIP. When the input dim exceeds the warp softmax cap, route BF16 through the existing matrix softmax kernel instead of letting the call fall into the MIOpen branch. Also gate the CUDNN_VERSION < 8100 BF16 fallback specialization on !defined(PADDLE_WITH_HIP) -- that branch dispatched into MIOpen too and would trip the same NOT_IMPLEMENTED failure on ROCm.
conv2d_add_fuse_pass and conv2d_add_act_fuse_pass rewrite conv2d+add[+act]
into the fused_conv2d_add_act op, which has only a cuDNN GPUDNN kernel.
On ROCm the rewrite succeeds but kernel dispatch later fails because no
HIP kernel is registered, so PaddleX currently works around this by
calling config.delete_pass("conv2d_add_act_fuse_pass") and
config.delete_pass("conv2d_add_fuse_pass") under paddle.is_compiled_with_rocm()
in paddlex/inference/models/runners/paddle_static/runner.py.
Gate both the pass registration (REGISTER_IR_PASS / USE_PIR_PASS) and the
pass-builder inclusion on PADDLE_WITH_CUDA so the rewrite never runs on
HIP builds, making the PaddleX delete_pass calls unnecessary.
Restore the BF16 registrations for conv2d / conv3d / depthwise conv kernels and the DataType::BFLOAT16 -> miopenBFloat16 mapping originally added by ROCm#47 and reverted on paddle_hackthon ahead of RDNA4 enablement. The change is gated at compile time by the existing #ifdef PADDLE_WITH_HIP block. Deployment to archs that lack native BF16 support should be handled via PADDLE_ROCM_OFFLOAD_ARCHS (paddle_hackthon's default list already covers the BF16-capable set: CDNA3/gfx942, CDNA4/gfx950, RDNA3/gfx1100- 1102, RDNA4/gfx1200-1201); if a downstream target needs to strip BF16 from the build, it can narrow the offload-arch list accordingly. No runtime arch queries are introduced.
13537f9 to
fa80fca
Compare
Updated BF16 profiling (base switched to
|
PR Category
Execute Infrastructure
PR Types
Bug fixes
Description
Enables PaddleOCR-VL-1.5 to run end-to-end natively in BF16 on AMD MI300X (gfx942) under ROCm 7.x against the
paddle_hackthonbranch. Three independent HIP-only patches, 3 commits / 8 files / +58−12:[ROCm] Re-enable BF16 conv kernels on HIP(paddle/phi/backends/gpu/rocm/miopen_desc.h+paddle/phi/kernels/gpudnn/conv_kernel.cu+paddle/phi/kernels/gpudnn/conv_grad_kernel.cu) — restores theDataType::BFLOAT16 → miopenBFloat16mapping and thephi::bfloat16registrations onconv2d / conv2d_grad / conv2d_double_grad / conv3d / conv3d_grad / conv3d_double_grad / depthwise_conv2d / depthwise_conv2d_double_gradthat7d14616ceereverted from feat(ROCm): Add BF16 support for conv kernels on HIP/ROCm #47. Without this, PaddleOCR-VL-1.5's vision patchify Conv2D cannot dispatch a BF16 kernel and the pipeline falls back to FP32 for the entire vision encoder. Deployment to archs that don't have BF16 MFMA/WMMA (pre-CDNA3 / pre-RDNA3) is handled viaPADDLE_ROCM_OFFLOAD_ARCHSat configure time —paddle_hackthon's default already covers the BF16-capable set (gfx942, gfx950, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201).[ROCm] Route BF16 softmax through matrix kernel (MIOpen NOT_IMPLEMENTED)(paddle/phi/kernels/gpudnn/softmax_gpudnn.h) — MIOpen (as of ROCm 7.x) returnsMIOPEN_STATUS_NOT_IMPLEMENTEDformiopenSoftmaxForward_V2withmiopenBFloat16, so wheneverdim ≥ MATRIX_SOFTMAX_THRESHOLDthe existing gpudnn path dispatched into MIOpen and crashed. Route BF16 softmax to the existing matrix-softmax kernel on HIP, and gate theCUDNN_VERSION < 8100BF16 fallback specialization on!defined(PADDLE_WITH_HIP)— that branch dispatched into MIOpen too and would trip the same failure.[ROCm] Skip cuDNN-only conv2d fusion passes on HIP(paddle/fluid/pir/transforms/gpu/conv2d_add_fuse_pass.cc+conv2d_add_act_fuse_pass.cc+paddle/fluid/pir/transforms/passes.h+paddle/fluid/inference/api/paddle_pass_builder.cc) — both PIR passes rewriteconv2d + add[+ act]into the fusedfused_conv2d_add_actop, whose only kernel is cuDNN-only GPUDNN. On ROCm the rewrite succeeds but dispatch later fails for lack of a HIP kernel. PaddleX currently works around this by callingconfig.delete_pass("conv2d_add_act_fuse_pass")andconfig.delete_pass("conv2d_add_fuse_pass")underpaddle.is_compiled_with_rocm()inpaddlex/inference/models/runners/paddle_static/runner.py. Gate bothREGISTER_IR_PASS/USE_PIR_PASSand thekPirGpuPasseslist entries onPADDLE_WITH_CUDAso the rewrite never runs on HIP builds — the PaddleX delete_pass workaround becomes unnecessary.Upstream relation
PaddlePaddle/Paddle:develop.PaddlePaddle/Paddle:develop, also ported toROCm/Paddle:developvia ROCm/Paddle#47.7d14616ceeonpaddle_hackthonrevertedROCm/Paddle#47's conv BF16 ahead of RDNA4 enablement. Commit 1 in this PR restores that registration; RDNA4 deployers who hit a regression can narrowPADDLE_ROCM_OFFLOAD_ARCHSto excludegfx1200/1201from their build until MIOpen BF16 conv is stable on RDNA4. This PR does not reintroduce the unit test that7d14616ceealso removed (test/legacy_test/test_hip_bf16_conv_kernel.py) — upstream CI on ROCm/Paddle does not run that test anyway, and the e2e BF16 verification attached below exercises the kernel more thoroughly._keep_in_fp32_modules = ["visual", "mlp_AR"]and the 4delete_pass("conv2d_add_*_fuse_pass")blocks that are currently shipped as workarounds.Verification
Full rebuild from
ROCm/Paddle:paddle_hackthon @ 4df29c5818+ these 3 commits on MI300X (gfx942) / ROCm 7.2.0 / Python 3.12 withPADDLE_ROCM_OFFLOAD_ARCHS=gfx942, then:paddle.matmul/F.softmax/F.gelu; all 219 leaf-sublayer outputsbfloat16,no BF16→FP32 leakasserted, 27 GELU + 27 softmax + 54 matmul all BF16.test_ocr.png. OCR text output semantically identical BF16 vs FP32 fallback. GPU kernel-dispatch time drops from 4,369.3 ms (FP32 fallback) → 3,897.8 ms (native BF16), a 1.12× speedup. GEMM alone saves 339 ms; the FP32 fallback path'sCijk_Ailk_Bljk_SB_MT…_MI16x16x4x1vision-GEMMs (473 ms at ranks 2 and 3 of the top-10) disappear entirely and are replaced byCijk_Ailk_Bljk_BBS_BH_MT…_MI16x16x16x1BF16 MFMA GEMMs.Kernel-class breakdown (rocprofv3
kernel_stats.csv):The full benchmark report (methodology, env, reproduce instructions, top-10 kernel tables for both modes) is kept alongside at
BF16_BENCHMARK_ROCM_FORK.mdin the workspace root and is reproducible viabench_paddleocr_vl.py+rocprofv3 --kernel-trace --stats --output-format csv.是否引起精度变化
否
仅影响 ROCm/HIP 构建的 dispatch 路径:conv BF16 本就是 PR #47 的 dispatch (
miopenBFloat16精度),行为与未 revert 的ROCm/Paddle:develop及PaddlePaddle/Paddle:develop一致;softmax 改走 matrix kernel 数值实现同现网 FP16/FP32 路径一致;conv2d_add_*_fuse_pass在 HIP 下本就无可调度 kernel,gating 后行为等价于 PaddleX 现网delete_pass显式移除。CUDA 构建完全不受影响(所有新增#ifdef都是PADDLE_WITH_CUDA/PADDLE_WITH_HIP)。