[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP by austin1997 · Pull Request #48 · ROCm/Paddle

austin1997 · 2026-04-22T07:37:32Z

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

Enables PaddleOCR-VL-1.5 to run end-to-end natively in BF16 on AMD MI300X (gfx942) under ROCm 7.x against the paddle_hackthon branch. Three independent HIP-only patches, 3 commits / 8 files / +58−12:

[ROCm] Re-enable BF16 conv kernels on HIP (paddle/phi/backends/gpu/rocm/miopen_desc.h + paddle/phi/kernels/gpudnn/conv_kernel.cu + paddle/phi/kernels/gpudnn/conv_grad_kernel.cu) — restores the DataType::BFLOAT16 → miopenBFloat16 mapping and the phi::bfloat16 registrations on conv2d / conv2d_grad / conv2d_double_grad / conv3d / conv3d_grad / conv3d_double_grad / depthwise_conv2d / depthwise_conv2d_double_grad that 7d14616cee reverted from feat(ROCm): Add BF16 support for conv kernels on HIP/ROCm #47. Without this, PaddleOCR-VL-1.5's vision patchify Conv2D cannot dispatch a BF16 kernel and the pipeline falls back to FP32 for the entire vision encoder. Deployment to archs that don't have BF16 MFMA/WMMA (pre-CDNA3 / pre-RDNA3) is handled via PADDLE_ROCM_OFFLOAD_ARCHS at configure time — paddle_hackthon's default already covers the BF16-capable set (gfx942, gfx950, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201).
[ROCm] Route BF16 softmax through matrix kernel (MIOpen NOT_IMPLEMENTED) (paddle/phi/kernels/gpudnn/softmax_gpudnn.h) — MIOpen (as of ROCm 7.x) returns MIOPEN_STATUS_NOT_IMPLEMENTED for miopenSoftmaxForward_V2 with miopenBFloat16, so whenever dim ≥ MATRIX_SOFTMAX_THRESHOLD the existing gpudnn path dispatched into MIOpen and crashed. Route BF16 softmax to the existing matrix-softmax kernel on HIP, and gate the CUDNN_VERSION < 8100 BF16 fallback specialization on !defined(PADDLE_WITH_HIP) — that branch dispatched into MIOpen too and would trip the same failure.
[ROCm] Skip cuDNN-only conv2d fusion passes on HIP (paddle/fluid/pir/transforms/gpu/conv2d_add_fuse_pass.cc + conv2d_add_act_fuse_pass.cc + paddle/fluid/pir/transforms/passes.h + paddle/fluid/inference/api/paddle_pass_builder.cc) — both PIR passes rewrite conv2d + add[+ act] into the fused fused_conv2d_add_act op, whose only kernel is cuDNN-only GPUDNN. On ROCm the rewrite succeeds but dispatch later fails for lack of a HIP kernel. PaddleX currently works around this by calling config.delete_pass("conv2d_add_act_fuse_pass") and config.delete_pass("conv2d_add_fuse_pass") under paddle.is_compiled_with_rocm() in paddlex/inference/models/runners/paddle_static/runner.py. Gate both REGISTER_IR_PASS / USE_PIR_PASS and the kPirGpuPasses list entries on PADDLE_WITH_CUDA so the rewrite never runs on HIP builds — the PaddleX delete_pass workaround becomes unnecessary.

Upstream relation

Sibling on mainline: PaddlePaddle/Paddle#78711 (same author) — the identical softmax + fuse-pass patch, opened earlier against PaddlePaddle/Paddle:develop.
Conv BF16 already merged upstream: PaddlePaddle/Paddle#78587 (fchange, Hackathon) — the mainline version of the conv BF16 + miopen_desc.h changes this PR restores. Already in PaddlePaddle/Paddle:develop, also ported to ROCm/Paddle:develop via ROCm/Paddle#47.
Revert being undone: 7d14616cee on paddle_hackthon reverted ROCm/Paddle#47's conv BF16 ahead of RDNA4 enablement. Commit 1 in this PR restores that registration; RDNA4 deployers who hit a regression can narrow PADDLE_ROCM_OFFLOAD_ARCHS to exclude gfx1200/1201 from their build until MIOpen BF16 conv is stable on RDNA4. This PR does not reintroduce the unit test that 7d14616cee also removed (test/legacy_test/test_hip_bf16_conv_kernel.py) — upstream CI on ROCm/Paddle does not run that test anyway, and the e2e BF16 verification attached below exercises the kernel more thoroughly.
PaddleX companion (once this lands): PaddleX#5096 drops _keep_in_fp32_modules = ["visual", "mlp_AR"] and the 4 delete_pass("conv2d_add_*_fuse_pass") blocks that are currently shipped as workarounds.

Verification

Full rebuild from ROCm/Paddle:paddle_hackthon @ 4df29c5818 + these 3 commits on MI300X (gfx942) / ROCm 7.2.0 / Python 3.12 with PADDLE_ROCM_OFFLOAD_ARCHS=gfx942, then:

Per-op probe — 15/15 BF16 ops pass (Conv2D, LayerNorm, Softmax, GELU, RMSNorm, SDPA-GQA, fused-bias-residual-layernorm, etc.).
Vision-encoder end-to-end — loads real PaddleOCR-VL-1.5 weights, monkey-patches paddle.matmul / F.softmax / F.gelu; all 219 leaf-sublayer outputs bfloat16, no BF16→FP32 leak asserted, 27 GELU + 27 softmax + 54 matmul all BF16.
Full pipeline benchmark + rocprofv3 — 3 timed runs per mode on test_ocr.png. OCR text output semantically identical BF16 vs FP32 fallback. GPU kernel-dispatch time drops from 4,369.3 ms (FP32 fallback) → 3,897.8 ms (native BF16), a 1.12× speedup. GEMM alone saves 339 ms; the FP32 fallback path's Cijk_Ailk_Bljk_SB_MT…_MI16x16x4x1 vision-GEMMs (473 ms at ranks 2 and 3 of the top-10) disappear entirely and are replaced by Cijk_Ailk_Bljk_BBS_BH_MT…_MI16x16x16x1 BF16 MFMA GEMMs.

Kernel-class breakdown (rocprofv3 kernel_stats.csv):

Op class	FP32 calls	FP32 ms	BF16 calls	BF16 ms	Δ ms	Speedup
GEMM (rocBLAS/hipBLASLt)	87,752	1,786.45	87,752	1,447.27	+339.18	1.23×
Cast / copy / memcpy	549,564	1,657.47	585,702	1,827.41	−169.95	0.91×
Elementwise add / mul / bias	122,840	357.59	107,720	210.71	+146.88	1.70×
Other (incl. bf16-specialized)	40,959	241.43	27,499	94.93	+146.50	2.54×
Reduction / sum / mean	36,004	144.68	36,004	142.01	+2.67	1.02×
Softmax	13,588	98.33	13,588	95.00	+3.33	1.04×
Layer norm / RMS norm	4,572	24.70	4,572	22.61	+2.09	1.09×
Conv / MIOpen	236	12.46	236	12.55	−0.09	0.99×
GELU / SiLU / activation	11,832	33.22	11,832	33.63	−0.41	0.99×
Top-k / argmax / sort	8	10.00	8	9.94	+0.07	1.01×
Transpose / reshape	400	2.19	240	1.38	+0.82	1.59×
Embedding / gather / index	96	0.25	96	0.28	−0.03	0.90×
Interpolate / resize	96	0.49	16	0.09	+0.40	5.45×
Fill / set_value / memset	1	0.01	1	0.01	−0.00	0.98×
Total	867,948	4,369.28	875,266	3,897.82	+471.46	1.12×

The full benchmark report (methodology, env, reproduce instructions, top-10 kernel tables for both modes) is kept alongside at BF16_BENCHMARK_ROCM_FORK.md in the workspace root and is reproducible via bench_paddleocr_vl.py + rocprofv3 --kernel-trace --stats --output-format csv.

是否引起精度变化

否

仅影响 ROCm/HIP 构建的 dispatch 路径：conv BF16 本就是 PR #47 的 dispatch (miopenBFloat16 精度)，行为与未 revert 的 ROCm/Paddle:develop 及 PaddlePaddle/Paddle:develop 一致；softmax 改走 matrix kernel 数值实现同现网 FP16/FP32 路径一致；conv2d_add_*_fuse_pass 在 HIP 下本就无可调度 kernel，gating 后行为等价于 PaddleX 现网 delete_pass 显式移除。CUDA 构建完全不受影响（所有新增 #ifdef 都是 PADDLE_WITH_CUDA / PADDLE_WITH_HIP）。

austin1997 · 2026-04-22T07:49:46Z

BF16 profiling: `rocprofv3 --kernel-trace --stats` output

Raw kernel_stats.csv from the BF16 run of bench_paddleocr_vl.py on this PR branch (ROCm/Paddle:develop @ 29d1c6f0ca + 2 BF16 commits), MI300X / ROCm 7.2.0 / Python 3.12. Row count: 220 (incl. header). Columns: Name, Calls, TotalDurationNs, AverageNs, Percentage, MinNs, MaxNs, StdDev. The Cijk_Ailk_Bljk_BBS_BH_*_MI16x16x16x1_* kernels are hipBLASLt BF16 GEMMs (BBS = bf16/bf16 inputs, BH = bf16 accumulator kind).

kernel_stats.csv (click to expand)

"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"Cijk_Ailk_Bljk_BBS_BH_MT64x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",22752,452959812,19908.571203,11.61,15236,30030,3447.533088
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA1024_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU16_SUM0_SUS512_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn16",13536,299983287,22161.885860,7.69,16077,85118,11716.922064
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<phi::dtype::bfloat16, float>)",76632,189068997,2467.232971,4.85,681,18523,1662.264516
"Cijk_Alik_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA4_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA1_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",13536,151993442,11228.829935,3.90,2646,64391,18123.829511
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",48464,148192009,3057.775029,3.80,882,9542,355.901133
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",52616,146850533,2790.986259,3.76,681,13229,1610.691688
"Cijk_Ailk_Bljk_BBS_BH_MT64x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA32_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",8640,145431134,16832.307176,3.73,14394,23054,793.679000
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",34272,139475857,4069.673699,3.57,2325,23815,3212.673338
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::dtype::bfloat16, unsigned int, 1, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",36288,122145253,3365.995729,3.13,681,12309,806.043801
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",32166,116469634,3620.892682,2.99,1162,28026,1270.872721
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<float, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<float, phi::dtype::bfloat16>)",42585,114347908,2685.168674,2.93,1523,15115,1832.671964
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR4_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD2_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGMn16",11376,111906819,9837.097310,2.87,7898,16959,852.732366
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4>)",23384,102251854,4372.727249,2.62,2606,11386,484.961907
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",34592,86680433,2505.794201,2.22,682,13031,1643.969796
"Cijk_Ailk_Bljk_BBS_BH_MT128x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLRn30_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM8",2240,79588668,35530.655357,2.04,30592,67678,1522.834254
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",13280,78303630,5896.357681,2.01,2967,14995,1816.094217
"Cijk_Ailk_Bljk_BBS_BH_MT128x128x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR12_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB1024_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU16_SUM0_SUS256_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_128_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB4_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn2",632,66978985,105979.406646,1.72,90611,205840,36145.186322
"void phi::RepeatInterleaveVecKernel<phi::dtype::bfloat16, 8>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, long, long, long, long, int)",22752,66164874,2908.090454,1.70,1844,8580,390.888878
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",22752,64057566,2815.469673,1.64,882,10866,305.474114
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8>, int, int, int, void*)",22752,63094270,2773.130714,1.62,1163,11908,389.876783
"SoftMaxCommon",2164,59287367,27397.119686,1.52,22412,32917,1961.622672
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",22806,58465028,2563.580987,1.50,762,12469,557.898035
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4>, int, int, int, void*)",19890,57991347,2915.603167,1.49,2125,10825,310.682675
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::ScaleFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>)",8640,54670751,6327.633218,1.40,2847,16118,1282.403601
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4> >(phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4>, int, int, int, void*)",22779,47828955,2099.695114,1.23,762,16077,495.657251
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",12224,47692693,3901.561927,1.22,2486,56171,1959.136766
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ScaleFunctor<float, float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ScaleFunctor<float, float>)",25446,46050852,1809.748173,1.18,641,15877,371.344802
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSquareFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSquareFunctor<float>)",23384,45575904,1949.020869,1.17,682,4249,192.285457
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaRsqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaRsqrtFunctor<float>)",23384,43188777,1846.937094,1.11,681,6855,220.938574
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",9936,42368404,4264.130837,1.09,3127,13590,441.469141
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",12816,38623160,3013.667291,0.9899,1363,9742,238.871330
"void phi::funcs::VectorizedElementwiseKernel<long, phi::ScaleFunctor<long, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::ScaleFunctor<long, long>)",22744,38592279,1696.811423,0.9891,681,8460,212.390187
"void phi::ArgCUDAKernel<phi::dtype::bfloat16, long, hipcub::HIPCUB_400200_NS::ArgMax, 1024ul, int>(long, long, long, hipcub::HIPCUB_400200_NS::ArgMax, phi::dtype::bfloat16, phi::dtype::bfloat16 const*, long*)",632,38212799,60463.289557,0.9794,58055,63869,927.092514
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4>)",11616,37372984,3217.371212,0.9579,2045,11587,645.041122
"void phi::ContiguousCaseZeroFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>)",19872,35128031,1767.714926,0.9003,681,7578,244.781016
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO4_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1404,33643242,23962.423077,0.8623,21209,27785,1140.684487
"__amd_rocclr_copyBuffer",9956,32688374,3283.283849,0.8378,641,92014,2279.954877
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>)",11376,30397756,2672.095288,0.7791,1764,5974,475.291018
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",5182,27999197,5403.164222,0.7176,1643,57935,3085.720073
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR8_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU64_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",756,23481691,31060.437831,0.6018,26942,36044,1686.590193
"void phi::funcs::LayerNormForward<phi::dtype::bfloat16, float, 512, true, phi::dtype::bfloat16, phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, phi::dtype::bfloat16*, float*, float*, float, long, float const*, int, float, int, float, float)",4480,22563671,5036.533705,0.5783,4049,10745,363.160493
"void phi::ContiguousCaseOneFunc<float, 4ul>(float const*, float*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",5584,20317108,3638.450573,0.5207,1363,12389,524.043820
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>)",2160,20031675,9273.923611,0.5134,6735,13631,699.706645
"void phi::WarpSoftmaxForward<float, float, float, int, 8, false>(float*, float const*, int, int, int)",5904,19542085,3309.973747,0.5009,2606,7497,721.684704
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4972,18977088,3816.791633,0.4864,1563,25219,1185.659655
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<float, float>, float, unsigned int, 1, 1, 1, 2>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, int, phi::ScaleFunctor<float, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",4320,15734150,3642.164352,0.4033,3127,5493,170.098614
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1440,14012390,9730.826389,0.3591,8579,14312,672.901376
"void phi::Strided2ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, common::Array<long, 10ul>, phi::dtype::bfloat16*, common::Array<long, 6ul>, long)",411,12537435,30504.708029,0.3213,8780,2501311,173467.743740
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>)",2160,10939180,5064.435185,0.2804,2887,6455,144.386400
"void phi::funcs::KeMatrixTopK<float, 20, 64>(float*, int, long*, float const*, long, long, int, int, long, bool)",8,10002233,1250279.125000,0.2564,1044234,1464774,207354.412882
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 2u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7719975,2821.628289,0.1979,2326,4490,268.480673
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7712914,2819.047515,0.1977,2246,4451,319.850095
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f2x3_stride1",104,7144250,68694.711538,0.1831,21931,198744,50453.211098
"void phi::funcs::VectorizedElementwiseKernel<long, phi::FullFunctor<long, long>, 0, 1, 1>(common::Array<char const* restrict, 0>, common::Array<long*, 1>, long, long, int, phi::FullFunctor<long, long>)",3103,6917501,2229.294554,0.1773,681,8580,1218.030319
"naive_conv_ab_nonpacked_fwd_nchw_float_double_float",108,4475986,41444.314815,0.1147,12549,94460,16868.798825
"void phi::Range<long, long>(long, long, long, long*)",2391,4150732,1735.981598,0.1064,1162,10144,835.062527
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",2218,4080934,1839.916141,0.1046,1243,8340,350.611940
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<float, float>, float, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<float, float>)",716,3647979,5094.942737,0.0935,2406,14594,1273.587701
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI32x32x8x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR3_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_2_4_WGM1",80,3630697,45383.712500,0.0931,44023,46990,526.850306
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",964,3611809,3746.689834,0.0926,3007,5452,276.884200
"void phi::funcs::ConcatTensorWithDifferentShape<int, 8, phi::funcs::PointerAndColWrapper<long, int, 4> >(phi::funcs::PointerAndColWrapper<long, int, 4>, int, int, int, void*)",1264,3590920,2840.917722,0.0920,2205,4851,283.850751
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4> >(phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4>)",636,2997105,4712.429245,0.0768,3929,13472,705.964341
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaCosFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaCosFunctor<float>)",712,2974689,4177.933989,0.0762,2927,5212,384.378838
"void phi::IndexSampleForward<phi::dtype::bfloat16, long, unsigned int>(long const*, phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long)",632,2903849,4594.697785,0.0744,3368,6295,133.240671
"void phi::funcs::VectorizedElementwiseKernel<float, phi::FullFunctor<float, float>, 0, 1, 4>(common::Array<char const* restrict, 0>, common::Array<float*, 1>, long, long, int, phi::FullFunctor<float, float>)",1368,2861869,2092.009503,0.0734,641,33037,2117.204446
"void phi::GridSampleCudaKernel<float, int>(int, int, int, int, int, float const*, float const*, float*, phi::Mode, phi::PaddingMode, bool)",72,2613294,36295.750000,0.0670,26822,43060,4677.142148
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CondFunctor<long>, 3, 1, 1>(common::Array<char const* restrict, 3>, common::Array<long*, 1>, long, long, int, phi::CondFunctor<long>)",632,2562249,4054.191456,0.0657,2125,6014,389.528232
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSinFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSinFunctor<float>)",712,2429296,3411.932584,0.0623,2767,4089,237.979939
"void phi::BinaryElementwiseKernel<phi::funcs::NotEqualFunctor<long, bool>, bool, unsigned int, 2, 1, 1, 4>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, long, int, phi::funcs::NotEqualFunctor<long, bool>, phi::funcs::OffsetCalculator<(2)+(1), unsigned int, false>)",632,2291964,3626.525316,0.0587,3167,4811,201.171331
"void phi::EmbeddingFW<phi::dtype::bfloat16, long, false>(phi::dtype::bfloat16*, phi::dtype::bfloat16 const*, long const*, long, long, long, long)",632,2253037,3564.931962,0.0577,1163,7498,482.351740
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>)",632,2092442,3310.825949,0.0536,2606,4009,219.789162
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, phi::dtype::bfloat16*, unsigned long)",246,2082939,8467.231707,0.0534,4209,119879,10643.156999
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CondFunctor<float>, 3, 1, 8>(common::Array<char const* restrict, 3>, common::Array<float*, 1>, long, long, int, phi::CondFunctor<float>)",644,2001074,3107.257764,0.0513,722,73291,6219.015400
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CastFunctor<bool, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::CastFunctor<bool, long>)",1032,1904646,1845.587209,0.0488,1243,2406,183.620497
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",714,1776289,2487.799720,0.0455,1483,4771,707.199076
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaReluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaReluFunctor<float>)",324,1769602,5461.734568,0.0454,1924,26542,4598.425476
"void phi::funcs::TilingSwapDim1And2<phi::dtype::bfloat16, 256, 32, 32, int>(phi::dtype::bfloat16 const*, phi::funcs::Dim3<int>, phi::dtype::bfloat16*)",250,1676350,6705.400000,0.0430,2446,13992,2336.078915
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, phi::dtype::bfloat16, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",712,1668773,2343.782303,0.0428,1764,3448,277.999179
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",744,1652531,2221.143817,0.0424,642,4531,486.152134
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",552,1628957,2951.009058,0.0418,2486,3408,178.246780
"void phi::ContiguousCaseOneFunc<long, 2ul>(long const*, long*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",552,1585287,2871.896739,0.0406,2045,12108,1208.802393
"void phi::RepeatInterleaveVecKernel<long, 1>(long const*, long*, long, long, long, long, int)",552,1558788,2823.891304,0.0400,2325,3849,218.415817
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt128x128x16_wt32x32x4_ws1x1_wr2x2_ta1x1x8x1_1x16x1x16_tb1x1x8x1_1x16x1x16_me",40,1536624,38415.600000,0.0394,37688,38770,315.532094
"Cijk_Ailk_Bljk_SB_MT16x16x4_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT017_165_35_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO2_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_1_WGM16",552,1520057,2753.726449,0.0390,2365,3608,168.904593
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<long, phi::dtype::bfloat16>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<long, phi::dtype::bfloat16>)",632,1491909,2360.615506,0.0382,1844,2887,196.107613
"void phi::GPUMaskedFillOneValueKernel<phi::dtype::bfloat16, 1>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long, long, phi::dtype::bfloat16*)",632,1486747,2352.447785,0.0381,1764,4010,212.965025
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<long, bool>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<long, bool>)",632,1483533,2347.362342,0.0380,1363,3609,194.479426
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",244,1476481,6051.151639,0.0378,2806,7698,1074.918175
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<bool, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<bool, phi::dtype::bfloat16>)",632,1439990,2278.465190,0.0369,1724,3007,202.446946
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::LogicalAndFunctor<bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::LogicalAndFunctor<bool>)",632,1329297,2103.318038,0.0341,1644,3247,188.437974
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<long, float>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<long, float>)",722,1301590,1802.756233,0.0334,681,4491,384.323469
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",632,1290935,2042.618671,0.0331,681,3328,211.035877
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 3>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",560,1194152,2132.414286,0.0306,1524,3368,194.908583
"Cijk_Ailk_Bljk_SB_MT256x256x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT8_128_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM18",8,1187328,148416.000000,0.0304,93538,202111,56680.349243
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<long, bool>)",560,1122120,2003.785714,0.0288,642,3930,459.743676
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,1098921,91576.750000,0.0282,77220,99431,10288.687204
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS16_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",24,1000294,41678.916667,0.0256,23976,57374,9950.349222
"void phi::funcs::CumsumOneBlock<long, long, phi::kps::AddFunctor<long>, 2>(long const*, long*, long, long, phi::kps::AddFunctor<long>)",240,961168,4004.866667,0.0246,2846,4731,305.320873
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4>)",240,918193,3825.804167,0.0235,2606,5293,771.407001
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt64x64x16_wt16x16x4_ws1x1_wr2x2_ta1x1x4x1_1x16x1x16_tb1x1x4x1_1x16x1x16_me",36,888188,24671.888889,0.0228,23655,25099,347.315154
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",160,833625,5210.156250,0.0214,4170,6174,466.731139
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",88,808046,9182.340909,0.0207,6134,17882,4190.962924
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<bool, bool>, bool, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<bool, bool>)",80,796294,9953.675000,0.0204,7617,12429,1313.660176
"batched_transpose_128x4_half",160,780819,4880.118750,0.0200,2285,7137,1377.483241
"void phi::funcs::SelectKernel<bool, bool, long, long, phi::IndexFunctor<bool, long, long>, 1, 0>(long*, bool const*, bool const*, long*, phi::IndexFunctor<bool, long, long>, long, long, long)",160,742212,4638.825000,0.0190,4049,11347,629.729827
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",208,633031,3043.418269,0.0162,2405,4771,476.033240
"void phi::funcs::GetBlockCountKernel<bool, long, 1>(bool const*, long*, long, long)",240,626456,2610.233333,0.0161,2124,3930,218.867061
"void rocprim::ROCPRIM_400200_NS::detail::trampoline_kernel<rocprim::ROCPRIM_400200_NS::detail::wrapped_scan_config<rocprim::ROCPRIM_400200_NS::default_config, long>, (rocprim::ROCPRIM_400200_NS::detail::target_arch)942, rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1}, rocprim::ROCPRIM_400200_NS::detail::default_config_selector>(rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1})",80,607982,7599.775000,0.0156,6736,10184,828.286427
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",240,566969,2362.370833,0.0145,1844,7136,371.559903
"Cijk_Ailk_Bljk_SB_MT32x32x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",16,553008,34563.000000,0.0142,20408,68640,20012.806073
"void phi::funcs::TilingSwapDim1And2<float, 256, 32, 32, int>(float const*, phi::funcs::Dim3<int>, float*)",100,531804,5318.040000,0.0136,2967,12990,2173.862916
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<bool, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<bool, float>)",34,510712,15020.941176,0.0131,1524,28306,6093.132025
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,505180,6314.750000,0.0129,5693,6937,296.633879
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,469543,5869.287500,0.0120,5373,7176,271.298434
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,467007,38917.250000,0.0120,37768,40374,986.632300
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride1",16,464245,29015.312500,0.0119,21651,52964,7227.554558
"void phi::funcs::StackCudaKernel<long, int, phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, long*)",168,453646,2700.273810,0.0116,2125,4931,497.336796
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4>)",80,448686,5608.575000,0.0115,4691,8941,689.975248
"SubTensorOpWithScalar1d",108,431322,3993.722222,0.0111,2646,9662,1280.310979
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSwishFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSwishFunctor<float>)",76,431287,5674.828947,0.0111,2807,11788,2855.916490
"Cijk_Ailk_Bljk_SB_MT128x128x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_8_1_WGM1",12,430643,35886.916667,0.0110,26862,41337,6292.715723
"void phi::BNForwardInference<float, (common::DataLayout)2>(float const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, int, int, long, double, float*)",108,430521,3986.305556,0.0110,2646,7738,850.272419
"void phi::fusion::FusedLayernormResidualDropoutBias<float, unsigned char, 4, float, false, false>(long, long, unsigned long, float, bool, bool, unsigned long, float, float const*, float const*, float const*, std::conditional<false, float, float>::type const*, std::conditional<false, float, float>::type const*, unsigned char*, float*, float*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, float)",80,425511,5318.887500,0.0109,3809,7617,1281.625609
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride2",8,392396,49049.500000,0.0101,28025,70364,22090.260562
"phi::BoolToInt64Kernel(bool const*, long*, long)",80,392275,4903.437500,0.0101,2325,6014,790.589235
"void phi::KeBilinearInterpNCHWFw<phi::dtype::bfloat16, float>(phi::dtype::bfloat16 const*, unsigned long, unsigned long, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",80,389180,4864.750000,9.975e-03,4089,5813,288.192249
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::FullFunctor<bool, bool>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<bool*, 1>, long, long, int, phi::FullFunctor<bool, bool>)",89,370513,4163.067416,9.496e-03,2887,13552,2549.549536
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",80,365849,4573.112500,9.377e-03,4129,5733,241.464361
"void phi::funcs::SelectKernel<bool, long, long, long, phi::MaskedSelectFunctor<bool, long, long>, 1, 1>(long*, bool const*, long const*, long*, phi::MaskedSelectFunctor<bool, long, long>, long, long, long)",80,361727,4521.587500,9.271e-03,4170,6856,315.683333
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x128x16_wt32x32x2_ws1x1_wr2x2_ta1x4x2x1_1x4x1x64_tb1x4x2x1_1x4x1x64_gkgs",16,357955,22372.187500,9.174e-03,20407,22934,582.442926
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>)",80,344239,4302.987500,8.823e-03,3247,5894,303.887117
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSiluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<float>)",48,318302,6631.291667,8.158e-03,2726,12469,2928.314132
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 2>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",80,292918,3661.475000,7.508e-03,3327,4210,172.616570
"void phi::funcs::StackCudaKernel<float, int, phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, float*)",44,291520,6625.454545,7.472e-03,2526,33839,8012.746623
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::SubtractFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::SubtractFunctor<float>)",94,277730,2954.574468,7.118e-03,1884,4531,457.045455
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::FloorDivideFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::FloorDivideFunctor<long, void>)",84,272715,3246.607143,6.990e-03,2927,4530,308.502486
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSigmoidFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSigmoidFunctor<float>)",40,263294,6582.350000,6.748e-03,2526,43221,11302.503069
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",72,263049,3653.458333,6.742e-03,3207,4290,242.365859
"batched_transpose_16x32_dword",32,254835,7963.593750,6.531e-03,2807,21811,5220.735635
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",60,252992,4216.533333,6.484e-03,3408,6776,723.317275
"void phi::MaskedScatterCUDAKernel<phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long const*, long, phi::dtype::bfloat16*)",80,249704,3121.300000,6.400e-03,2285,8540,807.519901
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",28,248465,8873.750000,6.368e-03,2406,10384,2666.002231
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM48",4,247376,61844.000000,6.340e-03,61463,62425,441.030611
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGMn16",28,245613,8771.892857,6.295e-03,7859,12028,1359.209365
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,242127,10088.625000,6.206e-03,9663,10705,236.417062
"Cijk_Ailk_Bjlk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM12",24,237072,9878.000000,6.076e-03,9622,10344,159.036228
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",80,234667,2933.337500,6.015e-03,2486,3408,165.898089
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 2>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,233671,2920.887500,5.989e-03,2766,3127,86.798617
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSqrtFunctor<float>)",78,230858,2959.717949,5.917e-03,2045,3849,285.294739
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,226008,2825.100000,5.793e-03,2445,3970,253.048482
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGMn16",36,225043,6251.194444,5.768e-03,5213,7377,524.453992
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt256x64x8_wt64x16x4_ws1x1_wr2x2_ta1x1x8x1_1x8x1x32_tb1x1x2x1_1x8x1x32_me",4,216263,54065.750000,5.543e-03,53846,54246,177.501878
"batched_transpose_32x32_dword",40,211373,5284.325000,5.418e-03,3969,7898,1273.798305
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ClipFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ClipFunctor<float>)",84,206125,2453.869048,5.283e-03,2005,3889,241.941135
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",20,197861,9893.050000,5.071e-03,5773,13512,2374.226533
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",68,196101,2883.838235,5.026e-03,1444,9462,940.124247
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",8,196096,24512.000000,5.026e-03,2606,46067,21754.002417
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",44,193816,4404.909091,4.968e-03,4130,4691,123.987775
"void phi::Range<float, float>(float, float, long, float*)",82,182266,2222.756098,4.672e-03,1964,3448,216.431928
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM32",4,177735,44433.750000,4.555e-03,43862,44785,430.266100
"phi::MaskedScatterSizeCheck(long const*, bool const*, long)",80,177495,2218.687500,4.549e-03,641,5733,1675.604743
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,171400,2142.500000,4.393e-03,1844,2525,149.466308
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM12",36,168553,4682.027778,4.320e-03,4410,5774,217.187146
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4>)",12,165308,13775.666667,4.237e-03,5293,19445,6224.479903
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<float, bool>, bool, 2, 1, 8, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<float, bool>)",8,156726,19590.750000,4.017e-03,18042,21851,1298.216111
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,148983,6207.625000,3.818e-03,5814,7738,431.741349
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,148666,18583.250000,3.810e-03,13591,23615,5041.388889
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 9, false>(float*, float const*, int, int, int)",24,147339,6139.125000,3.776e-03,5572,7497,407.672110
"void phi::funcs::GatherNdCUDAKernel<float, long, 4>(float const*, common::Dim<9>, long const*, float*, unsigned long, unsigned long, unsigned long)",12,144574,12047.833333,3.705e-03,4891,24216,8733.093140
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex0_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta",4,143656,35914.000000,3.682e-03,35723,36285,252.403117
"void rocprim::ROCPRIM_400200_NS::detail::init_lookback_scan_state_kernel<rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true> >(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>::value_type*)",80,143011,1787.637500,3.665e-03,1323,3889,302.021721
"void phi::IsfiniteCUDAKernel<long, unsigned int>(long const*, unsigned int, bool*, std::enable_if<std::is_integral<long>::value, void>::type*)",80,140128,1751.600000,3.592e-03,1363,3288,246.051842
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4>)",8,133471,16683.875000,3.421e-03,15517,18763,1196.338275
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<float, bool>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<float, bool>)",12,129862,10821.833333,3.328e-03,5453,17080,4216.552595
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",24,124250,5177.083333,3.185e-03,4851,5733,241.548612
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt16x64x32_wt16x16x4_ws1x1_wr1x2_ta1x4x1x1_1x8x1x16_tb1x4x4x1_1x8x1x16_gkgs",4,115911,28977.750000,2.971e-03,28908,29108,88.808314
"batched_transpose_32x16_dword",8,114064,14258.000000,2.923e-03,4009,24497,10849.911613
"Cijk_Alik_Bljk_SB_MT32x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA2_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM6",24,107410,4475.416667,2.753e-03,4089,5694,389.517418
"mloPoolingG",4,104522,26130.500000,2.679e-03,25178,26902,770.171193
"void phi::funcs::LayerNormForward<float, float, 64, true, float, float>(float const*, std::conditional<true, float, float>::type const*, std::conditional<true, float, float>::type const*, float*, float*, float*, float, long, float const*, int, float, int, float, float)",12,99873,8322.750000,2.560e-03,4731,15195,4679.663568
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 8> >(phi::funcs::AlignedPointerWrapper<float, 8>, int, int, int, void*)",5,99311,19862.200000,2.545e-03,5894,23815,7819.173658
"void phi::KeBilinearInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",16,85559,5347.437500,2.193e-03,3609,7858,1507.864670
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD3_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGM16",16,84397,5274.812500,2.163e-03,4891,5533,225.258435
"Cijk_Ailk_Bljk_SB_MT192x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT0193_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL4_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA3_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR9_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT3_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",4,83435,20858.750000,2.138e-03,20488,21089,260.474407
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaLogFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaLogFunctor<float>)",28,74410,2657.500000,1.907e-03,2245,3528,317.779157
"void phi::VecReduceKernel<128, 4, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",4,74053,18513.250000,1.898e-03,18202,18764,237.278142
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,73092,3480.571429,1.873e-03,3287,4491,258.330906
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM8",4,72890,18222.500000,1.868e-03,17962,18563,262.806012
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<float, int>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<float, int>)",4,71647,17911.750000,1.836e-03,17320,18443,574.080932
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 4, false>(float*, float const*, int, int, int)",24,71128,2963.666667,1.823e-03,2686,3128,132.635056
"Im2d2Col_v2",4,67157,16789.250000,1.721e-03,16639,16960,132.590535
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,56291,2680.523810,1.443e-03,1844,3087,288.702722
"void phi::KeNearestNeighborInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool)",8,52080,6510.000000,1.335e-03,5252,7497,964.809085
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG16_8_2_WGMn16",4,48472,12118.000000,1.242e-03,12028,12268,114.891253
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",4,47991,11997.750000,1.230e-03,11226,13832,1243.677979
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta_gkgs",4,43902,10975.500000,1.125e-03,9783,11667,827.552415
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD2_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGM16",4,43381,10845.250000,1.112e-03,10624,11346,337.659962
"Cijk_Alik_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA1_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM6",4,37488,9372.000000,9.608e-04,9182,9622,194.250697
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,35964,4495.500000,9.218e-04,2887,4811,656.032882
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 4> >(phi::funcs::PointerAndColWrapper<float, int, 4>, int, int, int, void*)",4,31393,7848.250000,8.046e-04,7577,8139,230.320610
"void phi::funcs::GatherNdCUDAKernel<long, long, 1>(long const*, common::Dim<9>, long const*, long*, unsigned long, unsigned long, unsigned long)",4,19727,4931.750000,5.056e-04,4451,5172,331.944147
"void phi::funcs::VectorizedElementwiseKernel<float, phi::GeluWithoutApproximateFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<float>)",4,18844,4711.000000,4.830e-04,4250,5693,664.631728
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",4,18483,4620.750000,4.737e-04,4450,4812,148.259626
"void phi::funcs::ConcatTensorWithDifferentShape<int, 4, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",4,18123,4530.750000,4.645e-04,4490,4571,33.069372
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",4,17684,4421.000000,4.532e-04,4371,4491,60.000000
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<int, int>, int, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<int, int>)",4,15556,3889.000000,3.987e-04,3849,3929,32.659863
"void phi::IndexPutCudaKernel<long>(long const*, long const*, long**, common::Array<long, 9ul>, common::Array<long, 9ul>, int, long, long, bool, long*)",4,15155,3788.750000,3.884e-04,3569,4250,311.449702
"void phi::FlipCudaKernel<float>(float const*, float*, common::Array<long, 9ul>, common::Array<long, 9ul>, common::Array<int, 9ul>, int, long, int)",4,12950,3237.500000,3.319e-04,3087,3408,140.524019
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaFloorFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaFloorFunctor<float>)",4,12228,3057.000000,3.134e-04,2806,3168,171.582827
"void phi::funcs::ConcatTensorWithSameShape<int, 8, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,11750,2937.500000,3.012e-04,2767,3168,202.880096
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",4,11627,2906.750000,2.980e-04,2525,3689,528.737096
"void phi::funcs::ConcatTensorWithSameShape<int, 4, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,11386,2846.500000,2.918e-04,2085,4249,957.832449
"__amd_rocclr_fillBufferAligned",1,10825,10825.000000,2.774e-04,10825,10825,0.00000000e+00
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::ElementwisePowFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::ElementwisePowFunctor<float>)",2,9984,4992.000000,2.559e-04,4531,5453,651.952452
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float>, phi::dtype::bfloat16*, unsigned long)",2,9703,4851.500000,2.487e-04,3488,6215,1928.280192
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",1,3489,3489.000000,8.942e-05,3489,3489,0.00000000e+00
"void phi::funcs::ForRangeElemwiseOpGridIsOne<phi::EyeFunctor<float> >(phi::EyeFunctor<float>)",1,2967,2967.000000,7.605e-05,2967,2967,0.00000000e+00
"void phi::funcs::DistributionKernel<float, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, float*, unsigned long)",1,2766,2766.000000,7.089e-05,2766,2766,0.00000000e+00
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<long, int>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<long, int>)",1,2687,2687.000000,6.887e-05,2687,2687,0.00000000e+00

MIOpen (as of ROCm 7.x) returns MIOPEN_STATUS_NOT_IMPLEMENTED for miopenSoftmaxForward_V2 with miopenBFloat16, so the gpudnn softmax path cannot be used for BF16 on HIP. When the input dim exceeds the warp softmax cap, route BF16 through the existing matrix softmax kernel instead of letting the call fall into the MIOpen branch. Also gate the CUDNN_VERSION < 8100 BF16 fallback specialization on !defined(PADDLE_WITH_HIP) -- that branch dispatched into MIOpen too and would trip the same NOT_IMPLEMENTED failure on ROCm.

conv2d_add_fuse_pass and conv2d_add_act_fuse_pass rewrite conv2d+add[+act] into the fused_conv2d_add_act op, which has only a cuDNN GPUDNN kernel. On ROCm the rewrite succeeds but kernel dispatch later fails because no HIP kernel is registered, so PaddleX currently works around this by calling config.delete_pass("conv2d_add_act_fuse_pass") and config.delete_pass("conv2d_add_fuse_pass") under paddle.is_compiled_with_rocm() in paddlex/inference/models/runners/paddle_static/runner.py. Gate both the pass registration (REGISTER_IR_PASS / USE_PIR_PASS) and the pass-builder inclusion on PADDLE_WITH_CUDA so the rewrite never runs on HIP builds, making the PaddleX delete_pass calls unnecessary.

Restore the BF16 registrations for conv2d / conv3d / depthwise conv kernels and the DataType::BFLOAT16 -> miopenBFloat16 mapping originally added by ROCm#47 and reverted on paddle_hackthon ahead of RDNA4 enablement. The change is gated at compile time by the existing #ifdef PADDLE_WITH_HIP block. Deployment to archs that lack native BF16 support should be handled via PADDLE_ROCM_OFFLOAD_ARCHS (paddle_hackthon's default list already covers the BF16-capable set: CDNA3/gfx942, CDNA4/gfx950, RDNA3/gfx1100- 1102, RDNA4/gfx1200-1201); if a downstream target needs to strip BF16 from the build, it can narrow the offload-arch list accordingly. No runtime arch queries are introduced.

austin1997 · 2026-04-22T13:20:41Z

Updated BF16 profiling (base switched to `paddle_hackthon`)

Refreshed kernel_stats.csv for the BF16 run on the new base ROCm/Paddle:paddle_hackthon @ 4df29c5818 + this PR's 3 commits, MI300X / ROCm 7.2.0 / Python 3.12, PADDLE_ROCM_OFFLOAD_ARCHS=gfx942. Row count: 220 (incl. header). Supersedes the earlier comment for the former develop-base iteration.

Headline numbers from this run (domain_stats.csv): BF16 kernel-dispatch total 3,897.8 ms (875,266 kernels) vs FP32 fallback 4,369.3 ms (867,948 kernels) — 1.12× GPU-kernel speedup, 471 ms saved. GEMM line: 1,447.27 ms BF16 (Cijk_*_BBS_BH_*_MI16x16x16x1_*) vs 1,786.45 ms FP32 (includes two Cijk_*_SB_*_MI16x16x4x1_* vision-path GEMMs totaling 473 ms that disappear entirely).

kernel_stats.csv (BF16 run, click to expand)

"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"Cijk_Ailk_Bljk_BBS_BH_MT64x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",22752,454476624,19975.238397,11.66,15716,25819,3437.292923
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA1024_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU16_SUM0_SUS512_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn16",13536,299746724,22144.409279,7.69,16037,53121,11771.909869
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<phi::dtype::bfloat16, float>)",76632,188738030,2462.914057,4.84,681,18723,1691.645108
"Cijk_Alik_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA4_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA1_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",13536,152340317,11254.456043,3.91,2605,91730,18150.643817
"Cijk_Ailk_Bljk_BBS_BH_MT64x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA32_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",8640,150723523,17444.852199,3.87,15675,23774,716.630168
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",52616,146923378,2792.370724,3.77,681,11386,1629.007916
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",48464,146535083,3023.586229,3.76,1082,5051,370.848612
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",34272,139531092,4071.285364,3.58,2365,21209,3228.665631
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::dtype::bfloat16, unsigned int, 1, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",36288,120655319,3324.937142,3.10,681,12709,817.478855
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",32166,115954017,3604.862805,2.97,1363,24617,1281.198817
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<float, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<float, phi::dtype::bfloat16>)",42585,114005536,2677.128942,2.92,1523,15194,1826.407889
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR4_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD2_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGMn16",11376,111690912,9818.118143,2.87,8860,16518,901.885203
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4>)",23384,100108722,4281.077745,2.57,2005,6896,504.173399
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",34592,84794952,2451.287928,2.18,682,12148,1591.584417
"Cijk_Ailk_Bljk_BBS_BH_MT128x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLRn30_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM8",2240,79835753,35640.961161,2.05,31111,39290,1500.331706
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",13280,78743679,5929.493901,2.02,2926,14512,1777.199531
"Cijk_Ailk_Bljk_BBS_BH_MT128x128x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR12_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB1024_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU16_SUM0_SUS256_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_128_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB4_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn2",632,66539159,105283.479430,1.71,90406,204106,35585.577280
"void phi::RepeatInterleaveVecKernel<phi::dtype::bfloat16, 8>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, long, long, long, long, int)",22752,66396804,2918.284283,1.70,2165,8540,393.318715
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",22752,62955318,2767.023470,1.62,802,10103,285.501471
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8>, int, int, int, void*)",22752,60939332,2678.416491,1.56,1283,13190,368.713882
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",22806,60454962,2650.835833,1.55,962,15635,538.994670
"SoftMaxCommon",2164,60040513,27745.153882,1.54,22011,32835,1972.917888
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4>, int, int, int, void*)",19890,59750034,3004.023831,1.53,2005,11707,308.706954
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::ScaleFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>)",8640,55134671,6381.327662,1.41,2726,14232,1302.907985
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",12224,47355268,3873.958442,1.21,2686,41294,1859.849483
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4> >(phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4>, int, int, int, void*)",22779,46618641,2046.562228,1.20,1002,16037,470.593244
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ScaleFunctor<float, float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ScaleFunctor<float, float>)",25446,45723679,1796.890631,1.17,682,14673,355.295104
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSquareFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSquareFunctor<float>)",23384,44989302,1923.935255,1.15,722,10785,204.014106
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",9936,42239836,4251.191224,1.08,3087,12188,409.014449
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaRsqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaRsqrtFunctor<float>)",23384,42182554,1803.906688,1.08,681,3368,205.094953
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",12816,40137096,3131.795880,1.03,1924,9502,280.484878
"void phi::ArgCUDAKernel<phi::dtype::bfloat16, long, hipcub::HIPCUB_400200_NS::ArgMax, 1024ul, int>(long, long, long, hipcub::HIPCUB_400200_NS::ArgMax, phi::dtype::bfloat16, phi::dtype::bfloat16 const*, long*)",632,38109794,60300.306962,0.9777,58053,63305,900.076243
"void phi::funcs::VectorizedElementwiseKernel<long, phi::ScaleFunctor<long, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::ScaleFunctor<long, long>)",22744,37537831,1650.449833,0.9630,681,6775,201.855527
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4>)",11616,37211598,3203.477789,0.9547,2044,5894,623.326343
"void phi::ContiguousCaseZeroFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>)",19872,34637648,1743.037842,0.8886,681,4932,244.020546
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO4_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1404,34104371,24290.862536,0.8750,21890,28184,1182.141787
"__amd_rocclr_copyBuffer",9956,33713352,3386.234632,0.8649,641,105681,2399.598376
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>)",11376,31146866,2737.945323,0.7991,1804,5452,451.773984
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",5182,27423752,5292.117329,0.7036,1564,45303,2820.867273
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR8_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU64_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",756,23412035,30968.300265,0.6006,26901,36003,1672.062474
"void phi::funcs::LayerNormForward<phi::dtype::bfloat16, float, 512, true, phi::dtype::bfloat16, phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, phi::dtype::bfloat16*, float*, float*, float, long, float const*, int, float, int, float, float)",4480,22074471,4927.337277,0.5663,4089,13471,389.927368
"void phi::ContiguousCaseOneFunc<float, 4ul>(float const*, float*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",5584,20183537,3614.530265,0.5178,1684,38889,696.387770
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>)",2160,19940325,9231.631944,0.5116,7257,14233,671.361681
"void phi::WarpSoftmaxForward<float, float, float, int, 8, false>(float*, float const*, int, int, int)",5904,19332800,3274.525745,0.4960,2486,9501,748.003645
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4972,19126770,3846.896621,0.4907,1483,26260,1213.449063
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<float, float>, float, unsigned int, 1, 1, 1, 2>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, int, phi::ScaleFunctor<float, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",4320,15775691,3651.780324,0.4047,3207,5733,163.753848
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1440,14067442,9769.056944,0.3609,8699,12148,534.959254
"void phi::Strided2ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, common::Array<long, 10ul>, phi::dtype::bfloat16*, common::Array<long, 6ul>, long)",411,12543476,30519.406326,0.3218,9061,2509651,173731.259149
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>)",2160,10957356,5072.850000,0.2811,2806,6655,177.364980
"void phi::funcs::KeMatrixTopK<float, 20, 64>(float*, int, long*, float const*, long, long, int, int, long, bool)",8,9937656,1242207.000000,0.2550,1031516,1472242,215128.992914
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7737466,2828.021199,0.1985,2325,4490,295.344373
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 2u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7668874,2802.951023,0.1967,2325,4370,268.875586
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f2x3_stride1",104,7205816,69286.692308,0.1849,22170,197651,49872.195519
"void phi::funcs::VectorizedElementwiseKernel<long, phi::FullFunctor<long, long>, 0, 1, 1>(common::Array<char const* restrict, 0>, common::Array<long*, 1>, long, long, int, phi::FullFunctor<long, long>)",3103,6943968,2237.824041,0.1782,681,8380,1213.408268
"naive_conv_ab_nonpacked_fwd_nchw_float_double_float",108,4477189,41455.453704,0.1149,12749,88322,16748.046060
"void phi::Range<long, long>(long, long, long, long*)",2391,4075055,1704.330824,0.1045,1082,9302,804.974882
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",2218,4005133,1805.740757,0.1028,1203,8861,435.228998
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<float, float>, float, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<float, float>)",716,3580466,5000.650838,0.0919,3929,10063,1183.355156
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI32x32x8x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR3_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_2_4_WGM1",80,3568948,44611.850000,0.0916,43459,45705,543.216513
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",964,3502922,3633.736515,0.0899,3007,5612,272.826700
"void phi::funcs::ConcatTensorWithDifferentShape<int, 8, phi::funcs::PointerAndColWrapper<long, int, 4> >(phi::funcs::PointerAndColWrapper<long, int, 4>, int, int, int, void*)",1264,3341742,2643.783228,0.0857,2045,4370,246.205076
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaCosFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaCosFunctor<float>)",712,2935981,4123.568820,0.0753,2806,4811,368.139234
"void phi::IndexSampleForward<phi::dtype::bfloat16, long, unsigned int>(long const*, phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long)",632,2870608,4542.101266,0.0736,3288,5974,110.470235
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4> >(phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4>)",636,2812954,4422.883648,0.0722,3568,12027,764.361694
"void phi::funcs::VectorizedElementwiseKernel<float, phi::FullFunctor<float, float>, 0, 1, 4>(common::Array<char const* restrict, 0>, common::Array<float*, 1>, long, long, int, phi::FullFunctor<float, float>)",1368,2799511,2046.426170,0.0718,641,24857,1876.742671
"void phi::GridSampleCudaKernel<float, int>(int, int, int, int, int, float const*, float const*, float*, phi::Mode, phi::PaddingMode, bool)",72,2614212,36308.500000,0.0671,26781,43619,4756.424089
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CondFunctor<long>, 3, 1, 1>(common::Array<char const* restrict, 3>, common::Array<long*, 1>, long, long, int, phi::CondFunctor<long>)",632,2516543,3981.871835,0.0646,1964,5853,349.511429
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSinFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSinFunctor<float>)",712,2391289,3358.551966,0.0613,2806,3969,218.469069
"void phi::BinaryElementwiseKernel<phi::funcs::NotEqualFunctor<long, bool>, bool, unsigned int, 2, 1, 1, 4>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, long, int, phi::funcs::NotEqualFunctor<long, bool>, phi::funcs::OffsetCalculator<(2)+(1), unsigned int, false>)",632,2256309,3570.109177,0.0579,3087,4531,214.975086
"void phi::EmbeddingFW<phi::dtype::bfloat16, long, false>(phi::dtype::bfloat16*, phi::dtype::bfloat16 const*, long const*, long, long, long, long)",632,2186512,3459.670886,0.0561,1243,7377,480.257448
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, phi::dtype::bfloat16*, unsigned long)",246,2134474,8676.723577,0.0548,4210,111534,9962.998369
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>)",632,2086254,3301.034810,0.0535,2806,3849,198.764852
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CastFunctor<bool, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::CastFunctor<bool, long>)",1032,1898403,1839.537791,0.0487,1243,2446,201.677837
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CondFunctor<float>, 3, 1, 8>(common::Array<char const* restrict, 3>, common::Array<float*, 1>, long, long, int, phi::CondFunctor<float>)",644,1887240,2930.496894,0.0484,721,64267,5814.371913
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaReluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaReluFunctor<float>)",324,1817595,5609.861111,0.0466,1964,27624,4690.417073
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",714,1795653,2514.920168,0.0461,1524,5172,718.813168
"void phi::funcs::TilingSwapDim1And2<phi::dtype::bfloat16, 256, 32, 32, int>(phi::dtype::bfloat16 const*, phi::funcs::Dim3<int>, phi::dtype::bfloat16*)",250,1758928,7035.712000,0.0451,2485,14874,2579.283619
"void phi::ContiguousCaseOneFunc<long, 2ul>(long const*, long*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",552,1731237,3136.298913,0.0444,1804,12669,1252.650982
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",744,1659739,2230.831989,0.0426,1644,5894,429.966514
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, phi::dtype::bfloat16, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",712,1640244,2303.713483,0.0421,1764,3408,315.689199
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",552,1624897,2943.653986,0.0417,2525,3408,189.731805
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt128x128x16_wt32x32x4_ws1x1_wr2x2_ta1x1x8x1_1x16x1x16_tb1x1x8x1_1x16x1x16_me",40,1537914,38447.850000,0.0395,37285,38849,324.722726
"Cijk_Ailk_Bljk_SB_MT16x16x4_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT017_165_35_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO2_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_1_WGM16",552,1503713,2724.117754,0.0386,2366,3688,157.977665
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<long, phi::dtype::bfloat16>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<long, phi::dtype::bfloat16>)",632,1497422,2369.338608,0.0384,1844,2927,221.996722
"void phi::RepeatInterleaveVecKernel<long, 1>(long const*, long*, long, long, long, long, int)",552,1497005,2711.965580,0.0384,2245,3648,207.978614
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",244,1475332,6046.442623,0.0379,2766,8018,1104.378607
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<long, bool>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<long, bool>)",632,1463055,2314.960443,0.0375,1283,2886,209.243858
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<bool, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<bool, phi::dtype::bfloat16>)",632,1419881,2246.647152,0.0364,1844,3088,220.509417
"void phi::GPUMaskedFillOneValueKernel<phi::dtype::bfloat16, 1>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long, long, phi::dtype::bfloat16*)",632,1413457,2236.482595,0.0363,1724,3929,228.765410
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::LogicalAndFunctor<bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::LogicalAndFunctor<bool>)",632,1310014,2072.806962,0.0336,1644,3248,180.614396
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<long, float>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<long, float>)",722,1269501,1758.311634,0.0326,681,4650,400.369193
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",632,1258005,1990.514241,0.0323,681,2526,215.882818
"Cijk_Ailk_Bljk_SB_MT256x256x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT8_128_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM18",8,1200862,150107.750000,0.0308,94336,204426,57552.089384
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 3>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",560,1153224,2059.328571,0.0296,1643,3447,194.936637
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,1096546,91378.833333,0.0281,76335,99507,10591.275885
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<long, bool>)",560,1081763,1931.719643,0.0278,641,7697,539.647210
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS16_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",24,994709,41446.208333,0.0255,24375,55045,9199.269990
"void phi::funcs::CumsumOneBlock<long, long, phi::kps::AddFunctor<long>, 2>(long const*, long*, long, long, phi::kps::AddFunctor<long>)",240,931082,3879.508333,0.0239,2766,4931,333.288973
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4>)",240,921412,3839.216667,0.0236,2886,5092,511.425593
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt64x64x16_wt16x16x4_ws1x1_wr2x2_ta1x1x4x1_1x16x1x16_tb1x1x4x1_1x16x1x16_me",36,880251,24451.416667,0.0226,23814,25017,408.036159
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",160,835875,5224.218750,0.0214,4289,6255,449.888857
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",88,817427,9288.943182,0.0210,6214,17600,4156.242554
"batched_transpose_128x4_half",160,798227,4988.918750,0.0205,2405,7296,1261.749499
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<bool, bool>, bool, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<bool, bool>)",80,781625,9770.312500,0.0201,8058,12067,1257.652501
"void phi::funcs::SelectKernel<bool, bool, long, long, phi::IndexFunctor<bool, long, long>, 1, 0>(long*, bool const*, bool const*, long*, phi::IndexFunctor<bool, long, long>, long, long, long)",160,753435,4708.968750,0.0193,3929,7377,546.928248
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",208,626956,3014.211538,0.0161,2446,4530,419.479622
"void phi::funcs::GetBlockCountKernel<bool, long, 1>(bool const*, long*, long, long)",240,602890,2512.041667,0.0155,2044,3648,221.396156
"void rocprim::ROCPRIM_400200_NS::detail::trampoline_kernel<rocprim::ROCPRIM_400200_NS::detail::wrapped_scan_config<rocprim::ROCPRIM_400200_NS::default_config, long>, (rocprim::ROCPRIM_400200_NS::detail::target_arch)942, rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1}, rocprim::ROCPRIM_400200_NS::detail::default_config_selector>(rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1})",80,584574,7307.175000,0.0150,6535,9301,648.170058
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",240,562640,2344.333333,0.0144,1844,2887,210.826917
"Cijk_Ailk_Bljk_SB_MT32x32x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",16,549696,34356.000000,0.0141,19966,67875,19983.257469
"void phi::funcs::TilingSwapDim1And2<float, 256, 32, 32, int>(float const*, phi::funcs::Dim3<int>, float*)",100,539226,5392.260000,0.0138,2766,12830,2239.025819
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<bool, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<bool, float>)",34,490515,14426.911765,0.0126,2606,24295,5657.949272
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,478333,39861.083333,0.0123,38247,41495,1109.180570
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4>)",80,471443,5893.037500,0.0121,4611,9703,846.510805
"SubTensorOpWithScalar1d",108,466710,4321.388889,0.0120,2525,10384,1359.883041
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,460327,5754.087500,0.0118,5372,7016,236.999610
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,450744,5634.300000,0.0116,4851,6655,332.019650
"void phi::funcs::StackCudaKernel<long, int, phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, long*)",168,444169,2643.863095,0.0114,2125,5051,544.854581
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride1",16,441528,27595.500000,0.0113,21569,32594,3817.139627
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSwishFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSwishFunctor<float>)",76,440563,5796.881579,0.0113,2525,11947,3061.724507
"Cijk_Ailk_Bljk_SB_MT128x128x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_8_1_WGM1",12,439045,36587.083333,0.0113,27423,41856,5349.070904
"void phi::fusion::FusedLayernormResidualDropoutBias<float, unsigned char, 4, float, false, false>(long, long, unsigned long, float, bool, bool, unsigned long, float, float const*, float const*, float const*, std::conditional<false, float, float>::type const*, std::conditional<false, float, float>::type const*, unsigned char*, float*, float*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, float)",80,433708,5421.350000,0.0111,3849,7697,1345.225350
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride2",8,429018,53627.250000,0.0110,28224,103918,28901.651217
"void phi::BNForwardInference<float, (common::DataLayout)2>(float const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, int, int, long, double, float*)",108,419525,3884.490741,0.0108,2565,7818,862.121286
"void phi::KeBilinearInterpNCHWFw<phi::dtype::bfloat16, float>(phi::dtype::bfloat16 const*, unsigned long, unsigned long, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",80,389925,4874.062500,0.0100,4290,5693,202.737115
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::FullFunctor<bool, bool>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<bool*, 1>, long, long, int, phi::FullFunctor<bool, bool>)",89,373574,4197.460674,9.584e-03,2806,13311,2374.493855
"phi::BoolToInt64Kernel(bool const*, long*, long)",80,372815,4660.187500,9.565e-03,2005,5894,806.357195
"void phi::funcs::SelectKernel<bool, long, long, long, phi::MaskedSelectFunctor<bool, long, long>, 1, 1>(long*, bool const*, long const*, long*, phi::MaskedSelectFunctor<bool, long, long>, long, long, long)",80,366319,4578.987500,9.398e-03,4329,6254,263.004957
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",80,364792,4559.900000,9.359e-03,4170,5733,234.976088
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x128x16_wt32x32x2_ws1x1_wr2x2_ta1x4x2x1_1x4x1x64_tb1x4x2x1_1x4x1x64_gkgs",16,362909,22681.812500,9.311e-03,22010,23494,401.991578
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>)",80,341705,4271.312500,8.767e-03,4009,5372,215.672372
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSiluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<float>)",48,306945,6394.687500,7.875e-03,2767,12589,3310.156287
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 2>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",80,298554,3731.925000,7.660e-03,3288,4290,204.419820
"void phi::funcs::StackCudaKernel<float, int, phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, float*)",44,287536,6534.909091,7.377e-03,2205,33196,7946.487075
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSigmoidFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSigmoidFunctor<float>)",40,279598,6989.950000,7.173e-03,2325,46306,12595.272180
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::FloorDivideFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::FloorDivideFunctor<long, void>)",84,278432,3314.666667,7.143e-03,2887,4691,329.367899
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::SubtractFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::SubtractFunctor<float>)",94,271736,2890.808511,6.971e-03,2165,3568,357.358477
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",72,265081,3681.680556,6.801e-03,3247,4531,276.313347
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",28,255424,9122.285714,6.553e-03,2406,10585,2740.424120
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",60,251338,4188.966667,6.448e-03,3288,7216,786.753617
"batched_transpose_16x32_dword",32,251251,7851.593750,6.446e-03,4009,21128,4772.882769
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM48",4,245520,61380.000000,6.299e-03,61139,61581,188.398514
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGMn16",28,244716,8739.857143,6.278e-03,7858,12388,1408.161626
"void phi::MaskedScatterCUDAKernel<phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long const*, long, phi::dtype::bfloat16*)",80,242193,3027.412500,6.214e-03,2245,4490,395.885602
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSqrtFunctor<float>)",78,241874,3100.948718,6.205e-03,2526,4090,291.057896
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,238468,9936.166667,6.118e-03,9542,10263,165.682427
"Cijk_Ailk_Bjlk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM12",24,236500,9854.166667,6.067e-03,9662,10183,136.894014
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,228763,2859.537500,5.869e-03,2405,3769,225.923699
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 2>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,227952,2849.400000,5.848e-03,2566,3087,118.961072
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGMn16",36,227361,6315.583333,5.833e-03,5452,7417,517.844233
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt256x64x8_wt64x16x4_ws1x1_wr2x2_ta1x1x8x1_1x8x1x32_tb1x1x2x1_1x8x1x32_me",4,216655,54163.750000,5.558e-03,53923,54444,242.542608
"batched_transpose_32x32_dword",40,213607,5340.175000,5.480e-03,3889,7538,1283.639755
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",80,211439,2642.987500,5.425e-03,2325,3047,168.448730
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ClipFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ClipFunctor<float>)",84,208438,2481.404762,5.348e-03,1925,3448,201.034141
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",20,198130,9906.500000,5.083e-03,5853,14313,2552.281611
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",44,194006,4409.227273,4.977e-03,4129,4610,121.416229
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",68,193683,2848.279412,4.969e-03,1283,10424,1036.622340
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM32",4,181254,45313.500000,4.650e-03,44581,45624,493.061524
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",8,175801,21975.125000,4.510e-03,3047,40893,19503.904616
"void phi::Range<float, float>(float, float, long, float*)",82,173034,2110.170732,4.439e-03,1804,2886,187.353288
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM12",36,167098,4641.611111,4.287e-03,4450,5452,186.563398
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4>)",12,161890,13490.833333,4.153e-03,4931,20527,6180.767499
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<float, bool>, bool, 2, 1, 8, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<float, bool>)",8,159365,19920.625000,4.089e-03,18282,20768,915.301658
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,151266,1890.825000,3.881e-03,1563,2245,143.489406
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,148781,18597.625000,3.817e-03,13511,23854,5291.743999
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 9, false>(float*, float const*, int, int, int)",24,147656,6152.333333,3.788e-03,5693,6855,284.401991
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,146535,6105.625000,3.759e-03,5773,6655,268.856324
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex0_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta",4,142766,35691.500000,3.663e-03,35481,36002,222.741255
"void phi::funcs::GatherNdCUDAKernel<float, long, 4>(float const*, common::Dim<9>, long const*, float*, unsigned long, unsigned long, unsigned long)",12,142725,11893.750000,3.662e-03,4811,25177,8647.873907
"void phi::IsfiniteCUDAKernel<long, unsigned int>(long const*, unsigned int, bool*, std::enable_if<std::is_integral<long>::value, void>::type*)",80,136428,1705.350000,3.500e-03,1363,2686,204.026748
"void rocprim::ROCPRIM_400200_NS::detail::init_lookback_scan_state_kernel<rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true> >(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>::value_type*)",80,133667,1670.837500,3.429e-03,1404,2285,184.017228
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4>)",8,128252,16031.500000,3.290e-03,15715,16478,289.671242
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<float, bool>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<float, bool>)",12,121438,10119.833333,3.116e-03,4851,13591,3617.831566
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",24,117748,4906.166667,3.021e-03,4530,5693,344.462867
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt16x64x32_wt16x16x4_ws1x1_wr1x2_ta1x4x1x1_1x8x1x16_tb1x4x4x1_1x8x1x16_gkgs",4,114501,28625.250000,2.938e-03,28465,28946,217.133408
"phi::MaskedScatterSizeCheck(long const*, bool const*, long)",80,114415,1430.187500,2.935e-03,641,5653,1368.226905
"batched_transpose_32x16_dword",8,112900,14112.500000,2.896e-03,4089,24336,10640.860210
"mloPoolingG",4,104878,26219.500000,2.691e-03,25939,26901,457.064182
"void phi::funcs::LayerNormForward<float, float, 64, true, float, float>(float const*, std::conditional<true, float, float>::type const*, std::conditional<true, float, float>::type const*, float*, float*, float*, float, long, float const*, int, float, int, float, float)",12,104679,8723.250000,2.686e-03,4891,15716,4850.266199
"Cijk_Alik_Bljk_SB_MT32x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA2_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM6",24,104202,4341.750000,2.673e-03,3969,5092,327.046234
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 8> >(phi::funcs::AlignedPointerWrapper<float, 8>, int, int, int, void*)",5,101191,20238.200000,2.596e-03,5493,24175,8245.662417
"Cijk_Ailk_Bljk_SB_MT192x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT0193_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL4_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA3_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR9_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT3_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",4,90126,22531.500000,2.312e-03,20206,28024,3679.241362
"void phi::KeBilinearInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",16,89405,5587.812500,2.294e-03,3929,8339,1725.040530
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD3_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGM16",16,84272,5267.000000,2.162e-03,4851,6494,422.034122
"void phi::VecReduceKernel<128, 4, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",4,75933,18983.250000,1.948e-03,18803,19204,177.802090
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM8",4,74128,18532.000000,1.902e-03,18321,18763,183.740759
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 4, false>(float*, float const*, int, int, int)",24,72968,3040.333333,1.872e-03,2766,3248,118.743299
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaLogFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaLogFunctor<float>)",28,71844,2565.857143,1.843e-03,2165,3288,284.816863
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,70764,3369.714286,1.815e-03,3128,4130,227.370874
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<float, int>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<float, int>)",4,69679,17419.750000,1.788e-03,16919,18522,741.576418
"Im2d2Col_v2",4,67915,16978.750000,1.742e-03,16558,17360,419.762929
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,56689,2699.476190,1.454e-03,2366,3007,180.471222
"void phi::KeNearestNeighborInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool)",8,52359,6544.875000,1.343e-03,5572,7578,929.797126
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG16_8_2_WGMn16",4,51237,12809.250000,1.315e-03,12388,13150,362.419071
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",4,47829,11957.250000,1.227e-03,8861,13190,2079.983554
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta_gkgs",4,45706,11426.500000,1.173e-03,11066,11747,283.314313
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD2_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGM16",4,43019,10754.750000,1.104e-03,10384,10945,252.151244
"Cijk_Alik_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA1_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM6",4,38808,9702.000000,9.956e-04,9262,10784,724.248116
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,35482,4435.250000,9.103e-04,3007,4932,639.361009
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 4> >(phi::funcs::PointerAndColWrapper<float, int, 4>, int, int, int, void*)",4,30748,7687.000000,7.889e-04,7537,7817,114.891253
"void phi::funcs::GatherNdCUDAKernel<long, long, 1>(long const*, common::Dim<9>, long const*, long*, unsigned long, unsigned long, unsigned long)",4,20086,5021.500000,5.153e-04,4851,5211,150.953635
"void phi::funcs::ConcatTensorWithDifferentShape<int, 4, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",4,18001,4500.250000,4.618e-04,4450,4531,38.560558
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",4,17800,4450.000000,4.567e-04,4330,4570,97.979590
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",4,17800,4450.000000,4.567e-04,4370,4490,56.568542
"void phi::funcs::VectorizedElementwiseKernel<float, phi::GeluWithoutApproximateFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<float>)",4,17079,4269.750000,4.382e-04,3969,4972,473.077425
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<int, int>, int, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<int, int>)",4,15676,3919.000000,4.022e-04,3889,3969,38.297084
"void phi::IndexPutCudaKernel<long>(long const*, long const*, long**, common::Array<long, 9ul>, common::Array<long, 9ul>, int, long, long, bool, long*)",4,15355,3838.750000,3.939e-04,3568,4450,410.586065
"void phi::FlipCudaKernel<float>(float const*, float*, common::Array<long, 9ul>, common::Array<long, 9ul>, common::Array<int, 9ul>, int, long, int)",4,14272,3568.000000,3.662e-04,3327,4290,481.333564
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaFloorFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaFloorFunctor<float>)",4,13151,3287.750000,3.374e-04,3128,3488,156.802158
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",4,12308,3077.000000,3.158e-04,2446,3608,579.288069
"void phi::funcs::ConcatTensorWithSameShape<int, 8, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,11706,2926.500000,3.003e-04,2766,3128,157.483332
"void phi::funcs::ConcatTensorWithSameShape<int, 4, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,10143,2535.750000,2.602e-04,2165,3007,354.750029
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::ElementwisePowFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::ElementwisePowFunctor<float>)",2,9863,4931.500000,2.530e-04,4450,5413,680.943830
"__amd_rocclr_fillBufferAligned",1,9782,9782.000000,2.510e-04,9782,9782,0.00000000e+00
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float>, phi::dtype::bfloat16*, unsigned long)",2,9342,4671.000000,2.397e-04,3248,6094,2012.425899
"void phi::funcs::DistributionKernel<float, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, float*, unsigned long)",1,3207,3207.000000,8.228e-05,3207,3207,0.00000000e+00
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",1,3127,3127.000000,8.022e-05,3127,3127,0.00000000e+00
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<long, int>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<long, int>)",1,2526,2526.000000,6.481e-05,2526,2526,0.00000000e+00
"void phi::funcs::ForRangeElemwiseOpGridIsOne<phi::EyeFunctor<float> >(phi::EyeFunctor<float>)",1,2525,2525.000000,6.478e-05,2525,2525,0.00000000e+00

austin1997 added 3 commits April 22, 2026 12:47

austin1997 changed the base branch from develop to paddle_hackthon April 22, 2026 13:18

austin1997 force-pushed the bf16-rocm-fork branch from 13537f9 to fa80fca Compare April 22, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP#48

[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP#48
austin1997 wants to merge 3 commits into
ROCm:paddle_hackthonfrom
austin1997:bf16-rocm-fork

austin1997 commented Apr 22, 2026 •

edited

Loading

Uh oh!

austin1997 commented Apr 22, 2026

Uh oh!

austin1997 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

austin1997 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Upstream relation

Verification

是否引起精度变化

Uh oh!

austin1997 commented Apr 22, 2026

BF16 profiling: rocprofv3 --kernel-trace --stats output

Uh oh!

austin1997 commented Apr 22, 2026

Updated BF16 profiling (base switched to paddle_hackthon)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

austin1997 commented Apr 22, 2026 •

edited

Loading

BF16 profiling: `rocprofv3 --kernel-trace --stats` output

Updated BF16 profiling (base switched to `paddle_hackthon`)