Skip to content

[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP#48

Open
austin1997 wants to merge 3 commits into
ROCm:paddle_hackthonfrom
austin1997:bf16-rocm-fork
Open

[ROCm] Enable BF16 softmax + gate cuDNN-only conv2d_add fuse passes on HIP#48
austin1997 wants to merge 3 commits into
ROCm:paddle_hackthonfrom
austin1997:bf16-rocm-fork

Conversation

@austin1997
Copy link
Copy Markdown

@austin1997 austin1997 commented Apr 22, 2026

PR Category

Execute Infrastructure

PR Types

Bug fixes

Description

Enables PaddleOCR-VL-1.5 to run end-to-end natively in BF16 on AMD MI300X (gfx942) under ROCm 7.x against the paddle_hackthon branch. Three independent HIP-only patches, 3 commits / 8 files / +58−12:

  1. [ROCm] Re-enable BF16 conv kernels on HIP (paddle/phi/backends/gpu/rocm/miopen_desc.h + paddle/phi/kernels/gpudnn/conv_kernel.cu + paddle/phi/kernels/gpudnn/conv_grad_kernel.cu) — restores the DataType::BFLOAT16 → miopenBFloat16 mapping and the phi::bfloat16 registrations on conv2d / conv2d_grad / conv2d_double_grad / conv3d / conv3d_grad / conv3d_double_grad / depthwise_conv2d / depthwise_conv2d_double_grad that 7d14616cee reverted from feat(ROCm): Add BF16 support for conv kernels on HIP/ROCm #47. Without this, PaddleOCR-VL-1.5's vision patchify Conv2D cannot dispatch a BF16 kernel and the pipeline falls back to FP32 for the entire vision encoder. Deployment to archs that don't have BF16 MFMA/WMMA (pre-CDNA3 / pre-RDNA3) is handled via PADDLE_ROCM_OFFLOAD_ARCHS at configure time — paddle_hackthon's default already covers the BF16-capable set (gfx942, gfx950, gfx1100, gfx1101, gfx1102, gfx1200, gfx1201).

  2. [ROCm] Route BF16 softmax through matrix kernel (MIOpen NOT_IMPLEMENTED) (paddle/phi/kernels/gpudnn/softmax_gpudnn.h) — MIOpen (as of ROCm 7.x) returns MIOPEN_STATUS_NOT_IMPLEMENTED for miopenSoftmaxForward_V2 with miopenBFloat16, so whenever dim ≥ MATRIX_SOFTMAX_THRESHOLD the existing gpudnn path dispatched into MIOpen and crashed. Route BF16 softmax to the existing matrix-softmax kernel on HIP, and gate the CUDNN_VERSION < 8100 BF16 fallback specialization on !defined(PADDLE_WITH_HIP) — that branch dispatched into MIOpen too and would trip the same failure.

  3. [ROCm] Skip cuDNN-only conv2d fusion passes on HIP (paddle/fluid/pir/transforms/gpu/conv2d_add_fuse_pass.cc + conv2d_add_act_fuse_pass.cc + paddle/fluid/pir/transforms/passes.h + paddle/fluid/inference/api/paddle_pass_builder.cc) — both PIR passes rewrite conv2d + add[+ act] into the fused fused_conv2d_add_act op, whose only kernel is cuDNN-only GPUDNN. On ROCm the rewrite succeeds but dispatch later fails for lack of a HIP kernel. PaddleX currently works around this by calling config.delete_pass("conv2d_add_act_fuse_pass") and config.delete_pass("conv2d_add_fuse_pass") under paddle.is_compiled_with_rocm() in paddlex/inference/models/runners/paddle_static/runner.py. Gate both REGISTER_IR_PASS / USE_PIR_PASS and the kPirGpuPasses list entries on PADDLE_WITH_CUDA so the rewrite never runs on HIP builds — the PaddleX delete_pass workaround becomes unnecessary.

Upstream relation

  • Sibling on mainline: PaddlePaddle/Paddle#78711 (same author) — the identical softmax + fuse-pass patch, opened earlier against PaddlePaddle/Paddle:develop.
  • Conv BF16 already merged upstream: PaddlePaddle/Paddle#78587 (fchange, Hackathon) — the mainline version of the conv BF16 + miopen_desc.h changes this PR restores. Already in PaddlePaddle/Paddle:develop, also ported to ROCm/Paddle:develop via ROCm/Paddle#47.
  • Revert being undone: 7d14616cee on paddle_hackthon reverted ROCm/Paddle#47's conv BF16 ahead of RDNA4 enablement. Commit 1 in this PR restores that registration; RDNA4 deployers who hit a regression can narrow PADDLE_ROCM_OFFLOAD_ARCHS to exclude gfx1200/1201 from their build until MIOpen BF16 conv is stable on RDNA4. This PR does not reintroduce the unit test that 7d14616cee also removed (test/legacy_test/test_hip_bf16_conv_kernel.py) — upstream CI on ROCm/Paddle does not run that test anyway, and the e2e BF16 verification attached below exercises the kernel more thoroughly.
  • PaddleX companion (once this lands): PaddleX#5096 drops _keep_in_fp32_modules = ["visual", "mlp_AR"] and the 4 delete_pass("conv2d_add_*_fuse_pass") blocks that are currently shipped as workarounds.

Verification

Full rebuild from ROCm/Paddle:paddle_hackthon @ 4df29c5818 + these 3 commits on MI300X (gfx942) / ROCm 7.2.0 / Python 3.12 with PADDLE_ROCM_OFFLOAD_ARCHS=gfx942, then:

  • Per-op probe — 15/15 BF16 ops pass (Conv2D, LayerNorm, Softmax, GELU, RMSNorm, SDPA-GQA, fused-bias-residual-layernorm, etc.).
  • Vision-encoder end-to-end — loads real PaddleOCR-VL-1.5 weights, monkey-patches paddle.matmul / F.softmax / F.gelu; all 219 leaf-sublayer outputs bfloat16, no BF16→FP32 leak asserted, 27 GELU + 27 softmax + 54 matmul all BF16.
  • Full pipeline benchmark + rocprofv3 — 3 timed runs per mode on test_ocr.png. OCR text output semantically identical BF16 vs FP32 fallback. GPU kernel-dispatch time drops from 4,369.3 ms (FP32 fallback) → 3,897.8 ms (native BF16), a 1.12× speedup. GEMM alone saves 339 ms; the FP32 fallback path's Cijk_Ailk_Bljk_SB_MT…_MI16x16x4x1 vision-GEMMs (473 ms at ranks 2 and 3 of the top-10) disappear entirely and are replaced by Cijk_Ailk_Bljk_BBS_BH_MT…_MI16x16x16x1 BF16 MFMA GEMMs.

Kernel-class breakdown (rocprofv3 kernel_stats.csv):

Op class FP32 calls FP32 ms BF16 calls BF16 ms Δ ms Speedup
GEMM (rocBLAS/hipBLASLt) 87,752 1,786.45 87,752 1,447.27 +339.18 1.23×
Cast / copy / memcpy 549,564 1,657.47 585,702 1,827.41 −169.95 0.91×
Elementwise add / mul / bias 122,840 357.59 107,720 210.71 +146.88 1.70×
Other (incl. bf16-specialized) 40,959 241.43 27,499 94.93 +146.50 2.54×
Reduction / sum / mean 36,004 144.68 36,004 142.01 +2.67 1.02×
Softmax 13,588 98.33 13,588 95.00 +3.33 1.04×
Layer norm / RMS norm 4,572 24.70 4,572 22.61 +2.09 1.09×
Conv / MIOpen 236 12.46 236 12.55 −0.09 0.99×
GELU / SiLU / activation 11,832 33.22 11,832 33.63 −0.41 0.99×
Top-k / argmax / sort 8 10.00 8 9.94 +0.07 1.01×
Transpose / reshape 400 2.19 240 1.38 +0.82 1.59×
Embedding / gather / index 96 0.25 96 0.28 −0.03 0.90×
Interpolate / resize 96 0.49 16 0.09 +0.40 5.45×
Fill / set_value / memset 1 0.01 1 0.01 −0.00 0.98×
Total 867,948 4,369.28 875,266 3,897.82 +471.46 1.12×

The full benchmark report (methodology, env, reproduce instructions, top-10 kernel tables for both modes) is kept alongside at BF16_BENCHMARK_ROCM_FORK.md in the workspace root and is reproducible via bench_paddleocr_vl.py + rocprofv3 --kernel-trace --stats --output-format csv.

是否引起精度变化

仅影响 ROCm/HIP 构建的 dispatch 路径:conv BF16 本就是 PR #47 的 dispatch (miopenBFloat16 精度),行为与未 revert 的 ROCm/Paddle:developPaddlePaddle/Paddle:develop 一致;softmax 改走 matrix kernel 数值实现同现网 FP16/FP32 路径一致;conv2d_add_*_fuse_pass 在 HIP 下本就无可调度 kernel,gating 后行为等价于 PaddleX 现网 delete_pass 显式移除。CUDA 构建完全不受影响(所有新增 #ifdef 都是 PADDLE_WITH_CUDA / PADDLE_WITH_HIP)。

@austin1997
Copy link
Copy Markdown
Author

BF16 profiling: rocprofv3 --kernel-trace --stats output

Raw kernel_stats.csv from the BF16 run of bench_paddleocr_vl.py on this PR branch (ROCm/Paddle:develop @ 29d1c6f0ca + 2 BF16 commits), MI300X / ROCm 7.2.0 / Python 3.12. Row count: 220 (incl. header). Columns: Name, Calls, TotalDurationNs, AverageNs, Percentage, MinNs, MaxNs, StdDev. The Cijk_Ailk_Bljk_BBS_BH_*_MI16x16x16x1_* kernels are hipBLASLt BF16 GEMMs (BBS = bf16/bf16 inputs, BH = bf16 accumulator kind).

kernel_stats.csv (click to expand)
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"Cijk_Ailk_Bljk_BBS_BH_MT64x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",22752,452959812,19908.571203,11.61,15236,30030,3447.533088
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA1024_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU16_SUM0_SUS512_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn16",13536,299983287,22161.885860,7.69,16077,85118,11716.922064
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<phi::dtype::bfloat16, float>)",76632,189068997,2467.232971,4.85,681,18523,1662.264516
"Cijk_Alik_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA4_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA1_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",13536,151993442,11228.829935,3.90,2646,64391,18123.829511
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",48464,148192009,3057.775029,3.80,882,9542,355.901133
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",52616,146850533,2790.986259,3.76,681,13229,1610.691688
"Cijk_Ailk_Bljk_BBS_BH_MT64x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA32_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",8640,145431134,16832.307176,3.73,14394,23054,793.679000
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",34272,139475857,4069.673699,3.57,2325,23815,3212.673338
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::dtype::bfloat16, unsigned int, 1, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",36288,122145253,3365.995729,3.13,681,12309,806.043801
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",32166,116469634,3620.892682,2.99,1162,28026,1270.872721
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<float, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<float, phi::dtype::bfloat16>)",42585,114347908,2685.168674,2.93,1523,15115,1832.671964
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR4_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD2_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGMn16",11376,111906819,9837.097310,2.87,7898,16959,852.732366
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4>)",23384,102251854,4372.727249,2.62,2606,11386,484.961907
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",34592,86680433,2505.794201,2.22,682,13031,1643.969796
"Cijk_Ailk_Bljk_BBS_BH_MT128x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLRn30_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM8",2240,79588668,35530.655357,2.04,30592,67678,1522.834254
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",13280,78303630,5896.357681,2.01,2967,14995,1816.094217
"Cijk_Ailk_Bljk_BBS_BH_MT128x128x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR12_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB1024_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU16_SUM0_SUS256_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_128_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB4_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn2",632,66978985,105979.406646,1.72,90611,205840,36145.186322
"void phi::RepeatInterleaveVecKernel<phi::dtype::bfloat16, 8>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, long, long, long, long, int)",22752,66164874,2908.090454,1.70,1844,8580,390.888878
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",22752,64057566,2815.469673,1.64,882,10866,305.474114
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8>, int, int, int, void*)",22752,63094270,2773.130714,1.62,1163,11908,389.876783
"SoftMaxCommon",2164,59287367,27397.119686,1.52,22412,32917,1961.622672
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",22806,58465028,2563.580987,1.50,762,12469,557.898035
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4>, int, int, int, void*)",19890,57991347,2915.603167,1.49,2125,10825,310.682675
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::ScaleFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>)",8640,54670751,6327.633218,1.40,2847,16118,1282.403601
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4> >(phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4>, int, int, int, void*)",22779,47828955,2099.695114,1.23,762,16077,495.657251
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",12224,47692693,3901.561927,1.22,2486,56171,1959.136766
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ScaleFunctor<float, float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ScaleFunctor<float, float>)",25446,46050852,1809.748173,1.18,641,15877,371.344802
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSquareFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSquareFunctor<float>)",23384,45575904,1949.020869,1.17,682,4249,192.285457
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaRsqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaRsqrtFunctor<float>)",23384,43188777,1846.937094,1.11,681,6855,220.938574
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",9936,42368404,4264.130837,1.09,3127,13590,441.469141
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",12816,38623160,3013.667291,0.9899,1363,9742,238.871330
"void phi::funcs::VectorizedElementwiseKernel<long, phi::ScaleFunctor<long, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::ScaleFunctor<long, long>)",22744,38592279,1696.811423,0.9891,681,8460,212.390187
"void phi::ArgCUDAKernel<phi::dtype::bfloat16, long, hipcub::HIPCUB_400200_NS::ArgMax, 1024ul, int>(long, long, long, hipcub::HIPCUB_400200_NS::ArgMax, phi::dtype::bfloat16, phi::dtype::bfloat16 const*, long*)",632,38212799,60463.289557,0.9794,58055,63869,927.092514
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4>)",11616,37372984,3217.371212,0.9579,2045,11587,645.041122
"void phi::ContiguousCaseZeroFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>)",19872,35128031,1767.714926,0.9003,681,7578,244.781016
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO4_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1404,33643242,23962.423077,0.8623,21209,27785,1140.684487
"__amd_rocclr_copyBuffer",9956,32688374,3283.283849,0.8378,641,92014,2279.954877
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>)",11376,30397756,2672.095288,0.7791,1764,5974,475.291018
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",5182,27999197,5403.164222,0.7176,1643,57935,3085.720073
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR8_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU64_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",756,23481691,31060.437831,0.6018,26942,36044,1686.590193
"void phi::funcs::LayerNormForward<phi::dtype::bfloat16, float, 512, true, phi::dtype::bfloat16, phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, phi::dtype::bfloat16*, float*, float*, float, long, float const*, int, float, int, float, float)",4480,22563671,5036.533705,0.5783,4049,10745,363.160493
"void phi::ContiguousCaseOneFunc<float, 4ul>(float const*, float*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",5584,20317108,3638.450573,0.5207,1363,12389,524.043820
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>)",2160,20031675,9273.923611,0.5134,6735,13631,699.706645
"void phi::WarpSoftmaxForward<float, float, float, int, 8, false>(float*, float const*, int, int, int)",5904,19542085,3309.973747,0.5009,2606,7497,721.684704
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4972,18977088,3816.791633,0.4864,1563,25219,1185.659655
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<float, float>, float, unsigned int, 1, 1, 1, 2>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, int, phi::ScaleFunctor<float, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",4320,15734150,3642.164352,0.4033,3127,5493,170.098614
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1440,14012390,9730.826389,0.3591,8579,14312,672.901376
"void phi::Strided2ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, common::Array<long, 10ul>, phi::dtype::bfloat16*, common::Array<long, 6ul>, long)",411,12537435,30504.708029,0.3213,8780,2501311,173467.743740
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>)",2160,10939180,5064.435185,0.2804,2887,6455,144.386400
"void phi::funcs::KeMatrixTopK<float, 20, 64>(float*, int, long*, float const*, long, long, int, int, long, bool)",8,10002233,1250279.125000,0.2564,1044234,1464774,207354.412882
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 2u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7719975,2821.628289,0.1979,2326,4490,268.480673
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7712914,2819.047515,0.1977,2246,4451,319.850095
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f2x3_stride1",104,7144250,68694.711538,0.1831,21931,198744,50453.211098
"void phi::funcs::VectorizedElementwiseKernel<long, phi::FullFunctor<long, long>, 0, 1, 1>(common::Array<char const* restrict, 0>, common::Array<long*, 1>, long, long, int, phi::FullFunctor<long, long>)",3103,6917501,2229.294554,0.1773,681,8580,1218.030319
"naive_conv_ab_nonpacked_fwd_nchw_float_double_float",108,4475986,41444.314815,0.1147,12549,94460,16868.798825
"void phi::Range<long, long>(long, long, long, long*)",2391,4150732,1735.981598,0.1064,1162,10144,835.062527
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",2218,4080934,1839.916141,0.1046,1243,8340,350.611940
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<float, float>, float, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<float, float>)",716,3647979,5094.942737,0.0935,2406,14594,1273.587701
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI32x32x8x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR3_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_2_4_WGM1",80,3630697,45383.712500,0.0931,44023,46990,526.850306
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",964,3611809,3746.689834,0.0926,3007,5452,276.884200
"void phi::funcs::ConcatTensorWithDifferentShape<int, 8, phi::funcs::PointerAndColWrapper<long, int, 4> >(phi::funcs::PointerAndColWrapper<long, int, 4>, int, int, int, void*)",1264,3590920,2840.917722,0.0920,2205,4851,283.850751
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4> >(phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4>)",636,2997105,4712.429245,0.0768,3929,13472,705.964341
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaCosFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaCosFunctor<float>)",712,2974689,4177.933989,0.0762,2927,5212,384.378838
"void phi::IndexSampleForward<phi::dtype::bfloat16, long, unsigned int>(long const*, phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long)",632,2903849,4594.697785,0.0744,3368,6295,133.240671
"void phi::funcs::VectorizedElementwiseKernel<float, phi::FullFunctor<float, float>, 0, 1, 4>(common::Array<char const* restrict, 0>, common::Array<float*, 1>, long, long, int, phi::FullFunctor<float, float>)",1368,2861869,2092.009503,0.0734,641,33037,2117.204446
"void phi::GridSampleCudaKernel<float, int>(int, int, int, int, int, float const*, float const*, float*, phi::Mode, phi::PaddingMode, bool)",72,2613294,36295.750000,0.0670,26822,43060,4677.142148
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CondFunctor<long>, 3, 1, 1>(common::Array<char const* restrict, 3>, common::Array<long*, 1>, long, long, int, phi::CondFunctor<long>)",632,2562249,4054.191456,0.0657,2125,6014,389.528232
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSinFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSinFunctor<float>)",712,2429296,3411.932584,0.0623,2767,4089,237.979939
"void phi::BinaryElementwiseKernel<phi::funcs::NotEqualFunctor<long, bool>, bool, unsigned int, 2, 1, 1, 4>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, long, int, phi::funcs::NotEqualFunctor<long, bool>, phi::funcs::OffsetCalculator<(2)+(1), unsigned int, false>)",632,2291964,3626.525316,0.0587,3167,4811,201.171331
"void phi::EmbeddingFW<phi::dtype::bfloat16, long, false>(phi::dtype::bfloat16*, phi::dtype::bfloat16 const*, long const*, long, long, long, long)",632,2253037,3564.931962,0.0577,1163,7498,482.351740
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>)",632,2092442,3310.825949,0.0536,2606,4009,219.789162
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, phi::dtype::bfloat16*, unsigned long)",246,2082939,8467.231707,0.0534,4209,119879,10643.156999
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CondFunctor<float>, 3, 1, 8>(common::Array<char const* restrict, 3>, common::Array<float*, 1>, long, long, int, phi::CondFunctor<float>)",644,2001074,3107.257764,0.0513,722,73291,6219.015400
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CastFunctor<bool, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::CastFunctor<bool, long>)",1032,1904646,1845.587209,0.0488,1243,2406,183.620497
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",714,1776289,2487.799720,0.0455,1483,4771,707.199076
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaReluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaReluFunctor<float>)",324,1769602,5461.734568,0.0454,1924,26542,4598.425476
"void phi::funcs::TilingSwapDim1And2<phi::dtype::bfloat16, 256, 32, 32, int>(phi::dtype::bfloat16 const*, phi::funcs::Dim3<int>, phi::dtype::bfloat16*)",250,1676350,6705.400000,0.0430,2446,13992,2336.078915
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, phi::dtype::bfloat16, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",712,1668773,2343.782303,0.0428,1764,3448,277.999179
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",744,1652531,2221.143817,0.0424,642,4531,486.152134
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",552,1628957,2951.009058,0.0418,2486,3408,178.246780
"void phi::ContiguousCaseOneFunc<long, 2ul>(long const*, long*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",552,1585287,2871.896739,0.0406,2045,12108,1208.802393
"void phi::RepeatInterleaveVecKernel<long, 1>(long const*, long*, long, long, long, long, int)",552,1558788,2823.891304,0.0400,2325,3849,218.415817
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt128x128x16_wt32x32x4_ws1x1_wr2x2_ta1x1x8x1_1x16x1x16_tb1x1x8x1_1x16x1x16_me",40,1536624,38415.600000,0.0394,37688,38770,315.532094
"Cijk_Ailk_Bljk_SB_MT16x16x4_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT017_165_35_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO2_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_1_WGM16",552,1520057,2753.726449,0.0390,2365,3608,168.904593
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<long, phi::dtype::bfloat16>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<long, phi::dtype::bfloat16>)",632,1491909,2360.615506,0.0382,1844,2887,196.107613
"void phi::GPUMaskedFillOneValueKernel<phi::dtype::bfloat16, 1>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long, long, phi::dtype::bfloat16*)",632,1486747,2352.447785,0.0381,1764,4010,212.965025
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<long, bool>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<long, bool>)",632,1483533,2347.362342,0.0380,1363,3609,194.479426
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",244,1476481,6051.151639,0.0378,2806,7698,1074.918175
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<bool, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<bool, phi::dtype::bfloat16>)",632,1439990,2278.465190,0.0369,1724,3007,202.446946
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::LogicalAndFunctor<bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::LogicalAndFunctor<bool>)",632,1329297,2103.318038,0.0341,1644,3247,188.437974
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<long, float>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<long, float>)",722,1301590,1802.756233,0.0334,681,4491,384.323469
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",632,1290935,2042.618671,0.0331,681,3328,211.035877
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 3>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",560,1194152,2132.414286,0.0306,1524,3368,194.908583
"Cijk_Ailk_Bljk_SB_MT256x256x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT8_128_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM18",8,1187328,148416.000000,0.0304,93538,202111,56680.349243
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<long, bool>)",560,1122120,2003.785714,0.0288,642,3930,459.743676
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,1098921,91576.750000,0.0282,77220,99431,10288.687204
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS16_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",24,1000294,41678.916667,0.0256,23976,57374,9950.349222
"void phi::funcs::CumsumOneBlock<long, long, phi::kps::AddFunctor<long>, 2>(long const*, long*, long, long, phi::kps::AddFunctor<long>)",240,961168,4004.866667,0.0246,2846,4731,305.320873
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4>)",240,918193,3825.804167,0.0235,2606,5293,771.407001
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt64x64x16_wt16x16x4_ws1x1_wr2x2_ta1x1x4x1_1x16x1x16_tb1x1x4x1_1x16x1x16_me",36,888188,24671.888889,0.0228,23655,25099,347.315154
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",160,833625,5210.156250,0.0214,4170,6174,466.731139
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",88,808046,9182.340909,0.0207,6134,17882,4190.962924
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<bool, bool>, bool, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<bool, bool>)",80,796294,9953.675000,0.0204,7617,12429,1313.660176
"batched_transpose_128x4_half",160,780819,4880.118750,0.0200,2285,7137,1377.483241
"void phi::funcs::SelectKernel<bool, bool, long, long, phi::IndexFunctor<bool, long, long>, 1, 0>(long*, bool const*, bool const*, long*, phi::IndexFunctor<bool, long, long>, long, long, long)",160,742212,4638.825000,0.0190,4049,11347,629.729827
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",208,633031,3043.418269,0.0162,2405,4771,476.033240
"void phi::funcs::GetBlockCountKernel<bool, long, 1>(bool const*, long*, long, long)",240,626456,2610.233333,0.0161,2124,3930,218.867061
"void rocprim::ROCPRIM_400200_NS::detail::trampoline_kernel<rocprim::ROCPRIM_400200_NS::detail::wrapped_scan_config<rocprim::ROCPRIM_400200_NS::default_config, long>, (rocprim::ROCPRIM_400200_NS::detail::target_arch)942, rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1}, rocprim::ROCPRIM_400200_NS::detail::default_config_selector>(rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1})",80,607982,7599.775000,0.0156,6736,10184,828.286427
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",240,566969,2362.370833,0.0145,1844,7136,371.559903
"Cijk_Ailk_Bljk_SB_MT32x32x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",16,553008,34563.000000,0.0142,20408,68640,20012.806073
"void phi::funcs::TilingSwapDim1And2<float, 256, 32, 32, int>(float const*, phi::funcs::Dim3<int>, float*)",100,531804,5318.040000,0.0136,2967,12990,2173.862916
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<bool, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<bool, float>)",34,510712,15020.941176,0.0131,1524,28306,6093.132025
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,505180,6314.750000,0.0129,5693,6937,296.633879
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,469543,5869.287500,0.0120,5373,7176,271.298434
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,467007,38917.250000,0.0120,37768,40374,986.632300
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride1",16,464245,29015.312500,0.0119,21651,52964,7227.554558
"void phi::funcs::StackCudaKernel<long, int, phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, long*)",168,453646,2700.273810,0.0116,2125,4931,497.336796
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4>)",80,448686,5608.575000,0.0115,4691,8941,689.975248
"SubTensorOpWithScalar1d",108,431322,3993.722222,0.0111,2646,9662,1280.310979
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSwishFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSwishFunctor<float>)",76,431287,5674.828947,0.0111,2807,11788,2855.916490
"Cijk_Ailk_Bljk_SB_MT128x128x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_8_1_WGM1",12,430643,35886.916667,0.0110,26862,41337,6292.715723
"void phi::BNForwardInference<float, (common::DataLayout)2>(float const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, int, int, long, double, float*)",108,430521,3986.305556,0.0110,2646,7738,850.272419
"void phi::fusion::FusedLayernormResidualDropoutBias<float, unsigned char, 4, float, false, false>(long, long, unsigned long, float, bool, bool, unsigned long, float, float const*, float const*, float const*, std::conditional<false, float, float>::type const*, std::conditional<false, float, float>::type const*, unsigned char*, float*, float*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, float)",80,425511,5318.887500,0.0109,3809,7617,1281.625609
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride2",8,392396,49049.500000,0.0101,28025,70364,22090.260562
"phi::BoolToInt64Kernel(bool const*, long*, long)",80,392275,4903.437500,0.0101,2325,6014,790.589235
"void phi::KeBilinearInterpNCHWFw<phi::dtype::bfloat16, float>(phi::dtype::bfloat16 const*, unsigned long, unsigned long, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",80,389180,4864.750000,9.975e-03,4089,5813,288.192249
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::FullFunctor<bool, bool>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<bool*, 1>, long, long, int, phi::FullFunctor<bool, bool>)",89,370513,4163.067416,9.496e-03,2887,13552,2549.549536
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",80,365849,4573.112500,9.377e-03,4129,5733,241.464361
"void phi::funcs::SelectKernel<bool, long, long, long, phi::MaskedSelectFunctor<bool, long, long>, 1, 1>(long*, bool const*, long const*, long*, phi::MaskedSelectFunctor<bool, long, long>, long, long, long)",80,361727,4521.587500,9.271e-03,4170,6856,315.683333
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x128x16_wt32x32x2_ws1x1_wr2x2_ta1x4x2x1_1x4x1x64_tb1x4x2x1_1x4x1x64_gkgs",16,357955,22372.187500,9.174e-03,20407,22934,582.442926
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>)",80,344239,4302.987500,8.823e-03,3247,5894,303.887117
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSiluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<float>)",48,318302,6631.291667,8.158e-03,2726,12469,2928.314132
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 2>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",80,292918,3661.475000,7.508e-03,3327,4210,172.616570
"void phi::funcs::StackCudaKernel<float, int, phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, float*)",44,291520,6625.454545,7.472e-03,2526,33839,8012.746623
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::SubtractFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::SubtractFunctor<float>)",94,277730,2954.574468,7.118e-03,1884,4531,457.045455
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::FloorDivideFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::FloorDivideFunctor<long, void>)",84,272715,3246.607143,6.990e-03,2927,4530,308.502486
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSigmoidFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSigmoidFunctor<float>)",40,263294,6582.350000,6.748e-03,2526,43221,11302.503069
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",72,263049,3653.458333,6.742e-03,3207,4290,242.365859
"batched_transpose_16x32_dword",32,254835,7963.593750,6.531e-03,2807,21811,5220.735635
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",60,252992,4216.533333,6.484e-03,3408,6776,723.317275
"void phi::MaskedScatterCUDAKernel<phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long const*, long, phi::dtype::bfloat16*)",80,249704,3121.300000,6.400e-03,2285,8540,807.519901
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",28,248465,8873.750000,6.368e-03,2406,10384,2666.002231
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM48",4,247376,61844.000000,6.340e-03,61463,62425,441.030611
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGMn16",28,245613,8771.892857,6.295e-03,7859,12028,1359.209365
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,242127,10088.625000,6.206e-03,9663,10705,236.417062
"Cijk_Ailk_Bjlk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM12",24,237072,9878.000000,6.076e-03,9622,10344,159.036228
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",80,234667,2933.337500,6.015e-03,2486,3408,165.898089
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 2>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,233671,2920.887500,5.989e-03,2766,3127,86.798617
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSqrtFunctor<float>)",78,230858,2959.717949,5.917e-03,2045,3849,285.294739
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,226008,2825.100000,5.793e-03,2445,3970,253.048482
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGMn16",36,225043,6251.194444,5.768e-03,5213,7377,524.453992
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt256x64x8_wt64x16x4_ws1x1_wr2x2_ta1x1x8x1_1x8x1x32_tb1x1x2x1_1x8x1x32_me",4,216263,54065.750000,5.543e-03,53846,54246,177.501878
"batched_transpose_32x32_dword",40,211373,5284.325000,5.418e-03,3969,7898,1273.798305
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ClipFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ClipFunctor<float>)",84,206125,2453.869048,5.283e-03,2005,3889,241.941135
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",20,197861,9893.050000,5.071e-03,5773,13512,2374.226533
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",68,196101,2883.838235,5.026e-03,1444,9462,940.124247
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",8,196096,24512.000000,5.026e-03,2606,46067,21754.002417
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",44,193816,4404.909091,4.968e-03,4130,4691,123.987775
"void phi::Range<float, float>(float, float, long, float*)",82,182266,2222.756098,4.672e-03,1964,3448,216.431928
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM32",4,177735,44433.750000,4.555e-03,43862,44785,430.266100
"phi::MaskedScatterSizeCheck(long const*, bool const*, long)",80,177495,2218.687500,4.549e-03,641,5733,1675.604743
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,171400,2142.500000,4.393e-03,1844,2525,149.466308
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM12",36,168553,4682.027778,4.320e-03,4410,5774,217.187146
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4>)",12,165308,13775.666667,4.237e-03,5293,19445,6224.479903
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<float, bool>, bool, 2, 1, 8, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<float, bool>)",8,156726,19590.750000,4.017e-03,18042,21851,1298.216111
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,148983,6207.625000,3.818e-03,5814,7738,431.741349
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,148666,18583.250000,3.810e-03,13591,23615,5041.388889
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 9, false>(float*, float const*, int, int, int)",24,147339,6139.125000,3.776e-03,5572,7497,407.672110
"void phi::funcs::GatherNdCUDAKernel<float, long, 4>(float const*, common::Dim<9>, long const*, float*, unsigned long, unsigned long, unsigned long)",12,144574,12047.833333,3.705e-03,4891,24216,8733.093140
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex0_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta",4,143656,35914.000000,3.682e-03,35723,36285,252.403117
"void rocprim::ROCPRIM_400200_NS::detail::init_lookback_scan_state_kernel<rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true> >(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>::value_type*)",80,143011,1787.637500,3.665e-03,1323,3889,302.021721
"void phi::IsfiniteCUDAKernel<long, unsigned int>(long const*, unsigned int, bool*, std::enable_if<std::is_integral<long>::value, void>::type*)",80,140128,1751.600000,3.592e-03,1363,3288,246.051842
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4>)",8,133471,16683.875000,3.421e-03,15517,18763,1196.338275
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<float, bool>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<float, bool>)",12,129862,10821.833333,3.328e-03,5453,17080,4216.552595
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",24,124250,5177.083333,3.185e-03,4851,5733,241.548612
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt16x64x32_wt16x16x4_ws1x1_wr1x2_ta1x4x1x1_1x8x1x16_tb1x4x4x1_1x8x1x16_gkgs",4,115911,28977.750000,2.971e-03,28908,29108,88.808314
"batched_transpose_32x16_dword",8,114064,14258.000000,2.923e-03,4009,24497,10849.911613
"Cijk_Alik_Bljk_SB_MT32x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA2_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM6",24,107410,4475.416667,2.753e-03,4089,5694,389.517418
"mloPoolingG",4,104522,26130.500000,2.679e-03,25178,26902,770.171193
"void phi::funcs::LayerNormForward<float, float, 64, true, float, float>(float const*, std::conditional<true, float, float>::type const*, std::conditional<true, float, float>::type const*, float*, float*, float*, float, long, float const*, int, float, int, float, float)",12,99873,8322.750000,2.560e-03,4731,15195,4679.663568
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 8> >(phi::funcs::AlignedPointerWrapper<float, 8>, int, int, int, void*)",5,99311,19862.200000,2.545e-03,5894,23815,7819.173658
"void phi::KeBilinearInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",16,85559,5347.437500,2.193e-03,3609,7858,1507.864670
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD3_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGM16",16,84397,5274.812500,2.163e-03,4891,5533,225.258435
"Cijk_Ailk_Bljk_SB_MT192x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT0193_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL4_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA3_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR9_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT3_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",4,83435,20858.750000,2.138e-03,20488,21089,260.474407
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaLogFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaLogFunctor<float>)",28,74410,2657.500000,1.907e-03,2245,3528,317.779157
"void phi::VecReduceKernel<128, 4, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",4,74053,18513.250000,1.898e-03,18202,18764,237.278142
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,73092,3480.571429,1.873e-03,3287,4491,258.330906
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM8",4,72890,18222.500000,1.868e-03,17962,18563,262.806012
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<float, int>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<float, int>)",4,71647,17911.750000,1.836e-03,17320,18443,574.080932
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 4, false>(float*, float const*, int, int, int)",24,71128,2963.666667,1.823e-03,2686,3128,132.635056
"Im2d2Col_v2",4,67157,16789.250000,1.721e-03,16639,16960,132.590535
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,56291,2680.523810,1.443e-03,1844,3087,288.702722
"void phi::KeNearestNeighborInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool)",8,52080,6510.000000,1.335e-03,5252,7497,964.809085
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG16_8_2_WGMn16",4,48472,12118.000000,1.242e-03,12028,12268,114.891253
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",4,47991,11997.750000,1.230e-03,11226,13832,1243.677979
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta_gkgs",4,43902,10975.500000,1.125e-03,9783,11667,827.552415
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD2_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGM16",4,43381,10845.250000,1.112e-03,10624,11346,337.659962
"Cijk_Alik_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA1_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM6",4,37488,9372.000000,9.608e-04,9182,9622,194.250697
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,35964,4495.500000,9.218e-04,2887,4811,656.032882
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 4> >(phi::funcs::PointerAndColWrapper<float, int, 4>, int, int, int, void*)",4,31393,7848.250000,8.046e-04,7577,8139,230.320610
"void phi::funcs::GatherNdCUDAKernel<long, long, 1>(long const*, common::Dim<9>, long const*, long*, unsigned long, unsigned long, unsigned long)",4,19727,4931.750000,5.056e-04,4451,5172,331.944147
"void phi::funcs::VectorizedElementwiseKernel<float, phi::GeluWithoutApproximateFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<float>)",4,18844,4711.000000,4.830e-04,4250,5693,664.631728
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",4,18483,4620.750000,4.737e-04,4450,4812,148.259626
"void phi::funcs::ConcatTensorWithDifferentShape<int, 4, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",4,18123,4530.750000,4.645e-04,4490,4571,33.069372
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",4,17684,4421.000000,4.532e-04,4371,4491,60.000000
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<int, int>, int, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<int, int>)",4,15556,3889.000000,3.987e-04,3849,3929,32.659863
"void phi::IndexPutCudaKernel<long>(long const*, long const*, long**, common::Array<long, 9ul>, common::Array<long, 9ul>, int, long, long, bool, long*)",4,15155,3788.750000,3.884e-04,3569,4250,311.449702
"void phi::FlipCudaKernel<float>(float const*, float*, common::Array<long, 9ul>, common::Array<long, 9ul>, common::Array<int, 9ul>, int, long, int)",4,12950,3237.500000,3.319e-04,3087,3408,140.524019
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaFloorFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaFloorFunctor<float>)",4,12228,3057.000000,3.134e-04,2806,3168,171.582827
"void phi::funcs::ConcatTensorWithSameShape<int, 8, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,11750,2937.500000,3.012e-04,2767,3168,202.880096
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",4,11627,2906.750000,2.980e-04,2525,3689,528.737096
"void phi::funcs::ConcatTensorWithSameShape<int, 4, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,11386,2846.500000,2.918e-04,2085,4249,957.832449
"__amd_rocclr_fillBufferAligned",1,10825,10825.000000,2.774e-04,10825,10825,0.00000000e+00
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::ElementwisePowFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::ElementwisePowFunctor<float>)",2,9984,4992.000000,2.559e-04,4531,5453,651.952452
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float>, phi::dtype::bfloat16*, unsigned long)",2,9703,4851.500000,2.487e-04,3488,6215,1928.280192
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",1,3489,3489.000000,8.942e-05,3489,3489,0.00000000e+00
"void phi::funcs::ForRangeElemwiseOpGridIsOne<phi::EyeFunctor<float> >(phi::EyeFunctor<float>)",1,2967,2967.000000,7.605e-05,2967,2967,0.00000000e+00
"void phi::funcs::DistributionKernel<float, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, float*, unsigned long)",1,2766,2766.000000,7.089e-05,2766,2766,0.00000000e+00
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<long, int>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<long, int>)",1,2687,2687.000000,6.887e-05,2687,2687,0.00000000e+00

MIOpen (as of ROCm 7.x) returns MIOPEN_STATUS_NOT_IMPLEMENTED for
miopenSoftmaxForward_V2 with miopenBFloat16, so the gpudnn softmax path
cannot be used for BF16 on HIP. When the input dim exceeds the warp
softmax cap, route BF16 through the existing matrix softmax kernel
instead of letting the call fall into the MIOpen branch.

Also gate the CUDNN_VERSION < 8100 BF16 fallback specialization on
!defined(PADDLE_WITH_HIP) -- that branch dispatched into MIOpen too and
would trip the same NOT_IMPLEMENTED failure on ROCm.
conv2d_add_fuse_pass and conv2d_add_act_fuse_pass rewrite conv2d+add[+act]
into the fused_conv2d_add_act op, which has only a cuDNN GPUDNN kernel.
On ROCm the rewrite succeeds but kernel dispatch later fails because no
HIP kernel is registered, so PaddleX currently works around this by
calling config.delete_pass("conv2d_add_act_fuse_pass") and
config.delete_pass("conv2d_add_fuse_pass") under paddle.is_compiled_with_rocm()
in paddlex/inference/models/runners/paddle_static/runner.py.

Gate both the pass registration (REGISTER_IR_PASS / USE_PIR_PASS) and the
pass-builder inclusion on PADDLE_WITH_CUDA so the rewrite never runs on
HIP builds, making the PaddleX delete_pass calls unnecessary.
Restore the BF16 registrations for conv2d / conv3d / depthwise conv kernels
and the DataType::BFLOAT16 -> miopenBFloat16 mapping originally added by
ROCm#47 and reverted on paddle_hackthon ahead of RDNA4 enablement.

The change is gated at compile time by the existing #ifdef PADDLE_WITH_HIP
block. Deployment to archs that lack native BF16 support should be handled
via PADDLE_ROCM_OFFLOAD_ARCHS (paddle_hackthon's default list already
covers the BF16-capable set: CDNA3/gfx942, CDNA4/gfx950, RDNA3/gfx1100-
1102, RDNA4/gfx1200-1201); if a downstream target needs to strip BF16 from
the build, it can narrow the offload-arch list accordingly. No runtime
arch queries are introduced.
@austin1997 austin1997 changed the base branch from develop to paddle_hackthon April 22, 2026 13:18
@austin1997
Copy link
Copy Markdown
Author

Updated BF16 profiling (base switched to paddle_hackthon)

Refreshed kernel_stats.csv for the BF16 run on the new base ROCm/Paddle:paddle_hackthon @ 4df29c5818 + this PR's 3 commits, MI300X / ROCm 7.2.0 / Python 3.12, PADDLE_ROCM_OFFLOAD_ARCHS=gfx942. Row count: 220 (incl. header). Supersedes the earlier comment for the former develop-base iteration.

Headline numbers from this run (domain_stats.csv): BF16 kernel-dispatch total 3,897.8 ms (875,266 kernels) vs FP32 fallback 4,369.3 ms (867,948 kernels) — 1.12× GPU-kernel speedup, 471 ms saved. GEMM line: 1,447.27 ms BF16 (Cijk_*_BBS_BH_*_MI16x16x16x1_*) vs 1,786.45 ms FP32 (includes two Cijk_*_SB_*_MI16x16x4x1_* vision-path GEMMs totaling 473 ms that disappear entirely).

kernel_stats.csv (BF16 run, click to expand)
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"Cijk_Ailk_Bljk_BBS_BH_MT64x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",22752,454476624,19975.238397,11.66,15716,25819,3437.292923
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA1024_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU16_SUM0_SUS512_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn16",13536,299746724,22144.409279,7.69,16037,53121,11771.909869
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<phi::dtype::bfloat16, float>)",76632,188738030,2462.914057,4.84,681,18723,1691.645108
"Cijk_Alik_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA4_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA1_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",13536,152340317,11254.456043,3.91,2605,91730,18150.643817
"Cijk_Ailk_Bljk_BBS_BH_MT64x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA32_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",8640,150723523,17444.852199,3.87,15675,23774,716.630168
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",52616,146923378,2792.370724,3.77,681,11386,1629.007916
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",48464,146535083,3023.586229,3.76,1082,5051,370.848612
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",34272,139531092,4071.285364,3.58,2365,21209,3228.665631
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::dtype::bfloat16, unsigned int, 1, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",36288,120655319,3324.937142,3.10,681,12709,817.478855
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",32166,115954017,3604.862805,2.97,1363,24617,1281.198817
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<float, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<float, phi::dtype::bfloat16>)",42585,114005536,2677.128942,2.92,1523,15194,1826.407889
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_ELFLR4_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD2_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGMn16",11376,111690912,9818.118143,2.87,8860,16518,901.885203
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MeanOps<float, float, float>, float, 4, 4>)",23384,100108722,4281.077745,2.57,2005,6896,504.173399
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<phi::dtype::bfloat16>)",34592,84794952,2451.287928,2.18,682,12148,1591.584417
"Cijk_Ailk_Bljk_BBS_BH_MT128x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLRn30_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM8",2240,79835753,35640.961161,2.05,31111,39290,1500.331706
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<phi::dtype::bfloat16>, phi::dtype::bfloat16, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<phi::dtype::bfloat16>)",13280,78743679,5929.493901,2.02,2926,14512,1777.199531
"Cijk_Ailk_Bljk_BBS_BH_MT128x128x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR12_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB1024_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB2_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU16_SUM0_SUS256_SCIUI1_SPO1_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_128_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB4_VFLRP1_WSGRA0_WSGRB0_WS64_WG64_4_1_WGMn2",632,66539159,105283.479430,1.71,90406,204106,35585.577280
"void phi::RepeatInterleaveVecKernel<phi::dtype::bfloat16, 8>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, long, long, long, long, int)",22752,66396804,2918.284283,1.70,2165,8540,393.318715
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",22752,62955318,2767.023470,1.62,802,10103,285.501471
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 8>, int, int, int, void*)",22752,60939332,2678.416491,1.56,1283,13190,368.713882
"void phi::funcs::SplitTensorWithDifferentShape<phi::dtype::bfloat16, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(phi::dtype::bfloat16 const*, int, int, phi::funcs::PointerArray<phi::dtype::bfloat16, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",22806,60454962,2650.835833,1.55,962,15635,538.994670
"SoftMaxCommon",2164,60040513,27745.153882,1.54,22011,32835,1972.917888
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4> >(phi::funcs::PointerAndColWrapper<phi::dtype::bfloat16, int, 4>, int, int, int, void*)",19890,59750034,3004.023831,1.53,2005,11707,308.706954
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::ScaleFunctor<phi::dtype::bfloat16, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::ScaleFunctor<phi::dtype::bfloat16, float>)",8640,55134671,6381.327662,1.41,2726,14232,1302.907985
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",12224,47355268,3873.958442,1.21,2686,41294,1859.849483
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4> >(phi::funcs::AlignedPointerWrapper<phi::dtype::bfloat16, 4>, int, int, int, void*)",22779,46618641,2046.562228,1.20,1002,16037,470.593244
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ScaleFunctor<float, float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ScaleFunctor<float, float>)",25446,45723679,1796.890631,1.17,682,14673,355.295104
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSquareFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSquareFunctor<float>)",23384,44989302,1923.935255,1.15,722,10785,204.014106
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",9936,42239836,4251.191224,1.08,3087,12188,409.014449
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaRsqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaRsqrtFunctor<float>)",23384,42182554,1803.906688,1.08,681,3368,205.094953
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",12816,40137096,3131.795880,1.03,1924,9502,280.484878
"void phi::ArgCUDAKernel<phi::dtype::bfloat16, long, hipcub::HIPCUB_400200_NS::ArgMax, 1024ul, int>(long, long, long, hipcub::HIPCUB_400200_NS::ArgMax, phi::dtype::bfloat16, phi::dtype::bfloat16 const*, long*)",632,38109794,60300.306962,0.9777,58053,63305,900.076243
"void phi::funcs::VectorizedElementwiseKernel<long, phi::ScaleFunctor<long, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::ScaleFunctor<long, long>)",22744,37537831,1650.449833,0.9630,681,6775,201.855527
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::MaxOps<long, long, long>, long, 4, 4>)",11616,37211598,3203.477789,0.9547,2044,5894,623.326343
"void phi::ContiguousCaseZeroFunc<phi::dtype::bfloat16, 4ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>)",19872,34637648,1743.037842,0.8886,681,4932,244.020546
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO4_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1404,34104371,24290.862536,0.8750,21890,28184,1182.141787
"__amd_rocclr_copyBuffer",9956,33713352,3386.234632,0.8649,641,105681,2399.598376
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<phi::dtype::bfloat16>)",11376,31146866,2737.945323,0.7991,1804,5452,451.773984
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<float>)",5182,27423752,5292.117329,0.7036,1564,45303,2820.867273
"Cijk_Ailk_Bljk_BBS_BH_MT32x32x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR8_EMLL0_FSSC10_FL0_GLVWA2_GLVWB8_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA512_LBSPPB128_LPA16_LPB16_LDL1_LRVW8_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR5_PKA0_SIA3_SLW1_SS0_SU64_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM16",756,23412035,30968.300265,0.6006,26901,36003,1672.062474
"void phi::funcs::LayerNormForward<phi::dtype::bfloat16, float, 512, true, phi::dtype::bfloat16, phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, std::conditional<true, phi::dtype::bfloat16, float>::type const*, phi::dtype::bfloat16*, float*, float*, float, long, float const*, int, float, int, float, float)",4480,22074471,4927.337277,0.5663,4089,13471,389.927368
"void phi::ContiguousCaseOneFunc<float, 4ul>(float const*, float*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",5584,20183537,3614.530265,0.5178,1684,38889,696.387770
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaTanhFunctor<phi::dtype::bfloat16>)",2160,19940325,9231.631944,0.5116,7257,14233,671.361681
"void phi::WarpSoftmaxForward<float, float, float, int, 8, false>(float*, float const*, int, int, int)",5904,19332800,3274.525745,0.4960,2486,9501,748.003645
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4972,19126770,3846.896621,0.4907,1483,26260,1213.449063
"void phi::UnaryElementwiseKernel<phi::ScaleFunctor<float, float>, float, unsigned int, 1, 1, 1, 2>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, int, phi::ScaleFunctor<float, float>, phi::funcs::OffsetCalculator<(1)+(1), unsigned int, false>)",4320,15775691,3651.780324,0.4047,3207,5733,163.753848
"Cijk_Ailk_Bljk_BBS_BH_MT16x16x64_MI16x16x16x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA128_LBSPPB128_LPA16_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS128_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP1_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",1440,14067442,9769.056944,0.3609,8699,12148,534.959254
"void phi::Strided2ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, common::Array<long, 10ul>, phi::dtype::bfloat16*, common::Array<long, 6ul>, long)",411,12543476,30519.406326,0.3218,9061,2509651,173731.259149
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::funcs::CudaCubeFunctor<phi::dtype::bfloat16>)",2160,10957356,5072.850000,0.2811,2806,6655,177.364980
"void phi::funcs::KeMatrixTopK<float, 20, 64>(float*, int, long*, float const*, long, long, int, int, long, bool)",8,9937656,1242207.000000,0.2550,1031516,1472242,215128.992914
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7737466,2828.021199,0.1985,2325,4490,295.344373
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 2u>, float, int, 8, false>(float*, float const*, int, int, int)",2736,7668874,2802.951023,0.1967,2325,4370,268.875586
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f2x3_stride1",104,7205816,69286.692308,0.1849,22170,197651,49872.195519
"void phi::funcs::VectorizedElementwiseKernel<long, phi::FullFunctor<long, long>, 0, 1, 1>(common::Array<char const* restrict, 0>, common::Array<long*, 1>, long, long, int, phi::FullFunctor<long, long>)",3103,6943968,2237.824041,0.1782,681,8380,1213.408268
"naive_conv_ab_nonpacked_fwd_nchw_float_double_float",108,4477189,41455.453704,0.1149,12749,88322,16748.046060
"void phi::Range<long, long>(long, long, long, long*)",2391,4075055,1704.330824,0.1045,1082,9302,804.974882
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::FullFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",2218,4005133,1805.740757,0.1028,1203,8861,435.228998
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<float, float>, float, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<float, float>)",716,3580466,5000.650838,0.0919,3929,10063,1183.355156
"Cijk_Ailk_Bljk_BBS_BH_MT64x64x64_MI32x32x8x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR3_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB2_VFLRP1_WSGRA0_WSGRB0_WS64_WG32_2_4_WGM1",80,3568948,44611.850000,0.0916,43459,45705,543.216513
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",964,3502922,3633.736515,0.0899,3007,5612,272.826700
"void phi::funcs::ConcatTensorWithDifferentShape<int, 8, phi::funcs::PointerAndColWrapper<long, int, 4> >(phi::funcs::PointerAndColWrapper<long, int, 4>, int, int, int, void*)",1264,3341742,2643.783228,0.0857,2045,4370,246.205076
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaCosFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaCosFunctor<float>)",712,2935981,4123.568820,0.0753,2806,4811,368.139234
"void phi::IndexSampleForward<phi::dtype::bfloat16, long, unsigned int>(long const*, phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long)",632,2870608,4542.101266,0.0736,3288,5974,110.470235
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4> >(phi::ReduceExecutor<bool, phi::kps::LogicalOrOps<bool, bool, bool>, bool, 4, 4>)",636,2812954,4422.883648,0.0722,3568,12027,764.361694
"void phi::funcs::VectorizedElementwiseKernel<float, phi::FullFunctor<float, float>, 0, 1, 4>(common::Array<char const* restrict, 0>, common::Array<float*, 1>, long, long, int, phi::FullFunctor<float, float>)",1368,2799511,2046.426170,0.0718,641,24857,1876.742671
"void phi::GridSampleCudaKernel<float, int>(int, int, int, int, int, float const*, float const*, float*, phi::Mode, phi::PaddingMode, bool)",72,2614212,36308.500000,0.0671,26781,43619,4756.424089
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CondFunctor<long>, 3, 1, 1>(common::Array<char const* restrict, 3>, common::Array<long*, 1>, long, long, int, phi::CondFunctor<long>)",632,2516543,3981.871835,0.0646,1964,5853,349.511429
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSinFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSinFunctor<float>)",712,2391289,3358.551966,0.0613,2806,3969,218.469069
"void phi::BinaryElementwiseKernel<phi::funcs::NotEqualFunctor<long, bool>, bool, unsigned int, 2, 1, 1, 4>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, long, int, phi::funcs::NotEqualFunctor<long, bool>, phi::funcs::OffsetCalculator<(2)+(1), unsigned int, false>)",632,2256309,3570.109177,0.0579,3087,4531,214.975086
"void phi::EmbeddingFW<phi::dtype::bfloat16, long, false>(phi::dtype::bfloat16*, phi::dtype::bfloat16 const*, long const*, long, long, long, long)",632,2186512,3459.670886,0.0561,1243,7377,480.257448
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, phi::dtype::bfloat16*, unsigned long)",246,2134474,8676.723577,0.0548,4210,111534,9962.998369
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<phi::dtype::bfloat16, bool>)",632,2086254,3301.034810,0.0535,2806,3849,198.764852
"void phi::funcs::VectorizedElementwiseKernel<long, phi::CastFunctor<bool, long>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, long, long, int, phi::CastFunctor<bool, long>)",1032,1898403,1839.537791,0.0487,1243,2446,201.677837
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CondFunctor<float>, 3, 1, 8>(common::Array<char const* restrict, 3>, common::Array<float*, 1>, long, long, int, phi::CondFunctor<float>)",644,1887240,2930.496894,0.0484,721,64267,5814.371913
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaReluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaReluFunctor<float>)",324,1817595,5609.861111,0.0466,1964,27624,4690.417073
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",714,1795653,2514.920168,0.0461,1524,5172,718.813168
"void phi::funcs::TilingSwapDim1And2<phi::dtype::bfloat16, 256, 32, 32, int>(phi::dtype::bfloat16 const*, phi::funcs::Dim3<int>, phi::dtype::bfloat16*)",250,1758928,7035.712000,0.0451,2485,14874,2579.283619
"void phi::ContiguousCaseOneFunc<long, 2ul>(long const*, long*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",552,1731237,3136.298913,0.0444,1804,12669,1252.650982
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",744,1659739,2230.831989,0.0426,1644,5894,429.966514
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>, phi::dtype::bfloat16, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<phi::dtype::bfloat16, phi::dtype::bfloat16>)",712,1640244,2303.713483,0.0421,1764,3408,315.689199
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",552,1624897,2943.653986,0.0417,2525,3408,189.731805
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt128x128x16_wt32x32x4_ws1x1_wr2x2_ta1x1x8x1_1x16x1x16_tb1x1x8x1_1x16x1x16_me",40,1537914,38447.850000,0.0395,37285,38849,324.722726
"Cijk_Ailk_Bljk_SB_MT16x16x4_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT017_165_35_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO2_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_1_WGM16",552,1503713,2724.117754,0.0386,2366,3688,157.977665
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<long, phi::dtype::bfloat16>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<long, phi::dtype::bfloat16>)",632,1497422,2369.338608,0.0384,1844,2927,221.996722
"void phi::RepeatInterleaveVecKernel<long, 1>(long const*, long*, long, long, long, long, int)",552,1497005,2711.965580,0.0384,2245,3648,207.978614
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",244,1475332,6046.442623,0.0379,2766,8018,1104.378607
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<long, bool>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<long, bool>)",632,1463055,2314.960443,0.0375,1283,2886,209.243858
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::CastFunctor<bool, phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::CastFunctor<bool, phi::dtype::bfloat16>)",632,1419881,2246.647152,0.0364,1844,3088,220.509417
"void phi::GPUMaskedFillOneValueKernel<phi::dtype::bfloat16, 1>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long, long, phi::dtype::bfloat16*)",632,1413457,2236.482595,0.0363,1724,3929,228.765410
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::LogicalAndFunctor<bool>, bool, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::LogicalAndFunctor<bool>)",632,1310014,2072.806962,0.0336,1644,3248,180.614396
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<long, float>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<long, float>)",722,1269501,1758.311634,0.0326,681,4650,400.369193
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",632,1258005,1990.514241,0.0323,681,2526,215.882818
"Cijk_Ailk_Bljk_SB_MT256x256x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT8_128_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM18",8,1200862,150107.750000,0.0308,94336,204426,57552.089384
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<long, long>, long, 1, 1, 1, 3>(common::Array<char const* restrict, 1>, common::Array<long*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<long, long>)",560,1153224,2059.328571,0.0296,1643,3447,194.936637
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,1096546,91378.833333,0.0281,76335,99507,10591.275885
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::EqualFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::EqualFunctor<long, bool>)",560,1081763,1931.719643,0.0278,641,7697,539.647210
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS16_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG32_8_1_WGM1",24,994709,41446.208333,0.0255,24375,55045,9199.269990
"void phi::funcs::CumsumOneBlock<long, long, phi::kps::AddFunctor<long>, 2>(long const*, long*, long, long, phi::kps::AddFunctor<long>)",240,931082,3879.508333,0.0239,2766,4931,333.288973
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::SumOps<long, long, long>, long, 4, 4>)",240,921412,3839.216667,0.0236,2886,5092,511.425593
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt64x64x16_wt16x16x4_ws1x1_wr2x2_ta1x1x4x1_1x16x1x16_tb1x1x4x1_1x16x1x16_me",36,880251,24451.416667,0.0226,23814,25017,408.036159
"void phi::ContiguousCaseOneFunc<phi::dtype::bfloat16, 2ul>(phi::dtype::bfloat16 const*, phi::dtype::bfloat16*, common::Array<long, 10ul>, common::Array<long, 6ul>, long)",160,835875,5224.218750,0.0214,4289,6255,449.888857
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",88,817427,9288.943182,0.0210,6214,17600,4156.242554
"batched_transpose_128x4_half",160,798227,4988.918750,0.0205,2405,7296,1261.749499
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<bool, bool>, bool, 1, 1, 4, 1>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<bool, bool>)",80,781625,9770.312500,0.0201,8058,12067,1257.652501
"void phi::funcs::SelectKernel<bool, bool, long, long, phi::IndexFunctor<bool, long, long>, 1, 0>(long*, bool const*, bool const*, long*, phi::IndexFunctor<bool, long, long>, long, long, long)",160,753435,4708.968750,0.0193,3929,7377,546.928248
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 3ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",208,626956,3014.211538,0.0161,2446,4530,419.479622
"void phi::funcs::GetBlockCountKernel<bool, long, 1>(bool const*, long*, long, long)",240,602890,2512.041667,0.0155,2044,3648,221.396156
"void rocprim::ROCPRIM_400200_NS::detail::trampoline_kernel<rocprim::ROCPRIM_400200_NS::detail::wrapped_scan_config<rocprim::ROCPRIM_400200_NS::default_config, long>, (rocprim::ROCPRIM_400200_NS::detail::target_arch)942, rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1}, rocprim::ROCPRIM_400200_NS::detail::default_config_selector>(rocprim::ROCPRIM_400200_NS::detail::scan_impl<(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_determinism)0, true, true, rocprim::ROCPRIM_400200_NS::default_config, long*, long*, long, hipcub::HIPCUB_400200_NS::Sum, long>(void*, unsigned long&, long*, long*, long, unsigned long, hipcub::HIPCUB_400200_NS::Sum, ihipStream_t*, bool)::{lambda(auto:1, auto:2)#1}::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true> >(std::integral_constant<bool, false>, std::integral_constant<bool, true>) const::{lambda(auto:1)#1})",80,584574,7307.175000,0.0150,6535,9301,648.170058
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",240,562640,2344.333333,0.0144,1844,2887,210.826917
"Cijk_Ailk_Bljk_SB_MT32x32x32_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",16,549696,34356.000000,0.0141,19966,67875,19983.257469
"void phi::funcs::TilingSwapDim1And2<float, 256, 32, 32, int>(float const*, phi::funcs::Dim3<int>, float*)",100,539226,5392.260000,0.0138,2766,12830,2239.025819
"void phi::funcs::VectorizedElementwiseKernel<float, phi::CastFunctor<bool, float>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::CastFunctor<bool, float>)",34,490515,14426.911765,0.0126,2606,24295,5657.949272
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM1",12,478333,39861.083333,0.0123,38247,41495,1109.180570
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4> >(phi::ReduceExecutor<long, phi::kps::ProdOps<long, long, long>, long, 4, 4>)",80,471443,5893.037500,0.0121,4611,9703,846.510805
"SubTensorOpWithScalar1d",108,466710,4321.388889,0.0120,2525,10384,1359.883041
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<float, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,460327,5754.087500,0.0118,5372,7016,236.999610
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwisePutWithTensorKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,450744,5634.300000,0.0116,4851,6655,332.019650
"void phi::funcs::StackCudaKernel<long, int, phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<long, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, long*)",168,444169,2643.863095,0.0114,2125,5051,544.854581
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride1",16,441528,27595.500000,0.0113,21569,32594,3817.139627
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSwishFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSwishFunctor<float>)",76,440563,5796.881579,0.0113,2525,11947,3061.724507
"Cijk_Ailk_Bljk_SB_MT128x128x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS1_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU8_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT4_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_8_1_WGM1",12,439045,36587.083333,0.0113,27423,41856,5349.070904
"void phi::fusion::FusedLayernormResidualDropoutBias<float, unsigned char, 4, float, false, false>(long, long, unsigned long, float, bool, bool, unsigned long, float, float const*, float const*, float const*, std::conditional<false, float, float>::type const*, std::conditional<false, float, float>::type const*, unsigned char*, float*, float*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType*, float)",80,433708,5421.350000,0.0111,3849,7697,1345.225350
"miopenSp3AsmConv_v30_3_1_gfx9_fp32_f3x2_stride2",8,429018,53627.250000,0.0110,28224,103918,28901.651217
"void phi::BNForwardInference<float, (common::DataLayout)2>(float const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, phi::backends::gpu::CudnnDataType<float>::BatchNormParamType const*, int, int, long, double, float*)",108,419525,3884.490741,0.0108,2565,7818,862.121286
"void phi::KeBilinearInterpNCHWFw<phi::dtype::bfloat16, float>(phi::dtype::bfloat16 const*, unsigned long, unsigned long, phi::dtype::bfloat16*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",80,389925,4874.062500,0.0100,4290,5693,202.737115
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::FullFunctor<bool, bool>, 0, 1, 8>(common::Array<char const* restrict, 0>, common::Array<bool*, 1>, long, long, int, phi::FullFunctor<bool, bool>)",89,373574,4197.460674,9.584e-03,2806,13311,2374.493855
"phi::BoolToInt64Kernel(bool const*, long*, long)",80,372815,4660.187500,9.565e-03,2005,5894,806.357195
"void phi::funcs::SelectKernel<bool, long, long, long, phi::MaskedSelectFunctor<bool, long, long>, 1, 1>(long*, bool const*, long const*, long*, phi::MaskedSelectFunctor<bool, long, long>, long, long, long)",80,366319,4578.987500,9.398e-03,4329,6254,263.004957
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorShufflingOp<Eigen::array<int, 4ul> const, Eigen::TensorMap<Eigen::Tensor<phi::dtype::bfloat16 const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",80,364792,4559.900000,9.359e-03,4170,5733,234.976088
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x128x16_wt32x32x2_ws1x1_wr2x2_ta1x4x2x1_1x4x1x64_tb1x4x2x1_1x4x1x64_gkgs",16,362909,22681.812500,9.311e-03,22010,23494,401.991578
"void phi::funcs::VectorizedElementwiseKernel<phi::dtype::bfloat16, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<phi::dtype::bfloat16*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<phi::dtype::bfloat16>)",80,341705,4271.312500,8.767e-03,4009,5372,215.672372
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSiluFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSiluFunctor<float>)",48,306945,6394.687500,7.875e-03,2767,12589,3310.156287
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::MultiplyFunctor<float>, float, 2, 1, 4, 2>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::MultiplyFunctor<float>)",80,298554,3731.925000,7.660e-03,3288,4290,204.419820
"void phi::funcs::StackCudaKernel<float, int, phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4> >(phi::funcs::ConstPointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::FastDivMod<int>, int, int, int, float*)",44,287536,6534.909091,7.377e-03,2205,33196,7946.487075
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSigmoidFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSigmoidFunctor<float>)",40,279598,6989.950000,7.173e-03,2325,46306,12595.272180
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::FloorDivideFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::FloorDivideFunctor<long, void>)",84,278432,3314.666667,7.143e-03,2887,4691,329.367899
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::SubtractFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::SubtractFunctor<float>)",94,271736,2890.808511,6.971e-03,2165,3568,357.358477
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 6, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 6> const, Eigen::DSizes<long, 6> const, Eigen::TensorMap<Eigen::Tensor<float const, 6, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",72,265081,3681.680556,6.801e-03,3247,4531,276.313347
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)4>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)4, 4>)",28,255424,9122.285714,6.553e-03,2406,10585,2740.424120
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM16",60,251338,4188.966667,6.448e-03,3288,7216,786.753617
"batched_transpose_16x32_dword",32,251251,7851.593750,6.446e-03,4009,21128,4772.882769
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM48",4,245520,61380.000000,6.299e-03,61139,61581,188.398514
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGMn16",28,244716,8739.857143,6.278e-03,7858,12388,1408.161626
"void phi::MaskedScatterCUDAKernel<phi::dtype::bfloat16>(phi::dtype::bfloat16 const*, bool const*, phi::dtype::bfloat16 const*, long const*, long, phi::dtype::bfloat16*)",80,242193,3027.412500,6.214e-03,2245,4490,395.885602
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaSqrtFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaSqrtFunctor<float>)",78,241874,3100.948718,6.205e-03,2526,4090,291.057896
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,238468,9936.166667,6.118e-03,9542,10263,165.682427
"Cijk_Ailk_Bjlk_SB_MT64x64x16_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM12",24,236500,9854.166667,6.067e-03,9662,10183,136.894014
"void phi::funcs::index_elementwise_with_tensor_kernel<128, 4, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1}>(long, phi::GPUIndexElementwiseGetKernel<long, unsigned int>(phi::GPUContext const&, phi::DenseTensor const&, std::vector<phi::DenseTensor const*, std::allocator<phi::DenseTensor const*> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, long, phi::DenseTensor*)::{lambda(long)#1})",80,228763,2859.537500,5.869e-03,2405,3769,225.923699
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 2>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,227952,2849.400000,5.848e-03,2566,3087,118.961072
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGMn16",36,227361,6315.583333,5.833e-03,5452,7417,517.844233
"igemm_fwd_gtcx3_nhwc_bf16_bx0_ex1_bt256x64x8_wt64x16x4_ws1x1_wr2x2_ta1x1x8x1_1x8x1x32_tb1x1x2x1_1x8x1x32_me",4,216655,54163.750000,5.558e-03,53923,54444,242.542608
"batched_transpose_32x32_dword",40,213607,5340.175000,5.480e-03,3889,7538,1283.639755
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::AddFunctor<long>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::AddFunctor<long>)",80,211439,2642.987500,5.425e-03,2325,3047,168.448730
"void phi::funcs::VectorizedElementwiseKernel<float, phi::ClipFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::ClipFunctor<float>)",84,208438,2481.404762,5.348e-03,1925,3448,201.034141
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",20,198130,9906.500000,5.083e-03,5853,14313,2552.281611
"Cijk_Ailk_Bljk_SB_MT64x32x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB2_GRCGA1_GRCGB1_GRPM1_GRVW2_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM6",44,194006,4409.227273,4.977e-03,4129,4610,121.416229
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 4> const, Eigen::DSizes<long, 4> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",68,193683,2848.279412,4.969e-03,1283,10424,1036.622340
"Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA16_LPB8_LDL1_LRVW4_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD3_NEPBS8_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS0_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG32_8_1_WGM32",4,181254,45313.500000,4.650e-03,44581,45624,493.061524
"void phi::funcs::SplitTensorWithDifferentShape<float, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8> >(float const*, int, int, phi::funcs::PointerArray<float, (phi::funcs::SegmentedArraySize)8>, phi::funcs::ValueArray<int, (phi::funcs::SegmentedArraySize)8, 8>)",8,175801,21975.125000,4.510e-03,3047,40893,19503.904616
"void phi::Range<float, float>(float, float, long, float*)",82,173034,2110.170732,4.439e-03,1804,2886,187.353288
"Cijk_Ailk_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM12",36,167098,4641.611111,4.287e-03,4450,5452,186.563398
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MaxOps<float, float, float>, float, 4, 4>)",12,161890,13490.833333,4.153e-03,4931,20527,6180.767499
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<float, bool>, bool, 2, 1, 8, 1>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<float, bool>)",8,159365,19920.625000,4.089e-03,18282,20768,915.301658
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::GreaterThanFunctor<long, bool>, bool, 2, 1, 1, 3>(common::Array<char const* restrict, 2>, common::Array<bool*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::GreaterThanFunctor<long, bool>)",80,151266,1890.825000,3.881e-03,1563,2245,143.489406
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 4, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorPaddingOp<std::array<std::pair<long, long>, 4ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 4, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,148781,18597.625000,3.817e-03,13511,23854,5291.743999
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 9, false>(float*, float const*, int, int, int)",24,147656,6152.333333,3.788e-03,5693,6855,284.401991
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM1",24,146535,6105.625000,3.759e-03,5773,6655,268.856324
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex0_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta",4,142766,35691.500000,3.663e-03,35481,36002,222.741255
"void phi::funcs::GatherNdCUDAKernel<float, long, 4>(float const*, common::Dim<9>, long const*, float*, unsigned long, unsigned long, unsigned long)",12,142725,11893.750000,3.662e-03,4811,25177,8647.873907
"void phi::IsfiniteCUDAKernel<long, unsigned int>(long const*, unsigned int, bool*, std::enable_if<std::is_integral<long>::value, void>::type*)",80,136428,1705.350000,3.500e-03,1363,2686,204.026748
"void rocprim::ROCPRIM_400200_NS::detail::init_lookback_scan_state_kernel<rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true> >(rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::block_id_wrapper<unsigned int, true>, unsigned int, rocprim::ROCPRIM_400200_NS::detail::lookback_scan_state<long, false, true>::value_type*)",80,133667,1670.837500,3.429e-03,1404,2285,184.017228
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::MinOps<float, float, float>, float, 4, 4>)",8,128252,16031.500000,3.290e-03,15715,16478,289.671242
"void phi::funcs::VectorizedElementwiseKernel<bool, phi::CastFunctor<float, bool>, 1, 1, 8>(common::Array<char const* restrict, 1>, common::Array<bool*, 1>, long, long, int, phi::CastFunctor<float, bool>)",12,121438,10119.833333,3.116e-03,4851,13591,3617.831566
"void phi::VecReduceKernel<512, 1, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",24,117748,4906.166667,3.021e-03,4530,5693,344.462867
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt16x64x32_wt16x16x4_ws1x1_wr1x2_ta1x4x1x1_1x8x1x16_tb1x4x4x1_1x8x1x16_gkgs",4,114501,28625.250000,2.938e-03,28465,28946,217.133408
"phi::MaskedScatterSizeCheck(long const*, bool const*, long)",80,114415,1430.187500,2.935e-03,641,5653,1368.226905
"batched_transpose_32x16_dword",8,112900,14112.500000,2.896e-03,4089,24336,10640.860210
"mloPoolingG",4,104878,26219.500000,2.691e-03,25939,26901,457.064182
"void phi::funcs::LayerNormForward<float, float, 64, true, float, float>(float const*, std::conditional<true, float, float>::type const*, std::conditional<true, float, float>::type const*, float*, float*, float*, float, long, float const*, int, float, int, float, float)",12,104679,8723.250000,2.686e-03,4891,15716,4850.266199
"Cijk_Alik_Bljk_SB_MT32x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM2_AF1EM2_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM2_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA2_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS4_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM6",24,104202,4341.750000,2.673e-03,3969,5092,327.046234
"void phi::funcs::ConcatTensorWithSameShape<int, 16, phi::funcs::AlignedPointerWrapper<float, 8> >(phi::funcs::AlignedPointerWrapper<float, 8>, int, int, int, void*)",5,101191,20238.200000,2.596e-03,5493,24175,8245.662417
"Cijk_Ailk_Bljk_SB_MT192x64x32_MI16x16x4x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT0193_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL4_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB128_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA3_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR9_PKA0_SIA3_SLW1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT3_64_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM16",4,90126,22531.500000,2.312e-03,20206,28024,3679.241362
"void phi::KeBilinearInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool, int)",16,89405,5587.812500,2.294e-03,3929,8339,1725.040530
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC2_NTD3_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB0_WS64_WG16_8_2_WGM16",16,84272,5267.000000,2.162e-03,4851,6494,422.034122
"void phi::VecReduceKernel<128, 4, phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4> >(phi::ReduceExecutor<float, phi::kps::SumOps<float, float, float>, float, 4, 4>)",4,75933,18983.250000,1.948e-03,18803,19204,177.802090
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS3_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA2_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB4_LDL1_LRVW2_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR5_PKA0_SIA3_SLW1_SS1_SU4_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW2_SNLL0_TSGRA0_TSGRB0_TT2_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW2_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM8",4,74128,18532.000000,1.902e-03,18321,18763,183.740759
"void phi::WarpSoftmaxForward<float, HIP_vector_type<int, 4u>, float, int, 4, false>(float*, float const*, int, int, int)",24,72968,3040.333333,1.872e-03,2766,3248,118.743299
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaLogFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaLogFunctor<float>)",28,71844,2565.857143,1.843e-03,2165,3288,284.816863
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 2> const, Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,70764,3369.714286,1.815e-03,3128,4130,227.370874
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<float, int>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<float, int>)",4,69679,17419.750000,1.788e-03,16919,18522,741.576418
"Im2d2Col_v2",4,67915,16978.750000,1.742e-03,16558,17360,419.762929
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 1> const, Eigen::DSizes<long, 1> const, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",21,56689,2699.476190,1.454e-03,2366,3007,180.471222
"void phi::KeNearestNeighborInterpNCHWFw<float, float>(float const*, unsigned long, unsigned long, float*, unsigned long, unsigned long, unsigned long, float, float, bool)",8,52359,6544.875000,1.343e-03,5572,7578,929.797126
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR4_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS16_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT2_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG16_8_2_WGMn16",4,51237,12809.250000,1.315e-03,12388,13150,362.419071
"Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT081_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA1_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL8_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVWn1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA0_LPB2_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR1_PKA0_SIA3_SLW1_SS1_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_16_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG16_4_4_WGM32",4,47829,11957.250000,1.227e-03,8861,13190,2079.983554
"igemm_fwd_gtcx3_nhwc_fp32_bx0_ex1_bt128x64x16_wt32x32x2_ws1x1_wr1x2_ta1x8x1x1_1x2x4x32_tb1x4x1x1_1x4x1x64_pta_gkgs",4,45706,11426.500000,1.173e-03,11066,11747,283.314313
"Cijk_Ailk_Bljk_SB_MT32x32x64_MI16x16x4x1_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT3128_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA4_GLVWB4_GRCGA1_GRCGB1_GRPM1_GRVW4_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB256_LPA16_LPB4_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD2_NEPBS4_NLCA1_NLCB1_ONLL2_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO0_SRVW0_SSO4_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS1_UMLDSA0_UMLDSB1_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA1_WSGRB1_WS64_WG32_4_2_WGM16",4,43019,10754.750000,1.104e-03,10384,10945,252.151244
"Cijk_Alik_Bljk_SB_MT64x64x16_MI32x32x2x1_SN_1LDSB1_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASM_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_CLR0_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS1_ELFLR0_EMLL0_FSSC10_FL0_GLVWA1_GLVWB1_GRCGA1_GRCGB1_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA942_IU1_K1_KLA_LBSPPA0_LBSPPB0_LPA1_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV1_MDA2_MO40_MMFSC_MKFGSU256_NTA0_NTB0_NTC3_NTD3_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR2_PLR9_PKA0_SIA3_SLW1_SS1_SU0_SUM0_SUS0_SCIUI1_SPO1_SRVW0_SSO6_SVW1_SNLL0_TSGRA0_TSGRB0_TT1_32_TLDS0_UMLDSA0_UMLDSB0_U64SL1_USFGROn1_VAW1_VSn1_VW1_VWB1_VFLRP0_WSGRA0_WSGRB0_WS64_WG64_4_1_WGM6",4,38808,9702.000000,9.956e-04,9262,10784,724.248116
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 2, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<long, 2> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",8,35482,4435.250000,9.103e-04,3007,4932,639.361009
"void phi::funcs::ConcatTensorWithDifferentShape<int, 16, phi::funcs::PointerAndColWrapper<float, int, 4> >(phi::funcs::PointerAndColWrapper<float, int, 4>, int, int, int, void*)",4,30748,7687.000000,7.889e-04,7537,7817,114.891253
"void phi::funcs::GatherNdCUDAKernel<long, long, 1>(long const*, common::Dim<9>, long const*, long*, unsigned long, unsigned long, unsigned long)",4,20086,5021.500000,5.153e-04,4851,5211,150.953635
"void phi::funcs::ConcatTensorWithDifferentShape<int, 4, phi::funcs::PointerAndColWrapper<float, int, 8> >(phi::funcs::PointerAndColWrapper<float, int, 8>, int, int, int, void*)",4,18001,4500.250000,4.618e-04,4450,4531,38.560558
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::DivideFunctor<float, void>, float, 2, 1, 4, 1>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::DivideFunctor<float, void>)",4,17800,4450.000000,4.567e-04,4330,4570,97.979590
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::RemainderFunctor<long, void>, long, 2, 1, 1, 1>(common::Array<char const* restrict, 2>, common::Array<long*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::RemainderFunctor<long, void>)",4,17800,4450.000000,4.567e-04,4370,4490,56.568542
"void phi::funcs::VectorizedElementwiseKernel<float, phi::GeluWithoutApproximateFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::GeluWithoutApproximateFunctor<float>)",4,17079,4269.750000,4.382e-04,3969,4972,473.077425
"void phi::funcs::VectorizedBroadcastKernel<phi::kps::IdentityFunctor<int, int>, int, 1, 1, 4, 3>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, common::Array<bool, 1>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 1>, unsigned int, unsigned int, int, phi::kps::IdentityFunctor<int, int>)",4,15676,3919.000000,4.022e-04,3889,3969,38.297084
"void phi::IndexPutCudaKernel<long>(long const*, long const*, long**, common::Array<long, 9ul>, common::Array<long, 9ul>, int, long, long, bool, long*)",4,15355,3838.750000,3.939e-04,3568,4450,410.586065
"void phi::FlipCudaKernel<float>(float const*, float*, common::Array<long, 9ul>, common::Array<long, 9ul>, common::Array<int, 9ul>, int, long, int)",4,14272,3568.000000,3.662e-04,3327,4290,481.333564
"void phi::funcs::VectorizedElementwiseKernel<float, phi::funcs::CudaFloorFunctor<float>, 1, 1, 4>(common::Array<char const* restrict, 1>, common::Array<float*, 1>, long, long, int, phi::funcs::CudaFloorFunctor<float>)",4,13151,3287.750000,3.374e-04,3128,3488,156.802158
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 3, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, 3> const, Eigen::DSizes<long, 3> const, Eigen::TensorMap<Eigen::Tensor<float const, 3, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",4,12308,3077.000000,3.158e-04,2446,3608,579.288069
"void phi::funcs::ConcatTensorWithSameShape<int, 8, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,11706,2926.500000,3.003e-04,2766,3128,157.483332
"void phi::funcs::ConcatTensorWithSameShape<int, 4, phi::funcs::AlignedPointerWrapper<float, 4> >(phi::funcs::AlignedPointerWrapper<float, 4>, int, int, int, void*)",4,10143,2535.750000,2.602e-04,2165,3007,354.750029
"void phi::funcs::VectorizedBroadcastKernel<phi::funcs::ElementwisePowFunctor<float>, float, 2, 1, 4, 3>(common::Array<char const* restrict, 2>, common::Array<float*, 1>, common::Array<bool, 2>, unsigned int, common::Array<phi::kps::details::BroadcastConfig, 2>, unsigned int, unsigned int, int, phi::funcs::ElementwisePowFunctor<float>)",2,9863,4931.500000,2.530e-04,4450,5413,680.943830
"__amd_rocclr_fillBufferAligned",1,9782,9782.000000,2.510e-04,9782,9782,0.00000000e+00
"void phi::funcs::DistributionKernel<phi::dtype::bfloat16, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::normal_distribution<float>, phi::funcs::normal_transform<float>, phi::dtype::bfloat16*, unsigned long)",2,9342,4671.000000,2.397e-04,3248,6094,2012.425899
"void phi::funcs::DistributionKernel<float, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float> >(unsigned long, unsigned long, unsigned long, phi::funcs::uniform_distribution<float>, phi::funcs::uniform_real_transform<float>, float*, unsigned long)",1,3207,3207.000000,8.228e-05,3207,3207,0.00000000e+00
"void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long>(Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer>, Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<float>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 0, Eigen::MakePointer> const> const> const, Eigen::GpuDevice>, long)",1,3127,3127.000000,8.022e-05,3127,3127,0.00000000e+00
"void phi::funcs::VectorizedElementwiseKernel<int, phi::CastFunctor<long, int>, 1, 1, 1>(common::Array<char const* restrict, 1>, common::Array<int*, 1>, long, long, int, phi::CastFunctor<long, int>)",1,2526,2526.000000,6.481e-05,2526,2526,0.00000000e+00
"void phi::funcs::ForRangeElemwiseOpGridIsOne<phi::EyeFunctor<float> >(phi::EyeFunctor<float>)",1,2525,2525.000000,6.478e-05,2525,2525,0.00000000e+00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant