Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors by linuxrocks123 · Pull Request #2429 · ROCm/llvm-project

linuxrocks123 · 2026-05-06T18:01:50Z

Fixes LCOMPILER-1744

  Upstream commit 765d09dd1b04 ("clang/AMDGPU: Fix workgroup size builtins for nonuniform work group sizes") changed the semantics of __builtin_amdgcn_workgroup_size_{x,y,z}():

  - Before: returned the uniform/enqueued work‑group size (a direct load from dispatch->workgroup_size_x for COV4, or implicitarg.group_size for COV5).
  - After: when OffloadUniformBlock is false (the default for OpenCL ≥ 2.0 and for non‑uniform launches), the builtin emits umin(grid_size − workgroup_id*group_size, group_size) — the actual
   size of the current work‑group, which equals the uniform size for full groups but is smaller for the trailing partial group.

  The change broke every consumer that was relying on the historical "uniform size" contract. In particular, in rocm-device-libs/ockl/src/workitem.cl (legacy ABI / COV ≤ 4 path):

  ┌─────────────────────────────────┬───────────────────────────────────────────────────┐
  │            Function             │                   Why it broke                    │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_global_id_{x,y,z}           │ computes g*s + l + offset; needs s = uniform size │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_global_linear_id_{x,y,z}    │ same g*s + l formula                              │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_enqueued_local_size_{x,y,z} │ by definition is the uniform size                 │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_num_groups_{x,y,z}          │ n/d + (n > q*d); needs d = uniform size           │
  └─────────────────────────────────┴───────────────────────────────────────────────────┘

  get_local_size_{x,y,z} already manually computes the remainder using grid_size, so it stays correct (just redundant) when s is uniform; with the post‑patch builtin it coincidentally still
  returns the right answer because min(remainder, actual) collapses to actual.

  clang/lib/Headers/amdgpuintrin.h (__gpu_num_blocks_*, __gpu_num_threads_*) has the same bug for OpenMP/HIP offload code.

  The OpenCL CTS test fails because the HSA runtime's blit kernel (rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/imageblit_kernels.cl, function copy_buffer_to_image)
  is built with -cl-std=CL2.0 -mcode-object-version=4, calls get_global_id, and links against device‑libs. After the patch its get_global_id_x returns g*actual_size + l, so threads in the
  partial trailing work‑group write the source buffer to the wrong destination pixels. When the test then reads coord 321 of a 329‑wide image it sees the never‑written zero‑initialized pixel
   — exactly the (0,0,0,1) that we observed.

  One file modified: compiler/amd-llvm/amd/device-libs/ockl/src/workitem.c
  
  ln every OLD_ABI branch of the four affected function families, replace __builtin_amdgcn_workgroup_size_{x,y,z}() with a direct
  read of p->workgroup_size_{x,y,z} from the HSA dispatch packet (the value the builtin used to return). 12 call sites updated; added a header comment explaining why the builtin can't be
  used here.

  Verification

  - Built artifact-group-opencl-runtime artifact-base amd-comgr+dist ROCR-Runtime+dist artifact-sysdeps.
  - Disassembled the rebuilt copy_buffer_to_image blit kernel: it now does workgroup_id * uniform_size + workitem_id directly (no s_sub_i32 / s_min_u32 remainder dance), matching the
  workaround output.
  - test_image_streams CL_RGB CL_UNORM_INT_101010 CL_FILTER_NEAREST CL_ADDRESS_CLAMP_TO_EDGE UNNORMALIZED read float 1D now passes 2/2 sub‑tests, three runs in a row.

Co-authored-by: Claude <noreply@anthropic.com>

…tins for nonuniform work group sizes") changed the semantics of __builtin_amdgcn_workgroup_size_{x,y,z}(): - Before: returned the uniform/enqueued work‑group size (a direct load from dispatch->workgroup_size_x for COV4, or implicitarg.group_size for COV5). - After: when OffloadUniformBlock is false (the default for OpenCL ≥ 2.0 and for non‑uniform launches), the builtin emits umin(grid_size − workgroup_id*group_size, group_size) — the actual size of the current work‑group, which equals the uniform size for full groups but is smaller for the trailing partial group. The change broke every consumer that was relying on the historical "uniform size" contract. In particular, in rocm-device-libs/ockl/src/workitem.cl (legacy ABI / COV ≤ 4 path): ┌─────────────────────────────────┬───────────────────────────────────────────────────┐ │ Function │ Why it broke │ ├─────────────────────────────────┼───────────────────────────────────────────────────┤ │ get_global_id_{x,y,z} │ computes g*s + l + offset; needs s = uniform size │ ├─────────────────────────────────┼───────────────────────────────────────────────────┤ │ get_global_linear_id_{x,y,z} │ same g*s + l formula │ ├─────────────────────────────────┼───────────────────────────────────────────────────┤ │ get_enqueued_local_size_{x,y,z} │ by definition is the uniform size │ ├─────────────────────────────────┼───────────────────────────────────────────────────┤ │ get_num_groups_{x,y,z} │ n/d + (n > q*d); needs d = uniform size │ └─────────────────────────────────┴───────────────────────────────────────────────────┘ get_local_size_{x,y,z} already manually computes the remainder using grid_size, so it stays correct (just redundant) when s is uniform; with the post‑patch builtin it coincidentally still returns the right answer because min(remainder, actual) collapses to actual. clang/lib/Headers/amdgpuintrin.h (__gpu_num_blocks_*, __gpu_num_threads_*) has the same bug for OpenMP/HIP offload code. The OpenCL CTS test fails because the HSA runtime's blit kernel (rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/imageblit_kernels.cl, function copy_buffer_to_image) is built with -cl-std=CL2.0 -mcode-object-version=4, calls get_global_id, and links against device‑libs. After the patch its get_global_id_x returns g*actual_size + l, so threads in the partial trailing work‑group write the source buffer to the wrong destination pixels. When the test then reads coord 321 of a 329‑wide image it sees the never‑written zero‑initialized pixel — exactly the (0,0,0,1) that we observed. Two files modified: 1. compiler/amd-llvm/amd/device-libs/ockl/src/workitem.cl — in every OLD_ABI branch of the four affected function families, replace __builtin_amdgcn_workgroup_size_{x,y,z}() with a direct read of p->workgroup_size_{x,y,z} from the HSA dispatch packet (the value the builtin used to return). 12 call sites updated; added a header comment explaining why the builtin can't be used here. 2. compiler/amd-llvm/clang/lib/Headers/amdgpuintrin.h — added a private __GPU_AMDGCN_UNIFORM_WG_SIZE(idx) macro that reads the uniform work‑group size from the dispatch packet (uint16_t at byte offset 4 + 2*idx), and routed __gpu_num_blocks_{x,y,z} and __gpu_num_threads_{x,y,z} through it. Verification - Built artifact-group-opencl-runtime artifact-base amd-comgr+dist ROCR-Runtime+dist artifact-sysdeps. - Disassembled the rebuilt copy_buffer_to_image blit kernel: it now does workgroup_id * uniform_size + workitem_id directly (no s_sub_i32 / s_min_u32 remainder dance), matching the workaround output. - test_image_streams CL_RGB CL_UNORM_INT_101010 CL_FILTER_NEAREST CL_ADDRESS_CLAMP_TO_EDGE UNNORMALIZED read float 1D now passes 2/2 sub‑tests, three runs in a row. Co-authored-by: Claude <noreply@anthropic.com>

linuxrocks123 · 2026-05-06T18:03:53Z

@arsenm since this is your change, please review.

linuxrocks123 · 2026-05-06T18:05:26Z

Claude primarily developed the fix for this. I asked it to change downstream consumers instead of reverting part of @arsenm's change. It originally wanted to do that revert instead to preserve the API. It feels kind of right, but it also feels like this would have shown up other places than gfx1100 if this were the right fix.

z1-cciauto · 2026-05-06T18:13:32Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5749

arsenm

I wanted to change the device libs implementations to stop using the clang builtins, so that part LGTM. However this fix still leaves many questions.

Why is the runtime building with COv4? It should just use the default
Why are the blit kernels executing with nonuniform groups?
Why are the actual workitem ID and nonuniform group conformance tests not failing? I definitely tested these with v4 and v5

linuxrocks123 · 2026-05-06T20:01:07Z

I wanted to change the device libs implementations to stop using the clang builtins, so that part LGTM. However this fix still leaves many questions.
1. Why is the runtime building with COv4? It should just use the default

The blit kernels are pinned to v4: rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/CMakeLists.txt:75

(Thanks again, Claude.)

2. Why are the blit kernels executing with nonuniform groups?

They're executing with OpenCL 2.0 and non-uniform groups are the default for OpenCL 2.0 and later. Claude noticed the width of the image is 329 which isn't divisible by the workgroup size.

Why it worked before, according to Claude:

  The promise [that the workgroups are uniform] is technically a lie — the runtime really does dispatch non-uniformly — but it doesn't matter, because:                                                                         
  1. The thing the kernel needs to compute is global_id = g * uniform_size + actual_workitem_id, and that formula is correct regardless of whether the trailing group is uniform or not.
  2. The optimizer's folding doesn't depend on the runtime dispatch shape; it only changes which value the builtin produces.                                                                  
  3. No threads at the non-existent positions l = 9..63 ever run, so no spurious writes happen.

3. Why are the actual workitem ID and nonuniform group conformance tests not failing? I definitely tested these with v4 and v5

Claude said this but it's probably wrong because you said you tested with both v4 and v5:

Because the bug is gated by code object version, not by non-uniform-ness, and user OpenCL kernels and the blit kernel sit on opposite sides of that gate.

My biggest question was "why did this only show up on gfx1100." Claude doesn't see any reason it should be target-specific. Perhaps it's not?

linuxrocks123 · 2026-05-06T21:43:16Z

@arsenm Claude ran the non_uniform_work_group conformance test on my machine and found that it fails when COV4 is forced. Perhaps you hadn't rebuilt the blit kernels during your testing, or perhaps it only fails on gfx1100.

z1-cciauto · 2026-05-06T22:11:56Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5753

arsenm · 2026-05-07T19:22:52Z

@arsenm Claude ran the non_uniform_work_group conformance test on my machine and found that it fails when COV4 is forced. Perhaps you hadn't rebuilt the blit kernels during your testing, or perhaps it only fails on gfx1100.

The blit kernels shouldn't factor into this test. This also wouldn't be target specific (and I'm using gfx1101 which is the effectively the same thing as 1100) AMD_OCL_BUILD_OPTIONS_APPEND="-Wf,-mcode-object-version=4 " non_uniform_work_group/test_non_uniform_work_group should force the version but this still passes for me

arsenm

Need to drop the amdgpuintrin.h changes, that isn't even used on this path

linuxrocks123 · 2026-05-07T20:13:23Z

@arsenm Claude ran the non_uniform_work_group conformance test on my machine and found that it fails when COV4 is forced. Perhaps you hadn't rebuilt the blit kernels during your testing, or perhaps it only fails on gfx1100.

The blit kernels shouldn't factor into this test. This also wouldn't be target specific (and I'm using gfx1101 which is the effectively the same thing as 1100) AMD_OCL_BUILD_OPTIONS_APPEND="-Wf,-mcode-object-version=4 " non_uniform_work_group/test_non_uniform_work_group should force the version but this still passes for me

This is what I manually ran (no AI) to verify Claude's claim:

AMD_OCL_BUILD_OPTIONS_APPEND="-code-object-version=4" ./test_non_uniform_work_group

For me, this passes with this PR and fails without it. Note that my AMD_OCL_BUILD_OPTIONS_APPEND is different from what you used. (It's what Claude had used.)

lamb-j · 2026-05-07T20:24:52Z

AMD_OCL_BUILD_OPTIONS_APPEND may expect a -Wf (frontend) or -Wb (backend) to know which part of the compilation to forward options to. I don't remember exactly how the runtime handles this, and maybe they have an exception for code-object-version

arsenm · 2026-05-07T20:27:58Z

OK, I see the failure with -code-object-version=4. That means it's hitting the comgr intercept of the flag; I thought those were all fixed. Ideally all arguments would forward directly to clang and there wouldn't be any comgr handled options

z1-cciauto · 2026-05-08T22:37:13Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5830

linuxrocks123 · 2026-05-11T16:42:45Z

@arsenm here's what Claude says about what happens if we revert the gpuintrin.h changes:

"If anyone runs OpenMP omp target teams distribute with a non-uniform mapping, those helpers return wrong values. That's a real but probably narrow exposure — and not something your CTS run would catch."

We care about OpenMP, and the changes look right to me, to the extent I can evaluate them. What do you think?

arsenm · 2026-05-11T17:54:06Z

"If anyone runs OpenMP omp target teams distribute with a non-uniform mapping, those helpers return wrong values. That's a real but probably narrow exposure — and not something your CTS run would catch."

Don't think OpenMP supports non-uniform work groups

We care about OpenMP, and the changes look right to me, to the extent I can evaluate them. What do you think?

It does not look right, the builtin should be correct and understands the language. We also shouldn't make AI suggested changes just in case, it needs a backing test where this is problematic (and the fix is fix the builtin, not the wrapper header)

linuxrocks123 · 2026-05-11T19:53:30Z

@arsenm okay, I removed the change.

arsenm · 2026-05-11T20:02:16Z

Title should be what the patch is doing, the ticket should just be in the description

linuxrocks123 · 2026-05-11T20:50:50Z

@arsenm k done.

arsenm

I would't say it's updating to respect a new builtin behavior, it's reimplementing the V4 ABI without them

linuxrocks123 requested review from b-sumner and lamb-j as code owners May 6, 2026 18:01

arsenm reviewed May 6, 2026

View reviewed changes

b-sumner reviewed May 6, 2026

View reviewed changes

Comment thread amd/device-libs/ockl/src/workitem.cl Outdated

Remove comment

acf9180

arsenm requested changes May 7, 2026

View reviewed changes

Revert gpuintrin.h changes.

3f8e4a6

linuxrocks123 force-pushed the lcompiler-1744 branch from 2c9a95f to 3f8e4a6 Compare May 11, 2026 19:52

linuxrocks123 changed the title ~~Fix Issue LCOMPILER-1744~~ Update workitem.cl To Respect New Behavior of __builtin_amdgcn_workgroup_size Functions May 11, 2026

arsenm reviewed May 11, 2026

View reviewed changes

linuxrocks123 changed the title ~~Update workitem.cl To Respect New Behavior of __builtin_amdgcn_workgroup_size Functions~~ Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors May 11, 2026

Conversation

linuxrocks123 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linuxrocks123 commented May 6, 2026

Uh oh!

linuxrocks123 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z1-cciauto commented May 6, 2026

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linuxrocks123 commented May 6, 2026

Uh oh!

linuxrocks123 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z1-cciauto commented May 6, 2026

Uh oh!

arsenm commented May 7, 2026

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

linuxrocks123 commented May 7, 2026

Uh oh!

lamb-j commented May 7, 2026

Uh oh!

arsenm commented May 7, 2026

Uh oh!

z1-cciauto commented May 8, 2026

Uh oh!

linuxrocks123 commented May 11, 2026

Uh oh!

arsenm commented May 11, 2026

Uh oh!

linuxrocks123 commented May 11, 2026

Uh oh!

arsenm commented May 11, 2026

Uh oh!

linuxrocks123 commented May 11, 2026

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linuxrocks123 commented May 6, 2026 •

edited

Loading

linuxrocks123 commented May 6, 2026 •

edited

Loading

linuxrocks123 commented May 6, 2026 •

edited

Loading