Skip to content

Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors#2429

Open
linuxrocks123 wants to merge 3 commits into
ROCm:amd-stagingfrom
linuxrocks123:lcompiler-1744
Open

Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors#2429
linuxrocks123 wants to merge 3 commits into
ROCm:amd-stagingfrom
linuxrocks123:lcompiler-1744

Conversation

@linuxrocks123
Copy link
Copy Markdown

@linuxrocks123 linuxrocks123 commented May 6, 2026

Fixes LCOMPILER-1744

  Upstream commit 765d09dd1b04 ("clang/AMDGPU: Fix workgroup size builtins for nonuniform work group sizes") changed the semantics of __builtin_amdgcn_workgroup_size_{x,y,z}():

  - Before: returned the uniform/enqueued work‑group size (a direct load from dispatch->workgroup_size_x for COV4, or implicitarg.group_size for COV5).
  - After: when OffloadUniformBlock is false (the default for OpenCL ≥ 2.0 and for non‑uniform launches), the builtin emits umin(grid_size − workgroup_id*group_size, group_size) — the actual
   size of the current work‑group, which equals the uniform size for full groups but is smaller for the trailing partial group.

  The change broke every consumer that was relying on the historical "uniform size" contract. In particular, in rocm-device-libs/ockl/src/workitem.cl (legacy ABI / COV ≤ 4 path):

  ┌─────────────────────────────────┬───────────────────────────────────────────────────┐
  │            Function             │                   Why it broke                    │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_global_id_{x,y,z}           │ computes g*s + l + offset; needs s = uniform size │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_global_linear_id_{x,y,z}    │ same g*s + l formula                              │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_enqueued_local_size_{x,y,z} │ by definition is the uniform size                 │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_num_groups_{x,y,z}          │ n/d + (n > q*d); needs d = uniform size           │
  └─────────────────────────────────┴───────────────────────────────────────────────────┘

  get_local_size_{x,y,z} already manually computes the remainder using grid_size, so it stays correct (just redundant) when s is uniform; with the post‑patch builtin it coincidentally still
  returns the right answer because min(remainder, actual) collapses to actual.

  clang/lib/Headers/amdgpuintrin.h (__gpu_num_blocks_*, __gpu_num_threads_*) has the same bug for OpenMP/HIP offload code.

  The OpenCL CTS test fails because the HSA runtime's blit kernel (rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/imageblit_kernels.cl, function copy_buffer_to_image)
  is built with -cl-std=CL2.0 -mcode-object-version=4, calls get_global_id, and links against device‑libs. After the patch its get_global_id_x returns g*actual_size + l, so threads in the
  partial trailing work‑group write the source buffer to the wrong destination pixels. When the test then reads coord 321 of a 329‑wide image it sees the never‑written zero‑initialized pixel
   — exactly the (0,0,0,1) that we observed.

  One file modified: compiler/amd-llvm/amd/device-libs/ockl/src/workitem.c
  
  ln every OLD_ABI branch of the four affected function families, replace __builtin_amdgcn_workgroup_size_{x,y,z}() with a direct
  read of p->workgroup_size_{x,y,z} from the HSA dispatch packet (the value the builtin used to return). 12 call sites updated; added a header comment explaining why the builtin can't be
  used here.

  Verification

  - Built artifact-group-opencl-runtime artifact-base amd-comgr+dist ROCR-Runtime+dist artifact-sysdeps.
  - Disassembled the rebuilt copy_buffer_to_image blit kernel: it now does workgroup_id * uniform_size + workitem_id directly (no s_sub_i32 / s_min_u32 remainder dance), matching the
  workaround output.
  - test_image_streams CL_RGB CL_UNORM_INT_101010 CL_FILTER_NEAREST CL_ADDRESS_CLAMP_TO_EDGE UNNORMALIZED read float 1D now passes 2/2 sub‑tests, three runs in a row.

Co-authored-by: Claude <noreply@anthropic.com>

…tins for nonuniform work group sizes") changed the semantics of __builtin_amdgcn_workgroup_size_{x,y,z}():

  - Before: returned the uniform/enqueued work‑group size (a direct load from dispatch->workgroup_size_x for COV4, or implicitarg.group_size for COV5).
  - After: when OffloadUniformBlock is false (the default for OpenCL ≥ 2.0 and for non‑uniform launches), the builtin emits umin(grid_size − workgroup_id*group_size, group_size) — the actual
   size of the current work‑group, which equals the uniform size for full groups but is smaller for the trailing partial group.

  The change broke every consumer that was relying on the historical "uniform size" contract. In particular, in rocm-device-libs/ockl/src/workitem.cl (legacy ABI / COV ≤ 4 path):

  ┌─────────────────────────────────┬───────────────────────────────────────────────────┐
  │            Function             │                   Why it broke                    │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_global_id_{x,y,z}           │ computes g*s + l + offset; needs s = uniform size │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_global_linear_id_{x,y,z}    │ same g*s + l formula                              │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_enqueued_local_size_{x,y,z} │ by definition is the uniform size                 │
  ├─────────────────────────────────┼───────────────────────────────────────────────────┤
  │ get_num_groups_{x,y,z}          │ n/d + (n > q*d); needs d = uniform size           │
  └─────────────────────────────────┴───────────────────────────────────────────────────┘

  get_local_size_{x,y,z} already manually computes the remainder using grid_size, so it stays correct (just redundant) when s is uniform; with the post‑patch builtin it coincidentally still
  returns the right answer because min(remainder, actual) collapses to actual.

  clang/lib/Headers/amdgpuintrin.h (__gpu_num_blocks_*, __gpu_num_threads_*) has the same bug for OpenMP/HIP offload code.

  The OpenCL CTS test fails because the HSA runtime's blit kernel (rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/imageblit_kernels.cl, function copy_buffer_to_image)
  is built with -cl-std=CL2.0 -mcode-object-version=4, calls get_global_id, and links against device‑libs. After the patch its get_global_id_x returns g*actual_size + l, so threads in the
  partial trailing work‑group write the source buffer to the wrong destination pixels. When the test then reads coord 321 of a 329‑wide image it sees the never‑written zero‑initialized pixel
   — exactly the (0,0,0,1) that we observed.

  Two files modified:

  1. compiler/amd-llvm/amd/device-libs/ockl/src/workitem.cl — in every OLD_ABI branch of the four affected function families, replace __builtin_amdgcn_workgroup_size_{x,y,z}() with a direct
  read of p->workgroup_size_{x,y,z} from the HSA dispatch packet (the value the builtin used to return). 12 call sites updated; added a header comment explaining why the builtin can't be
  used here.
  2. compiler/amd-llvm/clang/lib/Headers/amdgpuintrin.h — added a private __GPU_AMDGCN_UNIFORM_WG_SIZE(idx) macro that reads the uniform work‑group size from the dispatch packet (uint16_t at
   byte offset 4 + 2*idx), and routed __gpu_num_blocks_{x,y,z} and __gpu_num_threads_{x,y,z} through it.

  Verification

  - Built artifact-group-opencl-runtime artifact-base amd-comgr+dist ROCR-Runtime+dist artifact-sysdeps.
  - Disassembled the rebuilt copy_buffer_to_image blit kernel: it now does workgroup_id * uniform_size + workitem_id directly (no s_sub_i32 / s_min_u32 remainder dance), matching the
  workaround output.
  - test_image_streams CL_RGB CL_UNORM_INT_101010 CL_FILTER_NEAREST CL_ADDRESS_CLAMP_TO_EDGE UNNORMALIZED read float 1D now passes 2/2 sub‑tests, three runs in a row.

Co-authored-by: Claude <noreply@anthropic.com>
@linuxrocks123
Copy link
Copy Markdown
Author

@arsenm since this is your change, please review.

@linuxrocks123
Copy link
Copy Markdown
Author

linuxrocks123 commented May 6, 2026

Claude primarily developed the fix for this. I asked it to change downstream consumers instead of reverting part of @arsenm's change. It originally wanted to do that revert instead to preserve the API. It feels kind of right, but it also feels like this would have shown up other places than gfx1100 if this were the right fix.

@z1-cciauto
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to change the device libs implementations to stop using the clang builtins, so that part LGTM. However this fix still leaves many questions.

  1. Why is the runtime building with COv4? It should just use the default
  2. Why are the blit kernels executing with nonuniform groups?
  3. Why are the actual workitem ID and nonuniform group conformance tests not failing? I definitely tested these with v4 and v5

Comment thread amd/device-libs/ockl/src/workitem.cl Outdated
@linuxrocks123
Copy link
Copy Markdown
Author

I wanted to change the device libs implementations to stop using the clang builtins, so that part LGTM. However this fix still leaves many questions.

1. Why is the runtime building with COv4? It should just use the default

The blit kernels are pinned to v4: rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/CMakeLists.txt:75

(Thanks again, Claude.)

2. Why are the blit kernels executing with nonuniform groups?

They're executing with OpenCL 2.0 and non-uniform groups are the default for OpenCL 2.0 and later. Claude noticed the width of the image is 329 which isn't divisible by the workgroup size.

Why it worked before, according to Claude:

  The promise [that the workgroups are uniform] is technically a lie — the runtime really does dispatch non-uniformly — but it doesn't matter, because:                                                                         
  1. The thing the kernel needs to compute is global_id = g * uniform_size + actual_workitem_id, and that formula is correct regardless of whether the trailing group is uniform or not.
  2. The optimizer's folding doesn't depend on the runtime dispatch shape; it only changes which value the builtin produces.                                                                  
  3. No threads at the non-existent positions l = 9..63 ever run, so no spurious writes happen.
3. Why are the actual workitem ID and nonuniform group conformance tests not failing? I definitely tested these with v4 and v5

Claude said this but it's probably wrong because you said you tested with both v4 and v5:

Because the bug is gated by code object version, not by non-uniform-ness, and user OpenCL kernels and the blit kernel sit on opposite sides of that gate.

My biggest question was "why did this only show up on gfx1100." Claude doesn't see any reason it should be target-specific. Perhaps it's not?

@linuxrocks123
Copy link
Copy Markdown
Author

linuxrocks123 commented May 6, 2026

@arsenm Claude ran the non_uniform_work_group conformance test on my machine and found that it fails when COV4 is forced. Perhaps you hadn't rebuilt the blit kernels during your testing, or perhaps it only fails on gfx1100.

@z1-cciauto
Copy link
Copy Markdown
Collaborator

@arsenm
Copy link
Copy Markdown

arsenm commented May 7, 2026

@arsenm Claude ran the non_uniform_work_group conformance test on my machine and found that it fails when COV4 is forced. Perhaps you hadn't rebuilt the blit kernels during your testing, or perhaps it only fails on gfx1100.

The blit kernels shouldn't factor into this test. This also wouldn't be target specific (and I'm using gfx1101 which is the effectively the same thing as 1100) AMD_OCL_BUILD_OPTIONS_APPEND="-Wf,-mcode-object-version=4 " non_uniform_work_group/test_non_uniform_work_group should force the version but this still passes for me

Copy link
Copy Markdown

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to drop the amdgpuintrin.h changes, that isn't even used on this path

@linuxrocks123
Copy link
Copy Markdown
Author

@arsenm Claude ran the non_uniform_work_group conformance test on my machine and found that it fails when COV4 is forced. Perhaps you hadn't rebuilt the blit kernels during your testing, or perhaps it only fails on gfx1100.

The blit kernels shouldn't factor into this test. This also wouldn't be target specific (and I'm using gfx1101 which is the effectively the same thing as 1100) AMD_OCL_BUILD_OPTIONS_APPEND="-Wf,-mcode-object-version=4 " non_uniform_work_group/test_non_uniform_work_group should force the version but this still passes for me

This is what I manually ran (no AI) to verify Claude's claim:

AMD_OCL_BUILD_OPTIONS_APPEND="-code-object-version=4" ./test_non_uniform_work_group

For me, this passes with this PR and fails without it. Note that my AMD_OCL_BUILD_OPTIONS_APPEND is different from what you used. (It's what Claude had used.)

@lamb-j
Copy link
Copy Markdown
Collaborator

lamb-j commented May 7, 2026

AMD_OCL_BUILD_OPTIONS_APPEND may expect a -Wf (frontend) or -Wb (backend) to know which part of the compilation to forward options to. I don't remember exactly how the runtime handles this, and maybe they have an exception for code-object-version

@arsenm
Copy link
Copy Markdown

arsenm commented May 7, 2026

OK, I see the failure with -code-object-version=4. That means it's hitting the comgr intercept of the flag; I thought those were all fixed. Ideally all arguments would forward directly to clang and there wouldn't be any comgr handled options

@z1-cciauto
Copy link
Copy Markdown
Collaborator

@linuxrocks123
Copy link
Copy Markdown
Author

@arsenm here's what Claude says about what happens if we revert the gpuintrin.h changes:

"If anyone runs OpenMP omp target teams distribute with a non-uniform mapping, those helpers return wrong values. That's a real but probably narrow exposure — and not something your CTS run would catch."

We care about OpenMP, and the changes look right to me, to the extent I can evaluate them. What do you think?

@arsenm
Copy link
Copy Markdown

arsenm commented May 11, 2026

"If anyone runs OpenMP omp target teams distribute with a non-uniform mapping, those helpers return wrong values. That's a real but probably narrow exposure — and not something your CTS run would catch."

Don't think OpenMP supports non-uniform work groups

We care about OpenMP, and the changes look right to me, to the extent I can evaluate them. What do you think?

It does not look right, the builtin should be correct and understands the language. We also shouldn't make AI suggested changes just in case, it needs a backing test where this is problematic (and the fix is fix the builtin, not the wrapper header)

@linuxrocks123
Copy link
Copy Markdown
Author

@arsenm okay, I removed the change.

@arsenm
Copy link
Copy Markdown

arsenm commented May 11, 2026

Title should be what the patch is doing, the ticket should just be in the description

@linuxrocks123 linuxrocks123 changed the title Fix Issue LCOMPILER-1744 Update workitem.cl To Respect New Behavior of __builtin_amdgcn_workgroup_size Functions May 11, 2026
@linuxrocks123
Copy link
Copy Markdown
Author

@arsenm k done.

Copy link
Copy Markdown

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would't say it's updating to respect a new builtin behavior, it's reimplementing the V4 ABI without them

@linuxrocks123 linuxrocks123 changed the title Update workitem.cl To Respect New Behavior of __builtin_amdgcn_workgroup_size Functions Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants