Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors#2429
Update workitem.cl Not To Use __builtin_amdgcn_workgroup_size Functions Due To Their Changed Behaviors#2429linuxrocks123 wants to merge 3 commits into
Conversation
…tins for nonuniform work group sizes") changed the semantics of __builtin_amdgcn_workgroup_size_{x,y,z}():
- Before: returned the uniform/enqueued work‑group size (a direct load from dispatch->workgroup_size_x for COV4, or implicitarg.group_size for COV5).
- After: when OffloadUniformBlock is false (the default for OpenCL ≥ 2.0 and for non‑uniform launches), the builtin emits umin(grid_size − workgroup_id*group_size, group_size) — the actual
size of the current work‑group, which equals the uniform size for full groups but is smaller for the trailing partial group.
The change broke every consumer that was relying on the historical "uniform size" contract. In particular, in rocm-device-libs/ockl/src/workitem.cl (legacy ABI / COV ≤ 4 path):
┌─────────────────────────────────┬───────────────────────────────────────────────────┐
│ Function │ Why it broke │
├─────────────────────────────────┼───────────────────────────────────────────────────┤
│ get_global_id_{x,y,z} │ computes g*s + l + offset; needs s = uniform size │
├─────────────────────────────────┼───────────────────────────────────────────────────┤
│ get_global_linear_id_{x,y,z} │ same g*s + l formula │
├─────────────────────────────────┼───────────────────────────────────────────────────┤
│ get_enqueued_local_size_{x,y,z} │ by definition is the uniform size │
├─────────────────────────────────┼───────────────────────────────────────────────────┤
│ get_num_groups_{x,y,z} │ n/d + (n > q*d); needs d = uniform size │
└─────────────────────────────────┴───────────────────────────────────────────────────┘
get_local_size_{x,y,z} already manually computes the remainder using grid_size, so it stays correct (just redundant) when s is uniform; with the post‑patch builtin it coincidentally still
returns the right answer because min(remainder, actual) collapses to actual.
clang/lib/Headers/amdgpuintrin.h (__gpu_num_blocks_*, __gpu_num_threads_*) has the same bug for OpenMP/HIP offload code.
The OpenCL CTS test fails because the HSA runtime's blit kernel (rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/imageblit_kernels.cl, function copy_buffer_to_image)
is built with -cl-std=CL2.0 -mcode-object-version=4, calls get_global_id, and links against device‑libs. After the patch its get_global_id_x returns g*actual_size + l, so threads in the
partial trailing work‑group write the source buffer to the wrong destination pixels. When the test then reads coord 321 of a 329‑wide image it sees the never‑written zero‑initialized pixel
— exactly the (0,0,0,1) that we observed.
Two files modified:
1. compiler/amd-llvm/amd/device-libs/ockl/src/workitem.cl — in every OLD_ABI branch of the four affected function families, replace __builtin_amdgcn_workgroup_size_{x,y,z}() with a direct
read of p->workgroup_size_{x,y,z} from the HSA dispatch packet (the value the builtin used to return). 12 call sites updated; added a header comment explaining why the builtin can't be
used here.
2. compiler/amd-llvm/clang/lib/Headers/amdgpuintrin.h — added a private __GPU_AMDGCN_UNIFORM_WG_SIZE(idx) macro that reads the uniform work‑group size from the dispatch packet (uint16_t at
byte offset 4 + 2*idx), and routed __gpu_num_blocks_{x,y,z} and __gpu_num_threads_{x,y,z} through it.
Verification
- Built artifact-group-opencl-runtime artifact-base amd-comgr+dist ROCR-Runtime+dist artifact-sysdeps.
- Disassembled the rebuilt copy_buffer_to_image blit kernel: it now does workgroup_id * uniform_size + workitem_id directly (no s_sub_i32 / s_min_u32 remainder dance), matching the
workaround output.
- test_image_streams CL_RGB CL_UNORM_INT_101010 CL_FILTER_NEAREST CL_ADDRESS_CLAMP_TO_EDGE UNNORMALIZED read float 1D now passes 2/2 sub‑tests, three runs in a row.
Co-authored-by: Claude <noreply@anthropic.com>
|
@arsenm since this is your change, please review. |
|
Claude primarily developed the fix for this. I asked it to change downstream consumers instead of reverting part of @arsenm's change. It originally wanted to do that revert instead to preserve the API. It feels kind of right, but it also feels like this would have shown up other places than gfx1100 if this were the right fix. |
arsenm
left a comment
There was a problem hiding this comment.
I wanted to change the device libs implementations to stop using the clang builtins, so that part LGTM. However this fix still leaves many questions.
- Why is the runtime building with COv4? It should just use the default
- Why are the blit kernels executing with nonuniform groups?
- Why are the actual workitem ID and nonuniform group conformance tests not failing? I definitely tested these with v4 and v5
The blit kernels are pinned to v4: rocm-systems/projects/rocr-runtime/runtime/hsa-runtime/image/blit_src/CMakeLists.txt:75 (Thanks again, Claude.)
They're executing with OpenCL 2.0 and non-uniform groups are the default for OpenCL 2.0 and later. Claude noticed the width of the image is 329 which isn't divisible by the workgroup size. Why it worked before, according to Claude:
Claude said this but it's probably wrong because you said you tested with both v4 and v5: My biggest question was "why did this only show up on gfx1100." Claude doesn't see any reason it should be target-specific. Perhaps it's not? |
|
@arsenm Claude ran the |
The blit kernels shouldn't factor into this test. This also wouldn't be target specific (and I'm using gfx1101 which is the effectively the same thing as 1100) |
arsenm
left a comment
There was a problem hiding this comment.
Need to drop the amdgpuintrin.h changes, that isn't even used on this path
This is what I manually ran (no AI) to verify Claude's claim: For me, this passes with this PR and fails without it. Note that my |
|
AMD_OCL_BUILD_OPTIONS_APPEND may expect a -Wf (frontend) or -Wb (backend) to know which part of the compilation to forward options to. I don't remember exactly how the runtime handles this, and maybe they have an exception for code-object-version |
|
OK, I see the failure with -code-object-version=4. That means it's hitting the comgr intercept of the flag; I thought those were all fixed. Ideally all arguments would forward directly to clang and there wouldn't be any comgr handled options |
|
@arsenm here's what Claude says about what happens if we revert the "If anyone runs OpenMP We care about OpenMP, and the changes look right to me, to the extent I can evaluate them. What do you think? |
Don't think OpenMP supports non-uniform work groups
It does not look right, the builtin should be correct and understands the language. We also shouldn't make AI suggested changes just in case, it needs a backing test where this is problematic (and the fix is fix the builtin, not the wrapper header) |
2c9a95f to
3f8e4a6
Compare
|
@arsenm okay, I removed the change. |
|
Title should be what the patch is doing, the ticket should just be in the description |
|
@arsenm k done. |
arsenm
left a comment
There was a problem hiding this comment.
I would't say it's updating to respect a new builtin behavior, it's reimplementing the V4 ABI without them
Fixes LCOMPILER-1744