Skip to content

Support for shuffled layouts of kernel arguments. #17

Open
newling wants to merge 12 commits into
b-shi:subtile_mxfrom
newling:subtile_mx_preshuffle
Open

Support for shuffled layouts of kernel arguments. #17
newling wants to merge 12 commits into
b-shi:subtile_mxfrom
newling:subtile_mx_preshuffle

Conversation

@newling

@newling newling commented Apr 1, 2026

Copy link
Copy Markdown

Summary

The main goal of this PR was to support shuffled B (weights). But it includes support for shuffling A, and for shuffling A and B, as these were very easy additions.

With pre-shuffling, the address that a lane loads from with its buffer_load_dwordx4 has a very simple offset, laneId * 16, relative to its wave's base address.

With pre-shuffling, the address that a lane loads from with its ds_read_b128 is also very simple, with an offset that is also laneId * 16. This is BC-free (see my "AITER pre-shuffle is BC-free" analysis in https://docs.google.com/spreadsheets/d/1tMj406bpUNOxFDjiBlO0nT5NCoZsnGpd7Fv6nQvs8y8/edit?usp=sharing)

Code changes

This PR adapts five aspects of the subtile addressing to handle the pre-shuffled layout:

  1. Within-wave ds_read offsets: Pre-shuffled data is already in the target layout for LDS, so lane offsets simplify to a linear mapping (laneId * 16) rather than the strided calculation used for standard layout.
  2. Within-wave buffer_load offsets (vaddr): The HBM row stride becomes depthUBytes (compile-time constant) instead of the runtime tensor stride, reflecting the tiled layout in HBM.
  3. Workgroup-level buffer_load base (SRD advance): After each unroll iteration, the SRD (shared by all waves in the workgroup) advances by 16 pre-shuffled rows rather than the standard stride-based increment.
  4. Per-wave soffset correction: Each wave loads a different set of 16 rows. When K > DepthU, the HBM distance between consecutive tiles varies per wave. An SGPR correction term bridges the gap between the SRD base and each wave's actual data location.
  5. Column swizzle skip: The column ID swizzle pass is skipped for pre-shuffled tiles since the data is already arranged in the target layout, simplifying both the LDS write pattern (buffer_load via DTL) and the LDS read pattern (ds_read).

@newling newling force-pushed the subtile_mx_preshuffle branch from 7c4e253 to 758e98a Compare April 1, 2026 21:19
@newling newling marked this pull request as ready for review April 1, 2026 21:22
@newling newling changed the base branch from subtile_mx to users/brianshi/custom_sched_2 April 1, 2026 21:38
@newling newling changed the base branch from users/brianshi/custom_sched_2 to subtile_mx_f4_schedule_2 April 1, 2026 21:38
##################################################
# Subroutine to update ptrs
#
def disableSrdBoundsForPreShuffled(writer, kernel, tc):

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check in KernelWriter.py, to check if preshuffleB is enabled and if so, not emit the logic to compute Srd bounds for A/B and replace with bufferLimit instead of computing the bounds then overwriting it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Claude and) I have addressed this: 197b958.

Is it ok to just use bufferLimit? Why doesn't the non-pre-shuffle do this, is it to correctly handle the case where M, N are perfectly tiled by the macro-tile?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, its to handle edge cases. With pre-shuffle B we can avoid this case

@newling newling changed the base branch from subtile_mx_f4_schedule_2 to subtile_mx April 2, 2026 14:36
@newling newling force-pushed the subtile_mx_preshuffle branch from 758e98a to b506d4e Compare April 2, 2026 17:48
@newling newling force-pushed the subtile_mx_preshuffle branch from 197b958 to a6f5ae8 Compare April 3, 2026 19:53
archana-ramalingam pushed a commit to archana-ramalingam/rocm-libraries that referenced this pull request Apr 7, 2026
…being used after being freed. (ROCm#5220)

## Motivation

A `heap-use-after-free` error was triggered by AddressSanitizer on test
`CPU_Dump_NAN_FP32.testDump`.

## Technical Details

Root Cause Analysis:
The AddressSanitizer error occurred because the HIPOCProgramImpl
constructor was not storing the binary data passed to it. When
LoadProgram called LoadBinary and created a HIPOCProgram with the
returned vector, the temporary vector would go out of scope, but COMGR
still needed to access the binary data later, causing a use-after-free.

- The fix ensures that the HIPOCProgramImpl object owns the binary data
for its entire lifetime
- Both constructors now consistently store the binary data in the
`binary` member variable (std::vector)
- The uint8_t constructor converts the data to char format using
iterator range construction
- This prevents the use-after-free that occurred when COMGR tried to
access freed memory


## Test Plan

Test output before change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
=================================================================
==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8
READ of size 26088 at 0x7e0f08c50200 thread T0
    #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5
    b-shi#1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9
    b-shi#2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17
    b-shi#3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24
    b-shi#4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12
    b-shi#5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14
    b-shi#6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22
    b-shi#7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22
    b-shi#8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22
    b-shi#9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47
    b-shi#10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10
    b-shi#11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25
    b-shi#12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18
    b-shi#13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    b-shi#14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    b-shi#15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    b-shi#16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    b-shi#17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    b-shi#18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    b-shi#19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50
    b-shi#20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6
    b-shi#21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14
    b-shi#22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33
    b-shi#23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6
    b-shi#24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47
    b-shi#25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    b-shi#26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52
    b-shi#27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55
    b-shi#28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73
    b-shi#29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12
    b-shi#30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986)
    b-shi#31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d)

0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8)
freed by thread T0 here:
    #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3
    b-shi#1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2
    b-shi#2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25
    b-shi#3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13
    b-shi#4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4
    b-shi#5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2
    b-shi#6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7
    b-shi#7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5
    b-shi#8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    b-shi#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    b-shi#10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    b-shi#11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    b-shi#12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    b-shi#13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    b-shi#14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    b-shi#15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

previously allocated by thread T0 here:
    #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35
    b-shi#1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27
    b-shi#2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32
    b-shi#3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20
    b-shi#4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20
    b-shi#5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33
    b-shi#6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9
    b-shi#7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9
    b-shi#8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27
    b-shi#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30
    b-shi#10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26
    b-shi#11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34
    b-shi#12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12
    b-shi#13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8
    b-shi#14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37
    b-shi#15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27
    b-shi#16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52

SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*)
Shadow bytes around the buggy address:
  0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3639==ABORTING
```

## Test Result

Test output after change:
```
HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*"
PRNG seed: 12345678
Note: Google Test filter = *CPU_Dump_NAN_FP32*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CPU_Dump_NAN_FP32
[ RUN      ] CPU_Dump_NAN_FP32.testDump
[       OK ] CPU_Dump_NAN_FP32.testDump (51 ms)
[----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (52 ms total)
[  PASSED  ] 1 test.
```

## Cline Analysis

### Test Coverage Analysis:

__1. LoadProgram Code Path (std::vector constructor):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/db_sync.cpp`
- __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file,
program_args, "")`
- __Coverage__: This test extensively exercises the LoadProgram →
LoadBinary → HIPOCProgramImpl constructor path
- __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942,
gfx1030) with different CU counts
- __Frequency__: Runs on thousands of kernel configurations in the
database sync tests

__2. Solution Binary Serialization (std::vector usage):__

- __Primary Test__:
`rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp`
- __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with
`std::vector<char> solution_binary`
- __Coverage__: Tests the save/load cycle of solution binaries
- __Scope__: Tests all convolution directions (Forward, BackwardData,
BackwardWeights)

__3. Additional Coverage:__

- __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp`
tests compression/decompression with `std::vector<char>`
- __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space
(related pattern)

__Test Quality Assessment:__

✅ __Both constructors are well-tested__:

- The `std::vector<char>` constructor is heavily exercised through
database sync tests
- The `std::vector<uint8_t>` constructor would be tested through any
code paths that use uint8_t binary data

✅ __Real-world scenarios covered__:

- Database synchronization (production kernel loading)
- Solution serialization (runtime binary handling)
- Multi-threaded execution (db_sync uses up to 32 threads)

✅ __Comprehensive architecture coverage__:

- Tests run on multiple GPU architectures
- Different compute unit configurations tested

__Confidence Level__: Very High


### Performance Analysis:

Regarding the performance impact of this fix, it's actually quite
minimal and represents good engineering practice:

__Memory Impact:__

- __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a
copy of the binary data in its `binary` member variable
- __Typical Size__: GPU code objects are usually relatively small
(typically a few KB to a few MB depending on kernel complexity)
- __Lifetime__: The memory is only held for the lifetime of the
HIPOCProgram object, which is typically short-lived during kernel
loading

__Performance Characteristics:__

- __One-time Copy Cost__: There's a single memory copy operation during
construction (std::vector copy or iterator range construction)
- __No Runtime Overhead__: Once constructed, there's no additional
performance cost during kernel execution
- __Memory Safety Benefit__: Eliminates potential crashes and undefined
behavior, which far outweighs the small memory cost

__Context in MIOpen:__

- This occurs during the kernel loading phase, not during actual ML
inference/training
- Kernel loading is already an expensive operation involving
compilation, module creation, etc.
- The additional memory copy is negligible compared to the overall
kernel loading time

__Trade-off Analysis:__

- __Cost__: Small increase in memory usage during kernel loading
- __Benefit__: Eliminates memory safety bugs that could cause crashes or
data corruption
- __Net Result__: Significantly positive - reliability and correctness
are much more valuable than the minimal memory overhead

In practice, this fix follows the RAII (Resource Acquisition Is
Initialization) principle and ensures proper ownership semantics, which
is standard best practice in modern C++. The performance impact should
be unnoticeable in real-world usage.
@newling newling force-pushed the subtile_mx_preshuffle branch from 538cd03 to cc36baf Compare April 9, 2026 00:42
@newling newling force-pushed the subtile_mx_preshuffle branch from cc36baf to 8ed4abb Compare April 9, 2026 18:38
# skip ShadowLimit and Srd+2 calculation here in useFixedSrd2 case
if not useFixedSrd2:
if isPreShuffled:
module.add(SMovB32(dst=sgpr("Srd%s+2"%tc), src="BufferLimit",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line.
We should not do this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should properly set SrdB+2.
(same logic as current non preshuffleA/B, preshuffle MXSA/B)

if soffset == 0:
soffset = correction
else:
module.add(SAddU32(dst=soffset, src0=soffset, src1=correction,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this out of main loop?
We do not need to do add inside main loop.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, we should not need any correction.
We should use the same logic as swizzleMXSA/B (as well as non swizzle A/B).
My expectation is to use same calculation logic as swizzle MXSA/B (just use different swizzle block size for A/B).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can pull it out of the loop I think but I don't see how it's possible to manage with soffset

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current non swizzleA/B code already handle that.
We can assume non swizzleA/B has swizzle block size 1x1.
We just need to generalize the calculation for current swizzle block.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need any extra correction for swizzled B.
What I expect is to use the same logic as swizzled MXSA/B.
-> means use SwizzleSize0/1 for swizzled A/B (SwizzleSize0=16, SwizzleSize1=32?)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, we can unify non swizzleA/B code and swizzled A/B code here.
non swizzled case, it is just SwizzleSize0=SwizzleSize1=1.
Then, all of remaining calculation should be common with using SwizzleSize0/1.

# lane 1: sharedVgprLROffset[0] = 16, sharedVgprLROffset[1] = 1040
# lane 63: sharedVgprLROffset[0] = 1008, sharedVgprLROffset[1] = 2032
mmaTileBytes = tileInfo.mmaTileSize
for vgprId in range(len(tileInfo.sharedVgprLROffset)):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need multiple vgpr offset here, if the second one has just constant offset.
Currently,
offset 0 = laneID * 16
offset 1 = offset 0 + const offset

In that case, we can just use const offset in ds_read.
Reducing 1 vgpr usage is important.

@newling newling force-pushed the subtile_mx_preshuffle branch from 0a64888 to 440d1a8 Compare April 15, 2026 00:34
isPreShuffledAB = tc in ("A", "B") and kernel["ProblemType"].get("SwizzleTensor%s" % tc, False)
useFixedSrd2 = useSubtile or isPreShuffledAB

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not enable this for non subtile case.
So far, I would like to limit this change only for subtile.

module.add(VAndB32(dst=vgpr(correctionVgpr), src0=hex(~(nTileRows - 1) & 0xFFFFFFFF),
src1=vgpr(rowOffset),
comment="alignedRow = rowOffset & ~%d (tile align)" % (nTileRows - 1)))
module.add(SSubU32(dst=sgpr(correctionSgpr), src0=sgpr("SizeL"), src1=hex(kernel["_DepthU%s" % tc]),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use sizeL here.
This calculation should be a part of vgpr B offset and we should use stride instead.
Then, we should not need to do add and sub at each GR B.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this part is basically calculating LDS dest address for DTL load.
However, the code you added here is for global memory address correction.
We should not mix up LDS address calculation and GR address calculation.

Please integrate this correction code into GR offset calculation (vgpr offset).

newling and others added 10 commits April 15, 2026 17:03
The swizzle cache keyed on (bitWidth, unrolledSize, tiledSize), which
collides when A and B have the same shape (e.g. M==N). When B hit the
cache, it received A's permuted data instead of its own. Add tensor
index to the key to prevent cross-tensor cache reuse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codegen (Python): calSwizzlePackK used int(numBytes()) which truncates
FP4's 0.5 to 0, causing division by zero. Switch to bit-width arithmetic:
128 // MIInputPerThread // numBits.

Client (C++): copySwizzledToGPUBuffer did not handle FP4:
- toBitWidth: missing Float4 case (threw "unsupported datatype")
- calculateKforSwizzling: missing Float4 case (MiK=16, MiKv=16)
- Tensor constructor: desc.elementBytes() asserts for FP4. Use
  dtInfo.elementSize (sizeof the packed storage type, always >= 1)
- unrolledSize: divide by dtInfo.packing to convert logical elements
  to storage units (no-op for packing=1, halves for FP4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…P4 GEMM

Support pre-shuffled tensor layout for A and/or B on the subtile path.
Pre-shuffle rearranges data on the host as:
  X.reshape(D//16, 16, K//32, 32).transpose(0, 2, 1, 3)
grouping each 16x32 FP4 block (256 bytes) contiguously in HBM.

Key codegen changes (SubtileBasedKernel.py):
- TileInfo.isPreShuffled field, set from SwizzleTensor{A,B}
- Linear GR/LR addressing for pre-shuffled tiles (no swizzle/rotation)
- Per-lane GR offset uses depthUBytes directly (compile-time shift)
- Subtile group offsets use Stride{A0I,B1J} (= K, unchanged)
- Per-wave soffset correction for K > DepthU at tile boundaries
- SRD pointer advance scaled by mmaTileShape[0] for interleaved layout
- SRD bounds check disabled for pre-shuffled tensors

Solution.py: relax DirectToVgpr checks for UseSubtileImpl.
KernelWriter.py: call disableSrdBoundsForPreShuffled after graAddresses.

178 tests: B-only (56 MIWT configs), A-only, A+B, StreamK,
and non-regression (no pre-shuffle).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The grTracker dedup in emitGR is no longer needed — the upstream
dependency-based instruction scheduling (ROCm#19) deduplicates subtile
loads at the scheduling level and fixes the WG 4x1/1x4 PGR=2 issue.

Re-enable the previously commented-out PGR=2 WG 4x1/1x4 configs in
subtile_mxfp4_preshuffle.yaml (now passing for both K=DU and K>DU).
@newling newling force-pushed the subtile_mx_preshuffle branch from 440d1a8 to 407c73c Compare April 15, 2026 17:13
elif isPreShuffledAB:
# Pre-shuffled layout: each row-block spans swizzleSize0 rows,
# so physical stride per block = stride * swizzleSize0 * bpe.
module.add(SMulI32(dst=sgpr(stmp+0), src0=sgpr(stmp+0), src1=swizzleSize0, comment="numLine * stride * %u (row-block stride)"%swizzleSize0))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please use the same logic as MXSA/B for stride?
MXSA,B case, we adjust the stride at very beginning.
// Scale StridesMXSA by 32
s_lshl_b32 s[sgprStridesMXSA], s[sgprStridesMXSA], 5
s_lshl_b32 s[sgprStridesMXSB], s[sgprStridesMXSB], 5

We can do same thing for shuffleB.
(please apply this logic only for SubIter code)

@nakajee

nakajee commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

PGR0+DU512 (MT128x128x512) fails with swizzleB.
It passes without swizzleB.
Would you please take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants