Support for shuffled layouts of kernel arguments. by newling · Pull Request #17 · b-shi/rocm-libraries

newling · 2026-04-01T21:18:16Z

Summary

The main goal of this PR was to support shuffled B (weights). But it includes support for shuffling A, and for shuffling A and B, as these were very easy additions.

With pre-shuffling, the address that a lane loads from with its buffer_load_dwordx4 has a very simple offset, laneId * 16, relative to its wave's base address.

With pre-shuffling, the address that a lane loads from with its ds_read_b128 is also very simple, with an offset that is also laneId * 16. This is BC-free (see my "AITER pre-shuffle is BC-free" analysis in https://docs.google.com/spreadsheets/d/1tMj406bpUNOxFDjiBlO0nT5NCoZsnGpd7Fv6nQvs8y8/edit?usp=sharing)

Code changes

This PR adapts five aspects of the subtile addressing to handle the pre-shuffled layout:

Within-wave ds_read offsets: Pre-shuffled data is already in the target layout for LDS, so lane offsets simplify to a linear mapping (laneId * 16) rather than the strided calculation used for standard layout.
Within-wave buffer_load offsets (vaddr): The HBM row stride becomes depthUBytes (compile-time constant) instead of the runtime tensor stride, reflecting the tiled layout in HBM.
Workgroup-level buffer_load base (SRD advance): After each unroll iteration, the SRD (shared by all waves in the workgroup) advances by 16 pre-shuffled rows rather than the standard stride-based increment.
Per-wave soffset correction: Each wave loads a different set of 16 rows. When K > DepthU, the HBM distance between consecutive tiles varies per wave. An SGPR correction term bridges the gap between the SRD base and each wave's actual data location.
Column swizzle skip: The column ID swizzle pass is skipped for pre-shuffled tiles since the data is already arranged in the target layout, simplifying both the LDS write pattern (buffer_load via DTL) and the LDS read pattern (ds_read).

b-shi · 2026-04-02T13:23:09Z

-##################################################
-# Subroutine to update ptrs
-#
+def disableSrdBoundsForPreShuffled(writer, kernel, tc):


Can we add a check in KernelWriter.py, to check if preshuffleB is enabled and if so, not emit the logic to compute Srd bounds for A/B and replace with bufferLimit instead of computing the bounds then overwriting it.

(Claude and) I have addressed this: 197b958.

Is it ok to just use bufferLimit? Why doesn't the non-pre-shuffle do this, is it to correctly handle the case where M, N are perfectly tiled by the macro-tile?

Yeah, its to handle edge cases. With pre-shuffle B we can avoid this case

…being used after being freed. (ROCm#5220) ## Motivation A `heap-use-after-free` error was triggered by AddressSanitizer on test `CPU_Dump_NAN_FP32.testDump`. ## Technical Details Root Cause Analysis: The AddressSanitizer error occurred because the HIPOCProgramImpl constructor was not storing the binary data passed to it. When LoadProgram called LoadBinary and created a HIPOCProgram with the returned vector, the temporary vector would go out of scope, but COMGR still needed to access the binary data later, causing a use-after-free. - The fix ensures that the HIPOCProgramImpl object owns the binary data for its entire lifetime - Both constructors now consistently store the binary data in the `binary` member variable (std::vector) - The uint8_t constructor converts the data to char format using iterator range construction - This prevents the use-after-free that occurred when COMGR tried to access freed memory ## Test Plan Test output before change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump ================================================================= ==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8 READ of size 26088 at 0x7e0f08c50200 thread T0 #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5 b-shi#1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 b-shi#2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17 b-shi#3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24 b-shi#4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12 b-shi#5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14 b-shi#6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22 b-shi#7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22 b-shi#8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22 b-shi#9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47 b-shi#10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10 b-shi#11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25 b-shi#12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18 b-shi#13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 b-shi#14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 b-shi#15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 b-shi#16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 b-shi#17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 b-shi#19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50 b-shi#20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6 b-shi#21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14 b-shi#22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33 b-shi#23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6 b-shi#24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47 b-shi#25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 b-shi#27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55 b-shi#28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73 b-shi#29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12 b-shi#30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986) b-shi#31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d) 0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8) freed by thread T0 here: #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3 b-shi#1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2 b-shi#2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25 b-shi#3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13 b-shi#4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4 b-shi#5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2 b-shi#6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7 b-shi#7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5 b-shi#8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 b-shi#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 b-shi#10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 b-shi#11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 b-shi#12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 b-shi#13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 b-shi#14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 previously allocated by thread T0 here: #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35 b-shi#1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27 b-shi#2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32 b-shi#3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20 b-shi#4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20 b-shi#5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33 b-shi#6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9 b-shi#7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9 b-shi#8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27 b-shi#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 b-shi#10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 b-shi#11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 b-shi#12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 b-shi#13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 b-shi#14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 b-shi#15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) Shadow bytes around the buggy address: 0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==3639==ABORTING ``` ## Test Result Test output after change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump [ OK ] CPU_Dump_NAN_FP32.testDump (51 ms) [----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (52 ms total) [ PASSED ] 1 test. ``` ## Cline Analysis ### Test Coverage Analysis: __1. LoadProgram Code Path (std::vector constructor):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/db_sync.cpp` - __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file, program_args, "")` - __Coverage__: This test extensively exercises the LoadProgram → LoadBinary → HIPOCProgramImpl constructor path - __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942, gfx1030) with different CU counts - __Frequency__: Runs on thousands of kernel configurations in the database sync tests __2. Solution Binary Serialization (std::vector usage):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp` - __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with `std::vector<char> solution_binary` - __Coverage__: Tests the save/load cycle of solution binaries - __Scope__: Tests all convolution directions (Forward, BackwardData, BackwardWeights) __3. Additional Coverage:__ - __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp` tests compression/decompression with `std::vector<char>` - __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space (related pattern) __Test Quality Assessment:__ ✅ __Both constructors are well-tested__: - The `std::vector<char>` constructor is heavily exercised through database sync tests - The `std::vector<uint8_t>` constructor would be tested through any code paths that use uint8_t binary data ✅ __Real-world scenarios covered__: - Database synchronization (production kernel loading) - Solution serialization (runtime binary handling) - Multi-threaded execution (db_sync uses up to 32 threads) ✅ __Comprehensive architecture coverage__: - Tests run on multiple GPU architectures - Different compute unit configurations tested __Confidence Level__: Very High ### Performance Analysis: Regarding the performance impact of this fix, it's actually quite minimal and represents good engineering practice: __Memory Impact:__ - __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a copy of the binary data in its `binary` member variable - __Typical Size__: GPU code objects are usually relatively small (typically a few KB to a few MB depending on kernel complexity) - __Lifetime__: The memory is only held for the lifetime of the HIPOCProgram object, which is typically short-lived during kernel loading __Performance Characteristics:__ - __One-time Copy Cost__: There's a single memory copy operation during construction (std::vector copy or iterator range construction) - __No Runtime Overhead__: Once constructed, there's no additional performance cost during kernel execution - __Memory Safety Benefit__: Eliminates potential crashes and undefined behavior, which far outweighs the small memory cost __Context in MIOpen:__ - This occurs during the kernel loading phase, not during actual ML inference/training - Kernel loading is already an expensive operation involving compilation, module creation, etc. - The additional memory copy is negligible compared to the overall kernel loading time __Trade-off Analysis:__ - __Cost__: Small increase in memory usage during kernel loading - __Benefit__: Eliminates memory safety bugs that could cause crashes or data corruption - __Net Result__: Significantly positive - reliability and correctness are much more valuable than the minimal memory overhead In practice, this fix follows the RAII (Resource Acquisition Is Initialization) principle and ensures proper ownership semantics, which is standard best practice in modern C++. The performance impact should be unnoticeable in real-world usage.

nakajee · 2026-04-10T19:55:17Z

      # skip ShadowLimit and Srd+2 calculation here in useFixedSrd2 case
-      if not useFixedSrd2:
+      if isPreShuffled:
+        module.add(SMovB32(dst=sgpr("Srd%s+2"%tc), src="BufferLimit",


Please remove this line.
We should not do this.

We should properly set SrdB+2.
(same logic as current non preshuffleA/B, preshuffle MXSA/B)

nakajee · 2026-04-10T20:00:49Z

+      if soffset == 0:
+        soffset = correction
+      else:
+        module.add(SAddU32(dst=soffset, src0=soffset, src1=correction,


Can we move this out of main loop?
We do not need to do add inside main loop.

First of all, we should not need any correction.
We should use the same logic as swizzleMXSA/B (as well as non swizzle A/B).
My expectation is to use same calculation logic as swizzle MXSA/B (just use different swizzle block size for A/B).

I can pull it out of the loop I think but I don't see how it's possible to manage with soffset

Current non swizzleA/B code already handle that.
We can assume non swizzleA/B has swizzle block size 1x1.
We just need to generalize the calculation for current swizzle block.

nakajee · 2026-04-10T20:12:38Z

We should not need any extra correction for swizzled B.
What I expect is to use the same logic as swizzled MXSA/B.
-> means use SwizzleSize0/1 for swizzled A/B (SwizzleSize0=16, SwizzleSize1=32?)

Essentially, we can unify non swizzleA/B code and swizzled A/B code here.
non swizzled case, it is just SwizzleSize0=SwizzleSize1=1.
Then, all of remaining calculation should be common with using SwizzleSize0/1.

nakajee · 2026-04-10T20:37:03Z

+  #   lane 1:  sharedVgprLROffset[0] = 16,   sharedVgprLROffset[1] = 1040
+  #   lane 63: sharedVgprLROffset[0] = 1008, sharedVgprLROffset[1] = 2032
+  mmaTileBytes = tileInfo.mmaTileSize
+  for vgprId in range(len(tileInfo.sharedVgprLROffset)):


We do not need multiple vgpr offset here, if the second one has just constant offset.
Currently,
offset 0 = laneID * 16
offset 1 = offset 0 + const offset

In that case, we can just use const offset in ds_read.
Reducing 1 vgpr usage is important.

nakajee · 2026-04-15T01:18:15Z

    isPreShuffledAB = tc in ("A", "B") and kernel["ProblemType"].get("SwizzleTensor%s" % tc, False)
+    useFixedSrd2 = useSubtile or isPreShuffledAB


Please do not enable this for non subtile case.
So far, I would like to limit this change only for subtile.

nakajee · 2026-04-15T06:29:24Z

+  module.add(VAndB32(dst=vgpr(correctionVgpr), src0=hex(~(nTileRows - 1) & 0xFFFFFFFF),
+                     src1=vgpr(rowOffset),
+                     comment="alignedRow = rowOffset & ~%d (tile align)" % (nTileRows - 1)))
+  module.add(SSubU32(dst=sgpr(correctionSgpr), src0=sgpr("SizeL"), src1=hex(kernel["_DepthU%s" % tc]),


We should not use sizeL here.
This calculation should be a part of vgpr B offset and we should use stride instead.
Then, we should not need to do add and sub at each GR B.

First of all, this part is basically calculating LDS dest address for DTL load.
However, the code you added here is for global memory address correction.
We should not mix up LDS address calculation and GR address calculation.

Please integrate this correction code into GR offset calculation (vgpr offset).

The swizzle cache keyed on (bitWidth, unrolledSize, tiledSize), which collides when A and B have the same shape (e.g. M==N). When B hit the cache, it received A's permuted data instead of its own. Add tensor index to the key to prevent cross-tensor cache reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Codegen (Python): calSwizzlePackK used int(numBytes()) which truncates FP4's 0.5 to 0, causing division by zero. Switch to bit-width arithmetic: 128 // MIInputPerThread // numBits. Client (C++): copySwizzledToGPUBuffer did not handle FP4: - toBitWidth: missing Float4 case (threw "unsupported datatype") - calculateKforSwizzling: missing Float4 case (MiK=16, MiKv=16) - Tensor constructor: desc.elementBytes() asserts for FP4. Use dtInfo.elementSize (sizeof the packed storage type, always >= 1) - unrolledSize: divide by dtInfo.packing to convert logical elements to storage units (no-op for packing=1, halves for FP4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…P4 GEMM Support pre-shuffled tensor layout for A and/or B on the subtile path. Pre-shuffle rearranges data on the host as: X.reshape(D//16, 16, K//32, 32).transpose(0, 2, 1, 3) grouping each 16x32 FP4 block (256 bytes) contiguously in HBM. Key codegen changes (SubtileBasedKernel.py): - TileInfo.isPreShuffled field, set from SwizzleTensor{A,B} - Linear GR/LR addressing for pre-shuffled tiles (no swizzle/rotation) - Per-lane GR offset uses depthUBytes directly (compile-time shift) - Subtile group offsets use Stride{A0I,B1J} (= K, unchanged) - Per-wave soffset correction for K > DepthU at tile boundaries - SRD pointer advance scaled by mmaTileShape[0] for interleaved layout - SRD bounds check disabled for pre-shuffled tensors Solution.py: relax DirectToVgpr checks for UseSubtileImpl. KernelWriter.py: call disableSrdBoundsForPreShuffled after graAddresses. 178 tests: B-only (56 MIWT configs), A-only, A+B, StreamK, and non-regression (no pre-shuffle). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The grTracker dedup in emitGR is no longer needed — the upstream dependency-based instruction scheduling (ROCm#19) deduplicates subtile loads at the scheduling level and fixes the WG 4x1/1x4 PGR=2 issue. Re-enable the previously commented-out PGR=2 WG 4x1/1x4 configs in subtile_mxfp4_preshuffle.yaml (now passing for both K=DU and K>DU).

nakajee · 2026-04-15T18:23:38Z

+                elif isPreShuffledAB:
+                  # Pre-shuffled layout: each row-block spans swizzleSize0 rows,
+                  # so physical stride per block = stride * swizzleSize0 * bpe.
+                  module.add(SMulI32(dst=sgpr(stmp+0), src0=sgpr(stmp+0), src1=swizzleSize0, comment="numLine * stride * %u (row-block stride)"%swizzleSize0))


Would you please use the same logic as MXSA/B for stride?
MXSA,B case, we adjust the stride at very beginning.
// Scale StridesMXSA by 32
s_lshl_b32 s[sgprStridesMXSA], s[sgprStridesMXSA], 5
s_lshl_b32 s[sgprStridesMXSB], s[sgprStridesMXSB], 5

We can do same thing for shuffleB.
(please apply this logic only for SubIter code)

nakajee · 2026-04-15T20:19:35Z

PGR0+DU512 (MT128x128x512) fails with swizzleB.
It passes without swizzleB.
Would you please take a look?

newling force-pushed the subtile_mx_preshuffle branch from 7c4e253 to 758e98a Compare April 1, 2026 21:19

newling marked this pull request as ready for review April 1, 2026 21:22

newling changed the base branch from subtile_mx to users/brianshi/custom_sched_2 April 1, 2026 21:38

newling changed the base branch from users/brianshi/custom_sched_2 to subtile_mx_f4_schedule_2 April 1, 2026 21:38

b-shi reviewed Apr 2, 2026

View reviewed changes

newling changed the base branch from subtile_mx_f4_schedule_2 to subtile_mx April 2, 2026 14:36

newling force-pushed the subtile_mx_preshuffle branch from 758e98a to b506d4e Compare April 2, 2026 17:48

b-shi force-pushed the subtile_mx branch from 74cb1fc to 916cf02 Compare April 2, 2026 18:07

newling force-pushed the subtile_mx_preshuffle branch from 197b958 to a6f5ae8 Compare April 3, 2026 19:53

newling force-pushed the subtile_mx_preshuffle branch from 538cd03 to cc36baf Compare April 9, 2026 00:42

b-shi force-pushed the subtile_mx branch from 8dd71f0 to ba6effb Compare April 9, 2026 02:02

newling force-pushed the subtile_mx_preshuffle branch from cc36baf to 8ed4abb Compare April 9, 2026 18:38

nakajee reviewed Apr 10, 2026

View reviewed changes

newling force-pushed the subtile_mx_preshuffle branch from 0a64888 to 440d1a8 Compare April 15, 2026 00:34

nakajee reviewed Apr 15, 2026

View reviewed changes

newling and others added 10 commits April 15, 2026 17:03

disable skew tests and fix for useFixedSrd2

5a7cacc

some sort of fix from claude

a51f832

additional testing

b3c2be6

update test blacklist

79bdb1e

claude's fix for rotating window with pre-shuffling

b97b66a

remove prints

06fb7c4

newling added 2 commits April 15, 2026 17:03

cosmetic changes

3d16ee4

attempt at addressing most recent review comments

407c73c

newling force-pushed the subtile_mx_preshuffle branch from 440d1a8 to 407c73c Compare April 15, 2026 17:13

nakajee reviewed Apr 15, 2026

View reviewed changes

		isPreShuffledAB = tc in ("A", "B") and kernel["ProblemType"].get("SwizzleTensor%s" % tc, False)
		useFixedSrd2 = useSubtile or isPreShuffledAB

Conversation

newling commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nakajee commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

newling commented Apr 1, 2026 •

edited

Loading