Support for shuffled layouts of kernel arguments. #17
Conversation
7c4e253 to
758e98a
Compare
| ################################################## | ||
| # Subroutine to update ptrs | ||
| # | ||
| def disableSrdBoundsForPreShuffled(writer, kernel, tc): |
There was a problem hiding this comment.
Can we add a check in KernelWriter.py, to check if preshuffleB is enabled and if so, not emit the logic to compute Srd bounds for A/B and replace with bufferLimit instead of computing the bounds then overwriting it.
There was a problem hiding this comment.
(Claude and) I have addressed this: 197b958.
Is it ok to just use bufferLimit? Why doesn't the non-pre-shuffle do this, is it to correctly handle the case where M, N are perfectly tiled by the macro-tile?
There was a problem hiding this comment.
Yeah, its to handle edge cases. With pre-shuffle B we can avoid this case
758e98a to
b506d4e
Compare
197b958 to
a6f5ae8
Compare
…being used after being freed. (ROCm#5220) ## Motivation A `heap-use-after-free` error was triggered by AddressSanitizer on test `CPU_Dump_NAN_FP32.testDump`. ## Technical Details Root Cause Analysis: The AddressSanitizer error occurred because the HIPOCProgramImpl constructor was not storing the binary data passed to it. When LoadProgram called LoadBinary and created a HIPOCProgram with the returned vector, the temporary vector would go out of scope, but COMGR still needed to access the binary data later, causing a use-after-free. - The fix ensures that the HIPOCProgramImpl object owns the binary data for its entire lifetime - Both constructors now consistently store the binary data in the `binary` member variable (std::vector) - The uint8_t constructor converts the data to char format using iterator range construction - This prevents the use-after-free that occurred when COMGR tried to access freed memory ## Test Plan Test output before change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump ================================================================= ==3639==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e0f08c50200 at pc 0x7f5f8d7a6554 bp 0x7ffcb7c4a730 sp 0x7ffcb7c49ee8 READ of size 26088 at 0x7e0f08c50200 thread T0 #0 0x7f5f8d7a6553 in memcpy /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors_memintrinsics.inc:117:5 b-shi#1 0x7f5f23d61d78 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 b-shi#2 0x7f5f23d61d78 in COMGR::DataObject::setData(llvm::StringRef) /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:334:17 b-shi#3 0x7f5f23d61d78 in amd_comgr_set_data /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:606:24 b-shi#4 0x7f5f221dc1d3 in amd::Comgr::set_data(amd_comgr_data_s, unsigned long, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/comgrctx.hpp:252:12 b-shi#5 0x7f5f221dc1d3 in amd::device::Program::getSymbolsFromCodeObj(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, amd_comgr_symbol_type_s) const /data/nhanna/repos/TheRock/rocm-systems/projects/clr/rocclr/device/devprogram.cpp:2061:14 b-shi#6 0x7f5f219e6f7c in hip::DynCO::populateDynGlobalVars() /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:216:22 b-shi#7 0x7f5f219e8e6a in hip::DynCO::getDynFunc(ihipModuleSymbol_t**, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_code_object.cpp:125:22 b-shi#8 0x7f5f21f842ba in hip::PlatformState::GetDynFunc(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_platform.cpp:884:22 b-shi#9 0x7f5f21ec2d71 in hip::hipModuleGetFunction(ihipModuleSymbol_t**, ihipModule_t*, char const*) /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_module.cpp:89:47 b-shi#10 0x7f5f2212c588 in hipModuleGetFunction /data/nhanna/repos/TheRock/rocm-systems/projects/clr/hipamd/src/hip_table_interface.cpp:1926:10 b-shi#11 0x7f5f7478d806 in miopen::HIPOCKernel::HIPOCKernel(miopen::HIPOCProgram, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::vector<unsigned long, std::allocator<unsigned long>>, std::vector<unsigned long, std::allocator<unsigned long>>) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/include/miopen/hipoc_kernel.hpp:225:25 b-shi#12 0x7f5f766febb7 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:161:18 b-shi#13 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 b-shi#14 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 b-shi#15 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 b-shi#16 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 b-shi#17 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#18 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 b-shi#19 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2728:50 b-shi#20 0x55e87ef04cdd in testing::Test::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2718:6 b-shi#21 0x55e87ef04e64 in testing::TestInfo::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2874:14 b-shi#22 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3052:33 b-shi#23 0x55e87ef0500e in testing::TestSuite::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:3006:6 b-shi#24 0x55e87ef0d20b in testing::internal::UnitTestImpl::RunAllTests() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:6004:47 b-shi#25 0x55e87ef1a1de in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#26 0x55e87ef1a1de in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 b-shi#27 0x55e87ef051b5 in testing::UnitTest::Run() /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:5583:55 b-shi#28 0x55e87eee3f9b in RUN_ALL_TESTS() /data/nhanna/repos/TheRock/build/third-party/googletest/dist/include/gtest/gtest.h:2334:73 b-shi#29 0x55e87eee3f9b in main /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/main_hip.cpp:34:12 b-shi#30 0x7f5f20a587e4 in __libc_start_main (/lib64/libc.so.6+0x3a7e4) (BuildId: 889235a2805b8308b2d0274921bbe1890e9a1986) b-shi#31 0x55e87b0bcf2d in _start (/data/nhanna/repos/TheRock/build/ml-libs/MIOpen/build/bin/miopen_gtest+0x126bf2d) 0x7e0f08c50200 is located 0 bytes inside of 26088-byte region [0x7e0f08c50200,0x7e0f08c567e8) freed by thread T0 here: #0 0x7f5f8d7b8ba2 in operator delete(void*, unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:190:3 b-shi#1 0x7f5f76b7317d in std::__new_allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:172:2 b-shi#2 0x7f5f76b7317d in std::allocator<char>::deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:210:25 b-shi#3 0x7f5f76b7317d in std::allocator_traits<std::allocator<char>>::deallocate(std::allocator<char>&, char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:517:13 b-shi#4 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::_M_deallocate(char*, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:390:4 b-shi#5 0x7f5f76b7317d in std::_Vector_base<char, std::allocator<char>>::~_Vector_base() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:369:2 b-shi#6 0x7f5f76b7317d in std::vector<char, std::allocator<char>>::~vector() /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:738:7 b-shi#7 0x7f5f76b7317d in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:633:5 b-shi#8 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 b-shi#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 b-shi#10 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 b-shi#11 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 b-shi#12 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 b-shi#13 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 b-shi#14 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#15 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 previously allocated by thread T0 here: #0 0x7f5f8d7b7f9d in operator new(unsigned long) /data/nhanna/repos/TheRock/compiler/amd-llvm/compiler-rt/lib/asan/asan_new_delete.cpp:109:35 b-shi#1 0x7f5f76b720a5 in std::__new_allocator<char>::allocate(unsigned long, void const*) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:151:27 b-shi#2 0x7f5f76b720a5 in std::allocator<char>::allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198:32 b-shi#3 0x7f5f76b720a5 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482:20 b-shi#4 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:381:20 b-shi#5 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:398:33 b-shi#6 0x7f5f76b720a5 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:335:9 b-shi#7 0x7f5f76b720a5 in std::vector<char, std::allocator<char>>::vector(std::vector<char, std::allocator<char>> const&) /opt/rh/gcc-toolset-13/root/usr/lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:602:9 b-shi#8 0x7f5f76b720a5 in miopen::Handle::LoadProgram(std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, bool) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:623:27 b-shi#9 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*)::'lambda'()::operator()() const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:143:30 b-shi#10 0x7f5f766fda82 in miopen::KernelCache::AddKernel(miopen::Handle const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, miopen::HIPOCProgram*) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/kernel_cache.cpp:124:26 b-shi#11 0x7f5f76b6f0e4 in miopen::Handle::AddKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::filesystem::__cxx11::path const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::vector<unsigned long, std::allocator<unsigned long>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) const /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/hip/handlehip.cpp:450:34 b-shi#12 0x7f5f7411b52f in miopen::checkNumericsImpl(miopen::Handle const&, int, miopen::TensorDescriptor const&, void const*, bool) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/src/check_numerics.cpp:107:12 b-shi#13 0x55e87c72ebee in void testDumpWithNan<float>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:130:8 b-shi#14 0x55e87c72d4e8 in CPU_Dump_NAN_FP32_testDump_Test::TestBody() /data/nhanna/repos/TheRock/rocm-libraries/projects/miopen/test/gtest/dumpTensorTest.cpp:157:37 b-shi#15 0x55e87ef19d5e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2653:27 b-shi#16 0x55e87ef19d5e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /data/nhanna/repos/TheRock/build/third-party/googletest/source/googletest/src/gtest.cc:2689:52 SUMMARY: AddressSanitizer: heap-use-after-free /data/nhanna/repos/TheRock/compiler/amd-llvm/amd/comgr/src/comgr.cpp:216:9 in COMGR::setCStr(char*&, llvm::StringRef, unsigned long*) Shadow bytes around the buggy address: 0x7e0f08c4ff80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x7e0f08c50180: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x7e0f08c50200:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50280: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50300: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50380: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50400: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd 0x7e0f08c50480: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==3639==ABORTING ``` ## Test Result Test output after change: ``` HSA_XNACK=1 ASAN_OPTIONS=symbolize=1 ./build/ml-libs/MIOpen/build/bin/miopen_gtest --gtest_filter="*CPU_Dump_NAN_FP32*" PRNG seed: 12345678 Note: Google Test filter = *CPU_Dump_NAN_FP32* [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from CPU_Dump_NAN_FP32 [ RUN ] CPU_Dump_NAN_FP32.testDump [ OK ] CPU_Dump_NAN_FP32.testDump (51 ms) [----------] 1 test from CPU_Dump_NAN_FP32 (51 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (52 ms total) [ PASSED ] 1 test. ``` ## Cline Analysis ### Test Coverage Analysis: __1. LoadProgram Code Path (std::vector constructor):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/db_sync.cpp` - __Function__: `BuildKernel()` calls `handle.LoadProgram(program_file, program_args, "")` - __Coverage__: This test extensively exercises the LoadProgram → LoadBinary → HIPOCProgramImpl constructor path - __Scope__: Tests multiple GPU architectures (gfx908, gfx90a, gfx942, gfx1030) with different CU counts - __Frequency__: Runs on thousands of kernel configurations in the database sync tests __2. Solution Binary Serialization (std::vector usage):__ - __Primary Test__: `rocm-libraries/projects/miopen/test/gtest/find_2_conv.cpp` - __Function__: `miopenSaveSolution()` and `miopenLoadSolution()` with `std::vector<char> solution_binary` - __Coverage__: Tests the save/load cycle of solution binaries - __Scope__: Tests all convolution directions (Forward, BackwardData, BackwardWeights) __3. Additional Coverage:__ - __Cache Tests__: `rocm-libraries/projects/miopen/test/gtest/cache.cpp` tests compression/decompression with `std::vector<char>` - __Dropout Tests__: Uses `std::vector<unsigned char>` for reserve space (related pattern) __Test Quality Assessment:__ ✅ __Both constructors are well-tested__: - The `std::vector<char>` constructor is heavily exercised through database sync tests - The `std::vector<uint8_t>` constructor would be tested through any code paths that use uint8_t binary data ✅ __Real-world scenarios covered__: - Database synchronization (production kernel loading) - Solution serialization (runtime binary handling) - Multi-threaded execution (db_sync uses up to 32 threads) ✅ __Comprehensive architecture coverage__: - Tests run on multiple GPU architectures - Different compute unit configurations tested __Confidence Level__: Very High ### Performance Analysis: Regarding the performance impact of this fix, it's actually quite minimal and represents good engineering practice: __Memory Impact:__ - __Additional Memory Usage__: Each HIPOCProgramImpl object now stores a copy of the binary data in its `binary` member variable - __Typical Size__: GPU code objects are usually relatively small (typically a few KB to a few MB depending on kernel complexity) - __Lifetime__: The memory is only held for the lifetime of the HIPOCProgram object, which is typically short-lived during kernel loading __Performance Characteristics:__ - __One-time Copy Cost__: There's a single memory copy operation during construction (std::vector copy or iterator range construction) - __No Runtime Overhead__: Once constructed, there's no additional performance cost during kernel execution - __Memory Safety Benefit__: Eliminates potential crashes and undefined behavior, which far outweighs the small memory cost __Context in MIOpen:__ - This occurs during the kernel loading phase, not during actual ML inference/training - Kernel loading is already an expensive operation involving compilation, module creation, etc. - The additional memory copy is negligible compared to the overall kernel loading time __Trade-off Analysis:__ - __Cost__: Small increase in memory usage during kernel loading - __Benefit__: Eliminates memory safety bugs that could cause crashes or data corruption - __Net Result__: Significantly positive - reliability and correctness are much more valuable than the minimal memory overhead In practice, this fix follows the RAII (Resource Acquisition Is Initialization) principle and ensures proper ownership semantics, which is standard best practice in modern C++. The performance impact should be unnoticeable in real-world usage.
538cd03 to
cc36baf
Compare
cc36baf to
8ed4abb
Compare
| # skip ShadowLimit and Srd+2 calculation here in useFixedSrd2 case | ||
| if not useFixedSrd2: | ||
| if isPreShuffled: | ||
| module.add(SMovB32(dst=sgpr("Srd%s+2"%tc), src="BufferLimit", |
There was a problem hiding this comment.
Please remove this line.
We should not do this.
There was a problem hiding this comment.
We should properly set SrdB+2.
(same logic as current non preshuffleA/B, preshuffle MXSA/B)
| if soffset == 0: | ||
| soffset = correction | ||
| else: | ||
| module.add(SAddU32(dst=soffset, src0=soffset, src1=correction, |
There was a problem hiding this comment.
Can we move this out of main loop?
We do not need to do add inside main loop.
There was a problem hiding this comment.
First of all, we should not need any correction.
We should use the same logic as swizzleMXSA/B (as well as non swizzle A/B).
My expectation is to use same calculation logic as swizzle MXSA/B (just use different swizzle block size for A/B).
There was a problem hiding this comment.
I can pull it out of the loop I think but I don't see how it's possible to manage with soffset
There was a problem hiding this comment.
Current non swizzleA/B code already handle that.
We can assume non swizzleA/B has swizzle block size 1x1.
We just need to generalize the calculation for current swizzle block.
There was a problem hiding this comment.
We should not need any extra correction for swizzled B.
What I expect is to use the same logic as swizzled MXSA/B.
-> means use SwizzleSize0/1 for swizzled A/B (SwizzleSize0=16, SwizzleSize1=32?)
There was a problem hiding this comment.
Essentially, we can unify non swizzleA/B code and swizzled A/B code here.
non swizzled case, it is just SwizzleSize0=SwizzleSize1=1.
Then, all of remaining calculation should be common with using SwizzleSize0/1.
| # lane 1: sharedVgprLROffset[0] = 16, sharedVgprLROffset[1] = 1040 | ||
| # lane 63: sharedVgprLROffset[0] = 1008, sharedVgprLROffset[1] = 2032 | ||
| mmaTileBytes = tileInfo.mmaTileSize | ||
| for vgprId in range(len(tileInfo.sharedVgprLROffset)): |
There was a problem hiding this comment.
We do not need multiple vgpr offset here, if the second one has just constant offset.
Currently,
offset 0 = laneID * 16
offset 1 = offset 0 + const offset
In that case, we can just use const offset in ds_read.
Reducing 1 vgpr usage is important.
0a64888 to
440d1a8
Compare
| isPreShuffledAB = tc in ("A", "B") and kernel["ProblemType"].get("SwizzleTensor%s" % tc, False) | ||
| useFixedSrd2 = useSubtile or isPreShuffledAB |
There was a problem hiding this comment.
Please do not enable this for non subtile case.
So far, I would like to limit this change only for subtile.
| module.add(VAndB32(dst=vgpr(correctionVgpr), src0=hex(~(nTileRows - 1) & 0xFFFFFFFF), | ||
| src1=vgpr(rowOffset), | ||
| comment="alignedRow = rowOffset & ~%d (tile align)" % (nTileRows - 1))) | ||
| module.add(SSubU32(dst=sgpr(correctionSgpr), src0=sgpr("SizeL"), src1=hex(kernel["_DepthU%s" % tc]), |
There was a problem hiding this comment.
We should not use sizeL here.
This calculation should be a part of vgpr B offset and we should use stride instead.
Then, we should not need to do add and sub at each GR B.
There was a problem hiding this comment.
First of all, this part is basically calculating LDS dest address for DTL load.
However, the code you added here is for global memory address correction.
We should not mix up LDS address calculation and GR address calculation.
Please integrate this correction code into GR offset calculation (vgpr offset).
The swizzle cache keyed on (bitWidth, unrolledSize, tiledSize), which collides when A and B have the same shape (e.g. M==N). When B hit the cache, it received A's permuted data instead of its own. Add tensor index to the key to prevent cross-tensor cache reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codegen (Python): calSwizzlePackK used int(numBytes()) which truncates FP4's 0.5 to 0, causing division by zero. Switch to bit-width arithmetic: 128 // MIInputPerThread // numBits. Client (C++): copySwizzledToGPUBuffer did not handle FP4: - toBitWidth: missing Float4 case (threw "unsupported datatype") - calculateKforSwizzling: missing Float4 case (MiK=16, MiKv=16) - Tensor constructor: desc.elementBytes() asserts for FP4. Use dtInfo.elementSize (sizeof the packed storage type, always >= 1) - unrolledSize: divide by dtInfo.packing to convert logical elements to storage units (no-op for packing=1, halves for FP4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…P4 GEMM
Support pre-shuffled tensor layout for A and/or B on the subtile path.
Pre-shuffle rearranges data on the host as:
X.reshape(D//16, 16, K//32, 32).transpose(0, 2, 1, 3)
grouping each 16x32 FP4 block (256 bytes) contiguously in HBM.
Key codegen changes (SubtileBasedKernel.py):
- TileInfo.isPreShuffled field, set from SwizzleTensor{A,B}
- Linear GR/LR addressing for pre-shuffled tiles (no swizzle/rotation)
- Per-lane GR offset uses depthUBytes directly (compile-time shift)
- Subtile group offsets use Stride{A0I,B1J} (= K, unchanged)
- Per-wave soffset correction for K > DepthU at tile boundaries
- SRD pointer advance scaled by mmaTileShape[0] for interleaved layout
- SRD bounds check disabled for pre-shuffled tensors
Solution.py: relax DirectToVgpr checks for UseSubtileImpl.
KernelWriter.py: call disableSrdBoundsForPreShuffled after graAddresses.
178 tests: B-only (56 MIWT configs), A-only, A+B, StreamK,
and non-regression (no pre-shuffle).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The grTracker dedup in emitGR is no longer needed — the upstream dependency-based instruction scheduling (ROCm#19) deduplicates subtile loads at the scheduling level and fixes the WG 4x1/1x4 PGR=2 issue. Re-enable the previously commented-out PGR=2 WG 4x1/1x4 configs in subtile_mxfp4_preshuffle.yaml (now passing for both K=DU and K>DU).
440d1a8 to
407c73c
Compare
| elif isPreShuffledAB: | ||
| # Pre-shuffled layout: each row-block spans swizzleSize0 rows, | ||
| # so physical stride per block = stride * swizzleSize0 * bpe. | ||
| module.add(SMulI32(dst=sgpr(stmp+0), src0=sgpr(stmp+0), src1=swizzleSize0, comment="numLine * stride * %u (row-block stride)"%swizzleSize0)) |
There was a problem hiding this comment.
Would you please use the same logic as MXSA/B for stride?
MXSA,B case, we adjust the stride at very beginning.
// Scale StridesMXSA by 32
s_lshl_b32 s[sgprStridesMXSA], s[sgprStridesMXSA], 5
s_lshl_b32 s[sgprStridesMXSB], s[sgprStridesMXSB], 5
We can do same thing for shuffleB.
(please apply this logic only for SubIter code)
|
PGR0+DU512 (MT128x128x512) fails with swizzleB. |
Summary
The main goal of this PR was to support shuffled B (weights). But it includes support for shuffling A, and for shuffling A and B, as these were very easy additions.
With pre-shuffling, the address that a lane loads from with its buffer_load_dwordx4 has a very simple offset,
laneId * 16, relative to its wave's base address.With pre-shuffling, the address that a lane loads from with its ds_read_b128 is also very simple, with an offset that is also
laneId * 16. This is BC-free (see my "AITER pre-shuffle is BC-free" analysis in https://docs.google.com/spreadsheets/d/1tMj406bpUNOxFDjiBlO0nT5NCoZsnGpd7Fv6nQvs8y8/edit?usp=sharing)Code changes
This PR adapts five aspects of the subtile addressing to handle the pre-shuffled layout: