Skip to content

[codex] restore CPU_SIM fallback storage for split TPOP tiles#59

Open
zhoubot wants to merge 1 commit intomainfrom
codex/fix-cpu-sim-tile-default-storage
Open

[codex] restore CPU_SIM fallback storage for split TPOP tiles#59
zhoubot wants to merge 1 commit intomainfrom
codex/fix-cpu-sim-tile-default-storage

Conversation

@zhoubot
Copy link
Copy Markdown
Collaborator

@zhoubot zhoubot commented Apr 8, 2026

Summary

  • restore lazy fallback backing storage for CPU_SIM Tile objects even when __PTO_AUTO__ is not defined
  • keep TASSIGN redirection intact while adding const/non-const accessors that allocate only when no backing pointer has been assigned
  • add a tpushpop CPU regression test that exercises split TPOP into unassigned destination tiles across dual subblocks

Root cause

The regression in issue #50 came from include/pto/common/pto_tile.hpp after d940c05b: CPU_SIM tiles only kept internal lazy storage when both __CPU_SIM and __PTO_AUTO__ were defined. In normal CPU_SIM builds used by a5sim, split TPOP consumers could therefore keep a null backing pointer unless the kernel explicitly called TASSIGN.

The failing BGEMM path in simpler declares a split TPOP destination tile without TASSIGN. The CPU path reaches TPop.hpp -> CopyLinearToTile, which writes through dst.data()[...], so the null tile backing pointer becomes a segmentation fault during execution.

This patch restores the earlier CPU_SIM fallback-storage behavior at the Tile layer, which is the correct upstream fix. Downstream kernels no longer need to add TASSIGN just to avoid the regression.

Validation

  • python3 tests/run_cpu.py --testcase tpushpop --gtest_filter 'TPushPopTest.a5_style_c2v_dual_subblock_split_push_pop_without_tassign_dst' --generator Ninja --build-dir build/cpu_tpushpop_regression --clean
  • python3 tests/run_cpu.py --testcase tpushpop --generator Ninja --build-dir build/cpu_tpushpop_full --clean
  • CC=/opt/homebrew/bin/gcc-15 CXX=/opt/homebrew/bin/g++-15 PTO_ISA_ROOT=/Users/zhoubot/github/pto-isa /Users/zhoubot/github/pto-orgs/simpler/.venv313/bin/python examples/scripts/run_example.py --build -k examples/a5/tensormap_and_ringbuffer/bgemm/kernels -g examples/a5/tensormap_and_ringbuffer/bgemm/golden.py -p a5sim --log-level warn

Fixes #50

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Tile class in include/pto/common/pto_tile.hpp by introducing a private ensureStorage method to handle lazy initialization of the internal buffer, which is now used by both non-const and newly added const data() accessors. The internal storage members are marked mutable to support this. Additionally, the preprocessor conditions for these features were simplified, and a new test case was added to tests/cpu/st/testcase/tpushpop/main.cpp. A review comment points out that ensureStorage lacks thread safety, which could lead to race conditions if Tile instances are shared between threads during CPU simulation.

Comment on lines +1660 to +1666
AICORE void ensureStorage() const
{
if (!data_) {
internalBuffer.resize(Rows * Cols);
data_ = internalBuffer.data();
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ensureStorage method is not thread-safe. In a multi-threaded CPU simulation environment, if multiple threads access the same Tile instance and call data() for the first time simultaneously, a race condition could occur during the internalBuffer.resize and data_ assignment. While Tile objects are often local to a subblock, if they are ever shared across threads in __CPU_SIM, this could lead to undefined behavior or crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Regression: a5sim BGEMM segmentation fault after CPU_SIM memory manager refactor in d940c05b

1 participant