Closed
Conversation
…iling
Add a standalone simulation unit test framework (tests/aicpu_ut/) that
runs the PTO2 orchestrator and scheduler logic on a standard Linux CPU
without Ascend hardware, with integrated perf profiling support.
- Extend register read/write stubs to support simulation mode (zero-reg
address treated as no-op dispatch, enables zero-core perf testing)
- Add per-core register address mapping used by the executor in sim mode
- Add AICPU-side profiling buffer management: per-core dispatch timestamp
arrays, double-buffer switch, and perf_aicpu_switch_buffer()
- Expose PLATFORM_PROF_BUFFER_SIZE for compile-time sizing
- Add PTO2_ORCH_PROFILING and PTO2_SCHED_PROFILING as independent sub-
switches under the existing PTO2_PROFILING master flag
- Add PTO2_SIM_AICORE_UT branch: when cores_total_num_ == 0 (sim mode),
skip hardware register polling and run a drain loop instead
- Integrate PTO2_SCHED_PROFILING instrumentation: track per-phase cycle
counts (get_ready, resolve_deps, dispatch_setup) and accumulate into
scheduler phase breakdown output
- Add local_dispatch_count / local_overflow_count profiling counters
- Simulate AICore execution in-process: aicpu_sim_run_pto2() launches
scheduler threads, accumulates dispatch counts per worker type, and
provides aicpu_sim_get_actual_sched_cpu() for affinity reporting
- Add print_sched_profiling(rt): print per-thread phase breakdown
(get_ready / resolve_deps / dispatch_setup / idle) in table form
- Add ut_dispatch_without_fanin_satisfied flag: in sim mode, treats any
task with fanin_refcount >= 1 as immediately READY (bypasses fanin
wait), allowing all tasks to be dispatched for scheduler-only perf
measurement without actual AICore execution completing tasks
- Set init_task_on_submit = true when scheduler is attached so that
init_task() is called at submit time, pre-populating fanin_refcount
- Add pto2_runtime_create_custom() for tests: takes explicit
task_window_size and gm_heap_size parameters
- Add get_sim_aicore_mode() accessor
- Fully static link (no .so dependencies): discovers libstdc++.a,
libm.a, libc.a, libpthread.a, libdl.a, libgcc.a at configure time
- One binary per PERF_CASE_IDX via target_compile_definitions
- PTO2_PROFILING / PTO2_SCHED_PROFILING / PTO2_ORCH_PROFILING toggles
- PTO2_SIM_AICORE_UT option (default ON) for zero-core sim paths
- cpu_affinity.cpp / cpu_affinity.h: bind_to_cpu(), current_cpu() via
sched_setaffinity / sched_getcpu; ORCH_CPU / SCHED_CPU{0..7} from
compile-time defines
- test_common.cpp / test_common.h: make_runtime() (calls
pto2_runtime_create_custom with task_window=16384, heap=4GB),
sim_run_with_resolve_and_dispatch() (runs scheduler threads and idle-
loops until MAX_IDLE_ITERATIONS quiet cycles), print_orch_profiling(),
print_sched_profiling() wrappers
- json_cases.h: PerfTestCase struct for compile-time test case selection
- test_log_stubs.cpp: stub out DEV_DEBUG / DEV_INFO / DEV_ERROR etc. for
host-side compilation
- test_cpu_affinity.cpp: verify bind_to_cpu() and current_cpu() return
the expected core
- test_platform_config.cpp: verify PLATFORM_MAX_BLOCKDIM,
PLATFORM_AIC_CORES_PER_BLOCKDIM, PLATFORM_AIV_CORES_PER_BLOCKDIM,
PLATFORM_MAX_AICPU_THREADS compile-time values
- test_paged_attention.cpp: single-head paged attention orch+sched perf
- test_batch_paged_attention.cpp: batch paged attention full pipeline
(orchestrator and scheduler run concurrently on separate threads);
3 cases: batch=64/ctx=8193, batch=2/varseq, batch=4/varseq
- test_batch_paged_attention_orch_only.cpp: orchestration only, no
scheduler threads; used to profile build_batch_paged_attention_graph
in isolation
- test_batch_paged_attention_sched_prof_only.cpp: run orchestration
first (single-threaded, completes fully), then launch scheduler threads
separately; PERF_WAIT_AFTER_INIT / SIGSTOP mechanism pauses after orch
so perf record window covers only the scheduler phase
- CMake configure → parallel build → test execution → pass/fail summary
- Test registry (TEST_TYPE / TEST_INDICES associative arrays) for
--test / --idx filtering
- --sched-threads N: pass AICPU_UT_NUM_SCHED_THREADS to test binaries
- --no-profiling / --no-sched-profiling / --no-orch-profiling toggles
- Writes sim output to outputs/aicpu_ut_sim_run.log; phase breakdown to
outputs/aicpu_ut_phase_breakdown.log
- Wrapper around perf record for a single named binary (--bin required)
- test_batch_paged_attention* binaries: SIGSTOP/SIGCONT protocol —
detect process state T via /proc/<pid>/stat, attach perf record -p,
send SIGCONT; sampling window covers only the work phase
- Other binaries: full-program perf record -- <bin>
- --build triggers run_tests.sh --build-only before sampling
- Default: --no-build; --call-graph dwarf (default) / fp / lbr
- Document run_tests.sh and perf_sched.sh usage, parameters, available
tests, environment variables, and execution flow
- Parse aicpu_ut_sim_run.log and print per-task-type dispatch statistics
- Add Part 2 JSON phase data source: try parse_scheduler_from_json_phases
first (perf JSON version >= 2), fall back to device log parsing
- Extend Phase Breakdown table with get_ready / dispatch_setup columns
- Development notes and simulation architecture overview
Update: remove ut_dispatch_without_fanin_satisfied; add build/profiling opts
Remove ut_dispatch_without_fanin_satisfied from PTO2Scheduler:
- Field bypassed fanin dependency check (fanin_rc>=1 instead of
fanin_rc==fanin_count) in sim tests; no longer needed as the
dependency chain is evaluated correctly without the escape hatch
- Simplify release_fanin_and_check_ready() to unconditional
bool ready = (new_refcount == task->fanin_count)
- Remove initialization in pto_scheduler.cpp and the
#if PTO2_SIM_AICORE_UT blocks in test_batch_paged_attention.cpp
and test_batch_paged_attention_sched_prof_only.cpp
run_tests.sh:
- Default profiling to OFF (silent run); add --profiling flag to
enable all profiling output; add --profiling --no-sched/orch-profiling
for selective control
- Suppress SIM_LOG/AICPU_UT_PHASE_LOG writes and summary output
when profiling is off
- Add --opt-level <N> parameter (default 3); passed to CMake as
OPT_LEVEL; also settable via OPT_LEVEL env variable
CMakeLists.txt:
- Add OPT_LEVEL cache variable (default 3); compile options now use
-O${OPT_LEVEL} so optimization level is configurable at build time
HARDWARE_SIMULATION.md:
- Remove outdated ut_dispatch_without_fanin_satisfied section
Made-with: Cursor
… into hardware_test Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.