Summary
Implement dual-slot (pipeline) kernel dispatch for Simpler AICPU runtime to eliminate scheduling gaps and improve core utilization. Allow two tasks (running + pending) to coexist on a single core, with the pending task ready to execute immediately after the running task completes.
Motivation / Use Case
Current single-slot dispatch creates gaps between kernel execution on AICore:
- AICPU dispatch task → AICore executes → completion → AICPU detects completion → dispatch next task
- AICore idle waiting for next payload during AICPU's dispatch latency
Dual-slot dispatch enables:
- Zero-gap pipelining: pending task can start execution immediately after running task FIN, without waiting for AICPU's next dispatch cycle
- Improved core utilization: reduce wasted core cycles due to dispatch overhead
- Better throughput: enable tighter kernel-to-kernel scheduling on dependent task chains
Proposed Design
The design spans three main areas:
1. Register Protocol (no changes needed)
- AICPU↔AICore communication via
DATA_MAIN_BASE (down) and COND (up) registers unchanged
- Extended interpretation:
COND task_id can match either running_reg_task_id or pending_reg_task_id
- ACK signal semantics expanded: signals payload cache-line lock (dcci complete), enabling safe payload reuse
2. AICPU Data Structure Changes
CoreExecState expansion:
- Split
executing_reg_task_id → running_reg_task_id + pending_reg_task_id
- Add
pending_slot_state and pending_subslot pointers
- Extend
CoreTracker with pending_states_ bitmap to track pending slot occupancy (one bit per core)
- Layout remains 64-byte cache-line aligned, no space waste
3. Completion Detection State Machine (4 cases)
- Case A:
pending FIN — both tasks complete in one poll cycle (task executed very fast, AICPU missed intermediate ACK/FIN)
- Case B:
pending ACK — running completes, pending promoted to running (normal pipeline flow)
- Case C:
running ACK — release pending slot reservation (enable second dispatch after ACK)
- Case D:
running FIN — task complete, clear state (reached idle state)
4. Dispatch Strategy
- First Dispatch (core idle): task enters running slot, pending slot reserved (prevents early second dispatch before ACK)
- Second Dispatch (pending slot free): task enters pending slot after observing first dispatch's ACK (payload reuse safe)
- Payload Reuse: single per-core payload buffer, safety guaranteed by ACK barrier (AICore must complete dcci before AICPU overwrites)
5. AICore Side
- No functionality changes required — main loop naturally supports dual-slot:
- Each poll cycle:
read DATA_MAIN_BASE → dcci(payload) → ACK → execute → FIN
- If new task written to DATA_MAIN_BASE during execution, next poll detects it immediately (zero gap)
dispatch_seq counter ensures monotonic reg_task_id per core (differentiates from task_id)
Alternatives Considered
-
Double-buffered payload: allocate two payload slots per core
- ✗ Wastes GM space (payload is 64B-aligned)
- ✗ AICore must select which payload to read based on task_id parity
- Single-buffer with ACK barrier is simpler and safer
-
Implicit payload locking: rely on kernel execution timing
- ✗ Unsafe without explicit signal (dcci timing is not guaranteed)
- Explicit ACK-based handshake is more robust
The implementation is a natural extension of the existing single-slot dispatch—AICore main loop requires no changes, complexity concentrated in AICPU completion detection and dispatch logic.
Related prior art: CANN PyPTO scheduler (Ascend training framework) implements a similar dual-slot scheduler; Simpler design is an architectural simplification optimizing for clarity and correctness.
Summary
Implement dual-slot (pipeline) kernel dispatch for Simpler AICPU runtime to eliminate scheduling gaps and improve core utilization. Allow two tasks (running + pending) to coexist on a single core, with the pending task ready to execute immediately after the running task completes.
Motivation / Use Case
Current single-slot dispatch creates gaps between kernel execution on AICore:
Dual-slot dispatch enables:
Proposed Design
The design spans three main areas:
1. Register Protocol (no changes needed)
DATA_MAIN_BASE(down) andCOND(up) registers unchangedCONDtask_id can match eitherrunning_reg_task_idorpending_reg_task_id2. AICPU Data Structure Changes
CoreExecStateexpansion:executing_reg_task_id→running_reg_task_id+pending_reg_task_idpending_slot_stateandpending_subslotpointersCoreTrackerwithpending_states_bitmap to track pending slot occupancy (one bit per core)3. Completion Detection State Machine (4 cases)
pending FIN— both tasks complete in one poll cycle (task executed very fast, AICPU missed intermediate ACK/FIN)pending ACK— running completes, pending promoted to running (normal pipeline flow)running ACK— release pending slot reservation (enable second dispatch after ACK)running FIN— task complete, clear state (reached idle state)4. Dispatch Strategy
5. AICore Side
read DATA_MAIN_BASE→dcci(payload)→ACK→execute→FINdispatch_seqcounter ensures monotonicreg_task_idper core (differentiates from task_id)Alternatives Considered
Double-buffered payload: allocate two payload slots per core
Implicit payload locking: rely on kernel execution timing
The implementation is a natural extension of the existing single-slot dispatch—AICore main loop requires no changes, complexity concentrated in AICPU completion detection and dispatch logic.
Related prior art: CANN PyPTO scheduler (Ascend training framework) implements a similar dual-slot scheduler; Simpler design is an architectural simplification optimizing for clarity and correctness.