Skip to content

[Feature] Support dual-slot kernel dispatch in Simpler #492

@zhusy54

Description

@zhusy54

Summary

Implement dual-slot (pipeline) kernel dispatch for Simpler AICPU runtime to eliminate scheduling gaps and improve core utilization. Allow two tasks (running + pending) to coexist on a single core, with the pending task ready to execute immediately after the running task completes.

Motivation / Use Case

Current single-slot dispatch creates gaps between kernel execution on AICore:

  • AICPU dispatch task → AICore executes → completion → AICPU detects completion → dispatch next task
  • AICore idle waiting for next payload during AICPU's dispatch latency

Dual-slot dispatch enables:

  1. Zero-gap pipelining: pending task can start execution immediately after running task FIN, without waiting for AICPU's next dispatch cycle
  2. Improved core utilization: reduce wasted core cycles due to dispatch overhead
  3. Better throughput: enable tighter kernel-to-kernel scheduling on dependent task chains

Proposed Design

The design spans three main areas:

1. Register Protocol (no changes needed)

  • AICPU↔AICore communication via DATA_MAIN_BASE (down) and COND (up) registers unchanged
  • Extended interpretation: COND task_id can match either running_reg_task_id or pending_reg_task_id
  • ACK signal semantics expanded: signals payload cache-line lock (dcci complete), enabling safe payload reuse

2. AICPU Data Structure Changes

  • CoreExecState expansion:
    • Split executing_reg_task_idrunning_reg_task_id + pending_reg_task_id
    • Add pending_slot_state and pending_subslot pointers
    • Extend CoreTracker with pending_states_ bitmap to track pending slot occupancy (one bit per core)
  • Layout remains 64-byte cache-line aligned, no space waste

3. Completion Detection State Machine (4 cases)

  • Case A: pending FIN — both tasks complete in one poll cycle (task executed very fast, AICPU missed intermediate ACK/FIN)
  • Case B: pending ACK — running completes, pending promoted to running (normal pipeline flow)
  • Case C: running ACK — release pending slot reservation (enable second dispatch after ACK)
  • Case D: running FIN — task complete, clear state (reached idle state)

4. Dispatch Strategy

  • First Dispatch (core idle): task enters running slot, pending slot reserved (prevents early second dispatch before ACK)
  • Second Dispatch (pending slot free): task enters pending slot after observing first dispatch's ACK (payload reuse safe)
  • Payload Reuse: single per-core payload buffer, safety guaranteed by ACK barrier (AICore must complete dcci before AICPU overwrites)

5. AICore Side

  • No functionality changes required — main loop naturally supports dual-slot:
    • Each poll cycle: read DATA_MAIN_BASEdcci(payload)ACKexecuteFIN
    • If new task written to DATA_MAIN_BASE during execution, next poll detects it immediately (zero gap)
    • dispatch_seq counter ensures monotonic reg_task_id per core (differentiates from task_id)

Alternatives Considered

  1. Double-buffered payload: allocate two payload slots per core

    • ✗ Wastes GM space (payload is 64B-aligned)
    • ✗ AICore must select which payload to read based on task_id parity
    • Single-buffer with ACK barrier is simpler and safer
  2. Implicit payload locking: rely on kernel execution timing

    • ✗ Unsafe without explicit signal (dcci timing is not guaranteed)
    • Explicit ACK-based handshake is more robust

The implementation is a natural extension of the existing single-slot dispatch—AICore main loop requires no changes, complexity concentrated in AICPU completion detection and dispatch logic.

Related prior art: CANN PyPTO scheduler (Ascend training framework) implements a similar dual-slot scheduler; Simpler design is an architectural simplification optimizing for clarity and correctness.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions