[Feature] Support dual-slot kernel dispatch in Simpler

### Summary

Implement dual-slot (pipeline) kernel dispatch for Simpler AICPU runtime to eliminate scheduling gaps and improve core utilization. Allow two tasks (running + pending) to coexist on a single core, with the pending task ready to execute immediately after the running task completes.

### Motivation / Use Case

Current single-slot dispatch creates gaps between kernel execution on AICore:
- AICPU dispatch task → AICore executes → completion → AICPU detects completion → dispatch next task
- AICore idle waiting for next payload during AICPU's dispatch latency

Dual-slot dispatch enables:
1. **Zero-gap pipelining**: pending task can start execution immediately after running task FIN, without waiting for AICPU's next dispatch cycle
2. **Improved core utilization**: reduce wasted core cycles due to dispatch overhead
3. **Better throughput**: enable tighter kernel-to-kernel scheduling on dependent task chains

### Proposed Design

The design spans three main areas:

#### 1. Register Protocol (no changes needed)
- AICPU↔AICore communication via `DATA_MAIN_BASE` (down) and `COND` (up) registers unchanged
- Extended interpretation: `COND` task_id can match either `running_reg_task_id` or `pending_reg_task_id`
- ACK signal semantics expanded: signals payload cache-line lock (dcci complete), enabling safe payload reuse

#### 2. AICPU Data Structure Changes
- `CoreExecState` expansion:
  - Split `executing_reg_task_id` → `running_reg_task_id` + `pending_reg_task_id`
  - Add `pending_slot_state` and `pending_subslot` pointers
  - Extend `CoreTracker` with `pending_states_` bitmap to track pending slot occupancy (one bit per core)
- Layout remains 64-byte cache-line aligned, no space waste

#### 3. Completion Detection State Machine (4 cases)
- **Case A**: `pending FIN` — both tasks complete in one poll cycle (task executed very fast, AICPU missed intermediate ACK/FIN)
- **Case B**: `pending ACK` — running completes, pending promoted to running (normal pipeline flow)
- **Case C**: `running ACK` — release pending slot reservation (enable second dispatch after ACK)
- **Case D**: `running FIN` — task complete, clear state (reached idle state)

#### 4. Dispatch Strategy
- **First Dispatch** (core idle): task enters running slot, pending slot reserved (prevents early second dispatch before ACK)
- **Second Dispatch** (pending slot free): task enters pending slot after observing first dispatch's ACK (payload reuse safe)
- **Payload Reuse**: single per-core payload buffer, safety guaranteed by ACK barrier (AICore must complete dcci before AICPU overwrites)

#### 5. AICore Side
- **No functionality changes required** — main loop naturally supports dual-slot:
  - Each poll cycle: `read DATA_MAIN_BASE` → `dcci(payload)` → `ACK` → `execute` → `FIN`
  - If new task written to DATA_MAIN_BASE during execution, next poll detects it immediately (zero gap)
  - `dispatch_seq` counter ensures monotonic `reg_task_id` per core (differentiates from task_id)

### Alternatives Considered

1. **Double-buffered payload**: allocate two payload slots per core
   - ✗ Wastes GM space (payload is 64B-aligned)
   - ✗ AICore must select which payload to read based on task_id parity
   - Single-buffer with ACK barrier is simpler and safer

2. **Implicit payload locking**: rely on kernel execution timing
   - ✗ Unsafe without explicit signal (dcci timing is not guaranteed)
   - Explicit ACK-based handshake is more robust

The implementation is a natural extension of the existing single-slot dispatch—AICore main loop requires no changes, complexity concentrated in AICPU completion detection and dispatch logic.

**Related prior art**: CANN PyPTO scheduler (Ascend training framework) implements a similar dual-slot scheduler; Simpler design is an architectural simplification optimizing for clarity and correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support dual-slot kernel dispatch in Simpler #492

Summary

Motivation / Use Case

Proposed Design

1. Register Protocol (no changes needed)

2. AICPU Data Structure Changes

3. Completion Detection State Machine (4 cases)

4. Dispatch Strategy

5. AICore Side

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support dual-slot kernel dispatch in Simpler #492

Description

Summary

Motivation / Use Case

Proposed Design

1. Register Protocol (no changes needed)

2. AICPU Data Structure Changes

3. Completion Detection State Machine (4 cases)

4. Dispatch Strategy

5. AICore Side

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions