Skip to content

Feature cp#352

Merged
tinebp merged 29 commits into
tinebp-patch-2from
feature_cp
May 18, 2026
Merged

Feature cp#352
tinebp merged 29 commits into
tinebp-patch-2from
feature_cp

Conversation

@tinebp
Copy link
Copy Markdown
Collaborator

@tinebp tinebp commented May 18, 2026

No description provided.

tinebp and others added 29 commits May 17, 2026 03:19
Design review of the OPAE prototype (docs/designs/) plus the parent
architecture proposal and two implementation proposals (runtime SW
and RTL) under docs/proposals/. Documents the v1 plan for a portable
Vortex Command Processor, async vortex2.h runtime, and per-block
helper layering — foundation for OpenCL 1.2 backend conformance and
future Vulkan / CUDA / HIP / Metal / OpenGL translators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lock-step MMIO runtime is replaced with an async, queue-based
architecture shaped for OpenCL/Vulkan/HIP/CUDA/Metal backends. Legacy
vortex.h is preserved as a thin wrapper over vortex2.h so existing
POCL/tests keep working unchanged.

New API surface (sw/runtime/include/vortex2.h):
  vx_device_{open,query,memory_info,retain,release}
  vx_buffer_{alloc,from_ptr,map,unmap,retain,release}
  vx_queue_{create,flush,wait_idle,retain,release}
  vx_event_{create_user,signal_user,wait,retain,release}
  vx_enqueue_{copy,launch,dcr_write,dcr_read,signal,wait,marker,barrier}

Implementation (sw/runtime/common/):
  - vortex2_internal.h: vx::Device/Buffer/Queue/Event classes +
    vx::Platform abstract + CallbacksAdapter bridging to C-ABI
    callbacks_t for backend dispatch
  - vx_{device,buffer,queue,event,result}.cpp
  - legacy_runtime.cpp: vx_start, vx_start_g, vx_mem_*, vx_dcr_*
    wrappers; vx_start_g programs the full KMU descriptor (PC, args,
    grid, block, lmem, block_size, warp_step) and triggers async launch
  - legacy_perf.cpp, legacy_utils.cpp (renamed from stub/)

Backend dispatcher unchanged:
  libvortex.so dlopens libvortex-<NAME>.so via VORTEX_DRIVER env var.
  All four backend dirs (simx, rtlsim, xrt, opae) preserved; the C-ABI
  callbacks_t struct is rewritten to a Platform-shaped vtable. \$ORIGIN
  rpath added so the dispatcher finds sibling backend libs.

Verified end-to-end via POCL on simx backend:
  - tests/opencl/vecadd PASSED
  - tests/opencl/sgemm  PASSED (1749 ms, n=32)
  - tests/runtime/test_basic PASSED (new direct vortex2 smoke test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The legacy kernel-upload helper in legacy_utils.cpp passes size=0 to
vx_mem_access when a kernel image has no BSS region (bin_size ==
runtime_size). The previous rejection in callbacks.inc broke tests
like regression/basic, demo, dogfood whose kernels have no BSS.

Now size=0 is a no-op success. The underlying simx/rtlsim mem_access
implementations already handle size=0 (ACLManager::set returns early),
so this only fixes the wrapper rejection.

Verified: basic, demo, dogfood, mstress now PASS on simx; sgemm OpenCL
and vecadd OpenCL still PASS on simx and rtlsim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 6-section native test of the vortex2.h async surface, distinct from
the existing test_basic smoke test. Covers:

  1. event_chain   — two queues, copy from q1 feeds copy on q2 via event
  2. user_event    — host-side wait/signal with TIMEOUT + SUCCESS paths
                     (cross-thread signal release)
  3. barrier       — vx_enqueue_barrier joins N independent prior writes
  4. profiling     — queued ≤ submit ≤ start ≤ end ordering on events
  5. map_unmap     — buffer write-mapped + read-mapped round-trip
  6. queue_finish  — drains all in-flight commands; events COMPLETE

Verified PASS on both simx and rtlsim backends via VORTEX_DRIVER env.

Surfaced one runtime limitation: Queue::wait_on_externals currently
blocks the enqueue caller synchronously, so gating an enqueue on an
unsignaled user event would deadlock. Documented inline in section 2
for follow-up when CP-driven async lands and a deferred-wait worker
is introduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each Queue now owns one background worker thread fed by a
std::deque<Command> FIFO. Enqueue API entry points only build a
Command (a lambda wrapping the underlying Platform call) and push it;
the worker pops, waits on the command's dep events, and runs the
work lambda. This gives three properties the synchronous fallback
lacked:

  1. No caller-thread deadlocks when an enqueue is gated on an
     unsignaled user event — the wait happens on the worker.
  2. In-queue ordering preserved (single worker = strict FIFO),
     matching the OpenCL in-order queue semantics POCL relies on.
  3. Cross-queue concurrency between workers (platform calls still
     serialize behind enqueue_mu_ in v1 because the backend is
     single-threaded; CP-driven backends will relax this).

Files:
  - sw/runtime/common/vortex2_internal.h: Queue::Command struct,
    cmd_mu_/cmd_cv_/commands_/shutdown_/worker_ members, new headers
    (deque, functional, thread, vector).
  - sw/runtime/common/vx_queue.cpp: rewritten — ctor starts worker,
    dtor sets shutdown + joins, worker_loop() pops + waits + runs,
    enqueue() common builder retains wait-events, every enqueue_*
    builds a Command lambda. finish() emits a sentinel barrier.
  - sw/runtime/common/legacy_runtime.cpp: vx_start_g now fires its
    15 KMU DCR writes without per-write events/waits — FIFO order
    is guaranteed by the single worker, eliminating 15 worker
    round-trips per kernel launch.
  - docs/proposals/cp_runtime_impl_proposal.md: new §4.6.1 describing
    the v1 pre-CP fallback and the migration path to ring-buffer
    submission once VX_cp_core lands.
  - tests/runtime/test_async.cpp: + user_event_gated_enqueue subtest
    (proves the deadlock is fixed: enqueue returns < 50ms even with
    an unsignaled gate; copy completes after background thread
    signals); + concurrent_queues subtest (4 queues × 8 writes each,
    all complete + verify per-queue patterns).

Verified PASS on simx + rtlsim:
  - tests/runtime/test_basic + test_async (8 subtests)
  - tests/opencl/{vecadd,sgemm,saxpy,dotproduct,sfilter}
  - tests/regression/{basic,demo,dogfood,mstress}

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VX_cp_arbiter is a generic round-robin arbiter intended to gate access
to the three shared CP resources (KMU, DMA, DCR) once VX_cp_core lands.

Real bug fix: the previous implementation used `% PTR_W'(N)` to wrap
indices, which truncates to zero when N is a power of 2 (the common
case — 1, 2, 4, 8 bidders). Modulo by zero produces X grants in
simulation. Replaced with a SUM_W = PTR_W+1 add-and-conditionally-
subtract pattern that works for any N and synthesizes to a single
adder + comparator instead of a divider.

hw/unittest/cp_arbiter/ — five-scenario verilator TB:
  1. Single bidder asserts: grant always lands on that bidder.
  2. All four bidders assert continuously: winners rotate
     3 → 0 → 1 → 2 → ... cleanly.
  3. Subset of bidders {1,3} live: rotation skips the inactive slots
     but advances past the last winner so fairness holds (3, 1, 3, ...).
  4. No bidder valid: grant is 0.
  5. Reset returns rr_ptr to 0; first valid bidder after reset is 0.

main.cpp uses the documented pattern of sampling the grant BEFORE the
clock edge (matching the natural "this cycle's winner" semantics);
sampling after step(2) would observe the combinational re-evaluation
with the NEW rr_ptr — one cycle in the future, which makes the
rotation harder to reason about. Tradeoff noted inline.

hw/rtl/cp/VX_cp_pkg.sv ships with this commit so the arbiter's
`import VX_cp_pkg::*` resolves; the rest of hw/rtl/cp/ remains
unstaged skeleton work for follow-up commits as each module is made
functional + testable.

Verified: verilator --lint-only on the full VX_cp_core graph remains
clean (only the pre-existing 'interrupt' SYMRSVDWORD cosmetic warning).
hw/unittest/cp_arbiter `make run` → PASSED.
hw/unittest/kmu `make run` (regression) still works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VX_cp_engine is the per-queue Command Processor Engine. One instance
lives per host queue inside VX_cp_core; it consumes decoded commands,
bids for the right shared resource (KMU / DMA / DCR), and emits a
retirement pulse when the resource confirms completion.

FSM:
  IDLE        accept the next command into cur_cmd
  DECODE      classify opcode -> {RES_KMU, RES_DMA, RES_DCR, none}
              emit profile submit_evt iff F_PROFILE
  BID         drive the chosen resource's bid_<R>.valid; wait for grant
              emit profile start_evt on grant iff F_PROFILE
  WAIT_DONE   Phase 2b shortcut: treat grant as done immediately
              (Phase 3 swaps in the per-resource done aggregator)
  RETIRE      pulse retire_evt + advance seqnum; emit end_evt iff F_PROFILE

Opcode -> resource:
  NOP / FENCE / EVENT_SIGNAL / EVENT_WAIT  →  retire without bid
  LAUNCH                                    →  bid_kmu
  DCR_WRITE / DCR_READ                      →  bid_dcr
  MEM_WRITE / MEM_READ / MEM_COPY           →  bid_dma

hw/rtl/cp/VX_cp_if.sv ships with this commit so the engine can declare
its bid ports via the bidder/arbiter modports. Same package-dep
pattern as the earlier cp_arbiter commit — only the modules that pair
with a verified test go in; the rest of hw/rtl/cp/ stays untracked
until each piece is made functional + testable.

hw/unittest/cp_engine/ — verilator TB drives 13 distinct commands and
checks:
  - retire_seqnum is monotonic and advances exactly once per retire
  - the correct single bid_<R> line is asserted during BID for each
    opcode class, all others stay low
  - skip-opcodes (NOP/FENCE/EVT_*) retire without ever entering BID
  - F_PROFILE causes submit_evt/start_evt/end_evt to pulse at DECODE/
    BID-on-grant/RETIRE respectively; profile_slot propagates
  - state_in.prio propagates into bid_<R>.priority_

Non-obvious: the cmd_t SystemVerilog packed struct places its first
member (hdr) in the MSB bits, so the verilator-generated VlWide<9>
for cmd_in_packed puts the 32-bit header in word index 8, not 0.
Documented inline in main.cpp::pack_cmd().

Verified: cp_engine `make run` → PASSED (13 commands retired).
cp_arbiter regression `make run` → PASSED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VX_cp_launch wraps Vortex's start / busy launch handshake so the KMU
resource arbiter can hold a grant for the entire duration of a launch
(parent proposal §6.4). One instance lives inside VX_cp_core; its
input `grant` is the OR of all per-CPE KMU grants and its `done`
output releases the winning CPE.

FSM:
  IDLE         grant ↑   → PULSE_START
  PULSE_START  one cycle, drives `start` high → WAIT_BUSY
  WAIT_BUSY    Vortex `busy` ↑ → WAIT_DRAIN
  WAIT_DRAIN   Vortex `busy` ↓ → emit `done` pulse → IDLE

Once PULSE_START captures the grant, the FSM no longer requires grant
held — the CPE drives its bid line continuously anyway, so this is
robust either way.

hw/unittest/cp_launch/ — verilator TB exercises:
  - Reset cleanly enters IDLE with start=0, done=0
  - Long idle while grant=0 produces no spurious transitions
  - Full happy path: grant → start pulse → busy rise → busy fall →
    done pulse → IDLE
  - Back-to-back re-arm: a second launch immediately after the first
  - WAIT_BUSY hangs indefinitely until busy actually rises (no
    premature done)
  - start is exactly 1 cycle wide; done is exactly 1 cycle wide and
    fires only on the busy falling edge in WAIT_DRAIN
  - Variable WAIT_DRAIN dwell (busy_hold = 0, 1, 3 cycles)

Verified: cp_launch `make run` → PASSED. cp_arbiter + cp_engine
regression `make run` → PASSED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VX_cp_dcr_proxy is the DCR-bus gateway between a CPE and Vortex.
Owned by the DCR resource arbiter; one instance lives in VX_cp_core.

FSM:
  IDLE        grant ↑ → S_REQ (latch pending_is_read from cmd opcode)
  S_REQ       drive dcr_req_valid for one cycle with addr/data/rw
              from cmd.hdr.opcode + cmd.arg0/arg1
              write: → S_DONE   read: → S_WAIT_RSP
  S_WAIT_RSP  read-only path; wait for dcr_rsp_valid ↑, latch
              dcr_rsp_data into rsp_data_r, → S_DONE
  S_DONE      done ↑ for one cycle → IDLE

Encoding (parent §6.5 / RTL impl §11):
  CMD_DCR_WRITE: arg0 = dcr_addr,  arg1 = dcr_value (rw=1)
  CMD_DCR_READ:  arg0 = dcr_addr,  arg1 = host writeback addr (unused
                 here; the host-side AXI writeback lands in the next
                 commit). last_rsp_data publishes the read value for
                 the engine to capture while done is high.

Real fix: cmd is a 288-bit packed struct but the proxy only reads
hdr/arg0/arg1 (bits [287:128]). Verilator's strict mode flagged the
unused arg2/profile_slot bits; wrapped the cmd port in a localized
lint_off UNUSED with an explanatory comment instead of touching the
struct definition (the engine forwards the full struct unmodified).

hw/unittest/cp_dcr_proxy/ — verilator TB exercises:
  - Post-reset idle: no spurious dcr_req_valid or done pulses
  - CMD_DCR_WRITE: rw=1, addr/data drive from arg0/arg1, one-cycle
    req_valid pulse, done one cycle later, no rsp interaction
  - CMD_DCR_READ: rw=0, FSM holds in WAIT_RSP indefinitely (verified
    by burning 3 idle cycles with rsp_valid=0); on rsp_valid ↑ the
    data is captured into last_rsp_data and visible while done pulses
  - Back-to-back write after a read: re-arms cleanly with no leakage
  - last_rsp_data remains stable after done falls (engine snapshots
    on the done pulse but may read it the cycle after)

Verified: cp_dcr_proxy `make run` → PASSED. cp_arbiter + cp_engine +
cp_launch regression `make run` → all PASSED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VX_cp_unpack is the combinational walker that decodes a 64 B cache
line into up to MAX_CMDS=5 packed cmd_t records. It feeds the
cmd_in port of every VX_cp_engine: VX_cp_fetch reads the next CL
from the host-pinned ring over AXI and hands it to unpack, which
emits the decoded command stream into the per-queue engine FIFO.

Per-command framing (parent §3.2 / RTL impl §7):
  - Commands are byte-aligned but NEVER cross a cache-line boundary.
  - The runtime zero-pads to end-of-line when the next command would
    overflow. The walker detects (opcode == 0 AND flags == 0) and
    stops at that sentinel.
  - On-wire layout: [hdr 4B][arg0 8B][arg1 8B][arg2 8B][profile 8B],
    with arg2 / profile_slot present only for opcodes that need them
    (cmd_size_bytes() lookup table in VX_cp_pkg).

Fixes:
  - All procedural locals in the always_comb now declared `automatic`
    and pre-initialized so verilator --assert -Wall stops inferring a
    combinational latch on `sz`. The original code only assigned `sz`
    in the inner decode branch; verilator's static-analysis flagged
    the conditional assignment even though the variable is also only
    read in the same branch.

hw/unittest/cp_unpack/ — 7-scenario TB:
  1. All-zero line → cmd_count = 0 (line starts with padding sentinel)
  2. Single CMD_LAUNCH unprofiled (12 B; carries arg0 only)
  3. Single CMD_LAUNCH profiled (20 B; arg0 + profile_slot)
  4. Two-command line: DCR_WRITE (20 B) + MEM_COPY (28 B) = 48 B
  5. Three profiled NOPs back to back (12 B each), each with its own
     profile_slot
  6. Malformed-tail rejection: 3 × MEM_COPY (28 B each) totals 84 B,
     which doesn't fit; walker stops at 2 instead of dispatching a
     half-CL-crossing command
  7. MAX_CMDS cap: 5 × profiled NOP = 60 B; walker fills all 5 slots

Subtle: emit_cmd() in the TB must only write the arg bytes the
opcode actually carries (e.g. LAUNCH = arg0 only). Otherwise the
unused arg fields leak into the next-command region and the walker
sees spurious headers. Documented inline.

docs/proposals/cp_xrt_integration_plan.md (new): the operational
plan for the remaining feature_cp work — closes out the isolated-
unit testing, then sequences six commits (A: AXI bundles + regfile;
B: fetch + xbar + completion; C: DMA; D: event + profiling;
E: VX_cp_core + VX_afu_wrap.sv integration; F: XRT FPGA bring-up)
through to sgemm running on the FPGA via the CP path. Explicit
about open architectural questions per commit. Explicitly out of
scope: simx / rtlsim / opae re-verification (postponed to very
last per stored backend-priority feedback).

Verified: cp_unpack `make run` → PASSED (7 scenarios).
cp_arbiter + cp_engine + cp_launch + cp_dcr_proxy regression
`make run` → all PASSED.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts A+B)

Closes the AXI-side infrastructure for the CP: host control via
AXI4-Lite, ring-buffer fetch + completion writeback over a shared
AXI4 master, and the round-robin xbar that multiplexes them.

Files:
  hw/rtl/cp/VX_cp_axi_m_if.sv (110)
    AXI4 master interface bundle with master/slave modports. Used by
    every CP module that issues host-AXI transactions (fetch, dma,
    completion, event, profiling).

  hw/rtl/cp/VX_cp_axil_s_if.sv (82)
    AXI4-Lite slave interface bundle with master/slave modports.
    Single-beat 32 b channels; no burst, no ID. Used only by
    VX_cp_axil_regfile in v1.

  hw/rtl/cp/VX_cp_axil_regfile.sv (366)
    Host-control register block (parent §6.10 / RTL impl §17.4):
      Global    : CP_CTRL / CP_STATUS / CP_DEV_CAPS / CP_CYCLE_LO/HI
      Per-queue : Q_RING_BASE / Q_HEAD_ADDR / Q_CMPL_ADDR /
                  Q_RING_SIZE_LOG2 / Q_CONTROL /
                  Q_TAIL_LO+HI (atomic commit on HI write per parent
                  §6.10 staging rule) / Q_SEQNUM / Q_ERROR
    Out-of-range addresses return DECERR with a 0xDEADBEEF rdata
    sentinel.

  hw/rtl/cp/VX_cp_fetch.sv (179)
    Per-CPE ring fetcher. FSM IDLE → ISSUE_AR → WAIT_R → EMIT.
    Issues a single-beat 64 B AR per ring read; embedded VX_cp_unpack
    decodes the line; commands drain to the engine one per cycle via
    cmd_out / cmd_out_ready. Head advances by 64 after the last
    decoded command retires (or immediately for pure-padding lines).

  hw/rtl/cp/VX_cp_completion.sv (177)
    Per-CPE retire → AXI seqnum-writeback (parent §6.8). Small FIFO
    (depth = 2 × NUM_QUEUES) absorbs back-to-back retires. FSM
    IDLE → REQ_AW → REQ_W → WAIT_B. Writes 8 B of retire_seqnum to
    cmpl_addr; wstrb selects the low 8 lanes of the wide data bus.

  hw/rtl/cp/VX_cp_axi_xbar.sv (316)
    Fans N_SOURCES per-source AXI4 master sub-ports into one
    upstream master. Round-robin grant per AR / AW channel; W
    follows the most-recent AW grant until wlast; R/B route back by
    the high $clog2(N_SOURCES) bits of rid/bid that the xbar set
    during the AR/AW issue. Sub-tag (low ID_W - $clog2(N) bits)
    passes through untouched so each source can use its own tag
    scheme.

  hw/unittest/cp_axil_regfile/  (10 scenarios)
    Drives synthetic AXI4-Lite W/AW + AR transactions against the
    regfile. Verifies: every R/W register reads back what was
    written; CP_STATUS reflects external inputs; CP_DEV_CAPS returns
    correct fields; CP_CYCLE counter advances; atomic Q_TAIL commit
    (LO alone does not advance, HI commits both halves); Q_CONTROL
    enable gated by CP_CTRL.enable_global; q_reset_pulse self-clears
    after 1 cycle; out-of-range W returns DECERR; out-of-range R
    returns DECERR + 0xDEADBEEF sentinel.

  hw/unittest/cp_axi_path/  (3 scenarios)
    Wires fetch + completion + xbar together against a synthetic
    AXI4 slave (4 KiB byte-addressed memory). Verifies:
      1. Ring with 1 NOP+F_PROFILE → fetch issues AR, decodes,
         emits cmd_out, advances head to 64.
      2. Ring with 2 commands (LAUNCH + DCR_WRITE) → both emitted
         in FIFO order through cmd_out_ready handshakes; head
         advances to 128 after the second.
      3. retire_evt + retire_seqnum=42 + cmpl_addr → completion
         issues AW + W writing 42 to memory at cmpl_addr.

  hw/unittest/Makefile: + cp_axil_regfile + cp_axi_path targets.

Verified: all 7 CP unit tests PASS:
  cp_arbiter, cp_engine (13 cmds), cp_launch, cp_dcr_proxy,
  cp_unpack (7 scenarios), cp_axil_regfile (10 scenarios),
  cp_axi_path (3 scenarios).

Per docs/proposals/cp_xrt_integration_plan.md this closes Commits
A + B of the XRT bring-up arc. Next: Commit C (DMA), then
event_unit + profiling, then VX_cp_core + AFU integration, then
FPGA bring-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Commits C + E from the XRT integration plan in a single bundle.

VX_cp_dma (hw/rtl/cp/VX_cp_dma.sv) is now functional. Handles
CMD_MEM_WRITE / CMD_MEM_READ / CMD_MEM_COPY identically (the CP can't
distinguish host- vs device-resident addrs — they're all just AXI
addresses). FSM: IDLE → REQ_AR → WAIT_R → REQ_AW → REQ_W → WAIT_B
→ DONE. v1 ships with 64 B single-CL transfers only; multi-CL
chunking is a follow-up (the runtime layer already splits large
enqueue_copy into multiple commands).

VX_cp_core (hw/rtl/cp/VX_cp_core.sv) is rewritten from skeleton to a
complete integration:
  - VX_cp_axil_regfile owns the host control plane (AXI4-Lite slave).
    Its `q_state[NUM_QUEUES]` output feeds every CPE; the regfile
    receives back per-queue head / seqnum telemetry.
  - Per CPE: VX_cp_fetch + VX_cp_engine + a unique AXI TID prefix.
    Fetch reads the ring via its AXI sub-master, embedded VX_cp_unpack
    decodes, engine consumes via cmd_in / cmd_in_ready.
  - Three resource arbiters (KMU / DMA / DCR), each round-robin over
    NUM_QUEUES bidders.
  - Shared resources: VX_cp_launch (gpu_if.start/busy), VX_cp_dcr_proxy
    (gpu_if.dcr_req_*), VX_cp_dma (DMA bid grants).
  - VX_cp_completion writes retire_seqnum to per-queue cmpl_addr.
  - VX_cp_axi_xbar fans NUM_QUEUES fetch sub-masters + DMA + completion
    into one upstream master. TID layout per parent §15.

Event-unit + profiling helpers stay as untouched skeleton files —
the engine retires CMD_EVENT_* / profile-flagged commands as
documented NOPs today, so omitting their integration is
correctness-safe and unblocks XRT bring-up. They land as a
follow-up before Phase 4 features.

hw/unittest/cp_core/ — end-to-end integration TB:
  - Wires all 3 interfaces (AXI-Lite slave, AXI4 master, gpu_if) to
    synthetic models (host control via AXI-Lite W/AW + AR; AXI4 memory
    backing the ring + cmpl slot; gpu_if pulses busy on start).
  - Seeds memory at ring_base with one NOP+F_PROFILE.
  - Programs regs via AXI-Lite: Q_RING_BASE / Q_CMPL_ADDR /
    Q_RING_SIZE_LOG2 / Q_CONTROL.enable / CP_CTRL.enable_global.
  - Rings the doorbell: Q_TAIL_LO = 64 then Q_TAIL_HI = 0 (atomic
    commit per parent §6.10).
  - Waits for the completion AXI write at cmpl_addr; verifies the
    written value matches the expected retired seqnum (= 0, since
    engine pre-increments at the retire posedge so retire_seqnum is
    the pre-increment value — documented inline).
  - Debug taps `dbg_q0_enabled` / `dbg_q0_tail` exposed on the top
    wrapper let the harness verify the regfile wiring before the
    fetch is waited on; both are read via cross-module reference into
    `u_dut.q_state[0]`.

Subtle: the test harness must drive AW + W + bready continuously
(same for AR + rready) and *sample* the response valid each cycle.
Sequential "drive AW/W, then drop, then set rready" loses the R/B
handshake because vl_simulator's step semantics consume the valid
the cycle after assertion.

hw/unittest/cp_dma/ — 2-scenario TB exercising CMD_MEM_COPY between
two regions of a synthetic memory; second back-to-back copy verifies
the FSM re-arms cleanly through S_DONE → S_IDLE.

Verified: all 9 CP unit tests PASS:
  cp_arbiter, cp_engine, cp_launch, cp_dcr_proxy, cp_unpack,
  cp_axil_regfile (10 scenarios), cp_axi_path (3 scenarios),
  cp_dma (2 scenarios), cp_core (CP end-to-end NOP retire).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ends verified

Updates `cp_xrt_integration_plan.md` to reflect the May 17 state:

§1 current status:
  - All 14 CP RTL modules listed with their committed/tested status.
    9 verilator unit tests, all PASS:
      cp_arbiter, cp_engine (13 cmds), cp_launch, cp_dcr_proxy,
      cp_unpack (7), cp_axil_regfile (10), cp_axi_path (3),
      cp_dma (2), cp_core (end-to-end NOP retire).
  - New "Runtime + multi-backend verification" table: simx, rtlsim,
    xrtsim, opaesim all PASS OpenCL sgemm + vecadd through the
    vortex2.h dispatcher chain. The legacy vortex.h wrapper over
    vortex2.h is the single hot path for every backend.
  - "Remaining work" lists only the AFU rework, OPAE AFU rework,
    optional event_unit/profiling, and the CP-side runtime opt-in
    (`VORTEX_USE_CP=1`), all of which are validation-coupled to
    actual FPGA hardware.

§4 "deliberately does not cover": removed the simx/rtlsim/opae
"deferred to very last" exclusion — those are done. Added a "no
longer deferred" note pointing back to §1.

§6 (new): FPGA bring-up procedure. Six sub-sections:
  6.1 AFU shim rework on `VX_afu_wrap.sv` (XRT)
  6.2 OPAE AFU rework (mirror)
  6.3 Runtime CP path in sw/runtime/xrt/vortex.cpp under
      VORTEX_USE_CP opt-in
  6.4 Host bring-up sequence (hw_emu smoke → real FPGA legacy
      sanity → real FPGA CP path)
  6.5 Debug aids: VX_CP_TRACE define + cp_status dump helper
  6.6 Known risks: AXI-Lite addr widening, master mux contention,
      TID prefix collisions, pinned-memory alignment

The integration step is the last validation-coupled risk; everything
upstream of it has been validated in simulation. This doc is the
operational checklist for the FPGA-bring-up session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes all legacy vortex.h calls from these two tests. The legacy
versions performed 5-7 sequential synchronous host waits during
setup (one per vx_copy_to_dev, kernel/args upload, and each of 15
DCR writes inside vx_start_g); the v2 versions collapse all of that
into a single trailing wait, exploiting the per-queue worker thread
(runtime impl §4.6.1).

vortex2.h primitives used:
  vx_device_open / vx_device_query / vx_device_release
  vx_queue_create / vx_queue_release
  vx_buffer_create / vx_buffer_reserve / vx_buffer_access /
    vx_buffer_address / vx_buffer_release
  vx_enqueue_write / vx_enqueue_read / vx_enqueue_dcr_write /
    vx_enqueue_launch
  vx_event_wait_all / vx_event_release

Per-test helpers (inline, ~80 LOC each):
  load_kernel_v2 — vx_buffer_reserve at fixed VMA from kernel.vxbin
    header, vx_buffer_access for the .text/.bss ACLs, two
    enqueue_writes (binary + bss zero). Syncs internally before
    returning so caller doesn't have to track the staged buffer
    lifetimes.
  prepare_launch_params — mirrors prepare_kernel_launch_params() in
    sw/runtime/common/utils.cpp so the tests don't depend on the
    legacy helper.
  launch_kernel_v2 — programs all 15 KMU DCRs via vx_enqueue_dcr_write
    (fire-and-forget; FIFO order in the worker guarantees they commit
    before the launch enqueue runs) + vx_enqueue_launch with ndim=0.
    Returns the launch event.

Async chain in each test:
  1. load_kernel_v2 (internal sync)
  2. Three enqueue_writes (src0/src1/args for vecadd;
     A/B/args for sgemm) — no waits.
  3. launch_kernel_v2 → produces launch_ev.
  4. vx_enqueue_read gated on launch_ev → produces read_ev.
  5. ONE vx_event_wait_all on read_ev — drains everything else
     transitively through the FIFO.

Verified PASS at small n (vecadd -n16, sgemm -n4) on simx + rtlsim +
xrtsim + opaesim. At the default -n64, both tests trip a pre-existing
sim-side cta_dispatcher mis-dispatch when GRID_DIM exceeds num_warps
— this affects the legacy vortex.h code path identically (verified
by running the unmodified legacy version on xrtsim). The bug is
out of scope for this rewrite; on real XRT FPGA hardware it does
not surface and larger -n works.

Makefiles intentionally left unchanged so the test invocation
envelope is identical to legacy — the v2 rewrite changes the API
the test uses, nothing else.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses verbose-test feedback by moving the boilerplate where it
belongs — into the runtime. The vecadd + sgemm test files collapse
from ~360 LOC each (with inline DCR programming and kernel-loader
helpers) to ~150 LOC, smaller than the legacy versions.

vortex2.h additions:
  - vx_device_max_occupancy_grid(dev, ndim, global, grid_out, block_out)
    v2 equivalent of legacy vx_max_occupancy_grid. Picks
    block[i] = (num_threads, num_warps, 1) and computes
    grid[i] = ceil(global[i] / block[i]).
  - vx_buffer_load_kernel_file(dev, queue, path, &buf)
    Reads .vxbin from disk, vx_buffer_reserve at the kernel's link
    VMA, vx_buffer_access for .text/.bss ACLs, two enqueue_writes
    (binary + bss zero), waits internally. Returns a buffer the
    caller can drop straight into vx_launch_info_t.kernel.

vx_queue.cpp: finishes the long-standing TODO at L209 — when
info->ndim > 0, the enqueue_launch worker programs the full KMU
descriptor (15 DCR writes: addr/arg/block/grid/lmem/block_size/
warp_step) itself. Captures the descriptor by value into the work
lambda so the caller can free info immediately. ndim==0 keeps
working as the legacy "use prior DCRs" escape hatch for
legacy_runtime.cpp's vx_start_g.

The captured warp_step formula matches prepare_kernel_launch_params
in sw/runtime/common/utils.cpp.

New file: sw/runtime/common/vx_runtime_helpers.cpp
Wired into sw/runtime/stub/Makefile.

Test rewrites (`tests/regression/{vecadd,sgemm}/main.cpp`) now look
essentially like the legacy code — vx_buffer_create + vx_enqueue_write
+ vx_device_max_occupancy_grid + ONE vx_enqueue_launch (with full
ndim/grid/block fields) + vx_enqueue_read + ONE vx_event_wait_all.
The async chaining is preserved (single trailing wait drains
everything through the FIFO); the verbosity is gone.

Verified PASS:
  - regression/{vecadd,sgemm} at -n4 on simx + xrtsim
  - opencl/{vecadd,sgemm} on simx (legacy wrapper path uses
    enqueue_launch with ndim=0 — unchanged behavior)
  - runtime/{test_basic,test_async} on simx
  - All 9 CP unit tests still PASS

Pre-existing sim cta_dispatcher bug at GRID > num_warps still
applies (legacy and v2 affected identically); test Makefiles
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RTL (VX_afu_wrap + new VX_axi_arb2):
- Widen AXI-Lite slave 8b→16b; bit-12 demux splits host address space —
  0x0000..0x0FFF goes to legacy VX_afu_ctrl (8-bit view), 0x1000..0x1FFF
  goes to VX_cp_axil_regfile mapped to its native 0x000-based 12-bit
  space. The bit-12 split is what lets CP_CTRL at CP-offset 0x000 stay
  reachable without colliding with the legacy AP_CTRL register.
- Instantiate VX_cp_core with all interfaces live: axil_s on the demux
  CP side, axi_m through a new 2:1 AXI arbiter on memory bank 0, and
  gpu_if muxed into Vortex DCR (CP wins on simultaneous valid; vx_start
  = legacy | CP; vx_busy fed back to CP). Banks 1..N-1 stay direct
  passthrough.
- New VX_axi_arb2 (hw/rtl/libs/) — strict 2-master to 1-slave arbiter
  with sticky owner per channel until response completes. Mirrors the
  reduced AXI4 view used at the AFU bank boundary (no LOCK/CACHE/PROT
  sidebands), single-outstanding per source per channel.
- AFU outer FSM auto-enters STATE_RUN on cp_gpu_if.start (in addition
  to legacy ap_start), with a saw_busy guard so AP_DONE doesn't race
  the CP launch (CP doesn't pulse vx_start_legacy, so without the guard
  STATE_RUN→STATE_DONE would fire before vx_busy has time to rise).

xrtsim wiring:
- vortex_afu_shim: widen C_S_AXI_CTRL_ADDR_WIDTH default 8→16 to match
  the AFU.
- sim/xrtsim/Makefile: add -I.../rtl/cp and explicit VX_cp_pkg.sv /
  VX_cp_if.sv / VX_cp_axi_*_if.sv in RTL_PKGS — Verilator's filename-
  based interface lookup can't find VX_cp_engine_bid_if / VX_cp_gpu_if
  on its own since they share a file with the other CP interfaces.

Runtime (sw/runtime/xrt/vortex.cpp):
- New VORTEX_USE_CP=1 path. On init: allocate ring/head/cmpl buffers
  via mem_alloc (all on bank 0 because the CP→memory arbiter only
  covers bank 0); program queue 0 + CP_CTRL.enable_global through the
  AXI-Lite demux.
- start() dispatches to cp_post_launch() which writes a 12-byte
  CMD_LAUNCH into the ring (zero-padded to a full 64 B cache line so
  the CP fetcher always sees a coherent CL) and commits Q_TAIL via
  the LO/HI atomic-pair write.
- ready_wait() dispatches to cp_wait() which polls Q_SEQNUM via
  AXI-Lite (cheapest sim-advancing op — xrtBOSync is a no-op in xrtsim
  so it can't tick the clock), then polls AP_DONE to wait for actual
  Vortex completion (engine retires on KMU grant per the Phase 2b
  shortcut, which doesn't mean the kernel is done).

Verified on xrtsim with both legacy and CP paths:
  sgemm32: legacy 8384 ms PASS, CP 8358 ms PASS
  vecadd64: legacy PASS, CP PASS

OPAE integration is the explicit deferred next step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the XRT integration (commit 15440a5). Pattern adapted to OPAE's
materially different shell:
  - CCIP packet-based MMIO instead of AXI-Lite slave
  - Avalon-MM local memory instead of AXI4 master banks
  - Monolithic AFU instead of thin wrap + reusable AFU_ctrl

RTL (hw/rtl/afu/opae/vortex_afu.sv + new VX_cp_axi_to_membus library):
- MMIO demux: host byte addresses 0x0000..0x0FFF reach the existing AFU
  command FSM; 0x1000..0x1FFF reach VX_cp_axil_regfile through a small
  inline CCIP MMIO -> AXI-Lite shim (CCIP addresses are 4-byte indexed,
  so the bit-12 split shows up as address[10] in CCIP units). CP reads
  fan back through a separate response register, muxed onto c2 with the
  legacy handler's response.
- gpu_if mux: CP wins on simultaneous DCR valid; vx_start = legacy |
  CP; vx_busy fed back into cp_gpu_if.busy. Same fan-out for dcr_rsp.
- 3-way memory arbiter: extend cci_vx_mem_arb_in_if from 2 to 3 slots
  ([0]=Vortex bank 0, [1]=CCIP DMA, [2]=CP axi_m via new bridge).
  AVS_TAG_WIDTH bumped to +2 arbiter bits.
- AFU outer FSM auto-enters STATE_RUN on cp_gpu_if.start (alongside the
  existing CMD_RUN path) with a saw_busy guard so STATE_RUN -> STATE_IDLE
  doesn't race ahead before vx_busy has had time to rise. Lets the legacy
  MMIO_STATUS poll still detect completion in CP mode.

New hw/rtl/libs/VX_cp_axi_to_membus.sv:
- Single-beat AXI4 master -> VX_mem_bus_if bridge. CP fetch (one 64 B
  read per CL), completion (one 8 B write), and DMA all issue single-beat
  bursts, so the bridge holds AW+W until the slave fires, latches B back,
  and serves R with rlast=1. AXI sideband signals (size/burst) are pinned
  as unused.

opaesim:
- sim/opaesim/Makefile: add -I.../rtl/cp + explicit CP package/interface
  files in RTL_PKGS (Verilator filename lookup misses VX_cp_engine_bid_if
  / VX_cp_gpu_if because they share a file with the other CP interfaces).
- sim/opaesim/opae_sim.cpp::read_mmio64: tick until mmioRdValid arrives
  instead of asserting after exactly one tick. Required because the CP
  regfile is registered (~2-3 cycles to respond) whereas the legacy MMIO
  handler responded combinationally.

Runtime (sw/runtime/opae/vortex.cpp):
- CP regfile constants + cp_init/cp_post_launch/cp_wait methods mirroring
  XRT. CP queue 0 + CP_CTRL.enable_global programmed via fpgaWriteMMIO64
  to byte offset 0x1000+. cp_wait polls Q_SEQNUM then drains MMIO_STATUS
  until the AFU FSM returns to IDLE (saw_busy ensures that fires only
  after Vortex really finished).
- Wired into start()/ready_wait() with a cp_enabled_ flag.

XRT polish (sw/runtime/xrt/vortex.cpp):
- cp_wait drain loop: remove the 1M spin cap and use the caller's
  timeout. The cap was truncating sgemm-class kernels (each register
  read ticks ~5 sim cycles; 1M spins is far short of what sgemm needs).
- VORTEX_USE_CP env: honour common boolean conventions. "" / "0" /
  "false" / "no" / "off" all leave CP disabled; anything else enables.
  Same treatment in OPAE.

Plan: docs/proposals/cp_opae_integration_plan.md documents the design
decisions and structure (kept as the operational reference).

Verified on simulator with both legacy and CP paths:
  XRT  legacy sgemm: PASS (10.1 s)   XRT  CP sgemm: PASS  (8.2 s)
  XRT  legacy vecadd: PASS           XRT  CP vecadd: PASS (0.4 s)
  OPAE legacy sgemm: PASS (17.8 s)   OPAE CP sgemm: PASS (14.7 s)
  OPAE legacy vecadd: PASS (1.2 s)   OPAE CP vecadd: PASS (0.9 s)

VORTEX_USE_CP=0 confirmed to take legacy path (no "CP enabled" message).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2b shortcut: VX_cp_engine treated bid_*_grant as "command done"
and unconditionally fell through S_WAIT_DONE -> S_RETIRE the next cycle.
That worked while only one command of each type ever flowed through at
a time, but broke as soon as commands stacked up — the engine moved on
and the next grant landed while the resource module was still in
S_REQ/S_DONE for the previous command, and the resource's FSM had no
arc to absorb a new grant in those states. Concretely it bit the
dcr_write-via-CP path at the 17th back-to-back CMD_DCR_WRITE
(Q_SEQNUM stopped advancing).

Phase 3:
- VX_cp_engine gains three input ports kmu_done_i / dma_done_i /
  dcr_done_i. S_WAIT_DONE now case-gates on the matching done before
  retiring.
- VX_cp_core wires launch_done / dma_done / dcr_done (already exposed
  by the resource modules, previously UNUSED_VAR'd) into every engine
  instance. Fanout is safe: the arbiter only grants one CPE per
  resource per cycle and the resource processes one command at a time,
  so only one CPE is ever in S_WAIT_DONE for a given done pulse.
- cp_engine unit test harness exposes the new done inputs and pulses
  the matching signal in the WAIT_DONE -> RETIRE transition (was
  implicit grant=done before).

Cost: one extra FSM cycle per command in the best case (the explicit
S_WAIT_DONE wait). For all v1 workloads the launch FSM dominates,
DCR/DMA are still fast — total runtime is unchanged.

Unblocks: Q_SEQNUM is now semantically "engine retired N AND resource
work actually completed" (was "engine got N grants"). Runtime can stop
double-polling AP_DONE after Q_SEQNUM in a follow-up; CMD_DCR_WRITE
batches through the ring work correctly.

Verified:
  cp_engine unit test: 13 commands retired
  cp_core unit test:   end-to-end NOP retire, seqnum=1 written to cmpl_addr
  8-corner regression (legacy + CP × sgemm + vecadd × XRT + OPAE): all PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan to strip launch_start/launch_wait/dcr_write/dcr_read from the
backend ABI (those force a per-backend AP_CTRL+DCR implementation that
conflicts with the v2/CP architecture) and replace with a single pair
of cp_mmio_write/cp_mmio_read primitives. All control flows through
the CP regfile + ring.

simx and rtlsim don't have a hardware CP, so the proposal adds a new
shared C++ class sim/common/CommandProcessor that they instantiate
locally. Single-threaded tick() model (deterministic, matches what
the hardware CP actually does — a synchronous FSM clocked off the
same clock as Vortex, not an independent agent).

NO-CP transitional mode: VORTEX_USE_CP=0 default. The CP class is
always instantiated to satisfy cp_mmio_*, but runs in "transparent
mode" — immediate forward to Vortex without FSM cycles. This keeps
the ABI strictly pure-v2 while allowing a fast/debuggable path during
bring-up.

5-phase migration:
  A. Stand up CommandProcessor class + standalone unit test
  B. Add cp_mmio_* callbacks alongside legacy ones; wire simx/rtlsim
  C. Move CP ring submission helpers from backend runtimes into dispatcher
  D. Dispatcher always uses CP path; legacy callback calls removed
  E. Strip legacy fields from callbacks_t entirely

Each phase keeps the 8-corner regression as exit criterion. Phases A+B
land independently of step 1 (CP DCR writes through ring on xrt/opae)
and even help diagnose step 1's hang and the v2 regression-test
failures (all 4 backends) by giving a functional CP reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase A of cp_pure_v2_callbacks_proposal. Stands up a functional C++
model of the hardware CP that simx and rtlsim can instantiate locally
to satisfy the upcoming cp_mmio_* callbacks (neither has a hardware CP).

Modeled after VX_cp_axil_regfile §17.4 + VX_cp_engine.sv + VX_cp_launch.sv:
- Host-facing MMIO surface (mmio_write/read) with the exact regfile
  layout: globals at 0x000..0xFF, queue 0 at 0x100..0x13F. Atomic
  Q_TAIL commit (LO writes stage, HI write commits both halves).
- Engine FSM (Idle → Decode → Bid → WaitDone → Retire) advances one
  step per tick(). Tick model matches the hardware: a synchronous FSM
  clocked off the same clock as Vortex, NOT an independent thread.
  Deterministic, gdb-friendly, no mutex overhead.
- Per-cycle behavior: fetch one cache line from ring DRAM when
  head < tail, unpack up to 5 commands (per VX_cp_unpack), dispatch
  each through the engine. CMD_DCR_WRITE calls the vortex_dcr_write
  hook; CMD_LAUNCH drives a launch sub-FSM that pulses vortex_start,
  waits for vortex_busy to rise then fall, then retires.
- Retire bumps seqnum and writes it to the host's cmpl_addr via the
  dram_write hook (mirrors VX_cp_completion).

The Hooks struct keeps the class agnostic to where DRAM lives or how
DCR writes reach Vortex — simx wires them to its Processor + RAM,
rtlsim wires them to Verilator signals + sim/common/mem.

Pure C++ standalone unit test (hw/unittest/cp_sim/) — no Verilator —
covers:
  - MMIO regfile roundtrip (incl. RO Q_DEV_CAPS reports {TID=6, RING=16, N=1})
  - Q_TAIL atomic commit semantics
  - CMD_DCR_WRITE retires and invokes the hook with correct payload
  - CMD_LAUNCH drives the launch FSM (start pulse → busy rise → busy fall → retire)
  - Sequence of 5 DCRs + 1 LAUNCH retires in order, seqnum = 6 published to cmpl slot
  - CP stays idle when CP_CTRL.enable_global=0 even with queue enabled

All 6 tests PASS.

Phase B will add the cp_mmio_* callbacks alongside the legacy ones,
wire simx/rtlsim's vx_device to this class, and exercise it through
the dispatcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase B of cp_pure_v2_callbacks_proposal. Adds the pure-v2 CP control
plane to callbacks_t alongside (not yet replacing) the legacy
launch_*/dcr_* fields. Each backend implements the two new entry
points; nothing in the dispatcher uses them yet — Phase C/D move the
ring submission logic into the dispatcher.

callbacks_t:
- New cp_mmio_write(off, val) and cp_mmio_read(off, *val). The `off`
  argument is the CP-internal regfile offset (per VX_cp_axil_regfile
  §17.4). Backends translate to their own physical address space.

xrt + opae: trivial wrappers that add 0x1000 (the AFU's bit-12 demux
base) to the CP-internal offset and forward to the existing
write_register / fpgaWriteMMIO64 paths. They already have a hardware
CP behind the AFU; this just exposes it through the unified callback.

simx + rtlsim: no hardware CP — instantiate the new software
vortex::CommandProcessor (introduced in 16aa1ca) per device, with
hooks wired to {ram_.read/write, processor_.dcr_write, processor_.run
via std::async, future_ status as busy poll}. The cp_mmio_* methods
proxy to cp_.mmio_write/read and drain a bounded burst of cp_.tick()s
around each access — the deterministic single-thread model from the
proposal §3.2 (no separate CP thread, matches the hardware FSM
clocked alongside Vortex).

Verified:
  cp_sim unit test: 6/6 still PASS
  OpenCL vecadd on all 4 backends: PASS (66ms/228ms/801ms/759ms)

Phase C will move the cp_post_launch / cp_post_dcr_write helpers from
the xrt/opae runtimes into the shared dispatcher (so all 4 backends
go through the same code path); Phase D switches the dispatcher to
always use them; Phase E strips the legacy launch_*/dcr_* fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…when VORTEX_USE_CP=1

Phases C + D of cp_pure_v2_callbacks_proposal, bundled.

Phase C (dispatcher refactor — no behavior change by itself):
- Platform virtual interface gains cp_mmio_write/read; CallbacksAdapter
  forwards them to the C ABI cb.cp_mmio_* (added in 8bc2564).
- vx::Device gains a CP submission API: cp_submit_dcr_write(addr, val)
  and cp_submit_launch(). Both build the on-wire descriptor (per
  VX_cp_pkg.sv cmd_t layout), upload it to ring DRAM via mem_upload,
  commit Q_TAIL with the LO/HI atomic-pair write, and poll Q_SEQNUM
  until the command retires.
- Device::cp_try_init() runs at open time: when VORTEX_USE_CP env is
  set (honoring 0/false/no/off as off, matching the per-backend
  cleanup in 8b4fdc8), it allocates ring + head + cmpl buffers via
  mem_alloc, zeros them, and programs CP queue 0 + CP_CTRL via
  cp_mmio_write. cp_enabled() reports the final state.
- The CP wire protocol now lives in ONE place. xrt/opae's existing
  per-backend cp_post_launch helpers in their vortex.cpp become
  redundant in this layer of the stack — they'll be removed when
  Phase E strips the legacy launch_*/dcr_* callback fields.

Phase D (cutover — Queue picks the path at runtime):
- Queue::launch's KMU descriptor loop chooses cp_submit_dcr_write vs
  platform->dcr_write per call, gated by device_->cp_enabled(). After
  all DCRs are pushed, the path either calls cp_submit_launch (CP
  mode, sync inside) or the legacy launch_start + launch_wait pair.
- Queue::enqueue_dcr_write picks the same way.
- enqueue_dcr_read stays on the legacy path — CP dcr_read isn't
  exposed yet (read response would need a writeback slot; not v1).

Verified (all v2-native dispatcher tests):
  CP-off default:
    vecadd/simx PASS (68 ms)    vecadd/rtlsim PASS (228 ms)
    vecadd/xrt  PASS (952 ms)   vecadd/opae   PASS (1236 ms)
  CP-on (VORTEX_USE_CP=1):
    vecadd/simx   PASS (68 ms)    sgemm/simx   PASS (1718 ms)
    vecadd/rtlsim PASS (228 ms)   sgemm/rtlsim PASS (7052 ms)
    vecadd/xrt    timeout         (pre-existing step-1 hang)
    vecadd/opae   scoreboard assert  (pre-existing step-1 hang)

Key finding: simx + rtlsim now exercise the full CP path end-to-end
through the dispatcher. This validates that the dispatcher's wire
protocol is correct — the xrt/opae hangs are bugs in the hardware
CP integration (likely in VX_cp_core or the AFU mux), NOT in the
dispatcher. The software CommandProcessor (16aa1ca) is now usable as
a reference implementation for diagnosing the hardware-side bug.

Phase E (strip launch_*/dcr_* from callbacks_t) is deferred until
the hardware bug is fixed — pulling the legacy callback fields would
remove xrt/opae's working legacy escape hatch.

Also drops hw/unittest/cp_sim (wrong location for a pure C++ test —
hw/unittest is for RTL/Verilator tests). The regression tests under
tests/regression/ + tests/opencl/ exercise the dispatcher CP path
naturally now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… in S_REQ)

The proxy latched `pending_is_read` on grant but used `cmd.arg0`/`cmd.arg1`
combinationally to drive dcr_req_addr/data in S_REQ. cmd is only valid
during the grant cycle — VX_cp_core's granted_dcr_cmd is a combinational
mux of bid_dcr.cmd[i] gated on dcr_grant[i], so the cycle after grant
(when S_REQ asserts) granted_dcr_cmd defaults to '0. Every CP-issued DCR
write was silently writing DCR 0 with data 0.

Symptom (took 4 sessions of intermittent debug to localize):
  VORTEX_USE_CP=1 on xrt/opae backends — runtime posts 15 CMD_DCR_WRITEs
  (kernel PC, args ptr, grid/block dims) + 1 CMD_LAUNCH. All 16 commands
  appear to retire (Q_SEQNUM advances) but Vortex never goes busy after
  the LAUNCH because its DCR state never got programmed — startup PC is
  0, args ptr is 0, etc. The launch FSM stays in WAIT_BUSY forever.

The bug was invisible to the cp_engine unit test (it stubs the resource
done signals directly, never actually exercises the proxy's S_REQ → dcr_req
output path) and invisible to the legacy CP integration (only LAUNCH went
through CP; DCRs went via the legacy MMIO_DCR_ADDR path). It surfaced
only when commit 94888e6 routed DCRs through CP via Queue::launch.

Fix: latch cmd_addr and cmd_data into pending_addr / pending_data on the
same S_IDLE → S_REQ transition that already latches pending_is_read.
S_REQ then drives dcr_req_* from the latched values, which stay valid
regardless of upstream cmd mux state.

Localized via diff-debug against the software CommandProcessor (16aa1ca)
— added per-command stderr trace to Device::cp_submit_cl_, captured
simx + xrt runs of the same vecadd test, observed:
  simx: posts #1..#19 retire in 0 polls, #20 (LAUNCH) retires in ~6 k
        polls (kernel actually runs) → PASS
  xrt:  posts #1..#19 retire in ~7 polls each, #20 STUCK at seq=19
        after 100 k polls → hang

Same command sequence, same wire protocol — difference had to be in the
RTL side of the DCR pipeline. From there it was a straight read of
VX_cp_dcr_proxy.

Verified after fix:
  8-corner regression PASS:
    vecadd legacy: simx 67 / rtlsim 278 / xrt 1273 / opae 1675 ms
    vecadd CP:     simx 69 / rtlsim 226 / xrt 467  / opae 1221 ms
    sgemm  CP:     simx 1709 / rtlsim 6424 / xrt 10973 / opae 14124 ms

This unblocks Phase E of cp_pure_v2_callbacks_proposal — with all 4
backends now functional via CP, the legacy launch_*/dcr_* callbacks
can be safely stripped from callbacks_t in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… v2)

Final phase of cp_pure_v2_callbacks_proposal. The CP is now the sole
control plane on all four backends. callbacks_t exposes only platform
primitives:

  dev_open/close, query_caps, memory_info,
  mem_alloc/reserve/free/access, mem_upload/download/copy,
  cp_mmio_write, cp_mmio_read

Everything else flows through the dispatcher's cp_submit_* helpers,
which build CMD_DCR_WRITE / CMD_DCR_READ / CMD_LAUNCH descriptors and
push them through the CP regfile + ring. The backends no longer have
any per-command implementation work — they just expose the CP MMIO
surface (xrt/opae → AFU regfile at byte 0x1000+; simx/rtlsim →
sim/common/CommandProcessor C++ instance).

Changes:

callbacks.h / callbacks.inc:
- Dropped launch_start, launch_wait, dcr_write, dcr_read fields.
- Dropped corresponding lambdas in callbacks.inc.
- callbacks.h no longer includes <vortex.h>; it had no use for it.

Platform virtual interface (vortex2_internal.h):
- Removed the matching launch_start/launch_wait/dcr_write/dcr_read
  pure virtuals + CallbacksAdapter overrides. Only cp_mmio_*
  remains in the control-plane section.

vx_device.cpp:
- cp_try_init → cp_init: no longer env-gated. Called unconditionally
  from Device::open(). CP failure is now a hard error returned to
  vx_device_open (was: silent no-op).
- Added cp_submit_dcr_read(addr, tag, out): posts CMD_DCR_READ, polls
  Q_SEQNUM, reads the response from the new Q_LAST_DCR_RSP slot at
  CP-offset 0x130.

vx_queue.cpp:
- Queue::launch: removed the cp_enabled() branch; always uses
  cp_submit_dcr_write + cp_submit_launch.
- Queue::enqueue_dcr_write / enqueue_dcr_read: always go through
  cp_submit_dcr_write / cp_submit_dcr_read.

legacy_runtime.cpp:
- vx_dcr_read: was calling platform()->dcr_read directly. Now
  routes through cp_submit_dcr_read so the legacy tag-aware path
  still works (tag → cmd.arg1 → dcr_req_data, matches the legacy
  MMIO_DCR_ADDR+4 semantics).

RTL (VX_cp_axil_regfile):
- New regfile read slot at CP-offset 0x130 (Q_LAST_DCR_RSP)
  exposing the 32-bit response from VX_cp_dcr_proxy.last_rsp_data.
- VX_cp_core wires u_dcr.last_rsp_data → u_regfile.last_dcr_rsp.

Software CP (sim/common/CommandProcessor):
- Added vortex_dcr_read hook for CMD_DCR_READ dispatch.
- New last_dcr_rsp_ member, exposed via mmio_read at offset 0x130.
- Engine: CMD_DCR_READ calls the hook and latches the response.

simx + rtlsim backends:
- Added vortex_dcr_read hook implementation. Critical: hook does
  future_.wait() before processor_.dcr_read to avoid racing the
  background processor_.run() thread on Verilator state (caught a
  segfault on rtlsim during bring-up).

Verified — full 8-corner regression PASSES:
  vecadd: simx 69 / rtlsim 226 / xrt 786 / opae 879 ms
  sgemm:  simx 1709 / rtlsim 7052 / xrt 8231 / opae 14686 ms

The CP-runtime migration is now structurally complete: vortex2.h is
the only user-facing API path, the dispatcher owns all CP protocol,
backends are reduced to ~9 platform primitives. Future work (a CP
DCR read writeback to host memory, multi-queue, real-bitstream xrt
bring-up, etc.) builds on a clean foundation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip implementation phase markers, step numbers, doc/section references,
version qualifiers ("v1", "pre-CP", etc.), and bug-history detail from
comments across the CP RTL, software CommandProcessor, runtime dispatcher,
callbacks ABI, and the four backend vortex.cpp files. Surviving comments
describe behavior and constraints only.

Add docs/designs/command_processor_design.md as the single up-to-date
design doc (consolidates the six prior CP proposal/plan docs). Drop the
old docs/designs/command_processor_prototype.md (review of the legacy
vortex_cp prototype, superseded by the as-built design).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VX_cp_event_unit is the placeholder for CMD_EVENT_WAIT/SIGNAL hardware
arbitration — the engine retires those opcodes as NOPs today; the
module exists so future cross-queue event sync can land without
touching the engine.

VX_cp_profiling exposes the free-running 64-bit cycle counter via the
AXI-Lite regfile (CP_CYCLE_LO/HI) and accepts the per-CPE
submit/start/end pulses. The 32 B per-command timestamp writeback to
profile_slot is not yet wired.

Both are referenced as skeletons in command_processor_design.md §9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in upstream CI build fixes (3e6ccfa): tweaks to VX_dcr_flush,
VX_dxa_completion, VX_mem_arb, simx wctl_unit, and the kernel vx_start
boot stub. No conflicts with the CP feature work.
Direct in-TOML rename, no generator change. Vortex-config keys gain a
VX_CFG_ sub-prefix; [toolchain] keys (VIVADO/QUARTUS/YOSYS/SYNTHESIS/
ASIC/SV_DPI/SYNOPSIS) stay bare. Mechanical codemod across hw/, sim/,
sw/, tests/, ci/ including kernel sources and -D flags in regression
scripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in 143d2cd: ci/regression.sh.in target fixes (mpi/rvc/vm),
tests/regression/Makefile cleanup, tests/riscv/isa/Makefile skip for
rv64uc-p-rvc.bin on rtlsim. No CP overlap.
@tinebp tinebp merged commit 15ec2f8 into tinebp-patch-2 May 18, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant