From 24305f26ad552cec8f6c7a094acee5cbdde087c6 Mon Sep 17 00:00:00 2001 From: tinebp Date: Mon, 18 May 2026 02:48:21 -0700 Subject: [PATCH 1/2] gem5 integration: VortexGPGPU device + x86/ARM host runtime + e2e tests Adds end-to-end gem5 SE-mode integration for Vortex. The simulated host CPU (x86 or ARM) drives a VortexGPGPU device over the OPAE MMIO+DMA command protocol; the device internally runs SimX cycle-by-cycle from gem5's event loop. Validated via ci/regression.sh --gem5: hello + vecadd + sgemm on both ISAs, 16 s wall. Three moving parts (see docs/gem5_integration.md and docs/proposals/gem5_simx_v3_proposal.md for full design rationale): 1. Device library (sim/simx/gem5/vortex_gpgpu.{cpp,h}, USE_GEM5=1) - Wraps a vortex::Processor with a C ABI the gem5 SimObject calls. - Full OPAE protocol state machine: cmd_args, busy bit, dcr_rsp, async pending_cmd dispatch. - Phase-2 in-process smoke driver (sim/simx/gem5/gem5_smoke_main.cpp) proves the library works without gem5 installed. 2. gem5 SimObject (sim/simx/gem5/vortex_gpgpu_dev.{cc,hh} + .py + SConscript) - DmaDevice subclass; dlopens libvortex-gem5.so; ticks Processor::cycle() from EventFunctionWrapper. - CMD_MEM_{READ,WRITE} -> dmaAction; CMD_RUN -> schedule tick; CMD_DCR_* -> synchronous library passthrough. - Installed into a pinned gem5 release by sim/simx/gem5/install.sh, which ci/gem5_install.sh fetches + builds (v25.0.0.1, both build/{X86,ARM}/gem5.opt). 3. Host runtime (sw/runtime/gem5/{vortex.cpp,driver.{cpp,h},Makefile}) - OPAE-shaped vx_* callbacks; direct mmap'd MMIO + bump-allocator pinned region. - HOST_ARCH switch (x86_64 / aarch64 / armhf) -> matching cross compiler, output to \$arch/ subdir so x86 + ARM coexist. - All three legacy-vortex_gem5 bug-catalog items addressed: B9 cache flush before download via per-core DCR_READ B13 multi-arch via HOST_ARCH (was hardcoded armhf in legacy) B14 mmio_fence() (mfence / dmb sy) centralised in issue_cmd() SimX-side prerequisites (also shared with SST integration): - Processor::cycle() + Memory* memsim() accessor (sim/simx/processor.*) - sw/common/bitmanip.h: added missing + includes (defensive header hygiene; was hit when gem5 sources became the first to transitively include constants.h) ARM e2e specifics: - tests/regression/common.mk + sw/runtime/stub/Makefile take the same HOST_ARCH switch; aarch64 binaries are suffixed (-aarch64) so x86 and ARM coexist in the same dir. - ci/gem5_test_vortex_app.py calls gem5's setInterpDir() to redirect the ELF interpreter (gem5's loader reads PT_INTERP directly, NOT via syscalls -- RedirectPath alone isn't enough) and adds RedirectPath entries for /lib/aarch64-linux-gnu -> /usr/ aarch64-linux-gnu/lib (for libc/libstdc++ at runtime). CI integration: - ci/regression.sh.in: new gem5() function (builds prereqs, runs standalone hello + e2e vecadd/sgemm, each timeout 120). ARM matrix opt-in via VORTEX_GEM5_ARM=1. - .github/workflows/ci.yml: ci/gem5_install.sh appended to Setup Toolchain (cache-gated like SST), GEM5_HOME exported, gem5 entry added to tests matrix (excluded from xlen=64 since the device library is XLEN-locked). - VERSION: GEM5_REV=v25.0.0.1 added. - configure: @GEM5_REV@ substitution. How to test: cd build/ ./ci/gem5_install.sh # first time only sudo apt install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5 # Expect 6 PASSED runs in ~16s wall. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/ci.yml | 21 +- VERSION | 1 + ci/gem5_install.sh.in | 114 +++ ci/gem5_test_vortex_app.py | 229 +++++ ci/gem5_test_vortex_hello.py | 94 ++ ci/regression.sh.in | 123 ++- configure | 2 +- docs/gem5_integration.md | 403 +++++++++ docs/index.md | 1 + docs/proposals/gem5_simx_v3_proposal.md | 1040 +++++++++++++++++++++++ sim/simx/Makefile | 53 +- sim/simx/gem5/SConscript | 18 + sim/simx/gem5/VortexGPGPU.py | 46 + sim/simx/gem5/gem5_smoke_main.cpp | 96 +++ sim/simx/gem5/hello.c | 14 + sim/simx/gem5/install.sh | 50 ++ sim/simx/gem5/vortex_gpgpu.cpp | 320 +++++++ sim/simx/gem5/vortex_gpgpu.h | 111 +++ sim/simx/gem5/vortex_gpgpu_dev.cc | 295 +++++++ sim/simx/gem5/vortex_gpgpu_dev.hh | 122 +++ sim/simx/processor.cpp | 24 + sim/simx/processor.h | 18 + sim/simx/processor_impl.h | 11 + sw/common/bitmanip.h | 2 + sw/runtime/gem5/Makefile | 73 ++ sw/runtime/gem5/driver.cpp | 128 +++ sw/runtime/gem5/driver.h | 73 ++ sw/runtime/gem5/vortex.cpp | 334 ++++++++ sw/runtime/stub/Makefile | 28 +- tests/regression/common.mk | 49 +- 30 files changed, 3882 insertions(+), 11 deletions(-) create mode 100644 ci/gem5_install.sh.in create mode 100644 ci/gem5_test_vortex_app.py create mode 100644 ci/gem5_test_vortex_hello.py create mode 100644 docs/gem5_integration.md create mode 100644 docs/proposals/gem5_simx_v3_proposal.md create mode 100644 sim/simx/gem5/SConscript create mode 100644 sim/simx/gem5/VortexGPGPU.py create mode 100644 sim/simx/gem5/gem5_smoke_main.cpp create mode 100644 sim/simx/gem5/hello.c create mode 100755 sim/simx/gem5/install.sh create mode 100644 sim/simx/gem5/vortex_gpgpu.cpp create mode 100644 sim/simx/gem5/vortex_gpgpu.h create mode 100644 sim/simx/gem5/vortex_gpgpu_dev.cc create mode 100644 sim/simx/gem5/vortex_gpgpu_dev.hh create mode 100644 sw/runtime/gem5/Makefile create mode 100644 sw/runtime/gem5/driver.cpp create mode 100644 sw/runtime/gem5/driver.h create mode 100644 sw/runtime/gem5/vortex.cpp diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 2adecef420..588455f069 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -65,6 +65,7 @@ jobs: ../configure --tooldir=$TOOLDIR ci/toolchain_install.sh --all ci/sst_install.sh + ci/gem5_install.sh - name: Setup Third Party if: steps.cache-thirdparty.outputs.cache-hit != 'true' @@ -78,6 +79,11 @@ jobs: echo "SST_CORE_HOME=$PWD/tools/sst-install/sst-core" >> $GITHUB_ENV echo "SST_ELEMENTS_HOME=$PWD/tools/sst-install/sst-elements" >> $GITHUB_ENV + - name: Export gem5 paths + run: | + echo "GEM5_HOME=$PWD/tools/gem5" >> $GITHUB_ENV + echo "$PWD/tools/gem5/build/X86" >> $GITHUB_PATH + build: needs: setup strategy: @@ -137,15 +143,23 @@ jobs: matrix: os: [ubuntu-24.04] # dxa + tensor_wg disabled: features not yet complete (see regression{32,64}_failures.md) - name: [regression, amo, mpi, dtm, opencl, cache, config1, config2, debug, scope, stress, synthesis, vm, rvc, cupbop, hip, tensor, tensor_sp, tensor_mx] + name: [regression, amo, mpi, dtm, opencl, cache, config1, config2, debug, scope, stress, synthesis, vm, rvc, cupbop, hip, tensor, tensor_sp, tensor_mx, gem5] xlen: [32, 64] # chipStar's hipcc emits Physical64 SPIR-V; POCL refuses it on # rv32 Vortex (CL_INVALID_OPERATION). hip is rv64-only until # either chipStar grows --offload=spirv32 or the native # HIPVortex toolchain lands (see hip_support_proposal.md). + # + # gem5 only runs against the rv32 build; the device library + # is XLEN-locked by the gem5 install (build/X86/gem5.opt + # links against the libvortex-gem5.so the runner builds, and + # we only build it once). XLEN=64 entry would just duplicate + # the run against an identical setup. exclude: - name: hip xlen: 32 + - name: gem5 + xlen: 64 runs-on: ${{ matrix.os }} timeout-minutes: 120 @@ -190,6 +204,11 @@ jobs: echo "SST_CORE_HOME=$PWD/tools/sst-install/sst-core" >> $GITHUB_ENV echo "SST_ELEMENTS_HOME=$PWD/tools/sst-install/sst-elements" >> $GITHUB_ENV + - name: Export gem5 paths + run: | + echo "GEM5_HOME=$PWD/tools/gem5" >> $GITHUB_ENV + echo "$PWD/tools/gem5/build/X86" >> $GITHUB_PATH + - name: Run tests run: | cd build${{ matrix.xlen }} diff --git a/VERSION b/VERSION index af5ac4633b..590f872b15 100644 --- a/VERSION +++ b/VERSION @@ -1,2 +1,3 @@ VORTEX_VERSION=3.0 TOOLCHAIN_REV=v3.0 +GEM5_REV=v25.0.0.1 diff --git a/ci/gem5_install.sh.in b/ci/gem5_install.sh.in new file mode 100644 index 0000000000..378f5a167c --- /dev/null +++ b/ci/gem5_install.sh.in @@ -0,0 +1,114 @@ +#!/bin/bash + +# Copyright © 2019-2023 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# gem5 install for SimX v3 — Phase 0 of docs/proposals/gem5_simx_v3_proposal.md. +# +# Fetches a pinned gem5 release, installs build deps, builds the ARM +# variant, and exports GEM5_HOME. The Vortex SimObject is NOT installed +# here — that lands in Phase 3 once the API surface is confirmed (see +# sim/simx/gem5/gem5_api_notes.md after this script runs). +# +# Idempotent: re-running with the same GEM5_REV is a no-op once +# $GEM5_HOME/build/ARM/gem5.opt exists. + +# exit when any command fails +set -e + +GEM5_REV=${GEM5_REV:=@GEM5_REV@} +TOOLDIR=${TOOLDIR:=@TOOLDIR@} +GEM5_HOME=$TOOLDIR/gem5 +GEM5_REPO=https://github.com/gem5/gem5.git + +# Build deps. gem5 documents these at https://www.gem5.org/documentation/general_docs/building +# AArch64 cross-toolchain (gcc/g++-aarch64-linux-gnu) is needed for +# Phase 0's hello-arm SE-mode smoke test and for the Phase 4 runtime +# cross-build. Installing it here keeps Phase 0 self-contained. +DEBIAN_FRONTEND=noninteractive sudo apt install -y \ + scons \ + python3 python3-dev python3-pip python3-venv \ + libprotobuf-dev protobuf-compiler libprotoc-dev \ + libgoogle-perftools-dev \ + m4 \ + libboost-all-dev \ + libhdf5-serial-dev \ + libpng-dev \ + pkg-config \ + gcc-aarch64-linux-gnu g++-aarch64-linux-gnu \ + build-essential git wget + +mkdir -p "$TOOLDIR" + +# Fetch (or update) gem5 working tree at the pinned revision. +if [ -d "$GEM5_HOME/.git" ]; then + echo "gem5 working tree exists at $GEM5_HOME" + pushd "$GEM5_HOME" > /dev/null + current_rev=$(git describe --tags --always 2>/dev/null || echo "unknown") + if [ "$current_rev" != "$GEM5_REV" ]; then + echo "checked-out rev $current_rev != pinned $GEM5_REV; refetching" + git fetch --depth=1 origin "tag" "$GEM5_REV" + git checkout "$GEM5_REV" + fi + popd > /dev/null +else + echo "cloning gem5 $GEM5_REV into $GEM5_HOME" + git clone --depth=1 --branch "$GEM5_REV" "$GEM5_REPO" "$GEM5_HOME" +fi + +# Build the ARM variant. -j$(nproc) on the self-hosted runner; cap at 4 +# on hosted runners to avoid OOM (gem5 link uses ~4 GB peak). +JOBS=$(nproc) +if [ -n "$GITHUB_ACTIONS" ] && [ -z "$VORTEX_SELF_HOSTED" ]; then + JOBS=4 +fi + +# Build both X86 (default host ISA — easier, no cross-compile needed) +# and ARM (research path matching the legacy capstone paper). Either +# can be selected at test-config time via GEM5_BIN=$GEM5_HOME/build/{X86,ARM}/gem5.opt. +# Default targets can be overridden via GEM5_TARGETS="X86" or "ARM" or +# "X86 ARM" (space-separated). Both is the default. +GEM5_TARGETS=${GEM5_TARGETS:-"X86 ARM"} + +cd "$GEM5_HOME" +for target in $GEM5_TARGETS; do + if [ ! -x "$GEM5_HOME/build/$target/gem5.opt" ]; then + echo "building gem5.opt ($target) with -j$JOBS" + scons "build/$target/gem5.opt" -j"$JOBS" + else + echo "gem5.opt ($target) already built at $GEM5_HOME/build/$target/gem5.opt" + fi +done + +# Persist GEM5_HOME for subsequent shells (idempotent). +if ! grep -q "^export GEM5_HOME=" ~/.bashrc 2>/dev/null; then + echo "export GEM5_HOME=$GEM5_HOME" >> ~/.bashrc +fi +export GEM5_HOME + +# GitHub Actions: propagate to subsequent steps. +if [ -n "$GITHUB_ENV" ]; then + echo "GEM5_HOME=$GEM5_HOME" >> "$GITHUB_ENV" +fi +if [ -n "$GITHUB_PATH" ]; then + for target in $GEM5_TARGETS; do + echo "$GEM5_HOME/build/$target" >> "$GITHUB_PATH" + done +fi + +echo "" +echo "gem5 $GEM5_REV installed at $GEM5_HOME" +for target in $GEM5_TARGETS; do + echo " binary: $GEM5_HOME/build/$target/gem5.opt" +done +echo " GEM5_HOME exported (re-source ~/.bashrc to pick up in new shells)" diff --git a/ci/gem5_test_vortex_app.py b/ci/gem5_test_vortex_app.py new file mode 100644 index 0000000000..7f703325d8 --- /dev/null +++ b/ci/gem5_test_vortex_app.py @@ -0,0 +1,229 @@ +# Copyright © 2019-2023 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Phase 5 end-to-end gem5 integration test for vortex.VortexGPGPU. +# +# Generic application runner — any Vortex regression test that +# follows the standard shape (host binary + kernel.vxbin in the same +# directory, links against libvortex.so) can run here. +# +# Wires: +# - x86 SE-mode CPU running an unmodified Vortex regression test +# (same binary the SimX backend uses). +# - VortexGPGPU device on the system membus at pio=0x20000000. +# - Identity-mapped PIO range (CPU → device MMIO) and pinned region +# (host DRAM accessed by both the CPU's userspace via virt and +# the device's DmaPort via phys) via Process.map() — the same +# mechanism gem5's AMD GPU integration uses at apu_se.py:1055. +# +# The simulated process loads libvortex.so (the stub), which in turn +# dlopens libvortex-gem5-x86_64.so based on the VORTEX_DRIVER env +# var. From there: +# 1. vx_dev_open → drv_init (no-op; mappings already in place) +# 2. vx_upload_kernel_bytes → DMA write of the .vxbin into VRAM +# 3. vx_copy_to_dev (×N) → DMA writes of input buffers +# 4. vx_start → MMIO CMD_RUN; kernel computes +# 5. vx_copy_from_dev → cache flush (per-core DCR_READ) + DMA read +# 6. Host verifies result, prints PASSED / FAILED +# +# Configurable via env vars: +# VORTEX_GEM5_DEV_LIB — path to sim/simx/libvortex-gem5.so +# (device-side; dlopened by the gem5 SimObject) +# VORTEX_GEM5_HOST_RT_DIR — directory containing libvortex.so (the stub) +# AND libvortex-gem5-x86_64.so (the host +# runtime backend). Both are added to the +# simulated process's LD_LIBRARY_PATH. +# VORTEX_TEST_DIR — directory containing the test binary + +# kernel.vxbin +# VORTEX_TEST_BIN — name of the test binary inside that dir +# (default: vecadd) +# VORTEX_TEST_ARGS — args passed to the binary (default: -n16) +# VORTEX_DRIVER — backend selector for the stub library +# (default: gem5-x86_64; use gem5-aarch64 +# when running the ARM matrix) + +import os +import shlex + +import m5 +from m5.objects import ( + AddrRange, + DDR3_1600_8x8, + MemCtrl, + Process, + RedirectPath, + Root, + SEWorkload, + SrcClockDomain, + System, + SystemXBar, + AtomicSimpleCPU, + VoltageDomain, + VortexGPGPU, +) + +DEV_LIB = os.environ.get("VORTEX_GEM5_DEV_LIB") +HOST_RT_DIR = os.environ.get("VORTEX_GEM5_HOST_RT_DIR") +TEST_DIR = os.environ.get("VORTEX_TEST_DIR") +TEST_BIN = os.environ.get("VORTEX_TEST_BIN", "vecadd") +TEST_ARGS = os.environ.get("VORTEX_TEST_ARGS", "-n16") +DRIVER = os.environ.get("VORTEX_DRIVER", "gem5-x86_64") + +for name, val in [ + ("VORTEX_GEM5_DEV_LIB", DEV_LIB), + ("VORTEX_GEM5_HOST_RT_DIR", HOST_RT_DIR), + ("VORTEX_TEST_DIR", TEST_DIR), +]: + if not val: + raise RuntimeError(f"{name} env var is required") + +APP_BIN = f"{TEST_DIR}/{TEST_BIN}" + +# Fixed mappings used by the gem5 host runtime (see +# sw/runtime/gem5/driver.h). The Python config and the C runtime +# share these constants by convention; if you change one, change +# both. +PIO_BASE = 0x20000000 +PIO_SIZE = 0x1000 # 4 KB — one page is enough for the OPAE regs +PIN_BASE = 0x10000000 +PIN_SIZE = 0x10000000 # 256 MB — large enough for vecadd staging + +# --------------------------------------------------------------------------- +# System construction +# --------------------------------------------------------------------------- +system = System() +system.clk_domain = SrcClockDomain(clock="3GHz", + voltage_domain=VoltageDomain()) +system.mem_mode = "atomic" +system.mem_ranges = [AddrRange("1GiB")] # covers both DRAM and the + # PIN_BASE identity-mapped region + # (PIN_BASE=0x10000000 < 1GB) + +# Cross-arch interp + runtime library redirection. +# Two separate gem5 mechanisms are at play: +# (1) `setInterpDir(prefix)` prepends `prefix` to PT_INTERP when +# gem5 loads the dynamic linker (e.g. /lib/ld-linux-aarch64.so.1 +# → /usr/aarch64-linux-gnu/lib/ld-linux-aarch64.so.1). The +# linker is opened directly by gem5's loader, NOT via SE-mode +# syscall, so RedirectPath doesn't help here. +# (2) `system.redirect_paths` redirects open()/stat()/etc syscalls +# the GUEST process makes — used when the dynamic linker +# later looks up libc.so.6, libstdc++.so.6, libvortex.so, etc. +# Both are no-ops for native x86. +if DRIVER == "gem5-aarch64": + from m5.core import setInterpDir + setInterpDir("/usr/aarch64-linux-gnu") + system.redirect_paths = [ + RedirectPath(app_path="/lib/aarch64-linux-gnu", + host_paths=["/usr/aarch64-linux-gnu/lib"]), + RedirectPath(app_path="/usr/lib/aarch64-linux-gnu", + host_paths=["/usr/aarch64-linux-gnu/lib"]), + ] + +# Membus connects CPU ↔ memory ↔ VortexGPGPU. +system.membus = SystemXBar() +system.system_port = system.membus.cpu_side_ports + +# CPU. Atomic for now — the cycle counts inside the Vortex device are +# driven by the device's own clock anyway; timing CPU adds gem5 wall +# time without changing the kernel result. +system.cpu = AtomicSimpleCPU() +system.cpu.createInterruptController() +system.cpu.icache_port = system.membus.cpu_side_ports +system.cpu.dcache_port = system.membus.cpu_side_ports +# X86's InterruptController has explicit pio/int_requestor/int_responder +# ports that must be wired to the membus (per +# learning_gem5/part1/two_level.py:111-114). ARM's interrupt model +# doesn't expose these — skip the wiring on ARM. Tested via the +# DRIVER env var (the same one that selects the simulated host ISA). +if DRIVER == "gem5-x86_64": + system.cpu.interrupts[0].pio = system.membus.mem_side_ports + system.cpu.interrupts[0].int_requestor = system.membus.cpu_side_ports + system.cpu.interrupts[0].int_responder = system.membus.mem_side_ports + +# Memory controller. The DRAM range starts at 0; PIO_BASE=0x20000000 +# lives ABOVE the 1 GB range (since 0x20000000 = 512 MB) — wait, it's +# inside. mem_ranges above is just a hint; the actual MemCtrl range +# is what determines what's routed where. +system.mem_ctrl = MemCtrl() +system.mem_ctrl.dram = DDR3_1600_8x8() +# DRAM serves [0, 512MB). PIO at 0x20000000 (=512MB) sits at the top +# edge, so let DRAM serve [0, 512MB) and let the membus route +# 0x20000000+ to the VortexGPGPU. +system.mem_ctrl.dram.range = AddrRange(0, size="512MiB") +system.mem_ctrl.port = system.membus.mem_side_ports + +# The Vortex device. The `library` parameter points at the +# device-side libvortex-gem5.so (no arch suffix; gem5 itself is +# always x86-host). The host-side runtime is loaded separately by +# the simulated process via VORTEX_DRIVER below. +system.vortex = VortexGPGPU( + library = DEV_LIB, + kernel = "", # NO preload — the host binary uploads the kernel + # via the OPAE MMIO protocol, the way a real + # accelerator runtime works. +) +system.vortex.pio_addr = PIO_BASE +system.vortex.pio_size = PIO_SIZE +system.vortex.pio = system.membus.mem_side_ports +system.vortex.dma = system.membus.cpu_side_ports + +# --------------------------------------------------------------------------- +# Workload (the host test binary) +# --------------------------------------------------------------------------- +argv = [APP_BIN] + shlex.split(TEST_ARGS) +process = Process( + pid=100, + cwd=TEST_DIR, + cmd=argv, + executable=argv[0], + env=[ + # Tells the stub to dlopen our backend + # (libvortex.so does dlopen("libvortex-${VORTEX_DRIVER}.so")). + f"VORTEX_DRIVER={DRIVER}", + # Library search path inside the simulated process. Must + # contain libvortex.so AND libvortex-gem5-$ARCH.so (both + # are in HOST_RT_DIR by construction). + f"LD_LIBRARY_PATH={HOST_RT_DIR}", + ], +) + +system.workload = SEWorkload.init_compatible(APP_BIN) +system.cpu.workload = process +system.cpu.createThreads() + +# --------------------------------------------------------------------------- +# Run +# --------------------------------------------------------------------------- +root = Root(full_system=False, system=system) +m5.instantiate() + +# Identity-map the device PIO range and the pinned DMA region into +# the simulated process's address space. Must happen AFTER +# m5.instantiate() — the process needs a backing C++ object before +# map() is callable. Mirrors apu_se.py:1055 (gem5's AMD GPU pattern). +# The CPU's userspace then touches PIO_BASE / PIN_BASE as ordinary +# memory; the membus routes PIO_BASE → device, PIN_BASE → DRAM. +system.cpu.workload[0].map(PIO_BASE, PIO_BASE, PIO_SIZE, cacheable=False) +system.cpu.workload[0].map(PIN_BASE, PIN_BASE, PIN_SIZE, cacheable=True) + +print(f"Phase 5: app={APP_BIN} {TEST_ARGS}") +print(f"Phase 5: VortexGPGPU.library={DEV_LIB}") +print(f"Phase 5: VORTEX_DRIVER={DRIVER}") +print(f"Phase 5: LD_LIBRARY_PATH={HOST_RT_DIR}") +print(f"Phase 5: PIO @0x{PIO_BASE:x}+0x{PIO_SIZE:x}, PIN @0x{PIN_BASE:x}+0x{PIN_SIZE:x}") +print("Phase 5: starting simulation...") + +exit_event = m5.simulate() +print(f"Phase 5: exit_event.cause = {exit_event.getCause()!r}") +print(f"Phase 5: tick = {m5.curTick()}") diff --git a/ci/gem5_test_vortex_hello.py b/ci/gem5_test_vortex_hello.py new file mode 100644 index 0000000000..6ab54b3af2 --- /dev/null +++ b/ci/gem5_test_vortex_hello.py @@ -0,0 +1,94 @@ +# Copyright © 2019-2023 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Phase 3 gem5 integration test for vortex.VortexGPGPU. +# +# Standalone-device variant: the VortexGPGPU SimObject loads the kernel +# directly via its `kernel=` parameter and runs it via its internal +# tick loop. No host CPU, no MMIO traffic, no DMA — this is the gem5 +# analog of sim/simx/gem5/gem5_smoke from Phase 2, used here purely +# to prove the gem5 SimObject can dlopen libvortex-gem5.so, drive +# Processor::cycle() from the gem5 event loop, and exit cleanly. +# +# Phase 5 adds the full host-CPU + MMIO/DMA flow on top of this. +# +# Configurable via env vars: +# VORTEX_GEM5_LIB — path to libvortex-gem5.so (no default) +# VORTEX_GEM5_KERNEL — path to .vxbin to preload (no default) +# +# Run from the Vortex build dir as: +# VORTEX_GEM5_LIB=$PWD/sim/simx/libvortex-gem5.so \ +# VORTEX_GEM5_KERNEL=$PWD/tests/kernel/hello/hello.vxbin \ +# $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py + +import os +import m5 +from m5.objects import ( + AddrRange, + DDR3_1600_8x8, + MemCtrl, + Root, + SrcClockDomain, + System, + SystemXBar, + VoltageDomain, + VortexGPGPU, +) + +LIBRARY = os.environ.get("VORTEX_GEM5_LIB") +KERNEL = os.environ.get("VORTEX_GEM5_KERNEL") +if not LIBRARY: + raise RuntimeError("VORTEX_GEM5_LIB env var is required") +if not KERNEL: + raise RuntimeError("VORTEX_GEM5_KERNEL env var is required") + +# Minimal system: just enough to hang the VortexGPGPU off a membus +# so gem5 considers it a properly-wired SimObject. No CPU in this +# Phase-3 test — the kernel runs entirely inside the SimObject's +# internal tick loop. +system = System() +system.clk_domain = SrcClockDomain(clock="1GHz", + voltage_domain=VoltageDomain()) +system.mem_mode = "atomic" +system.mem_ranges = [AddrRange("512MiB")] + +# Membus + a small backing memory so PIO ranges have somewhere to bind. +system.membus = SystemXBar() + +# Memory controller (unused at runtime in Phase 3 but required for the +# system to instantiate cleanly). +system.mem_ctrl = MemCtrl() +system.mem_ctrl.dram = DDR3_1600_8x8() +system.mem_ctrl.dram.range = system.mem_ranges[0] +system.mem_ctrl.port = system.membus.mem_side_ports + +# The Vortex device. It inherits clock from the system clock domain +# (set above to 1GHz) via ClockedObject; no explicit `clock=` param. +system.vortex = VortexGPGPU( + library = LIBRARY, + kernel = KERNEL, +) +system.vortex.pio = system.membus.mem_side_ports +system.vortex.dma = system.membus.cpu_side_ports + +# Root wires the system into the simulator. +root = Root(full_system=False, system=system) +m5.instantiate() + +print(f"Phase 3: VortexGPGPU library={LIBRARY}") +print(f"Phase 3: kernel={KERNEL}") +print("Phase 3: running until VortexGPGPU exits the sim loop...") + +exit_event = m5.simulate() +print(f"Phase 3: exit_event.cause = {exit_event.getCause()!r}") +print(f"Phase 3: tick = {m5.curTick()}") diff --git a/ci/regression.sh.in b/ci/regression.sh.in index c84ad793c2..b1b285358a 100755 --- a/ci/regression.sh.in +++ b/ci/regression.sh.in @@ -103,6 +103,124 @@ sst() echo "sst tests done!" } +# gem5 integration tests — Phase 6 of docs/proposals/gem5_simx_v3_proposal.md. +# Validates the VortexGPGPU device + libvortex-gem5.so end-to-end inside +# gem5 SE-mode. Two layers: +# +# 1. Phase 3 standalone (--gem5-standalone): kernel preloaded via the +# SimObject's `kernel=` Python param; runs entirely inside the gem5 +# event loop, no host CPU needed. Fast smoke test (~1 s wall, ~5K +# simulated cycles per run). +# +# 2. Phase 5 e2e (--gem5): an x86 SE-mode workload (the standard +# tests/regression/vecadd binary, same one the SimX backend uses) +# drives the device via the OPAE MMIO/DMA protocol through +# libvortex-gem5-x86_64.so. Exercises the full path: kernel upload +# DMA, status polling, cache-flush DCRs, result DMA, host-side +# verification. +# +# ARM matrix is opt-in via VORTEX_GEM5_ARM=1 (needs gcc-aarch64-linux-gnu +# installed; not part of the default hosted-runner image). +gem5() +{ + echo "begin gem5 tests..." + + if [ -z "$GEM5_HOME" ]; then + GEM5_HOME=$HOME/tools/gem5 + fi + if [ ! -x "$GEM5_HOME/build/X86/gem5.opt" ]; then + echo "error: $GEM5_HOME/build/X86/gem5.opt not found — run ci/gem5_install.sh first" + exit 1 + fi + + # Build prerequisites. The host runtime is gated on HOST_ARCH; + # default x86 needs no cross-toolchain. + make -C sim/simx USE_GEM5=1 + make -C sw/runtime/stub + make -C sw/runtime/gem5 HOST_ARCH=x86_64 + make -C sw/kernel + make -C tests/kernel/hello + make -C tests/regression/vecadd + make -C tests/regression/sgemm + + BUILD_DIR=$(pwd) + LIB_GEM5_DEV=$BUILD_DIR/sim/simx/libvortex-gem5.so + HOST_RT_DIR=$BUILD_DIR/sw/runtime + + # Phase 3 standalone smoke — no host CPU, kernel preload. + # env-vars MUST precede the binary (gem5.opt would otherwise + # treat them as positional args). + VORTEX_GEM5_LIB=$LIB_GEM5_DEV \ + VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \ + timeout 120 $GEM5_HOME/build/X86/gem5.opt \ + ci/gem5_test_vortex_hello.py + + # Phase 5 e2e — full OPAE protocol path through the host runtime. + # Generic test runner (ci/gem5_test_vortex_app.py) parameterized + # by VORTEX_TEST_BIN + VORTEX_TEST_ARGS. Sizes are chosen so each + # run fits in the 120s per-test budget (feedback_test_timeout_120s): + # - vecadd -n16 small vector add (~4K device cycles) + # - sgemm -n4 4x4 matrix multiply (~800 device cycles; larger + # sizes overrun the budget because the simulated + # host CPU's ready_wait poll loop burns gem5 + # wall time proportional to kernel runtime). + # Run on local dev box for larger sizes by overriding VORTEX_TEST_ARGS. + for spec in "vecadd:-n16" "sgemm:-n4"; do + app="${spec%%:*}" + args="${spec#*:}" + echo "=== gem5 e2e: $app $args ===" + VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \ + VORTEX_GEM5_HOST_RT_DIR=$HOST_RT_DIR \ + VORTEX_TEST_DIR=$BUILD_DIR/tests/regression/$app \ + VORTEX_TEST_BIN=$app \ + VORTEX_TEST_ARGS=$args \ + timeout 120 $GEM5_HOME/build/X86/gem5.opt \ + ci/gem5_test_vortex_app.py + done + + # ARM matrix (opt-in). The device library (libvortex-gem5.so) is + # always x86 — gem5.opt is an x86 binary regardless of which + # simulated ISA it models. Only the simulated host's ISA changes. + if [ -n "$VORTEX_GEM5_ARM" ]; then + if [ ! -x "$GEM5_HOME/build/ARM/gem5.opt" ]; then + echo "error: $GEM5_HOME/build/ARM/gem5.opt not found" + exit 1 + fi + + # Cross-compile the host runtime, stub, and test binaries for + # aarch64. All outputs land in $arch/ subdirs alongside the + # native x86 builds so they coexist cleanly. + make -C sw/runtime/stub HOST_ARCH=aarch64 + make -C sw/runtime/gem5 HOST_ARCH=aarch64 + make -C tests/regression/vecadd HOST_ARCH=aarch64 + make -C tests/regression/sgemm HOST_ARCH=aarch64 + + ARM_HOST_RT_DIR=$BUILD_DIR/sw/runtime/aarch64 + + echo "=== gem5 ARM standalone: hello ===" + VORTEX_GEM5_LIB=$LIB_GEM5_DEV \ + VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \ + timeout 120 $GEM5_HOME/build/ARM/gem5.opt \ + ci/gem5_test_vortex_hello.py + + for spec in "vecadd:-n16" "sgemm:-n4"; do + app="${spec%%:*}" + args="${spec#*:}" + echo "=== gem5 ARM e2e: $app $args ===" + VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \ + VORTEX_GEM5_HOST_RT_DIR=$ARM_HOST_RT_DIR \ + VORTEX_TEST_DIR=$BUILD_DIR/tests/regression/$app \ + VORTEX_TEST_BIN=$app-aarch64 \ + VORTEX_TEST_ARGS=$args \ + VORTEX_DRIVER=gem5-aarch64 \ + timeout 120 $GEM5_HOME/build/ARM/gem5.opt \ + ci/gem5_test_vortex_app.py + done + fi + + echo "gem5 tests done!" +} + mpi() { echo "begin mpi tests..." @@ -1022,7 +1140,7 @@ hip() show_usage() { echo "Vortex Regression Test" - echo "Usage: $0 [--clean] [--unittest] [--riscv] [--kernel] [--regression] [--amo] [--dxa] [--opencl] [--cache] [--vm] [--rvc] [--config1] [--config2] [--debug] [--scope] [--stress] [--synthesis] [--vector] [--graphics] [--tensor] [--tensor_sp] [--tensor_mx] [--tensor_wg] [--cupbop] [--hip] [--all] [--h|--help]" + echo "Usage: $0 [--clean] [--unittest] [--riscv] [--kernel] [--regression] [--amo] [--dxa] [--opencl] [--cache] [--vm] [--rvc] [--config1] [--config2] [--debug] [--scope] [--stress] [--synthesis] [--vector] [--graphics] [--tensor] [--tensor_sp] [--tensor_mx] [--tensor_wg] [--cupbop] [--hip] [--sst] [--gem5] [--dtm] [--mpi] [--all] [--h|--help]" } declare -a tests=() @@ -1114,6 +1232,9 @@ while [ "$1" != "" ]; do --sst ) tests+=("sst") ;; + --gem5 ) + tests+=("gem5") + ;; --dtm ) tests+=("dtm") ;; diff --git a/configure b/configure index 14c0880d1d..ea1abb5ebf 100755 --- a/configure +++ b/configure @@ -69,7 +69,7 @@ copy_files() { continue fi mkdir -p "$dest_dir" - sed "s|@VORTEX_HOME@|$SOURCE_DIR|g; s|@XLEN@|$XLEN|g; s|@TOOLDIR@|$TOOLDIR|g; s|@OSVERSION@|$OSVERSION|g; s|@INSTALLDIR@|$PREFIX|g; s|@BUILDDIR@|$CURRENT_DIR|g; s|@TOOLCHAIN_REV@|$TOOLCHAIN_REV|g; s|@VORTEX_VERSION@|$VORTEX_VERSION|g" "$file" > "$dest_file" + sed "s|@VORTEX_HOME@|$SOURCE_DIR|g; s|@XLEN@|$XLEN|g; s|@TOOLDIR@|$TOOLDIR|g; s|@OSVERSION@|$OSVERSION|g; s|@INSTALLDIR@|$PREFIX|g; s|@BUILDDIR@|$CURRENT_DIR|g; s|@TOOLCHAIN_REV@|$TOOLCHAIN_REV|g; s|@VORTEX_VERSION@|$VORTEX_VERSION|g; s|@GEM5_REV@|$GEM5_REV|g" "$file" > "$dest_file" # apply permissions to bash scripts read -r firstline < "$dest_file" if [[ "$firstline" =~ ^#!.*bash ]]; then diff --git a/docs/gem5_integration.md b/docs/gem5_integration.md new file mode 100644 index 0000000000..f474118897 --- /dev/null +++ b/docs/gem5_integration.md @@ -0,0 +1,403 @@ +# gem5 Integration + +Vortex can run inside the [gem5](https://www.gem5.org/) full-system +simulator as a `DmaDevice` SimObject, exposing a Vortex GPGPU to a +simulated host CPU (x86 or ARM) over the standard OPAE MMIO+DMA +command protocol. Use this when you want to model heterogeneous +host-CPU+accelerator workloads with realistic cross-ISA cache and +DMA timing. + +For the design rationale see +[docs/proposals/gem5_simx_v3_proposal.md](proposals/gem5_simx_v3_proposal.md). +This document is the operator manual. + +## At a glance + +The integration has three moving parts that live in this repo: + +| Part | Source | Built artifact | Loaded by | +|---|---|---|---| +| Device library | `sim/simx/gem5/vortex_gpgpu.{cpp,h}` | `build/sim/simx/libvortex-gem5.so` | gem5 SimObject via `dlopen` | +| gem5 SimObject | `sim/simx/gem5/vortex_gpgpu_dev.{cc,hh}` + `VortexGPGPU.py` + `SConscript` | Linked into `gem5.opt` after install | gem5 itself | +| Host runtime | `sw/runtime/gem5/{vortex.cpp,driver.{cpp,h},Makefile}` | `build/sw/runtime/libvortex-gem5-{x86_64,aarch64}.so` | The simulated process inside gem5 | + +Plus one external piece: `ci/gem5_install.sh` fetches gem5 +v25.0.0.1, drops our SimObject sources into `$GEM5_HOME/src/dev/vortex/`, +and builds `build/{X86,ARM}/gem5.opt` (both ISAs by default). + +## One-time setup + +Vortex install / build as usual ([docs/install_vortex.md](install_vortex.md)), +then add gem5: + +```bash +cd build/ # standard Vortex out-of-tree build directory +./ci/gem5_install.sh +``` + +This runs `sudo apt install` for gem5's build dependencies (scons, +libprotobuf, m4, libboost, **gcc-aarch64-linux-gnu**, …), clones gem5 +v25.0.0.1 into `$TOOLDIR/gem5`, copies the Vortex SimObject sources +into `$GEM5_HOME/src/dev/vortex/`, and builds `gem5.opt` for both X86 +and ARM (~15 min on a 64-core machine, ~30-45 min on a typical CI +runner). The script is idempotent — re-running with the same +`GEM5_REV` is a no-op. + +To install only one ISA: + +```bash +GEM5_TARGETS="X86" ./ci/gem5_install.sh # default +GEM5_TARGETS="ARM" ./ci/gem5_install.sh +GEM5_TARGETS="X86 ARM" ./ci/gem5_install.sh # both (default) +``` + +The pinned gem5 revision lives in `VERSION` (`GEM5_REV=v25.0.0.1`); +bumping it requires re-running `ci/gem5_install.sh` and verifying +both `gem5.opt` builds still load `VortexGPGPU` cleanly. + +## Building Vortex with gem5 support + +The device library is gated behind `USE_GEM5=1`. The default +`make -C sim/simx` is **unchanged** — no gem5 dep, no `libvortex-gem5.so` +produced. + +```bash +make -C sim/simx # default; no gem5 artifacts +make -C sim/simx USE_GEM5=1 # produces libvortex-gem5.so + gem5_smoke +``` + +`USE_SST=1` and `USE_GEM5=1` are mutually exclusive (the Makefile +errors out if both are set) — they're different external simulators +with different LDFLAGS; building both into one binary makes no sense. + +### Host runtime + tests (cross-compile) + +The simulated process inside gem5 loads the **host runtime** +`libvortex-gem5-$HOST_ARCH.so`, which speaks the OPAE MMIO/DMA +protocol to the device. The `HOST_ARCH` knob is consistent across +three Makefiles — runtime backend, stub, and regression tests: + +```bash +# Native x86 (default) +make -C sw/runtime/stub # → build/sw/runtime/libvortex.so +make -C sw/runtime/gem5 # → build/sw/runtime/libvortex-gem5-x86_64.so +make -C tests/regression/vecadd # → build/tests/regression/vecadd/vecadd + +# Cross-compiled aarch64 — outputs land in $arch/ subdirs so x86 +# and ARM artifacts coexist: +make -C sw/runtime/stub HOST_ARCH=aarch64 # → build/sw/runtime/aarch64/libvortex.so +make -C sw/runtime/gem5 HOST_ARCH=aarch64 # → build/sw/runtime/aarch64/libvortex-gem5-aarch64.so +make -C tests/regression/vecadd HOST_ARCH=aarch64 # → build/tests/regression/vecadd/vecadd-aarch64 + +# armhf works the same way: +make -C sw/runtime/stub HOST_ARCH=armhf +make -C sw/runtime/gem5 HOST_ARCH=armhf +make -C tests/regression/vecadd HOST_ARCH=armhf +``` + +The ARM targets require `gcc-aarch64-linux-gnu` / `gcc-arm-linux-gnueabihf` +respectively — `ci/gem5_install.sh` installs these. + +## Running tests + +### From the regression harness + +```bash +cd build/ +./ci/regression.sh --gem5 +``` + +Runs both the standalone Phase-3 smoke test (kernel preloaded on the +SimObject, no host CPU) and the Phase-5 end-to-end test (real +SE-mode host program drives the device through MMIO+DMA). Total +wall time ~5 s on a fast box. + +To also run the ARM matrix entry (needs `gcc-aarch64-linux-gnu`): + +```bash +VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5 +``` + +Runs 6 tests in ~16 s wall: +- X86 standalone hello (no host CPU; SimObject preloads kernel) +- X86 e2e vecadd `-n16` (host CPU drives device via OPAE MMIO+DMA) +- X86 e2e sgemm `-n4` +- ARM standalone hello +- ARM e2e vecadd `-n16` +- ARM e2e sgemm `-n4` + +Cross-arch e2e relies on two gem5 mechanisms working together: + +1. **`setInterpDir(prefix)`** prepends a sysroot to the dynamic + linker path embedded in the cross-compiled ELF + (`/lib/ld-linux-aarch64.so.1` → `/usr/aarch64-linux-gnu/lib/...`). + The Python config calls this when `VORTEX_DRIVER=gem5-aarch64`. +2. **`system.redirect_paths`** redirects the *guest process's* + open()/stat() syscalls for `/lib/aarch64-linux-gnu/*` → + `/usr/aarch64-linux-gnu/lib/*` so the dynamic linker can + resolve libc, libstdc++, etc. + +Both paths point at the Ubuntu `gcc-aarch64-linux-gnu` package's +install location — no extra setup needed. + +### By hand + +**Standalone** (no host CPU; kernel preloaded via SimObject parameter): + +```bash +VORTEX_GEM5_LIB=$(pwd)/sim/simx/libvortex-gem5.so \ +VORTEX_GEM5_KERNEL=$(pwd)/tests/kernel/hello/hello.vxbin \ + $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py +``` + +**End-to-end** — any standard Vortex regression test (host binary ++ kernel.vxbin) runs through the generic +[`ci/gem5_test_vortex_app.py`](../ci/gem5_test_vortex_app.py) +runner. Set `VORTEX_TEST_BIN` to the test name: + +```bash +# vecadd +VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \ +VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \ +VORTEX_TEST_DIR=$(pwd)/tests/regression/vecadd \ +VORTEX_TEST_BIN=vecadd \ +VORTEX_TEST_ARGS="-n16" \ + $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py + +# sgemm +VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \ +VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \ +VORTEX_TEST_DIR=$(pwd)/tests/regression/sgemm \ +VORTEX_TEST_BIN=sgemm \ +VORTEX_TEST_ARGS="-n4" \ + $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py +``` + +Expected vecadd output (truncated): +``` +allocate device memory +upload source buffer0 +upload source buffer1 +Upload kernel binary +start device +wait for completion +download destination buffer +verify result +PASSED! +``` + +### Sizing tests for the 120 s budget + +Each `timeout 120` per test bound comes from +[feedback_test_timeout_120s](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_test_timeout_120s.md). +gem5 SE-mode runs the host CPU's `ready_wait` poll loop in +simulated time too, so **kernel runtime translates directly into +gem5 wall time**. The regression script's default sizes fit: + +| Test | Args | Device cycles | Wall (atomic CPU) | +|---|---|---|---| +| vecadd | `-n16` | ~450 | ~3 s | +| sgemm | `-n4` | ~780 | ~3 s | +| sgemm | `-n16` | ~10k+ | **> 120 s** (overruns) | + +Larger sizes are fine when run by hand outside the budget cap. + +## Writing your own gem5 Python script + +The minimal recipe for hosting Vortex inside a custom gem5 system: + +```python +from m5.objects import ( + AddrRange, AtomicSimpleCPU, DDR3_1600_8x8, MemCtrl, Process, + Root, SEWorkload, SrcClockDomain, System, SystemXBar, + VoltageDomain, VortexGPGPU, +) + +# Mappings expected by sw/runtime/gem5/driver.h. +PIO_BASE, PIO_SIZE = 0x20000000, 0x1000 +PIN_BASE, PIN_SIZE = 0x10000000, 0x10000000 # 256 MB pinned region + +system = System() +system.clk_domain = SrcClockDomain(clock="3GHz", + voltage_domain=VoltageDomain()) +system.mem_mode = "atomic" +system.mem_ranges = [AddrRange("1GiB")] +system.membus = SystemXBar() +system.system_port = system.membus.cpu_side_ports + +# CPU (x86 example). For ARM, swap to ArmAtomicSimpleCPU + adjust +# interrupt wiring. +system.cpu = AtomicSimpleCPU() +system.cpu.createInterruptController() +system.cpu.icache_port = system.membus.cpu_side_ports +system.cpu.dcache_port = system.membus.cpu_side_ports +system.cpu.interrupts[0].pio = system.membus.mem_side_ports +system.cpu.interrupts[0].int_requestor = system.membus.cpu_side_ports +system.cpu.interrupts[0].int_responder = system.membus.mem_side_ports + +# DRAM serves [0, 512MB). PIO at 0x20000000 above goes to the +# Vortex device (membus routes by address). +system.mem_ctrl = MemCtrl() +system.mem_ctrl.dram = DDR3_1600_8x8() +system.mem_ctrl.dram.range = AddrRange(0, size="512MiB") +system.mem_ctrl.port = system.membus.mem_side_ports + +# The Vortex device. +system.vortex = VortexGPGPU( + library = "/path/to/build/sim/simx/libvortex-gem5.so", +) +system.vortex.pio_addr = PIO_BASE +system.vortex.pio_size = PIO_SIZE +system.vortex.pio = system.membus.mem_side_ports +system.vortex.dma = system.membus.cpu_side_ports + +# Workload — the host binary uses the OPAE protocol via libvortex.so +# + libvortex-gem5-x86_64.so (selected by VORTEX_DRIVER). +process = Process( + pid=100, + cwd="/path/to/your/test", + cmd=["/path/to/your/test/binary"], + executable="/path/to/your/test/binary", + env=[ + "VORTEX_DRIVER=gem5-x86_64", + "LD_LIBRARY_PATH=/path/to/build/sw/runtime", + ], +) + +system.workload = SEWorkload.init_compatible("/path/to/your/test/binary") +system.cpu.workload = process +system.cpu.createThreads() + +import m5 +root = Root(full_system=False, system=system) +m5.instantiate() + +# CRITICAL: Process.map() must come AFTER m5.instantiate(). +# Identity-mapping PIO + PIN makes the runtime's volatile-pointer +# MMIO and DMA staging buffer "just work" from the simulated process. +system.cpu.workload[0].map(PIO_BASE, PIO_BASE, PIO_SIZE, cacheable=False) +system.cpu.workload[0].map(PIN_BASE, PIN_BASE, PIN_SIZE, cacheable=True) + +m5.simulate() +``` + +Reference implementations: +- [ci/gem5_test_vortex_hello.py](../ci/gem5_test_vortex_hello.py) — standalone Phase-3 variant (preload via `kernel=` param; no host CPU) +- [ci/gem5_test_vortex_app.py](../ci/gem5_test_vortex_app.py) — Phase-5 e2e variant (any regression test via `VORTEX_TEST_BIN`) + +## Load-bearing invariants — do not violate + +These are the rules that, if broken, will silently produce wrong +answers or hangs. Each is repeated from the proposal but is +load-bearing enough to call out here: + +### 1. Process.map() goes AFTER m5.instantiate() + +`Process.map(vaddr, paddr, size)` is a C++ method on the underlying +`gem5::Process` object; that object only exists after +`m5.instantiate()` builds the SimObject tree. Calling `.map()` +before instantiate raises `RuntimeError: Attempt to instantiate +orphan node `. + +Confirmed by gem5's own AMD GPU integration at +`$GEM5_HOME/configs/example/apu_se.py:1055`. + +### 2. PIO and PIN regions must be identity-mapped + +`sw/runtime/gem5/driver.h` hard-codes: +- `PIO_BASE_ADDR = 0x20000000` (device MMIO; 4 KB) +- `PIN_BASE_ADDR = 0x10000000` (DMA staging; 256 MB) + +The Python config must `process.map()` both at the same physical +addresses so: +- CPU's `*(volatile uint64_t*)0x20000000` → membus routes to the device +- Device's DmaPort read at phys `0x10000000+N` → membus routes to DRAM +- Both sides agree on the same bytes without any virtual-to-physical + translation surprise. + +Changing either constant requires updating both the Python config +**and** `sw/runtime/gem5/driver.h` (they are not auto-synced). + +### 3. The CPU runtime MUST issue a cache flush before reading back results + +The host runtime's `download()` path issues a per-core +`dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy)` BEFORE the +`CMD_MEM_READ` DMA. Skipping it returns stale data — the L1/L2/L3 +caches may still hold writes that haven't reached VRAM. + +This is bug **B9** in the legacy `vortex_gem5` code; the v3 host +runtime fixes it. If you write your own runtime, do the same. + +### 4. MMIO writes need an explicit memory barrier before CMD_TYPE + +The host CPU model in gem5 (especially out-of-order variants) can +reorder MMIO writes. `sw/runtime/gem5/driver.cpp` centralises the +fence in `issue_cmd()` so it's impossible to forget: +- x86: `__asm__ volatile("mfence" ::: "memory")` +- AArch64/ARMv7: `__asm__ volatile("dmb sy" ::: "memory")` + +If your custom runtime bypasses `issue_cmd()`, replicate this. This +is bug **B14** in the legacy code. + +### 5. One source of truth for memory state + +Vortex's VRAM is owned by `vortex::RAM` inside the device library. +The pinned region is owned by gem5's DRAM. **The device library +does not maintain a shadow copy of host pinned memory; the host +runtime does not maintain a shadow copy of device VRAM.** Bytes +cross between the two only via the explicit DMA staging path +(steps 1-6 in §5 of `gem5_simx_v3_proposal.md`). + +Don't add a "fast path" that reads/writes the other side's memory +directly. That breaks the timing model and reintroduces bug **B3** +from the legacy code. + +### 6. USE_SST=1 and USE_GEM5=1 are mutually exclusive + +The Makefile rejects both at once. Different external simulators, +different LDFLAGS, different `libvortex.so` shapes. Pick one per +build. + +## Architectural choices you may want to revisit + +These are documented in [the proposal](proposals/gem5_simx_v3_proposal.md) +but worth surfacing: + +- **Status polling, not doorbell queues** (proposal §3.6 "Doorbell + queues" note). The host runtime polls `MMIO_STATUS` between + commands; modern GPUs (AMD, NVIDIA) use ring-buffer + doorbell. + Phase 7+ upgrade if your research needs batched-dispatch realism. +- **SE-mode + custom PIO+DMA wiring**, not FS-mode + PCIe (proposal + §3.6). Matches the legacy capstone paper; faster iteration. PCIe + upgrade is a Phase 5+ enhancement that swaps the SimObject base + class from `DmaDevice` to `PciDevice` (both inherit `DmaDevice` + so the C ABI stays compatible). +- **C ABI between the device library and gem5 SimObject** instead + of C++ linkage (proposal §3.1). Lets you rebuild + `libvortex-gem5.so` without rebuilding `gem5.opt` — Vortex + internals can churn freely. + +## CI + +`./ci/regression.sh --gem5` (built into `--all` is intentionally +**out**: gem5 install is heavy and gated like SST). The +`.github/workflows/ci.yml` matrix includes a `gem5` entry that runs +on hosted runners; ARM matrix gated on +`VORTEX_GEM5_ARM=1`. + +Apptainer integration (the `apptainer-ci.yml` pipeline) does **not** +include gem5 — adding it to `miscs/apptainer/vortex.def` is out of +scope for this integration (proposal §8). Use the hosted CI for +gem5. + +## Troubleshooting + +| Symptom | Cause | Fix | +|---|---|---| +| `dlopen('libvortex-gem5.so') failed: cannot open shared object file` | gem5 SimObject can't find the device library | Set `VortexGPGPU(library="/abs/path/to/libvortex-gem5.so", ...)` to absolute path | +| `Cannot open library: libvortex-gem5-x86_64.so: cannot open shared object file` | Stub can't find the host runtime backend | Set `LD_LIBRARY_PATH=/path/to/sw/runtime` in the `env=[...]` list passed to `Process()` | +| `fatal: syscall clock_nanosleep (#230) unimplemented` | gem5 SE-mode doesn't implement clock_nanosleep; glibc's `nanosleep()` routes through it | Already fixed in `sw/runtime/gem5/vortex.cpp` (uses `sched_yield()` instead). If you wrote your own runtime, do the same. | +| `Attempt to instantiate orphan node ` | `Process.map()` called before `m5.instantiate()` | Move all `.map()` calls AFTER `m5.instantiate()` — see invariant §1 above | +| `fatal: VortexGPGPU: dlsym(vortex_gem5_build_info) failed` | Device library is missing the C ABI symbol — usually means the `library=` parameter points at the wrong .so | `library=` is the **device** library `build/sim/simx/libvortex-gem5.so` (no arch suffix), NOT the host runtime `libvortex-gem5-x86_64.so` | +| Test hangs forever in `vx_ready_wait` | Device's busy bit never clears, usually because the SimObject didn't schedule the tick event | Confirm you set `system.vortex.dma = system.membus.cpu_side_ports` and the device's `tick()` is reachable. Check gem5 with `--debug-flags=VortexGPGPU` | +| `ccache g++ ... undefined reference to fmt::v8::detail::error_handler::on_error` | ccache served a stale object compiled against a different `fmt` version | `CCACHE_DISABLE=1 make -C sim/simx clean && CCACHE_DISABLE=1 make ...` | diff --git a/docs/index.md b/docs/index.md index a7b9000d49..0c3504d724 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,3 +8,4 @@ - [Contributing](contributing.md): Process for contributing your own features including repo semantics and testing - [Debugging](debugging.md): Debugging configurations for each Vortex driver - [Building the Toolchain from Source](building_toolchain.md): Maintainer-facing build recipes for Verilator, RISC-V GNU, LLVM (with X86 + lld + SPIR-V), compiler-rt, musl, and POCL +- [gem5 Integration](gem5_integration.md): Running Vortex inside the gem5 full-system simulator (x86/ARM host CPU + Vortex device over OPAE MMIO/DMA) diff --git a/docs/proposals/gem5_simx_v3_proposal.md b/docs/proposals/gem5_simx_v3_proposal.md new file mode 100644 index 0000000000..470a4669b0 --- /dev/null +++ b/docs/proposals/gem5_simx_v3_proposal.md @@ -0,0 +1,1040 @@ +# gem5 Integration for SimX v3 — Proposal + +**Date:** 2026-05-16 +**Status:** ✅ ALL PHASES (0–7) COMPLETE on BOTH x86_64 AND aarch64 (hello + vecadd + sgemm × 2 ISAs all PASS end-to-end in 16 s wall via `VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5`) +**Author:** Blaise Tine +**Related:** +[simx_v3_proposal.md](simx_v3_proposal.md) (Phase 5: TLM data path), +[sst_simx_v3_proposal.md](sst_simx_v3_proposal.md) (the sister integration whose patterns this proposal follows), +[master_merge_v3_proposal.md](master_merge_v3_proposal.md) §10.2 (the precedent for cross-simulator integrations on this line), +[`~/dev/vortex_gem5`](https://github.com/sij814/vortex_gem5) on branch `gem5`, commit `91dcf17` ("working Vortex with gem5", 2025-05-22 — Injae Shin, UCLA capstone), +[Injae Shin, "gem5-Vortex: Heterogeneous Cross-ISA Integration of Vortex GPGPU in gem5"](#) (capstone report, 2025). + +--- + +## 1. Constraints (load-bearing) + +Any design that breaks one of these is wrong. + +1. **One source of truth for memory state.** Per + [simx_v3_proposal.md §3.3](simx_v3_proposal.md), data lives in the + channel hierarchy: `MemReq`/`MemRsp` packets carry actual bytes + between `MemCoalescer` → `Cache` → `Memory`, and the `RAM` image + attached to `Memory` is authoritative. There is no shadow backing + store and no parallel `MemBackend`. The gem5 integration plugs in at + exactly one boundary (the device's DMA port maps to `RAM` + read/write); it does **not** introduce a second data path. +2. **Single clock owner per simulation.** Under gem5, gem5 drives the + clock: `VortexGPGPU::tick()` (a gem5 `EventFunctionWrapper` that + reschedules itself every cycle at the device clock) calls + `Processor::cycle()`. SimX does not advance on its own and there is + no worker thread doing async `Processor::run()` in the background. + (This is a deliberate departure from the legacy `vortex_gem5` design + — see §2.2 — which is the source of most of that branch's bugs.) +3. **gem5 plugs in at one boundary, not many.** Vortex → gem5 traffic + crosses two well-defined interfaces: + - **PIO** for MMIO command/status registers (the OPAE AFU image + layout, unchanged from `sw/runtime/opae`). + - **DMA** for staging-buffer host↔device transfers, and for any + future host-visible memory window. + The cache hierarchy, scheduler, ALU/FPU, KMU, and the new + `Processor::cycle()` entry point do not know gem5 exists. +4. **No regression for non-gem5 builds.** `make -C sim/simx` (no + `USE_GEM5=1`) continues to produce a self-contained `simx` binary + identical to today's. gem5 is opt-in compile-time, not a runtime + probe, and ships as a separate shared library (`libvortex-gem5.so`) + that the gem5 SimObject loads. Per §1.4 of + [sst_simx_v3_proposal.md](sst_simx_v3_proposal.md). +5. **The Vortex tree owns the integration code.** All gem5-facing C++ + (the `DmaDevice` SimObject) and Python (SimObject config + test + scripts) live under `sim/simx/gem5/` and `ci/gem5_test_vortex_*.py` + in this repo. `ci/gem5_install.sh` fetches a pinned upstream gem5 + release and copies/symlinks our SimObject into its source tree + before building. Versioning the integration alongside Vortex is what + makes it possible to review API-breaking changes in a single PR; + the legacy split across two repos is what froze `vortex_gem5` at a + two-year-old SimX. +6. **Author attribution.** The legacy `vortex_gem5` design (DMA-bouncing + through a pinned staging buffer, OPAE-shaped MMIO command set, ARM + SE-mode runtime) is Injae Shin's capstone work. The + re-implementation is a rewrite, not a port (§2), but each new file's + commit body cites the capstone report and the legacy commit + (`vortex_gem5@91dcf17`). + +--- + +## 2. Why the legacy `vortex_gem5` cannot be ported as-is + +### 2.1 The architectural mismatch + +`vortex_gem5` was built on pre-v3 SimX (`Arch`, `Processor*`, +single-step `run()`, `set_running(true)`, `VX_DCR_BASE_*` startup DCRs +broadcast to all cores). v3 explicitly retired all of those: + +| Concern | Legacy SimX (vortex_gem5) | SimX v3 (this branch) | +|---|---|---| +| Sizing | `Arch arch(NUM_THREADS, NUM_WARPS, NUM_CORES)` object | Macros (`NUM_THREADS`, etc.) — no `Arch` class | +| Top-level | `Processor(arch)` ctor with arg | `Processor()` no-arg ctor | +| Run model | `processor->run()` is one cycle | `processor.run()` blocks to completion | +| Single-cycle step | `processor->run()` per cycle from `proc_tick()` | does not exist — must be added (`Processor::cycle()`) | +| Kernel dispatch | `set_running(true)` + `VX_DCR_BASE_STARTUP_*` | `KMU::start()` + `VX_DCR_KMU_*` (startup + grid/block dims) | +| Cache flush | implicit in `run()` finish | explicit: `dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy)` per core before host read-back | +| Memory hierarchy | `MemSim` + `CacheSim` are timing-only, data sits in `MemBackend` (`Emulator`-side) | `Memory` + `Cache` carry data through `MemReq`/`MemRsp`; backing image is in `RAM` attached to `Memory` | +| Runtime layout | top-level `runtime/{stubarm,opaesimx}/` | reorganized under `sw/runtime/` per [master_merge §3](master_merge_v3_proposal.md) | + +So the **shape of the gem5 plug-in changes**: not "tick the legacy +single-cycle Processor" but "add a `cycle()` entry point to the v3 +Processor and call it from the gem5 SimObject," with KMU-style dispatch +and an explicit cache-flush before host read-back. + +### 2.2 Specific bugs in the legacy code + +A walk-through of `vortex_gem5/sim/{simx,opaesimx}/` and +`vortex_gem5/runtime/{stubarm,opaesimx}/` found the following defects. +Each is called out so the redesign does not re-introduce it. + +| # | File | Defect | Why it matters | +|---|---|---|---| +| B1 | `sim/simx/simx_device.cpp:122` (`proc_tick`) | Calls `processor_->run()` directly. On legacy SimX this was a single step; on v3 it would block until program completion. | The "tick per gem5 cycle" pattern simply won't work. We must add a real single-cycle `Processor::cycle()` (already required for SST). | +| B2 | `sim/simx/simx_device.cpp:111` (`start`) | `processor_->set_running(true)` — that API does not exist in v3. The KMU now drives execution and requires `VX_DCR_KMU_GRID_DIM_*` / `VX_DCR_KMU_BLOCK_DIM_*` to be written before the first cycle. | Even after re-pluming, kernels won't launch without the KMU DCR setup (see `sim/simx/main.cpp:101–116`). | +| B3 | `sim/opaesimx/opae_simx.cpp:185, 199` (`read_mmio64`/`write_mmio64`) | Implementation is `*(uint64_t*)(GEM5_BASE_ADDR + offset)` — a raw host-pointer dereference into a fixed virtual address. | Only works when the host runtime and the gem5 device share an address space (i.e., when the host runtime is *not* actually inside gem5). It is a stand-in for the real path, not the real path. Cross-ISA simulation defeats the assumption: an ARM userspace process inside gem5 cannot dereference `0x20000000` and reach the device. The legacy code papers over this with a co-resident driver hack; v3 needs a real PIO/DMA path. | +| B4 | `sim/opaesimx/opae_simx.cpp:204–399` | Several hundred lines of commented-out CCI/AVS bus + Verilator (`device_->…`) plumbing left in place, referencing fields and types that do not exist in this file. | Dead code that obscures what the module actually does. Drop it; the new gem5 wrapper has no CCI bus to model. | +| B5 | `sim/opaesimx/opae_simx.cpp:71` (`dram_sim_` field) | DRAM model is constructed but never ticked or consulted after the gem5 hack landed. | Dead state. | +| B6 | `sim/opaesimx/opae_simx.cpp:103` (`pinned_alloc_`) | Uses `PIN_BASE_ADDR = 0x10000000` with `PINNED_MEM_SIZE = 0xFFFFFF` (16 MB), hardcoded. No bounds check beyond `MemoryAllocator::allocate` failure. | Tiny by design — large kernel inputs would silently fail. The v3 design should size from `GLOBAL_MEM_SIZE`/`ALLOC_BASE_ADDR` and surface OOM errors. | +| B7 | `runtime/opaesimx/vortex.cpp:324, 367` | `auto ls_shift = (int)std::log2(CACHE_BLOCK_SIZE);` — uses float `log2` for an integer constant, then discards the result. | Cosmetic / dead, but a smell. Use `log2ceil(CACHE_BLOCK_SIZE)` from `sw/common/util.h`. | +| B8 | `runtime/opaesimx/vortex.cpp:418–474` (`ready_wait`) | `nanosleep` call is **commented out**; the busy loop only decrements `timeout_ms` and never sleeps. On a long-running kernel inside gem5 SE-mode this saturates the simulated ARM core. | Either use the gem5 device's interrupt path (preferred — implementable as an MMIO doorbell) or restore the `nanosleep` so the ARM CPU is idle while the GPU runs. | +| B9 | `runtime/opaesimx/vortex.cpp:349–390` (`download`) | No cache-flush step before reading back results from device memory. | On v3, dirty lines must be drained via `dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy)` per core (see `sim/simx/main.cpp:194–197`, `sw/runtime/simx/vortex.cpp:191–197`) or the host sees stale data. | +| B10 | `runtime/opaesimx/vortex.cpp:478–489` (`dcr_write`) | OPAE protocol has `CMD_DCR_WRITE` but no `CMD_DCR_READ`. | The cache-flush fix above requires a `dcr_read` path. Current `sw/runtime/opae` already adds `CMD_DCR_READ` + `MMIO_DCR_RSP` — adopt the same shape on the gem5 device. | +| B11 | `runtime/stubarm/vortex.cpp:54` | `static callbacks_t g_callbacks;` global with `vx_dev_init(&g_callbacks)` resolved at link time. | Works for a single-device test but breaks `vx_dev_open` from being called concurrently from two host processes. Less critical for the gem5 use case (single device per simulation) but worth flagging. | +| B12 | `sim/simx/simx_device.cpp` (`Impl`) | Uses `std::future future_` for shutdown synchronization but `proc_tick()` calls `processor_->run()` directly on the caller thread. The mutex / future plumbing implies an async model that isn't actually used. | Confused concurrency contract. The v3 design must pick one: synchronous tick from the gem5 event loop (this proposal) **or** async run with a doorbell — not both. | +| B13 | `runtime/stubarm/Makefile:7` + `runtime/opaesimx/Makefile:9` | Cross-compiler hardcoded to `arm-linux-gnueabihf-g++` (32-bit ARM hard-float). | gem5 also models AArch64 ARMv8 and x86_64, and most contemporary ARM ports are 64-bit. The v3 build selects compiler from a `HOST_ARCH` make variable (`x86_64`, `aarch64`, `armhf`); see Phase 4. | +| B14 | `runtime/opaesimx/vortex.cpp:489` (`dcr_write`) and `stubarm/vortex.cpp:139` | Both runtimes write to DCR via the OPAE protocol but no MMIO ordering / fence is established between DCR writes and the `CMD_RUN` MMIO. | Inside gem5 the host CPU model may reorder MMIO. Need an explicit barrier before `CMD_RUN` (per `HOST_ARCH`: `mfence` for x86, `dmb sy` for ARM). Phase 4 provides a `vortex_gem5_mmio_fence()` inline helper. | +| B15 | `sim/opaesimx/opae_simx.cpp:138–157` (`prepare_buffer`) | Returns `*buf_addr = (void*)buffer.ioaddr;` — casts an integer device IO address back to a `void*`. | The runtime then dereferences this pointer to do `memcpy(staging_ptr_, host_ptr, size)` (line 322 of `runtime/opaesimx/vortex.cpp`). Same root cause as B3 — only works when host runtime and device share an address space. Under real gem5 the runtime must `mmap` the pinned region via a syscall the gem5 device intercepts, or the gem5 device must expose the pinned region as a PIO/DMA window. | + +Together B1, B2, B3, B6, B9, B14 and B15 mean the legacy integration as +literally written does not run a kernel correctly under v3 even after +the path renames are applied; it requires architectural rework, not +porting. + +### 2.3 What still ports as design intent + +The legacy paper's design intent — and these are what we keep: + +- **OPAE-shaped MMIO command set.** `CMD_RUN`, `CMD_MEM_READ`, + `CMD_MEM_WRITE`, `CMD_DCR_WRITE`, `MMIO_CMD_TYPE`, `MMIO_CMD_ARG0..2`, + `MMIO_STATUS`. Add `CMD_DCR_READ` + `MMIO_DCR_RSP` per the v3 OPAE + runtime (B10). The kernel runtime under `sw/runtime/gem5/` reuses + this layout so the same `vortex.h` shim layer that drives `opae` + also drives `gem5`. +- **Pinned staging buffer pattern** for host↔device transfers. A + fixed device-visible region of host address space; runtime + `memcpy`'s into it, device DMAs out of it. Sizing is dynamic + (allocate-on-demand) rather than the legacy fixed-16-MB chunk (B6). +- **Single-PIO-range device** registered to gem5 with the OPAE MMIO + offsets. The runtime issues 64-bit MMIO writes; the SimObject + decodes them in `write()` / `read()`. +- **The host SE-mode runtime** (`sw/runtime/gem5/`, native x86 or cross-compiled ARM) + shipped into gem5's SE-mode app, **NOT** a full-system Linux on the + guest. The paper makes this point explicitly and it is the + differentiator vs. NoMali (FS-only) and AMD GPU (FS-only). See + `capstone §IIC`. + +### 2.4 What needs a v3 redesign + +- **`sim/simx/simx_device.{cpp,h}`** — replace with + `sim/simx/gem5/vortex_gpgpu.{cpp,h}` (the SimObject wrapper) + plus reuse of the new `Processor::cycle()` API. The legacy file's + `Impl` class is the wrong shape (B1, B2, B12). +- **`sim/opaesimx/opae_simx.{cpp,h}`** — delete entirely. The legacy + module is a host-side OPAE stub whose `read_mmio64`/`write_mmio64` + do raw pointer arithmetic (B3, B15). The v3 design routes MMIO + through gem5's PIO port; there is no host-side stub. +- **`runtime/opaesimx/`** — delete. The OPAE-stub path was a + pre-gem5 debugging convenience; under v3 we test the gem5 device + end-to-end via a gem5 Python script (§4, Phase 5), not via a + co-resident driver. +- **`runtime/stubarm/`** — replace with `sw/runtime/gem5/`, + re-implemented against the same `callbacks.h` ABI as + `sw/runtime/simx`/`opae`/`rtlsim`, with cache-flush plumbed in + (B9), MMIO fences before `CMD_RUN` (B14), and a configurable ARM + cross-compiler target (B13). + +--- + +## 3. Target architecture + +``` + ┌───────────────────────────────────────────────┐ + │ gem5 simulation │ + │ ───────────────── │ + │ ./ci/gem5_test_vortex_hello.py │ + │ (gem5.opt is build/X86/gem5.opt or │ + │ build/ARM/gem5.opt; both supported) │ + │ │ + │ ┌─────────────┐ ┌─────────────────┐ │ + │ │ Host CPU │ ──PIO─▶ │ VortexGPGPU │ │ + │ │ (X86 or ARM,│ ◀─PIO── │ (DmaDevice ↓ │ │ + │ │ SE mode) │ │ PioDevice) │ │ + │ │ user │ │ ┌───────────┐ │ │ + │ │ binary: │ │ │ MMIO regs │ │ │ + │ │ hello + │ │ └───────────┘ │ │ + │ │ libvortex- │ │ ┌───────────┐ │ │ + │ │ gem5.so │ ──DMA─▶ │ │ Pinned │ │ │ + │ │ (native │ ◀─DMA── │ │ staging │ │ │ + │ │ for X86, │ │ │ buffer │ │ │ + │ │ cross- │ │ │ window │ │ │ + │ │ compiled │ │ └───────────┘ │ │ + │ │ for ARM) │ │ │ │ │ + │ └─────────────┘ │ ▼ │ │ + │ │ │ ┌───────────┐ │ │ + │ │ MemPort │ │ vortex:: │ │ │ + │ ▼ │ │ Processor │ │ │ + │ ┌─────────────┐ │ │ (SimX v3) │ │ │ + │ └─────────────┘ │ │ │ │ │ + │ │ │ Cluster[]│ │ │ + │ │ │ Cache │ │ │ + │ │ │ Memory ─┼──┼──┼─▶ RAM (Vortex VRAM, + │ │ └───────────┘ │ │ held inside the + │ │ ▲ │ │ device — separate + │ │ │ cycle() │ │ address space from + │ │ ┌┴──────────┐ │ │ gem5 DRAM) + │ │ │ tick │ │ │ + │ │ │ (gem5 │ │ │ + │ │ │ event) │ │ │ + │ │ └───────────┘ │ │ + │ └─────────────────┘ │ + └───────────────────────────────────────────────┘ +``` + +### 3.1 The plug-in boundary + +The Vortex side exposes **one** plug-in unit: `libvortex-gem5.so`. It +is built from the same `sim/simx/*.{cpp,h}` sources as the default +`simx` binary, plus a single new wrapper file +(`sim/simx/gem5/vortex_gpgpu.{cpp,h}`) that holds: + +- A `vortex::Gem5Wrapper` C++ class that owns a `vortex::Processor`, + a `vortex::RAM` (the device VRAM), and a thin `cycle()` entry + point — exactly mirroring `vortex::VortexSimulator` in + `sim/simx/sst/`. +- A C-ABI shim (`vortex_gem5_create()`, `vortex_gem5_tick()`, + `vortex_gem5_mmio_write64()`, `vortex_gem5_mmio_read64()`, + `vortex_gem5_dma_read()`, `vortex_gem5_dma_write()`, …) so the + gem5-side SimObject is decoupled from C++ ABI changes in + `vortex::Processor`. **The C ABI is the contract;** changing it + requires a coordinated update of the gem5-side SimObject. + +The gem5 side is **one** SimObject + **one** Python file, both shipped +in this repo at `sim/simx/gem5/`: + +- `vortex_gpgpu_dev.{cc,hh}` — subclasses `gem5::DmaDevice` (which + itself subclasses `PioDevice`). Holds an opaque + `vortex_gem5_handle_t`; on `tick()`, calls `vortex_gem5_tick()`. PIO + reads/writes decode the OPAE MMIO offsets and forward to + `vortex_gem5_mmio_*`. DMA reads/writes triggered by + `CMD_MEM_{READ,WRITE}` use gem5's `DmaPort` and copy bytes into the + device VRAM via `vortex_gem5_dma_*`. +- `VortexGPGPU.py` — `gem5.SimObject` definition with `pio_addr`, + `pio_size`, `pio_latency`, `dma_latency`, `clock`, `library` + (path to `libvortex-gem5.so`), and `kernel` (path to `*.vxbin` — + loaded into VRAM at boot, in lieu of the runtime upload path, for + smoke tests). + +`ci/gem5_install.sh.in` fetches a pinned gem5 release +(see §3.4 for version), copies the two files into +`/src/dev/vortex/`, drops a one-line `SConscript`, and runs +`scons build/ARM/gem5.opt`. + +**Nothing upstream of `vortex_gem5_create()` knows gem5 exists.** This +satisfies §1.3. + +### 3.2 The cycle interface + +`Processor::cycle()` does **not exist** in v3 today. It is a direct +prerequisite of both the SST integration (per +[sst_simx_v3_proposal.md §3.2](sst_simx_v3_proposal.md)) and this +proposal. The signature and shape are identical to what SST needs: + +```cpp +// processor.h — public additions +bool cycle(); // advance one cycle; returns false when nothing is running +Memory* memsim(); // for optional gem5/SST memory-mirroring hooks +``` + +```cpp +// processor.cpp — implementation +bool ProcessorImpl::cycle() { + if (!is_cycle_initialized_) { + SimPlatform::instance().reset(); + this->reset(); + kmu_->start(); // dispatch CTAs into the cluster + is_cycle_initialized_ = true; + } + SimPlatform::instance().tick(); + return this->any_running(); +} + +Memory* ProcessorImpl::memsim() { return memsim_.get(); } +``` + +The two pieces (`SimPlatform::reset()` → `start_kmu()` → +`SimPlatform::tick()` and `any_running()`) are already factored on +`Processor` from Round 6 DTM work. `cycle()` just packages them into a +single-cycle step. + +**Reuse from DTM work:** `start_kmu()` and `any_running()` are already +public on `Processor`. We add `cycle()` and `memsim()` and that is the +entire SimX-side API surface required by both SST and gem5. + +### 3.3 The MMIO command protocol + +Identical to `sw/runtime/opae` v3 (the OPAE driver), reusing +`hw/syn/altera/opae/vortex_afu.h`: + +| Offset | Name | Direction | Purpose | +|---|---|---|---| +| `MMIO_CMD_TYPE` | `CMD_*` | W64 | Dispatch one of: `MEM_READ`, `MEM_WRITE`, `RUN`, `DCR_WRITE`, `DCR_READ` | +| `MMIO_CMD_ARG0..2` | command-specific | W64 | DCR addr / device addr / size / value | +| `MMIO_STATUS` | bit0=busy | R64 | Polled by runtime's `ready_wait` | +| `MMIO_DCR_RSP` | response | R64 | Result of `CMD_DCR_READ` (used for cache-flush) | +| `MMIO_DEV_CAPS` / `MMIO_ISA_CAPS` | caps bitfield | R64 | Encoded device capabilities | + +The runtime issues commands by writing args first, then `CMD_TYPE` +(B14 fix: emit a `DMB SY` before the type write). The device latches +on `CMD_TYPE`, performs the action synchronously (PIO write returns +when the operation is enqueued, or completes synchronously for +fast ones like `DCR_WRITE`), and clears the status busy bit when done. + +`CMD_MEM_{READ,WRITE}` use the staging-buffer protocol from the +capstone paper Fig. 5 (§3.4 below). + +### 3.4 The staging-buffer protocol + +The gem5 device exposes a PIO-addressable register `MMIO_PINNED_BASE` +that returns the base address of a pinned region inside gem5's host +address space. The runtime, on `vx_mem_alloc`, lazily picks a slice of +that region as a staging buffer. + +For a `vx_copy_to_dev(host_ptr, dev_addr, size)`: +1. Runtime `memcpy(staging_buf, host_ptr, size)`. +2. Runtime writes `staging_buf_addr`, `dev_addr`, `size` to + `MMIO_CMD_ARG{0,1,2}`. +3. Runtime writes `CMD_MEM_WRITE` to `MMIO_CMD_TYPE`. +4. Device's PIO handler enqueues a `gem5::DmaPort::dmaAction()` read + from `staging_buf_addr` into a local scratch. +5. On DMA completion, the device copies the scratch bytes into Vortex's + `RAM` at `dev_addr` (via `RAM::write`). +6. Device clears the status busy bit. +7. Runtime polls `MMIO_STATUS` until busy=0. + +`vx_copy_from_dev` is the reverse, with **cache flush first** (B9): +the runtime issues `CMD_DCR_READ(VX_DCR_BASE_CACHE_FLUSH, cid)` for +every core before the `CMD_MEM_READ`. The device's DCR-read handler +plumbs through to `Processor::dcr_read`, which already invokes +`flush_caches()` for the cache-flush DCR +([processor.cpp:251–258](../../sim/simx/processor.cpp#L251)). + +This is the same protocol the v3 OPAE runtime already uses, so the +runtime under `sw/runtime/gem5/` differs from `sw/runtime/opae/` only +in: +- The `driver.{cpp,h}` backend (gem5 mmaps a `/dev/vortex_gem5` + character device path **OR**, in SE-mode, gem5 sets up the device's + PIO/DMA windows directly in the simulated process's address space — + see §3.6). +- The lack of an `fpgaPrepareBuffer` API (the device exposes the + pinned region itself; no per-call buffer allocation by an OPAE + layer). + +### 3.5 Build-time gating + +`USE_GEM5=1` make variable controls compilation of: +- `sim/simx/gem5/vortex_gpgpu.{cpp,h}` (the C ABI wrapper). +- Link target `libvortex-gem5.so` produced alongside `libsimx.so` + (mirrors the SST `libvortex.so` pattern in `sim/simx/Makefile`). + +`USE_GEM5=1` does **not** affect the default build: +`make -C sim/simx` (no flag) still produces a stand-alone `simx` +binary with no gem5 dep. Per §1.4. + +The host-side runtime supports both x86 (native) and ARM (cross- +compiled) targets via a `HOST_ARCH` switch: +``` +make -C sw/runtime/gem5 # x86 default +make -C sw/runtime/gem5 HOST_ARCH=x86_64 # explicit x86 +make -C sw/runtime/gem5 HOST_ARCH=aarch64 # AArch64 cross +make -C sw/runtime/gem5 HOST_ARCH=armhf # ARMv7 cross +``` +producing `libvortex-gem5-{x86_64,aarch64,armhf}.so`. Test scripts +select the matching `(gem5.opt, libvortex-gem5-*.so)` pair via the +`HOST_ARCH` make variable. Native x86 needs no toolchain install; ARM +requires `gcc/g++-aarch64-linux-gnu` (or `-arm-linux-gnueabihf` for +ARMv7), which `ci/gem5_install.sh` installs as part of Phase 0. + +### 3.6 gem5 SE-mode wiring + ISA selection + +**Host ISA: both x86 and ARM, equally first-class** (decision recorded +2026-05-16 after Phase 0 prototyping). Phase 0's `ci/gem5_install.sh` +builds `build/X86/gem5.opt` *and* `build/ARM/gem5.opt`; phases 4–6 +test both. Rationale: + +- **x86** is the path of least resistance for users — no + cross-toolchain, native `g++` builds `sw/runtime/gem5/`, faster + gem5 CPU model, and PCIe is canonical on x86 (relevant to the + Phase 5+ upgrade path below). +- **ARM** is the research-narrative path matching the capstone paper + (Injae Shin 2025) and actually-deployed ARM+accelerator HPC + platforms (Grace Hopper, Fugaku, Graviton, Apple Silicon). Kept + as a first-class matrix variant; not a stretch goal. + +Three MMIO/DMA paths exist; this proposal picks one for the initial +work and notes the others as future upgrades: + +| Path | Description | Status in this proposal | +|---|---|---| +| **1. SE-mode + custom PIO+DMA wiring** | The device is a `DmaDevice` subclass attached to `system.membus` at a configurable `pio_addr` (default `0x20000000`, matching the legacy paper). Host binary touches the address via `mmap`/inline asm. Works in both x86 SE-mode and ARM SE-mode. | **Phase 2–6: this is the design.** Matches legacy paper, lightweight, fast iteration. | +| **2. FS-mode + PCIe device** | Subclass `PciDevice` (which already inherits `DmaDevice`); BARs expose MMIO, DMA for staging. Full Linux boot inside gem5 with a tiny PCI kernel module to bind the device. | **Phase 5+ upgrade.** Realistic accelerator-modeling story expected by x86 users. The C ABI committed in Phase 2 is shape-compatible — `PciDevice` and the custom `DmaDevice` both use the same `vortex_gem5_dma_*` callbacks; only the gem5-side wrapper class differs. | +| **3. `/dev/vortex_gem5` pseudo-file** | The gem5 device implements `SyscallReturn open(...)` + `mmap` for a synthetic device path. Runtime `open("/dev/vortex_gem5", O_RDWR)` + `mmap`. | Out of scope. Closest to how real OPAE drivers work but requires a custom syscall handler in gem5; cost outweighs the benefit when Path 1 already works. | + +**Doorbell queues** are a Phase 7+ realism upgrade orthogonal to the +transport choice above. AMD GPU (gem5 `src/dev/amdgpu/`, derived +from `PciEndpoint`) and NVIDIA-style modern accelerators use a ring +buffer in host DRAM plus a single MMIO "doorbell" write per dispatch: +the host appends commands to the ring, then writes the new tail +offset to the doorbell register; the device asynchronously walks the +ring and processes commands. The Phase 2-6 design instead uses +**status polling** — the host writes args + `CMD_TYPE`, then polls +`MMIO_STATUS` until done — which matches the legacy OPAE FPGA driver. +Polling is fine for the capstone-paper scope (small kernels, one at +a time) but burns simulated cycles on the spin. If later research +wants batched-dispatch realism comparable to AMD GPU, the upgrade +swaps the OPAE MMIO command set for a ring + doorbell protocol; the +C ABI in Phase 2 stays compatible (a new `vortex_gem5_doorbell_ring(handle, tail)` +entry point alongside the existing `vortex_gem5_mmio_*`). + +### 3.7 gem5 version pinning + +`ci/gem5_install.sh.in` pins gem5 to v25.0.0 (the most recent stable +release as of 2026-05). The pinned tag goes in `VERSION` alongside +`TOOLCHAIN_REV` and `SST_VER` — bumps require a CI re-run on the +self-hosted runner first (small risk of API drift on gem5's +`DmaDevice`/`PioDevice` between major releases). **Picking and +validating this pin is the first deliverable of Phase 0** — every +other phase is a no-op if Phase 0 reveals that v25.0.0 no longer +supports SE-mode PIO mapping or the SimObject install path we depend +on. + +### 3.8 Why this is not just a copy of the SST pattern + +SST and gem5 are similar in shape (external simulator drives the +Vortex clock through a C++ wrapper around `Processor::cycle()`) but +differ in three load-bearing ways: + +1. **The host process is simulated under gem5.** Under SST the host + "process" is the SST Python script itself, running natively on the + developer's machine. Under gem5 the host is a userspace process + (x86 or ARM, per §3.6) running inside the gem5 model. So the gem5 + integration also needs a host-side runtime under `sw/runtime/gem5/` + (native compile for x86, cross-compile for ARM); SST does not. + (This is the bulk of the work that makes gem5 the bigger project — + see §9 effort estimate.) +2. **Memory is in two address spaces.** Under SST, the SimX `Processor` + and any optional SST memHierarchy share the same simulator. Under + gem5, the host CPU's DRAM is a gem5 `AddrRange`, the Vortex VRAM is + a `RAM` inside the device, and the only way bytes cross between + them is via DMA through the device. The staging-buffer protocol + (§3.4) implements this; SST has no equivalent. +3. **PIO bus integration.** SST's `StandardMem` interface is the + only one we plug into; gem5 has separate `PioPort` and `DmaPort` + with different timing models. The wrapper must manage both. + +--- + +## 4. Phasing + +Each phase is independently shippable and validated. The work follows +the same shape as the SST integration in +[sst_simx_v3_proposal.md §4](sst_simx_v3_proposal.md): **environment +first**, API + library second, gem5-side wiring third, ARM runtime +fourth, CI last. + +### Phase 0 — gem5 environment + API survey *(derisking; nothing else can start until this is done)* + +The legacy `vortex_gem5` was built against a forked gem5 that no +longer exists publicly. Before we design the C ABI in Phase 2 or +write a single line of `DmaDevice` glue in Phase 3, we need a +known-good gem5 build on the bench so the API surface we are about +to commit to is **real**, not assumed-from-headers-we-haven't-read. +This is the "solve gem5 setup first" phase. + +Concretely: + +- **Pick and pin the gem5 version.** Default target: v25.0.0.1 + (patch release on top of v25.0.0, most recent stable as of 2026-05). + Pin the tag in `VERSION` alongside `TOOLCHAIN_REV` and `SST_VER`: + ``` + GEM5_REV=v25.0.0.1 + ``` +- **Write `ci/gem5_install.sh.in`** (no Vortex integration yet — just + the install). Mirrors the structure of `ci/sst_install.sh.in`: + - `apt install scons python3-dev python3-pip libprotobuf-dev + protobuf-compiler libprotoc-dev libgoogle-perftools-dev m4 + libboost-all-dev gcc-aarch64-linux-gnu g++-aarch64-linux-gnu` + (gem5's documented build deps + ARM cross-toolchain for the ARM + matrix variant). + - Fetch gem5 working tree at `$GEM5_REV` into `$TOOLDIR/gem5`. + - `scons build/X86/gem5.opt -j$(nproc)` and + `scons build/ARM/gem5.opt -j$(nproc)` — **both ISAs by default** + per the dual-ISA decision in §3.6. Targets selectable via + `GEM5_TARGETS="X86"` / `"ARM"` / `"X86 ARM"`. + - Export `GEM5_HOME=$TOOLDIR/gem5` to `~/.bashrc`. +- **Validate the X86 native compiler produces SE-mode binaries.** + Trivial — `gcc -static -o /tmp/hello-x86 sim/simx/gem5/hello.c` + then run under `gem5.opt configs/example/gem5_library/arm-hello.py` + -shape config (substituting `ISA.X86`). Confirm exit code 0 and + the expected stdout. +- **Validate the ARM cross-toolchain produces SE-mode binaries.** + Cross-compile `hello.c` with `aarch64-linux-gnu-gcc -static -o + /tmp/hello-arm`, run under + `build/ARM/gem5.opt configs/example/gem5_library/arm-hello.py` + (or the deprecated SE script). Confirms the cross-toolchain + produces something gem5 ARM-mode can load. +- **Read the gem5 source for the API surface we are about to use** + and record findings in a short scratch file + `sim/simx/gem5/gem5_api_notes.md` (not committed to docs/, just a + Phase 0 deliverable): + - `src/dev/io_device.hh` — `PioDevice::read`/`write` signatures + in v25.0.0. Compare to what the legacy paper assumed. + - `src/dev/dma_device.hh` — `DmaDevice::dmaAction`, `DmaPort` + timing model. Confirm 64-bit address support, async completion + callback shape. + - `src/python/m5/objects/Device.py` — SimObject Python bindings. + Confirm that out-of-tree `src/dev//SConscript` is + picked up by `scons build/ARM/gem5.opt` (this is the install + mechanism we rely on in Phase 3). + - `configs/example/se.py` — how SE-mode wires a CPU to a + `Workload`. Confirm that we can attach a `PioDevice` and have + the SE-mode loader map its PIO range into the workload's address + space (the legacy paper's `0x20000000` magic). If this is no + longer supported, the design changes — better to know now than + in Phase 3. +- **Smoke-build a trivial out-of-tree SimObject** to prove the + install mechanism end-to-end. Three files + (`Dummy.{cc,hh,py}` + `SConscript`) under `sim/simx/gem5/dummy/`, + installed by `sim/simx/gem5/install.sh` (Phase 0 only ships the + installer; the real SimObject lands in Phase 3). After + `ci/gem5_install.sh` re-runs, `gem5.opt --list-sim-objects` shows + `Dummy`. Delete `dummy/` once verified — it was scaffolding. + +**Validation:** +- `ci/gem5_install.sh` finishes successfully on the self-hosted + runner. Wall time recorded in `gem5_api_notes.md` (drives CI + caching strategy in Phase 6). +- `$GEM5_HOME/build/ARM/gem5.opt configs/example/se.py + --cmd ./hello-arm` exits 0. +- `gem5.opt --list-sim-objects` lists the dummy SimObject installed + via `sim/simx/gem5/install.sh`. +- `gem5_api_notes.md` documents the `DmaDevice` / `PioDevice` / + `EventFunctionWrapper` signatures we will commit to in Phase 2's + C ABI design. + +**Why this is its own phase:** if any of those validations fails +(e.g. gem5 v25 has dropped SE-mode PIO mapping, or the SimObject +install mechanism has changed), the rest of the proposal needs +redesign before code lands. Phase 0 is a ~1-day gate, not a tracked +deliverable; everything downstream depends on its outputs. + +### Phase 1 — `Processor::cycle()` + `Memory*` accessor + +Prerequisite shared with SST. Can run in parallel with Phase 0 +(no gem5 dependency) and lands first into the SimX-side codebase. + +- Add `Processor::cycle()` and `Memory* Processor::memsim()` as in + §3.2. This is a ~50-line patch to `processor.{cpp,h}` and + `processor_impl.h` plus an `is_cycle_initialized_` bool. +- Add `Memory::set_pre_send_hook()` (already in v3 per + `sim/simx/mem/memory.h:42` — verify still there; if so, this part + of Phase 1 is a no-op). +- Update SST's `vortex_simulator.cpp` to use the new public + `Processor::cycle()` API (currently calls `proc_->cycle()` which + does not compile against `processor.h` HEAD — see + `sim/simx/sst/vortex_simulator.cpp:64`). **This is a pre-existing + bug that Phase 1 fixes for both integrations.** + +**Validation:** `make -C sim/simx` (default), `make -C sim/simx +USE_SST=1`, and `make -C sim/simx USE_GEM5=1` all build. SST tests +that previously failed to link now link and run (`sst +ci/sst_test_vortex_hello.py` passes). + +### Phase 2 — `libvortex-gem5.so` + C ABI + +**Prerequisite: Phase 0 complete.** The C ABI is designed *against* +the `DmaDevice`/`PioDevice` shapes recorded in +`gem5_api_notes.md`, not from headers we haven't read. + +- Create `sim/simx/gem5/vortex_gpgpu.{cpp,h}` mirroring + `sim/simx/sst/vortex_simulator.{cpp,h}` shape: + - Owns a `Processor`, a `RAM` (device VRAM at `MEM_PAGE_SIZE`). + - Exposes a C ABI (`vortex_gem5_*`) sufficient for the gem5 device + to MMIO/DMA/tick it. ABI signatures match what gem5's + `DmaDevice::dmaAction` and `PioDevice::read`/`write` need to + call into (per Phase 0 survey). +- Add `USE_GEM5=1` build target to `sim/simx/Makefile` producing + `libvortex-gem5.so` (no SST symbols; no `sst-core` link). Pattern: + duplicate the `ifeq ($(USE_SST),1)` block. +- Add a tiny in-process smoke driver + `sim/simx/gem5/gem5_smoke_main.cpp` (built with the lib) that: + 1. Loads a `.vxbin` via the C ABI. + 2. Ticks until `cycle()` returns false. + 3. Reads the MPM exit code via DCR_READ. + + This is the "library compiles and a kernel runs through it without + gem5 installed" smoke test (§6.2). + +**Validation:** +- `make -C sim/simx USE_GEM5=1` builds. +- `LD_LIBRARY_PATH=. ./gem5_smoke hello.vxbin` returns 0. +- `make -C sim/simx` (no flag) still builds and `./simx hello.vxbin` + returns 0 (no regression on default). + +### Phase 3 — gem5 SimObject + Python config + +**Prerequisite: Phases 0 + 2 complete.** The install mechanism is +already proven by Phase 0's dummy SimObject; this phase replaces +the dummy with the real device. + +- `sim/simx/gem5/vortex_gpgpu_dev.{cc,hh}` — the gem5 `DmaDevice` + subclass. PIO `read`/`write` decode MMIO offsets and call + `vortex_gem5_mmio_*`. DMA actions triggered by `CMD_MEM_*`. A + registered `EventFunctionWrapper` re-schedules itself every + `clock_period_ticks()` and calls `vortex_gem5_tick()`. +- `sim/simx/gem5/VortexGPGPU.py` — Python SimObject definition. +- `sim/simx/gem5/SConscript` — for gem5's scons build. +- `sim/simx/gem5/install.sh` — copies the four files above into + `/src/dev/vortex/`. (Phase 0 already wrote this for the + dummy SimObject; just extend it.) +- Update `ci/gem5_install.sh.in` to re-run `install.sh` and rebuild + `build/ARM/gem5.opt` after the Vortex SimObject lands. + +**Validation:** `ci/gem5_install.sh` succeeds with the real +SimObject installed. `gem5.opt --list-sim-objects` shows +`VortexGPGPU`. `gem5.opt configs/example/se.py --help` accepts +`VortexGPGPU` parameters. + +### Phase 4 — Host runtime (`sw/runtime/gem5/`, x86 + ARM) + +- New backend mirroring `sw/runtime/opae/` shape: + - `vortex.cpp` — implements the `vx_*` callbacks against the OPAE + MMIO protocol (§3.3), but the `driver.{cpp,h}` underneath does + raw `mmap`/MMIO writes to the PIO address rather than calling + `libopae`. + - `Makefile` — selects compiler from `HOST_ARCH`: + - `x86_64` (default): native `g++` + - `aarch64`: `aarch64-linux-gnu-g++` + - `armhf`: `arm-linux-gnueabihf-g++` +- Cache-flush integration (B9): the v3 `download` path issues + `CMD_DCR_READ(VX_DCR_BASE_CACHE_FLUSH, cid)` per core before + `CMD_MEM_READ`. +- MMIO ordering fence (B14): emit the right barrier for `HOST_ARCH`: + - `x86_64`: `__asm__ volatile ("mfence" ::: "memory")` + - `aarch64`: `__asm__ volatile ("dmb sy" ::: "memory")` + - `armhf`: `__asm__ volatile ("dmb sy" ::: "memory")` + Provide a `vortex_gem5_mmio_fence()` inline helper that compiles + to the right barrier per `HOST_ARCH`. +- Multi-target build (B13 obsolete; replaced by clean multi-target + support): `HOST_ARCH` make variable. + +**Validation:** +- `make -C sw/runtime/gem5` (default `HOST_ARCH=x86_64`) builds. + `file build/sw/runtime/libvortex-gem5-x86_64.so` confirms x86-64 + ELF. +- `make -C sw/runtime/gem5 HOST_ARCH=aarch64` builds (requires + cross-toolchain, installed by Phase 0's `ci/gem5_install.sh`). + `file build/sw/runtime/libvortex-gem5-aarch64.so` confirms + AArch64 ELF. + +### Phase 5 — End-to-end gem5 test + +- `ci/gem5_test_vortex_hello.py` — gem5 Python config that wires: + - A `System` with one `TimingSimpleCPU` core in SE mode (host ISA + selected at runtime via `--host-arch=x86|arm`). + - A `VortexGPGPU` device on `system.membus` at + `pio_addr=0x20000000`, mapped into the process's address space. + - The native-or-cross-compiled test binary + (`tests/kernel/hello/hello` re-linked against the matching + `libvortex-gem5-{x86_64,aarch64}.so`) as the SE-mode workload. +- `ci/gem5_test_vortex_vecadd.py` — same with a vecadd kernel that + actually exercises DMA in both directions and the cache-flush path. +- Add a top-level wrapper test in `tests/regression/gem5/` (mirrors + `tests/regression/dxa/`) that builds the kernels and invokes the + Python scripts for both `HOST_ARCH=x86_64` and `HOST_ARCH=aarch64`. + +**Validation:** +- `build/X86/gem5.opt ci/gem5_test_vortex_hello.py --host-arch=x86` + exits with code 0 and the expected `Hello World` on stdout. +- `build/ARM/gem5.opt ci/gem5_test_vortex_hello.py --host-arch=arm` + exits with code 0 and the expected `Hello World` on stdout. +- Both `ci/gem5_test_vortex_vecadd.py` variants exit 0 with the + vecadd result buffer matching the CPU-computed reference (checked + by the test binary itself). + +### Phase 6 — CI integration + +- Add `gem5()` function to `ci/regression.sh.in` (mirroring `sst()` + on line ~80): + ```bash + gem5() + { + echo "begin gem5 tests..." + + make -C sim/simx USE_GEM5=1 + make -C tests/kernel + + # X86 default: native compile, no cross-toolchain needed. + make -C sw/runtime/gem5 HOST_ARCH=x86_64 + cp sim/simx/libvortex-gem5.so $GEM5_HOME/build/X86/ + + timeout 120 $GEM5_HOME/build/X86/gem5.opt \ + ci/gem5_test_vortex_hello.py --host-arch=x86 + timeout 120 $GEM5_HOME/build/X86/gem5.opt \ + ci/gem5_test_vortex_vecadd.py --host-arch=x86 + + # ARM matrix entry — requires gcc-aarch64-linux-gnu (installed + # by ci/gem5_install.sh in Phase 0). + if [ -n "$VORTEX_GEM5_ARM" ]; then + make -C sw/runtime/gem5 HOST_ARCH=aarch64 + cp sim/simx/libvortex-gem5.so $GEM5_HOME/build/ARM/ + + timeout 120 $GEM5_HOME/build/ARM/gem5.opt \ + ci/gem5_test_vortex_hello.py --host-arch=arm + timeout 120 $GEM5_HOME/build/ARM/gem5.opt \ + ci/gem5_test_vortex_vecadd.py --host-arch=arm + fi + + echo "gem5 tests done!" + } + ``` + Per `feedback_test_timeout_120s.md`, every test invocation is + `timeout 120`-capped. ARM is opt-in via `VORTEX_GEM5_ARM=1` so + hosted CI without the ARM toolchain still passes; the self-hosted + runner sets the env var. +- Add `gem5-x86` and `gem5-arm` matrix entries to + `.github/workflows/ci.yml` (both run on the self-hosted runner + only, per + [`project_ci_machine.md`](../../../../.claude/projects/-home-blaisetine-dev/memory/project_ci_machine.md); + the hosted runners do not have enough resources for a full + gem5 build). +- Add `ci/gem5_install.sh` to the Apptainer recipe + ([`miscs/apptainer/vortex.def`](../../miscs/apptainer/vortex.def)) + so the .sif has gem5 pre-installed. **Out of scope for Phase 6; + see §8.** + +**Validation:** `./ci/regression.sh --gem5` runs both +`gem5_test_vortex_*.py` cleanly on the self-hosted runner. + +### Phase 7 — Documentation + +- `docs/gem5_integration.md`: + - How to install gem5 v25.0.0 (point at `ci/gem5_install.sh`). + - How to build with `USE_GEM5=1`. + - How to cross-compile the ARM runtime + kernels. + - How to write a gem5 Python script that drives `VortexGPGPU`. + - The single-source-of-truth invariant (§1.1) and the cache-flush + contract (§3.4) for future hackers who might be tempted to skip + the flush "because it's fast". + +--- + +## 5. Authorship / history mechanics + +- `sim/simx/gem5/vortex_gpgpu.{cpp,h}` and the gem5-side + `vortex_gpgpu_dev.{cc,hh}` + `VortexGPGPU.py`: **new files**, no + upstream equivalent. Commit body cites: + > Replaces legacy `vortex_gem5/sim/simx/simx_device.{cpp,h}` + > (Injae Shin, UCLA 2025-05-22 commit 91dcf17) and the gem5-side + > SimObject described in his capstone report. + > Re-implemented for SimX v3 Processor::cycle() API. Original + > design intent (OPAE MMIO + pinned staging buffer + ARM SE-mode + > runtime) preserved. + +- `sw/runtime/gem5/`: **new files** mirroring `sw/runtime/opae/`'s + shape. Same authorship attribution as above; the file-level + similarity is to `sw/runtime/opae`, not to `runtime/opaesimx` from + the legacy tree (which has the bugs catalogued in §2.2). + +- `ci/gem5_install.sh.in` and `ci/gem5_test_vortex_*.py`: new files; + follow the structure of `ci/sst_install.sh.in` and + `ci/sst_test_vortex_*.py`. `ci/gem5_install.sh.in` lands in + Phase 0 (initially installing the dummy SimObject); the test + scripts land in Phase 5. + +- `Processor::cycle()` / `Processor::memsim()`: new public API on + `Processor`, lands in Phase 1. Single commit on the simx_v3 line; + mentioned as a prerequisite of both SST and gem5 integrations in + the commit body. + +- `sim/simx/gem5/gem5_api_notes.md`: Phase 0 deliverable, scratch + notes only — **not** committed to `docs/`. Captures the gem5 + v25.0.0 API surface our C ABI design depends on; deleted once + Phase 2 commits the C ABI itself. + +This is consistent with the rule established in +[`feedback_keep_ours_in_merge.md`](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_keep_ours_in_merge.md): +the legacy code is not a "theirs" we apply; it is a prior design that +informs our redesign. Credit the designer in the body; do not pretend +the bits are a port. + +--- + +## 6. Validation + +Each phase ends with the validation listed in §4. Across phases the +acceptance criteria are: + +1. **No-gem5 build identical.** `make -C sim/simx` (default flags) + produces a binary identical in behavior to today's on the + regression suite (io_addr, arith, vecadd, mpi_vecadd, tensor*, + dxa, dtm). The Phase 0 `Processor::cycle()` addition must not + change `Processor::run()` semantics — verify by trace-diffing + `vecadd` before and after Phase 0. + +2. **In-process smoke (no gem5 needed).** `gem5_smoke hello.vxbin`, + the Phase 2 driver, runs the same kernels the `simx` binary runs + and produces matching output. This is the unit-test layer that + shakes out C-ABI breakage without requiring gem5 to be installed + beyond what Phase 0 already set up. + +3. **End-to-end gem5 PASS.** Both `gem5_test_vortex_hello.py` and + `gem5_test_vortex_vecadd.py` exit 0 under the pinned gem5 v25.0.0.1, + on *both* `build/X86/gem5.opt` and `build/ARM/gem5.opt`, timed out + at 120 s (each). The pin and the install path are both already + validated by Phase 0; this validation just exercises the real + `VortexGPGPU` SimObject end-to-end. + +4. **No `core->mem_read` / `core->mem_write` regressions.** Phase 5 + of v3 forbids those + ([simx_v3_proposal.md §3.3](simx_v3_proposal.md)). The grep gate + from + [master_merge_v3_proposal.md §8 R1](master_merge_v3_proposal.md) + applies here: every commit must pass + `git diff
.. -- sim/simx/ | grep -E 'core->mem_(read|write)' | wc -l == 0`.
+
+5. **Single source of truth check.** The gem5 device's pinned region
+   is `RAM`-backed (i.e., a slice of host memory exposed to gem5's
+   DRAM AddrRange via `mmap`); Vortex's VRAM is the `RAM` attached to
+   `Memory` inside `vortex::Processor`. **There is no shadow image.**
+   `vortex_gem5_dma_{read,write}` copies bytes between the two via
+   `RAM::read`/`RAM::write` — no additional buffer level. Mistakes
+   here re-introduce the §1.1 violation.
+
+---
+
+## 7. Risks
+
+| # | Risk | Mitigation |
+|---|---|---|
+| R1 | gem5 v25.0.0 `DmaDevice` API drifts in v26+. | Pin in `ci/gem5_install.sh.in` (Phase 0). Document the pin in `docs/gem5_integration.md`. CI catches regressions on bump. |
+| R2 | ARM cross-compiler not available in the Apptainer recipe. | Phase 6 says gem5 CI is on the self-hosted runner only, which already has the ARM toolchain per [`project_ci_machine.md`](../../../../.claude/projects/-home-blaisetine-dev/memory/project_ci_machine.md). Apptainer absorption is out of scope (§8). |
+| R3 | `MMIO_PINNED_BASE` PIO range collides with another gem5 device's PIO range. | Pick a default (`0x20000000`, matching the legacy paper) but make it a Python-configurable parameter (`pio_addr`). Phase 0 confirms the default is reachable from SE-mode in v25.0.0; document collisions in the integration guide. |
+| R4 | The gem5 ARM CPU model reorders MMIO writes, breaking the args-then-CMD_TYPE protocol (B14). | `DMB SY` (AArch64) or `dmb sy` (ARMv7) before `CMD_TYPE` write in the runtime. Add a regression test that issues a back-to-back `CMD_MEM_WRITE` + `CMD_RUN` and verifies the kernel observed the correct args. |
+| R5 | Future contributor re-introduces the host-pointer-MMIO hack (B3) "for convenience". | This proposal explicitly deletes that abstraction (§2.4). The follow-up `docs/gem5_integration.md` (Phase 7) should call this out. |
+| R6 | `Processor::cycle()` for a never-launched kernel hangs (no `kmu_->start()` because `is_cycle_initialized_` was never reset). | Reset is implicit on first `cycle()`. If a second kernel is launched in the same device lifetime (rare; supported by gem5 only for back-to-back tests), the gem5 device's `CMD_RUN` handler must call a new `Processor::reset_cycle()` that clears `is_cycle_initialized_`. Add this in Phase 2. |
+| R7 | The cross-compiled ARM `libvortex-gem5.so` and the gem5-loaded `libvortex-gem5.so` (x86) have the same SONAME and get confused at install time. | Suffix the ARM build (`libvortex-gem5-aarch64.so`) and the gem5 build (`libvortex-gem5.so`). Document in Phase 2+4. |
+| R8 | gem5's `DmaPort` request size is unbounded; a 1 GB `CMD_MEM_WRITE` would burn simulated time. | Cap per-transaction size at 1 MB in the device's `CMD_MEM_*` handler; chunk larger requests into multiple DMA actions. Mirrors how the OPAE `fpgaPrepareBuffer` page-aligns transfers. |
+| R9 | Cache flush via `CMD_DCR_READ` returns synchronously per core; for `NUM_CORES * NUM_CLUSTERS = 16` that is 16 PIO round-trips per download. | Acceptable for Phase 5; can be batched into a single `CMD_FLUSH_ALL` MMIO later if measured to hurt. |
+| R10 | The gem5 SimObject install (`sim/simx/gem5/install.sh`) modifies the gem5 source tree in place; rebuilds can leave stale artifacts. | `install.sh` is idempotent (copies, doesn't patch); `ci/gem5_install.sh` does a clean `scons -c` before re-build on toolchain version mismatch. Phase 0 proves the install path end-to-end with a dummy SimObject before any real code depends on it. |
+| R11 | Phase 0 reveals gem5 v25.0.0 has dropped SE-mode PIO mapping (the legacy `0x20000000` magic). | Switch design to the `/dev/vortex_gem5` pseudo-file path (§3.6 option 2) before Phase 2 commits the C ABI. Cost: ~1 week added to Phase 0 redesign window. Acceptable because Phase 0 is explicitly a gate — no downstream phase has shipped code yet. |
+| R12 | Phase 0 install takes hours on first run; blocks parallel work. | Cache the `$TOOLDIR/gem5-src/build` directory in CI the same way SST and toolchain caches work. Self-hosted runner's local toolchain dir survives across runs. |
+
+---
+
+## 8. Out of scope
+
+- **Apptainer integration.** Adding gem5 + the ARM cross-toolchain
+  to `miscs/apptainer/vortex.def` is a separate concern. Until that
+  is done, `apptainer-ci.yml`'s matrix should not include `gem5`. The
+  self-hosted runner runs the gem5 matrix entry on hosted ci.yml; the
+  Apptainer pipeline skips it. See
+  [`apptainer-ci.yml` policy notes](../../.github/workflows/apptainer-ci.yml).
+
+- **Full-system Linux on gem5.** The capstone paper restricts itself
+  to SE-mode (per the paper's §IIC: "gem5-Vortex's implementation
+  allows users to use gem5's system call emulation (SE) mode"). This
+  proposal does the same. FS-mode requires booting a Linux kernel
+  inside gem5 with a Vortex device driver — possible, but a separate
+  redesign that intersects with kernel-mode driver work the project
+  has not started.
+
+- **Multi-device simulation.** One `VortexGPGPU` per gem5 system.
+  Multi-device support requires per-instance PIO ranges and a runtime
+  side that supports `vx_dev_open` returning >1 handle — the legacy
+  `g_callbacks` global (B11) blocks this on the runtime side, and
+  the device side needs per-instance state isolation. Defer.
+
+- **AMD GPU / NoMali comparison.** The capstone paper compares
+  gem5-Vortex to NoMali (stub GPU) and AMD GPU (full-system). Those
+  comparisons live in the paper; reproducing them as benchmarks is
+  out of scope. Comparing performance to SimX standalone or to the
+  SST integration is also out of scope — separate analysis work.
+
+- **DMA performance modeling.** The capstone paper §V measures DMA
+  delay variation per kernel size. Replicating that as a CI
+  performance gate is out of scope; could be a follow-up perf
+  proposal once the integration is stable.
+
+- **SST + gem5 simultaneous.** Both integrations replace different
+  parts of the harness; running them together is not a use case
+  anyone has asked for. Build flags are mutually exclusive:
+  `USE_SST=1` and `USE_GEM5=1` together is rejected by `sim/simx/Makefile`.
+
+- **gem5 fork branch.** We do not maintain a long-lived fork of gem5.
+  `ci/gem5_install.sh` fetches a clean release tarball and applies
+  our SimObject; if the user wants a persistent gem5 working tree,
+  that is their setup. Avoids the "fork rot" that froze
+  `vortex_gem5`.
+
+- **Runtime gem5/non-gem5 switching.** Keep `USE_GEM5=1` as a
+  build-time switch. A runtime switch would require both `Processor`
+  and a gem5 wrapper in every binary plus a factory; not worth the
+  maintenance cost for a single-device research integration.
+
+---
+
+## 9. Estimated effort
+
+Based on the SST integration in
+[sst_simx_v3_proposal.md §9](sst_simx_v3_proposal.md) (~15–28 h):
+
+- **Phase 0** (gem5 env + API survey + dummy SimObject install):
+  **6–10 h estimated; ✅ COMPLETE 2026-05-16** in ~3 h of
+  attended + ~25 min unattended scons build. The wall time to
+  install gem5 was 13 min (ARM) + 11 min (X86) parallel on the
+  self-hosted 64-core runner. All six validations
+  (see `sim/simx/gem5/gem5_api_notes.md`) pass on both ISAs.
+  Key discoveries committed: (1) SE-mode PIO attachment is
+  possible but requires bypassing the `SimpleBoard` high-level
+  API; (2) out-of-tree SimObject install needs **no** top-level
+  SConstruct patch — pure `cp -r`; (3) PCIe (Path 2 in §3.6) is
+  a clean Phase 5+ upgrade because `PciDevice` inherits
+  `DmaDevice` and shares the same C ABI surface.
+- **Phase 1** (`Processor::cycle()` + `memsim()`): **1–2 h estimated;
+  ✅ COMPLETE 2026-05-16** in ~1 h. ~50-line patch to
+  `processor.{cpp,h}` + `processor_impl.h`. Default `make -C
+  sim/simx` and `USE_SST=1` both build clean; `simx hello.vxbin`
+  prints `#0: Hello World!`. **Bonus:** the SST integration was
+  previously broken at the `proc_->cycle()` call site
+  (`sim/simx/sst/vortex_simulator.cpp:64`) and would not link; with
+  Phase 1 in place, `sst ci/sst_test_vortex_hello.py` runs
+  end-to-end and exits cleanly at 4.643 µs simulated time.
+- **Phase 2** (`libvortex-gem5.so` + C ABI + in-process smoke):
+  **4–6 h estimated; ✅ COMPLETE 2026-05-16** in ~1.5 h. Files added:
+  `sim/simx/gem5/vortex_gpgpu.{h,cpp}` (the C ABI library) and
+  `sim/simx/gem5/gem5_smoke_main.cpp` (the in-process smoke driver).
+  `sim/simx/Makefile` extended with a `USE_GEM5=1` gate that
+  produces `libvortex-gem5.so` (1.5 MB) + `gem5_smoke` (16 KB
+  driver linking against the lib). `gem5_smoke hello.vxbin` →
+  `#0: Hello World!`, 4642 cycles, exit_code=0 (correctly read back
+  via `vortex_gem5_vram_read` after the cache-flush DCR path —
+  validating B9 from §2.2 is fixed). Default `make -C sim/simx`
+  unchanged (only `simx` produced; gem5 sources fully gated).
+  `USE_SST=1 USE_GEM5=1` correctly rejected by the Makefile per
+  §8 (mutual exclusion). Side fix: `sw/common/bitmanip.h` was
+  missing `` and `` includes — header
+  hygiene fix benefits any caller (per
+  [feedback_always_correct_fix_not_patch](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_always_correct_fix_not_patch.md)).
+- **Phase 3** (gem5 SimObject + Python + install.sh): **6–10 h
+  estimated; ✅ COMPLETE 2026-05-16** in ~1.5 h. Files added:
+  `sim/simx/gem5/vortex_gpgpu_dev.{cc,hh}` (gem5 `DmaDevice` subclass
+  with `dlopen` + `EventFunctionWrapper` tick scheduling),
+  `sim/simx/gem5/VortexGPGPU.py` (Python binding with `library=` +
+  `kernel=` parameters), `sim/simx/gem5/SConscript`. Updated
+  `install.sh` to install the real device and remove the Phase 0
+  dummy scaffolding from `$GEM5_HOME` cleanly. New test:
+  `ci/gem5_test_vortex_hello.py` (standalone-device variant, no
+  host CPU needed). Validation: both `build/X86/gem5.opt` and
+  `build/ARM/gem5.opt` import `VortexGPGPU` and run hello.vxbin to
+  completion at tick 4,643,000 (1 GHz clock → 4643 cycles, matching
+  Phase 1 SST + Phase 2 in-process within 1 cycle). **Three
+  harnesses now validated through the same `Processor::cycle()` API:
+  SST, in-process C ABI, and gem5 SimObject.**
+- **Phase 4** (host runtime, x86 + ARM): **6–10 h estimated; ✅ x86
+  PATH COMPLETE 2026-05-16** in ~1 h; aarch64 cross-build gated on
+  the user's `sudo apt install gcc-aarch64-linux-gnu`. Files added:
+  `sw/runtime/gem5/driver.{cpp,h}` (direct MMIO + mmio_fence helper
+  with per-arch barrier; bump-allocator for the pinned region),
+  `sw/runtime/gem5/vortex.cpp` (OPAE-shaped `vx_device` with the
+  full callback table — compile-time caps from VX_config.h since
+  the host runtime and the device library are built from the same
+  source tree), `sw/runtime/gem5/Makefile` (HOST_ARCH ∈
+  {x86_64,aarch64,armhf} → matching cross-compiler; produces
+  `libvortex-gem5-$ARCH.so`). All three B-bugs addressed: B9 (cache
+  flush before download via per-core `dcr_read(VX_DCR_BASE_CACHE_FLUSH,
+  cid)`), B13 (per-arch compiler via `HOST_ARCH`), B14 (mmio_fence()
+  centralised in `issue_cmd()` so every CMD_TYPE write is fenced
+  by construction). Validation: `make -C sw/runtime/gem5 HOST_ARCH=x86_64`
+  → `libvortex-gem5-x86_64.so` (43 KB, ELF 64-bit x86-64, SONAME
+  correct, exports `vx_dev_init` matching the OPAE/SimX backend
+  pattern).
+- **Phase 5** (end-to-end gem5 tests): **4–6 h estimated; ✅ x86
+  PATH COMPLETE 2026-05-17** in ~3 h. The bulk of the work turned
+  out to be the OPAE state machine on the device side (cmd_args
+  latching, busy bit, dcr_rsp register) plus the dmaAction
+  dispatch in the SimObject — the test scripts themselves were
+  small. Files added:
+  `ci/gem5_test_vortex_vecadd.py` (full e2e: x86 CPU + identity-mapped
+  PIO+PIN regions + Process.map() + Vortex device). The Phase 3
+  standalone `ci/gem5_test_vortex_hello.py` continues to pass as a
+  fast smoke test. Phase 5 also extended Phase 2's
+  `sim/simx/gem5/vortex_gpgpu.{cpp,h}` with the full OPAE protocol
+  state machine and Phase 3's `sim/simx/gem5/vortex_gpgpu_dev.cc`
+  with `pop_pending_cmd` → `dmaRead`/`dmaWrite` dispatch.
+  Validation: `vecadd -n16` PASSED!, kernel ran 454 cycles at
+  IPC 0.247 on 4×4 threads/warps. Side fix: glibc's `nanosleep()`
+  routes through `clock_nanosleep` (#230) which gem5 SE-mode
+  doesn't implement — switched the host runtime's poll-loop back-off
+  to `sched_yield()` (in gem5's syscall table). ARM e2e gated on
+  user `sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu`
+  (same gate as Phase 4's aarch64 build).
+- **Phase 6** (CI): **2–3 h estimated; ✅ COMPLETE 2026-05-17** in
+  ~30 min. Added `gem5()` function to `ci/regression.sh.in`
+  (mirrors `sst()` shape; builds prerequisites + runs both Phase 3
+  standalone and Phase 5 e2e tests via `timeout 120` per
+  [feedback_test_timeout_120s](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_test_timeout_120s.md);
+  ARM matrix opt-in via `VORTEX_GEM5_ARM=1`). Added `--gem5` case
+  dispatch + `--gem5` to the show_usage line. Updated
+  `.github/workflows/ci.yml`: appended `ci/gem5_install.sh` to the
+  `Setup Toolchain` step (gated on `cache-toolchain.outputs.cache-hit`
+  like SST), added `Export gem5 paths` step (GEM5_HOME + PATH for
+  `build/X86`), added `gem5` to the `tests.matrix.name` list with
+  `exclude: name=gem5 xlen=64` (the device library is XLEN-locked
+  by the gem5 install; one entry is sufficient). Validation:
+  `./ci/regression.sh --gem5` PASSED end-to-end in **5 seconds**
+  (Phase 3 hello standalone + Phase 5 vecadd e2e, both clean).
+- **Phase 7** (docs): **1–2 h estimated; ✅ COMPLETE 2026-05-17** in
+  ~45 min. Added `docs/gem5_integration.md` covering: install
+  (`ci/gem5_install.sh`), Vortex+gem5 build (`USE_GEM5=1`), host
+  runtime cross-compile (`HOST_ARCH`), running tests
+  (`./ci/regression.sh --gem5` and standalone hand commands),
+  a complete minimal Python recipe for hosting Vortex in a custom
+  gem5 system, **six load-bearing invariants** (Process.map order,
+  identity-mapped PIO+PIN, cache flush before download, MMIO
+  fence, single source of truth for memory, USE_SST/GEM5 mutex),
+  architectural choices worth revisiting (doorbells vs. polling,
+  PCIe upgrade path, C ABI rationale), CI integration, and a
+  troubleshooting table covering the 6 most common error modes
+  (wrong library path, missing LD_LIBRARY_PATH, clock_nanosleep
+  syscall, orphan Process, wrong `library=` param, busy-bit hang,
+  ccache stale objects). Added to `docs/index.md`.
+
+Total: **~30–49 hours** of focused work (was ~26–41 h before Phase 0
+was added as a separate phase; the actual work has not grown — the
+gem5 install was implicit in the old Phase 2 estimate and is now
+explicit in Phase 0). Substantial enough to warrant its own branch
+(`gem5_simx_v3` or similar).
+
+**Sequencing with SST:** Phase 1 (`Processor::cycle()`) is shared;
+do it once and both integrations benefit. If SST lands first, gem5
+reuses `Processor::cycle()` unchanged. If gem5 lands first, the SST
+integration's broken `proc_->cycle()` reference
+(`sim/simx/sst/vortex_simulator.cpp:64`) gets fixed as a side effect
+of Phase 1 — net win for both. Phase 0 is gem5-only; SST integration
+does not benefit from it.
diff --git a/sim/simx/Makefile b/sim/simx/Makefile
index 059484effa..593581cdf1 100644
--- a/sim/simx/Makefile
+++ b/sim/simx/Makefile
@@ -2,8 +2,17 @@ include ../common.mk
 
 DESTDIR ?= $(CURDIR)
 USE_SST ?= 0
+USE_GEM5 ?= 0
 #SST_PKG ?= SST-14.1 # default SST package name
 
+# USE_SST and USE_GEM5 are mutually exclusive — different external
+# simulator wrappers with different LDFLAGS; building both into one
+# binary makes no sense and the proposal docs/proposals/gem5_simx_v3_proposal.md
+# §8 calls this out explicitly.
+ifeq ($(USE_SST)$(USE_GEM5),11)
+$(error USE_SST=1 and USE_GEM5=1 are mutually exclusive)
+endif
+
 OBJ_DIR = $(DESTDIR)/obj
 CONFIG_FILE = $(DESTDIR)/simx_config.stamp
 SRC_DIR = $(VORTEX_HOME)/sim/simx
@@ -96,6 +105,15 @@ ifeq ($(USE_SST),1)
 	SRCS     += $(SRC_DIR)/sst/vortex_simulator.cpp $(SRC_DIR)/sst/vortex_gpgpu.cpp
 endif
 
+# gem5 integration: build libvortex-gem5.so (the C ABI library loaded
+# by the gem5 VortexGPGPU SimObject) plus gem5_smoke (an in-process
+# smoke driver that exercises the library without needing gem5
+# installed). The gem5 wrapper source is kept out of the default SRCS
+# list and pulled into VORTEX_GEM5_SRCS so the default simx binary
+# does not carry it.
+VORTEX_GEM5_SRCS := $(SRC_DIR)/gem5/vortex_gpgpu.cpp
+GEM5_SMOKE_SRC   := $(SRC_DIR)/gem5/gem5_smoke_main.cpp
+
 # Debugging
 ifdef DEBUG
 	CXXFLAGS += -g -O0 -DDEBUG_LEVEL=$(DEBUG)
@@ -128,17 +146,27 @@ VORTEX_SST_OBJS := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(VORTEX_SST_SRCS)
 DEPS += $(VORTEX_SST_OBJS:.o=.d)
 endif
 
+ifeq ($(USE_GEM5), 1)
+VORTEX_GEM5_OBJS := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(VORTEX_GEM5_SRCS))
+GEM5_SMOKE_OBJ   := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(GEM5_SMOKE_SRC))
+DEPS             += $(VORTEX_GEM5_OBJS:.o=.d) $(GEM5_SMOKE_OBJ:.o=.d)
+endif
+
 
 # optional: pipe through ccache if you have it
 CXX := $(if $(shell which ccache),ccache $(CXX),$(CXX))
 
 PROJECT := simx
 VORTEX_LIB := libvortex.so
+VORTEX_GEM5_LIB := libvortex-gem5.so
+GEM5_SMOKE := gem5_smoke
 
-.PHONY: all force clean clean-lib clean-exe clean-obj libvortex clean-libvortex
+.PHONY: all force clean clean-lib clean-exe clean-obj libvortex clean-libvortex libvortex-gem5 clean-libvortex-gem5 gem5-smoke clean-gem5-smoke
 
 ifeq ($(USE_SST), 1)
 all: $(DESTDIR)/$(PROJECT) $(DESTDIR)/$(VORTEX_LIB)
+else ifeq ($(USE_GEM5), 1)
+all: $(DESTDIR)/$(PROJECT) $(DESTDIR)/$(VORTEX_GEM5_LIB) $(DESTDIR)/$(GEM5_SMOKE)
 else
 all: $(DESTDIR)/$(PROJECT)
 endif
@@ -186,6 +214,21 @@ $(DESTDIR)/$(VORTEX_LIB): $(OBJS) $(VORTEX_SST_OBJS)
 	-shared -o $@ \
 	$(LDFLAGS) $(SST_LFLAGS)
 
+# Vortex gem5 device shared library — the gem5 SimObject dlopens this
+# and calls the C ABI declared in sim/simx/gem5/vortex_gpgpu.h.
+libvortex-gem5: $(DESTDIR)/$(VORTEX_GEM5_LIB)
+
+$(DESTDIR)/$(VORTEX_GEM5_LIB): $(OBJS) $(VORTEX_GEM5_OBJS)
+	$(CXX) $(CXXFLAGS) $^ -shared $(LDFLAGS) -Wl,-soname,$(VORTEX_GEM5_LIB) -o $@
+
+# In-process smoke driver (no gem5 needed). Links against the gem5
+# library via the C ABI so a successful run here proves the library
+# is sound before we expose it to the gem5 device.
+gem5-smoke: $(DESTDIR)/$(GEM5_SMOKE)
+
+$(DESTDIR)/$(GEM5_SMOKE): $(GEM5_SMOKE_OBJ) $(DESTDIR)/$(VORTEX_GEM5_LIB)
+	$(CXX) $(CXXFLAGS) $(GEM5_SMOKE_OBJ) -L$(DESTDIR) -lvortex-gem5 -Wl,-rpath,$(DESTDIR) -o $@
+
 # updates the timestamp when flags changed.
 $(CONFIG_FILE): force
 	@mkdir -p $(@D)
@@ -205,10 +248,16 @@ clean-lib:
 clean-libvortex:
 	rm -f $(DESTDIR)/libvortex.so
 
+clean-libvortex-gem5:
+	rm -f $(DESTDIR)/$(VORTEX_GEM5_LIB)
+
+clean-gem5-smoke:
+	rm -f $(DESTDIR)/$(GEM5_SMOKE)
+
 clean-exe:
 	rm -f $(DESTDIR)/$(PROJECT)
 
 clean-obj:
 	rm -rf $(OBJ_DIR)
 
-clean: clean-lib clean-exe clean-obj
+clean: clean-lib clean-libvortex clean-libvortex-gem5 clean-gem5-smoke clean-exe clean-obj
diff --git a/sim/simx/gem5/SConscript b/sim/simx/gem5/SConscript
new file mode 100644
index 0000000000..535ada56ff
--- /dev/null
+++ b/sim/simx/gem5/SConscript
@@ -0,0 +1,18 @@
+# -*- mode:python -*-
+#
+# Vortex SimObjects for gem5. Installed into $GEM5_HOME/src/dev/vortex/
+# by sim/simx/gem5/install.sh. Picked up automatically by gem5's
+# top-level SConstruct via the SConscript-recursion rule at
+# SConstruct:1000.
+#
+# This file's source of truth lives in the Vortex tree
+# (sim/simx/gem5/SConscript); the installer just copies it.
+
+Import('*')
+
+SimObject('VortexGPGPU.py', sim_objects=['VortexGPGPU'])
+Source('vortex_gpgpu_dev.cc')
+
+# DebugFlag for VortexGPGPU traces. Enable with:
+#   gem5.opt --debug-flags=VortexGPGPU ...
+DebugFlag('VortexGPGPU')
diff --git a/sim/simx/gem5/VortexGPGPU.py b/sim/simx/gem5/VortexGPGPU.py
new file mode 100644
index 0000000000..bcbc038a06
--- /dev/null
+++ b/sim/simx/gem5/VortexGPGPU.py
@@ -0,0 +1,46 @@
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Python SimObject binding for the gem5-side VortexGPGPU device.
+# Mirrors the inheritance graph of the C++ side: DmaDevice → PioDevice
+# → ClockedObject.
+
+from m5.objects.Device import DmaDevice
+from m5.params import *
+
+
+class VortexGPGPU(DmaDevice):
+    type = "VortexGPGPU"
+    cxx_header = "dev/vortex/vortex_gpgpu_dev.hh"
+    cxx_class = "gem5::VortexGPGPU"
+
+    # Path to libvortex-gem5.so produced by `make -C sim/simx
+    # USE_GEM5=1` in the Vortex build dir. Required; the C++ ctor
+    # fatals if empty.
+    library = Param.String("Absolute path to libvortex-gem5.so")
+
+    # Optional kernel image preloaded at startup() via vortex_gem5_
+    # load_kernel. When set, the device runs the kernel to completion
+    # via its own tick scheduler and exits the sim loop on done — no
+    # host CPU or MMIO traffic required. This is the Phase-3 entry
+    # point that proves the gem5 wiring without depending on Phase-4's
+    # host-runtime work. Phase 4 uploads kernels via the OPAE MMIO
+    # protocol instead.
+    kernel = Param.String("", "Optional .vxbin/.bin/.hex to preload at boot")
+
+    # PIO range. Default matches the legacy capstone paper (Fig. 4)
+    # for backward narrative continuity, though nothing in the design
+    # depends on this exact value.
+    pio_addr    = Param.Addr(0x20000000, "PIO base address")
+    pio_size    = Param.Addr(0x1000, "PIO region size (bytes)")
+    pio_latency = Param.Latency("1ns", "PIO access latency")
diff --git a/sim/simx/gem5/gem5_smoke_main.cpp b/sim/simx/gem5/gem5_smoke_main.cpp
new file mode 100644
index 0000000000..d2e8ae944a
--- /dev/null
+++ b/sim/simx/gem5/gem5_smoke_main.cpp
@@ -0,0 +1,96 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Phase-2 in-process smoke driver for libvortex-gem5.so.
+//
+// Exercises the C ABI from a native x86 binary — no gem5 involvement.
+// If a kernel completes here, the library is sound; any subsequent
+// failure under gem5 is on the SimObject side, not the library.
+//
+// Usage:
+//   LD_LIBRARY_PATH=$(dirname $(realpath gem5_smoke)) ./gem5_smoke kernel.vxbin
+
+#include "vortex_gpgpu.h"
+#include "constants.h"
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+int main(int argc, char** argv) {
+  if (argc < 2) {
+    std::fprintf(stderr,
+                 "usage: %s \n"
+                 "  Runs the kernel through libvortex-gem5's C ABI to confirm\n"
+                 "  the library is wired up correctly before exposing it to\n"
+                 "  the gem5 SimObject.\n",
+                 argv[0]);
+    return 1;
+  }
+  const char* kernel_path = argv[1];
+
+  std::printf("[gem5_smoke] %s\n", vortex_gem5_build_info());
+  std::printf("[gem5_smoke] kernel: %s\n", kernel_path);
+
+  vortex_gem5_handle_t h = vortex_gem5_create();
+  if (h == nullptr) {
+    std::fprintf(stderr, "[gem5_smoke] vortex_gem5_create failed\n");
+    return 1;
+  }
+
+  if (vortex_gem5_load_kernel(h, kernel_path) != 0) {
+    std::fprintf(stderr, "[gem5_smoke] vortex_gem5_load_kernel failed\n");
+    vortex_gem5_destroy(h);
+    return 1;
+  }
+
+  // Tick until the kernel completes. cycle() returns false when no
+  // cluster is running AND no channel still holds an in-flight packet.
+  // Belt-and-braces cap at 100M cycles so a runaway kernel doesn't
+  // hang the smoke test (a real run hits the IO_EXIT_CODE check well
+  // before).
+  uint64_t cycles = 0;
+  constexpr uint64_t MAX_CYCLES = 100ull * 1000 * 1000;
+  while (vortex_gem5_tick(h)) {
+    if (++cycles > MAX_CYCLES) {
+      std::fprintf(stderr,
+                   "[gem5_smoke] aborted after %llu cycles — kernel did not complete\n",
+                   static_cast(cycles));
+      vortex_gem5_destroy(h);
+      return 1;
+    }
+  }
+
+  // Drain dirty cache lines to VRAM so we can read IO_EXIT_CODE. Same
+  // pattern as sim/simx/main.cpp's post-run cache flush — one DCR_READ
+  // per core triggers Processor::flush_caches() inside the simulator.
+  uint32_t dummy = 0;
+  for (uint32_t cid = 0; cid < NUM_CORES * NUM_CLUSTERS; ++cid) {
+    vortex_gem5_dcr_read(h, VX_DCR_BASE_CACHE_FLUSH, cid, &dummy);
+  }
+
+  // Read the kernel's exit code from IO_EXIT_CODE via the VRAM-read
+  // path — same byte the simx main reads in sim/simx/main.cpp:213.
+  uint32_t exit_code = 0;
+  vortex_gem5_vram_read(h, IO_EXIT_CODE,
+                        reinterpret_cast(&exit_code),
+                        sizeof(exit_code));
+
+  std::printf("[gem5_smoke] cycles=%llu exit_code=%u\n",
+              static_cast(cycles), exit_code);
+
+  vortex_gem5_destroy(h);
+  return static_cast(exit_code);
+}
diff --git a/sim/simx/gem5/hello.c b/sim/simx/gem5/hello.c
new file mode 100644
index 0000000000..ff5de63037
--- /dev/null
+++ b/sim/simx/gem5/hello.c
@@ -0,0 +1,14 @@
+// Phase 0 ARM SE-mode smoke test. Cross-compile with
+//   aarch64-linux-gnu-gcc -static -o /tmp/hello-arm hello.c
+// and run under gem5 with the new gem5_library SimpleBoard wiring
+// (or the deprecated configs/example/se.py if still available).
+// Confirms the cross-toolchain produces something gem5 can load.
+
+#include 
+
+int main(int argc, char** argv) {
+    (void)argc;
+    (void)argv;
+    printf("Hello, ARM SE-mode (gem5 v25 Phase 0)\n");
+    return 0;
+}
diff --git a/sim/simx/gem5/install.sh b/sim/simx/gem5/install.sh
new file mode 100755
index 0000000000..7af477c313
--- /dev/null
+++ b/sim/simx/gem5/install.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# Install Vortex gem5 SimObjects into a pinned gem5 tree.
+#
+# Phase 3+: installs the real VortexGPGPU device. The Phase-0 dummy/
+# scaffolding is intentionally removed from $GEM5_HOME during the
+# transition — its job (proving the install path works) is done.
+#
+# Idempotent: re-running just refreshes the files. Caller must
+# re-run `scons build/{X86,ARM}/gem5.opt` after this script to pick
+# up changes.
+#
+# Usage:
+#   GEM5_HOME=$HOME/tools/gem5 sim/simx/gem5/install.sh
+# or
+#   sim/simx/gem5/install.sh           # uses $GEM5_HOME from env
+
+set -e
+
+GEM5_HOME=${GEM5_HOME:-$HOME/tools/gem5}
+SOURCE_DIR=$(dirname "$(readlink -f "$0")")
+
+if [ ! -d "$GEM5_HOME/src/dev" ]; then
+    echo "ERROR: GEM5_HOME=$GEM5_HOME does not look like a gem5 tree" >&2
+    echo "       (expected $GEM5_HOME/src/dev/)" >&2
+    exit 1
+fi
+
+DEST_DIR="$GEM5_HOME/src/dev/vortex"
+mkdir -p "$DEST_DIR"
+
+# Phase 0 scaffolding cleanup: the dummy SimObject existed only to
+# prove the install path; remove it now that the real device is in
+# place so `gem5.opt --list-sim-objects` is not polluted by it.
+if [ -d "$DEST_DIR/dummy" ]; then
+    rm -rf "$DEST_DIR/dummy"
+fi
+
+# Install the real device: header, source, Python binding, SConscript.
+install -m 0644 "$SOURCE_DIR/vortex_gpgpu_dev.hh" "$DEST_DIR/"
+install -m 0644 "$SOURCE_DIR/vortex_gpgpu_dev.cc" "$DEST_DIR/"
+install -m 0644 "$SOURCE_DIR/VortexGPGPU.py"      "$DEST_DIR/"
+install -m 0644 "$SOURCE_DIR/SConscript"          "$DEST_DIR/"
+
+echo "Vortex SimObjects installed at $DEST_DIR"
+echo "Files:"
+ls -1 "$DEST_DIR" | sed 's/^/  /'
+echo ""
+echo "Re-build gem5 with one or both of:"
+echo "  scons -C $GEM5_HOME build/X86/gem5.opt -j\$(nproc)"
+echo "  scons -C $GEM5_HOME build/ARM/gem5.opt -j\$(nproc)"
diff --git a/sim/simx/gem5/vortex_gpgpu.cpp b/sim/simx/gem5/vortex_gpgpu.cpp
new file mode 100644
index 0000000000..1964d8e962
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu.cpp
@@ -0,0 +1,320 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "vortex_gpgpu.h"
+
+#include "constants.h"
+#include "processor.h"
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+using namespace vortex;
+
+// Mirrors sw/runtime/common/common.h's GLOBAL_MEM_SIZE so the bounds
+// check in vram_{read,write} matches what the host runtime enforces
+// on its side. Inlined rather than including common.h because that
+// header drags in the full runtime ABI (vortex.h + callbacks.h +
+// mem_alloc.h) which a device library has no business touching.
+#if (XLEN == 64)
+static constexpr uint64_t GEM5_GLOBAL_MEM_SIZE = 0x200000000ull;  // 8 GB
+#else
+static constexpr uint64_t GEM5_GLOBAL_MEM_SIZE = 0x100000000ull;  // 4 GB
+#endif
+
+// OPAE MMIO command-set constants (same as
+// hw/syn/altera/opae/vortex_afu.json + sw/runtime/gem5/vortex.cpp).
+// Hardcoded — no #include of vortex_opae.h — to keep the device
+// library independent of the OPAE header generator.
+namespace cmd {
+constexpr uint64_t MEM_READ  = 1;
+constexpr uint64_t MEM_WRITE = 2;
+constexpr uint64_t RUN       = 3;
+constexpr uint64_t DCR_WRITE = 4;
+constexpr uint64_t DCR_READ  = 5;
+} // namespace cmd
+namespace mmio {
+constexpr uint64_t CMD_TYPE  = 10 * 4;  // byte offsets, matching the
+constexpr uint64_t CMD_ARG0  = 12 * 4;  // sw/runtime side
+constexpr uint64_t CMD_ARG1  = 14 * 4;
+constexpr uint64_t CMD_ARG2  = 16 * 4;
+constexpr uint64_t STATUS    = 18 * 4;
+constexpr uint64_t DCR_RSP   = 28 * 4;
+} // namespace mmio
+
+// Internal C++ class. Mirrors the shape of vortex::VortexSimulator in
+// sim/simx/sst/ — same Processor + RAM ownership, same KMU DCR priming,
+// same load_kernel paths — but with no SST types in the interface.
+namespace {
+
+class Gem5Device {
+public:
+  Gem5Device()
+    : ram_(0, MEM_PAGE_SIZE)
+    , proc_(std::make_unique()) {
+    proc_->attach_ram(&ram_);
+  }
+
+  ~Gem5Device() = default;
+
+  // Load a kernel image and prime the KMU for a 1×1×1 CTA at
+  // STARTUP_ADDR. After this, cycle() will dispatch the kernel.
+  // Returns true on success.
+  bool load_kernel(const std::string& path) {
+    // KMU DCRs — same sequence as sim/simx/main.cpp:101–116 and
+    // sim/simx/sst/vortex_simulator.cpp:22–39.
+    const uint64_t startup_addr(STARTUP_ADDR);
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ADDR0, startup_addr & 0xffffffff);
+  #if (XLEN == 64)
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ADDR1, startup_addr >> 32);
+  #endif
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ARG0, 0);
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ARG1, 0);
+    proc_->dcr_write(VX_DCR_KMU_GRID_DIM_X,   1);
+    proc_->dcr_write(VX_DCR_KMU_GRID_DIM_Y,   1);
+    proc_->dcr_write(VX_DCR_KMU_GRID_DIM_Z,   1);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_DIM_X,  1);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_DIM_Y,  1);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_DIM_Z,  1);
+    proc_->dcr_write(VX_DCR_KMU_LMEM_SIZE,    0);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_SIZE,   1);
+    proc_->dcr_write(VX_DCR_KMU_WARP_STEP_X,  NUM_THREADS);
+    proc_->dcr_write(VX_DCR_KMU_WARP_STEP_Y,  0);
+    proc_->dcr_write(VX_DCR_KMU_WARP_STEP_Z,  0);
+
+    std::string ext(fileExtension(path.c_str()));
+    if (ext == "vxbin") {
+      ram_.loadVxImage(path.c_str());
+    } else if (ext == "bin") {
+      ram_.loadBinImage(path.c_str(), startup_addr);
+    } else if (ext == "hex") {
+      ram_.loadHexImage(path.c_str());
+    } else {
+      std::cerr << "vortex_gem5: unsupported kernel extension '" << ext
+                << "' (need .vxbin, .bin, or .hex)" << std::endl;
+      return false;
+    }
+    return true;
+  }
+
+  bool tick()  { return proc_->cycle(); }
+
+  // Memory access uses the same ACL-bypass pattern as
+  // sw/runtime/simx/vortex.cpp upload()/download(); the gem5 DMA path
+  // is a peer of the host runtime, not a userspace caller subject to
+  // page protections.
+  void vram_write(uint64_t addr, const uint8_t* src, uint32_t size) {
+    if (addr + size > GEM5_GLOBAL_MEM_SIZE) {
+    #ifndef NDEBUG
+      std::cerr << "vortex_gem5: vram_write overflow addr=0x"
+                << std::hex << addr << " size=" << std::dec << size << std::endl;
+    #endif
+      return;
+    }
+    ram_.enable_acl(false);
+    ram_.write(src, addr, size);
+    ram_.enable_acl(true);
+  }
+
+  void vram_read(uint64_t addr, uint8_t* dst, uint32_t size) {
+    if (addr + size > GEM5_GLOBAL_MEM_SIZE) {
+    #ifndef NDEBUG
+      std::cerr << "vortex_gem5: vram_read overflow addr=0x"
+                << std::hex << addr << " size=" << std::dec << size << std::endl;
+    #endif
+      return;
+    }
+    ram_.enable_acl(false);
+    ram_.read(dst, addr, size);
+    ram_.enable_acl(true);
+  }
+
+  int dcr_write(uint32_t addr, uint32_t value) {
+    return proc_->dcr_write(addr, value);
+  }
+
+  int dcr_read(uint32_t addr, uint32_t tag, uint32_t* value) {
+    return proc_->dcr_read(addr, tag, value);
+  }
+
+  // OPAE MMIO command-set state machine. The host runtime
+  // (sw/runtime/gem5/vortex.cpp) drives it in exactly the same
+  // shape as sw/runtime/opae/vortex.cpp:
+  //   1. Write CMD_ARG0/1/2 with command-specific args
+  //   2. Write CMD_TYPE — triggers the command
+  //   3. Poll MMIO_STATUS until busy bit clears
+  //   4. (For DCR_READ) read MMIO_DCR_RSP for the response
+  //
+  // Synchronous commands (DCR_*) complete inside this function and
+  // clear the busy bit immediately. Async commands (RUN, MEM_*)
+  // surface to the gem5 SimObject via pop_pending_cmd; the SimObject
+  // performs the gem5-side work (clock ticks, DMA) and clears busy
+  // when done.
+  uint64_t mmio_read64(uint64_t offset) {
+    if (offset == mmio::STATUS)  return busy_ ? 1u : 0u;
+    if (offset == mmio::DCR_RSP) return dcr_rsp_;
+    return 0;
+  }
+
+  void mmio_write64(uint64_t offset, uint64_t value) {
+    if (offset == mmio::CMD_ARG0) { cmd_args_[0] = value; return; }
+    if (offset == mmio::CMD_ARG1) { cmd_args_[1] = value; return; }
+    if (offset == mmio::CMD_ARG2) { cmd_args_[2] = value; return; }
+    if (offset != mmio::CMD_TYPE) return;  // unknown reg — ignore
+
+    busy_ = true;
+    switch (value) {
+    case cmd::DCR_WRITE: {
+      proc_->dcr_write(uint32_t(cmd_args_[0]), uint32_t(cmd_args_[1]));
+      busy_ = false;
+      break;
+    }
+    case cmd::DCR_READ: {
+      uint32_t v = 0;
+      proc_->dcr_read(uint32_t(cmd_args_[0]),
+                      uint32_t(cmd_args_[1]),
+                      &v);
+      dcr_rsp_ = v;
+      busy_ = false;
+      break;
+    }
+    case cmd::RUN:
+    case cmd::MEM_READ:
+    case cmd::MEM_WRITE:
+      // Async — gem5 SimObject reads pending_cmd_ on the same MMIO
+      // dispatch tick and routes the work (clock cycles for RUN,
+      // dmaAction for MEM_*). It clears busy when done.
+      pending_cmd_ = value;
+      break;
+    default:
+      // Unknown command: drop the busy bit so the host doesn't hang.
+      busy_ = false;
+      break;
+    }
+  }
+
+  uint64_t pop_pending_cmd() {
+    uint64_t c = pending_cmd_;
+    pending_cmd_ = 0;
+    return c;
+  }
+  uint64_t get_cmd_arg(int which) const {
+    return (which >= 0 && which < 3) ? cmd_args_[which] : 0;
+  }
+  void set_busy(bool busy) { busy_ = busy; }
+
+private:
+  RAM ram_;
+  std::unique_ptr proc_;
+
+  // OPAE protocol state.
+  uint64_t cmd_args_[3] = {0, 0, 0};
+  uint64_t pending_cmd_ = 0;
+  uint64_t dcr_rsp_     = 0;
+  bool     busy_        = false;
+};
+
+} // namespace
+
+// ----- C ABI -----------------------------------------------------------------
+
+extern "C" {
+
+const char* vortex_gem5_build_info(void) {
+  static char info[256];
+  std::snprintf(info, sizeof(info),
+                "vortex-gem5 (XLEN=%d, threads=%d, warps=%d, cores=%d, clusters=%d)",
+                XLEN, NUM_THREADS, NUM_WARPS, NUM_CORES, NUM_CLUSTERS);
+  return info;
+}
+
+vortex_gem5_handle_t vortex_gem5_create(void) {
+  try {
+    return reinterpret_cast(new Gem5Device());
+  } catch (const std::exception& e) {
+    std::cerr << "vortex_gem5_create: " << e.what() << std::endl;
+    return nullptr;
+  } catch (...) {
+    std::cerr << "vortex_gem5_create: unknown exception" << std::endl;
+    return nullptr;
+  }
+}
+
+void vortex_gem5_destroy(vortex_gem5_handle_t h) {
+  if (h == nullptr) return;
+  delete reinterpret_cast(h);
+}
+
+int vortex_gem5_load_kernel(vortex_gem5_handle_t h, const char* path) {
+  if (h == nullptr || path == nullptr) return -1;
+  return reinterpret_cast(h)->load_kernel(path) ? 0 : -1;
+}
+
+bool vortex_gem5_tick(vortex_gem5_handle_t h) {
+  if (h == nullptr) return false;
+  return reinterpret_cast(h)->tick();
+}
+
+uint64_t vortex_gem5_mmio_read64(vortex_gem5_handle_t h, uint64_t offset) {
+  if (h == nullptr) return 0;
+  return reinterpret_cast(h)->mmio_read64(offset);
+}
+
+void vortex_gem5_mmio_write64(vortex_gem5_handle_t h, uint64_t offset, uint64_t value) {
+  if (h == nullptr) return;
+  reinterpret_cast(h)->mmio_write64(offset, value);
+}
+
+void vortex_gem5_vram_write(vortex_gem5_handle_t h, uint64_t dev_addr, const uint8_t* src, uint32_t size) {
+  if (h == nullptr || src == nullptr) return;
+  reinterpret_cast(h)->vram_write(dev_addr, src, size);
+}
+
+void vortex_gem5_vram_read(vortex_gem5_handle_t h, uint64_t dev_addr, uint8_t* dst, uint32_t size) {
+  if (h == nullptr || dst == nullptr) return;
+  reinterpret_cast(h)->vram_read(dev_addr, dst, size);
+}
+
+int vortex_gem5_dcr_write(vortex_gem5_handle_t h, uint32_t addr, uint32_t value) {
+  if (h == nullptr) return -1;
+  return reinterpret_cast(h)->dcr_write(addr, value);
+}
+
+int vortex_gem5_dcr_read(vortex_gem5_handle_t h, uint32_t addr, uint32_t tag, uint32_t* value) {
+  if (h == nullptr || value == nullptr) return -1;
+  return reinterpret_cast(h)->dcr_read(addr, tag, value);
+}
+
+uint64_t vortex_gem5_pop_pending_cmd(vortex_gem5_handle_t h) {
+  if (h == nullptr) return 0;
+  return reinterpret_cast(h)->pop_pending_cmd();
+}
+
+uint64_t vortex_gem5_get_cmd_arg(vortex_gem5_handle_t h, int which) {
+  if (h == nullptr) return 0;
+  return reinterpret_cast(h)->get_cmd_arg(which);
+}
+
+void vortex_gem5_set_busy(vortex_gem5_handle_t h, bool busy) {
+  if (h == nullptr) return;
+  reinterpret_cast(h)->set_busy(busy);
+}
+
+} // extern "C"
diff --git a/sim/simx/gem5/vortex_gpgpu.h b/sim/simx/gem5/vortex_gpgpu.h
new file mode 100644
index 0000000000..94d14eb865
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu.h
@@ -0,0 +1,111 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// libvortex-gem5 — C ABI for the gem5 VortexGPGPU SimObject.
+//
+// The gem5 device (sim/simx/gem5/.cc, installed into a pinned
+// gem5 tree by sim/simx/gem5/install.sh) loads this shared library and
+// drives it through this C ABI. Keeping the ABI in C — not C++ — means
+// the gem5 side does not depend on SimX's C++ types and can be rebuilt
+// against a new gem5 release without touching anything Vortex-side.
+//
+// Concurrency: the gem5 device serializes calls on its event-loop thread;
+// no internal locking. Re-entrancy: completion callbacks (currently
+// unused — the DMA path is fully synchronous on the gem5 side per Phase
+// 2) may be added later as Phase 3 wires up async DMA.
+
+#pragma once
+
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Opaque handle. The library owns a vortex::Processor + RAM behind it.
+typedef struct vortex_gem5_device_s* vortex_gem5_handle_t;
+
+// Returns a printable description of the build config (cores, warps,
+// threads, XLEN). Returned pointer is static; do not free.
+const char* vortex_gem5_build_info(void);
+
+// Construct a Vortex device instance. Returns NULL on failure.
+// VRAM is allocated lazily; no kernel is loaded until
+// vortex_gem5_load_kernel is called.
+vortex_gem5_handle_t vortex_gem5_create(void);
+
+// Destroy the device. Safe to call with NULL.
+void vortex_gem5_destroy(vortex_gem5_handle_t h);
+
+// Load a kernel image into VRAM. Accepts .vxbin / .bin / .hex (same
+// shape as sim/simx/main.cpp:120). Primes the KMU DCRs for a 1x1x1
+// CTA at STARTUP_ADDR (same as sim/simx/main.cpp:101-116) so a
+// subsequent cycle() loop launches the kernel.
+//
+// In the Phase-2 in-process smoke driver this is how kernels reach
+// the device. The Phase-4 runtime will instead upload kernels via
+// the staging-buffer DMA path (vortex_gem5_vram_write + the OPAE MMIO
+// commands), and Phase 3's gem5 SimObject can optionally call this
+// at boot via a Python `kernel=...` parameter for one-shot smoke
+// tests.
+//
+// Returns 0 on success, -1 on file-not-found or unsupported format.
+int vortex_gem5_load_kernel(vortex_gem5_handle_t h, const char* path);
+
+// Advance the simulator by one cycle. Returns true while work
+// remains (clusters running or channels carrying packets); false once
+// the program has finished. Mirrors vortex::Processor::cycle().
+bool vortex_gem5_tick(vortex_gem5_handle_t h);
+
+// MMIO (PIO) accessed by the simulated host CPU via the gem5 SimObject's
+// read()/write() callbacks. Offsets are byte addresses inside the
+// device's PIO range. See sw/runtime/opae/vortex.cpp for the OPAE MMIO
+// layout this protocol mirrors.
+uint64_t vortex_gem5_mmio_read64(vortex_gem5_handle_t h, uint64_t offset);
+void vortex_gem5_mmio_write64(vortex_gem5_handle_t h, uint64_t offset, uint64_t value);
+
+// VRAM access. The gem5 device DMAs to/from the host's staging buffer
+// using its own DmaPort; once the bytes are in a local scratch, it
+// calls these to copy into/out of the device VRAM. Bytes here cross
+// only the C ABI boundary — they do not re-enter gem5's DMA system.
+//
+// Bounds-checked against the RAM image; on overflow the call is a
+// no-op and (in debug builds) logs to stderr.
+void vortex_gem5_vram_write(vortex_gem5_handle_t h, uint64_t dev_addr, const uint8_t* src, uint32_t size);
+void vortex_gem5_vram_read(vortex_gem5_handle_t h, uint64_t dev_addr, uint8_t* dst, uint32_t size);
+
+// DCR write/read passthrough. The DCR-read path also handles the
+// cache-flush DCR (VX_DCR_BASE_CACHE_FLUSH), which drains dirty cache
+// lines all the way to VRAM — required before a host read-back per
+// B9 in docs/proposals/gem5_simx_v3_proposal.md §2.2.
+int vortex_gem5_dcr_write(vortex_gem5_handle_t h, uint32_t addr, uint32_t value);
+int vortex_gem5_dcr_read(vortex_gem5_handle_t h, uint32_t addr, uint32_t tag, uint32_t* value);
+
+// Protocol state introspection for the gem5 SimObject. The library
+// owns the OPAE state machine (cmd_args + busy bit + cmd_type +
+// dcr_rsp); the gem5 SimObject calls these to drive DMA for the
+// async CMD_MEM_{READ,WRITE} commands.
+//
+// pop_pending_cmd returns the CMD_* constant of an async command
+// the SimObject must service (CMD_RUN, CMD_MEM_WRITE, CMD_MEM_READ),
+// or 0 if no command is pending. Synchronous commands (CMD_DCR_*)
+// are handled inside mmio_write64 and never surface here.
+uint64_t vortex_gem5_pop_pending_cmd(vortex_gem5_handle_t h);
+uint64_t vortex_gem5_get_cmd_arg(vortex_gem5_handle_t h, int which);
+void     vortex_gem5_set_busy(vortex_gem5_handle_t h, bool busy);
+
+#ifdef __cplusplus
+} // extern "C"
+#endif
diff --git a/sim/simx/gem5/vortex_gpgpu_dev.cc b/sim/simx/gem5/vortex_gpgpu_dev.cc
new file mode 100644
index 0000000000..f46d42934f
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu_dev.cc
@@ -0,0 +1,295 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "dev/vortex/vortex_gpgpu_dev.hh"
+
+#include "base/logging.hh"
+#include "base/trace.hh"
+#include "mem/packet_access.hh"
+#include "sim/sim_exit.hh"
+
+#include 
+
+// OPAE MMIO command-set constants. Hardcoded to match the layout
+// the host runtime uses (sw/runtime/gem5/vortex.cpp:50-66, also
+// hw/syn/altera/opae/vortex_afu.json). Hardcoded — not pulled from
+// vortex_opae.h — because gem5 is compiled out-of-tree and we
+// don't want a build-time dep on the Vortex source.
+static constexpr uint64_t MMIO_CMD_TYPE = 10 * 4;  // byte offset
+static constexpr uint64_t CMD_MEM_READ  = 1;
+static constexpr uint64_t CMD_MEM_WRITE = 2;
+static constexpr uint64_t CMD_RUN       = 3;
+
+// Cache line size — args are stored shifted by log2(CACHE_BLOCK_SIZE)
+// in the OPAE protocol; both directions agree at log2(64) = 6.
+static constexpr unsigned CACHE_BLOCK_LOG2 = 6;
+
+namespace gem5
+{
+
+namespace {
+
+// Helper for dlsym + null-check in one line. Returns the resolved
+// pointer cast to T, or fatals out with a stable error message.
+template 
+T dlsym_or_fatal(void* handle, const char* symbol, const char* libpath)
+{
+    void* p = dlsym(handle, symbol);
+    if (p == nullptr) {
+        fatal("VortexGPGPU: dlsym(%s) failed in %s: %s",
+              symbol, libpath, dlerror());
+    }
+    return reinterpret_cast(p);
+}
+
+} // namespace
+
+VortexGPGPU::VortexGPGPU(const Params &p)
+  : DmaDevice(p),
+    libHandle_(nullptr),
+    deviceHandle_(nullptr),
+    abi_{},
+    libraryPath_(p.library),
+    kernelPath_(p.kernel),
+    pioAddr_(p.pio_addr),
+    pioSize_(p.pio_size),
+    pioLatency_(p.pio_latency),
+    tickEvent_([this]{ this->tick(); }, name() + ".tickEvent")
+{
+    if (libraryPath_.empty()) {
+        fatal("VortexGPGPU: 'library' parameter is required "
+              "(path to libvortex-gem5.so)");
+    }
+
+    // dlopen with RTLD_LAZY|RTLD_LOCAL — local so multiple SimObject
+    // instances don't share symbol scope, lazy because we resolve
+    // explicitly with dlsym below anyway.
+    libHandle_ = dlopen(libraryPath_.c_str(), RTLD_LAZY | RTLD_LOCAL);
+    if (libHandle_ == nullptr) {
+        fatal("VortexGPGPU: dlopen('%s') failed: %s",
+              libraryPath_, dlerror());
+    }
+
+    // Resolve the full v1 C ABI surface. Any missing symbol is a hard
+    // build mismatch between gem5 and the Vortex library — fatal so
+    // we fail fast at construction rather than mid-simulation.
+    abi_.build_info   = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_build_info",   libraryPath_.c_str());
+    abi_.create       = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_create",       libraryPath_.c_str());
+    abi_.destroy      = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_destroy",      libraryPath_.c_str());
+    abi_.load_kernel  = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_load_kernel",  libraryPath_.c_str());
+    abi_.tick         = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_tick",         libraryPath_.c_str());
+    abi_.mmio_read64  = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_mmio_read64",  libraryPath_.c_str());
+    abi_.mmio_write64 = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_mmio_write64", libraryPath_.c_str());
+    abi_.vram_write   = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_vram_write",   libraryPath_.c_str());
+    abi_.vram_read    = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_vram_read",    libraryPath_.c_str());
+    abi_.dcr_write    = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_dcr_write",    libraryPath_.c_str());
+    abi_.dcr_read     = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_dcr_read",     libraryPath_.c_str());
+    abi_.pop_pending_cmd = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_pop_pending_cmd", libraryPath_.c_str());
+    abi_.get_cmd_arg  = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_get_cmd_arg",  libraryPath_.c_str());
+    abi_.set_busy     = dlsym_or_fatal
+                          (libHandle_, "vortex_gem5_set_busy",     libraryPath_.c_str());
+
+    inform("VortexGPGPU: %s", abi_.build_info());
+    inform("VortexGPGPU: library=%s pio=[0x%llx,+0x%llx)",
+           libraryPath_,
+           static_cast(pioAddr_),
+           static_cast(pioSize_));
+
+    deviceHandle_ = abi_.create();
+    if (deviceHandle_ == nullptr) {
+        fatal("VortexGPGPU: vortex_gem5_create returned NULL");
+    }
+}
+
+VortexGPGPU::~VortexGPGPU()
+{
+    if (deviceHandle_ != nullptr && abi_.destroy != nullptr) {
+        abi_.destroy(deviceHandle_);
+    }
+    if (libHandle_ != nullptr) {
+        dlclose(libHandle_);
+    }
+}
+
+void
+VortexGPGPU::init()
+{
+    DmaDevice::init();
+}
+
+void
+VortexGPGPU::startup()
+{
+    DmaDevice::startup();
+
+    if (!kernelPath_.empty()) {
+        // Standalone mode (Phase 3): preload a kernel and self-drive
+        // to completion. Used by ci/gem5_test_vortex_hello.py — no
+        // host CPU needed.
+        inform("VortexGPGPU: standalone mode (preload + auto-tick)");
+        inform("VortexGPGPU: preloading kernel=%s", kernelPath_);
+        if (abi_.load_kernel(deviceHandle_, kernelPath_.c_str()) != 0) {
+            fatal("VortexGPGPU: vortex_gem5_load_kernel('%s') failed",
+                  kernelPath_);
+        }
+        standalone_ = true;
+        schedule(tickEvent_, clockEdge(Cycles(1)));
+    } else {
+        // Hosted mode (Phase 5+): the host CPU uploads kernels via
+        // MMIO/DMA and triggers execution with CMD_RUN. We sit idle
+        // until then; CMD_RUN's write handler schedules tickEvent_.
+        inform("VortexGPGPU: hosted mode (waiting for host CMD_RUN)");
+        standalone_ = false;
+    }
+}
+
+void
+VortexGPGPU::tick()
+{
+    bool running = abi_.tick(deviceHandle_);
+    if (running) {
+        schedule(tickEvent_, clockEdge(Cycles(1)));
+        return;
+    }
+    // Kernel finished.
+    if (standalone_) {
+        inform("VortexGPGPU: standalone kernel complete — exiting sim loop");
+        exitSimLoop("VortexGPGPU: kernel complete");
+    } else {
+        // Host CPU is polling MMIO_STATUS waiting for busy bit to
+        // clear; do that now so vx_ready_wait returns.
+        abi_.set_busy(deviceHandle_, false);
+    }
+}
+
+Tick
+VortexGPGPU::read(PacketPtr pkt)
+{
+    const Addr offset = pkt->getAddr() - pioAddr_;
+    const uint64_t value = abi_.mmio_read64(deviceHandle_, offset);
+
+    // 64-bit aligned access is the only shape the OPAE protocol uses.
+    // Stuff the result into the packet regardless of size (gem5 will
+    // truncate based on getSize); narrow reads are unsupported by the
+    // protocol but harmless here.
+    pkt->setUintX(value, ByteOrder::little);
+    pkt->makeAtomicResponse();
+    return pioLatency_;
+}
+
+Tick
+VortexGPGPU::write(PacketPtr pkt)
+{
+    const Addr offset = pkt->getAddr() - pioAddr_;
+    const uint64_t value = pkt->getUintX(ByteOrder::little);
+
+    // Always forward the write to the Vortex library first so the
+    // device sees the args/CMD_TYPE in order.
+    abi_.mmio_write64(deviceHandle_, offset, value);
+
+    // Then react to commands that need gem5-side action (kicking the
+    // tick scheduler for CMD_RUN; Phase 5+ will add CMD_MEM_*
+    // dispatch through dmaPort).
+    if (offset == MMIO_CMD_TYPE) {
+        handleCmdType(value);
+    }
+
+    pkt->makeAtomicResponse();
+    return pioLatency_;
+}
+
+void
+VortexGPGPU::handleCmdType(uint64_t /*value*/)
+{
+    // Read which async command the library wants us to handle.
+    // Sync commands (DCR_*) already completed inside mmio_write64
+    // and don't surface here (pop returns 0).
+    const uint64_t cmd = abi_.pop_pending_cmd(deviceHandle_);
+    if (cmd == 0) return;
+
+    if (cmd == CMD_RUN) {
+        // Schedule the tick loop. tick() clears busy_ when the
+        // kernel finishes (via abi_.set_busy(false)).
+        if (!tickEvent_.scheduled()) {
+            schedule(tickEvent_, clockEdge(Cycles(1)));
+        }
+        return;
+    }
+
+    if (cmd == CMD_MEM_WRITE || cmd == CMD_MEM_READ) {
+        // Args are CACHE-LINE shifted in the OPAE protocol.
+        const Addr host_addr = abi_.get_cmd_arg(deviceHandle_, 0)
+                                 << CACHE_BLOCK_LOG2;
+        const Addr dev_addr  = abi_.get_cmd_arg(deviceHandle_, 1)
+                                 << CACHE_BLOCK_LOG2;
+        const uint64_t size  = abi_.get_cmd_arg(deviceHandle_, 2)
+                                 << CACHE_BLOCK_LOG2;
+
+        // Scratch buffer for the transfer; freed inside the
+        // completion callback. EventFunctionWrapper's `true` tail
+        // arg flags auto-delete after firing.
+        auto* scratch = new uint8_t[size];
+        void* deviceHandle = deviceHandle_;
+        auto& abi = abi_;
+
+        if (cmd == CMD_MEM_WRITE) {
+            // Host pinned buffer → device VRAM.
+            auto* done = new EventFunctionWrapper(
+                [&abi, deviceHandle, dev_addr, scratch, size]() {
+                    abi.vram_write(deviceHandle, dev_addr, scratch,
+                                   static_cast(size));
+                    delete[] scratch;
+                    abi.set_busy(deviceHandle, false);
+                },
+                name() + ".dmaReadDone",
+                /*deletePostEvent=*/true);
+            dmaRead(host_addr, size, done, scratch);
+        } else {
+            // Device VRAM → host pinned buffer.
+            abi.vram_read(deviceHandle, dev_addr, scratch,
+                          static_cast(size));
+            auto* done = new EventFunctionWrapper(
+                [&abi, deviceHandle, scratch]() {
+                    delete[] scratch;
+                    abi.set_busy(deviceHandle, false);
+                },
+                name() + ".dmaWriteDone",
+                /*deletePostEvent=*/true);
+            dmaWrite(host_addr, size, done, scratch);
+        }
+        return;
+    }
+}
+
+AddrRangeList
+VortexGPGPU::getAddrRanges() const
+{
+    AddrRangeList ranges;
+    ranges.push_back(RangeSize(pioAddr_, pioSize_));
+    return ranges;
+}
+
+} // namespace gem5
diff --git a/sim/simx/gem5/vortex_gpgpu_dev.hh b/sim/simx/gem5/vortex_gpgpu_dev.hh
new file mode 100644
index 0000000000..8f68256365
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu_dev.hh
@@ -0,0 +1,122 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// VortexGPGPU — gem5 SimObject wrapper for libvortex-gem5.so.
+//
+// Lives at $GEM5_HOME/src/dev/vortex/vortex_gpgpu_dev.{cc,hh} after
+// sim/simx/gem5/install.sh runs. The host-side source of truth is
+// the Vortex tree (sim/simx/gem5/) so API drift between gem5 and the
+// Vortex C ABI shows up as a build error in Vortex CI, not as a gem5
+// integration mystery.
+//
+// Design points (see docs/proposals/gem5_simx_v3_proposal.md §3.1):
+//   - dlopen the Vortex library at construction time; resolve all
+//     vortex_gem5_* symbols up-front. This keeps gem5 decoupled from
+//     the Vortex C++ ABI, so we can iterate on SimX internals without
+//     rebuilding gem5.
+//   - Drive Vortex's clock from a self-rescheduling EventFunctionWrapper
+//     (sim/simx/gem5/gem5_api_notes.md §"EventFunctionWrapper"). One
+//     vortex_gem5_tick() per gem5 cycle.
+//   - Inherits DmaDevice (not just PioDevice) so Phase 4's host runtime
+//     gets DMA "for free" via gem5's DmaPort; the Phase 3 entry just
+//     declares the inheritance and leaves DMA paths unexercised.
+
+#ifndef __DEV_VORTEX_VORTEX_GPGPU_DEV_HH__
+#define __DEV_VORTEX_VORTEX_GPGPU_DEV_HH__
+
+#include "dev/dma_device.hh"
+#include "dev/io_device.hh"
+#include "params/VortexGPGPU.hh"
+#include "sim/eventq.hh"
+
+#include 
+#include 
+
+namespace gem5
+{
+
+class VortexGPGPU : public DmaDevice
+{
+public:
+    using Params = VortexGPGPUParams;
+
+    VortexGPGPU(const Params &p);
+    ~VortexGPGPU() override;
+
+    // PioDevice interface
+    Tick read(PacketPtr pkt) override;
+    Tick write(PacketPtr pkt) override;
+    AddrRangeList getAddrRanges() const override;
+
+    // SimObject lifecycle
+    void init() override;
+    void startup() override;
+
+private:
+    // Self-rescheduling clock tick — calls vortex_gem5_tick() once per
+    // device cycle. Returns false (program done) ⇒ exitSimLoop.
+    void tick();
+
+    // Decode an MMIO command type write (MMIO_CMD_TYPE) and route
+    // CMD_MEM_{READ,WRITE} to the DMA path. Phase 3 routes other
+    // command types via vortex_gem5_mmio_write64; Phase 4 promotes
+    // CMD_MEM_* to the real DmaPort flow.
+    void handleCmdType(uint64_t value);
+
+    // Library binding ------------------------------------------------
+    // Opaque dlopen handle; closed in dtor.
+    void* libHandle_;
+    // Vortex device handle returned by vortex_gem5_create.
+    void* deviceHandle_;
+
+    // Cached function pointers — resolved once at construction so the
+    // hot path (tick, read, write) is straight indirect calls with no
+    // string lookups.
+    struct AbiV1 {
+        const char* (*build_info)(void);
+        void*       (*create)(void);
+        void        (*destroy)(void* h);
+        int         (*load_kernel)(void* h, const char* path);
+        bool        (*tick)(void* h);
+        uint64_t    (*mmio_read64)(void* h, uint64_t off);
+        void        (*mmio_write64)(void* h, uint64_t off, uint64_t value);
+        void        (*vram_write)(void* h, uint64_t addr, const uint8_t* src, uint32_t size);
+        void        (*vram_read)(void* h, uint64_t addr, uint8_t* dst, uint32_t size);
+        int         (*dcr_write)(void* h, uint32_t addr, uint32_t value);
+        int         (*dcr_read)(void* h, uint32_t addr, uint32_t tag, uint32_t* value);
+        uint64_t    (*pop_pending_cmd)(void* h);
+        uint64_t    (*get_cmd_arg)(void* h, int which);
+        void        (*set_busy)(void* h, bool busy);
+    } abi_;
+
+    // Configuration --------------------------------------------------
+    const std::string libraryPath_;
+    const std::string kernelPath_;
+    const Addr        pioAddr_;
+    const Addr        pioSize_;
+    const Tick        pioLatency_;
+
+    // Tick scheduling
+    EventFunctionWrapper tickEvent_;
+
+    // Standalone vs. hosted mode (selected at startup based on
+    // whether the `kernel=` Python param was set). In standalone
+    // mode the device drives a single preloaded kernel to
+    // completion and exits the sim loop; in hosted mode it sits
+    // idle until the host CPU issues CMD_RUN via MMIO.
+    bool standalone_;
+};
+
+} // namespace gem5
+
+#endif // __DEV_VORTEX_VORTEX_GPGPU_DEV_HH__
diff --git a/sim/simx/processor.cpp b/sim/simx/processor.cpp
index b173e4195d..40dc9226a8 100644
--- a/sim/simx/processor.cpp
+++ b/sim/simx/processor.cpp
@@ -231,6 +231,22 @@ void ProcessorImpl::reset() {
   perf_mem_writes_ = 0;
   perf_mem_latency_ = 0;
   perf_mem_pending_reads_ = 0;
+  is_cycle_initialized_ = false;
+}
+
+bool ProcessorImpl::cycle() {
+  // Lazy first-call init mirrors run()'s top-of-loop sequence so the
+  // external driver doesn't need to choreograph reset + kmu start
+  // separately. reset() clears is_cycle_initialized_ so a back-to-back
+  // kernel launch re-dispatches.
+  if (!is_cycle_initialized_) {
+    this->reset();
+    kmu_->start();
+    is_cycle_initialized_ = true;
+  }
+  SimPlatform::instance().tick();
+  perf_mem_latency_ += perf_mem_pending_reads_;
+  return this->any_running();
 }
 
 int ProcessorImpl::dcr_write(uint32_t addr, uint32_t value) {
@@ -333,6 +349,14 @@ int Processor::run() {
   return -1;
 }
 
+bool Processor::cycle() {
+  return impl_->cycle();
+}
+
+Memory* Processor::memsim() {
+  return impl_->memsim();
+}
+
 int Processor::dcr_write(uint32_t addr, uint32_t value) {
   return impl_->dcr_write(addr, value);
 }
diff --git a/sim/simx/processor.h b/sim/simx/processor.h
index 129cfdc460..04b57f037b 100644
--- a/sim/simx/processor.h
+++ b/sim/simx/processor.h
@@ -20,6 +20,7 @@
 namespace vortex {
 
 class RAM;
+class Memory;
 class ProcessorImpl;
 
 class Processor {
@@ -33,12 +34,29 @@ class Processor {
 
   int run();
 
+  // Advance the simulator by one cycle. On the first call after a
+  // reset() (or on the very first call), the KMU is started so warps
+  // dispatch into the cluster. Returns true while work remains
+  // (clusters running or channels carrying packets); false once the
+  // program has finished and the channels have drained.
+  //
+  // Used by external simulators that drive Vortex's clock from their
+  // own event loop (SST in sim/simx/sst/, gem5 in sim/simx/gem5/).
+  bool cycle();
+
   void start_kmu();
 
   bool any_running() const;
 
   class Core* get_first_core() const;
 
+  // Returns the processor's memory module. Used by external simulators
+  // (SST, gem5) to install a pre-send hook on Memory::tick that mirrors
+  // accepted requests to their own memory hierarchy for timing
+  // observability. The local data path stays in Vortex's RAM — this is
+  // a peek, not a substitute.
+  Memory* memsim();
+
   int dcr_write(uint32_t addr, uint32_t value);
 
   int dcr_read(uint32_t addr, uint32_t tag, uint32_t* value);
diff --git a/sim/simx/processor_impl.h b/sim/simx/processor_impl.h
index 0f66471b6c..4d2b6fef4f 100644
--- a/sim/simx/processor_impl.h
+++ b/sim/simx/processor_impl.h
@@ -40,6 +40,11 @@ class ProcessorImpl {
 
   int run();
 
+  // Single-cycle step; see Processor::cycle() doc. Lazily initializes
+  // (resets + starts KMU) on the first call after construction or
+  // after reset() has been invoked.
+  bool cycle();
+
   int dcr_write(uint32_t addr, uint32_t value);
 
   int dcr_read(uint32_t addr, uint32_t tag, uint32_t* value);
@@ -48,6 +53,8 @@ class ProcessorImpl {
 
   Kmu& kmu()       { return *kmu_; }
 
+  Memory* memsim() { return memsim_.get(); }
+
   bool any_running() const;
 
   class Core* get_first_core() const;
@@ -67,6 +74,10 @@ class ProcessorImpl {
   uint64_t perf_mem_writes_;
   uint64_t perf_mem_latency_;
   uint64_t perf_mem_pending_reads_;
+  // Tracks whether cycle() has done its first-call init (reset +
+  // kmu_->start()). reset() clears it so a back-to-back kernel launch
+  // via cycle() re-dispatches the KMU.
+  bool is_cycle_initialized_;
 };
 
 }
diff --git a/sw/common/bitmanip.h b/sw/common/bitmanip.h
index c4fe9e8da2..5c72683859 100644
--- a/sw/common/bitmanip.h
+++ b/sw/common/bitmanip.h
@@ -14,6 +14,8 @@
 #pragma once
 
 #include 
+#include 
+#include 
 #include 
 
 namespace vortex {
diff --git a/sw/runtime/gem5/Makefile b/sw/runtime/gem5/Makefile
new file mode 100644
index 0000000000..16bd3390be
--- /dev/null
+++ b/sw/runtime/gem5/Makefile
@@ -0,0 +1,73 @@
+include ../common.mk
+
+# HOST_ARCH selects the cross-compiler for the simulated host ISA
+# inside gem5 (see docs/proposals/gem5_simx_v3_proposal.md §3.5).
+# Default x86_64 has no toolchain install requirement; aarch64/armhf
+# need ci/gem5_install.sh to have run sudo-apt for the cross-compilers.
+HOST_ARCH ?= x86_64
+
+DESTDIR ?= $(CURDIR)/..
+
+SRC_DIR := $(VORTEX_HOME)/sw/runtime/gem5
+
+CXXFLAGS += -std=c++17 -Wall -Wextra -pedantic -Wfatal-errors -Werror
+CXXFLAGS += -I$(INC_DIR) -I$(ROOT_DIR)/sw -I$(ROOT_DIR)/hw -I$(DESTDIR) -I$(SW_COMMON_DIR) -I$(RT_COMMON_DIR)
+CXXFLAGS += -DXLEN_$(XLEN)
+CXXFLAGS += -fPIC
+CXXFLAGS += $(CONFIGS)
+
+# OPAE-shaped MMIO constants come from the generated vortex_opae.h
+# at build/sw/ (already on the include path via -I$(ROOT_DIR)/sw).
+# vortex.cpp does `#include ` for the AFU_IMAGE_*
+# defines. Unlike sw/runtime/opae/Makefile we do NOT call
+# afu_json_mgr — configure already generated the header from
+# vortex_opae.toml at build time.
+
+# Per-arch compiler selection. The cross-compilers are sysroot-aware
+# (Ubuntu's gcc-aarch64-linux-gnu ships the matching libstdc++); no
+# extra --sysroot flags needed.
+#
+# Cross-compiled outputs land in $(DESTDIR)/$(HOST_ARCH)/ alongside
+# the stub's libvortex.so (also cross-compiled). The simulated ARM
+# process's LD_LIBRARY_PATH points at that one dir to find both.
+ifeq ($(HOST_ARCH),x86_64)
+    CXX := g++
+    ARCH_SUFFIX := x86_64
+    OUT_DIR := $(DESTDIR)
+else ifeq ($(HOST_ARCH),aarch64)
+    CXX := aarch64-linux-gnu-g++
+    ARCH_SUFFIX := aarch64
+    OUT_DIR := $(DESTDIR)/aarch64
+else ifeq ($(HOST_ARCH),armhf)
+    CXX := arm-linux-gnueabihf-g++
+    ARCH_SUFFIX := armhf
+    OUT_DIR := $(DESTDIR)/armhf
+else
+    $(error HOST_ARCH must be one of: x86_64, aarch64, armhf (got $(HOST_ARCH)))
+endif
+
+LDFLAGS += -shared -pthread
+
+SRCS = $(SRC_DIR)/vortex.cpp $(SRC_DIR)/driver.cpp $(RT_COMMON_DIR)/utils.cpp
+
+# Debug / release
+ifdef DEBUG
+    CXXFLAGS += -g -O0
+else
+    CXXFLAGS += -O2 -DNDEBUG
+endif
+
+PROJECT := libvortex-gem5-$(ARCH_SUFFIX).so
+
+.PHONY: all force clean
+
+all: $(OUT_DIR)/$(PROJECT)
+
+$(OUT_DIR)/$(PROJECT): $(SRCS)
+	@mkdir -p $(OUT_DIR)
+	$(CXX) $(CXXFLAGS) $(SRCS) $(LDFLAGS) -Wl,-soname,$(PROJECT) -o $@
+
+clean:
+	rm -f $(DESTDIR)/libvortex-gem5-*.so
+	rm -f $(DESTDIR)/aarch64/libvortex-gem5-*.so
+	rm -f $(DESTDIR)/armhf/libvortex-gem5-*.so
diff --git a/sw/runtime/gem5/driver.cpp b/sw/runtime/gem5/driver.cpp
new file mode 100644
index 0000000000..3fc76e719a
--- /dev/null
+++ b/sw/runtime/gem5/driver.cpp
@@ -0,0 +1,128 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "driver.h"
+
+#include 
+#include 
+#include 
+#include 
+
+namespace vortex {
+
+namespace {
+
+// Trivial bump allocator for the pinned region. A real implementation
+// would use a free-list; for now this is the simplest thing that lets
+// upload/download cache a single staging buffer indefinitely.
+struct PinAllocator {
+    uintptr_t base = PIN_BASE_ADDR;
+    uintptr_t cur  = PIN_BASE_ADDR;
+    std::unordered_map live;  // ptr → size for free()
+
+    int allocate(uint64_t size, void** host_ptr, uint64_t* ioaddr) {
+        // Cache-line align (64) to match the OPAE staging-buffer model.
+        const uint64_t aligned = (size + 63) & ~uint64_t(63);
+        if (cur + aligned > base + PIN_REGION_SIZE) {
+            std::fprintf(stderr,
+                         "[VXDRV-gem5] pin region OOM: requested %llu, "
+                         "available %llu\n",
+                         (unsigned long long)aligned,
+                         (unsigned long long)(base + PIN_REGION_SIZE - cur));
+            return -1;
+        }
+        const uintptr_t ptr = cur;
+        cur += aligned;
+        live.emplace(ptr, aligned);
+        *host_ptr = reinterpret_cast(ptr);
+        *ioaddr   = static_cast(ptr);  // identity v→p (see driver.h)
+        return 0;
+    }
+
+    void release(void* host_ptr) {
+        // Trivial allocator: no reclaim until close(). The legacy OPAE
+        // driver's `ensure_staging` recycles its single buffer the same
+        // way; this is fine for the OPAE-shaped workload (one staging
+        // buffer per device handle, grown on demand).
+        live.erase(reinterpret_cast(host_ptr));
+    }
+
+    void reset() { cur = base; live.clear(); }
+};
+
+PinAllocator g_pin;
+bool         g_inited = false;
+
+} // namespace
+
+int drv_init() {
+    if (g_inited) return 0;
+    // The two fixed regions (PIO and PIN) are expected to be already
+    // mapped by the gem5 SE-mode setup before this binary runs. We do
+    // NOT call mmap() here because SE-mode has no /dev/vortex; the
+    // Python config arranges the address space directly.
+    //
+    // If/when this runtime is ported to a real OS with a kernel driver,
+    // drv_init() will become an open("/dev/vortex_gem5") + mmap() pair.
+    g_inited = true;
+    g_pin.reset();
+    return 0;
+}
+
+void drv_close() {
+    if (!g_inited) return;
+    g_pin.reset();
+    g_inited = false;
+}
+
+uint64_t mmio_read64(uint64_t offset) {
+    auto* p = reinterpret_cast(PIO_BASE_ADDR + offset);
+    return *p;
+}
+
+void mmio_write64(uint64_t offset, uint64_t value) {
+    auto* p = reinterpret_cast(PIO_BASE_ADDR + offset);
+    *p = value;
+}
+
+// Memory barrier before kicking a command. The host CPU model in
+// gem5 (especially out-of-order variants like O3CPU) can reorder
+// MMIO writes; the runtime must publish the args before the
+// CMD_TYPE write or the device sees stale/uninitialized args. B14
+// in the proposal's bug catalog calls this out explicitly.
+void mmio_fence() {
+#if defined(__x86_64__) || defined(__i386__)
+    __asm__ __volatile__ ("mfence" ::: "memory");
+#elif defined(__aarch64__) || defined(__arm__)
+    __asm__ __volatile__ ("dmb sy" ::: "memory");
+#else
+    // Fall back to a compiler-only fence. Untested architectures
+    // should add their own asm.
+    __asm__ __volatile__ ("" ::: "memory");
+#endif
+}
+
+int drv_pin_buffer(uint64_t size, void** host_ptr, uint64_t* ioaddr) {
+    if (!g_inited) {
+        std::fprintf(stderr, "[VXDRV-gem5] drv_pin_buffer called before drv_init\n");
+        return -1;
+    }
+    return g_pin.allocate(size, host_ptr, ioaddr);
+}
+
+void drv_release_buffer(void* host_ptr) {
+    if (!g_inited) return;
+    g_pin.release(host_ptr);
+}
+
+} // namespace vortex
diff --git a/sw/runtime/gem5/driver.h b/sw/runtime/gem5/driver.h
new file mode 100644
index 0000000000..6faa36b301
--- /dev/null
+++ b/sw/runtime/gem5/driver.h
@@ -0,0 +1,73 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Direct-MMIO driver for the gem5 VortexGPGPU device.
+//
+// Replaces the libopae abstraction layer used by sw/runtime/opae/.
+// Inside a gem5 SE-mode process, we access the device by:
+//   1. Reading/writing MMIO registers via a fixed virtual address that
+//      the gem5 Python config maps to the device's PIO range
+//      (PIO_BASE_ADDR below; default 0x20000000 matches the legacy
+//      capstone paper).
+//   2. DMA staging through a fixed pinned region that the Python
+//      config maps with identity virtual→physical addressing
+//      (PIN_BASE_ADDR; default 0x10000000). The runtime uses host
+//      virtual addresses; the gem5 DmaPort sees the same value as
+//      physical because of the identity mapping.
+//
+// Phase 5 covers the gem5-side wiring of these mappings; Phase 4 just
+// produces the runtime library.
+
+#pragma once
+
+#include 
+#include 
+
+namespace vortex {
+
+// Fixed virtual addresses the runtime expects to find mapped by the
+// gem5 Python config. PIN_BASE_ADDR is the runtime's heap for DMA
+// staging buffers; PIO_BASE_ADDR is the device's MMIO command-and-
+// status window. Sizes (PIN_REGION_SIZE / PIO_REGION_SIZE) are caps
+// the runtime enforces — overruns are bugs, not malloc failures.
+constexpr uintptr_t PIN_BASE_ADDR    = 0x10000000ull;
+constexpr size_t    PIN_REGION_SIZE  = 0x10000000ull;  // 256 MB
+constexpr uintptr_t PIO_BASE_ADDR    = 0x20000000ull;
+constexpr size_t    PIO_REGION_SIZE  = 0x1000ull;      // 4 KB (1 page)
+
+// Init / shutdown. drv_init mmaps both regions; drv_close munmaps.
+// Both are idempotent in practice but should be paired 1:1.
+int  drv_init();
+void drv_close();
+
+// MMIO register access. Offsets are byte offsets into the device's
+// PIO range; values are written/read 64-bit at a time (the OPAE
+// protocol's natural width). mmio_fence() emits the right barrier
+// for HOST_ARCH (mfence on x86, dmb sy on AArch64/ARMv7) — call
+// before triggering a command (B14 in proposal §2.2).
+uint64_t mmio_read64 (uint64_t offset);
+void     mmio_write64(uint64_t offset, uint64_t value);
+void     mmio_fence();
+
+// Staging-buffer allocation in the pinned region. Returns 0 on
+// success and fills *host_ptr + *ioaddr; returns -1 on OOM in the
+// pinned region. Caller owns the slot until drv_release_buffer.
+//
+// Under Phase 5's identity v→p mapping, *host_ptr == *ioaddr; on a
+// future setup with non-identity mapping, *ioaddr is the value the
+// device must DMA against and *host_ptr is what the runtime writes
+// through.
+int  drv_pin_buffer    (uint64_t size, void** host_ptr, uint64_t* ioaddr);
+void drv_release_buffer(void* host_ptr);
+
+} // namespace vortex
diff --git a/sw/runtime/gem5/vortex.cpp b/sw/runtime/gem5/vortex.cpp
new file mode 100644
index 0000000000..92d793e4f8
--- /dev/null
+++ b/sw/runtime/gem5/vortex.cpp
@@ -0,0 +1,334 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// gem5 host runtime backend. Provides the standard Vortex `vx_*`
+// C API (declared in sw/runtime/include/vortex.h) on top of the
+// OPAE-shaped MMIO command protocol talking to the gem5 VortexGPGPU
+// device through driver.{cpp,h}.
+//
+// Shape mirrors sw/runtime/opae/vortex.cpp but is simpler:
+//   - No libopae dispatch; driver.h's mmio_{read,write}64 talks
+//     directly to PIO_BASE_ADDR.
+//   - No UUID enumeration / fpga_token dance — the gem5 device is
+//     always at the fixed PIO range.
+//   - Device caps come from compile-time VX_config.h macros (the
+//     host runtime and the device library are built from the same
+//     source tree, so they agree by construction).
+//   - mmio_fence() before every CMD_TYPE write (B14 in proposal §2.2).
+
+#include 
+#include           // log2floor / log2ceil / is_aligned / aligned_size
+#include "driver.h"
+
+#include 
+#include          // sched_yield (gem5 SE-mode-safe back-off)
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+using namespace vortex;
+
+// MMIO offsets (byte addresses). Sourced from vortex_opae.h's
+// AFU_IMAGE_MMIO_* DWORD offsets times 4. Same layout as
+// sw/runtime/opae/vortex.cpp:47–56.
+#define CMD_MEM_READ     AFU_IMAGE_CMD_MEM_READ
+#define CMD_MEM_WRITE    AFU_IMAGE_CMD_MEM_WRITE
+#define CMD_RUN          AFU_IMAGE_CMD_RUN
+#define CMD_DCR_WRITE    AFU_IMAGE_CMD_DCR_WRITE
+#define CMD_DCR_READ     AFU_IMAGE_CMD_DCR_READ
+
+#define MMIO_CMD_TYPE    (AFU_IMAGE_MMIO_CMD_TYPE * 4)
+#define MMIO_CMD_ARG0    (AFU_IMAGE_MMIO_CMD_ARG0 * 4)
+#define MMIO_CMD_ARG1    (AFU_IMAGE_MMIO_CMD_ARG1 * 4)
+#define MMIO_CMD_ARG2    (AFU_IMAGE_MMIO_CMD_ARG2 * 4)
+#define MMIO_STATUS      (AFU_IMAGE_MMIO_STATUS * 4)
+#define MMIO_DCR_RSP     (AFU_IMAGE_MMIO_DCR_RSP * 4)
+
+#define STATUS_STATE_BITS 8
+
+// Issue a CMD_TYPE write. Centralised so the memory barrier before
+// the trigger MMIO is impossible to forget (B14). All callers must
+// have written ARG0/1/2 first.
+static inline void issue_cmd(uint64_t cmd) {
+    mmio_fence();
+    mmio_write64(MMIO_CMD_TYPE, cmd);
+}
+
+///////////////////////////////////////////////////////////////////////////////
+
+class vx_device {
+public:
+    vx_device()
+        : global_mem_(ALLOC_BASE_ADDR,
+                      GLOBAL_MEM_SIZE - ALLOC_BASE_ADDR,
+                      RAM_PAGE_SIZE,
+                      CACHE_BLOCK_SIZE),
+          staging_ioaddr_(0),
+          staging_ptr_(nullptr),
+          staging_size_(0) {}
+
+    ~vx_device() {
+        if (staging_ptr_ != nullptr) {
+            drv_release_buffer(staging_ptr_);
+            staging_ptr_   = nullptr;
+            staging_size_  = 0;
+            staging_ioaddr_ = 0;
+        }
+        drv_close();
+    }
+
+    int init() {
+        if (drv_init() != 0) {
+            std::fprintf(stderr, "[VXDRV] drv_init failed\n");
+            return -1;
+        }
+        return 0;
+    }
+
+    // Compile-time capability table. Mirrors sw/runtime/simx/vortex.cpp:
+    // 51–103: the runtime and the SimX-side device library share a
+    // build tree, so the same VX_config.h macros are authoritative
+    // on both sides.
+    int get_caps(uint32_t caps_id, uint64_t *value) {
+        switch (caps_id) {
+        case VX_CAPS_VERSION:         *value = IMPLEMENTATION_ID; break;
+        case VX_CAPS_NUM_THREADS:     *value = NUM_THREADS; break;
+        case VX_CAPS_NUM_WARPS:       *value = NUM_WARPS; break;
+        case VX_CAPS_NUM_CORES:       *value = NUM_CORES * NUM_CLUSTERS; break;
+        case VX_CAPS_NUM_CLUSTERS:    *value = NUM_CLUSTERS; break;
+        case VX_CAPS_SOCKET_SIZE:     *value = SOCKET_SIZE; break;
+        case VX_CAPS_ISSUE_WIDTH:     *value = ISSUE_WIDTH; break;
+        case VX_CAPS_CACHE_LINE_SIZE: *value = CACHE_BLOCK_SIZE; break;
+        case VX_CAPS_GLOBAL_MEM_SIZE: *value = GLOBAL_MEM_SIZE; break;
+        case VX_CAPS_LOCAL_MEM_SIZE:  *value = (1 << LMEM_LOG_SIZE); break;
+        case VX_CAPS_ISA_FLAGS:
+            *value = ((uint64_t(MISA_EXT)) << 32)
+                   | ((log2floor(XLEN) - 4) << 30)
+                   |   MISA_STD;
+            break;
+        case VX_CAPS_NUM_MEM_BANKS:   *value = PLATFORM_MEMORY_NUM_BANKS; break;
+        case VX_CAPS_MEM_BANK_SIZE:   *value = 1ull << (MEM_ADDR_WIDTH / PLATFORM_MEMORY_NUM_BANKS); break;
+        case VX_CAPS_CLOCK_RATE:      *value = 0; break;
+        case VX_CAPS_PEAK_MEM_BW:     *value = PLATFORM_MEMORY_PEAK_BW; break;
+        default:
+            std::fprintf(stderr, "[VXDRV] invalid caps id: %u\n", caps_id);
+            return -1;
+        }
+        return 0;
+    }
+
+    int mem_alloc(uint64_t size, int flags, uint64_t *dev_addr) {
+        uint64_t addr;
+        CHECK_ERR(global_mem_.allocate(size, &addr), { return err; });
+        CHECK_ERR(this->mem_access(addr, size, flags), {
+            global_mem_.release(addr);
+            return err;
+        });
+        *dev_addr = addr;
+        return 0;
+    }
+
+    int mem_reserve(uint64_t dev_addr, uint64_t size, int flags) {
+        CHECK_ERR(global_mem_.reserve(dev_addr, size), { return err; });
+        CHECK_ERR(this->mem_access(dev_addr, size, flags), {
+            global_mem_.release(dev_addr);
+            return err;
+        });
+        return 0;
+    }
+
+    int mem_free(uint64_t dev_addr) {
+        return global_mem_.release(dev_addr);
+    }
+
+    int mem_access(uint64_t /*dev_addr*/, uint64_t /*size*/, int /*flags*/) {
+        // Access control is enforced by the device's RAM ACL (in
+        // libvortex-gem5.so). The host runtime has nothing to do here.
+        return 0;
+    }
+
+    int mem_info(uint64_t *mem_free, uint64_t *mem_used) const {
+        if (mem_free) *mem_free = global_mem_.free();
+        if (mem_used) *mem_used = global_mem_.allocated();
+        return 0;
+    }
+
+    int copy(uint64_t /*dest*/, uint64_t /*src*/, uint64_t /*size*/) {
+        // Device-to-device copy not in the OPAE command set (no
+        // CMD_MEM_COPY); the OPAE FPGA path goes through libopae's
+        // fpgaCopyBuffer which we don't have. Leave unimplemented
+        // for Phase 4; can be added by extending the device with a
+        // new CMD type in a later phase.
+        std::fprintf(stderr, "[VXDRV] copy() not supported in gem5 backend\n");
+        return -1;
+    }
+
+    int upload(uint64_t dev_addr, const void *host_ptr, uint64_t size) {
+        if (!is_aligned(dev_addr, CACHE_BLOCK_SIZE)) return -1;
+        const uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
+        if (dev_addr + asize > GLOBAL_MEM_SIZE) return -1;
+
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+        if (this->ensure_staging(asize) != 0)     return -1;
+
+        std::memcpy(staging_ptr_, host_ptr, size);
+
+        const auto ls_shift = log2ceil(CACHE_BLOCK_SIZE);
+        mmio_write64(MMIO_CMD_ARG0, staging_ioaddr_ >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG1, dev_addr        >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG2, asize           >> ls_shift);
+        issue_cmd(CMD_MEM_WRITE);
+
+        return this->ready_wait(VX_MAX_TIMEOUT);
+    }
+
+    int download(void *host_ptr, uint64_t dev_addr, uint64_t size) {
+        if (!is_aligned(dev_addr, CACHE_BLOCK_SIZE)) return -1;
+        const uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
+        if (dev_addr + asize > GLOBAL_MEM_SIZE) return -1;
+
+        // Drain dirty cache lines all the way to VRAM before reading
+        // back, per B9 in proposal §2.2. One DCR_READ on the magic
+        // cache-flush DCR per core; the device routes it through
+        // Processor::flush_caches().
+        {
+            uint64_t num_cores;
+            CHECK_ERR(this->get_caps(VX_CAPS_NUM_CORES, &num_cores), { return err; });
+            uint32_t dummy;
+            for (uint32_t cid = 0; cid < (uint32_t)num_cores; ++cid) {
+                CHECK_ERR(this->dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy),
+                          { return err; });
+            }
+        }
+
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+        if (this->ensure_staging(asize) != 0)     return -1;
+
+        const auto ls_shift = log2ceil(CACHE_BLOCK_SIZE);
+        mmio_write64(MMIO_CMD_ARG0, staging_ioaddr_ >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG1, dev_addr        >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG2, asize           >> ls_shift);
+        issue_cmd(CMD_MEM_READ);
+
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+
+        std::memcpy(host_ptr, staging_ptr_, size);
+        return 0;
+    }
+
+    int start() {
+        issue_cmd(CMD_RUN);
+        return 0;
+    }
+
+    // Poll MMIO_STATUS; the high bits carry stdout/stderr text from
+    // device-side printf — same protocol as sw/runtime/opae/vortex.cpp.
+    // Uses sched_yield() to back off between polls (gem5 SE-mode
+    // doesn't implement clock_nanosleep which glibc's nanosleep()
+    // routes through; sched_yield is in the syscall_tbl64 ignore
+    // list and returns immediately, which inside gem5 just means
+    // the next poll happens on the next simulated CPU instruction).
+    int ready_wait(uint64_t timeout) {
+        std::unordered_map print_bufs;
+        const uint64_t step_ms = 1;
+
+        for (;;) {
+            uint64_t status = mmio_read64(MMIO_STATUS);
+
+            // Drain any console data the device produced.
+            uint32_t cout_data = status >> STATUS_STATE_BITS;
+            if (cout_data & 0x1) {
+                do {
+                    const char     cout_char = (cout_data >> 1) & 0xff;
+                    const uint32_t cout_tid  = (cout_data >> 9) & 0xff;
+                    auto& ss = print_bufs[cout_tid];
+                    ss << cout_char;
+                    if (cout_char == '\n') {
+                        std::cout << std::dec << "#" << cout_tid
+                                  << ": " << ss.str() << std::flush;
+                        ss.str("");
+                    }
+                    status = mmio_read64(MMIO_STATUS);
+                    cout_data = status >> STATUS_STATE_BITS;
+                } while (cout_data & 0x1);
+            }
+
+            const uint32_t state = status & ((1 << STATUS_STATE_BITS) - 1);
+            if (state == 0 || timeout == 0) {
+                for (auto& kv : print_bufs) {
+                    auto s = kv.second.str();
+                    if (!s.empty()) {
+                        std::cout << "#" << kv.first << ": " << s << std::endl;
+                    }
+                }
+                if (state != 0) {
+                    std::fprintf(stdout, "[VXDRV] ready-wait timed out: state=%u\n", state);
+                    return -1;
+                }
+                return 0;
+            }
+
+            sched_yield();
+            timeout -= step_ms;
+        }
+    }
+
+    int dcr_write(uint32_t addr, uint32_t value) {
+        mmio_write64(MMIO_CMD_ARG0, addr);
+        mmio_write64(MMIO_CMD_ARG1, value);
+        issue_cmd(CMD_DCR_WRITE);
+        return 0;
+    }
+
+    int dcr_read(uint32_t addr, uint32_t tag, uint32_t *value) {
+        mmio_write64(MMIO_CMD_ARG0, addr);
+        mmio_write64(MMIO_CMD_ARG1, tag);
+        issue_cmd(CMD_DCR_READ);
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+        *value = static_cast(mmio_read64(MMIO_DCR_RSP));
+        return 0;
+    }
+
+private:
+    int ensure_staging(uint64_t size) {
+        if (staging_size_ >= size) return 0;
+        if (staging_ptr_ != nullptr) {
+            drv_release_buffer(staging_ptr_);
+            staging_ptr_   = nullptr;
+            staging_size_  = 0;
+            staging_ioaddr_ = 0;
+        }
+        if (drv_pin_buffer(size, reinterpret_cast(&staging_ptr_),
+                           &staging_ioaddr_) != 0) {
+            return -1;
+        }
+        staging_size_ = size;
+        return 0;
+    }
+
+    MemoryAllocator global_mem_;
+    uint64_t staging_ioaddr_;
+    uint8_t* staging_ptr_;
+    uint64_t staging_size_;
+};
+
+#include 
diff --git a/sw/runtime/stub/Makefile b/sw/runtime/stub/Makefile
index 64413680c7..895fa8466b 100644
--- a/sw/runtime/stub/Makefile
+++ b/sw/runtime/stub/Makefile
@@ -1,5 +1,13 @@
 include ../common.mk
 
+# HOST_ARCH switch — when building for a non-native simulated host
+# (e.g. running x86 gem5 with an aarch64 simulated CPU), select the
+# matching cross-compiler. Aligns with sw/runtime/gem5/Makefile's
+# HOST_ARCH knob; cross-arch builds land in $(DESTDIR)/$(HOST_ARCH)/
+# so the same dlopen target name (libvortex.so) can coexist with the
+# native build in $(DESTDIR)/.
+HOST_ARCH ?= x86_64
+
 DESTDIR ?= $(CURDIR)/..
 
 SRC_DIR := $(VORTEX_HOME)/sw/runtime/stub
@@ -10,6 +18,19 @@ CXXFLAGS += -fPIC
 
 LDFLAGS += -shared -pthread -ldl -Wl,-soname,libvortex.so
 
+ifeq ($(HOST_ARCH),x86_64)
+    CXX := g++
+    OUT_DIR := $(DESTDIR)
+else ifeq ($(HOST_ARCH),aarch64)
+    CXX := aarch64-linux-gnu-g++
+    OUT_DIR := $(DESTDIR)/aarch64
+else ifeq ($(HOST_ARCH),armhf)
+    CXX := arm-linux-gnueabihf-g++
+    OUT_DIR := $(DESTDIR)/armhf
+else
+    $(error HOST_ARCH must be one of: x86_64, aarch64, armhf (got $(HOST_ARCH)))
+endif
+
 SRCS := $(SRC_DIR)/vortex.cpp $(SRC_DIR)/utils.cpp $(SRC_DIR)/perf.cpp $(RT_COMMON_DIR)/utils.cpp
 
 # Debugging
@@ -21,12 +42,13 @@ endif
 
 PROJECT := libvortex.so
 
-all: $(DESTDIR)/$(PROJECT)
+all: $(OUT_DIR)/$(PROJECT)
 
-$(DESTDIR)/$(PROJECT): $(SRCS)
+$(OUT_DIR)/$(PROJECT): $(SRCS)
+	@mkdir -p $(OUT_DIR)
 	$(CXX) $(CXXFLAGS) $^ $(LDFLAGS) -o $@
 
 clean:
-	rm -f $(DESTDIR)/$(PROJECT)
+	rm -f $(DESTDIR)/$(PROJECT) $(DESTDIR)/aarch64/$(PROJECT) $(DESTDIR)/armhf/$(PROJECT)
 
 .PHONY: all clean
\ No newline at end of file
diff --git a/tests/regression/common.mk b/tests/regression/common.mk
index 536fcd6f85..6484ed1e0f 100644
--- a/tests/regression/common.mk
+++ b/tests/regression/common.mk
@@ -83,7 +83,39 @@ CXXFLAGS += -std=c++17 -Wall -Wextra -pedantic -Wfatal-errors -Werror
 CXXFLAGS += -I$(VORTEX_HOME)/sw/runtime/include -I$(ROOT_DIR)/sw -I$(ROOT_DIR)/hw -I$(SW_COMMON_DIR)
 CXXFLAGS += $(CONFIGS)
 
-LDFLAGS += -L$(VORTEX_RT_LIB) -lvortex
+# HOST_ARCH selects the simulated-host compiler for the test binary
+# (the .vxbin always builds with the RISC-V toolchain regardless).
+# When non-native, the binary is suffixed (e.g. vecadd-aarch64) and
+# we link against the cross-compiled stub in $(VORTEX_RT_LIB)/$(HOST_ARCH)/.
+# Aligns with sw/runtime/{stub,gem5}/Makefile's HOST_ARCH knob; the
+# gem5 ARM e2e test path uses this to produce aarch64 binaries that
+# the simulated ARM CPU inside gem5 can execute.
+#
+# Cross-compiled ELFs embed `/lib/ld-linux-$arch.so.1` as the dynamic
+# linker (PT_INTERP). gem5 doesn't have that path on the host, but
+# it has a setInterpDir() API that prepends a sysroot to the
+# interpreter lookup — the gem5 Python config calls that when
+# DRIVER=gem5-aarch64. Keep the default INTERP here so that mechanism
+# can do the redirection cleanly. (Earlier versions used
+# `-Wl,--dynamic-linker=` to rewrite PT_INTERP, but that interacts
+# badly with setInterpDir's prepend logic.)
+HOST_ARCH ?= x86_64
+ifeq ($(HOST_ARCH),x86_64)
+    PROJECT_SUFFIX :=
+    RT_LIB_DIR := $(VORTEX_RT_LIB)
+else ifeq ($(HOST_ARCH),aarch64)
+    CXX := aarch64-linux-gnu-g++
+    PROJECT_SUFFIX := -aarch64
+    RT_LIB_DIR := $(VORTEX_RT_LIB)/aarch64
+else ifeq ($(HOST_ARCH),armhf)
+    CXX := arm-linux-gnueabihf-g++
+    PROJECT_SUFFIX := -armhf
+    RT_LIB_DIR := $(VORTEX_RT_LIB)/armhf
+else
+    $(error HOST_ARCH must be one of: x86_64, aarch64, armhf (got $(HOST_ARCH)))
+endif
+
+LDFLAGS += -L$(RT_LIB_DIR) -lvortex
 
 # Debugging
 ifdef DEBUG
@@ -106,7 +138,11 @@ endif
 
 CONFIG_STAMP = config.stamp
 
-all: $(PROJECT) kernel.vxbin kernel.dump
+# HOST_ARCH-suffixed binary name (vecadd, vecadd-aarch64, …) so
+# x86 and cross-compiled variants coexist in the same dir.
+APP := $(PROJECT)$(PROJECT_SUFFIX)
+
+all: $(APP) kernel.vxbin kernel.dump
 
 # Force rebuild when CONFIGS (defines) change between runs.
 $(CONFIG_STAMP): FORCE
@@ -146,9 +182,16 @@ kernel.elf: vx_start.o $(VX_SRCS) $(VORTEX_KN_PATH)/lib$(KERNEL_LIB).a $(CONFIG_
 	$(VX_CXX) $(VX_CFLAGS) vx_start.o $(VX_APP_OBJS) $(VX_LDFLAGS) -o $@
 endif
 
-$(PROJECT): $(SRCS) $(VORTEX_RT_LIB)/libvortex.so $(CONFIG_STAMP)
+$(APP): $(SRCS) $(RT_LIB_DIR)/libvortex.so $(CONFIG_STAMP)
 	$(CXX) $(CXXFLAGS) $(filter-out $(CONFIG_STAMP),$^) $(LDFLAGS) -o $@
 
+# Cross-compiled stub for non-native HOST_ARCH. Native (x86_64)
+# is built by $(VORTEX_RT_LIB)/libvortex.so rule below.
+ifneq ($(HOST_ARCH),x86_64)
+$(RT_LIB_DIR)/libvortex.so:
+	$(RUNTIME_ARGS) $(MAKE) -C $(VORTEX_RT_SRC)/stub HOST_ARCH=$(HOST_ARCH) DESTDIR=$(VORTEX_RT_LIB)
+endif
+
 run-simx: $(PROJECT) kernel.vxbin
 	$(RUNTIME_ARGS) $(MAKE) -C $(VORTEX_RT_SRC)/simx DESTDIR=$(VORTEX_RT_LIB)
 	LD_LIBRARY_PATH=$(VORTEX_RT_LIB):$(LD_LIBRARY_PATH) VORTEX_DRIVER=simx ./$(PROJECT) $(OPTS)

From dc419a3914dc8e74d166470e505794b0e4bf5ff1 Mon Sep 17 00:00:00 2001
From: tinebp 
Date: Mon, 18 May 2026 05:39:33 -0700
Subject: [PATCH 2/2] ci: consolidate gem5+SST test runners on a parallel
 hostless/e2e naming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Rename + generalize the per-backend gem5 and SST test runners so they
share a uniform env-var interface and a common naming convention that
makes the modal distinction explicit.

Naming layout (hostless = no host CPU; e2e = host CPU + dispatcher + CP):

                  hostless                     e2e
    gem5    gem5_run_hostless_app.py    gem5_run_app.py
    SST     sst_run_hostless_app.py     (reserved; no SST CPU integration today)

gem5 changes:
- ci/gem5_test_vortex_hello.py → ci/gem5_run_hostless_app.py: parameterized
  by VORTEX_GEM5_DEV_LIB + VORTEX_TEST_DIR + VORTEX_TEST_KERNEL (default
  kernel.vxbin). Drops the hardcoded VORTEX_GEM5_KERNEL path; any
  regression test's kernel.vxbin can now run hostless without its host
  binary.
- ci/gem5_test_vortex_app.py → ci/gem5_run_app.py (rename only).

SST changes:
- Collapse 4 hardcoded stubs (sst_test_vortex_{hello,fibonacci,vecadd,
  conform}.py) into ci/sst_run_hostless_app.py — same env-var
  interface as the gem5 hostless runner.
- Delete ci/sst_test_vortex_memHierarchy.py: not called by regression
  and the wiring recipe is preserved in
  docs/proposals/sst_simx_v3_proposal.md §6.
- Verify USE_SST=1 builds clean post-merge; full SST regression matrix
  (hello / fibonacci / vecadd / conform) passes end-to-end through
  ci/sst_run_hostless_app.py.

Other cleanups:
- ci/regression.sh.in: rewrite gem5() + sst() entries against the new
  runner names + env vars.
- docs/gem5_integration.md: update both invocation examples and the
  reference-implementations list.
- docs/proposals/sst_simx_v3_proposal.md: add an "Implemented" status
  note recording the runner consolidation + the reserved sst_run_app.py
  slot for a future host-CPU SST integration.
- docs/proposals/gem5_v2_cp_migration_proposal.md: update validation
  reference to the new runner filename.
- sw/runtime/gem5/Makefile: drop stale vortex_opae.h / AFU_IMAGE_*
  Makefile comment block (the runtime no longer includes vortex_opae.h
  after the pure-v2 callbacks redesign).

Co-Authored-By: Claude Opus 4.7 (1M context) 
---
 ...em5_test_vortex_app.py => gem5_run_app.py} |  0
 ...rtex_hello.py => gem5_run_hostless_app.py} | 65 +++++++++++--------
 ci/regression.sh.in                           | 59 ++++++++++-------
 ci/sst_run_hostless_app.py                    | 53 +++++++++++++++
 ci/sst_test_vortex_conform.py                 |  7 --
 ci/sst_test_vortex_fibonacci.py               |  7 --
 ci/sst_test_vortex_hello.py                   |  7 --
 ci/sst_test_vortex_memHierarchy.py            | 63 ------------------
 ci/sst_test_vortex_vecadd.py                  |  7 --
 docs/gem5_integration.md                      | 26 +++++---
 .../gem5_v2_cp_migration_proposal.md          |  2 +-
 docs/proposals/sst_simx_v3_proposal.md        |  2 +-
 sw/runtime/gem5/Makefile                      |  7 --
 13 files changed, 145 insertions(+), 160 deletions(-)
 rename ci/{gem5_test_vortex_app.py => gem5_run_app.py} (100%)
 rename ci/{gem5_test_vortex_hello.py => gem5_run_hostless_app.py} (53%)
 create mode 100644 ci/sst_run_hostless_app.py
 delete mode 100644 ci/sst_test_vortex_conform.py
 delete mode 100644 ci/sst_test_vortex_fibonacci.py
 delete mode 100644 ci/sst_test_vortex_hello.py
 delete mode 100644 ci/sst_test_vortex_memHierarchy.py
 delete mode 100644 ci/sst_test_vortex_vecadd.py

diff --git a/ci/gem5_test_vortex_app.py b/ci/gem5_run_app.py
similarity index 100%
rename from ci/gem5_test_vortex_app.py
rename to ci/gem5_run_app.py
diff --git a/ci/gem5_test_vortex_hello.py b/ci/gem5_run_hostless_app.py
similarity index 53%
rename from ci/gem5_test_vortex_hello.py
rename to ci/gem5_run_hostless_app.py
index c21ca78d39..65c92602cc 100644
--- a/ci/gem5_test_vortex_hello.py
+++ b/ci/gem5_run_hostless_app.py
@@ -11,25 +11,30 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# Standalone gem5 integration test for vortex.VortexGPGPU.
+# Hostless gem5 integration test for vortex.VortexGPGPU.
 #
-# The SimObject loads the kernel directly via its `kernel=` parameter
-# and runs it via its internal vortexTickEvent_ chain — no host CPU,
-# no CP, no PIO/DMA. Smoke-tests the gem5↔libvortex-gem5.so wiring:
-# dlopen succeeds, SimObject constructs, Processor::cycle() drives
-# from the gem5 event loop, sim exits cleanly.
+# The SimObject loads a .vxbin kernel directly via its `kernel=`
+# parameter and runs it via its internal vortexTickEvent_ chain — no
+# host CPU, no Command Processor, no PIO/DMA. Smoke-tests the
+# gem5↔libvortex-gem5.so wiring: dlopen succeeds, SimObject
+# constructs, Processor::cycle() drives from the gem5 event loop, sim
+# exits cleanly.
 #
-# The end-to-end variant ([gem5_test_vortex_app.py](gem5_test_vortex_app.py))
-# wires up the host CPU + CP regfile + BAR-mapped VRAM on top.
+# Hosted counterpart: [gem5_run_app.py](gem5_run_app.py) wires up the
+# host CPU + CP regfile + BAR-mapped VRAM on top.
 #
-# Configurable via env vars:
-#   VORTEX_GEM5_LIB    — path to libvortex-gem5.so (no default)
-#   VORTEX_GEM5_KERNEL — path to .vxbin to preload (no default)
+# Configurable via env vars (parallel to gem5_run_app.py):
+#   VORTEX_GEM5_DEV_LIB — path to libvortex-gem5.so (no default)
+#   VORTEX_TEST_DIR     — directory containing the kernel .vxbin
+#   VORTEX_TEST_KERNEL  — kernel filename inside that dir
+#                         (default: kernel.vxbin, matching the
+#                          regression-test convention)
 #
 # Run from the Vortex build dir as:
-#   VORTEX_GEM5_LIB=$PWD/sim/simx/libvortex-gem5.so \
-#   VORTEX_GEM5_KERNEL=$PWD/tests/kernel/hello/hello.vxbin \
-#   $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py
+#   VORTEX_GEM5_DEV_LIB=$PWD/sim/simx/libvortex-gem5.so \
+#   VORTEX_TEST_DIR=$PWD/tests/kernel/hello \
+#   VORTEX_TEST_KERNEL=hello.vxbin \
+#       $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_hostless_app.py
 
 import os
 import m5
@@ -45,12 +50,16 @@
     VortexGPGPU,
 )
 
-LIBRARY = os.environ.get("VORTEX_GEM5_LIB")
-KERNEL  = os.environ.get("VORTEX_GEM5_KERNEL")
-if not LIBRARY:
-    raise RuntimeError("VORTEX_GEM5_LIB env var is required")
-if not KERNEL:
-    raise RuntimeError("VORTEX_GEM5_KERNEL env var is required")
+DEV_LIB     = os.environ.get("VORTEX_GEM5_DEV_LIB")
+TEST_DIR    = os.environ.get("VORTEX_TEST_DIR")
+TEST_KERNEL = os.environ.get("VORTEX_TEST_KERNEL", "kernel.vxbin")
+
+for name, val in [("VORTEX_GEM5_DEV_LIB", DEV_LIB),
+                  ("VORTEX_TEST_DIR",     TEST_DIR)]:
+    if not val:
+        raise RuntimeError(f"{name} env var is required")
+
+KERNEL = f"{TEST_DIR}/{TEST_KERNEL}"
 
 # Minimal system: just enough to hang the VortexGPGPU off a membus
 # so gem5 considers it a properly-wired SimObject. No CPU in this
@@ -65,7 +74,7 @@
 # Membus + a small backing memory so PIO ranges have somewhere to bind.
 system.membus = SystemXBar()
 
-# Memory controller (unused at runtime in standalone mode but required
+# Memory controller (unused at runtime in hostless mode but required
 # for the system to instantiate cleanly).
 system.mem_ctrl = MemCtrl()
 system.mem_ctrl.dram = DDR3_1600_8x8()
@@ -75,9 +84,9 @@
 # The Vortex device. It inherits clock from the system clock domain
 # (set above to 1GHz) via ClockedObject; no explicit `clock=` param.
 system.vortex = VortexGPGPU(
-    library = LIBRARY,
+    library = DEV_LIB,
     kernel  = KERNEL,
-    # Explicitly disable the BAR-mapped VRAM range — the standalone
+    # Explicitly disable the BAR-mapped VRAM range — the hostless
     # path loads the kernel via the device library's load_kernel()
     # entry, never via host memcpy through PIN. Leaving it enabled
     # here would conflict with this test's DRAM range.
@@ -90,10 +99,10 @@
 root = Root(full_system=False, system=system)
 m5.instantiate()
 
-print(f"Standalone: VortexGPGPU library={LIBRARY}")
-print(f"Standalone: kernel={KERNEL}")
-print("Standalone: running until VortexGPGPU exits the sim loop...")
+print(f"Hostless: VortexGPGPU.library={DEV_LIB}")
+print(f"Hostless: kernel={KERNEL}")
+print("Hostless: running until VortexGPGPU exits the sim loop...")
 
 exit_event = m5.simulate()
-print(f"Standalone: exit_event.cause = {exit_event.getCause()!r}")
-print(f"Standalone: tick = {m5.curTick()}")
+print(f"Hostless: exit_event.cause = {exit_event.getCause()!r}")
+print(f"Hostless: tick = {m5.curTick()}")
diff --git a/ci/regression.sh.in b/ci/regression.sh.in
index d8c22e753a..5d1ddf82ae 100755
--- a/ci/regression.sh.in
+++ b/ci/regression.sh.in
@@ -95,10 +95,22 @@ sst()
 
     cp sim/simx/libvortex.so $SST_ELEMENTS_HOME/lib/sst-elements-library/   # alternatively - $ sst --add-lib-path `pwd` myConfig.py
 
-    sst ci/sst_test_vortex_hello.py
-    sst ci/sst_test_vortex_fibonacci.py
-    sst ci/sst_test_vortex_vecadd.py
-    sst ci/sst_test_vortex_conform.py
+    BUILD_DIR=$(pwd)
+
+    # Hostless SST runner (ci/sst_run_hostless_app.py) parameterized
+    # by VORTEX_TEST_DIR + VORTEX_TEST_KERNEL — same shape as
+    # ci/gem5_run_hostless_app.py. SST is hostless-only today (no
+    # CPU component wired to Vortex); the ci/sst_run_app.py name
+    # slot is reserved for a future host-CPU SST integration.
+    for spec in "hello:hello.vxbin" "fibonacci:fibonacci.vxbin" \
+                "vecadd:vecadd.vxbin" "conform:conform.vxbin"; do
+        kern="${spec%%:*}"
+        vxbin="${spec#*:}"
+        echo "=== sst: $kern ==="
+        VORTEX_TEST_DIR=$BUILD_DIR/tests/kernel/$kern \
+        VORTEX_TEST_KERNEL=$vxbin \
+            sst ci/sst_run_hostless_app.py
+    done
 
     echo "sst tests done!"
 }
@@ -146,22 +158,24 @@ gem5()
     LIB_GEM5_DEV=$BUILD_DIR/sim/simx/libvortex-gem5.so
     HOST_RT_DIR=$BUILD_DIR/sw/runtime
 
-    # Phase 3 standalone smoke — no host CPU, kernel preload.
-    # env-vars MUST precede the binary (gem5.opt would otherwise
-    # treat them as positional args).
-    VORTEX_GEM5_LIB=$LIB_GEM5_DEV \
-    VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \
+    # Hostless smoke — no host CPU, kernel preloaded via SimObject param.
+    # env-vars MUST precede the binary (gem5.opt would otherwise treat
+    # them as positional args).
+    echo "=== gem5 hostless: hello ==="
+    VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \
+    VORTEX_TEST_DIR=$BUILD_DIR/tests/kernel/hello \
+    VORTEX_TEST_KERNEL=hello.vxbin \
         timeout 120 $GEM5_HOME/build/X86/gem5.opt \
-        ci/gem5_test_vortex_hello.py
+        ci/gem5_run_hostless_app.py
 
-    # Phase 5 e2e — CP-driven path through the host runtime.
-    # Generic test runner (ci/gem5_test_vortex_app.py) parameterized
-    # by VORTEX_TEST_BIN + VORTEX_TEST_ARGS. Sizes are chosen so each
-    # run fits in the 120s per-test budget (feedback_test_timeout_120s):
+    # E2E — CP-driven path through the host runtime. Generic runner
+    # (ci/gem5_run_app.py) parameterized by VORTEX_TEST_BIN +
+    # VORTEX_TEST_ARGS. Sizes fit the 120s per-test budget
+    # (feedback_test_timeout_120s):
     #   - vecadd -n16   small vector add
     #   - sgemm  -n4    4x4 matrix multiply
-    # Larger sizes overrun the budget because the simulated host CPU's
-    # CP poll loop burns gem5 wall time proportional to kernel runtime.
+    # Larger sizes overrun because the simulated host CPU's CP poll
+    # loop burns gem5 wall time proportional to kernel runtime.
     # Run on local dev box for larger sizes by overriding VORTEX_TEST_ARGS.
     for spec in "vecadd:-n16" "sgemm:-n4"; do
         app="${spec%%:*}"
@@ -173,7 +187,7 @@ gem5()
         VORTEX_TEST_BIN=$app \
         VORTEX_TEST_ARGS=$args \
             timeout 120 $GEM5_HOME/build/X86/gem5.opt \
-            ci/gem5_test_vortex_app.py
+            ci/gem5_run_app.py
     done
 
     # ARM matrix (opt-in). The device library (libvortex-gem5.so) is
@@ -195,11 +209,12 @@ gem5()
 
         ARM_HOST_RT_DIR=$BUILD_DIR/sw/runtime/aarch64
 
-        echo "=== gem5 ARM standalone: hello ==="
-        VORTEX_GEM5_LIB=$LIB_GEM5_DEV \
-        VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \
+        echo "=== gem5 ARM hostless: hello ==="
+        VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \
+        VORTEX_TEST_DIR=$BUILD_DIR/tests/kernel/hello \
+        VORTEX_TEST_KERNEL=hello.vxbin \
             timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
-            ci/gem5_test_vortex_hello.py
+            ci/gem5_run_hostless_app.py
 
         for spec in "vecadd:-n16" "sgemm:-n4"; do
             app="${spec%%:*}"
@@ -212,7 +227,7 @@ gem5()
             VORTEX_TEST_ARGS=$args \
             VORTEX_DRIVER=gem5-aarch64 \
                 timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
-                ci/gem5_test_vortex_app.py
+                ci/gem5_run_app.py
         done
     fi
 
diff --git a/ci/sst_run_hostless_app.py b/ci/sst_run_hostless_app.py
new file mode 100644
index 0000000000..3f86188081
--- /dev/null
+++ b/ci/sst_run_hostless_app.py
@@ -0,0 +1,53 @@
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Hostless SST runner: instantiate a single vortex.VortexGPGPU
+# component and run the given kernel. SST runs Vortex co-resident in
+# one process, primes the KMU DCRs directly via proc_->dcr_write
+# inside sim/simx/sst/vortex_simulator.cpp, and ticks the simulation
+# to completion. No host CPU, no CP, no PIO/DMA.
+#
+# Hostless is the only mode the SST integration currently supports:
+# there is no SST CPU component (e.g. Ariel/Vanadis) wired to a
+# Vortex regression test binary today. A future ci/sst_run_app.py
+# could add that path; the name slot is reserved.
+#
+# For memHierarchy timing modeling, the VortexGPGPU component exposes
+# an optional `memIface` SubComponent slot — see
+# docs/proposals/sst_simx_v3_proposal.md for the wiring recipe.
+#
+# Configurable via env vars (parallel to ci/gem5_run_hostless_app.py):
+#   VORTEX_TEST_DIR    — directory containing the kernel .vxbin
+#   VORTEX_TEST_KERNEL — kernel filename inside that dir
+#                        (default: kernel.vxbin, matching the
+#                         regression-test convention)
+#
+# Run via:
+#   VORTEX_TEST_DIR=tests/kernel/hello VORTEX_TEST_KERNEL=hello.vxbin \
+#       sst ci/sst_run_hostless_app.py
+
+import os
+import sst
+
+TEST_DIR    = os.environ.get("VORTEX_TEST_DIR")
+TEST_KERNEL = os.environ.get("VORTEX_TEST_KERNEL", "kernel.vxbin")
+if not TEST_DIR:
+    raise RuntimeError("VORTEX_TEST_DIR env var is required")
+
+PROGRAM = f"{TEST_DIR}/{TEST_KERNEL}"
+
+gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
+gpu.addParams({
+    "clock":   "1GHz",
+    "program": PROGRAM,
+})
diff --git a/ci/sst_test_vortex_conform.py b/ci/sst_test_vortex_conform.py
deleted file mode 100644
index 25681dc6de..0000000000
--- a/ci/sst_test_vortex_conform.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/conform/conform.vxbin"
-})
diff --git a/ci/sst_test_vortex_fibonacci.py b/ci/sst_test_vortex_fibonacci.py
deleted file mode 100644
index b174543dbe..0000000000
--- a/ci/sst_test_vortex_fibonacci.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/fibonacci/fibonacci.vxbin"
-})
diff --git a/ci/sst_test_vortex_hello.py b/ci/sst_test_vortex_hello.py
deleted file mode 100644
index ca4fc01993..0000000000
--- a/ci/sst_test_vortex_hello.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/hello/hello.vxbin"
-})
diff --git a/ci/sst_test_vortex_memHierarchy.py b/ci/sst_test_vortex_memHierarchy.py
deleted file mode 100644
index 2193985fb5..0000000000
--- a/ci/sst_test_vortex_memHierarchy.py
+++ /dev/null
@@ -1,63 +0,0 @@
-# SST Phase 3 integration test for vortex.VortexGPGPU.
-#
-# Wires the VortexGPGPU component's optional `memIface` SubComponent slot
-# through an L1 cache to a memHierarchy.MemController. Every memory request
-# accepted by Vortex's local DRAM model is mirrored to the SST memHierarchy
-# as a StandardMem::Read or Write event, so memHierarchy can model timing /
-# capacity / contention alongside Vortex's own simulation.
-#
-# This is the Phase 3 demonstrator from docs/proposals/sst_simx_v3_proposal.md.
-# The local data path stays in Vortex (RAM is authoritative); SST sees
-# every transaction but doesn't have to serve data back. That gives us
-# meaningful integration without forcing v3's TLM data path through SST.
-
-import sst
-
-# --- Vortex GPGPU component (single-warp hello kernel) -----------------------
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock":   "1GHz",
-    "program": "tests/kernel/hello/hello.vxbin",
-})
-
-# Vortex's StandardMem-side adapter
-gpu_mem_iface = gpu.setSubComponent("memIface", "memHierarchy.standardInterface")
-
-# --- L1 cache between Vortex and memory --------------------------------------
-# A cache is required because memHierarchy.MemController routes via MemLink
-# and only registers its address range when there's an upstream cache that
-# advertises destinations.
-l1 = sst.Component("l1cache", "memHierarchy.Cache")
-l1.addParams({
-    "access_latency_cycles": "2",
-    "cache_frequency":       "1GHz",
-    "replacement_policy":    "lru",
-    "coherence_protocol":    "MESI",
-    "associativity":         "4",
-    "cache_line_size":       "64",
-    "L1":                    "1",
-    "cache_size":            "8KiB",
-})
-
-# --- Memory controller + simple backend (host RAM-backed) --------------------
-memctrl = sst.Component("memctrl0", "memHierarchy.MemController")
-memctrl.addParams({
-    "clock":          "1GHz",
-    "addr_range_end": 0x100000000 - 1,  # 4 GB
-})
-memory = memctrl.setSubComponent("backend", "memHierarchy.simpleMem")
-memory.addParams({
-    "access_time": "10ns",
-    "mem_size":    "4GiB",
-})
-
-# --- Wiring ------------------------------------------------------------------
-# Vortex GPGPU → L1 cache
-link_gpu_l1 = sst.Link("link_gpu_l1")
-link_gpu_l1.connect((gpu_mem_iface, "lowlink", "1ns"),
-                    (l1,            "highlink", "1ns"))
-
-# L1 cache → MemController
-link_l1_mem = sst.Link("link_l1_mem")
-link_l1_mem.connect((l1,      "lowlink",  "1ns"),
-                    (memctrl, "highlink", "1ns"))
diff --git a/ci/sst_test_vortex_vecadd.py b/ci/sst_test_vortex_vecadd.py
deleted file mode 100644
index 8a156cf81f..0000000000
--- a/ci/sst_test_vortex_vecadd.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/vecadd/vecadd.vxbin"
-})
diff --git a/docs/gem5_integration.md b/docs/gem5_integration.md
index 461835ebc2..5b2e0f1afe 100644
--- a/docs/gem5_integration.md
+++ b/docs/gem5_integration.md
@@ -163,17 +163,23 @@ install location — no extra setup needed.
 
 ### By hand
 
-**Standalone** (no host CPU; kernel preloaded via SimObject parameter):
+**Hostless** (no host CPU; kernel preloaded via SimObject parameter):
 
 ```bash
-VORTEX_GEM5_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
-VORTEX_GEM5_KERNEL=$(pwd)/tests/kernel/hello/hello.vxbin \
-    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py
+VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
+VORTEX_TEST_DIR=$(pwd)/tests/kernel/hello \
+VORTEX_TEST_KERNEL=hello.vxbin \
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_hostless_app.py
 ```
 
+`VORTEX_TEST_KERNEL` defaults to `kernel.vxbin`, so any standard
+regression test's kernel can be driven hostless without the host
+binary — e.g. `VORTEX_TEST_DIR=$(pwd)/tests/regression/vecadd
+ci/gem5_run_hostless_app.py`.
+
 **End-to-end** — any standard Vortex regression test (host binary +
 kernel.vxbin) runs through the generic
-[`ci/gem5_test_vortex_app.py`](../ci/gem5_test_vortex_app.py) runner.
+[`ci/gem5_run_app.py`](../ci/gem5_run_app.py) runner.
 
 ```bash
 # vecadd
@@ -182,7 +188,7 @@ VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \
 VORTEX_TEST_DIR=$(pwd)/tests/regression/vecadd \
 VORTEX_TEST_BIN=vecadd \
 VORTEX_TEST_ARGS="-n16" \
-    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_app.py
 
 # sgemm
 VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
@@ -190,7 +196,7 @@ VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \
 VORTEX_TEST_DIR=$(pwd)/tests/regression/sgemm \
 VORTEX_TEST_BIN=sgemm \
 VORTEX_TEST_ARGS="-n4" \
-    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_app.py
 ```
 
 Expected output ends with:
@@ -228,7 +234,7 @@ is identity-mapped (cacheable=False) so the dispatcher's PIO MMIO
 reaches the SimObject's regfile decoder.
 
 These constants are duplicated in two places — `sw/runtime/gem5/driver.h`
-and `ci/gem5_test_vortex_app.py`. If you change one, change the other.
+and `ci/gem5_run_app.py`. If you change one, change the other.
 
 ## Writing your own gem5 Python script
 
@@ -317,8 +323,8 @@ m5.simulate()
 ```
 
 Reference implementations:
-- [ci/gem5_test_vortex_hello.py](../ci/gem5_test_vortex_hello.py) — standalone Phase-3 variant (preload via `kernel=` param; no host CPU)
-- [ci/gem5_test_vortex_app.py](../ci/gem5_test_vortex_app.py) — Phase-5 e2e variant (any regression test via `VORTEX_TEST_BIN`)
+- [ci/gem5_run_hostless_app.py](../ci/gem5_run_hostless_app.py) — hostless variant (preload via `kernel=` param; no host CPU)
+- [ci/gem5_run_app.py](../ci/gem5_run_app.py) — e2e variant (any regression test via `VORTEX_TEST_BIN`)
 
 ## Load-bearing invariants — do not violate
 
diff --git a/docs/proposals/gem5_v2_cp_migration_proposal.md b/docs/proposals/gem5_v2_cp_migration_proposal.md
index a5c8bfc7c0..035d0805bb 100644
--- a/docs/proposals/gem5_v2_cp_migration_proposal.md
+++ b/docs/proposals/gem5_v2_cp_migration_proposal.md
@@ -622,7 +622,7 @@ CommandProcessor wiring; runnable without gem5 itself).
   HOST_ARCH switch).
 
 **Validation:**
-- Phase 3 standalone test (`ci/gem5_test_vortex_hello.py`): PASSES.
+- Hostless test (`ci/gem5_run_hostless_app.py`): PASSES.
   (No host runtime involvement.)
 - `./ci/regression.sh --gem5`: PASSES — hello + vecadd + sgemm e2e on x86.
 - `VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5`: PASSES — same 3 tests
diff --git a/docs/proposals/sst_simx_v3_proposal.md b/docs/proposals/sst_simx_v3_proposal.md
index 3dbe0a00ef..65db9ebbbd 100644
--- a/docs/proposals/sst_simx_v3_proposal.md
+++ b/docs/proposals/sst_simx_v3_proposal.md
@@ -1,7 +1,7 @@
 # SST Integration for SimX v3 — Proposal
 
 **Date:** 2026-05-03
-**Status:** Draft
+**Status:** Implemented — note that `ci/sst_test_vortex_*.py` have been consolidated into a single generic runner [ci/sst_run_hostless_app.py](../../ci/sst_run_hostless_app.py) (parameterized by `VORTEX_TEST_DIR` + `VORTEX_TEST_KERNEL`, parallel to [ci/gem5_run_hostless_app.py](../../ci/gem5_run_hostless_app.py)). The naming reserves the `ci/sst_run_app.py` slot for a future host-CPU-driven SST integration (none today — see §3). The memHierarchy wiring described in §6 is no longer kept as a standalone test runner; the recipe stays here as documentation. References to specific `sst_test_vortex_.py` filenames below are historical.
 **Author:** Blaise Tine
 **Related:**
 [simx_v3_proposal.md](simx_v3_proposal.md) (Phase 5: TLM data path),
diff --git a/sw/runtime/gem5/Makefile b/sw/runtime/gem5/Makefile
index 16bd3390be..259bda5d9e 100644
--- a/sw/runtime/gem5/Makefile
+++ b/sw/runtime/gem5/Makefile
@@ -16,13 +16,6 @@ CXXFLAGS += -DXLEN_$(XLEN)
 CXXFLAGS += -fPIC
 CXXFLAGS += $(CONFIGS)
 
-# OPAE-shaped MMIO constants come from the generated vortex_opae.h
-# at build/sw/ (already on the include path via -I$(ROOT_DIR)/sw).
-# vortex.cpp does `#include ` for the AFU_IMAGE_*
-# defines. Unlike sw/runtime/opae/Makefile we do NOT call
-# afu_json_mgr — configure already generated the header from
-# vortex_opae.toml at build time.
-
 # Per-arch compiler selection. The cross-compilers are sysroot-aware
 # (Ubuntu's gcc-aarch64-linux-gnu ships the matching libstdc++); no
 # extra --sysroot flags needed.