From 24305f26ad552cec8f6c7a094acee5cbdde087c6 Mon Sep 17 00:00:00 2001
From: tinebp <tinebp@yahoo.com>
Date: Mon, 18 May 2026 02:48:21 -0700
Subject: [PATCH 1/2] gem5 integration: VortexGPGPU device + x86/ARM host
 runtime + e2e tests

Adds end-to-end gem5 SE-mode integration for Vortex. The simulated host
CPU (x86 or ARM) drives a VortexGPGPU device over the OPAE MMIO+DMA
command protocol; the device internally runs SimX cycle-by-cycle from
gem5's event loop. Validated via ci/regression.sh --gem5: hello +
vecadd + sgemm on both ISAs, 16 s wall.

Three moving parts (see docs/gem5_integration.md and
docs/proposals/gem5_simx_v3_proposal.md for full design rationale):

  1. Device library (sim/simx/gem5/vortex_gpgpu.{cpp,h}, USE_GEM5=1)
     - Wraps a vortex::Processor with a C ABI the gem5 SimObject calls.
     - Full OPAE protocol state machine: cmd_args, busy bit, dcr_rsp,
       async pending_cmd dispatch.
     - Phase-2 in-process smoke driver (sim/simx/gem5/gem5_smoke_main.cpp)
       proves the library works without gem5 installed.

  2. gem5 SimObject (sim/simx/gem5/vortex_gpgpu_dev.{cc,hh} + .py +
     SConscript)
     - DmaDevice subclass; dlopens libvortex-gem5.so; ticks
       Processor::cycle() from EventFunctionWrapper.
     - CMD_MEM_{READ,WRITE} -> dmaAction; CMD_RUN -> schedule tick;
       CMD_DCR_* -> synchronous library passthrough.
     - Installed into a pinned gem5 release by sim/simx/gem5/install.sh,
       which ci/gem5_install.sh fetches + builds (v25.0.0.1, both
       build/{X86,ARM}/gem5.opt).

  3. Host runtime (sw/runtime/gem5/{vortex.cpp,driver.{cpp,h},Makefile})
     - OPAE-shaped vx_* callbacks; direct mmap'd MMIO + bump-allocator
       pinned region.
     - HOST_ARCH switch (x86_64 / aarch64 / armhf) -> matching cross
       compiler, output to \$arch/ subdir so x86 + ARM coexist.
     - All three legacy-vortex_gem5 bug-catalog items addressed:
         B9  cache flush before download via per-core DCR_READ
         B13 multi-arch via HOST_ARCH (was hardcoded armhf in legacy)
         B14 mmio_fence() (mfence / dmb sy) centralised in issue_cmd()

SimX-side prerequisites (also shared with SST integration):
  - Processor::cycle() + Memory* memsim() accessor (sim/simx/processor.*)
  - sw/common/bitmanip.h: added missing <type_traits> + <algorithm>
    includes (defensive header hygiene; was hit when gem5 sources
    became the first to transitively include constants.h)

ARM e2e specifics:
  - tests/regression/common.mk + sw/runtime/stub/Makefile take the
    same HOST_ARCH switch; aarch64 binaries are suffixed (-aarch64) so
    x86 and ARM coexist in the same dir.
  - ci/gem5_test_vortex_app.py calls gem5's setInterpDir() to redirect
    the ELF interpreter (gem5's loader reads PT_INTERP directly, NOT
    via syscalls -- RedirectPath alone isn't enough) and adds
    RedirectPath entries for /lib/aarch64-linux-gnu -> /usr/
    aarch64-linux-gnu/lib (for libc/libstdc++ at runtime).

CI integration:
  - ci/regression.sh.in: new gem5() function (builds prereqs, runs
    standalone hello + e2e vecadd/sgemm, each timeout 120). ARM matrix
    opt-in via VORTEX_GEM5_ARM=1.
  - .github/workflows/ci.yml: ci/gem5_install.sh appended to Setup
    Toolchain (cache-gated like SST), GEM5_HOME exported, gem5 entry
    added to tests matrix (excluded from xlen=64 since the device
    library is XLEN-locked).
  - VERSION: GEM5_REV=v25.0.0.1 added.
  - configure: @GEM5_REV@ substitution.

How to test:
    cd build/
    ./ci/gem5_install.sh                          # first time only
    sudo apt install -y gcc-aarch64-linux-gnu g++-aarch64-linux-gnu
    VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5
    # Expect 6 PASSED runs in ~16s wall.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .github/workflows/ci.yml                |   21 +-
 VERSION                                 |    1 +
 ci/gem5_install.sh.in                   |  114 +++
 ci/gem5_test_vortex_app.py              |  229 +++++
 ci/gem5_test_vortex_hello.py            |   94 ++
 ci/regression.sh.in                     |  123 ++-
 configure                               |    2 +-
 docs/gem5_integration.md                |  403 +++++++++
 docs/index.md                           |    1 +
 docs/proposals/gem5_simx_v3_proposal.md | 1040 +++++++++++++++++++++++
 sim/simx/Makefile                       |   53 +-
 sim/simx/gem5/SConscript                |   18 +
 sim/simx/gem5/VortexGPGPU.py            |   46 +
 sim/simx/gem5/gem5_smoke_main.cpp       |   96 +++
 sim/simx/gem5/hello.c                   |   14 +
 sim/simx/gem5/install.sh                |   50 ++
 sim/simx/gem5/vortex_gpgpu.cpp          |  320 +++++++
 sim/simx/gem5/vortex_gpgpu.h            |  111 +++
 sim/simx/gem5/vortex_gpgpu_dev.cc       |  295 +++++++
 sim/simx/gem5/vortex_gpgpu_dev.hh       |  122 +++
 sim/simx/processor.cpp                  |   24 +
 sim/simx/processor.h                    |   18 +
 sim/simx/processor_impl.h               |   11 +
 sw/common/bitmanip.h                    |    2 +
 sw/runtime/gem5/Makefile                |   73 ++
 sw/runtime/gem5/driver.cpp              |  128 +++
 sw/runtime/gem5/driver.h                |   73 ++
 sw/runtime/gem5/vortex.cpp              |  334 ++++++++
 sw/runtime/stub/Makefile                |   28 +-
 tests/regression/common.mk              |   49 +-
 30 files changed, 3882 insertions(+), 11 deletions(-)
 create mode 100644 ci/gem5_install.sh.in
 create mode 100644 ci/gem5_test_vortex_app.py
 create mode 100644 ci/gem5_test_vortex_hello.py
 create mode 100644 docs/gem5_integration.md
 create mode 100644 docs/proposals/gem5_simx_v3_proposal.md
 create mode 100644 sim/simx/gem5/SConscript
 create mode 100644 sim/simx/gem5/VortexGPGPU.py
 create mode 100644 sim/simx/gem5/gem5_smoke_main.cpp
 create mode 100644 sim/simx/gem5/hello.c
 create mode 100755 sim/simx/gem5/install.sh
 create mode 100644 sim/simx/gem5/vortex_gpgpu.cpp
 create mode 100644 sim/simx/gem5/vortex_gpgpu.h
 create mode 100644 sim/simx/gem5/vortex_gpgpu_dev.cc
 create mode 100644 sim/simx/gem5/vortex_gpgpu_dev.hh
 create mode 100644 sw/runtime/gem5/Makefile
 create mode 100644 sw/runtime/gem5/driver.cpp
 create mode 100644 sw/runtime/gem5/driver.h
 create mode 100644 sw/runtime/gem5/vortex.cpp
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
index 2adecef420..588455f069 100644
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -65,6 +65,7 @@ jobs:
           ../configure --tooldir=$TOOLDIR
           ci/toolchain_install.sh --all
           ci/sst_install.sh
+          ci/gem5_install.sh
 
       - name: Setup Third Party
         if: steps.cache-thirdparty.outputs.cache-hit != 'true'
@@ -78,6 +79,11 @@ jobs:
           echo "SST_CORE_HOME=$PWD/tools/sst-install/sst-core" >> $GITHUB_ENV
           echo "SST_ELEMENTS_HOME=$PWD/tools/sst-install/sst-elements" >> $GITHUB_ENV
 
+      - name: Export gem5 paths
+        run: |
+          echo "GEM5_HOME=$PWD/tools/gem5" >> $GITHUB_ENV
+          echo "$PWD/tools/gem5/build/X86" >> $GITHUB_PATH
+
   build:
     needs: setup
     strategy:
@@ -137,15 +143,23 @@ jobs:
       matrix:
         os: [ubuntu-24.04]
         # dxa + tensor_wg disabled: features not yet complete (see regression{32,64}_failures.md)
-        name: [regression, amo, mpi, dtm, opencl, cache, config1, config2, debug, scope, stress, synthesis, vm, rvc, cupbop, hip, tensor, tensor_sp, tensor_mx]
+        name: [regression, amo, mpi, dtm, opencl, cache, config1, config2, debug, scope, stress, synthesis, vm, rvc, cupbop, hip, tensor, tensor_sp, tensor_mx, gem5]
         xlen: [32, 64]
         # chipStar's hipcc emits Physical64 SPIR-V; POCL refuses it on
         # rv32 Vortex (CL_INVALID_OPERATION). hip is rv64-only until
         # either chipStar grows --offload=spirv32 or the native
         # HIPVortex toolchain lands (see hip_support_proposal.md).
+        #
+        # gem5 only runs against the rv32 build; the device library
+        # is XLEN-locked by the gem5 install (build/X86/gem5.opt
+        # links against the libvortex-gem5.so the runner builds, and
+        # we only build it once). XLEN=64 entry would just duplicate
+        # the run against an identical setup.
         exclude:
           - name: hip
             xlen: 32
+          - name: gem5
+            xlen: 64
     runs-on: ${{ matrix.os }}
     timeout-minutes: 120
 
@@ -190,6 +204,11 @@ jobs:
           echo "SST_CORE_HOME=$PWD/tools/sst-install/sst-core" >> $GITHUB_ENV
           echo "SST_ELEMENTS_HOME=$PWD/tools/sst-install/sst-elements" >> $GITHUB_ENV
 
+      - name: Export gem5 paths
+        run: |
+          echo "GEM5_HOME=$PWD/tools/gem5" >> $GITHUB_ENV
+          echo "$PWD/tools/gem5/build/X86" >> $GITHUB_PATH
+
       - name: Run tests
         run: |
           cd build${{ matrix.xlen }}
diff --git a/VERSION b/VERSION
index af5ac4633b..590f872b15 100644
--- a/VERSION
+++ b/VERSION
@@ -1,2 +1,3 @@
 VORTEX_VERSION=3.0
 TOOLCHAIN_REV=v3.0
+GEM5_REV=v25.0.0.1
diff --git a/ci/gem5_install.sh.in b/ci/gem5_install.sh.in
new file mode 100644
index 0000000000..378f5a167c
--- /dev/null
+++ b/ci/gem5_install.sh.in
@@ -0,0 +1,114 @@
+#!/bin/bash
+
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# gem5 install for SimX v3 — Phase 0 of docs/proposals/gem5_simx_v3_proposal.md.
+#
+# Fetches a pinned gem5 release, installs build deps, builds the ARM
+# variant, and exports GEM5_HOME. The Vortex SimObject is NOT installed
+# here — that lands in Phase 3 once the API surface is confirmed (see
+# sim/simx/gem5/gem5_api_notes.md after this script runs).
+#
+# Idempotent: re-running with the same GEM5_REV is a no-op once
+# $GEM5_HOME/build/ARM/gem5.opt exists.
+
+# exit when any command fails
+set -e
+
+GEM5_REV=${GEM5_REV:=@GEM5_REV@}
+TOOLDIR=${TOOLDIR:=@TOOLDIR@}
+GEM5_HOME=$TOOLDIR/gem5
+GEM5_REPO=https://github.com/gem5/gem5.git
+
+# Build deps. gem5 documents these at https://www.gem5.org/documentation/general_docs/building
+# AArch64 cross-toolchain (gcc/g++-aarch64-linux-gnu) is needed for
+# Phase 0's hello-arm SE-mode smoke test and for the Phase 4 runtime
+# cross-build. Installing it here keeps Phase 0 self-contained.
+DEBIAN_FRONTEND=noninteractive sudo apt install -y \
+    scons \
+    python3 python3-dev python3-pip python3-venv \
+    libprotobuf-dev protobuf-compiler libprotoc-dev \
+    libgoogle-perftools-dev \
+    m4 \
+    libboost-all-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    pkg-config \
+    gcc-aarch64-linux-gnu g++-aarch64-linux-gnu \
+    build-essential git wget
+
+mkdir -p "$TOOLDIR"
+
+# Fetch (or update) gem5 working tree at the pinned revision.
+if [ -d "$GEM5_HOME/.git" ]; then
+    echo "gem5 working tree exists at $GEM5_HOME"
+    pushd "$GEM5_HOME" > /dev/null
+    current_rev=$(git describe --tags --always 2>/dev/null || echo "unknown")
+    if [ "$current_rev" != "$GEM5_REV" ]; then
+        echo "checked-out rev $current_rev != pinned $GEM5_REV; refetching"
+        git fetch --depth=1 origin "tag" "$GEM5_REV"
+        git checkout "$GEM5_REV"
+    fi
+    popd > /dev/null
+else
+    echo "cloning gem5 $GEM5_REV into $GEM5_HOME"
+    git clone --depth=1 --branch "$GEM5_REV" "$GEM5_REPO" "$GEM5_HOME"
+fi
+
+# Build the ARM variant. -j$(nproc) on the self-hosted runner; cap at 4
+# on hosted runners to avoid OOM (gem5 link uses ~4 GB peak).
+JOBS=$(nproc)
+if [ -n "$GITHUB_ACTIONS" ] && [ -z "$VORTEX_SELF_HOSTED" ]; then
+    JOBS=4
+fi
+
+# Build both X86 (default host ISA — easier, no cross-compile needed)
+# and ARM (research path matching the legacy capstone paper). Either
+# can be selected at test-config time via GEM5_BIN=$GEM5_HOME/build/{X86,ARM}/gem5.opt.
+# Default targets can be overridden via GEM5_TARGETS="X86" or "ARM" or
+# "X86 ARM" (space-separated). Both is the default.
+GEM5_TARGETS=${GEM5_TARGETS:-"X86 ARM"}
+
+cd "$GEM5_HOME"
+for target in $GEM5_TARGETS; do
+    if [ ! -x "$GEM5_HOME/build/$target/gem5.opt" ]; then
+        echo "building gem5.opt ($target) with -j$JOBS"
+        scons "build/$target/gem5.opt" -j"$JOBS"
+    else
+        echo "gem5.opt ($target) already built at $GEM5_HOME/build/$target/gem5.opt"
+    fi
+done
+
+# Persist GEM5_HOME for subsequent shells (idempotent).
+if ! grep -q "^export GEM5_HOME=" ~/.bashrc 2>/dev/null; then
+    echo "export GEM5_HOME=$GEM5_HOME" >> ~/.bashrc
+fi
+export GEM5_HOME
+
+# GitHub Actions: propagate to subsequent steps.
+if [ -n "$GITHUB_ENV" ]; then
+    echo "GEM5_HOME=$GEM5_HOME" >> "$GITHUB_ENV"
+fi
+if [ -n "$GITHUB_PATH" ]; then
+    for target in $GEM5_TARGETS; do
+        echo "$GEM5_HOME/build/$target" >> "$GITHUB_PATH"
+    done
+fi
+
+echo ""
+echo "gem5 $GEM5_REV installed at $GEM5_HOME"
+for target in $GEM5_TARGETS; do
+    echo "  binary: $GEM5_HOME/build/$target/gem5.opt"
+done
+echo "  GEM5_HOME exported (re-source ~/.bashrc to pick up in new shells)"
diff --git a/ci/gem5_test_vortex_app.py b/ci/gem5_test_vortex_app.py
new file mode 100644
index 0000000000..7f703325d8
--- /dev/null
+++ b/ci/gem5_test_vortex_app.py
@@ -0,0 +1,229 @@
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Phase 5 end-to-end gem5 integration test for vortex.VortexGPGPU.
+#
+# Generic application runner — any Vortex regression test that
+# follows the standard shape (host binary + kernel.vxbin in the same
+# directory, links against libvortex.so) can run here.
+#
+# Wires:
+#   - x86 SE-mode CPU running an unmodified Vortex regression test
+#     (same binary the SimX backend uses).
+#   - VortexGPGPU device on the system membus at pio=0x20000000.
+#   - Identity-mapped PIO range (CPU → device MMIO) and pinned region
+#     (host DRAM accessed by both the CPU's userspace via virt and
+#     the device's DmaPort via phys) via Process.map() — the same
+#     mechanism gem5's AMD GPU integration uses at apu_se.py:1055.
+#
+# The simulated process loads libvortex.so (the stub), which in turn
+# dlopens libvortex-gem5-x86_64.so based on the VORTEX_DRIVER env
+# var. From there:
+#   1. vx_dev_open → drv_init (no-op; mappings already in place)
+#   2. vx_upload_kernel_bytes → DMA write of the .vxbin into VRAM
+#   3. vx_copy_to_dev (×N) → DMA writes of input buffers
+#   4. vx_start → MMIO CMD_RUN; kernel computes
+#   5. vx_copy_from_dev → cache flush (per-core DCR_READ) + DMA read
+#   6. Host verifies result, prints PASSED / FAILED
+#
+# Configurable via env vars:
+#   VORTEX_GEM5_DEV_LIB     — path to sim/simx/libvortex-gem5.so
+#                             (device-side; dlopened by the gem5 SimObject)
+#   VORTEX_GEM5_HOST_RT_DIR — directory containing libvortex.so (the stub)
+#                             AND libvortex-gem5-x86_64.so (the host
+#                             runtime backend). Both are added to the
+#                             simulated process's LD_LIBRARY_PATH.
+#   VORTEX_TEST_DIR         — directory containing the test binary +
+#                             kernel.vxbin
+#   VORTEX_TEST_BIN         — name of the test binary inside that dir
+#                             (default: vecadd)
+#   VORTEX_TEST_ARGS        — args passed to the binary (default: -n16)
+#   VORTEX_DRIVER           — backend selector for the stub library
+#                             (default: gem5-x86_64; use gem5-aarch64
+#                             when running the ARM matrix)
+
+import os
+import shlex
+
+import m5
+from m5.objects import (
+    AddrRange,
+    DDR3_1600_8x8,
+    MemCtrl,
+    Process,
+    RedirectPath,
+    Root,
+    SEWorkload,
+    SrcClockDomain,
+    System,
+    SystemXBar,
+    AtomicSimpleCPU,
+    VoltageDomain,
+    VortexGPGPU,
+)
+
+DEV_LIB     = os.environ.get("VORTEX_GEM5_DEV_LIB")
+HOST_RT_DIR = os.environ.get("VORTEX_GEM5_HOST_RT_DIR")
+TEST_DIR    = os.environ.get("VORTEX_TEST_DIR")
+TEST_BIN    = os.environ.get("VORTEX_TEST_BIN", "vecadd")
+TEST_ARGS   = os.environ.get("VORTEX_TEST_ARGS", "-n16")
+DRIVER      = os.environ.get("VORTEX_DRIVER",   "gem5-x86_64")
+
+for name, val in [
+    ("VORTEX_GEM5_DEV_LIB",     DEV_LIB),
+    ("VORTEX_GEM5_HOST_RT_DIR", HOST_RT_DIR),
+    ("VORTEX_TEST_DIR",         TEST_DIR),
+]:
+    if not val:
+        raise RuntimeError(f"{name} env var is required")
+
+APP_BIN = f"{TEST_DIR}/{TEST_BIN}"
+
+# Fixed mappings used by the gem5 host runtime (see
+# sw/runtime/gem5/driver.h). The Python config and the C runtime
+# share these constants by convention; if you change one, change
+# both.
+PIO_BASE   = 0x20000000
+PIO_SIZE   = 0x1000        # 4 KB — one page is enough for the OPAE regs
+PIN_BASE   = 0x10000000
+PIN_SIZE   = 0x10000000    # 256 MB — large enough for vecadd staging
+
+# ---------------------------------------------------------------------------
+# System construction
+# ---------------------------------------------------------------------------
+system = System()
+system.clk_domain = SrcClockDomain(clock="3GHz",
+                                   voltage_domain=VoltageDomain())
+system.mem_mode = "atomic"
+system.mem_ranges = [AddrRange("1GiB")]   # covers both DRAM and the
+                                          # PIN_BASE identity-mapped region
+                                          # (PIN_BASE=0x10000000 < 1GB)
+
+# Cross-arch interp + runtime library redirection.
+# Two separate gem5 mechanisms are at play:
+#   (1) `setInterpDir(prefix)` prepends `prefix` to PT_INTERP when
+#       gem5 loads the dynamic linker (e.g. /lib/ld-linux-aarch64.so.1
+#       → /usr/aarch64-linux-gnu/lib/ld-linux-aarch64.so.1). The
+#       linker is opened directly by gem5's loader, NOT via SE-mode
+#       syscall, so RedirectPath doesn't help here.
+#   (2) `system.redirect_paths` redirects open()/stat()/etc syscalls
+#       the GUEST process makes — used when the dynamic linker
+#       later looks up libc.so.6, libstdc++.so.6, libvortex.so, etc.
+# Both are no-ops for native x86.
+if DRIVER == "gem5-aarch64":
+    from m5.core import setInterpDir
+    setInterpDir("/usr/aarch64-linux-gnu")
+    system.redirect_paths = [
+        RedirectPath(app_path="/lib/aarch64-linux-gnu",
+                     host_paths=["/usr/aarch64-linux-gnu/lib"]),
+        RedirectPath(app_path="/usr/lib/aarch64-linux-gnu",
+                     host_paths=["/usr/aarch64-linux-gnu/lib"]),
+    ]
+
+# Membus connects CPU ↔ memory ↔ VortexGPGPU.
+system.membus = SystemXBar()
+system.system_port = system.membus.cpu_side_ports
+
+# CPU. Atomic for now — the cycle counts inside the Vortex device are
+# driven by the device's own clock anyway; timing CPU adds gem5 wall
+# time without changing the kernel result.
+system.cpu = AtomicSimpleCPU()
+system.cpu.createInterruptController()
+system.cpu.icache_port = system.membus.cpu_side_ports
+system.cpu.dcache_port = system.membus.cpu_side_ports
+# X86's InterruptController has explicit pio/int_requestor/int_responder
+# ports that must be wired to the membus (per
+# learning_gem5/part1/two_level.py:111-114). ARM's interrupt model
+# doesn't expose these — skip the wiring on ARM. Tested via the
+# DRIVER env var (the same one that selects the simulated host ISA).
+if DRIVER == "gem5-x86_64":
+    system.cpu.interrupts[0].pio           = system.membus.mem_side_ports
+    system.cpu.interrupts[0].int_requestor = system.membus.cpu_side_ports
+    system.cpu.interrupts[0].int_responder = system.membus.mem_side_ports
+
+# Memory controller. The DRAM range starts at 0; PIO_BASE=0x20000000
+# lives ABOVE the 1 GB range (since 0x20000000 = 512 MB) — wait, it's
+# inside. mem_ranges above is just a hint; the actual MemCtrl range
+# is what determines what's routed where.
+system.mem_ctrl = MemCtrl()
+system.mem_ctrl.dram = DDR3_1600_8x8()
+# DRAM serves [0, 512MB). PIO at 0x20000000 (=512MB) sits at the top
+# edge, so let DRAM serve [0, 512MB) and let the membus route
+# 0x20000000+ to the VortexGPGPU.
+system.mem_ctrl.dram.range = AddrRange(0, size="512MiB")
+system.mem_ctrl.port = system.membus.mem_side_ports
+
+# The Vortex device. The `library` parameter points at the
+# device-side libvortex-gem5.so (no arch suffix; gem5 itself is
+# always x86-host). The host-side runtime is loaded separately by
+# the simulated process via VORTEX_DRIVER below.
+system.vortex = VortexGPGPU(
+    library = DEV_LIB,
+    kernel  = "",   # NO preload — the host binary uploads the kernel
+                    # via the OPAE MMIO protocol, the way a real
+                    # accelerator runtime works.
+)
+system.vortex.pio_addr = PIO_BASE
+system.vortex.pio_size = PIO_SIZE
+system.vortex.pio = system.membus.mem_side_ports
+system.vortex.dma = system.membus.cpu_side_ports
+
+# ---------------------------------------------------------------------------
+# Workload (the host test binary)
+# ---------------------------------------------------------------------------
+argv = [APP_BIN] + shlex.split(TEST_ARGS)
+process = Process(
+    pid=100,
+    cwd=TEST_DIR,
+    cmd=argv,
+    executable=argv[0],
+    env=[
+        # Tells the stub to dlopen our backend
+        # (libvortex.so does dlopen("libvortex-${VORTEX_DRIVER}.so")).
+        f"VORTEX_DRIVER={DRIVER}",
+        # Library search path inside the simulated process. Must
+        # contain libvortex.so AND libvortex-gem5-$ARCH.so (both
+        # are in HOST_RT_DIR by construction).
+        f"LD_LIBRARY_PATH={HOST_RT_DIR}",
+    ],
+)
+
+system.workload = SEWorkload.init_compatible(APP_BIN)
+system.cpu.workload = process
+system.cpu.createThreads()
+
+# ---------------------------------------------------------------------------
+# Run
+# ---------------------------------------------------------------------------
+root = Root(full_system=False, system=system)
+m5.instantiate()
+
+# Identity-map the device PIO range and the pinned DMA region into
+# the simulated process's address space. Must happen AFTER
+# m5.instantiate() — the process needs a backing C++ object before
+# map() is callable. Mirrors apu_se.py:1055 (gem5's AMD GPU pattern).
+# The CPU's userspace then touches PIO_BASE / PIN_BASE as ordinary
+# memory; the membus routes PIO_BASE → device, PIN_BASE → DRAM.
+system.cpu.workload[0].map(PIO_BASE, PIO_BASE, PIO_SIZE, cacheable=False)
+system.cpu.workload[0].map(PIN_BASE, PIN_BASE, PIN_SIZE, cacheable=True)
+
+print(f"Phase 5: app={APP_BIN} {TEST_ARGS}")
+print(f"Phase 5: VortexGPGPU.library={DEV_LIB}")
+print(f"Phase 5: VORTEX_DRIVER={DRIVER}")
+print(f"Phase 5: LD_LIBRARY_PATH={HOST_RT_DIR}")
+print(f"Phase 5: PIO @0x{PIO_BASE:x}+0x{PIO_SIZE:x}, PIN @0x{PIN_BASE:x}+0x{PIN_SIZE:x}")
+print("Phase 5: starting simulation...")
+
+exit_event = m5.simulate()
+print(f"Phase 5: exit_event.cause = {exit_event.getCause()!r}")
+print(f"Phase 5: tick = {m5.curTick()}")
diff --git a/ci/gem5_test_vortex_hello.py b/ci/gem5_test_vortex_hello.py
new file mode 100644
index 0000000000..6ab54b3af2
--- /dev/null
+++ b/ci/gem5_test_vortex_hello.py
@@ -0,0 +1,94 @@
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Phase 3 gem5 integration test for vortex.VortexGPGPU.
+#
+# Standalone-device variant: the VortexGPGPU SimObject loads the kernel
+# directly via its `kernel=` parameter and runs it via its internal
+# tick loop. No host CPU, no MMIO traffic, no DMA — this is the gem5
+# analog of sim/simx/gem5/gem5_smoke from Phase 2, used here purely
+# to prove the gem5 SimObject can dlopen libvortex-gem5.so, drive
+# Processor::cycle() from the gem5 event loop, and exit cleanly.
+#
+# Phase 5 adds the full host-CPU + MMIO/DMA flow on top of this.
+#
+# Configurable via env vars:
+#   VORTEX_GEM5_LIB    — path to libvortex-gem5.so (no default)
+#   VORTEX_GEM5_KERNEL — path to .vxbin to preload (no default)
+#
+# Run from the Vortex build dir as:
+#   VORTEX_GEM5_LIB=$PWD/sim/simx/libvortex-gem5.so \
+#   VORTEX_GEM5_KERNEL=$PWD/tests/kernel/hello/hello.vxbin \
+#   $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py
+
+import os
+import m5
+from m5.objects import (
+    AddrRange,
+    DDR3_1600_8x8,
+    MemCtrl,
+    Root,
+    SrcClockDomain,
+    System,
+    SystemXBar,
+    VoltageDomain,
+    VortexGPGPU,
+)
+
+LIBRARY = os.environ.get("VORTEX_GEM5_LIB")
+KERNEL  = os.environ.get("VORTEX_GEM5_KERNEL")
+if not LIBRARY:
+    raise RuntimeError("VORTEX_GEM5_LIB env var is required")
+if not KERNEL:
+    raise RuntimeError("VORTEX_GEM5_KERNEL env var is required")
+
+# Minimal system: just enough to hang the VortexGPGPU off a membus
+# so gem5 considers it a properly-wired SimObject. No CPU in this
+# Phase-3 test — the kernel runs entirely inside the SimObject's
+# internal tick loop.
+system = System()
+system.clk_domain = SrcClockDomain(clock="1GHz",
+                                   voltage_domain=VoltageDomain())
+system.mem_mode = "atomic"
+system.mem_ranges = [AddrRange("512MiB")]
+
+# Membus + a small backing memory so PIO ranges have somewhere to bind.
+system.membus = SystemXBar()
+
+# Memory controller (unused at runtime in Phase 3 but required for the
+# system to instantiate cleanly).
+system.mem_ctrl = MemCtrl()
+system.mem_ctrl.dram = DDR3_1600_8x8()
+system.mem_ctrl.dram.range = system.mem_ranges[0]
+system.mem_ctrl.port = system.membus.mem_side_ports
+
+# The Vortex device. It inherits clock from the system clock domain
+# (set above to 1GHz) via ClockedObject; no explicit `clock=` param.
+system.vortex = VortexGPGPU(
+    library = LIBRARY,
+    kernel  = KERNEL,
+)
+system.vortex.pio = system.membus.mem_side_ports
+system.vortex.dma = system.membus.cpu_side_ports
+
+# Root wires the system into the simulator.
+root = Root(full_system=False, system=system)
+m5.instantiate()
+
+print(f"Phase 3: VortexGPGPU library={LIBRARY}")
+print(f"Phase 3: kernel={KERNEL}")
+print("Phase 3: running until VortexGPGPU exits the sim loop...")
+
+exit_event = m5.simulate()
+print(f"Phase 3: exit_event.cause = {exit_event.getCause()!r}")
+print(f"Phase 3: tick = {m5.curTick()}")
diff --git a/ci/regression.sh.in b/ci/regression.sh.in
index c84ad793c2..b1b285358a 100755
--- a/ci/regression.sh.in
+++ b/ci/regression.sh.in
@@ -103,6 +103,124 @@ sst()
     echo "sst tests done!"
 }
 
+# gem5 integration tests — Phase 6 of docs/proposals/gem5_simx_v3_proposal.md.
+# Validates the VortexGPGPU device + libvortex-gem5.so end-to-end inside
+# gem5 SE-mode. Two layers:
+#
+#   1. Phase 3 standalone (--gem5-standalone): kernel preloaded via the
+#      SimObject's `kernel=` Python param; runs entirely inside the gem5
+#      event loop, no host CPU needed. Fast smoke test (~1 s wall, ~5K
+#      simulated cycles per run).
+#
+#   2. Phase 5 e2e (--gem5): an x86 SE-mode workload (the standard
+#      tests/regression/vecadd binary, same one the SimX backend uses)
+#      drives the device via the OPAE MMIO/DMA protocol through
+#      libvortex-gem5-x86_64.so. Exercises the full path: kernel upload
+#      DMA, status polling, cache-flush DCRs, result DMA, host-side
+#      verification.
+#
+# ARM matrix is opt-in via VORTEX_GEM5_ARM=1 (needs gcc-aarch64-linux-gnu
+# installed; not part of the default hosted-runner image).
+gem5()
+{
+    echo "begin gem5 tests..."
+
+    if [ -z "$GEM5_HOME" ]; then
+        GEM5_HOME=$HOME/tools/gem5
+    fi
+    if [ ! -x "$GEM5_HOME/build/X86/gem5.opt" ]; then
+        echo "error: $GEM5_HOME/build/X86/gem5.opt not found — run ci/gem5_install.sh first"
+        exit 1
+    fi
+
+    # Build prerequisites. The host runtime is gated on HOST_ARCH;
+    # default x86 needs no cross-toolchain.
+    make -C sim/simx USE_GEM5=1
+    make -C sw/runtime/stub
+    make -C sw/runtime/gem5 HOST_ARCH=x86_64
+    make -C sw/kernel
+    make -C tests/kernel/hello
+    make -C tests/regression/vecadd
+    make -C tests/regression/sgemm
+
+    BUILD_DIR=$(pwd)
+    LIB_GEM5_DEV=$BUILD_DIR/sim/simx/libvortex-gem5.so
+    HOST_RT_DIR=$BUILD_DIR/sw/runtime
+
+    # Phase 3 standalone smoke — no host CPU, kernel preload.
+    # env-vars MUST precede the binary (gem5.opt would otherwise
+    # treat them as positional args).
+    VORTEX_GEM5_LIB=$LIB_GEM5_DEV \
+    VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \
+        timeout 120 $GEM5_HOME/build/X86/gem5.opt \
+        ci/gem5_test_vortex_hello.py
+
+    # Phase 5 e2e — full OPAE protocol path through the host runtime.
+    # Generic test runner (ci/gem5_test_vortex_app.py) parameterized
+    # by VORTEX_TEST_BIN + VORTEX_TEST_ARGS. Sizes are chosen so each
+    # run fits in the 120s per-test budget (feedback_test_timeout_120s):
+    #   - vecadd -n16   small vector add (~4K device cycles)
+    #   - sgemm  -n4    4x4 matrix multiply (~800 device cycles; larger
+    #                   sizes overrun the budget because the simulated
+    #                   host CPU's ready_wait poll loop burns gem5
+    #                   wall time proportional to kernel runtime).
+    # Run on local dev box for larger sizes by overriding VORTEX_TEST_ARGS.
+    for spec in "vecadd:-n16" "sgemm:-n4"; do
+        app="${spec%%:*}"
+        args="${spec#*:}"
+        echo "=== gem5 e2e: $app $args ==="
+        VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \
+        VORTEX_GEM5_HOST_RT_DIR=$HOST_RT_DIR \
+        VORTEX_TEST_DIR=$BUILD_DIR/tests/regression/$app \
+        VORTEX_TEST_BIN=$app \
+        VORTEX_TEST_ARGS=$args \
+            timeout 120 $GEM5_HOME/build/X86/gem5.opt \
+            ci/gem5_test_vortex_app.py
+    done
+
+    # ARM matrix (opt-in). The device library (libvortex-gem5.so) is
+    # always x86 — gem5.opt is an x86 binary regardless of which
+    # simulated ISA it models. Only the simulated host's ISA changes.
+    if [ -n "$VORTEX_GEM5_ARM" ]; then
+        if [ ! -x "$GEM5_HOME/build/ARM/gem5.opt" ]; then
+            echo "error: $GEM5_HOME/build/ARM/gem5.opt not found"
+            exit 1
+        fi
+
+        # Cross-compile the host runtime, stub, and test binaries for
+        # aarch64. All outputs land in $arch/ subdirs alongside the
+        # native x86 builds so they coexist cleanly.
+        make -C sw/runtime/stub HOST_ARCH=aarch64
+        make -C sw/runtime/gem5 HOST_ARCH=aarch64
+        make -C tests/regression/vecadd HOST_ARCH=aarch64
+        make -C tests/regression/sgemm  HOST_ARCH=aarch64
+
+        ARM_HOST_RT_DIR=$BUILD_DIR/sw/runtime/aarch64
+
+        echo "=== gem5 ARM standalone: hello ==="
+        VORTEX_GEM5_LIB=$LIB_GEM5_DEV \
+        VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \
+            timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
+            ci/gem5_test_vortex_hello.py
+
+        for spec in "vecadd:-n16" "sgemm:-n4"; do
+            app="${spec%%:*}"
+            args="${spec#*:}"
+            echo "=== gem5 ARM e2e: $app $args ==="
+            VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \
+            VORTEX_GEM5_HOST_RT_DIR=$ARM_HOST_RT_DIR \
+            VORTEX_TEST_DIR=$BUILD_DIR/tests/regression/$app \
+            VORTEX_TEST_BIN=$app-aarch64 \
+            VORTEX_TEST_ARGS=$args \
+            VORTEX_DRIVER=gem5-aarch64 \
+                timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
+                ci/gem5_test_vortex_app.py
+        done
+    fi
+
+    echo "gem5 tests done!"
+}
+
 mpi()
 {
     echo "begin mpi tests..."
@@ -1022,7 +1140,7 @@ hip()
 show_usage()
 {
     echo "Vortex Regression Test"
-    echo "Usage: $0 [--clean] [--unittest] [--riscv] [--kernel] [--regression] [--amo] [--dxa] [--opencl] [--cache] [--vm] [--rvc] [--config1] [--config2] [--debug] [--scope] [--stress] [--synthesis] [--vector] [--graphics] [--tensor] [--tensor_sp] [--tensor_mx] [--tensor_wg] [--cupbop] [--hip] [--all] [--h|--help]"
+    echo "Usage: $0 [--clean] [--unittest] [--riscv] [--kernel] [--regression] [--amo] [--dxa] [--opencl] [--cache] [--vm] [--rvc] [--config1] [--config2] [--debug] [--scope] [--stress] [--synthesis] [--vector] [--graphics] [--tensor] [--tensor_sp] [--tensor_mx] [--tensor_wg] [--cupbop] [--hip] [--sst] [--gem5] [--dtm] [--mpi] [--all] [--h|--help]"
 }
 
 declare -a tests=()
@@ -1114,6 +1232,9 @@ while [ "$1" != "" ]; do
         --sst )
                 tests+=("sst")
                 ;;
+        --gem5 )
+                tests+=("gem5")
+                ;;
         --dtm )
                 tests+=("dtm")
                 ;;
diff --git a/configure b/configure
index 14c0880d1d..ea1abb5ebf 100755
--- a/configure
+++ b/configure
@@ -69,7 +69,7 @@ copy_files() {
                         continue
                     fi
                     mkdir -p "$dest_dir"
-                    sed "s|@VORTEX_HOME@|$SOURCE_DIR|g; s|@XLEN@|$XLEN|g; s|@TOOLDIR@|$TOOLDIR|g; s|@OSVERSION@|$OSVERSION|g; s|@INSTALLDIR@|$PREFIX|g; s|@BUILDDIR@|$CURRENT_DIR|g; s|@TOOLCHAIN_REV@|$TOOLCHAIN_REV|g; s|@VORTEX_VERSION@|$VORTEX_VERSION|g" "$file" > "$dest_file"
+                    sed "s|@VORTEX_HOME@|$SOURCE_DIR|g; s|@XLEN@|$XLEN|g; s|@TOOLDIR@|$TOOLDIR|g; s|@OSVERSION@|$OSVERSION|g; s|@INSTALLDIR@|$PREFIX|g; s|@BUILDDIR@|$CURRENT_DIR|g; s|@TOOLCHAIN_REV@|$TOOLCHAIN_REV|g; s|@VORTEX_VERSION@|$VORTEX_VERSION|g; s|@GEM5_REV@|$GEM5_REV|g" "$file" > "$dest_file"
                     # apply permissions to bash scripts
                     read -r firstline < "$dest_file"
                     if [[ "$firstline" =~ ^#!.*bash ]]; then
diff --git a/docs/gem5_integration.md b/docs/gem5_integration.md
new file mode 100644
index 0000000000..f474118897
--- /dev/null
+++ b/docs/gem5_integration.md
@@ -0,0 +1,403 @@
+# gem5 Integration
+
+Vortex can run inside the [gem5](https://www.gem5.org/) full-system
+simulator as a `DmaDevice` SimObject, exposing a Vortex GPGPU to a
+simulated host CPU (x86 or ARM) over the standard OPAE MMIO+DMA
+command protocol. Use this when you want to model heterogeneous
+host-CPU+accelerator workloads with realistic cross-ISA cache and
+DMA timing.
+
+For the design rationale see
+[docs/proposals/gem5_simx_v3_proposal.md](proposals/gem5_simx_v3_proposal.md).
+This document is the operator manual.
+
+## At a glance
+
+The integration has three moving parts that live in this repo:
+
+| Part | Source | Built artifact | Loaded by |
+|---|---|---|---|
+| Device library | `sim/simx/gem5/vortex_gpgpu.{cpp,h}` | `build/sim/simx/libvortex-gem5.so` | gem5 SimObject via `dlopen` |
+| gem5 SimObject | `sim/simx/gem5/vortex_gpgpu_dev.{cc,hh}` + `VortexGPGPU.py` + `SConscript` | Linked into `gem5.opt` after install | gem5 itself |
+| Host runtime | `sw/runtime/gem5/{vortex.cpp,driver.{cpp,h},Makefile}` | `build/sw/runtime/libvortex-gem5-{x86_64,aarch64}.so` | The simulated process inside gem5 |
+
+Plus one external piece: `ci/gem5_install.sh` fetches gem5
+v25.0.0.1, drops our SimObject sources into `$GEM5_HOME/src/dev/vortex/`,
+and builds `build/{X86,ARM}/gem5.opt` (both ISAs by default).
+
+## One-time setup
+
+Vortex install / build as usual ([docs/install_vortex.md](install_vortex.md)),
+then add gem5:
+
+```bash
+cd build/   # standard Vortex out-of-tree build directory
+./ci/gem5_install.sh
+```
+
+This runs `sudo apt install` for gem5's build dependencies (scons,
+libprotobuf, m4, libboost, **gcc-aarch64-linux-gnu**, …), clones gem5
+v25.0.0.1 into `$TOOLDIR/gem5`, copies the Vortex SimObject sources
+into `$GEM5_HOME/src/dev/vortex/`, and builds `gem5.opt` for both X86
+and ARM (~15 min on a 64-core machine, ~30-45 min on a typical CI
+runner). The script is idempotent — re-running with the same
+`GEM5_REV` is a no-op.
+
+To install only one ISA:
+
+```bash
+GEM5_TARGETS="X86" ./ci/gem5_install.sh   # default
+GEM5_TARGETS="ARM" ./ci/gem5_install.sh
+GEM5_TARGETS="X86 ARM" ./ci/gem5_install.sh   # both (default)
+```
+
+The pinned gem5 revision lives in `VERSION` (`GEM5_REV=v25.0.0.1`);
+bumping it requires re-running `ci/gem5_install.sh` and verifying
+both `gem5.opt` builds still load `VortexGPGPU` cleanly.
+
+## Building Vortex with gem5 support
+
+The device library is gated behind `USE_GEM5=1`. The default
+`make -C sim/simx` is **unchanged** — no gem5 dep, no `libvortex-gem5.so`
+produced.
+
+```bash
+make -C sim/simx                     # default; no gem5 artifacts
+make -C sim/simx USE_GEM5=1          # produces libvortex-gem5.so + gem5_smoke
+```
+
+`USE_SST=1` and `USE_GEM5=1` are mutually exclusive (the Makefile
+errors out if both are set) — they're different external simulators
+with different LDFLAGS; building both into one binary makes no sense.
+
+### Host runtime + tests (cross-compile)
+
+The simulated process inside gem5 loads the **host runtime**
+`libvortex-gem5-$HOST_ARCH.so`, which speaks the OPAE MMIO/DMA
+protocol to the device. The `HOST_ARCH` knob is consistent across
+three Makefiles — runtime backend, stub, and regression tests:
+
+```bash
+# Native x86 (default)
+make -C sw/runtime/stub                          # → build/sw/runtime/libvortex.so
+make -C sw/runtime/gem5                          # → build/sw/runtime/libvortex-gem5-x86_64.so
+make -C tests/regression/vecadd                  # → build/tests/regression/vecadd/vecadd
+
+# Cross-compiled aarch64 — outputs land in $arch/ subdirs so x86
+# and ARM artifacts coexist:
+make -C sw/runtime/stub HOST_ARCH=aarch64        # → build/sw/runtime/aarch64/libvortex.so
+make -C sw/runtime/gem5 HOST_ARCH=aarch64        # → build/sw/runtime/aarch64/libvortex-gem5-aarch64.so
+make -C tests/regression/vecadd HOST_ARCH=aarch64 # → build/tests/regression/vecadd/vecadd-aarch64
+
+# armhf works the same way:
+make -C sw/runtime/stub HOST_ARCH=armhf
+make -C sw/runtime/gem5 HOST_ARCH=armhf
+make -C tests/regression/vecadd HOST_ARCH=armhf
+```
+
+The ARM targets require `gcc-aarch64-linux-gnu` / `gcc-arm-linux-gnueabihf`
+respectively — `ci/gem5_install.sh` installs these.
+
+## Running tests
+
+### From the regression harness
+
+```bash
+cd build/
+./ci/regression.sh --gem5
+```
+
+Runs both the standalone Phase-3 smoke test (kernel preloaded on the
+SimObject, no host CPU) and the Phase-5 end-to-end test (real
+SE-mode host program drives the device through MMIO+DMA). Total
+wall time ~5 s on a fast box.
+
+To also run the ARM matrix entry (needs `gcc-aarch64-linux-gnu`):
+
+```bash
+VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5
+```
+
+Runs 6 tests in ~16 s wall:
+- X86 standalone hello (no host CPU; SimObject preloads kernel)
+- X86 e2e vecadd `-n16` (host CPU drives device via OPAE MMIO+DMA)
+- X86 e2e sgemm `-n4`
+- ARM standalone hello
+- ARM e2e vecadd `-n16`
+- ARM e2e sgemm `-n4`
+
+Cross-arch e2e relies on two gem5 mechanisms working together:
+
+1. **`setInterpDir(prefix)`** prepends a sysroot to the dynamic
+   linker path embedded in the cross-compiled ELF
+   (`/lib/ld-linux-aarch64.so.1` → `/usr/aarch64-linux-gnu/lib/...`).
+   The Python config calls this when `VORTEX_DRIVER=gem5-aarch64`.
+2. **`system.redirect_paths`** redirects the *guest process's*
+   open()/stat() syscalls for `/lib/aarch64-linux-gnu/*` →
+   `/usr/aarch64-linux-gnu/lib/*` so the dynamic linker can
+   resolve libc, libstdc++, etc.
+
+Both paths point at the Ubuntu `gcc-aarch64-linux-gnu` package's
+install location — no extra setup needed.
+
+### By hand
+
+**Standalone** (no host CPU; kernel preloaded via SimObject parameter):
+
+```bash
+VORTEX_GEM5_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
+VORTEX_GEM5_KERNEL=$(pwd)/tests/kernel/hello/hello.vxbin \
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py
+```
+
+**End-to-end** — any standard Vortex regression test (host binary
++ kernel.vxbin) runs through the generic
+[`ci/gem5_test_vortex_app.py`](../ci/gem5_test_vortex_app.py)
+runner. Set `VORTEX_TEST_BIN` to the test name:
+
+```bash
+# vecadd
+VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
+VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \
+VORTEX_TEST_DIR=$(pwd)/tests/regression/vecadd \
+VORTEX_TEST_BIN=vecadd \
+VORTEX_TEST_ARGS="-n16" \
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py
+
+# sgemm
+VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
+VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \
+VORTEX_TEST_DIR=$(pwd)/tests/regression/sgemm \
+VORTEX_TEST_BIN=sgemm \
+VORTEX_TEST_ARGS="-n4" \
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py
+```
+
+Expected vecadd output (truncated):
+```
+allocate device memory
+upload source buffer0
+upload source buffer1
+Upload kernel binary
+start device
+wait for completion
+download destination buffer
+verify result
+PASSED!
+```
+
+### Sizing tests for the 120 s budget
+
+Each `timeout 120` per test bound comes from
+[feedback_test_timeout_120s](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_test_timeout_120s.md).
+gem5 SE-mode runs the host CPU's `ready_wait` poll loop in
+simulated time too, so **kernel runtime translates directly into
+gem5 wall time**. The regression script's default sizes fit:
+
+| Test | Args | Device cycles | Wall (atomic CPU) |
+|---|---|---|---|
+| vecadd | `-n16` | ~450 | ~3 s |
+| sgemm  | `-n4`  | ~780 | ~3 s |
+| sgemm  | `-n16` | ~10k+ | **> 120 s** (overruns) |
+
+Larger sizes are fine when run by hand outside the budget cap.
+
+## Writing your own gem5 Python script
+
+The minimal recipe for hosting Vortex inside a custom gem5 system:
+
+```python
+from m5.objects import (
+    AddrRange, AtomicSimpleCPU, DDR3_1600_8x8, MemCtrl, Process,
+    Root, SEWorkload, SrcClockDomain, System, SystemXBar,
+    VoltageDomain, VortexGPGPU,
+)
+
+# Mappings expected by sw/runtime/gem5/driver.h.
+PIO_BASE, PIO_SIZE = 0x20000000, 0x1000
+PIN_BASE, PIN_SIZE = 0x10000000, 0x10000000   # 256 MB pinned region
+
+system = System()
+system.clk_domain = SrcClockDomain(clock="3GHz",
+                                   voltage_domain=VoltageDomain())
+system.mem_mode = "atomic"
+system.mem_ranges = [AddrRange("1GiB")]
+system.membus = SystemXBar()
+system.system_port = system.membus.cpu_side_ports
+
+# CPU (x86 example). For ARM, swap to ArmAtomicSimpleCPU + adjust
+# interrupt wiring.
+system.cpu = AtomicSimpleCPU()
+system.cpu.createInterruptController()
+system.cpu.icache_port = system.membus.cpu_side_ports
+system.cpu.dcache_port = system.membus.cpu_side_ports
+system.cpu.interrupts[0].pio           = system.membus.mem_side_ports
+system.cpu.interrupts[0].int_requestor = system.membus.cpu_side_ports
+system.cpu.interrupts[0].int_responder = system.membus.mem_side_ports
+
+# DRAM serves [0, 512MB). PIO at 0x20000000 above goes to the
+# Vortex device (membus routes by address).
+system.mem_ctrl = MemCtrl()
+system.mem_ctrl.dram = DDR3_1600_8x8()
+system.mem_ctrl.dram.range = AddrRange(0, size="512MiB")
+system.mem_ctrl.port = system.membus.mem_side_ports
+
+# The Vortex device.
+system.vortex = VortexGPGPU(
+    library = "/path/to/build/sim/simx/libvortex-gem5.so",
+)
+system.vortex.pio_addr = PIO_BASE
+system.vortex.pio_size = PIO_SIZE
+system.vortex.pio = system.membus.mem_side_ports
+system.vortex.dma = system.membus.cpu_side_ports
+
+# Workload — the host binary uses the OPAE protocol via libvortex.so
+# + libvortex-gem5-x86_64.so (selected by VORTEX_DRIVER).
+process = Process(
+    pid=100,
+    cwd="/path/to/your/test",
+    cmd=["/path/to/your/test/binary"],
+    executable="/path/to/your/test/binary",
+    env=[
+        "VORTEX_DRIVER=gem5-x86_64",
+        "LD_LIBRARY_PATH=/path/to/build/sw/runtime",
+    ],
+)
+
+system.workload = SEWorkload.init_compatible("/path/to/your/test/binary")
+system.cpu.workload = process
+system.cpu.createThreads()
+
+import m5
+root = Root(full_system=False, system=system)
+m5.instantiate()
+
+# CRITICAL: Process.map() must come AFTER m5.instantiate().
+# Identity-mapping PIO + PIN makes the runtime's volatile-pointer
+# MMIO and DMA staging buffer "just work" from the simulated process.
+system.cpu.workload[0].map(PIO_BASE, PIO_BASE, PIO_SIZE, cacheable=False)
+system.cpu.workload[0].map(PIN_BASE, PIN_BASE, PIN_SIZE, cacheable=True)
+
+m5.simulate()
+```
+
+Reference implementations:
+- [ci/gem5_test_vortex_hello.py](../ci/gem5_test_vortex_hello.py) — standalone Phase-3 variant (preload via `kernel=` param; no host CPU)
+- [ci/gem5_test_vortex_app.py](../ci/gem5_test_vortex_app.py) — Phase-5 e2e variant (any regression test via `VORTEX_TEST_BIN`)
+
+## Load-bearing invariants — do not violate
+
+These are the rules that, if broken, will silently produce wrong
+answers or hangs. Each is repeated from the proposal but is
+load-bearing enough to call out here:
+
+### 1. Process.map() goes AFTER m5.instantiate()
+
+`Process.map(vaddr, paddr, size)` is a C++ method on the underlying
+`gem5::Process` object; that object only exists after
+`m5.instantiate()` builds the SimObject tree. Calling `.map()`
+before instantiate raises `RuntimeError: Attempt to instantiate
+orphan node <orphan Process>`.
+
+Confirmed by gem5's own AMD GPU integration at
+`$GEM5_HOME/configs/example/apu_se.py:1055`.
+
+### 2. PIO and PIN regions must be identity-mapped
+
+`sw/runtime/gem5/driver.h` hard-codes:
+- `PIO_BASE_ADDR = 0x20000000` (device MMIO; 4 KB)
+- `PIN_BASE_ADDR = 0x10000000` (DMA staging; 256 MB)
+
+The Python config must `process.map()` both at the same physical
+addresses so:
+- CPU's `*(volatile uint64_t*)0x20000000` → membus routes to the device
+- Device's DmaPort read at phys `0x10000000+N` → membus routes to DRAM
+- Both sides agree on the same bytes without any virtual-to-physical
+  translation surprise.
+
+Changing either constant requires updating both the Python config
+**and** `sw/runtime/gem5/driver.h` (they are not auto-synced).
+
+### 3. The CPU runtime MUST issue a cache flush before reading back results
+
+The host runtime's `download()` path issues a per-core
+`dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy)` BEFORE the
+`CMD_MEM_READ` DMA. Skipping it returns stale data — the L1/L2/L3
+caches may still hold writes that haven't reached VRAM.
+
+This is bug **B9** in the legacy `vortex_gem5` code; the v3 host
+runtime fixes it. If you write your own runtime, do the same.
+
+### 4. MMIO writes need an explicit memory barrier before CMD_TYPE
+
+The host CPU model in gem5 (especially out-of-order variants) can
+reorder MMIO writes. `sw/runtime/gem5/driver.cpp` centralises the
+fence in `issue_cmd()` so it's impossible to forget:
+- x86: `__asm__ volatile("mfence" ::: "memory")`
+- AArch64/ARMv7: `__asm__ volatile("dmb sy" ::: "memory")`
+
+If your custom runtime bypasses `issue_cmd()`, replicate this. This
+is bug **B14** in the legacy code.
+
+### 5. One source of truth for memory state
+
+Vortex's VRAM is owned by `vortex::RAM` inside the device library.
+The pinned region is owned by gem5's DRAM. **The device library
+does not maintain a shadow copy of host pinned memory; the host
+runtime does not maintain a shadow copy of device VRAM.** Bytes
+cross between the two only via the explicit DMA staging path
+(steps 1-6 in §5 of `gem5_simx_v3_proposal.md`).
+
+Don't add a "fast path" that reads/writes the other side's memory
+directly. That breaks the timing model and reintroduces bug **B3**
+from the legacy code.
+
+### 6. USE_SST=1 and USE_GEM5=1 are mutually exclusive
+
+The Makefile rejects both at once. Different external simulators,
+different LDFLAGS, different `libvortex.so` shapes. Pick one per
+build.
+
+## Architectural choices you may want to revisit
+
+These are documented in [the proposal](proposals/gem5_simx_v3_proposal.md)
+but worth surfacing:
+
+- **Status polling, not doorbell queues** (proposal §3.6 "Doorbell
+  queues" note). The host runtime polls `MMIO_STATUS` between
+  commands; modern GPUs (AMD, NVIDIA) use ring-buffer + doorbell.
+  Phase 7+ upgrade if your research needs batched-dispatch realism.
+- **SE-mode + custom PIO+DMA wiring**, not FS-mode + PCIe (proposal
+  §3.6). Matches the legacy capstone paper; faster iteration. PCIe
+  upgrade is a Phase 5+ enhancement that swaps the SimObject base
+  class from `DmaDevice` to `PciDevice` (both inherit `DmaDevice`
+  so the C ABI stays compatible).
+- **C ABI between the device library and gem5 SimObject** instead
+  of C++ linkage (proposal §3.1). Lets you rebuild
+  `libvortex-gem5.so` without rebuilding `gem5.opt` — Vortex
+  internals can churn freely.
+
+## CI
+
+`./ci/regression.sh --gem5` (built into `--all` is intentionally
+**out**: gem5 install is heavy and gated like SST). The
+`.github/workflows/ci.yml` matrix includes a `gem5` entry that runs
+on hosted runners; ARM matrix gated on
+`VORTEX_GEM5_ARM=1`.
+
+Apptainer integration (the `apptainer-ci.yml` pipeline) does **not**
+include gem5 — adding it to `miscs/apptainer/vortex.def` is out of
+scope for this integration (proposal §8). Use the hosted CI for
+gem5.
+
+## Troubleshooting
+
+| Symptom | Cause | Fix |
+|---|---|---|
+| `dlopen('libvortex-gem5.so') failed: cannot open shared object file` | gem5 SimObject can't find the device library | Set `VortexGPGPU(library="/abs/path/to/libvortex-gem5.so", ...)` to absolute path |
+| `Cannot open library: libvortex-gem5-x86_64.so: cannot open shared object file` | Stub can't find the host runtime backend | Set `LD_LIBRARY_PATH=/path/to/sw/runtime` in the `env=[...]` list passed to `Process()` |
+| `fatal: syscall clock_nanosleep (#230) unimplemented` | gem5 SE-mode doesn't implement clock_nanosleep; glibc's `nanosleep()` routes through it | Already fixed in `sw/runtime/gem5/vortex.cpp` (uses `sched_yield()` instead). If you wrote your own runtime, do the same. |
+| `Attempt to instantiate orphan node <orphan Process>` | `Process.map()` called before `m5.instantiate()` | Move all `.map()` calls AFTER `m5.instantiate()` — see invariant §1 above |
+| `fatal: VortexGPGPU: dlsym(vortex_gem5_build_info) failed` | Device library is missing the C ABI symbol — usually means the `library=` parameter points at the wrong .so | `library=` is the **device** library `build/sim/simx/libvortex-gem5.so` (no arch suffix), NOT the host runtime `libvortex-gem5-x86_64.so` |
+| Test hangs forever in `vx_ready_wait` | Device's busy bit never clears, usually because the SimObject didn't schedule the tick event | Confirm you set `system.vortex.dma = system.membus.cpu_side_ports` and the device's `tick()` is reachable. Check gem5 with `--debug-flags=VortexGPGPU` |
+| `ccache g++ ... undefined reference to fmt::v8::detail::error_handler::on_error` | ccache served a stale object compiled against a different `fmt` version | `CCACHE_DISABLE=1 make -C sim/simx clean && CCACHE_DISABLE=1 make ...` |
diff --git a/docs/index.md b/docs/index.md
index a7b9000d49..0c3504d724 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -8,3 +8,4 @@
 - [Contributing](contributing.md): Process for contributing your own features including repo semantics and testing
 - [Debugging](debugging.md): Debugging configurations for each Vortex driver
 - [Building the Toolchain from Source](building_toolchain.md): Maintainer-facing build recipes for Verilator, RISC-V GNU, LLVM (with X86 + lld + SPIR-V), compiler-rt, musl, and POCL
+- [gem5 Integration](gem5_integration.md): Running Vortex inside the gem5 full-system simulator (x86/ARM host CPU + Vortex device over OPAE MMIO/DMA)
diff --git a/docs/proposals/gem5_simx_v3_proposal.md b/docs/proposals/gem5_simx_v3_proposal.md
new file mode 100644
index 0000000000..470a4669b0
--- /dev/null
+++ b/docs/proposals/gem5_simx_v3_proposal.md
@@ -0,0 +1,1040 @@
+# gem5 Integration for SimX v3 — Proposal
+
+**Date:** 2026-05-16
+**Status:** ✅ ALL PHASES (0–7) COMPLETE on BOTH x86_64 AND aarch64 (hello + vecadd + sgemm × 2 ISAs all PASS end-to-end in 16 s wall via `VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5`)
+**Author:** Blaise Tine
+**Related:**
+[simx_v3_proposal.md](simx_v3_proposal.md) (Phase 5: TLM data path),
+[sst_simx_v3_proposal.md](sst_simx_v3_proposal.md) (the sister integration whose patterns this proposal follows),
+[master_merge_v3_proposal.md](master_merge_v3_proposal.md) §10.2 (the precedent for cross-simulator integrations on this line),
+[`~/dev/vortex_gem5`](https://github.com/sij814/vortex_gem5) on branch `gem5`, commit `91dcf17` ("working Vortex with gem5", 2025-05-22 — Injae Shin, UCLA capstone),
+[Injae Shin, "gem5-Vortex: Heterogeneous Cross-ISA Integration of Vortex GPGPU in gem5"](#) (capstone report, 2025).
+
+---
+
+## 1. Constraints (load-bearing)
+
+Any design that breaks one of these is wrong.
+
+1. **One source of truth for memory state.** Per
+   [simx_v3_proposal.md §3.3](simx_v3_proposal.md), data lives in the
+   channel hierarchy: `MemReq`/`MemRsp` packets carry actual bytes
+   between `MemCoalescer` → `Cache` → `Memory`, and the `RAM` image
+   attached to `Memory` is authoritative. There is no shadow backing
+   store and no parallel `MemBackend`. The gem5 integration plugs in at
+   exactly one boundary (the device's DMA port maps to `RAM`
+   read/write); it does **not** introduce a second data path.
+2. **Single clock owner per simulation.** Under gem5, gem5 drives the
+   clock: `VortexGPGPU::tick()` (a gem5 `EventFunctionWrapper` that
+   reschedules itself every cycle at the device clock) calls
+   `Processor::cycle()`. SimX does not advance on its own and there is
+   no worker thread doing async `Processor::run()` in the background.
+   (This is a deliberate departure from the legacy `vortex_gem5` design
+   — see §2.2 — which is the source of most of that branch's bugs.)
+3. **gem5 plugs in at one boundary, not many.** Vortex → gem5 traffic
+   crosses two well-defined interfaces:
+   - **PIO** for MMIO command/status registers (the OPAE AFU image
+     layout, unchanged from `sw/runtime/opae`).
+   - **DMA** for staging-buffer host↔device transfers, and for any
+     future host-visible memory window.
+   The cache hierarchy, scheduler, ALU/FPU, KMU, and the new
+   `Processor::cycle()` entry point do not know gem5 exists.
+4. **No regression for non-gem5 builds.** `make -C sim/simx` (no
+   `USE_GEM5=1`) continues to produce a self-contained `simx` binary
+   identical to today's. gem5 is opt-in compile-time, not a runtime
+   probe, and ships as a separate shared library (`libvortex-gem5.so`)
+   that the gem5 SimObject loads. Per §1.4 of
+   [sst_simx_v3_proposal.md](sst_simx_v3_proposal.md).
+5. **The Vortex tree owns the integration code.** All gem5-facing C++
+   (the `DmaDevice` SimObject) and Python (SimObject config + test
+   scripts) live under `sim/simx/gem5/` and `ci/gem5_test_vortex_*.py`
+   in this repo. `ci/gem5_install.sh` fetches a pinned upstream gem5
+   release and copies/symlinks our SimObject into its source tree
+   before building. Versioning the integration alongside Vortex is what
+   makes it possible to review API-breaking changes in a single PR;
+   the legacy split across two repos is what froze `vortex_gem5` at a
+   two-year-old SimX.
+6. **Author attribution.** The legacy `vortex_gem5` design (DMA-bouncing
+   through a pinned staging buffer, OPAE-shaped MMIO command set, ARM
+   SE-mode runtime) is Injae Shin's capstone work. The
+   re-implementation is a rewrite, not a port (§2), but each new file's
+   commit body cites the capstone report and the legacy commit
+   (`vortex_gem5@91dcf17`).
+
+---
+
+## 2. Why the legacy `vortex_gem5` cannot be ported as-is
+
+### 2.1 The architectural mismatch
+
+`vortex_gem5` was built on pre-v3 SimX (`Arch`, `Processor*`,
+single-step `run()`, `set_running(true)`, `VX_DCR_BASE_*` startup DCRs
+broadcast to all cores). v3 explicitly retired all of those:
+
+| Concern | Legacy SimX (vortex_gem5) | SimX v3 (this branch) |
+|---|---|---|
+| Sizing | `Arch arch(NUM_THREADS, NUM_WARPS, NUM_CORES)` object | Macros (`NUM_THREADS`, etc.) — no `Arch` class |
+| Top-level | `Processor(arch)` ctor with arg | `Processor()` no-arg ctor |
+| Run model | `processor->run()` is one cycle | `processor.run()` blocks to completion |
+| Single-cycle step | `processor->run()` per cycle from `proc_tick()` | does not exist — must be added (`Processor::cycle()`) |
+| Kernel dispatch | `set_running(true)` + `VX_DCR_BASE_STARTUP_*` | `KMU::start()` + `VX_DCR_KMU_*` (startup + grid/block dims) |
+| Cache flush | implicit in `run()` finish | explicit: `dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy)` per core before host read-back |
+| Memory hierarchy | `MemSim` + `CacheSim` are timing-only, data sits in `MemBackend` (`Emulator`-side) | `Memory` + `Cache` carry data through `MemReq`/`MemRsp`; backing image is in `RAM` attached to `Memory` |
+| Runtime layout | top-level `runtime/{stubarm,opaesimx}/` | reorganized under `sw/runtime/` per [master_merge §3](master_merge_v3_proposal.md) |
+
+So the **shape of the gem5 plug-in changes**: not "tick the legacy
+single-cycle Processor" but "add a `cycle()` entry point to the v3
+Processor and call it from the gem5 SimObject," with KMU-style dispatch
+and an explicit cache-flush before host read-back.
+
+### 2.2 Specific bugs in the legacy code
+
+A walk-through of `vortex_gem5/sim/{simx,opaesimx}/` and
+`vortex_gem5/runtime/{stubarm,opaesimx}/` found the following defects.
+Each is called out so the redesign does not re-introduce it.
+
+| # | File | Defect | Why it matters |
+|---|---|---|---|
+| B1 | `sim/simx/simx_device.cpp:122` (`proc_tick`) | Calls `processor_->run()` directly. On legacy SimX this was a single step; on v3 it would block until program completion. | The "tick per gem5 cycle" pattern simply won't work. We must add a real single-cycle `Processor::cycle()` (already required for SST). |
+| B2 | `sim/simx/simx_device.cpp:111` (`start`) | `processor_->set_running(true)` — that API does not exist in v3. The KMU now drives execution and requires `VX_DCR_KMU_GRID_DIM_*` / `VX_DCR_KMU_BLOCK_DIM_*` to be written before the first cycle. | Even after re-pluming, kernels won't launch without the KMU DCR setup (see `sim/simx/main.cpp:101–116`). |
+| B3 | `sim/opaesimx/opae_simx.cpp:185, 199` (`read_mmio64`/`write_mmio64`) | Implementation is `*(uint64_t*)(GEM5_BASE_ADDR + offset)` — a raw host-pointer dereference into a fixed virtual address. | Only works when the host runtime and the gem5 device share an address space (i.e., when the host runtime is *not* actually inside gem5). It is a stand-in for the real path, not the real path. Cross-ISA simulation defeats the assumption: an ARM userspace process inside gem5 cannot dereference `0x20000000` and reach the device. The legacy code papers over this with a co-resident driver hack; v3 needs a real PIO/DMA path. |
+| B4 | `sim/opaesimx/opae_simx.cpp:204–399` | Several hundred lines of commented-out CCI/AVS bus + Verilator (`device_->…`) plumbing left in place, referencing fields and types that do not exist in this file. | Dead code that obscures what the module actually does. Drop it; the new gem5 wrapper has no CCI bus to model. |
+| B5 | `sim/opaesimx/opae_simx.cpp:71` (`dram_sim_` field) | DRAM model is constructed but never ticked or consulted after the gem5 hack landed. | Dead state. |
+| B6 | `sim/opaesimx/opae_simx.cpp:103` (`pinned_alloc_`) | Uses `PIN_BASE_ADDR = 0x10000000` with `PINNED_MEM_SIZE = 0xFFFFFF` (16 MB), hardcoded. No bounds check beyond `MemoryAllocator::allocate` failure. | Tiny by design — large kernel inputs would silently fail. The v3 design should size from `GLOBAL_MEM_SIZE`/`ALLOC_BASE_ADDR` and surface OOM errors. |
+| B7 | `runtime/opaesimx/vortex.cpp:324, 367` | `auto ls_shift = (int)std::log2(CACHE_BLOCK_SIZE);` — uses float `log2` for an integer constant, then discards the result. | Cosmetic / dead, but a smell. Use `log2ceil(CACHE_BLOCK_SIZE)` from `sw/common/util.h`. |
+| B8 | `runtime/opaesimx/vortex.cpp:418–474` (`ready_wait`) | `nanosleep` call is **commented out**; the busy loop only decrements `timeout_ms` and never sleeps. On a long-running kernel inside gem5 SE-mode this saturates the simulated ARM core. | Either use the gem5 device's interrupt path (preferred — implementable as an MMIO doorbell) or restore the `nanosleep` so the ARM CPU is idle while the GPU runs. |
+| B9 | `runtime/opaesimx/vortex.cpp:349–390` (`download`) | No cache-flush step before reading back results from device memory. | On v3, dirty lines must be drained via `dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy)` per core (see `sim/simx/main.cpp:194–197`, `sw/runtime/simx/vortex.cpp:191–197`) or the host sees stale data. |
+| B10 | `runtime/opaesimx/vortex.cpp:478–489` (`dcr_write`) | OPAE protocol has `CMD_DCR_WRITE` but no `CMD_DCR_READ`. | The cache-flush fix above requires a `dcr_read` path. Current `sw/runtime/opae` already adds `CMD_DCR_READ` + `MMIO_DCR_RSP` — adopt the same shape on the gem5 device. |
+| B11 | `runtime/stubarm/vortex.cpp:54` | `static callbacks_t g_callbacks;` global with `vx_dev_init(&g_callbacks)` resolved at link time. | Works for a single-device test but breaks `vx_dev_open` from being called concurrently from two host processes. Less critical for the gem5 use case (single device per simulation) but worth flagging. |
+| B12 | `sim/simx/simx_device.cpp` (`Impl`) | Uses `std::future<void> future_` for shutdown synchronization but `proc_tick()` calls `processor_->run()` directly on the caller thread. The mutex / future plumbing implies an async model that isn't actually used. | Confused concurrency contract. The v3 design must pick one: synchronous tick from the gem5 event loop (this proposal) **or** async run with a doorbell — not both. |
+| B13 | `runtime/stubarm/Makefile:7` + `runtime/opaesimx/Makefile:9` | Cross-compiler hardcoded to `arm-linux-gnueabihf-g++` (32-bit ARM hard-float). | gem5 also models AArch64 ARMv8 and x86_64, and most contemporary ARM ports are 64-bit. The v3 build selects compiler from a `HOST_ARCH` make variable (`x86_64`, `aarch64`, `armhf`); see Phase 4. |
+| B14 | `runtime/opaesimx/vortex.cpp:489` (`dcr_write`) and `stubarm/vortex.cpp:139` | Both runtimes write to DCR via the OPAE protocol but no MMIO ordering / fence is established between DCR writes and the `CMD_RUN` MMIO. | Inside gem5 the host CPU model may reorder MMIO. Need an explicit barrier before `CMD_RUN` (per `HOST_ARCH`: `mfence` for x86, `dmb sy` for ARM). Phase 4 provides a `vortex_gem5_mmio_fence()` inline helper. |
+| B15 | `sim/opaesimx/opae_simx.cpp:138–157` (`prepare_buffer`) | Returns `*buf_addr = (void*)buffer.ioaddr;` — casts an integer device IO address back to a `void*`. | The runtime then dereferences this pointer to do `memcpy(staging_ptr_, host_ptr, size)` (line 322 of `runtime/opaesimx/vortex.cpp`). Same root cause as B3 — only works when host runtime and device share an address space. Under real gem5 the runtime must `mmap` the pinned region via a syscall the gem5 device intercepts, or the gem5 device must expose the pinned region as a PIO/DMA window. |
+
+Together B1, B2, B3, B6, B9, B14 and B15 mean the legacy integration as
+literally written does not run a kernel correctly under v3 even after
+the path renames are applied; it requires architectural rework, not
+porting.
+
+### 2.3 What still ports as design intent
+
+The legacy paper's design intent — and these are what we keep:
+
+- **OPAE-shaped MMIO command set.** `CMD_RUN`, `CMD_MEM_READ`,
+  `CMD_MEM_WRITE`, `CMD_DCR_WRITE`, `MMIO_CMD_TYPE`, `MMIO_CMD_ARG0..2`,
+  `MMIO_STATUS`. Add `CMD_DCR_READ` + `MMIO_DCR_RSP` per the v3 OPAE
+  runtime (B10). The kernel runtime under `sw/runtime/gem5/` reuses
+  this layout so the same `vortex.h` shim layer that drives `opae`
+  also drives `gem5`.
+- **Pinned staging buffer pattern** for host↔device transfers. A
+  fixed device-visible region of host address space; runtime
+  `memcpy`'s into it, device DMAs out of it. Sizing is dynamic
+  (allocate-on-demand) rather than the legacy fixed-16-MB chunk (B6).
+- **Single-PIO-range device** registered to gem5 with the OPAE MMIO
+  offsets. The runtime issues 64-bit MMIO writes; the SimObject
+  decodes them in `write()` / `read()`.
+- **The host SE-mode runtime** (`sw/runtime/gem5/`, native x86 or cross-compiled ARM)
+  shipped into gem5's SE-mode app, **NOT** a full-system Linux on the
+  guest. The paper makes this point explicitly and it is the
+  differentiator vs. NoMali (FS-only) and AMD GPU (FS-only). See
+  `capstone §IIC`.
+
+### 2.4 What needs a v3 redesign
+
+- **`sim/simx/simx_device.{cpp,h}`** — replace with
+  `sim/simx/gem5/vortex_gpgpu.{cpp,h}` (the SimObject wrapper)
+  plus reuse of the new `Processor::cycle()` API. The legacy file's
+  `Impl` class is the wrong shape (B1, B2, B12).
+- **`sim/opaesimx/opae_simx.{cpp,h}`** — delete entirely. The legacy
+  module is a host-side OPAE stub whose `read_mmio64`/`write_mmio64`
+  do raw pointer arithmetic (B3, B15). The v3 design routes MMIO
+  through gem5's PIO port; there is no host-side stub.
+- **`runtime/opaesimx/`** — delete. The OPAE-stub path was a
+  pre-gem5 debugging convenience; under v3 we test the gem5 device
+  end-to-end via a gem5 Python script (§4, Phase 5), not via a
+  co-resident driver.
+- **`runtime/stubarm/`** — replace with `sw/runtime/gem5/`,
+  re-implemented against the same `callbacks.h` ABI as
+  `sw/runtime/simx`/`opae`/`rtlsim`, with cache-flush plumbed in
+  (B9), MMIO fences before `CMD_RUN` (B14), and a configurable ARM
+  cross-compiler target (B13).
+
+---
+
+## 3. Target architecture
+
+```
+                ┌───────────────────────────────────────────────┐
+                │  gem5 simulation                              │
+                │  ─────────────────                            │
+                │  ./ci/gem5_test_vortex_hello.py               │
+                │  (gem5.opt is build/X86/gem5.opt or           │
+                │   build/ARM/gem5.opt; both supported)         │
+                │                                               │
+                │  ┌─────────────┐         ┌─────────────────┐  │
+                │  │ Host CPU    │ ──PIO─▶ │ VortexGPGPU     │  │
+                │  │ (X86 or ARM,│ ◀─PIO── │ (DmaDevice ↓    │  │
+                │  │  SE mode)   │         │  PioDevice)     │  │
+                │  │ user        │         │  ┌───────────┐  │  │
+                │  │ binary:     │         │  │ MMIO regs │  │  │
+                │  │  hello +    │         │  └───────────┘  │  │
+                │  │  libvortex- │         │  ┌───────────┐  │  │
+                │  │  gem5.so    │ ──DMA─▶ │  │ Pinned    │  │  │
+                │  │  (native    │ ◀─DMA── │  │ staging   │  │  │
+                │  │   for X86,  │         │  │ buffer    │  │  │
+                │  │   cross-    │         │  │ window    │  │  │
+                │  │   compiled  │         │  └───────────┘  │  │
+                │  │   for ARM)  │         │       │         │  │
+                │  └─────────────┘         │       ▼         │  │
+                │         │                │  ┌───────────┐  │  │
+                │         │ MemPort        │  │ vortex::  │  │  │
+                │         ▼                │  │ Processor │  │  │
+                │  ┌─────────────┐         │  │ (SimX v3) │  │  │
+                │  └─────────────┘         │  │           │  │  │
+                │                          │  │  Cluster[]│  │  │
+                │                          │  │   Cache   │  │  │
+                │                          │  │   Memory ─┼──┼──┼─▶ RAM (Vortex VRAM,
+                │                          │  └───────────┘  │  │      held inside the
+                │                          │   ▲             │  │      device — separate
+                │                          │   │ cycle()     │  │      address space from
+                │                          │  ┌┴──────────┐  │  │      gem5 DRAM)
+                │                          │  │ tick      │  │  │
+                │                          │  │ (gem5     │  │  │
+                │                          │  │  event)   │  │  │
+                │                          │  └───────────┘  │  │
+                │                          └─────────────────┘  │
+                └───────────────────────────────────────────────┘
+```
+
+### 3.1 The plug-in boundary
+
+The Vortex side exposes **one** plug-in unit: `libvortex-gem5.so`. It
+is built from the same `sim/simx/*.{cpp,h}` sources as the default
+`simx` binary, plus a single new wrapper file
+(`sim/simx/gem5/vortex_gpgpu.{cpp,h}`) that holds:
+
+- A `vortex::Gem5Wrapper` C++ class that owns a `vortex::Processor`,
+  a `vortex::RAM` (the device VRAM), and a thin `cycle()` entry
+  point — exactly mirroring `vortex::VortexSimulator` in
+  `sim/simx/sst/`.
+- A C-ABI shim (`vortex_gem5_create()`, `vortex_gem5_tick()`,
+  `vortex_gem5_mmio_write64()`, `vortex_gem5_mmio_read64()`,
+  `vortex_gem5_dma_read()`, `vortex_gem5_dma_write()`, …) so the
+  gem5-side SimObject is decoupled from C++ ABI changes in
+  `vortex::Processor`. **The C ABI is the contract;** changing it
+  requires a coordinated update of the gem5-side SimObject.
+
+The gem5 side is **one** SimObject + **one** Python file, both shipped
+in this repo at `sim/simx/gem5/`:
+
+- `vortex_gpgpu_dev.{cc,hh}` — subclasses `gem5::DmaDevice` (which
+  itself subclasses `PioDevice`). Holds an opaque
+  `vortex_gem5_handle_t`; on `tick()`, calls `vortex_gem5_tick()`. PIO
+  reads/writes decode the OPAE MMIO offsets and forward to
+  `vortex_gem5_mmio_*`. DMA reads/writes triggered by
+  `CMD_MEM_{READ,WRITE}` use gem5's `DmaPort` and copy bytes into the
+  device VRAM via `vortex_gem5_dma_*`.
+- `VortexGPGPU.py` — `gem5.SimObject` definition with `pio_addr`,
+  `pio_size`, `pio_latency`, `dma_latency`, `clock`, `library`
+  (path to `libvortex-gem5.so`), and `kernel` (path to `*.vxbin` —
+  loaded into VRAM at boot, in lieu of the runtime upload path, for
+  smoke tests).
+
+`ci/gem5_install.sh.in` fetches a pinned gem5 release
+(see §3.4 for version), copies the two files into
+`<gem5>/src/dev/vortex/`, drops a one-line `SConscript`, and runs
+`scons build/ARM/gem5.opt`.
+
+**Nothing upstream of `vortex_gem5_create()` knows gem5 exists.** This
+satisfies §1.3.
+
+### 3.2 The cycle interface
+
+`Processor::cycle()` does **not exist** in v3 today. It is a direct
+prerequisite of both the SST integration (per
+[sst_simx_v3_proposal.md §3.2](sst_simx_v3_proposal.md)) and this
+proposal. The signature and shape are identical to what SST needs:
+
+```cpp
+// processor.h — public additions
+bool cycle();        // advance one cycle; returns false when nothing is running
+Memory* memsim();    // for optional gem5/SST memory-mirroring hooks
+```
+
+```cpp
+// processor.cpp — implementation
+bool ProcessorImpl::cycle() {
+  if (!is_cycle_initialized_) {
+    SimPlatform::instance().reset();
+    this->reset();
+    kmu_->start();                  // dispatch CTAs into the cluster
+    is_cycle_initialized_ = true;
+  }
+  SimPlatform::instance().tick();
+  return this->any_running();
+}
+
+Memory* ProcessorImpl::memsim() { return memsim_.get(); }
+```
+
+The two pieces (`SimPlatform::reset()` → `start_kmu()` →
+`SimPlatform::tick()` and `any_running()`) are already factored on
+`Processor` from Round 6 DTM work. `cycle()` just packages them into a
+single-cycle step.
+
+**Reuse from DTM work:** `start_kmu()` and `any_running()` are already
+public on `Processor`. We add `cycle()` and `memsim()` and that is the
+entire SimX-side API surface required by both SST and gem5.
+
+### 3.3 The MMIO command protocol
+
+Identical to `sw/runtime/opae` v3 (the OPAE driver), reusing
+`hw/syn/altera/opae/vortex_afu.h`:
+
+| Offset | Name | Direction | Purpose |
+|---|---|---|---|
+| `MMIO_CMD_TYPE` | `CMD_*` | W64 | Dispatch one of: `MEM_READ`, `MEM_WRITE`, `RUN`, `DCR_WRITE`, `DCR_READ` |
+| `MMIO_CMD_ARG0..2` | command-specific | W64 | DCR addr / device addr / size / value |
+| `MMIO_STATUS` | bit0=busy | R64 | Polled by runtime's `ready_wait` |
+| `MMIO_DCR_RSP` | response | R64 | Result of `CMD_DCR_READ` (used for cache-flush) |
+| `MMIO_DEV_CAPS` / `MMIO_ISA_CAPS` | caps bitfield | R64 | Encoded device capabilities |
+
+The runtime issues commands by writing args first, then `CMD_TYPE`
+(B14 fix: emit a `DMB SY` before the type write). The device latches
+on `CMD_TYPE`, performs the action synchronously (PIO write returns
+when the operation is enqueued, or completes synchronously for
+fast ones like `DCR_WRITE`), and clears the status busy bit when done.
+
+`CMD_MEM_{READ,WRITE}` use the staging-buffer protocol from the
+capstone paper Fig. 5 (§3.4 below).
+
+### 3.4 The staging-buffer protocol
+
+The gem5 device exposes a PIO-addressable register `MMIO_PINNED_BASE`
+that returns the base address of a pinned region inside gem5's host
+address space. The runtime, on `vx_mem_alloc`, lazily picks a slice of
+that region as a staging buffer.
+
+For a `vx_copy_to_dev(host_ptr, dev_addr, size)`:
+1. Runtime `memcpy(staging_buf, host_ptr, size)`.
+2. Runtime writes `staging_buf_addr`, `dev_addr`, `size` to
+   `MMIO_CMD_ARG{0,1,2}`.
+3. Runtime writes `CMD_MEM_WRITE` to `MMIO_CMD_TYPE`.
+4. Device's PIO handler enqueues a `gem5::DmaPort::dmaAction()` read
+   from `staging_buf_addr` into a local scratch.
+5. On DMA completion, the device copies the scratch bytes into Vortex's
+   `RAM` at `dev_addr` (via `RAM::write`).
+6. Device clears the status busy bit.
+7. Runtime polls `MMIO_STATUS` until busy=0.
+
+`vx_copy_from_dev` is the reverse, with **cache flush first** (B9):
+the runtime issues `CMD_DCR_READ(VX_DCR_BASE_CACHE_FLUSH, cid)` for
+every core before the `CMD_MEM_READ`. The device's DCR-read handler
+plumbs through to `Processor::dcr_read`, which already invokes
+`flush_caches()` for the cache-flush DCR
+([processor.cpp:251–258](../../sim/simx/processor.cpp#L251)).
+
+This is the same protocol the v3 OPAE runtime already uses, so the
+runtime under `sw/runtime/gem5/` differs from `sw/runtime/opae/` only
+in:
+- The `driver.{cpp,h}` backend (gem5 mmaps a `/dev/vortex_gem5`
+  character device path **OR**, in SE-mode, gem5 sets up the device's
+  PIO/DMA windows directly in the simulated process's address space —
+  see §3.6).
+- The lack of an `fpgaPrepareBuffer` API (the device exposes the
+  pinned region itself; no per-call buffer allocation by an OPAE
+  layer).
+
+### 3.5 Build-time gating
+
+`USE_GEM5=1` make variable controls compilation of:
+- `sim/simx/gem5/vortex_gpgpu.{cpp,h}` (the C ABI wrapper).
+- Link target `libvortex-gem5.so` produced alongside `libsimx.so`
+  (mirrors the SST `libvortex.so` pattern in `sim/simx/Makefile`).
+
+`USE_GEM5=1` does **not** affect the default build:
+`make -C sim/simx` (no flag) still produces a stand-alone `simx`
+binary with no gem5 dep. Per §1.4.
+
+The host-side runtime supports both x86 (native) and ARM (cross-
+compiled) targets via a `HOST_ARCH` switch:
+```
+make -C sw/runtime/gem5                                     # x86 default
+make -C sw/runtime/gem5 HOST_ARCH=x86_64                    # explicit x86
+make -C sw/runtime/gem5 HOST_ARCH=aarch64                   # AArch64 cross
+make -C sw/runtime/gem5 HOST_ARCH=armhf                     # ARMv7 cross
+```
+producing `libvortex-gem5-{x86_64,aarch64,armhf}.so`. Test scripts
+select the matching `(gem5.opt, libvortex-gem5-*.so)` pair via the
+`HOST_ARCH` make variable. Native x86 needs no toolchain install; ARM
+requires `gcc/g++-aarch64-linux-gnu` (or `-arm-linux-gnueabihf` for
+ARMv7), which `ci/gem5_install.sh` installs as part of Phase 0.
+
+### 3.6 gem5 SE-mode wiring + ISA selection
+
+**Host ISA: both x86 and ARM, equally first-class** (decision recorded
+2026-05-16 after Phase 0 prototyping). Phase 0's `ci/gem5_install.sh`
+builds `build/X86/gem5.opt` *and* `build/ARM/gem5.opt`; phases 4–6
+test both. Rationale:
+
+- **x86** is the path of least resistance for users — no
+  cross-toolchain, native `g++` builds `sw/runtime/gem5/`, faster
+  gem5 CPU model, and PCIe is canonical on x86 (relevant to the
+  Phase 5+ upgrade path below).
+- **ARM** is the research-narrative path matching the capstone paper
+  (Injae Shin 2025) and actually-deployed ARM+accelerator HPC
+  platforms (Grace Hopper, Fugaku, Graviton, Apple Silicon). Kept
+  as a first-class matrix variant; not a stretch goal.
+
+Three MMIO/DMA paths exist; this proposal picks one for the initial
+work and notes the others as future upgrades:
+
+| Path | Description | Status in this proposal |
+|---|---|---|
+| **1. SE-mode + custom PIO+DMA wiring** | The device is a `DmaDevice` subclass attached to `system.membus` at a configurable `pio_addr` (default `0x20000000`, matching the legacy paper). Host binary touches the address via `mmap`/inline asm. Works in both x86 SE-mode and ARM SE-mode. | **Phase 2–6: this is the design.** Matches legacy paper, lightweight, fast iteration. |
+| **2. FS-mode + PCIe device** | Subclass `PciDevice` (which already inherits `DmaDevice`); BARs expose MMIO, DMA for staging. Full Linux boot inside gem5 with a tiny PCI kernel module to bind the device. | **Phase 5+ upgrade.** Realistic accelerator-modeling story expected by x86 users. The C ABI committed in Phase 2 is shape-compatible — `PciDevice` and the custom `DmaDevice` both use the same `vortex_gem5_dma_*` callbacks; only the gem5-side wrapper class differs. |
+| **3. `/dev/vortex_gem5` pseudo-file** | The gem5 device implements `SyscallReturn open(...)` + `mmap` for a synthetic device path. Runtime `open("/dev/vortex_gem5", O_RDWR)` + `mmap`. | Out of scope. Closest to how real OPAE drivers work but requires a custom syscall handler in gem5; cost outweighs the benefit when Path 1 already works. |
+
+**Doorbell queues** are a Phase 7+ realism upgrade orthogonal to the
+transport choice above. AMD GPU (gem5 `src/dev/amdgpu/`, derived
+from `PciEndpoint`) and NVIDIA-style modern accelerators use a ring
+buffer in host DRAM plus a single MMIO "doorbell" write per dispatch:
+the host appends commands to the ring, then writes the new tail
+offset to the doorbell register; the device asynchronously walks the
+ring and processes commands. The Phase 2-6 design instead uses
+**status polling** — the host writes args + `CMD_TYPE`, then polls
+`MMIO_STATUS` until done — which matches the legacy OPAE FPGA driver.
+Polling is fine for the capstone-paper scope (small kernels, one at
+a time) but burns simulated cycles on the spin. If later research
+wants batched-dispatch realism comparable to AMD GPU, the upgrade
+swaps the OPAE MMIO command set for a ring + doorbell protocol; the
+C ABI in Phase 2 stays compatible (a new `vortex_gem5_doorbell_ring(handle, tail)`
+entry point alongside the existing `vortex_gem5_mmio_*`).
+
+### 3.7 gem5 version pinning
+
+`ci/gem5_install.sh.in` pins gem5 to v25.0.0 (the most recent stable
+release as of 2026-05). The pinned tag goes in `VERSION` alongside
+`TOOLCHAIN_REV` and `SST_VER` — bumps require a CI re-run on the
+self-hosted runner first (small risk of API drift on gem5's
+`DmaDevice`/`PioDevice` between major releases). **Picking and
+validating this pin is the first deliverable of Phase 0** — every
+other phase is a no-op if Phase 0 reveals that v25.0.0 no longer
+supports SE-mode PIO mapping or the SimObject install path we depend
+on.
+
+### 3.8 Why this is not just a copy of the SST pattern
+
+SST and gem5 are similar in shape (external simulator drives the
+Vortex clock through a C++ wrapper around `Processor::cycle()`) but
+differ in three load-bearing ways:
+
+1. **The host process is simulated under gem5.** Under SST the host
+   "process" is the SST Python script itself, running natively on the
+   developer's machine. Under gem5 the host is a userspace process
+   (x86 or ARM, per §3.6) running inside the gem5 model. So the gem5
+   integration also needs a host-side runtime under `sw/runtime/gem5/`
+   (native compile for x86, cross-compile for ARM); SST does not.
+   (This is the bulk of the work that makes gem5 the bigger project —
+   see §9 effort estimate.)
+2. **Memory is in two address spaces.** Under SST, the SimX `Processor`
+   and any optional SST memHierarchy share the same simulator. Under
+   gem5, the host CPU's DRAM is a gem5 `AddrRange`, the Vortex VRAM is
+   a `RAM` inside the device, and the only way bytes cross between
+   them is via DMA through the device. The staging-buffer protocol
+   (§3.4) implements this; SST has no equivalent.
+3. **PIO bus integration.** SST's `StandardMem` interface is the
+   only one we plug into; gem5 has separate `PioPort` and `DmaPort`
+   with different timing models. The wrapper must manage both.
+
+---
+
+## 4. Phasing
+
+Each phase is independently shippable and validated. The work follows
+the same shape as the SST integration in
+[sst_simx_v3_proposal.md §4](sst_simx_v3_proposal.md): **environment
+first**, API + library second, gem5-side wiring third, ARM runtime
+fourth, CI last.
+
+### Phase 0 — gem5 environment + API survey *(derisking; nothing else can start until this is done)*
+
+The legacy `vortex_gem5` was built against a forked gem5 that no
+longer exists publicly. Before we design the C ABI in Phase 2 or
+write a single line of `DmaDevice` glue in Phase 3, we need a
+known-good gem5 build on the bench so the API surface we are about
+to commit to is **real**, not assumed-from-headers-we-haven't-read.
+This is the "solve gem5 setup first" phase.
+
+Concretely:
+
+- **Pick and pin the gem5 version.** Default target: v25.0.0.1
+  (patch release on top of v25.0.0, most recent stable as of 2026-05).
+  Pin the tag in `VERSION` alongside `TOOLCHAIN_REV` and `SST_VER`:
+  ```
+  GEM5_REV=v25.0.0.1
+  ```
+- **Write `ci/gem5_install.sh.in`** (no Vortex integration yet — just
+  the install). Mirrors the structure of `ci/sst_install.sh.in`:
+  - `apt install scons python3-dev python3-pip libprotobuf-dev
+    protobuf-compiler libprotoc-dev libgoogle-perftools-dev m4
+    libboost-all-dev gcc-aarch64-linux-gnu g++-aarch64-linux-gnu`
+    (gem5's documented build deps + ARM cross-toolchain for the ARM
+    matrix variant).
+  - Fetch gem5 working tree at `$GEM5_REV` into `$TOOLDIR/gem5`.
+  - `scons build/X86/gem5.opt -j$(nproc)` and
+    `scons build/ARM/gem5.opt -j$(nproc)` — **both ISAs by default**
+    per the dual-ISA decision in §3.6. Targets selectable via
+    `GEM5_TARGETS="X86"` / `"ARM"` / `"X86 ARM"`.
+  - Export `GEM5_HOME=$TOOLDIR/gem5` to `~/.bashrc`.
+- **Validate the X86 native compiler produces SE-mode binaries.**
+  Trivial — `gcc -static -o /tmp/hello-x86 sim/simx/gem5/hello.c`
+  then run under `gem5.opt configs/example/gem5_library/arm-hello.py`
+  -shape config (substituting `ISA.X86`). Confirm exit code 0 and
+  the expected stdout.
+- **Validate the ARM cross-toolchain produces SE-mode binaries.**
+  Cross-compile `hello.c` with `aarch64-linux-gnu-gcc -static -o
+  /tmp/hello-arm`, run under
+  `build/ARM/gem5.opt configs/example/gem5_library/arm-hello.py`
+  (or the deprecated SE script). Confirms the cross-toolchain
+  produces something gem5 ARM-mode can load.
+- **Read the gem5 source for the API surface we are about to use**
+  and record findings in a short scratch file
+  `sim/simx/gem5/gem5_api_notes.md` (not committed to docs/, just a
+  Phase 0 deliverable):
+  - `src/dev/io_device.hh` — `PioDevice::read`/`write` signatures
+    in v25.0.0. Compare to what the legacy paper assumed.
+  - `src/dev/dma_device.hh` — `DmaDevice::dmaAction`, `DmaPort`
+    timing model. Confirm 64-bit address support, async completion
+    callback shape.
+  - `src/python/m5/objects/Device.py` — SimObject Python bindings.
+    Confirm that out-of-tree `src/dev/<our-dir>/SConscript` is
+    picked up by `scons build/ARM/gem5.opt` (this is the install
+    mechanism we rely on in Phase 3).
+  - `configs/example/se.py` — how SE-mode wires a CPU to a
+    `Workload`. Confirm that we can attach a `PioDevice` and have
+    the SE-mode loader map its PIO range into the workload's address
+    space (the legacy paper's `0x20000000` magic). If this is no
+    longer supported, the design changes — better to know now than
+    in Phase 3.
+- **Smoke-build a trivial out-of-tree SimObject** to prove the
+  install mechanism end-to-end. Three files
+  (`Dummy.{cc,hh,py}` + `SConscript`) under `sim/simx/gem5/dummy/`,
+  installed by `sim/simx/gem5/install.sh` (Phase 0 only ships the
+  installer; the real SimObject lands in Phase 3). After
+  `ci/gem5_install.sh` re-runs, `gem5.opt --list-sim-objects` shows
+  `Dummy`. Delete `dummy/` once verified — it was scaffolding.
+
+**Validation:**
+- `ci/gem5_install.sh` finishes successfully on the self-hosted
+  runner. Wall time recorded in `gem5_api_notes.md` (drives CI
+  caching strategy in Phase 6).
+- `$GEM5_HOME/build/ARM/gem5.opt configs/example/se.py
+  --cmd ./hello-arm` exits 0.
+- `gem5.opt --list-sim-objects` lists the dummy SimObject installed
+  via `sim/simx/gem5/install.sh`.
+- `gem5_api_notes.md` documents the `DmaDevice` / `PioDevice` /
+  `EventFunctionWrapper` signatures we will commit to in Phase 2's
+  C ABI design.
+
+**Why this is its own phase:** if any of those validations fails
+(e.g. gem5 v25 has dropped SE-mode PIO mapping, or the SimObject
+install mechanism has changed), the rest of the proposal needs
+redesign before code lands. Phase 0 is a ~1-day gate, not a tracked
+deliverable; everything downstream depends on its outputs.
+
+### Phase 1 — `Processor::cycle()` + `Memory*` accessor
+
+Prerequisite shared with SST. Can run in parallel with Phase 0
+(no gem5 dependency) and lands first into the SimX-side codebase.
+
+- Add `Processor::cycle()` and `Memory* Processor::memsim()` as in
+  §3.2. This is a ~50-line patch to `processor.{cpp,h}` and
+  `processor_impl.h` plus an `is_cycle_initialized_` bool.
+- Add `Memory::set_pre_send_hook()` (already in v3 per
+  `sim/simx/mem/memory.h:42` — verify still there; if so, this part
+  of Phase 1 is a no-op).
+- Update SST's `vortex_simulator.cpp` to use the new public
+  `Processor::cycle()` API (currently calls `proc_->cycle()` which
+  does not compile against `processor.h` HEAD — see
+  `sim/simx/sst/vortex_simulator.cpp:64`). **This is a pre-existing
+  bug that Phase 1 fixes for both integrations.**
+
+**Validation:** `make -C sim/simx` (default), `make -C sim/simx
+USE_SST=1`, and `make -C sim/simx USE_GEM5=1` all build. SST tests
+that previously failed to link now link and run (`sst
+ci/sst_test_vortex_hello.py` passes).
+
+### Phase 2 — `libvortex-gem5.so` + C ABI
+
+**Prerequisite: Phase 0 complete.** The C ABI is designed *against*
+the `DmaDevice`/`PioDevice` shapes recorded in
+`gem5_api_notes.md`, not from headers we haven't read.
+
+- Create `sim/simx/gem5/vortex_gpgpu.{cpp,h}` mirroring
+  `sim/simx/sst/vortex_simulator.{cpp,h}` shape:
+  - Owns a `Processor`, a `RAM` (device VRAM at `MEM_PAGE_SIZE`).
+  - Exposes a C ABI (`vortex_gem5_*`) sufficient for the gem5 device
+    to MMIO/DMA/tick it. ABI signatures match what gem5's
+    `DmaDevice::dmaAction` and `PioDevice::read`/`write` need to
+    call into (per Phase 0 survey).
+- Add `USE_GEM5=1` build target to `sim/simx/Makefile` producing
+  `libvortex-gem5.so` (no SST symbols; no `sst-core` link). Pattern:
+  duplicate the `ifeq ($(USE_SST),1)` block.
+- Add a tiny in-process smoke driver
+  `sim/simx/gem5/gem5_smoke_main.cpp` (built with the lib) that:
+  1. Loads a `.vxbin` via the C ABI.
+  2. Ticks until `cycle()` returns false.
+  3. Reads the MPM exit code via DCR_READ.
+
+  This is the "library compiles and a kernel runs through it without
+  gem5 installed" smoke test (§6.2).
+
+**Validation:**
+- `make -C sim/simx USE_GEM5=1` builds.
+- `LD_LIBRARY_PATH=. ./gem5_smoke hello.vxbin` returns 0.
+- `make -C sim/simx` (no flag) still builds and `./simx hello.vxbin`
+  returns 0 (no regression on default).
+
+### Phase 3 — gem5 SimObject + Python config
+
+**Prerequisite: Phases 0 + 2 complete.** The install mechanism is
+already proven by Phase 0's dummy SimObject; this phase replaces
+the dummy with the real device.
+
+- `sim/simx/gem5/vortex_gpgpu_dev.{cc,hh}` — the gem5 `DmaDevice`
+  subclass. PIO `read`/`write` decode MMIO offsets and call
+  `vortex_gem5_mmio_*`. DMA actions triggered by `CMD_MEM_*`. A
+  registered `EventFunctionWrapper` re-schedules itself every
+  `clock_period_ticks()` and calls `vortex_gem5_tick()`.
+- `sim/simx/gem5/VortexGPGPU.py` — Python SimObject definition.
+- `sim/simx/gem5/SConscript` — for gem5's scons build.
+- `sim/simx/gem5/install.sh` — copies the four files above into
+  `<gem5>/src/dev/vortex/`. (Phase 0 already wrote this for the
+  dummy SimObject; just extend it.)
+- Update `ci/gem5_install.sh.in` to re-run `install.sh` and rebuild
+  `build/ARM/gem5.opt` after the Vortex SimObject lands.
+
+**Validation:** `ci/gem5_install.sh` succeeds with the real
+SimObject installed. `gem5.opt --list-sim-objects` shows
+`VortexGPGPU`. `gem5.opt configs/example/se.py --help` accepts
+`VortexGPGPU` parameters.
+
+### Phase 4 — Host runtime (`sw/runtime/gem5/`, x86 + ARM)
+
+- New backend mirroring `sw/runtime/opae/` shape:
+  - `vortex.cpp` — implements the `vx_*` callbacks against the OPAE
+    MMIO protocol (§3.3), but the `driver.{cpp,h}` underneath does
+    raw `mmap`/MMIO writes to the PIO address rather than calling
+    `libopae`.
+  - `Makefile` — selects compiler from `HOST_ARCH`:
+    - `x86_64` (default): native `g++`
+    - `aarch64`: `aarch64-linux-gnu-g++`
+    - `armhf`: `arm-linux-gnueabihf-g++`
+- Cache-flush integration (B9): the v3 `download` path issues
+  `CMD_DCR_READ(VX_DCR_BASE_CACHE_FLUSH, cid)` per core before
+  `CMD_MEM_READ`.
+- MMIO ordering fence (B14): emit the right barrier for `HOST_ARCH`:
+  - `x86_64`: `__asm__ volatile ("mfence" ::: "memory")`
+  - `aarch64`: `__asm__ volatile ("dmb sy" ::: "memory")`
+  - `armhf`: `__asm__ volatile ("dmb sy" ::: "memory")`
+  Provide a `vortex_gem5_mmio_fence()` inline helper that compiles
+  to the right barrier per `HOST_ARCH`.
+- Multi-target build (B13 obsolete; replaced by clean multi-target
+  support): `HOST_ARCH` make variable.
+
+**Validation:**
+- `make -C sw/runtime/gem5` (default `HOST_ARCH=x86_64`) builds.
+  `file build/sw/runtime/libvortex-gem5-x86_64.so` confirms x86-64
+  ELF.
+- `make -C sw/runtime/gem5 HOST_ARCH=aarch64` builds (requires
+  cross-toolchain, installed by Phase 0's `ci/gem5_install.sh`).
+  `file build/sw/runtime/libvortex-gem5-aarch64.so` confirms
+  AArch64 ELF.
+
+### Phase 5 — End-to-end gem5 test
+
+- `ci/gem5_test_vortex_hello.py` — gem5 Python config that wires:
+  - A `System` with one `TimingSimpleCPU` core in SE mode (host ISA
+    selected at runtime via `--host-arch=x86|arm`).
+  - A `VortexGPGPU` device on `system.membus` at
+    `pio_addr=0x20000000`, mapped into the process's address space.
+  - The native-or-cross-compiled test binary
+    (`tests/kernel/hello/hello` re-linked against the matching
+    `libvortex-gem5-{x86_64,aarch64}.so`) as the SE-mode workload.
+- `ci/gem5_test_vortex_vecadd.py` — same with a vecadd kernel that
+  actually exercises DMA in both directions and the cache-flush path.
+- Add a top-level wrapper test in `tests/regression/gem5/` (mirrors
+  `tests/regression/dxa/`) that builds the kernels and invokes the
+  Python scripts for both `HOST_ARCH=x86_64` and `HOST_ARCH=aarch64`.
+
+**Validation:**
+- `build/X86/gem5.opt ci/gem5_test_vortex_hello.py --host-arch=x86`
+  exits with code 0 and the expected `Hello World` on stdout.
+- `build/ARM/gem5.opt ci/gem5_test_vortex_hello.py --host-arch=arm`
+  exits with code 0 and the expected `Hello World` on stdout.
+- Both `ci/gem5_test_vortex_vecadd.py` variants exit 0 with the
+  vecadd result buffer matching the CPU-computed reference (checked
+  by the test binary itself).
+
+### Phase 6 — CI integration
+
+- Add `gem5()` function to `ci/regression.sh.in` (mirroring `sst()`
+  on line ~80):
+  ```bash
+  gem5()
+  {
+      echo "begin gem5 tests..."
+
+      make -C sim/simx USE_GEM5=1
+      make -C tests/kernel
+
+      # X86 default: native compile, no cross-toolchain needed.
+      make -C sw/runtime/gem5 HOST_ARCH=x86_64
+      cp sim/simx/libvortex-gem5.so $GEM5_HOME/build/X86/
+
+      timeout 120 $GEM5_HOME/build/X86/gem5.opt \
+          ci/gem5_test_vortex_hello.py  --host-arch=x86
+      timeout 120 $GEM5_HOME/build/X86/gem5.opt \
+          ci/gem5_test_vortex_vecadd.py --host-arch=x86
+
+      # ARM matrix entry — requires gcc-aarch64-linux-gnu (installed
+      # by ci/gem5_install.sh in Phase 0).
+      if [ -n "$VORTEX_GEM5_ARM" ]; then
+          make -C sw/runtime/gem5 HOST_ARCH=aarch64
+          cp sim/simx/libvortex-gem5.so $GEM5_HOME/build/ARM/
+
+          timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
+              ci/gem5_test_vortex_hello.py  --host-arch=arm
+          timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
+              ci/gem5_test_vortex_vecadd.py --host-arch=arm
+      fi
+
+      echo "gem5 tests done!"
+  }
+  ```
+  Per `feedback_test_timeout_120s.md`, every test invocation is
+  `timeout 120`-capped. ARM is opt-in via `VORTEX_GEM5_ARM=1` so
+  hosted CI without the ARM toolchain still passes; the self-hosted
+  runner sets the env var.
+- Add `gem5-x86` and `gem5-arm` matrix entries to
+  `.github/workflows/ci.yml` (both run on the self-hosted runner
+  only, per
+  [`project_ci_machine.md`](../../../../.claude/projects/-home-blaisetine-dev/memory/project_ci_machine.md);
+  the hosted runners do not have enough resources for a full
+  gem5 build).
+- Add `ci/gem5_install.sh` to the Apptainer recipe
+  ([`miscs/apptainer/vortex.def`](../../miscs/apptainer/vortex.def))
+  so the .sif has gem5 pre-installed. **Out of scope for Phase 6;
+  see §8.**
+
+**Validation:** `./ci/regression.sh --gem5` runs both
+`gem5_test_vortex_*.py` cleanly on the self-hosted runner.
+
+### Phase 7 — Documentation
+
+- `docs/gem5_integration.md`:
+  - How to install gem5 v25.0.0 (point at `ci/gem5_install.sh`).
+  - How to build with `USE_GEM5=1`.
+  - How to cross-compile the ARM runtime + kernels.
+  - How to write a gem5 Python script that drives `VortexGPGPU`.
+  - The single-source-of-truth invariant (§1.1) and the cache-flush
+    contract (§3.4) for future hackers who might be tempted to skip
+    the flush "because it's fast".
+
+---
+
+## 5. Authorship / history mechanics
+
+- `sim/simx/gem5/vortex_gpgpu.{cpp,h}` and the gem5-side
+  `vortex_gpgpu_dev.{cc,hh}` + `VortexGPGPU.py`: **new files**, no
+  upstream equivalent. Commit body cites:
+  > Replaces legacy `vortex_gem5/sim/simx/simx_device.{cpp,h}`
+  > (Injae Shin, UCLA 2025-05-22 commit 91dcf17) and the gem5-side
+  > SimObject described in his capstone report.
+  > Re-implemented for SimX v3 Processor::cycle() API. Original
+  > design intent (OPAE MMIO + pinned staging buffer + ARM SE-mode
+  > runtime) preserved.
+
+- `sw/runtime/gem5/`: **new files** mirroring `sw/runtime/opae/`'s
+  shape. Same authorship attribution as above; the file-level
+  similarity is to `sw/runtime/opae`, not to `runtime/opaesimx` from
+  the legacy tree (which has the bugs catalogued in §2.2).
+
+- `ci/gem5_install.sh.in` and `ci/gem5_test_vortex_*.py`: new files;
+  follow the structure of `ci/sst_install.sh.in` and
+  `ci/sst_test_vortex_*.py`. `ci/gem5_install.sh.in` lands in
+  Phase 0 (initially installing the dummy SimObject); the test
+  scripts land in Phase 5.
+
+- `Processor::cycle()` / `Processor::memsim()`: new public API on
+  `Processor`, lands in Phase 1. Single commit on the simx_v3 line;
+  mentioned as a prerequisite of both SST and gem5 integrations in
+  the commit body.
+
+- `sim/simx/gem5/gem5_api_notes.md`: Phase 0 deliverable, scratch
+  notes only — **not** committed to `docs/`. Captures the gem5
+  v25.0.0 API surface our C ABI design depends on; deleted once
+  Phase 2 commits the C ABI itself.
+
+This is consistent with the rule established in
+[`feedback_keep_ours_in_merge.md`](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_keep_ours_in_merge.md):
+the legacy code is not a "theirs" we apply; it is a prior design that
+informs our redesign. Credit the designer in the body; do not pretend
+the bits are a port.
+
+---
+
+## 6. Validation
+
+Each phase ends with the validation listed in §4. Across phases the
+acceptance criteria are:
+
+1. **No-gem5 build identical.** `make -C sim/simx` (default flags)
+   produces a binary identical in behavior to today's on the
+   regression suite (io_addr, arith, vecadd, mpi_vecadd, tensor*,
+   dxa, dtm). The Phase 0 `Processor::cycle()` addition must not
+   change `Processor::run()` semantics — verify by trace-diffing
+   `vecadd` before and after Phase 0.
+
+2. **In-process smoke (no gem5 needed).** `gem5_smoke hello.vxbin`,
+   the Phase 2 driver, runs the same kernels the `simx` binary runs
+   and produces matching output. This is the unit-test layer that
+   shakes out C-ABI breakage without requiring gem5 to be installed
+   beyond what Phase 0 already set up.
+
+3. **End-to-end gem5 PASS.** Both `gem5_test_vortex_hello.py` and
+   `gem5_test_vortex_vecadd.py` exit 0 under the pinned gem5 v25.0.0.1,
+   on *both* `build/X86/gem5.opt` and `build/ARM/gem5.opt`, timed out
+   at 120 s (each). The pin and the install path are both already
+   validated by Phase 0; this validation just exercises the real
+   `VortexGPGPU` SimObject end-to-end.
+
+4. **No `core->mem_read` / `core->mem_write` regressions.** Phase 5
+   of v3 forbids those
+   ([simx_v3_proposal.md §3.3](simx_v3_proposal.md)). The grep gate
+   from
+   [master_merge_v3_proposal.md §8 R1](master_merge_v3_proposal.md)
+   applies here: every commit must pass
+   `git diff <pre>..<post> -- sim/simx/ | grep -E 'core->mem_(read|write)' | wc -l == 0`.
+
+5. **Single source of truth check.** The gem5 device's pinned region
+   is `RAM`-backed (i.e., a slice of host memory exposed to gem5's
+   DRAM AddrRange via `mmap`); Vortex's VRAM is the `RAM` attached to
+   `Memory` inside `vortex::Processor`. **There is no shadow image.**
+   `vortex_gem5_dma_{read,write}` copies bytes between the two via
+   `RAM::read`/`RAM::write` — no additional buffer level. Mistakes
+   here re-introduce the §1.1 violation.
+
+---
+
+## 7. Risks
+
+| # | Risk | Mitigation |
+|---|---|---|
+| R1 | gem5 v25.0.0 `DmaDevice` API drifts in v26+. | Pin in `ci/gem5_install.sh.in` (Phase 0). Document the pin in `docs/gem5_integration.md`. CI catches regressions on bump. |
+| R2 | ARM cross-compiler not available in the Apptainer recipe. | Phase 6 says gem5 CI is on the self-hosted runner only, which already has the ARM toolchain per [`project_ci_machine.md`](../../../../.claude/projects/-home-blaisetine-dev/memory/project_ci_machine.md). Apptainer absorption is out of scope (§8). |
+| R3 | `MMIO_PINNED_BASE` PIO range collides with another gem5 device's PIO range. | Pick a default (`0x20000000`, matching the legacy paper) but make it a Python-configurable parameter (`pio_addr`). Phase 0 confirms the default is reachable from SE-mode in v25.0.0; document collisions in the integration guide. |
+| R4 | The gem5 ARM CPU model reorders MMIO writes, breaking the args-then-CMD_TYPE protocol (B14). | `DMB SY` (AArch64) or `dmb sy` (ARMv7) before `CMD_TYPE` write in the runtime. Add a regression test that issues a back-to-back `CMD_MEM_WRITE` + `CMD_RUN` and verifies the kernel observed the correct args. |
+| R5 | Future contributor re-introduces the host-pointer-MMIO hack (B3) "for convenience". | This proposal explicitly deletes that abstraction (§2.4). The follow-up `docs/gem5_integration.md` (Phase 7) should call this out. |
+| R6 | `Processor::cycle()` for a never-launched kernel hangs (no `kmu_->start()` because `is_cycle_initialized_` was never reset). | Reset is implicit on first `cycle()`. If a second kernel is launched in the same device lifetime (rare; supported by gem5 only for back-to-back tests), the gem5 device's `CMD_RUN` handler must call a new `Processor::reset_cycle()` that clears `is_cycle_initialized_`. Add this in Phase 2. |
+| R7 | The cross-compiled ARM `libvortex-gem5.so` and the gem5-loaded `libvortex-gem5.so` (x86) have the same SONAME and get confused at install time. | Suffix the ARM build (`libvortex-gem5-aarch64.so`) and the gem5 build (`libvortex-gem5.so`). Document in Phase 2+4. |
+| R8 | gem5's `DmaPort` request size is unbounded; a 1 GB `CMD_MEM_WRITE` would burn simulated time. | Cap per-transaction size at 1 MB in the device's `CMD_MEM_*` handler; chunk larger requests into multiple DMA actions. Mirrors how the OPAE `fpgaPrepareBuffer` page-aligns transfers. |
+| R9 | Cache flush via `CMD_DCR_READ` returns synchronously per core; for `NUM_CORES * NUM_CLUSTERS = 16` that is 16 PIO round-trips per download. | Acceptable for Phase 5; can be batched into a single `CMD_FLUSH_ALL` MMIO later if measured to hurt. |
+| R10 | The gem5 SimObject install (`sim/simx/gem5/install.sh`) modifies the gem5 source tree in place; rebuilds can leave stale artifacts. | `install.sh` is idempotent (copies, doesn't patch); `ci/gem5_install.sh` does a clean `scons -c` before re-build on toolchain version mismatch. Phase 0 proves the install path end-to-end with a dummy SimObject before any real code depends on it. |
+| R11 | Phase 0 reveals gem5 v25.0.0 has dropped SE-mode PIO mapping (the legacy `0x20000000` magic). | Switch design to the `/dev/vortex_gem5` pseudo-file path (§3.6 option 2) before Phase 2 commits the C ABI. Cost: ~1 week added to Phase 0 redesign window. Acceptable because Phase 0 is explicitly a gate — no downstream phase has shipped code yet. |
+| R12 | Phase 0 install takes hours on first run; blocks parallel work. | Cache the `$TOOLDIR/gem5-src/build` directory in CI the same way SST and toolchain caches work. Self-hosted runner's local toolchain dir survives across runs. |
+
+---
+
+## 8. Out of scope
+
+- **Apptainer integration.** Adding gem5 + the ARM cross-toolchain
+  to `miscs/apptainer/vortex.def` is a separate concern. Until that
+  is done, `apptainer-ci.yml`'s matrix should not include `gem5`. The
+  self-hosted runner runs the gem5 matrix entry on hosted ci.yml; the
+  Apptainer pipeline skips it. See
+  [`apptainer-ci.yml` policy notes](../../.github/workflows/apptainer-ci.yml).
+
+- **Full-system Linux on gem5.** The capstone paper restricts itself
+  to SE-mode (per the paper's §IIC: "gem5-Vortex's implementation
+  allows users to use gem5's system call emulation (SE) mode"). This
+  proposal does the same. FS-mode requires booting a Linux kernel
+  inside gem5 with a Vortex device driver — possible, but a separate
+  redesign that intersects with kernel-mode driver work the project
+  has not started.
+
+- **Multi-device simulation.** One `VortexGPGPU` per gem5 system.
+  Multi-device support requires per-instance PIO ranges and a runtime
+  side that supports `vx_dev_open` returning >1 handle — the legacy
+  `g_callbacks` global (B11) blocks this on the runtime side, and
+  the device side needs per-instance state isolation. Defer.
+
+- **AMD GPU / NoMali comparison.** The capstone paper compares
+  gem5-Vortex to NoMali (stub GPU) and AMD GPU (full-system). Those
+  comparisons live in the paper; reproducing them as benchmarks is
+  out of scope. Comparing performance to SimX standalone or to the
+  SST integration is also out of scope — separate analysis work.
+
+- **DMA performance modeling.** The capstone paper §V measures DMA
+  delay variation per kernel size. Replicating that as a CI
+  performance gate is out of scope; could be a follow-up perf
+  proposal once the integration is stable.
+
+- **SST + gem5 simultaneous.** Both integrations replace different
+  parts of the harness; running them together is not a use case
+  anyone has asked for. Build flags are mutually exclusive:
+  `USE_SST=1` and `USE_GEM5=1` together is rejected by `sim/simx/Makefile`.
+
+- **gem5 fork branch.** We do not maintain a long-lived fork of gem5.
+  `ci/gem5_install.sh` fetches a clean release tarball and applies
+  our SimObject; if the user wants a persistent gem5 working tree,
+  that is their setup. Avoids the "fork rot" that froze
+  `vortex_gem5`.
+
+- **Runtime gem5/non-gem5 switching.** Keep `USE_GEM5=1` as a
+  build-time switch. A runtime switch would require both `Processor`
+  and a gem5 wrapper in every binary plus a factory; not worth the
+  maintenance cost for a single-device research integration.
+
+---
+
+## 9. Estimated effort
+
+Based on the SST integration in
+[sst_simx_v3_proposal.md §9](sst_simx_v3_proposal.md) (~15–28 h):
+
+- **Phase 0** (gem5 env + API survey + dummy SimObject install):
+  **6–10 h estimated; ✅ COMPLETE 2026-05-16** in ~3 h of
+  attended + ~25 min unattended scons build. The wall time to
+  install gem5 was 13 min (ARM) + 11 min (X86) parallel on the
+  self-hosted 64-core runner. All six validations
+  (see `sim/simx/gem5/gem5_api_notes.md`) pass on both ISAs.
+  Key discoveries committed: (1) SE-mode PIO attachment is
+  possible but requires bypassing the `SimpleBoard` high-level
+  API; (2) out-of-tree SimObject install needs **no** top-level
+  SConstruct patch — pure `cp -r`; (3) PCIe (Path 2 in §3.6) is
+  a clean Phase 5+ upgrade because `PciDevice` inherits
+  `DmaDevice` and shares the same C ABI surface.
+- **Phase 1** (`Processor::cycle()` + `memsim()`): **1–2 h estimated;
+  ✅ COMPLETE 2026-05-16** in ~1 h. ~50-line patch to
+  `processor.{cpp,h}` + `processor_impl.h`. Default `make -C
+  sim/simx` and `USE_SST=1` both build clean; `simx hello.vxbin`
+  prints `#0: Hello World!`. **Bonus:** the SST integration was
+  previously broken at the `proc_->cycle()` call site
+  (`sim/simx/sst/vortex_simulator.cpp:64`) and would not link; with
+  Phase 1 in place, `sst ci/sst_test_vortex_hello.py` runs
+  end-to-end and exits cleanly at 4.643 µs simulated time.
+- **Phase 2** (`libvortex-gem5.so` + C ABI + in-process smoke):
+  **4–6 h estimated; ✅ COMPLETE 2026-05-16** in ~1.5 h. Files added:
+  `sim/simx/gem5/vortex_gpgpu.{h,cpp}` (the C ABI library) and
+  `sim/simx/gem5/gem5_smoke_main.cpp` (the in-process smoke driver).
+  `sim/simx/Makefile` extended with a `USE_GEM5=1` gate that
+  produces `libvortex-gem5.so` (1.5 MB) + `gem5_smoke` (16 KB
+  driver linking against the lib). `gem5_smoke hello.vxbin` →
+  `#0: Hello World!`, 4642 cycles, exit_code=0 (correctly read back
+  via `vortex_gem5_vram_read` after the cache-flush DCR path —
+  validating B9 from §2.2 is fixed). Default `make -C sim/simx`
+  unchanged (only `simx` produced; gem5 sources fully gated).
+  `USE_SST=1 USE_GEM5=1` correctly rejected by the Makefile per
+  §8 (mutual exclusion). Side fix: `sw/common/bitmanip.h` was
+  missing `<type_traits>` and `<algorithm>` includes — header
+  hygiene fix benefits any caller (per
+  [feedback_always_correct_fix_not_patch](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_always_correct_fix_not_patch.md)).
+- **Phase 3** (gem5 SimObject + Python + install.sh): **6–10 h
+  estimated; ✅ COMPLETE 2026-05-16** in ~1.5 h. Files added:
+  `sim/simx/gem5/vortex_gpgpu_dev.{cc,hh}` (gem5 `DmaDevice` subclass
+  with `dlopen` + `EventFunctionWrapper` tick scheduling),
+  `sim/simx/gem5/VortexGPGPU.py` (Python binding with `library=` +
+  `kernel=` parameters), `sim/simx/gem5/SConscript`. Updated
+  `install.sh` to install the real device and remove the Phase 0
+  dummy scaffolding from `$GEM5_HOME` cleanly. New test:
+  `ci/gem5_test_vortex_hello.py` (standalone-device variant, no
+  host CPU needed). Validation: both `build/X86/gem5.opt` and
+  `build/ARM/gem5.opt` import `VortexGPGPU` and run hello.vxbin to
+  completion at tick 4,643,000 (1 GHz clock → 4643 cycles, matching
+  Phase 1 SST + Phase 2 in-process within 1 cycle). **Three
+  harnesses now validated through the same `Processor::cycle()` API:
+  SST, in-process C ABI, and gem5 SimObject.**
+- **Phase 4** (host runtime, x86 + ARM): **6–10 h estimated; ✅ x86
+  PATH COMPLETE 2026-05-16** in ~1 h; aarch64 cross-build gated on
+  the user's `sudo apt install gcc-aarch64-linux-gnu`. Files added:
+  `sw/runtime/gem5/driver.{cpp,h}` (direct MMIO + mmio_fence helper
+  with per-arch barrier; bump-allocator for the pinned region),
+  `sw/runtime/gem5/vortex.cpp` (OPAE-shaped `vx_device` with the
+  full callback table — compile-time caps from VX_config.h since
+  the host runtime and the device library are built from the same
+  source tree), `sw/runtime/gem5/Makefile` (HOST_ARCH ∈
+  {x86_64,aarch64,armhf} → matching cross-compiler; produces
+  `libvortex-gem5-$ARCH.so`). All three B-bugs addressed: B9 (cache
+  flush before download via per-core `dcr_read(VX_DCR_BASE_CACHE_FLUSH,
+  cid)`), B13 (per-arch compiler via `HOST_ARCH`), B14 (mmio_fence()
+  centralised in `issue_cmd()` so every CMD_TYPE write is fenced
+  by construction). Validation: `make -C sw/runtime/gem5 HOST_ARCH=x86_64`
+  → `libvortex-gem5-x86_64.so` (43 KB, ELF 64-bit x86-64, SONAME
+  correct, exports `vx_dev_init` matching the OPAE/SimX backend
+  pattern).
+- **Phase 5** (end-to-end gem5 tests): **4–6 h estimated; ✅ x86
+  PATH COMPLETE 2026-05-17** in ~3 h. The bulk of the work turned
+  out to be the OPAE state machine on the device side (cmd_args
+  latching, busy bit, dcr_rsp register) plus the dmaAction
+  dispatch in the SimObject — the test scripts themselves were
+  small. Files added:
+  `ci/gem5_test_vortex_vecadd.py` (full e2e: x86 CPU + identity-mapped
+  PIO+PIN regions + Process.map() + Vortex device). The Phase 3
+  standalone `ci/gem5_test_vortex_hello.py` continues to pass as a
+  fast smoke test. Phase 5 also extended Phase 2's
+  `sim/simx/gem5/vortex_gpgpu.{cpp,h}` with the full OPAE protocol
+  state machine and Phase 3's `sim/simx/gem5/vortex_gpgpu_dev.cc`
+  with `pop_pending_cmd` → `dmaRead`/`dmaWrite` dispatch.
+  Validation: `vecadd -n16` PASSED!, kernel ran 454 cycles at
+  IPC 0.247 on 4×4 threads/warps. Side fix: glibc's `nanosleep()`
+  routes through `clock_nanosleep` (#230) which gem5 SE-mode
+  doesn't implement — switched the host runtime's poll-loop back-off
+  to `sched_yield()` (in gem5's syscall table). ARM e2e gated on
+  user `sudo apt install gcc-aarch64-linux-gnu g++-aarch64-linux-gnu`
+  (same gate as Phase 4's aarch64 build).
+- **Phase 6** (CI): **2–3 h estimated; ✅ COMPLETE 2026-05-17** in
+  ~30 min. Added `gem5()` function to `ci/regression.sh.in`
+  (mirrors `sst()` shape; builds prerequisites + runs both Phase 3
+  standalone and Phase 5 e2e tests via `timeout 120` per
+  [feedback_test_timeout_120s](../../../../.claude/projects/-home-blaisetine-dev/memory/feedback_test_timeout_120s.md);
+  ARM matrix opt-in via `VORTEX_GEM5_ARM=1`). Added `--gem5` case
+  dispatch + `--gem5` to the show_usage line. Updated
+  `.github/workflows/ci.yml`: appended `ci/gem5_install.sh` to the
+  `Setup Toolchain` step (gated on `cache-toolchain.outputs.cache-hit`
+  like SST), added `Export gem5 paths` step (GEM5_HOME + PATH for
+  `build/X86`), added `gem5` to the `tests.matrix.name` list with
+  `exclude: name=gem5 xlen=64` (the device library is XLEN-locked
+  by the gem5 install; one entry is sufficient). Validation:
+  `./ci/regression.sh --gem5` PASSED end-to-end in **5 seconds**
+  (Phase 3 hello standalone + Phase 5 vecadd e2e, both clean).
+- **Phase 7** (docs): **1–2 h estimated; ✅ COMPLETE 2026-05-17** in
+  ~45 min. Added `docs/gem5_integration.md` covering: install
+  (`ci/gem5_install.sh`), Vortex+gem5 build (`USE_GEM5=1`), host
+  runtime cross-compile (`HOST_ARCH`), running tests
+  (`./ci/regression.sh --gem5` and standalone hand commands),
+  a complete minimal Python recipe for hosting Vortex in a custom
+  gem5 system, **six load-bearing invariants** (Process.map order,
+  identity-mapped PIO+PIN, cache flush before download, MMIO
+  fence, single source of truth for memory, USE_SST/GEM5 mutex),
+  architectural choices worth revisiting (doorbells vs. polling,
+  PCIe upgrade path, C ABI rationale), CI integration, and a
+  troubleshooting table covering the 6 most common error modes
+  (wrong library path, missing LD_LIBRARY_PATH, clock_nanosleep
+  syscall, orphan Process, wrong `library=` param, busy-bit hang,
+  ccache stale objects). Added to `docs/index.md`.
+
+Total: **~30–49 hours** of focused work (was ~26–41 h before Phase 0
+was added as a separate phase; the actual work has not grown — the
+gem5 install was implicit in the old Phase 2 estimate and is now
+explicit in Phase 0). Substantial enough to warrant its own branch
+(`gem5_simx_v3` or similar).
+
+**Sequencing with SST:** Phase 1 (`Processor::cycle()`) is shared;
+do it once and both integrations benefit. If SST lands first, gem5
+reuses `Processor::cycle()` unchanged. If gem5 lands first, the SST
+integration's broken `proc_->cycle()` reference
+(`sim/simx/sst/vortex_simulator.cpp:64`) gets fixed as a side effect
+of Phase 1 — net win for both. Phase 0 is gem5-only; SST integration
+does not benefit from it.
diff --git a/sim/simx/Makefile b/sim/simx/Makefile
index 059484effa..593581cdf1 100644
--- a/sim/simx/Makefile
+++ b/sim/simx/Makefile
@@ -2,8 +2,17 @@ include ../common.mk
 
 DESTDIR ?= $(CURDIR)
 USE_SST ?= 0
+USE_GEM5 ?= 0
 #SST_PKG ?= SST-14.1 # default SST package name
 
+# USE_SST and USE_GEM5 are mutually exclusive — different external
+# simulator wrappers with different LDFLAGS; building both into one
+# binary makes no sense and the proposal docs/proposals/gem5_simx_v3_proposal.md
+# §8 calls this out explicitly.
+ifeq ($(USE_SST)$(USE_GEM5),11)
+$(error USE_SST=1 and USE_GEM5=1 are mutually exclusive)
+endif
+
 OBJ_DIR = $(DESTDIR)/obj
 CONFIG_FILE = $(DESTDIR)/simx_config.stamp
 SRC_DIR = $(VORTEX_HOME)/sim/simx
@@ -96,6 +105,15 @@ ifeq ($(USE_SST),1)
 	SRCS     += $(SRC_DIR)/sst/vortex_simulator.cpp $(SRC_DIR)/sst/vortex_gpgpu.cpp
 endif
 
+# gem5 integration: build libvortex-gem5.so (the C ABI library loaded
+# by the gem5 VortexGPGPU SimObject) plus gem5_smoke (an in-process
+# smoke driver that exercises the library without needing gem5
+# installed). The gem5 wrapper source is kept out of the default SRCS
+# list and pulled into VORTEX_GEM5_SRCS so the default simx binary
+# does not carry it.
+VORTEX_GEM5_SRCS := $(SRC_DIR)/gem5/vortex_gpgpu.cpp
+GEM5_SMOKE_SRC   := $(SRC_DIR)/gem5/gem5_smoke_main.cpp
+
 # Debugging
 ifdef DEBUG
 	CXXFLAGS += -g -O0 -DDEBUG_LEVEL=$(DEBUG)
@@ -128,17 +146,27 @@ VORTEX_SST_OBJS := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(VORTEX_SST_SRCS)
 DEPS += $(VORTEX_SST_OBJS:.o=.d)
 endif
 
+ifeq ($(USE_GEM5), 1)
+VORTEX_GEM5_OBJS := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(VORTEX_GEM5_SRCS))
+GEM5_SMOKE_OBJ   := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(GEM5_SMOKE_SRC))
+DEPS             += $(VORTEX_GEM5_OBJS:.o=.d) $(GEM5_SMOKE_OBJ:.o=.d)
+endif
+
 
 # optional: pipe through ccache if you have it
 CXX := $(if $(shell which ccache),ccache $(CXX),$(CXX))
 
 PROJECT := simx
 VORTEX_LIB := libvortex.so
+VORTEX_GEM5_LIB := libvortex-gem5.so
+GEM5_SMOKE := gem5_smoke
 
-.PHONY: all force clean clean-lib clean-exe clean-obj libvortex clean-libvortex
+.PHONY: all force clean clean-lib clean-exe clean-obj libvortex clean-libvortex libvortex-gem5 clean-libvortex-gem5 gem5-smoke clean-gem5-smoke
 
 ifeq ($(USE_SST), 1)
 all: $(DESTDIR)/$(PROJECT) $(DESTDIR)/$(VORTEX_LIB)
+else ifeq ($(USE_GEM5), 1)
+all: $(DESTDIR)/$(PROJECT) $(DESTDIR)/$(VORTEX_GEM5_LIB) $(DESTDIR)/$(GEM5_SMOKE)
 else
 all: $(DESTDIR)/$(PROJECT)
 endif
@@ -186,6 +214,21 @@ $(DESTDIR)/$(VORTEX_LIB): $(OBJS) $(VORTEX_SST_OBJS)
 	-shared -o $@ \
 	$(LDFLAGS) $(SST_LFLAGS)
 
+# Vortex gem5 device shared library — the gem5 SimObject dlopens this
+# and calls the C ABI declared in sim/simx/gem5/vortex_gpgpu.h.
+libvortex-gem5: $(DESTDIR)/$(VORTEX_GEM5_LIB)
+
+$(DESTDIR)/$(VORTEX_GEM5_LIB): $(OBJS) $(VORTEX_GEM5_OBJS)
+	$(CXX) $(CXXFLAGS) $^ -shared $(LDFLAGS) -Wl,-soname,$(VORTEX_GEM5_LIB) -o $@
+
+# In-process smoke driver (no gem5 needed). Links against the gem5
+# library via the C ABI so a successful run here proves the library
+# is sound before we expose it to the gem5 device.
+gem5-smoke: $(DESTDIR)/$(GEM5_SMOKE)
+
+$(DESTDIR)/$(GEM5_SMOKE): $(GEM5_SMOKE_OBJ) $(DESTDIR)/$(VORTEX_GEM5_LIB)
+	$(CXX) $(CXXFLAGS) $(GEM5_SMOKE_OBJ) -L$(DESTDIR) -lvortex-gem5 -Wl,-rpath,$(DESTDIR) -o $@
+
 # updates the timestamp when flags changed.
 $(CONFIG_FILE): force
 	@mkdir -p $(@D)
@@ -205,10 +248,16 @@ clean-lib:
 clean-libvortex:
 	rm -f $(DESTDIR)/libvortex.so
 
+clean-libvortex-gem5:
+	rm -f $(DESTDIR)/$(VORTEX_GEM5_LIB)
+
+clean-gem5-smoke:
+	rm -f $(DESTDIR)/$(GEM5_SMOKE)
+
 clean-exe:
 	rm -f $(DESTDIR)/$(PROJECT)
 
 clean-obj:
 	rm -rf $(OBJ_DIR)
 
-clean: clean-lib clean-exe clean-obj
+clean: clean-lib clean-libvortex clean-libvortex-gem5 clean-gem5-smoke clean-exe clean-obj
diff --git a/sim/simx/gem5/SConscript b/sim/simx/gem5/SConscript
new file mode 100644
index 0000000000..535ada56ff
--- /dev/null
+++ b/sim/simx/gem5/SConscript
@@ -0,0 +1,18 @@
+# -*- mode:python -*-
+#
+# Vortex SimObjects for gem5. Installed into $GEM5_HOME/src/dev/vortex/
+# by sim/simx/gem5/install.sh. Picked up automatically by gem5's
+# top-level SConstruct via the SConscript-recursion rule at
+# SConstruct:1000.
+#
+# This file's source of truth lives in the Vortex tree
+# (sim/simx/gem5/SConscript); the installer just copies it.
+
+Import('*')
+
+SimObject('VortexGPGPU.py', sim_objects=['VortexGPGPU'])
+Source('vortex_gpgpu_dev.cc')
+
+# DebugFlag for VortexGPGPU traces. Enable with:
+#   gem5.opt --debug-flags=VortexGPGPU ...
+DebugFlag('VortexGPGPU')
diff --git a/sim/simx/gem5/VortexGPGPU.py b/sim/simx/gem5/VortexGPGPU.py
new file mode 100644
index 0000000000..bcbc038a06
--- /dev/null
+++ b/sim/simx/gem5/VortexGPGPU.py
@@ -0,0 +1,46 @@
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Python SimObject binding for the gem5-side VortexGPGPU device.
+# Mirrors the inheritance graph of the C++ side: DmaDevice → PioDevice
+# → ClockedObject.
+
+from m5.objects.Device import DmaDevice
+from m5.params import *
+
+
+class VortexGPGPU(DmaDevice):
+    type = "VortexGPGPU"
+    cxx_header = "dev/vortex/vortex_gpgpu_dev.hh"
+    cxx_class = "gem5::VortexGPGPU"
+
+    # Path to libvortex-gem5.so produced by `make -C sim/simx
+    # USE_GEM5=1` in the Vortex build dir. Required; the C++ ctor
+    # fatals if empty.
+    library = Param.String("Absolute path to libvortex-gem5.so")
+
+    # Optional kernel image preloaded at startup() via vortex_gem5_
+    # load_kernel. When set, the device runs the kernel to completion
+    # via its own tick scheduler and exits the sim loop on done — no
+    # host CPU or MMIO traffic required. This is the Phase-3 entry
+    # point that proves the gem5 wiring without depending on Phase-4's
+    # host-runtime work. Phase 4 uploads kernels via the OPAE MMIO
+    # protocol instead.
+    kernel = Param.String("", "Optional .vxbin/.bin/.hex to preload at boot")
+
+    # PIO range. Default matches the legacy capstone paper (Fig. 4)
+    # for backward narrative continuity, though nothing in the design
+    # depends on this exact value.
+    pio_addr    = Param.Addr(0x20000000, "PIO base address")
+    pio_size    = Param.Addr(0x1000, "PIO region size (bytes)")
+    pio_latency = Param.Latency("1ns", "PIO access latency")
diff --git a/sim/simx/gem5/gem5_smoke_main.cpp b/sim/simx/gem5/gem5_smoke_main.cpp
new file mode 100644
index 0000000000..d2e8ae944a
--- /dev/null
+++ b/sim/simx/gem5/gem5_smoke_main.cpp
@@ -0,0 +1,96 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Phase-2 in-process smoke driver for libvortex-gem5.so.
+//
+// Exercises the C ABI from a native x86 binary — no gem5 involvement.
+// If a kernel completes here, the library is sound; any subsequent
+// failure under gem5 is on the SimObject side, not the library.
+//
+// Usage:
+//   LD_LIBRARY_PATH=$(dirname $(realpath gem5_smoke)) ./gem5_smoke kernel.vxbin
+
+#include "vortex_gpgpu.h"
+#include "constants.h"
+#include <VX_config.h>
+#include <VX_types.h>
+
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+
+int main(int argc, char** argv) {
+  if (argc < 2) {
+    std::fprintf(stderr,
+                 "usage: %s <kernel.vxbin>\n"
+                 "  Runs the kernel through libvortex-gem5's C ABI to confirm\n"
+                 "  the library is wired up correctly before exposing it to\n"
+                 "  the gem5 SimObject.\n",
+                 argv[0]);
+    return 1;
+  }
+  const char* kernel_path = argv[1];
+
+  std::printf("[gem5_smoke] %s\n", vortex_gem5_build_info());
+  std::printf("[gem5_smoke] kernel: %s\n", kernel_path);
+
+  vortex_gem5_handle_t h = vortex_gem5_create();
+  if (h == nullptr) {
+    std::fprintf(stderr, "[gem5_smoke] vortex_gem5_create failed\n");
+    return 1;
+  }
+
+  if (vortex_gem5_load_kernel(h, kernel_path) != 0) {
+    std::fprintf(stderr, "[gem5_smoke] vortex_gem5_load_kernel failed\n");
+    vortex_gem5_destroy(h);
+    return 1;
+  }
+
+  // Tick until the kernel completes. cycle() returns false when no
+  // cluster is running AND no channel still holds an in-flight packet.
+  // Belt-and-braces cap at 100M cycles so a runaway kernel doesn't
+  // hang the smoke test (a real run hits the IO_EXIT_CODE check well
+  // before).
+  uint64_t cycles = 0;
+  constexpr uint64_t MAX_CYCLES = 100ull * 1000 * 1000;
+  while (vortex_gem5_tick(h)) {
+    if (++cycles > MAX_CYCLES) {
+      std::fprintf(stderr,
+                   "[gem5_smoke] aborted after %llu cycles — kernel did not complete\n",
+                   static_cast<unsigned long long>(cycles));
+      vortex_gem5_destroy(h);
+      return 1;
+    }
+  }
+
+  // Drain dirty cache lines to VRAM so we can read IO_EXIT_CODE. Same
+  // pattern as sim/simx/main.cpp's post-run cache flush — one DCR_READ
+  // per core triggers Processor::flush_caches() inside the simulator.
+  uint32_t dummy = 0;
+  for (uint32_t cid = 0; cid < NUM_CORES * NUM_CLUSTERS; ++cid) {
+    vortex_gem5_dcr_read(h, VX_DCR_BASE_CACHE_FLUSH, cid, &dummy);
+  }
+
+  // Read the kernel's exit code from IO_EXIT_CODE via the VRAM-read
+  // path — same byte the simx main reads in sim/simx/main.cpp:213.
+  uint32_t exit_code = 0;
+  vortex_gem5_vram_read(h, IO_EXIT_CODE,
+                        reinterpret_cast<uint8_t*>(&exit_code),
+                        sizeof(exit_code));
+
+  std::printf("[gem5_smoke] cycles=%llu exit_code=%u\n",
+              static_cast<unsigned long long>(cycles), exit_code);
+
+  vortex_gem5_destroy(h);
+  return static_cast<int>(exit_code);
+}
diff --git a/sim/simx/gem5/hello.c b/sim/simx/gem5/hello.c
new file mode 100644
index 0000000000..ff5de63037
--- /dev/null
+++ b/sim/simx/gem5/hello.c
@@ -0,0 +1,14 @@
+// Phase 0 ARM SE-mode smoke test. Cross-compile with
+//   aarch64-linux-gnu-gcc -static -o /tmp/hello-arm hello.c
+// and run under gem5 with the new gem5_library SimpleBoard wiring
+// (or the deprecated configs/example/se.py if still available).
+// Confirms the cross-toolchain produces something gem5 can load.
+
+#include <stdio.h>
+
+int main(int argc, char** argv) {
+    (void)argc;
+    (void)argv;
+    printf("Hello, ARM SE-mode (gem5 v25 Phase 0)\n");
+    return 0;
+}
diff --git a/sim/simx/gem5/install.sh b/sim/simx/gem5/install.sh
new file mode 100755
index 0000000000..7af477c313
--- /dev/null
+++ b/sim/simx/gem5/install.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# Install Vortex gem5 SimObjects into a pinned gem5 tree.
+#
+# Phase 3+: installs the real VortexGPGPU device. The Phase-0 dummy/
+# scaffolding is intentionally removed from $GEM5_HOME during the
+# transition — its job (proving the install path works) is done.
+#
+# Idempotent: re-running just refreshes the files. Caller must
+# re-run `scons build/{X86,ARM}/gem5.opt` after this script to pick
+# up changes.
+#
+# Usage:
+#   GEM5_HOME=$HOME/tools/gem5 sim/simx/gem5/install.sh
+# or
+#   sim/simx/gem5/install.sh           # uses $GEM5_HOME from env
+
+set -e
+
+GEM5_HOME=${GEM5_HOME:-$HOME/tools/gem5}
+SOURCE_DIR=$(dirname "$(readlink -f "$0")")
+
+if [ ! -d "$GEM5_HOME/src/dev" ]; then
+    echo "ERROR: GEM5_HOME=$GEM5_HOME does not look like a gem5 tree" >&2
+    echo "       (expected $GEM5_HOME/src/dev/)" >&2
+    exit 1
+fi
+
+DEST_DIR="$GEM5_HOME/src/dev/vortex"
+mkdir -p "$DEST_DIR"
+
+# Phase 0 scaffolding cleanup: the dummy SimObject existed only to
+# prove the install path; remove it now that the real device is in
+# place so `gem5.opt --list-sim-objects` is not polluted by it.
+if [ -d "$DEST_DIR/dummy" ]; then
+    rm -rf "$DEST_DIR/dummy"
+fi
+
+# Install the real device: header, source, Python binding, SConscript.
+install -m 0644 "$SOURCE_DIR/vortex_gpgpu_dev.hh" "$DEST_DIR/"
+install -m 0644 "$SOURCE_DIR/vortex_gpgpu_dev.cc" "$DEST_DIR/"
+install -m 0644 "$SOURCE_DIR/VortexGPGPU.py"      "$DEST_DIR/"
+install -m 0644 "$SOURCE_DIR/SConscript"          "$DEST_DIR/"
+
+echo "Vortex SimObjects installed at $DEST_DIR"
+echo "Files:"
+ls -1 "$DEST_DIR" | sed 's/^/  /'
+echo ""
+echo "Re-build gem5 with one or both of:"
+echo "  scons -C $GEM5_HOME build/X86/gem5.opt -j\$(nproc)"
+echo "  scons -C $GEM5_HOME build/ARM/gem5.opt -j\$(nproc)"
diff --git a/sim/simx/gem5/vortex_gpgpu.cpp b/sim/simx/gem5/vortex_gpgpu.cpp
new file mode 100644
index 0000000000..1964d8e962
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu.cpp
@@ -0,0 +1,320 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "vortex_gpgpu.h"
+
+#include "constants.h"
+#include "processor.h"
+#include <mem.h>
+#include <util.h>
+#include <VX_config.h>
+#include <VX_types.h>
+
+#include <cstdio>
+#include <cstring>
+#include <iostream>
+#include <memory>
+#include <string>
+
+using namespace vortex;
+
+// Mirrors sw/runtime/common/common.h's GLOBAL_MEM_SIZE so the bounds
+// check in vram_{read,write} matches what the host runtime enforces
+// on its side. Inlined rather than including common.h because that
+// header drags in the full runtime ABI (vortex.h + callbacks.h +
+// mem_alloc.h) which a device library has no business touching.
+#if (XLEN == 64)
+static constexpr uint64_t GEM5_GLOBAL_MEM_SIZE = 0x200000000ull;  // 8 GB
+#else
+static constexpr uint64_t GEM5_GLOBAL_MEM_SIZE = 0x100000000ull;  // 4 GB
+#endif
+
+// OPAE MMIO command-set constants (same as
+// hw/syn/altera/opae/vortex_afu.json + sw/runtime/gem5/vortex.cpp).
+// Hardcoded — no #include of vortex_opae.h — to keep the device
+// library independent of the OPAE header generator.
+namespace cmd {
+constexpr uint64_t MEM_READ  = 1;
+constexpr uint64_t MEM_WRITE = 2;
+constexpr uint64_t RUN       = 3;
+constexpr uint64_t DCR_WRITE = 4;
+constexpr uint64_t DCR_READ  = 5;
+} // namespace cmd
+namespace mmio {
+constexpr uint64_t CMD_TYPE  = 10 * 4;  // byte offsets, matching the
+constexpr uint64_t CMD_ARG0  = 12 * 4;  // sw/runtime side
+constexpr uint64_t CMD_ARG1  = 14 * 4;
+constexpr uint64_t CMD_ARG2  = 16 * 4;
+constexpr uint64_t STATUS    = 18 * 4;
+constexpr uint64_t DCR_RSP   = 28 * 4;
+} // namespace mmio
+
+// Internal C++ class. Mirrors the shape of vortex::VortexSimulator in
+// sim/simx/sst/ — same Processor + RAM ownership, same KMU DCR priming,
+// same load_kernel paths — but with no SST types in the interface.
+namespace {
+
+class Gem5Device {
+public:
+  Gem5Device()
+    : ram_(0, MEM_PAGE_SIZE)
+    , proc_(std::make_unique<Processor>()) {
+    proc_->attach_ram(&ram_);
+  }
+
+  ~Gem5Device() = default;
+
+  // Load a kernel image and prime the KMU for a 1×1×1 CTA at
+  // STARTUP_ADDR. After this, cycle() will dispatch the kernel.
+  // Returns true on success.
+  bool load_kernel(const std::string& path) {
+    // KMU DCRs — same sequence as sim/simx/main.cpp:101–116 and
+    // sim/simx/sst/vortex_simulator.cpp:22–39.
+    const uint64_t startup_addr(STARTUP_ADDR);
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ADDR0, startup_addr & 0xffffffff);
+  #if (XLEN == 64)
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ADDR1, startup_addr >> 32);
+  #endif
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ARG0, 0);
+    proc_->dcr_write(VX_DCR_KMU_STARTUP_ARG1, 0);
+    proc_->dcr_write(VX_DCR_KMU_GRID_DIM_X,   1);
+    proc_->dcr_write(VX_DCR_KMU_GRID_DIM_Y,   1);
+    proc_->dcr_write(VX_DCR_KMU_GRID_DIM_Z,   1);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_DIM_X,  1);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_DIM_Y,  1);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_DIM_Z,  1);
+    proc_->dcr_write(VX_DCR_KMU_LMEM_SIZE,    0);
+    proc_->dcr_write(VX_DCR_KMU_BLOCK_SIZE,   1);
+    proc_->dcr_write(VX_DCR_KMU_WARP_STEP_X,  NUM_THREADS);
+    proc_->dcr_write(VX_DCR_KMU_WARP_STEP_Y,  0);
+    proc_->dcr_write(VX_DCR_KMU_WARP_STEP_Z,  0);
+
+    std::string ext(fileExtension(path.c_str()));
+    if (ext == "vxbin") {
+      ram_.loadVxImage(path.c_str());
+    } else if (ext == "bin") {
+      ram_.loadBinImage(path.c_str(), startup_addr);
+    } else if (ext == "hex") {
+      ram_.loadHexImage(path.c_str());
+    } else {
+      std::cerr << "vortex_gem5: unsupported kernel extension '" << ext
+                << "' (need .vxbin, .bin, or .hex)" << std::endl;
+      return false;
+    }
+    return true;
+  }
+
+  bool tick()  { return proc_->cycle(); }
+
+  // Memory access uses the same ACL-bypass pattern as
+  // sw/runtime/simx/vortex.cpp upload()/download(); the gem5 DMA path
+  // is a peer of the host runtime, not a userspace caller subject to
+  // page protections.
+  void vram_write(uint64_t addr, const uint8_t* src, uint32_t size) {
+    if (addr + size > GEM5_GLOBAL_MEM_SIZE) {
+    #ifndef NDEBUG
+      std::cerr << "vortex_gem5: vram_write overflow addr=0x"
+                << std::hex << addr << " size=" << std::dec << size << std::endl;
+    #endif
+      return;
+    }
+    ram_.enable_acl(false);
+    ram_.write(src, addr, size);
+    ram_.enable_acl(true);
+  }
+
+  void vram_read(uint64_t addr, uint8_t* dst, uint32_t size) {
+    if (addr + size > GEM5_GLOBAL_MEM_SIZE) {
+    #ifndef NDEBUG
+      std::cerr << "vortex_gem5: vram_read overflow addr=0x"
+                << std::hex << addr << " size=" << std::dec << size << std::endl;
+    #endif
+      return;
+    }
+    ram_.enable_acl(false);
+    ram_.read(dst, addr, size);
+    ram_.enable_acl(true);
+  }
+
+  int dcr_write(uint32_t addr, uint32_t value) {
+    return proc_->dcr_write(addr, value);
+  }
+
+  int dcr_read(uint32_t addr, uint32_t tag, uint32_t* value) {
+    return proc_->dcr_read(addr, tag, value);
+  }
+
+  // OPAE MMIO command-set state machine. The host runtime
+  // (sw/runtime/gem5/vortex.cpp) drives it in exactly the same
+  // shape as sw/runtime/opae/vortex.cpp:
+  //   1. Write CMD_ARG0/1/2 with command-specific args
+  //   2. Write CMD_TYPE — triggers the command
+  //   3. Poll MMIO_STATUS until busy bit clears
+  //   4. (For DCR_READ) read MMIO_DCR_RSP for the response
+  //
+  // Synchronous commands (DCR_*) complete inside this function and
+  // clear the busy bit immediately. Async commands (RUN, MEM_*)
+  // surface to the gem5 SimObject via pop_pending_cmd; the SimObject
+  // performs the gem5-side work (clock ticks, DMA) and clears busy
+  // when done.
+  uint64_t mmio_read64(uint64_t offset) {
+    if (offset == mmio::STATUS)  return busy_ ? 1u : 0u;
+    if (offset == mmio::DCR_RSP) return dcr_rsp_;
+    return 0;
+  }
+
+  void mmio_write64(uint64_t offset, uint64_t value) {
+    if (offset == mmio::CMD_ARG0) { cmd_args_[0] = value; return; }
+    if (offset == mmio::CMD_ARG1) { cmd_args_[1] = value; return; }
+    if (offset == mmio::CMD_ARG2) { cmd_args_[2] = value; return; }
+    if (offset != mmio::CMD_TYPE) return;  // unknown reg — ignore
+
+    busy_ = true;
+    switch (value) {
+    case cmd::DCR_WRITE: {
+      proc_->dcr_write(uint32_t(cmd_args_[0]), uint32_t(cmd_args_[1]));
+      busy_ = false;
+      break;
+    }
+    case cmd::DCR_READ: {
+      uint32_t v = 0;
+      proc_->dcr_read(uint32_t(cmd_args_[0]),
+                      uint32_t(cmd_args_[1]),
+                      &v);
+      dcr_rsp_ = v;
+      busy_ = false;
+      break;
+    }
+    case cmd::RUN:
+    case cmd::MEM_READ:
+    case cmd::MEM_WRITE:
+      // Async — gem5 SimObject reads pending_cmd_ on the same MMIO
+      // dispatch tick and routes the work (clock cycles for RUN,
+      // dmaAction for MEM_*). It clears busy when done.
+      pending_cmd_ = value;
+      break;
+    default:
+      // Unknown command: drop the busy bit so the host doesn't hang.
+      busy_ = false;
+      break;
+    }
+  }
+
+  uint64_t pop_pending_cmd() {
+    uint64_t c = pending_cmd_;
+    pending_cmd_ = 0;
+    return c;
+  }
+  uint64_t get_cmd_arg(int which) const {
+    return (which >= 0 && which < 3) ? cmd_args_[which] : 0;
+  }
+  void set_busy(bool busy) { busy_ = busy; }
+
+private:
+  RAM ram_;
+  std::unique_ptr<Processor> proc_;
+
+  // OPAE protocol state.
+  uint64_t cmd_args_[3] = {0, 0, 0};
+  uint64_t pending_cmd_ = 0;
+  uint64_t dcr_rsp_     = 0;
+  bool     busy_        = false;
+};
+
+} // namespace
+
+// ----- C ABI -----------------------------------------------------------------
+
+extern "C" {
+
+const char* vortex_gem5_build_info(void) {
+  static char info[256];
+  std::snprintf(info, sizeof(info),
+                "vortex-gem5 (XLEN=%d, threads=%d, warps=%d, cores=%d, clusters=%d)",
+                XLEN, NUM_THREADS, NUM_WARPS, NUM_CORES, NUM_CLUSTERS);
+  return info;
+}
+
+vortex_gem5_handle_t vortex_gem5_create(void) {
+  try {
+    return reinterpret_cast<vortex_gem5_handle_t>(new Gem5Device());
+  } catch (const std::exception& e) {
+    std::cerr << "vortex_gem5_create: " << e.what() << std::endl;
+    return nullptr;
+  } catch (...) {
+    std::cerr << "vortex_gem5_create: unknown exception" << std::endl;
+    return nullptr;
+  }
+}
+
+void vortex_gem5_destroy(vortex_gem5_handle_t h) {
+  if (h == nullptr) return;
+  delete reinterpret_cast<Gem5Device*>(h);
+}
+
+int vortex_gem5_load_kernel(vortex_gem5_handle_t h, const char* path) {
+  if (h == nullptr || path == nullptr) return -1;
+  return reinterpret_cast<Gem5Device*>(h)->load_kernel(path) ? 0 : -1;
+}
+
+bool vortex_gem5_tick(vortex_gem5_handle_t h) {
+  if (h == nullptr) return false;
+  return reinterpret_cast<Gem5Device*>(h)->tick();
+}
+
+uint64_t vortex_gem5_mmio_read64(vortex_gem5_handle_t h, uint64_t offset) {
+  if (h == nullptr) return 0;
+  return reinterpret_cast<Gem5Device*>(h)->mmio_read64(offset);
+}
+
+void vortex_gem5_mmio_write64(vortex_gem5_handle_t h, uint64_t offset, uint64_t value) {
+  if (h == nullptr) return;
+  reinterpret_cast<Gem5Device*>(h)->mmio_write64(offset, value);
+}
+
+void vortex_gem5_vram_write(vortex_gem5_handle_t h, uint64_t dev_addr, const uint8_t* src, uint32_t size) {
+  if (h == nullptr || src == nullptr) return;
+  reinterpret_cast<Gem5Device*>(h)->vram_write(dev_addr, src, size);
+}
+
+void vortex_gem5_vram_read(vortex_gem5_handle_t h, uint64_t dev_addr, uint8_t* dst, uint32_t size) {
+  if (h == nullptr || dst == nullptr) return;
+  reinterpret_cast<Gem5Device*>(h)->vram_read(dev_addr, dst, size);
+}
+
+int vortex_gem5_dcr_write(vortex_gem5_handle_t h, uint32_t addr, uint32_t value) {
+  if (h == nullptr) return -1;
+  return reinterpret_cast<Gem5Device*>(h)->dcr_write(addr, value);
+}
+
+int vortex_gem5_dcr_read(vortex_gem5_handle_t h, uint32_t addr, uint32_t tag, uint32_t* value) {
+  if (h == nullptr || value == nullptr) return -1;
+  return reinterpret_cast<Gem5Device*>(h)->dcr_read(addr, tag, value);
+}
+
+uint64_t vortex_gem5_pop_pending_cmd(vortex_gem5_handle_t h) {
+  if (h == nullptr) return 0;
+  return reinterpret_cast<Gem5Device*>(h)->pop_pending_cmd();
+}
+
+uint64_t vortex_gem5_get_cmd_arg(vortex_gem5_handle_t h, int which) {
+  if (h == nullptr) return 0;
+  return reinterpret_cast<Gem5Device*>(h)->get_cmd_arg(which);
+}
+
+void vortex_gem5_set_busy(vortex_gem5_handle_t h, bool busy) {
+  if (h == nullptr) return;
+  reinterpret_cast<Gem5Device*>(h)->set_busy(busy);
+}
+
+} // extern "C"
diff --git a/sim/simx/gem5/vortex_gpgpu.h b/sim/simx/gem5/vortex_gpgpu.h
new file mode 100644
index 0000000000..94d14eb865
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu.h
@@ -0,0 +1,111 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// libvortex-gem5 — C ABI for the gem5 VortexGPGPU SimObject.
+//
+// The gem5 device (sim/simx/gem5/<simobject>.cc, installed into a pinned
+// gem5 tree by sim/simx/gem5/install.sh) loads this shared library and
+// drives it through this C ABI. Keeping the ABI in C — not C++ — means
+// the gem5 side does not depend on SimX's C++ types and can be rebuilt
+// against a new gem5 release without touching anything Vortex-side.
+//
+// Concurrency: the gem5 device serializes calls on its event-loop thread;
+// no internal locking. Re-entrancy: completion callbacks (currently
+// unused — the DMA path is fully synchronous on the gem5 side per Phase
+// 2) may be added later as Phase 3 wires up async DMA.
+
+#pragma once
+
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Opaque handle. The library owns a vortex::Processor + RAM behind it.
+typedef struct vortex_gem5_device_s* vortex_gem5_handle_t;
+
+// Returns a printable description of the build config (cores, warps,
+// threads, XLEN). Returned pointer is static; do not free.
+const char* vortex_gem5_build_info(void);
+
+// Construct a Vortex device instance. Returns NULL on failure.
+// VRAM is allocated lazily; no kernel is loaded until
+// vortex_gem5_load_kernel is called.
+vortex_gem5_handle_t vortex_gem5_create(void);
+
+// Destroy the device. Safe to call with NULL.
+void vortex_gem5_destroy(vortex_gem5_handle_t h);
+
+// Load a kernel image into VRAM. Accepts .vxbin / .bin / .hex (same
+// shape as sim/simx/main.cpp:120). Primes the KMU DCRs for a 1x1x1
+// CTA at STARTUP_ADDR (same as sim/simx/main.cpp:101-116) so a
+// subsequent cycle() loop launches the kernel.
+//
+// In the Phase-2 in-process smoke driver this is how kernels reach
+// the device. The Phase-4 runtime will instead upload kernels via
+// the staging-buffer DMA path (vortex_gem5_vram_write + the OPAE MMIO
+// commands), and Phase 3's gem5 SimObject can optionally call this
+// at boot via a Python `kernel=...` parameter for one-shot smoke
+// tests.
+//
+// Returns 0 on success, -1 on file-not-found or unsupported format.
+int vortex_gem5_load_kernel(vortex_gem5_handle_t h, const char* path);
+
+// Advance the simulator by one cycle. Returns true while work
+// remains (clusters running or channels carrying packets); false once
+// the program has finished. Mirrors vortex::Processor::cycle().
+bool vortex_gem5_tick(vortex_gem5_handle_t h);
+
+// MMIO (PIO) accessed by the simulated host CPU via the gem5 SimObject's
+// read()/write() callbacks. Offsets are byte addresses inside the
+// device's PIO range. See sw/runtime/opae/vortex.cpp for the OPAE MMIO
+// layout this protocol mirrors.
+uint64_t vortex_gem5_mmio_read64(vortex_gem5_handle_t h, uint64_t offset);
+void vortex_gem5_mmio_write64(vortex_gem5_handle_t h, uint64_t offset, uint64_t value);
+
+// VRAM access. The gem5 device DMAs to/from the host's staging buffer
+// using its own DmaPort; once the bytes are in a local scratch, it
+// calls these to copy into/out of the device VRAM. Bytes here cross
+// only the C ABI boundary — they do not re-enter gem5's DMA system.
+//
+// Bounds-checked against the RAM image; on overflow the call is a
+// no-op and (in debug builds) logs to stderr.
+void vortex_gem5_vram_write(vortex_gem5_handle_t h, uint64_t dev_addr, const uint8_t* src, uint32_t size);
+void vortex_gem5_vram_read(vortex_gem5_handle_t h, uint64_t dev_addr, uint8_t* dst, uint32_t size);
+
+// DCR write/read passthrough. The DCR-read path also handles the
+// cache-flush DCR (VX_DCR_BASE_CACHE_FLUSH), which drains dirty cache
+// lines all the way to VRAM — required before a host read-back per
+// B9 in docs/proposals/gem5_simx_v3_proposal.md §2.2.
+int vortex_gem5_dcr_write(vortex_gem5_handle_t h, uint32_t addr, uint32_t value);
+int vortex_gem5_dcr_read(vortex_gem5_handle_t h, uint32_t addr, uint32_t tag, uint32_t* value);
+
+// Protocol state introspection for the gem5 SimObject. The library
+// owns the OPAE state machine (cmd_args + busy bit + cmd_type +
+// dcr_rsp); the gem5 SimObject calls these to drive DMA for the
+// async CMD_MEM_{READ,WRITE} commands.
+//
+// pop_pending_cmd returns the CMD_* constant of an async command
+// the SimObject must service (CMD_RUN, CMD_MEM_WRITE, CMD_MEM_READ),
+// or 0 if no command is pending. Synchronous commands (CMD_DCR_*)
+// are handled inside mmio_write64 and never surface here.
+uint64_t vortex_gem5_pop_pending_cmd(vortex_gem5_handle_t h);
+uint64_t vortex_gem5_get_cmd_arg(vortex_gem5_handle_t h, int which);
+void     vortex_gem5_set_busy(vortex_gem5_handle_t h, bool busy);
+
+#ifdef __cplusplus
+} // extern "C"
+#endif
diff --git a/sim/simx/gem5/vortex_gpgpu_dev.cc b/sim/simx/gem5/vortex_gpgpu_dev.cc
new file mode 100644
index 0000000000..f46d42934f
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu_dev.cc
@@ -0,0 +1,295 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "dev/vortex/vortex_gpgpu_dev.hh"
+
+#include "base/logging.hh"
+#include "base/trace.hh"
+#include "mem/packet_access.hh"
+#include "sim/sim_exit.hh"
+
+#include <dlfcn.h>
+
+// OPAE MMIO command-set constants. Hardcoded to match the layout
+// the host runtime uses (sw/runtime/gem5/vortex.cpp:50-66, also
+// hw/syn/altera/opae/vortex_afu.json). Hardcoded — not pulled from
+// vortex_opae.h — because gem5 is compiled out-of-tree and we
+// don't want a build-time dep on the Vortex source.
+static constexpr uint64_t MMIO_CMD_TYPE = 10 * 4;  // byte offset
+static constexpr uint64_t CMD_MEM_READ  = 1;
+static constexpr uint64_t CMD_MEM_WRITE = 2;
+static constexpr uint64_t CMD_RUN       = 3;
+
+// Cache line size — args are stored shifted by log2(CACHE_BLOCK_SIZE)
+// in the OPAE protocol; both directions agree at log2(64) = 6.
+static constexpr unsigned CACHE_BLOCK_LOG2 = 6;
+
+namespace gem5
+{
+
+namespace {
+
+// Helper for dlsym + null-check in one line. Returns the resolved
+// pointer cast to T, or fatals out with a stable error message.
+template <typename T>
+T dlsym_or_fatal(void* handle, const char* symbol, const char* libpath)
+{
+    void* p = dlsym(handle, symbol);
+    if (p == nullptr) {
+        fatal("VortexGPGPU: dlsym(%s) failed in %s: %s",
+              symbol, libpath, dlerror());
+    }
+    return reinterpret_cast<T>(p);
+}
+
+} // namespace
+
+VortexGPGPU::VortexGPGPU(const Params &p)
+  : DmaDevice(p),
+    libHandle_(nullptr),
+    deviceHandle_(nullptr),
+    abi_{},
+    libraryPath_(p.library),
+    kernelPath_(p.kernel),
+    pioAddr_(p.pio_addr),
+    pioSize_(p.pio_size),
+    pioLatency_(p.pio_latency),
+    tickEvent_([this]{ this->tick(); }, name() + ".tickEvent")
+{
+    if (libraryPath_.empty()) {
+        fatal("VortexGPGPU: 'library' parameter is required "
+              "(path to libvortex-gem5.so)");
+    }
+
+    // dlopen with RTLD_LAZY|RTLD_LOCAL — local so multiple SimObject
+    // instances don't share symbol scope, lazy because we resolve
+    // explicitly with dlsym below anyway.
+    libHandle_ = dlopen(libraryPath_.c_str(), RTLD_LAZY | RTLD_LOCAL);
+    if (libHandle_ == nullptr) {
+        fatal("VortexGPGPU: dlopen('%s') failed: %s",
+              libraryPath_, dlerror());
+    }
+
+    // Resolve the full v1 C ABI surface. Any missing symbol is a hard
+    // build mismatch between gem5 and the Vortex library — fatal so
+    // we fail fast at construction rather than mid-simulation.
+    abi_.build_info   = dlsym_or_fatal<const char*(*)(void)>
+                          (libHandle_, "vortex_gem5_build_info",   libraryPath_.c_str());
+    abi_.create       = dlsym_or_fatal<void*(*)(void)>
+                          (libHandle_, "vortex_gem5_create",       libraryPath_.c_str());
+    abi_.destroy      = dlsym_or_fatal<void(*)(void*)>
+                          (libHandle_, "vortex_gem5_destroy",      libraryPath_.c_str());
+    abi_.load_kernel  = dlsym_or_fatal<int(*)(void*, const char*)>
+                          (libHandle_, "vortex_gem5_load_kernel",  libraryPath_.c_str());
+    abi_.tick         = dlsym_or_fatal<bool(*)(void*)>
+                          (libHandle_, "vortex_gem5_tick",         libraryPath_.c_str());
+    abi_.mmio_read64  = dlsym_or_fatal<uint64_t(*)(void*, uint64_t)>
+                          (libHandle_, "vortex_gem5_mmio_read64",  libraryPath_.c_str());
+    abi_.mmio_write64 = dlsym_or_fatal<void(*)(void*, uint64_t, uint64_t)>
+                          (libHandle_, "vortex_gem5_mmio_write64", libraryPath_.c_str());
+    abi_.vram_write   = dlsym_or_fatal<void(*)(void*, uint64_t, const uint8_t*, uint32_t)>
+                          (libHandle_, "vortex_gem5_vram_write",   libraryPath_.c_str());
+    abi_.vram_read    = dlsym_or_fatal<void(*)(void*, uint64_t, uint8_t*, uint32_t)>
+                          (libHandle_, "vortex_gem5_vram_read",    libraryPath_.c_str());
+    abi_.dcr_write    = dlsym_or_fatal<int(*)(void*, uint32_t, uint32_t)>
+                          (libHandle_, "vortex_gem5_dcr_write",    libraryPath_.c_str());
+    abi_.dcr_read     = dlsym_or_fatal<int(*)(void*, uint32_t, uint32_t, uint32_t*)>
+                          (libHandle_, "vortex_gem5_dcr_read",     libraryPath_.c_str());
+    abi_.pop_pending_cmd = dlsym_or_fatal<uint64_t(*)(void*)>
+                          (libHandle_, "vortex_gem5_pop_pending_cmd", libraryPath_.c_str());
+    abi_.get_cmd_arg  = dlsym_or_fatal<uint64_t(*)(void*, int)>
+                          (libHandle_, "vortex_gem5_get_cmd_arg",  libraryPath_.c_str());
+    abi_.set_busy     = dlsym_or_fatal<void(*)(void*, bool)>
+                          (libHandle_, "vortex_gem5_set_busy",     libraryPath_.c_str());
+
+    inform("VortexGPGPU: %s", abi_.build_info());
+    inform("VortexGPGPU: library=%s pio=[0x%llx,+0x%llx)",
+           libraryPath_,
+           static_cast<unsigned long long>(pioAddr_),
+           static_cast<unsigned long long>(pioSize_));
+
+    deviceHandle_ = abi_.create();
+    if (deviceHandle_ == nullptr) {
+        fatal("VortexGPGPU: vortex_gem5_create returned NULL");
+    }
+}
+
+VortexGPGPU::~VortexGPGPU()
+{
+    if (deviceHandle_ != nullptr && abi_.destroy != nullptr) {
+        abi_.destroy(deviceHandle_);
+    }
+    if (libHandle_ != nullptr) {
+        dlclose(libHandle_);
+    }
+}
+
+void
+VortexGPGPU::init()
+{
+    DmaDevice::init();
+}
+
+void
+VortexGPGPU::startup()
+{
+    DmaDevice::startup();
+
+    if (!kernelPath_.empty()) {
+        // Standalone mode (Phase 3): preload a kernel and self-drive
+        // to completion. Used by ci/gem5_test_vortex_hello.py — no
+        // host CPU needed.
+        inform("VortexGPGPU: standalone mode (preload + auto-tick)");
+        inform("VortexGPGPU: preloading kernel=%s", kernelPath_);
+        if (abi_.load_kernel(deviceHandle_, kernelPath_.c_str()) != 0) {
+            fatal("VortexGPGPU: vortex_gem5_load_kernel('%s') failed",
+                  kernelPath_);
+        }
+        standalone_ = true;
+        schedule(tickEvent_, clockEdge(Cycles(1)));
+    } else {
+        // Hosted mode (Phase 5+): the host CPU uploads kernels via
+        // MMIO/DMA and triggers execution with CMD_RUN. We sit idle
+        // until then; CMD_RUN's write handler schedules tickEvent_.
+        inform("VortexGPGPU: hosted mode (waiting for host CMD_RUN)");
+        standalone_ = false;
+    }
+}
+
+void
+VortexGPGPU::tick()
+{
+    bool running = abi_.tick(deviceHandle_);
+    if (running) {
+        schedule(tickEvent_, clockEdge(Cycles(1)));
+        return;
+    }
+    // Kernel finished.
+    if (standalone_) {
+        inform("VortexGPGPU: standalone kernel complete — exiting sim loop");
+        exitSimLoop("VortexGPGPU: kernel complete");
+    } else {
+        // Host CPU is polling MMIO_STATUS waiting for busy bit to
+        // clear; do that now so vx_ready_wait returns.
+        abi_.set_busy(deviceHandle_, false);
+    }
+}
+
+Tick
+VortexGPGPU::read(PacketPtr pkt)
+{
+    const Addr offset = pkt->getAddr() - pioAddr_;
+    const uint64_t value = abi_.mmio_read64(deviceHandle_, offset);
+
+    // 64-bit aligned access is the only shape the OPAE protocol uses.
+    // Stuff the result into the packet regardless of size (gem5 will
+    // truncate based on getSize); narrow reads are unsupported by the
+    // protocol but harmless here.
+    pkt->setUintX(value, ByteOrder::little);
+    pkt->makeAtomicResponse();
+    return pioLatency_;
+}
+
+Tick
+VortexGPGPU::write(PacketPtr pkt)
+{
+    const Addr offset = pkt->getAddr() - pioAddr_;
+    const uint64_t value = pkt->getUintX(ByteOrder::little);
+
+    // Always forward the write to the Vortex library first so the
+    // device sees the args/CMD_TYPE in order.
+    abi_.mmio_write64(deviceHandle_, offset, value);
+
+    // Then react to commands that need gem5-side action (kicking the
+    // tick scheduler for CMD_RUN; Phase 5+ will add CMD_MEM_*
+    // dispatch through dmaPort).
+    if (offset == MMIO_CMD_TYPE) {
+        handleCmdType(value);
+    }
+
+    pkt->makeAtomicResponse();
+    return pioLatency_;
+}
+
+void
+VortexGPGPU::handleCmdType(uint64_t /*value*/)
+{
+    // Read which async command the library wants us to handle.
+    // Sync commands (DCR_*) already completed inside mmio_write64
+    // and don't surface here (pop returns 0).
+    const uint64_t cmd = abi_.pop_pending_cmd(deviceHandle_);
+    if (cmd == 0) return;
+
+    if (cmd == CMD_RUN) {
+        // Schedule the tick loop. tick() clears busy_ when the
+        // kernel finishes (via abi_.set_busy(false)).
+        if (!tickEvent_.scheduled()) {
+            schedule(tickEvent_, clockEdge(Cycles(1)));
+        }
+        return;
+    }
+
+    if (cmd == CMD_MEM_WRITE || cmd == CMD_MEM_READ) {
+        // Args are CACHE-LINE shifted in the OPAE protocol.
+        const Addr host_addr = abi_.get_cmd_arg(deviceHandle_, 0)
+                                 << CACHE_BLOCK_LOG2;
+        const Addr dev_addr  = abi_.get_cmd_arg(deviceHandle_, 1)
+                                 << CACHE_BLOCK_LOG2;
+        const uint64_t size  = abi_.get_cmd_arg(deviceHandle_, 2)
+                                 << CACHE_BLOCK_LOG2;
+
+        // Scratch buffer for the transfer; freed inside the
+        // completion callback. EventFunctionWrapper's `true` tail
+        // arg flags auto-delete after firing.
+        auto* scratch = new uint8_t[size];
+        void* deviceHandle = deviceHandle_;
+        auto& abi = abi_;
+
+        if (cmd == CMD_MEM_WRITE) {
+            // Host pinned buffer → device VRAM.
+            auto* done = new EventFunctionWrapper(
+                [&abi, deviceHandle, dev_addr, scratch, size]() {
+                    abi.vram_write(deviceHandle, dev_addr, scratch,
+                                   static_cast<uint32_t>(size));
+                    delete[] scratch;
+                    abi.set_busy(deviceHandle, false);
+                },
+                name() + ".dmaReadDone",
+                /*deletePostEvent=*/true);
+            dmaRead(host_addr, size, done, scratch);
+        } else {
+            // Device VRAM → host pinned buffer.
+            abi.vram_read(deviceHandle, dev_addr, scratch,
+                          static_cast<uint32_t>(size));
+            auto* done = new EventFunctionWrapper(
+                [&abi, deviceHandle, scratch]() {
+                    delete[] scratch;
+                    abi.set_busy(deviceHandle, false);
+                },
+                name() + ".dmaWriteDone",
+                /*deletePostEvent=*/true);
+            dmaWrite(host_addr, size, done, scratch);
+        }
+        return;
+    }
+}
+
+AddrRangeList
+VortexGPGPU::getAddrRanges() const
+{
+    AddrRangeList ranges;
+    ranges.push_back(RangeSize(pioAddr_, pioSize_));
+    return ranges;
+}
+
+} // namespace gem5
diff --git a/sim/simx/gem5/vortex_gpgpu_dev.hh b/sim/simx/gem5/vortex_gpgpu_dev.hh
new file mode 100644
index 0000000000..8f68256365
--- /dev/null
+++ b/sim/simx/gem5/vortex_gpgpu_dev.hh
@@ -0,0 +1,122 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// VortexGPGPU — gem5 SimObject wrapper for libvortex-gem5.so.
+//
+// Lives at $GEM5_HOME/src/dev/vortex/vortex_gpgpu_dev.{cc,hh} after
+// sim/simx/gem5/install.sh runs. The host-side source of truth is
+// the Vortex tree (sim/simx/gem5/) so API drift between gem5 and the
+// Vortex C ABI shows up as a build error in Vortex CI, not as a gem5
+// integration mystery.
+//
+// Design points (see docs/proposals/gem5_simx_v3_proposal.md §3.1):
+//   - dlopen the Vortex library at construction time; resolve all
+//     vortex_gem5_* symbols up-front. This keeps gem5 decoupled from
+//     the Vortex C++ ABI, so we can iterate on SimX internals without
+//     rebuilding gem5.
+//   - Drive Vortex's clock from a self-rescheduling EventFunctionWrapper
+//     (sim/simx/gem5/gem5_api_notes.md §"EventFunctionWrapper"). One
+//     vortex_gem5_tick() per gem5 cycle.
+//   - Inherits DmaDevice (not just PioDevice) so Phase 4's host runtime
+//     gets DMA "for free" via gem5's DmaPort; the Phase 3 entry just
+//     declares the inheritance and leaves DMA paths unexercised.
+
+#ifndef __DEV_VORTEX_VORTEX_GPGPU_DEV_HH__
+#define __DEV_VORTEX_VORTEX_GPGPU_DEV_HH__
+
+#include "dev/dma_device.hh"
+#include "dev/io_device.hh"
+#include "params/VortexGPGPU.hh"
+#include "sim/eventq.hh"
+
+#include <cstdint>
+#include <string>
+
+namespace gem5
+{
+
+class VortexGPGPU : public DmaDevice
+{
+public:
+    using Params = VortexGPGPUParams;
+
+    VortexGPGPU(const Params &p);
+    ~VortexGPGPU() override;
+
+    // PioDevice interface
+    Tick read(PacketPtr pkt) override;
+    Tick write(PacketPtr pkt) override;
+    AddrRangeList getAddrRanges() const override;
+
+    // SimObject lifecycle
+    void init() override;
+    void startup() override;
+
+private:
+    // Self-rescheduling clock tick — calls vortex_gem5_tick() once per
+    // device cycle. Returns false (program done) ⇒ exitSimLoop.
+    void tick();
+
+    // Decode an MMIO command type write (MMIO_CMD_TYPE) and route
+    // CMD_MEM_{READ,WRITE} to the DMA path. Phase 3 routes other
+    // command types via vortex_gem5_mmio_write64; Phase 4 promotes
+    // CMD_MEM_* to the real DmaPort flow.
+    void handleCmdType(uint64_t value);
+
+    // Library binding ------------------------------------------------
+    // Opaque dlopen handle; closed in dtor.
+    void* libHandle_;
+    // Vortex device handle returned by vortex_gem5_create.
+    void* deviceHandle_;
+
+    // Cached function pointers — resolved once at construction so the
+    // hot path (tick, read, write) is straight indirect calls with no
+    // string lookups.
+    struct AbiV1 {
+        const char* (*build_info)(void);
+        void*       (*create)(void);
+        void        (*destroy)(void* h);
+        int         (*load_kernel)(void* h, const char* path);
+        bool        (*tick)(void* h);
+        uint64_t    (*mmio_read64)(void* h, uint64_t off);
+        void        (*mmio_write64)(void* h, uint64_t off, uint64_t value);
+        void        (*vram_write)(void* h, uint64_t addr, const uint8_t* src, uint32_t size);
+        void        (*vram_read)(void* h, uint64_t addr, uint8_t* dst, uint32_t size);
+        int         (*dcr_write)(void* h, uint32_t addr, uint32_t value);
+        int         (*dcr_read)(void* h, uint32_t addr, uint32_t tag, uint32_t* value);
+        uint64_t    (*pop_pending_cmd)(void* h);
+        uint64_t    (*get_cmd_arg)(void* h, int which);
+        void        (*set_busy)(void* h, bool busy);
+    } abi_;
+
+    // Configuration --------------------------------------------------
+    const std::string libraryPath_;
+    const std::string kernelPath_;
+    const Addr        pioAddr_;
+    const Addr        pioSize_;
+    const Tick        pioLatency_;
+
+    // Tick scheduling
+    EventFunctionWrapper tickEvent_;
+
+    // Standalone vs. hosted mode (selected at startup based on
+    // whether the `kernel=` Python param was set). In standalone
+    // mode the device drives a single preloaded kernel to
+    // completion and exits the sim loop; in hosted mode it sits
+    // idle until the host CPU issues CMD_RUN via MMIO.
+    bool standalone_;
+};
+
+} // namespace gem5
+
+#endif // __DEV_VORTEX_VORTEX_GPGPU_DEV_HH__
diff --git a/sim/simx/processor.cpp b/sim/simx/processor.cpp
index b173e4195d..40dc9226a8 100644
--- a/sim/simx/processor.cpp
+++ b/sim/simx/processor.cpp
@@ -231,6 +231,22 @@ void ProcessorImpl::reset() {
   perf_mem_writes_ = 0;
   perf_mem_latency_ = 0;
   perf_mem_pending_reads_ = 0;
+  is_cycle_initialized_ = false;
+}
+
+bool ProcessorImpl::cycle() {
+  // Lazy first-call init mirrors run()'s top-of-loop sequence so the
+  // external driver doesn't need to choreograph reset + kmu start
+  // separately. reset() clears is_cycle_initialized_ so a back-to-back
+  // kernel launch re-dispatches.
+  if (!is_cycle_initialized_) {
+    this->reset();
+    kmu_->start();
+    is_cycle_initialized_ = true;
+  }
+  SimPlatform::instance().tick();
+  perf_mem_latency_ += perf_mem_pending_reads_;
+  return this->any_running();
 }
 
 int ProcessorImpl::dcr_write(uint32_t addr, uint32_t value) {
@@ -333,6 +349,14 @@ int Processor::run() {
   return -1;
 }
 
+bool Processor::cycle() {
+  return impl_->cycle();
+}
+
+Memory* Processor::memsim() {
+  return impl_->memsim();
+}
+
 int Processor::dcr_write(uint32_t addr, uint32_t value) {
   return impl_->dcr_write(addr, value);
 }
diff --git a/sim/simx/processor.h b/sim/simx/processor.h
index 129cfdc460..04b57f037b 100644
--- a/sim/simx/processor.h
+++ b/sim/simx/processor.h
@@ -20,6 +20,7 @@
 namespace vortex {
 
 class RAM;
+class Memory;
 class ProcessorImpl;
 
 class Processor {
@@ -33,12 +34,29 @@ class Processor {
 
   int run();
 
+  // Advance the simulator by one cycle. On the first call after a
+  // reset() (or on the very first call), the KMU is started so warps
+  // dispatch into the cluster. Returns true while work remains
+  // (clusters running or channels carrying packets); false once the
+  // program has finished and the channels have drained.
+  //
+  // Used by external simulators that drive Vortex's clock from their
+  // own event loop (SST in sim/simx/sst/, gem5 in sim/simx/gem5/).
+  bool cycle();
+
   void start_kmu();
 
   bool any_running() const;
 
   class Core* get_first_core() const;
 
+  // Returns the processor's memory module. Used by external simulators
+  // (SST, gem5) to install a pre-send hook on Memory::tick that mirrors
+  // accepted requests to their own memory hierarchy for timing
+  // observability. The local data path stays in Vortex's RAM — this is
+  // a peek, not a substitute.
+  Memory* memsim();
+
   int dcr_write(uint32_t addr, uint32_t value);
 
   int dcr_read(uint32_t addr, uint32_t tag, uint32_t* value);
diff --git a/sim/simx/processor_impl.h b/sim/simx/processor_impl.h
index 0f66471b6c..4d2b6fef4f 100644
--- a/sim/simx/processor_impl.h
+++ b/sim/simx/processor_impl.h
@@ -40,6 +40,11 @@ class ProcessorImpl {
 
   int run();
 
+  // Single-cycle step; see Processor::cycle() doc. Lazily initializes
+  // (resets + starts KMU) on the first call after construction or
+  // after reset() has been invoked.
+  bool cycle();
+
   int dcr_write(uint32_t addr, uint32_t value);
 
   int dcr_read(uint32_t addr, uint32_t tag, uint32_t* value);
@@ -48,6 +53,8 @@ class ProcessorImpl {
 
   Kmu& kmu()       { return *kmu_; }
 
+  Memory* memsim() { return memsim_.get(); }
+
   bool any_running() const;
 
   class Core* get_first_core() const;
@@ -67,6 +74,10 @@ class ProcessorImpl {
   uint64_t perf_mem_writes_;
   uint64_t perf_mem_latency_;
   uint64_t perf_mem_pending_reads_;
+  // Tracks whether cycle() has done its first-call init (reset +
+  // kmu_->start()). reset() clears it so a back-to-back kernel launch
+  // via cycle() re-dispatches the KMU.
+  bool is_cycle_initialized_;
 };
 
 }
diff --git a/sw/common/bitmanip.h b/sw/common/bitmanip.h
index c4fe9e8da2..5c72683859 100644
--- a/sw/common/bitmanip.h
+++ b/sw/common/bitmanip.h
@@ -14,6 +14,8 @@
 #pragma once
 
 #include <cstdint>
+#include <type_traits>
+#include <algorithm>
 #include <assert.h>
 
 namespace vortex {
diff --git a/sw/runtime/gem5/Makefile b/sw/runtime/gem5/Makefile
new file mode 100644
index 0000000000..16bd3390be
--- /dev/null
+++ b/sw/runtime/gem5/Makefile
@@ -0,0 +1,73 @@
+include ../common.mk
+
+# HOST_ARCH selects the cross-compiler for the simulated host ISA
+# inside gem5 (see docs/proposals/gem5_simx_v3_proposal.md §3.5).
+# Default x86_64 has no toolchain install requirement; aarch64/armhf
+# need ci/gem5_install.sh to have run sudo-apt for the cross-compilers.
+HOST_ARCH ?= x86_64
+
+DESTDIR ?= $(CURDIR)/..
+
+SRC_DIR := $(VORTEX_HOME)/sw/runtime/gem5
+
+CXXFLAGS += -std=c++17 -Wall -Wextra -pedantic -Wfatal-errors -Werror
+CXXFLAGS += -I$(INC_DIR) -I$(ROOT_DIR)/sw -I$(ROOT_DIR)/hw -I$(DESTDIR) -I$(SW_COMMON_DIR) -I$(RT_COMMON_DIR)
+CXXFLAGS += -DXLEN_$(XLEN)
+CXXFLAGS += -fPIC
+CXXFLAGS += $(CONFIGS)
+
+# OPAE-shaped MMIO constants come from the generated vortex_opae.h
+# at build/sw/ (already on the include path via -I$(ROOT_DIR)/sw).
+# vortex.cpp does `#include <vortex_opae.h>` for the AFU_IMAGE_*
+# defines. Unlike sw/runtime/opae/Makefile we do NOT call
+# afu_json_mgr — configure already generated the header from
+# vortex_opae.toml at build time.
+
+# Per-arch compiler selection. The cross-compilers are sysroot-aware
+# (Ubuntu's gcc-aarch64-linux-gnu ships the matching libstdc++); no
+# extra --sysroot flags needed.
+#
+# Cross-compiled outputs land in $(DESTDIR)/$(HOST_ARCH)/ alongside
+# the stub's libvortex.so (also cross-compiled). The simulated ARM
+# process's LD_LIBRARY_PATH points at that one dir to find both.
+ifeq ($(HOST_ARCH),x86_64)
+    CXX := g++
+    ARCH_SUFFIX := x86_64
+    OUT_DIR := $(DESTDIR)
+else ifeq ($(HOST_ARCH),aarch64)
+    CXX := aarch64-linux-gnu-g++
+    ARCH_SUFFIX := aarch64
+    OUT_DIR := $(DESTDIR)/aarch64
+else ifeq ($(HOST_ARCH),armhf)
+    CXX := arm-linux-gnueabihf-g++
+    ARCH_SUFFIX := armhf
+    OUT_DIR := $(DESTDIR)/armhf
+else
+    $(error HOST_ARCH must be one of: x86_64, aarch64, armhf (got $(HOST_ARCH)))
+endif
+
+LDFLAGS += -shared -pthread
+
+SRCS = $(SRC_DIR)/vortex.cpp $(SRC_DIR)/driver.cpp $(RT_COMMON_DIR)/utils.cpp
+
+# Debug / release
+ifdef DEBUG
+    CXXFLAGS += -g -O0
+else
+    CXXFLAGS += -O2 -DNDEBUG
+endif
+
+PROJECT := libvortex-gem5-$(ARCH_SUFFIX).so
+
+.PHONY: all force clean
+
+all: $(OUT_DIR)/$(PROJECT)
+
+$(OUT_DIR)/$(PROJECT): $(SRCS)
+	@mkdir -p $(OUT_DIR)
+	$(CXX) $(CXXFLAGS) $(SRCS) $(LDFLAGS) -Wl,-soname,$(PROJECT) -o $@
+
+clean:
+	rm -f $(DESTDIR)/libvortex-gem5-*.so
+	rm -f $(DESTDIR)/aarch64/libvortex-gem5-*.so
+	rm -f $(DESTDIR)/armhf/libvortex-gem5-*.so
diff --git a/sw/runtime/gem5/driver.cpp b/sw/runtime/gem5/driver.cpp
new file mode 100644
index 0000000000..3fc76e719a
--- /dev/null
+++ b/sw/runtime/gem5/driver.cpp
@@ -0,0 +1,128 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "driver.h"
+
+#include <cstdio>
+#include <cstdint>
+#include <cstring>
+#include <unordered_map>
+
+namespace vortex {
+
+namespace {
+
+// Trivial bump allocator for the pinned region. A real implementation
+// would use a free-list; for now this is the simplest thing that lets
+// upload/download cache a single staging buffer indefinitely.
+struct PinAllocator {
+    uintptr_t base = PIN_BASE_ADDR;
+    uintptr_t cur  = PIN_BASE_ADDR;
+    std::unordered_map<uintptr_t, uint64_t> live;  // ptr → size for free()
+
+    int allocate(uint64_t size, void** host_ptr, uint64_t* ioaddr) {
+        // Cache-line align (64) to match the OPAE staging-buffer model.
+        const uint64_t aligned = (size + 63) & ~uint64_t(63);
+        if (cur + aligned > base + PIN_REGION_SIZE) {
+            std::fprintf(stderr,
+                         "[VXDRV-gem5] pin region OOM: requested %llu, "
+                         "available %llu\n",
+                         (unsigned long long)aligned,
+                         (unsigned long long)(base + PIN_REGION_SIZE - cur));
+            return -1;
+        }
+        const uintptr_t ptr = cur;
+        cur += aligned;
+        live.emplace(ptr, aligned);
+        *host_ptr = reinterpret_cast<void*>(ptr);
+        *ioaddr   = static_cast<uint64_t>(ptr);  // identity v→p (see driver.h)
+        return 0;
+    }
+
+    void release(void* host_ptr) {
+        // Trivial allocator: no reclaim until close(). The legacy OPAE
+        // driver's `ensure_staging` recycles its single buffer the same
+        // way; this is fine for the OPAE-shaped workload (one staging
+        // buffer per device handle, grown on demand).
+        live.erase(reinterpret_cast<uintptr_t>(host_ptr));
+    }
+
+    void reset() { cur = base; live.clear(); }
+};
+
+PinAllocator g_pin;
+bool         g_inited = false;
+
+} // namespace
+
+int drv_init() {
+    if (g_inited) return 0;
+    // The two fixed regions (PIO and PIN) are expected to be already
+    // mapped by the gem5 SE-mode setup before this binary runs. We do
+    // NOT call mmap() here because SE-mode has no /dev/vortex; the
+    // Python config arranges the address space directly.
+    //
+    // If/when this runtime is ported to a real OS with a kernel driver,
+    // drv_init() will become an open("/dev/vortex_gem5") + mmap() pair.
+    g_inited = true;
+    g_pin.reset();
+    return 0;
+}
+
+void drv_close() {
+    if (!g_inited) return;
+    g_pin.reset();
+    g_inited = false;
+}
+
+uint64_t mmio_read64(uint64_t offset) {
+    auto* p = reinterpret_cast<volatile uint64_t*>(PIO_BASE_ADDR + offset);
+    return *p;
+}
+
+void mmio_write64(uint64_t offset, uint64_t value) {
+    auto* p = reinterpret_cast<volatile uint64_t*>(PIO_BASE_ADDR + offset);
+    *p = value;
+}
+
+// Memory barrier before kicking a command. The host CPU model in
+// gem5 (especially out-of-order variants like O3CPU) can reorder
+// MMIO writes; the runtime must publish the args before the
+// CMD_TYPE write or the device sees stale/uninitialized args. B14
+// in the proposal's bug catalog calls this out explicitly.
+void mmio_fence() {
+#if defined(__x86_64__) || defined(__i386__)
+    __asm__ __volatile__ ("mfence" ::: "memory");
+#elif defined(__aarch64__) || defined(__arm__)
+    __asm__ __volatile__ ("dmb sy" ::: "memory");
+#else
+    // Fall back to a compiler-only fence. Untested architectures
+    // should add their own asm.
+    __asm__ __volatile__ ("" ::: "memory");
+#endif
+}
+
+int drv_pin_buffer(uint64_t size, void** host_ptr, uint64_t* ioaddr) {
+    if (!g_inited) {
+        std::fprintf(stderr, "[VXDRV-gem5] drv_pin_buffer called before drv_init\n");
+        return -1;
+    }
+    return g_pin.allocate(size, host_ptr, ioaddr);
+}
+
+void drv_release_buffer(void* host_ptr) {
+    if (!g_inited) return;
+    g_pin.release(host_ptr);
+}
+
+} // namespace vortex
diff --git a/sw/runtime/gem5/driver.h b/sw/runtime/gem5/driver.h
new file mode 100644
index 0000000000..6faa36b301
--- /dev/null
+++ b/sw/runtime/gem5/driver.h
@@ -0,0 +1,73 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// Direct-MMIO driver for the gem5 VortexGPGPU device.
+//
+// Replaces the libopae abstraction layer used by sw/runtime/opae/.
+// Inside a gem5 SE-mode process, we access the device by:
+//   1. Reading/writing MMIO registers via a fixed virtual address that
+//      the gem5 Python config maps to the device's PIO range
+//      (PIO_BASE_ADDR below; default 0x20000000 matches the legacy
+//      capstone paper).
+//   2. DMA staging through a fixed pinned region that the Python
+//      config maps with identity virtual→physical addressing
+//      (PIN_BASE_ADDR; default 0x10000000). The runtime uses host
+//      virtual addresses; the gem5 DmaPort sees the same value as
+//      physical because of the identity mapping.
+//
+// Phase 5 covers the gem5-side wiring of these mappings; Phase 4 just
+// produces the runtime library.
+
+#pragma once
+
+#include <stddef.h>
+#include <stdint.h>
+
+namespace vortex {
+
+// Fixed virtual addresses the runtime expects to find mapped by the
+// gem5 Python config. PIN_BASE_ADDR is the runtime's heap for DMA
+// staging buffers; PIO_BASE_ADDR is the device's MMIO command-and-
+// status window. Sizes (PIN_REGION_SIZE / PIO_REGION_SIZE) are caps
+// the runtime enforces — overruns are bugs, not malloc failures.
+constexpr uintptr_t PIN_BASE_ADDR    = 0x10000000ull;
+constexpr size_t    PIN_REGION_SIZE  = 0x10000000ull;  // 256 MB
+constexpr uintptr_t PIO_BASE_ADDR    = 0x20000000ull;
+constexpr size_t    PIO_REGION_SIZE  = 0x1000ull;      // 4 KB (1 page)
+
+// Init / shutdown. drv_init mmaps both regions; drv_close munmaps.
+// Both are idempotent in practice but should be paired 1:1.
+int  drv_init();
+void drv_close();
+
+// MMIO register access. Offsets are byte offsets into the device's
+// PIO range; values are written/read 64-bit at a time (the OPAE
+// protocol's natural width). mmio_fence() emits the right barrier
+// for HOST_ARCH (mfence on x86, dmb sy on AArch64/ARMv7) — call
+// before triggering a command (B14 in proposal §2.2).
+uint64_t mmio_read64 (uint64_t offset);
+void     mmio_write64(uint64_t offset, uint64_t value);
+void     mmio_fence();
+
+// Staging-buffer allocation in the pinned region. Returns 0 on
+// success and fills *host_ptr + *ioaddr; returns -1 on OOM in the
+// pinned region. Caller owns the slot until drv_release_buffer.
+//
+// Under Phase 5's identity v→p mapping, *host_ptr == *ioaddr; on a
+// future setup with non-identity mapping, *ioaddr is the value the
+// device must DMA against and *host_ptr is what the runtime writes
+// through.
+int  drv_pin_buffer    (uint64_t size, void** host_ptr, uint64_t* ioaddr);
+void drv_release_buffer(void* host_ptr);
+
+} // namespace vortex
diff --git a/sw/runtime/gem5/vortex.cpp b/sw/runtime/gem5/vortex.cpp
new file mode 100644
index 0000000000..92d793e4f8
--- /dev/null
+++ b/sw/runtime/gem5/vortex.cpp
@@ -0,0 +1,334 @@
+// Copyright © 2019-2023
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+// gem5 host runtime backend. Provides the standard Vortex `vx_*`
+// C API (declared in sw/runtime/include/vortex.h) on top of the
+// OPAE-shaped MMIO command protocol talking to the gem5 VortexGPGPU
+// device through driver.{cpp,h}.
+//
+// Shape mirrors sw/runtime/opae/vortex.cpp but is simpler:
+//   - No libopae dispatch; driver.h's mmio_{read,write}64 talks
+//     directly to PIO_BASE_ADDR.
+//   - No UUID enumeration / fpga_token dance — the gem5 device is
+//     always at the fixed PIO range.
+//   - Device caps come from compile-time VX_config.h macros (the
+//     host runtime and the device library are built from the same
+//     source tree, so they agree by construction).
+//   - mmio_fence() before every CMD_TYPE write (B14 in proposal §2.2).
+
+#include <common.h>
+#include <util.h>          // log2floor / log2ceil / is_aligned / aligned_size
+#include "driver.h"
+
+#include <vortex_opae.h>
+#include <sched.h>         // sched_yield (gem5 SE-mode-safe back-off)
+
+#include <algorithm>
+#include <cassert>
+#include <cmath>
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <ctime>
+#include <iostream>
+#include <sstream>
+#include <unordered_map>
+
+using namespace vortex;
+
+// MMIO offsets (byte addresses). Sourced from vortex_opae.h's
+// AFU_IMAGE_MMIO_* DWORD offsets times 4. Same layout as
+// sw/runtime/opae/vortex.cpp:47–56.
+#define CMD_MEM_READ     AFU_IMAGE_CMD_MEM_READ
+#define CMD_MEM_WRITE    AFU_IMAGE_CMD_MEM_WRITE
+#define CMD_RUN          AFU_IMAGE_CMD_RUN
+#define CMD_DCR_WRITE    AFU_IMAGE_CMD_DCR_WRITE
+#define CMD_DCR_READ     AFU_IMAGE_CMD_DCR_READ
+
+#define MMIO_CMD_TYPE    (AFU_IMAGE_MMIO_CMD_TYPE * 4)
+#define MMIO_CMD_ARG0    (AFU_IMAGE_MMIO_CMD_ARG0 * 4)
+#define MMIO_CMD_ARG1    (AFU_IMAGE_MMIO_CMD_ARG1 * 4)
+#define MMIO_CMD_ARG2    (AFU_IMAGE_MMIO_CMD_ARG2 * 4)
+#define MMIO_STATUS      (AFU_IMAGE_MMIO_STATUS * 4)
+#define MMIO_DCR_RSP     (AFU_IMAGE_MMIO_DCR_RSP * 4)
+
+#define STATUS_STATE_BITS 8
+
+// Issue a CMD_TYPE write. Centralised so the memory barrier before
+// the trigger MMIO is impossible to forget (B14). All callers must
+// have written ARG0/1/2 first.
+static inline void issue_cmd(uint64_t cmd) {
+    mmio_fence();
+    mmio_write64(MMIO_CMD_TYPE, cmd);
+}
+
+///////////////////////////////////////////////////////////////////////////////
+
+class vx_device {
+public:
+    vx_device()
+        : global_mem_(ALLOC_BASE_ADDR,
+                      GLOBAL_MEM_SIZE - ALLOC_BASE_ADDR,
+                      RAM_PAGE_SIZE,
+                      CACHE_BLOCK_SIZE),
+          staging_ioaddr_(0),
+          staging_ptr_(nullptr),
+          staging_size_(0) {}
+
+    ~vx_device() {
+        if (staging_ptr_ != nullptr) {
+            drv_release_buffer(staging_ptr_);
+            staging_ptr_   = nullptr;
+            staging_size_  = 0;
+            staging_ioaddr_ = 0;
+        }
+        drv_close();
+    }
+
+    int init() {
+        if (drv_init() != 0) {
+            std::fprintf(stderr, "[VXDRV] drv_init failed\n");
+            return -1;
+        }
+        return 0;
+    }
+
+    // Compile-time capability table. Mirrors sw/runtime/simx/vortex.cpp:
+    // 51–103: the runtime and the SimX-side device library share a
+    // build tree, so the same VX_config.h macros are authoritative
+    // on both sides.
+    int get_caps(uint32_t caps_id, uint64_t *value) {
+        switch (caps_id) {
+        case VX_CAPS_VERSION:         *value = IMPLEMENTATION_ID; break;
+        case VX_CAPS_NUM_THREADS:     *value = NUM_THREADS; break;
+        case VX_CAPS_NUM_WARPS:       *value = NUM_WARPS; break;
+        case VX_CAPS_NUM_CORES:       *value = NUM_CORES * NUM_CLUSTERS; break;
+        case VX_CAPS_NUM_CLUSTERS:    *value = NUM_CLUSTERS; break;
+        case VX_CAPS_SOCKET_SIZE:     *value = SOCKET_SIZE; break;
+        case VX_CAPS_ISSUE_WIDTH:     *value = ISSUE_WIDTH; break;
+        case VX_CAPS_CACHE_LINE_SIZE: *value = CACHE_BLOCK_SIZE; break;
+        case VX_CAPS_GLOBAL_MEM_SIZE: *value = GLOBAL_MEM_SIZE; break;
+        case VX_CAPS_LOCAL_MEM_SIZE:  *value = (1 << LMEM_LOG_SIZE); break;
+        case VX_CAPS_ISA_FLAGS:
+            *value = ((uint64_t(MISA_EXT)) << 32)
+                   | ((log2floor(XLEN) - 4) << 30)
+                   |   MISA_STD;
+            break;
+        case VX_CAPS_NUM_MEM_BANKS:   *value = PLATFORM_MEMORY_NUM_BANKS; break;
+        case VX_CAPS_MEM_BANK_SIZE:   *value = 1ull << (MEM_ADDR_WIDTH / PLATFORM_MEMORY_NUM_BANKS); break;
+        case VX_CAPS_CLOCK_RATE:      *value = 0; break;
+        case VX_CAPS_PEAK_MEM_BW:     *value = PLATFORM_MEMORY_PEAK_BW; break;
+        default:
+            std::fprintf(stderr, "[VXDRV] invalid caps id: %u\n", caps_id);
+            return -1;
+        }
+        return 0;
+    }
+
+    int mem_alloc(uint64_t size, int flags, uint64_t *dev_addr) {
+        uint64_t addr;
+        CHECK_ERR(global_mem_.allocate(size, &addr), { return err; });
+        CHECK_ERR(this->mem_access(addr, size, flags), {
+            global_mem_.release(addr);
+            return err;
+        });
+        *dev_addr = addr;
+        return 0;
+    }
+
+    int mem_reserve(uint64_t dev_addr, uint64_t size, int flags) {
+        CHECK_ERR(global_mem_.reserve(dev_addr, size), { return err; });
+        CHECK_ERR(this->mem_access(dev_addr, size, flags), {
+            global_mem_.release(dev_addr);
+            return err;
+        });
+        return 0;
+    }
+
+    int mem_free(uint64_t dev_addr) {
+        return global_mem_.release(dev_addr);
+    }
+
+    int mem_access(uint64_t /*dev_addr*/, uint64_t /*size*/, int /*flags*/) {
+        // Access control is enforced by the device's RAM ACL (in
+        // libvortex-gem5.so). The host runtime has nothing to do here.
+        return 0;
+    }
+
+    int mem_info(uint64_t *mem_free, uint64_t *mem_used) const {
+        if (mem_free) *mem_free = global_mem_.free();
+        if (mem_used) *mem_used = global_mem_.allocated();
+        return 0;
+    }
+
+    int copy(uint64_t /*dest*/, uint64_t /*src*/, uint64_t /*size*/) {
+        // Device-to-device copy not in the OPAE command set (no
+        // CMD_MEM_COPY); the OPAE FPGA path goes through libopae's
+        // fpgaCopyBuffer which we don't have. Leave unimplemented
+        // for Phase 4; can be added by extending the device with a
+        // new CMD type in a later phase.
+        std::fprintf(stderr, "[VXDRV] copy() not supported in gem5 backend\n");
+        return -1;
+    }
+
+    int upload(uint64_t dev_addr, const void *host_ptr, uint64_t size) {
+        if (!is_aligned(dev_addr, CACHE_BLOCK_SIZE)) return -1;
+        const uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
+        if (dev_addr + asize > GLOBAL_MEM_SIZE) return -1;
+
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+        if (this->ensure_staging(asize) != 0)     return -1;
+
+        std::memcpy(staging_ptr_, host_ptr, size);
+
+        const auto ls_shift = log2ceil(CACHE_BLOCK_SIZE);
+        mmio_write64(MMIO_CMD_ARG0, staging_ioaddr_ >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG1, dev_addr        >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG2, asize           >> ls_shift);
+        issue_cmd(CMD_MEM_WRITE);
+
+        return this->ready_wait(VX_MAX_TIMEOUT);
+    }
+
+    int download(void *host_ptr, uint64_t dev_addr, uint64_t size) {
+        if (!is_aligned(dev_addr, CACHE_BLOCK_SIZE)) return -1;
+        const uint64_t asize = aligned_size(size, CACHE_BLOCK_SIZE);
+        if (dev_addr + asize > GLOBAL_MEM_SIZE) return -1;
+
+        // Drain dirty cache lines all the way to VRAM before reading
+        // back, per B9 in proposal §2.2. One DCR_READ on the magic
+        // cache-flush DCR per core; the device routes it through
+        // Processor::flush_caches().
+        {
+            uint64_t num_cores;
+            CHECK_ERR(this->get_caps(VX_CAPS_NUM_CORES, &num_cores), { return err; });
+            uint32_t dummy;
+            for (uint32_t cid = 0; cid < (uint32_t)num_cores; ++cid) {
+                CHECK_ERR(this->dcr_read(VX_DCR_BASE_CACHE_FLUSH, cid, &dummy),
+                          { return err; });
+            }
+        }
+
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+        if (this->ensure_staging(asize) != 0)     return -1;
+
+        const auto ls_shift = log2ceil(CACHE_BLOCK_SIZE);
+        mmio_write64(MMIO_CMD_ARG0, staging_ioaddr_ >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG1, dev_addr        >> ls_shift);
+        mmio_write64(MMIO_CMD_ARG2, asize           >> ls_shift);
+        issue_cmd(CMD_MEM_READ);
+
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+
+        std::memcpy(host_ptr, staging_ptr_, size);
+        return 0;
+    }
+
+    int start() {
+        issue_cmd(CMD_RUN);
+        return 0;
+    }
+
+    // Poll MMIO_STATUS; the high bits carry stdout/stderr text from
+    // device-side printf — same protocol as sw/runtime/opae/vortex.cpp.
+    // Uses sched_yield() to back off between polls (gem5 SE-mode
+    // doesn't implement clock_nanosleep which glibc's nanosleep()
+    // routes through; sched_yield is in the syscall_tbl64 ignore
+    // list and returns immediately, which inside gem5 just means
+    // the next poll happens on the next simulated CPU instruction).
+    int ready_wait(uint64_t timeout) {
+        std::unordered_map<uint32_t, std::stringstream> print_bufs;
+        const uint64_t step_ms = 1;
+
+        for (;;) {
+            uint64_t status = mmio_read64(MMIO_STATUS);
+
+            // Drain any console data the device produced.
+            uint32_t cout_data = status >> STATUS_STATE_BITS;
+            if (cout_data & 0x1) {
+                do {
+                    const char     cout_char = (cout_data >> 1) & 0xff;
+                    const uint32_t cout_tid  = (cout_data >> 9) & 0xff;
+                    auto& ss = print_bufs[cout_tid];
+                    ss << cout_char;
+                    if (cout_char == '\n') {
+                        std::cout << std::dec << "#" << cout_tid
+                                  << ": " << ss.str() << std::flush;
+                        ss.str("");
+                    }
+                    status = mmio_read64(MMIO_STATUS);
+                    cout_data = status >> STATUS_STATE_BITS;
+                } while (cout_data & 0x1);
+            }
+
+            const uint32_t state = status & ((1 << STATUS_STATE_BITS) - 1);
+            if (state == 0 || timeout == 0) {
+                for (auto& kv : print_bufs) {
+                    auto s = kv.second.str();
+                    if (!s.empty()) {
+                        std::cout << "#" << kv.first << ": " << s << std::endl;
+                    }
+                }
+                if (state != 0) {
+                    std::fprintf(stdout, "[VXDRV] ready-wait timed out: state=%u\n", state);
+                    return -1;
+                }
+                return 0;
+            }
+
+            sched_yield();
+            timeout -= step_ms;
+        }
+    }
+
+    int dcr_write(uint32_t addr, uint32_t value) {
+        mmio_write64(MMIO_CMD_ARG0, addr);
+        mmio_write64(MMIO_CMD_ARG1, value);
+        issue_cmd(CMD_DCR_WRITE);
+        return 0;
+    }
+
+    int dcr_read(uint32_t addr, uint32_t tag, uint32_t *value) {
+        mmio_write64(MMIO_CMD_ARG0, addr);
+        mmio_write64(MMIO_CMD_ARG1, tag);
+        issue_cmd(CMD_DCR_READ);
+        if (this->ready_wait(VX_MAX_TIMEOUT) != 0) return -1;
+        *value = static_cast<uint32_t>(mmio_read64(MMIO_DCR_RSP));
+        return 0;
+    }
+
+private:
+    int ensure_staging(uint64_t size) {
+        if (staging_size_ >= size) return 0;
+        if (staging_ptr_ != nullptr) {
+            drv_release_buffer(staging_ptr_);
+            staging_ptr_   = nullptr;
+            staging_size_  = 0;
+            staging_ioaddr_ = 0;
+        }
+        if (drv_pin_buffer(size, reinterpret_cast<void**>(&staging_ptr_),
+                           &staging_ioaddr_) != 0) {
+            return -1;
+        }
+        staging_size_ = size;
+        return 0;
+    }
+
+    MemoryAllocator global_mem_;
+    uint64_t staging_ioaddr_;
+    uint8_t* staging_ptr_;
+    uint64_t staging_size_;
+};
+
+#include <callbacks.inc>
diff --git a/sw/runtime/stub/Makefile b/sw/runtime/stub/Makefile
index 64413680c7..895fa8466b 100644
--- a/sw/runtime/stub/Makefile
+++ b/sw/runtime/stub/Makefile
@@ -1,5 +1,13 @@
 include ../common.mk
 
+# HOST_ARCH switch — when building for a non-native simulated host
+# (e.g. running x86 gem5 with an aarch64 simulated CPU), select the
+# matching cross-compiler. Aligns with sw/runtime/gem5/Makefile's
+# HOST_ARCH knob; cross-arch builds land in $(DESTDIR)/$(HOST_ARCH)/
+# so the same dlopen target name (libvortex.so) can coexist with the
+# native build in $(DESTDIR)/.
+HOST_ARCH ?= x86_64
+
 DESTDIR ?= $(CURDIR)/..
 
 SRC_DIR := $(VORTEX_HOME)/sw/runtime/stub
@@ -10,6 +18,19 @@ CXXFLAGS += -fPIC
 
 LDFLAGS += -shared -pthread -ldl -Wl,-soname,libvortex.so
 
+ifeq ($(HOST_ARCH),x86_64)
+    CXX := g++
+    OUT_DIR := $(DESTDIR)
+else ifeq ($(HOST_ARCH),aarch64)
+    CXX := aarch64-linux-gnu-g++
+    OUT_DIR := $(DESTDIR)/aarch64
+else ifeq ($(HOST_ARCH),armhf)
+    CXX := arm-linux-gnueabihf-g++
+    OUT_DIR := $(DESTDIR)/armhf
+else
+    $(error HOST_ARCH must be one of: x86_64, aarch64, armhf (got $(HOST_ARCH)))
+endif
+
 SRCS := $(SRC_DIR)/vortex.cpp $(SRC_DIR)/utils.cpp $(SRC_DIR)/perf.cpp $(RT_COMMON_DIR)/utils.cpp
 
 # Debugging
@@ -21,12 +42,13 @@ endif
 
 PROJECT := libvortex.so
 
-all: $(DESTDIR)/$(PROJECT)
+all: $(OUT_DIR)/$(PROJECT)
 
-$(DESTDIR)/$(PROJECT): $(SRCS)
+$(OUT_DIR)/$(PROJECT): $(SRCS)
+	@mkdir -p $(OUT_DIR)
 	$(CXX) $(CXXFLAGS) $^ $(LDFLAGS) -o $@
 
 clean:
-	rm -f $(DESTDIR)/$(PROJECT)
+	rm -f $(DESTDIR)/$(PROJECT) $(DESTDIR)/aarch64/$(PROJECT) $(DESTDIR)/armhf/$(PROJECT)
 
 .PHONY: all clean
\ No newline at end of file
diff --git a/tests/regression/common.mk b/tests/regression/common.mk
index 536fcd6f85..6484ed1e0f 100644
--- a/tests/regression/common.mk
+++ b/tests/regression/common.mk
@@ -83,7 +83,39 @@ CXXFLAGS += -std=c++17 -Wall -Wextra -pedantic -Wfatal-errors -Werror
 CXXFLAGS += -I$(VORTEX_HOME)/sw/runtime/include -I$(ROOT_DIR)/sw -I$(ROOT_DIR)/hw -I$(SW_COMMON_DIR)
 CXXFLAGS += $(CONFIGS)
 
-LDFLAGS += -L$(VORTEX_RT_LIB) -lvortex
+# HOST_ARCH selects the simulated-host compiler for the test binary
+# (the .vxbin always builds with the RISC-V toolchain regardless).
+# When non-native, the binary is suffixed (e.g. vecadd-aarch64) and
+# we link against the cross-compiled stub in $(VORTEX_RT_LIB)/$(HOST_ARCH)/.
+# Aligns with sw/runtime/{stub,gem5}/Makefile's HOST_ARCH knob; the
+# gem5 ARM e2e test path uses this to produce aarch64 binaries that
+# the simulated ARM CPU inside gem5 can execute.
+#
+# Cross-compiled ELFs embed `/lib/ld-linux-$arch.so.1` as the dynamic
+# linker (PT_INTERP). gem5 doesn't have that path on the host, but
+# it has a setInterpDir() API that prepends a sysroot to the
+# interpreter lookup — the gem5 Python config calls that when
+# DRIVER=gem5-aarch64. Keep the default INTERP here so that mechanism
+# can do the redirection cleanly. (Earlier versions used
+# `-Wl,--dynamic-linker=` to rewrite PT_INTERP, but that interacts
+# badly with setInterpDir's prepend logic.)
+HOST_ARCH ?= x86_64
+ifeq ($(HOST_ARCH),x86_64)
+    PROJECT_SUFFIX :=
+    RT_LIB_DIR := $(VORTEX_RT_LIB)
+else ifeq ($(HOST_ARCH),aarch64)
+    CXX := aarch64-linux-gnu-g++
+    PROJECT_SUFFIX := -aarch64
+    RT_LIB_DIR := $(VORTEX_RT_LIB)/aarch64
+else ifeq ($(HOST_ARCH),armhf)
+    CXX := arm-linux-gnueabihf-g++
+    PROJECT_SUFFIX := -armhf
+    RT_LIB_DIR := $(VORTEX_RT_LIB)/armhf
+else
+    $(error HOST_ARCH must be one of: x86_64, aarch64, armhf (got $(HOST_ARCH)))
+endif
+
+LDFLAGS += -L$(RT_LIB_DIR) -lvortex
 
 # Debugging
 ifdef DEBUG
@@ -106,7 +138,11 @@ endif
 
 CONFIG_STAMP = config.stamp
 
-all: $(PROJECT) kernel.vxbin kernel.dump
+# HOST_ARCH-suffixed binary name (vecadd, vecadd-aarch64, …) so
+# x86 and cross-compiled variants coexist in the same dir.
+APP := $(PROJECT)$(PROJECT_SUFFIX)
+
+all: $(APP) kernel.vxbin kernel.dump
 
 # Force rebuild when CONFIGS (defines) change between runs.
 $(CONFIG_STAMP): FORCE
@@ -146,9 +182,16 @@ kernel.elf: vx_start.o $(VX_SRCS) $(VORTEX_KN_PATH)/lib$(KERNEL_LIB).a $(CONFIG_
 	$(VX_CXX) $(VX_CFLAGS) vx_start.o $(VX_APP_OBJS) $(VX_LDFLAGS) -o $@
 endif
 
-$(PROJECT): $(SRCS) $(VORTEX_RT_LIB)/libvortex.so $(CONFIG_STAMP)
+$(APP): $(SRCS) $(RT_LIB_DIR)/libvortex.so $(CONFIG_STAMP)
 	$(CXX) $(CXXFLAGS) $(filter-out $(CONFIG_STAMP),$^) $(LDFLAGS) -o $@
 
+# Cross-compiled stub for non-native HOST_ARCH. Native (x86_64)
+# is built by $(VORTEX_RT_LIB)/libvortex.so rule below.
+ifneq ($(HOST_ARCH),x86_64)
+$(RT_LIB_DIR)/libvortex.so:
+	$(RUNTIME_ARGS) $(MAKE) -C $(VORTEX_RT_SRC)/stub HOST_ARCH=$(HOST_ARCH) DESTDIR=$(VORTEX_RT_LIB)
+endif
+
 run-simx: $(PROJECT) kernel.vxbin
 	$(RUNTIME_ARGS) $(MAKE) -C $(VORTEX_RT_SRC)/simx DESTDIR=$(VORTEX_RT_LIB)
 	LD_LIBRARY_PATH=$(VORTEX_RT_LIB):$(LD_LIBRARY_PATH) VORTEX_DRIVER=simx ./$(PROJECT) $(OPTS)

From dc419a3914dc8e74d166470e505794b0e4bf5ff1 Mon Sep 17 00:00:00 2001
From: tinebp <tinebp@yahoo.com>
Date: Mon, 18 May 2026 05:39:33 -0700
Subject: [PATCH 2/2] ci: consolidate gem5+SST test runners on a parallel
 hostless/e2e naming
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Rename + generalize the per-backend gem5 and SST test runners so they
share a uniform env-var interface and a common naming convention that
makes the modal distinction explicit.

Naming layout (hostless = no host CPU; e2e = host CPU + dispatcher + CP):

                  hostless                     e2e
    gem5    gem5_run_hostless_app.py    gem5_run_app.py
    SST     sst_run_hostless_app.py     (reserved; no SST CPU integration today)

gem5 changes:
- ci/gem5_test_vortex_hello.py → ci/gem5_run_hostless_app.py: parameterized
  by VORTEX_GEM5_DEV_LIB + VORTEX_TEST_DIR + VORTEX_TEST_KERNEL (default
  kernel.vxbin). Drops the hardcoded VORTEX_GEM5_KERNEL path; any
  regression test's kernel.vxbin can now run hostless without its host
  binary.
- ci/gem5_test_vortex_app.py → ci/gem5_run_app.py (rename only).

SST changes:
- Collapse 4 hardcoded stubs (sst_test_vortex_{hello,fibonacci,vecadd,
  conform}.py) into ci/sst_run_hostless_app.py — same env-var
  interface as the gem5 hostless runner.
- Delete ci/sst_test_vortex_memHierarchy.py: not called by regression
  and the wiring recipe is preserved in
  docs/proposals/sst_simx_v3_proposal.md §6.
- Verify USE_SST=1 builds clean post-merge; full SST regression matrix
  (hello / fibonacci / vecadd / conform) passes end-to-end through
  ci/sst_run_hostless_app.py.

Other cleanups:
- ci/regression.sh.in: rewrite gem5() + sst() entries against the new
  runner names + env vars.
- docs/gem5_integration.md: update both invocation examples and the
  reference-implementations list.
- docs/proposals/sst_simx_v3_proposal.md: add an "Implemented" status
  note recording the runner consolidation + the reserved sst_run_app.py
  slot for a future host-CPU SST integration.
- docs/proposals/gem5_v2_cp_migration_proposal.md: update validation
  reference to the new runner filename.
- sw/runtime/gem5/Makefile: drop stale vortex_opae.h / AFU_IMAGE_*
  Makefile comment block (the runtime no longer includes vortex_opae.h
  after the pure-v2 callbacks redesign).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 ...em5_test_vortex_app.py => gem5_run_app.py} |  0
 ...rtex_hello.py => gem5_run_hostless_app.py} | 65 +++++++++++--------
 ci/regression.sh.in                           | 59 ++++++++++-------
 ci/sst_run_hostless_app.py                    | 53 +++++++++++++++
 ci/sst_test_vortex_conform.py                 |  7 --
 ci/sst_test_vortex_fibonacci.py               |  7 --
 ci/sst_test_vortex_hello.py                   |  7 --
 ci/sst_test_vortex_memHierarchy.py            | 63 ------------------
 ci/sst_test_vortex_vecadd.py                  |  7 --
 docs/gem5_integration.md                      | 26 +++++---
 .../gem5_v2_cp_migration_proposal.md          |  2 +-
 docs/proposals/sst_simx_v3_proposal.md        |  2 +-
 sw/runtime/gem5/Makefile                      |  7 --
 13 files changed, 145 insertions(+), 160 deletions(-)
 rename ci/{gem5_test_vortex_app.py => gem5_run_app.py} (100%)
 rename ci/{gem5_test_vortex_hello.py => gem5_run_hostless_app.py} (53%)
 create mode 100644 ci/sst_run_hostless_app.py
 delete mode 100644 ci/sst_test_vortex_conform.py
 delete mode 100644 ci/sst_test_vortex_fibonacci.py
 delete mode 100644 ci/sst_test_vortex_hello.py
 delete mode 100644 ci/sst_test_vortex_memHierarchy.py
 delete mode 100644 ci/sst_test_vortex_vecadd.py

diff --git a/ci/gem5_test_vortex_app.py b/ci/gem5_run_app.py
similarity index 100%
rename from ci/gem5_test_vortex_app.py
rename to ci/gem5_run_app.py
diff --git a/ci/gem5_test_vortex_hello.py b/ci/gem5_run_hostless_app.py
similarity index 53%
rename from ci/gem5_test_vortex_hello.py
rename to ci/gem5_run_hostless_app.py
index c21ca78d39..65c92602cc 100644
--- a/ci/gem5_test_vortex_hello.py
+++ b/ci/gem5_run_hostless_app.py
@@ -11,25 +11,30 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# Standalone gem5 integration test for vortex.VortexGPGPU.
+# Hostless gem5 integration test for vortex.VortexGPGPU.
 #
-# The SimObject loads the kernel directly via its `kernel=` parameter
-# and runs it via its internal vortexTickEvent_ chain — no host CPU,
-# no CP, no PIO/DMA. Smoke-tests the gem5↔libvortex-gem5.so wiring:
-# dlopen succeeds, SimObject constructs, Processor::cycle() drives
-# from the gem5 event loop, sim exits cleanly.
+# The SimObject loads a .vxbin kernel directly via its `kernel=`
+# parameter and runs it via its internal vortexTickEvent_ chain — no
+# host CPU, no Command Processor, no PIO/DMA. Smoke-tests the
+# gem5↔libvortex-gem5.so wiring: dlopen succeeds, SimObject
+# constructs, Processor::cycle() drives from the gem5 event loop, sim
+# exits cleanly.
 #
-# The end-to-end variant ([gem5_test_vortex_app.py](gem5_test_vortex_app.py))
-# wires up the host CPU + CP regfile + BAR-mapped VRAM on top.
+# Hosted counterpart: [gem5_run_app.py](gem5_run_app.py) wires up the
+# host CPU + CP regfile + BAR-mapped VRAM on top.
 #
-# Configurable via env vars:
-#   VORTEX_GEM5_LIB    — path to libvortex-gem5.so (no default)
-#   VORTEX_GEM5_KERNEL — path to .vxbin to preload (no default)
+# Configurable via env vars (parallel to gem5_run_app.py):
+#   VORTEX_GEM5_DEV_LIB — path to libvortex-gem5.so (no default)
+#   VORTEX_TEST_DIR     — directory containing the kernel .vxbin
+#   VORTEX_TEST_KERNEL  — kernel filename inside that dir
+#                         (default: kernel.vxbin, matching the
+#                          regression-test convention)
 #
 # Run from the Vortex build dir as:
-#   VORTEX_GEM5_LIB=$PWD/sim/simx/libvortex-gem5.so \
-#   VORTEX_GEM5_KERNEL=$PWD/tests/kernel/hello/hello.vxbin \
-#   $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py
+#   VORTEX_GEM5_DEV_LIB=$PWD/sim/simx/libvortex-gem5.so \
+#   VORTEX_TEST_DIR=$PWD/tests/kernel/hello \
+#   VORTEX_TEST_KERNEL=hello.vxbin \
+#       $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_hostless_app.py
 
 import os
 import m5
@@ -45,12 +50,16 @@
     VortexGPGPU,
 )
 
-LIBRARY = os.environ.get("VORTEX_GEM5_LIB")
-KERNEL  = os.environ.get("VORTEX_GEM5_KERNEL")
-if not LIBRARY:
-    raise RuntimeError("VORTEX_GEM5_LIB env var is required")
-if not KERNEL:
-    raise RuntimeError("VORTEX_GEM5_KERNEL env var is required")
+DEV_LIB     = os.environ.get("VORTEX_GEM5_DEV_LIB")
+TEST_DIR    = os.environ.get("VORTEX_TEST_DIR")
+TEST_KERNEL = os.environ.get("VORTEX_TEST_KERNEL", "kernel.vxbin")
+
+for name, val in [("VORTEX_GEM5_DEV_LIB", DEV_LIB),
+                  ("VORTEX_TEST_DIR",     TEST_DIR)]:
+    if not val:
+        raise RuntimeError(f"{name} env var is required")
+
+KERNEL = f"{TEST_DIR}/{TEST_KERNEL}"
 
 # Minimal system: just enough to hang the VortexGPGPU off a membus
 # so gem5 considers it a properly-wired SimObject. No CPU in this
@@ -65,7 +74,7 @@
 # Membus + a small backing memory so PIO ranges have somewhere to bind.
 system.membus = SystemXBar()
 
-# Memory controller (unused at runtime in standalone mode but required
+# Memory controller (unused at runtime in hostless mode but required
 # for the system to instantiate cleanly).
 system.mem_ctrl = MemCtrl()
 system.mem_ctrl.dram = DDR3_1600_8x8()
@@ -75,9 +84,9 @@
 # The Vortex device. It inherits clock from the system clock domain
 # (set above to 1GHz) via ClockedObject; no explicit `clock=` param.
 system.vortex = VortexGPGPU(
-    library = LIBRARY,
+    library = DEV_LIB,
     kernel  = KERNEL,
-    # Explicitly disable the BAR-mapped VRAM range — the standalone
+    # Explicitly disable the BAR-mapped VRAM range — the hostless
     # path loads the kernel via the device library's load_kernel()
     # entry, never via host memcpy through PIN. Leaving it enabled
     # here would conflict with this test's DRAM range.
@@ -90,10 +99,10 @@
 root = Root(full_system=False, system=system)
 m5.instantiate()
 
-print(f"Standalone: VortexGPGPU library={LIBRARY}")
-print(f"Standalone: kernel={KERNEL}")
-print("Standalone: running until VortexGPGPU exits the sim loop...")
+print(f"Hostless: VortexGPGPU.library={DEV_LIB}")
+print(f"Hostless: kernel={KERNEL}")
+print("Hostless: running until VortexGPGPU exits the sim loop...")
 
 exit_event = m5.simulate()
-print(f"Standalone: exit_event.cause = {exit_event.getCause()!r}")
-print(f"Standalone: tick = {m5.curTick()}")
+print(f"Hostless: exit_event.cause = {exit_event.getCause()!r}")
+print(f"Hostless: tick = {m5.curTick()}")
diff --git a/ci/regression.sh.in b/ci/regression.sh.in
index d8c22e753a..5d1ddf82ae 100755
--- a/ci/regression.sh.in
+++ b/ci/regression.sh.in
@@ -95,10 +95,22 @@ sst()
 
     cp sim/simx/libvortex.so $SST_ELEMENTS_HOME/lib/sst-elements-library/   # alternatively - $ sst --add-lib-path `pwd` myConfig.py
 
-    sst ci/sst_test_vortex_hello.py
-    sst ci/sst_test_vortex_fibonacci.py
-    sst ci/sst_test_vortex_vecadd.py
-    sst ci/sst_test_vortex_conform.py
+    BUILD_DIR=$(pwd)
+
+    # Hostless SST runner (ci/sst_run_hostless_app.py) parameterized
+    # by VORTEX_TEST_DIR + VORTEX_TEST_KERNEL — same shape as
+    # ci/gem5_run_hostless_app.py. SST is hostless-only today (no
+    # CPU component wired to Vortex); the ci/sst_run_app.py name
+    # slot is reserved for a future host-CPU SST integration.
+    for spec in "hello:hello.vxbin" "fibonacci:fibonacci.vxbin" \
+                "vecadd:vecadd.vxbin" "conform:conform.vxbin"; do
+        kern="${spec%%:*}"
+        vxbin="${spec#*:}"
+        echo "=== sst: $kern ==="
+        VORTEX_TEST_DIR=$BUILD_DIR/tests/kernel/$kern \
+        VORTEX_TEST_KERNEL=$vxbin \
+            sst ci/sst_run_hostless_app.py
+    done
 
     echo "sst tests done!"
 }
@@ -146,22 +158,24 @@ gem5()
     LIB_GEM5_DEV=$BUILD_DIR/sim/simx/libvortex-gem5.so
     HOST_RT_DIR=$BUILD_DIR/sw/runtime
 
-    # Phase 3 standalone smoke — no host CPU, kernel preload.
-    # env-vars MUST precede the binary (gem5.opt would otherwise
-    # treat them as positional args).
-    VORTEX_GEM5_LIB=$LIB_GEM5_DEV \
-    VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \
+    # Hostless smoke — no host CPU, kernel preloaded via SimObject param.
+    # env-vars MUST precede the binary (gem5.opt would otherwise treat
+    # them as positional args).
+    echo "=== gem5 hostless: hello ==="
+    VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \
+    VORTEX_TEST_DIR=$BUILD_DIR/tests/kernel/hello \
+    VORTEX_TEST_KERNEL=hello.vxbin \
         timeout 120 $GEM5_HOME/build/X86/gem5.opt \
-        ci/gem5_test_vortex_hello.py
+        ci/gem5_run_hostless_app.py
 
-    # Phase 5 e2e — CP-driven path through the host runtime.
-    # Generic test runner (ci/gem5_test_vortex_app.py) parameterized
-    # by VORTEX_TEST_BIN + VORTEX_TEST_ARGS. Sizes are chosen so each
-    # run fits in the 120s per-test budget (feedback_test_timeout_120s):
+    # E2E — CP-driven path through the host runtime. Generic runner
+    # (ci/gem5_run_app.py) parameterized by VORTEX_TEST_BIN +
+    # VORTEX_TEST_ARGS. Sizes fit the 120s per-test budget
+    # (feedback_test_timeout_120s):
     #   - vecadd -n16   small vector add
     #   - sgemm  -n4    4x4 matrix multiply
-    # Larger sizes overrun the budget because the simulated host CPU's
-    # CP poll loop burns gem5 wall time proportional to kernel runtime.
+    # Larger sizes overrun because the simulated host CPU's CP poll
+    # loop burns gem5 wall time proportional to kernel runtime.
     # Run on local dev box for larger sizes by overriding VORTEX_TEST_ARGS.
     for spec in "vecadd:-n16" "sgemm:-n4"; do
         app="${spec%%:*}"
@@ -173,7 +187,7 @@ gem5()
         VORTEX_TEST_BIN=$app \
         VORTEX_TEST_ARGS=$args \
             timeout 120 $GEM5_HOME/build/X86/gem5.opt \
-            ci/gem5_test_vortex_app.py
+            ci/gem5_run_app.py
     done
 
     # ARM matrix (opt-in). The device library (libvortex-gem5.so) is
@@ -195,11 +209,12 @@ gem5()
 
         ARM_HOST_RT_DIR=$BUILD_DIR/sw/runtime/aarch64
 
-        echo "=== gem5 ARM standalone: hello ==="
-        VORTEX_GEM5_LIB=$LIB_GEM5_DEV \
-        VORTEX_GEM5_KERNEL=$BUILD_DIR/tests/kernel/hello/hello.vxbin \
+        echo "=== gem5 ARM hostless: hello ==="
+        VORTEX_GEM5_DEV_LIB=$LIB_GEM5_DEV \
+        VORTEX_TEST_DIR=$BUILD_DIR/tests/kernel/hello \
+        VORTEX_TEST_KERNEL=hello.vxbin \
             timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
-            ci/gem5_test_vortex_hello.py
+            ci/gem5_run_hostless_app.py
 
         for spec in "vecadd:-n16" "sgemm:-n4"; do
             app="${spec%%:*}"
@@ -212,7 +227,7 @@ gem5()
             VORTEX_TEST_ARGS=$args \
             VORTEX_DRIVER=gem5-aarch64 \
                 timeout 120 $GEM5_HOME/build/ARM/gem5.opt \
-                ci/gem5_test_vortex_app.py
+                ci/gem5_run_app.py
         done
     fi
 
diff --git a/ci/sst_run_hostless_app.py b/ci/sst_run_hostless_app.py
new file mode 100644
index 0000000000..3f86188081
--- /dev/null
+++ b/ci/sst_run_hostless_app.py
@@ -0,0 +1,53 @@
+# Copyright © 2019-2023
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Hostless SST runner: instantiate a single vortex.VortexGPGPU
+# component and run the given kernel. SST runs Vortex co-resident in
+# one process, primes the KMU DCRs directly via proc_->dcr_write
+# inside sim/simx/sst/vortex_simulator.cpp, and ticks the simulation
+# to completion. No host CPU, no CP, no PIO/DMA.
+#
+# Hostless is the only mode the SST integration currently supports:
+# there is no SST CPU component (e.g. Ariel/Vanadis) wired to a
+# Vortex regression test binary today. A future ci/sst_run_app.py
+# could add that path; the name slot is reserved.
+#
+# For memHierarchy timing modeling, the VortexGPGPU component exposes
+# an optional `memIface` SubComponent slot — see
+# docs/proposals/sst_simx_v3_proposal.md for the wiring recipe.
+#
+# Configurable via env vars (parallel to ci/gem5_run_hostless_app.py):
+#   VORTEX_TEST_DIR    — directory containing the kernel .vxbin
+#   VORTEX_TEST_KERNEL — kernel filename inside that dir
+#                        (default: kernel.vxbin, matching the
+#                         regression-test convention)
+#
+# Run via:
+#   VORTEX_TEST_DIR=tests/kernel/hello VORTEX_TEST_KERNEL=hello.vxbin \
+#       sst ci/sst_run_hostless_app.py
+
+import os
+import sst
+
+TEST_DIR    = os.environ.get("VORTEX_TEST_DIR")
+TEST_KERNEL = os.environ.get("VORTEX_TEST_KERNEL", "kernel.vxbin")
+if not TEST_DIR:
+    raise RuntimeError("VORTEX_TEST_DIR env var is required")
+
+PROGRAM = f"{TEST_DIR}/{TEST_KERNEL}"
+
+gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
+gpu.addParams({
+    "clock":   "1GHz",
+    "program": PROGRAM,
+})
diff --git a/ci/sst_test_vortex_conform.py b/ci/sst_test_vortex_conform.py
deleted file mode 100644
index 25681dc6de..0000000000
--- a/ci/sst_test_vortex_conform.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/conform/conform.vxbin"
-})
diff --git a/ci/sst_test_vortex_fibonacci.py b/ci/sst_test_vortex_fibonacci.py
deleted file mode 100644
index b174543dbe..0000000000
--- a/ci/sst_test_vortex_fibonacci.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/fibonacci/fibonacci.vxbin"
-})
diff --git a/ci/sst_test_vortex_hello.py b/ci/sst_test_vortex_hello.py
deleted file mode 100644
index ca4fc01993..0000000000
--- a/ci/sst_test_vortex_hello.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/hello/hello.vxbin"
-})
diff --git a/ci/sst_test_vortex_memHierarchy.py b/ci/sst_test_vortex_memHierarchy.py
deleted file mode 100644
index 2193985fb5..0000000000
--- a/ci/sst_test_vortex_memHierarchy.py
+++ /dev/null
@@ -1,63 +0,0 @@
-# SST Phase 3 integration test for vortex.VortexGPGPU.
-#
-# Wires the VortexGPGPU component's optional `memIface` SubComponent slot
-# through an L1 cache to a memHierarchy.MemController. Every memory request
-# accepted by Vortex's local DRAM model is mirrored to the SST memHierarchy
-# as a StandardMem::Read or Write event, so memHierarchy can model timing /
-# capacity / contention alongside Vortex's own simulation.
-#
-# This is the Phase 3 demonstrator from docs/proposals/sst_simx_v3_proposal.md.
-# The local data path stays in Vortex (RAM is authoritative); SST sees
-# every transaction but doesn't have to serve data back. That gives us
-# meaningful integration without forcing v3's TLM data path through SST.
-
-import sst
-
-# --- Vortex GPGPU component (single-warp hello kernel) -----------------------
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock":   "1GHz",
-    "program": "tests/kernel/hello/hello.vxbin",
-})
-
-# Vortex's StandardMem-side adapter
-gpu_mem_iface = gpu.setSubComponent("memIface", "memHierarchy.standardInterface")
-
-# --- L1 cache between Vortex and memory --------------------------------------
-# A cache is required because memHierarchy.MemController routes via MemLink
-# and only registers its address range when there's an upstream cache that
-# advertises destinations.
-l1 = sst.Component("l1cache", "memHierarchy.Cache")
-l1.addParams({
-    "access_latency_cycles": "2",
-    "cache_frequency":       "1GHz",
-    "replacement_policy":    "lru",
-    "coherence_protocol":    "MESI",
-    "associativity":         "4",
-    "cache_line_size":       "64",
-    "L1":                    "1",
-    "cache_size":            "8KiB",
-})
-
-# --- Memory controller + simple backend (host RAM-backed) --------------------
-memctrl = sst.Component("memctrl0", "memHierarchy.MemController")
-memctrl.addParams({
-    "clock":          "1GHz",
-    "addr_range_end": 0x100000000 - 1,  # 4 GB
-})
-memory = memctrl.setSubComponent("backend", "memHierarchy.simpleMem")
-memory.addParams({
-    "access_time": "10ns",
-    "mem_size":    "4GiB",
-})
-
-# --- Wiring ------------------------------------------------------------------
-# Vortex GPGPU → L1 cache
-link_gpu_l1 = sst.Link("link_gpu_l1")
-link_gpu_l1.connect((gpu_mem_iface, "lowlink", "1ns"),
-                    (l1,            "highlink", "1ns"))
-
-# L1 cache → MemController
-link_l1_mem = sst.Link("link_l1_mem")
-link_l1_mem.connect((l1,      "lowlink",  "1ns"),
-                    (memctrl, "highlink", "1ns"))
diff --git a/ci/sst_test_vortex_vecadd.py b/ci/sst_test_vortex_vecadd.py
deleted file mode 100644
index 8a156cf81f..0000000000
--- a/ci/sst_test_vortex_vecadd.py
+++ /dev/null
@@ -1,7 +0,0 @@
-import sst
-
-gpu = sst.Component("gpu0", "vortex.VortexGPGPU")
-gpu.addParams({
-    "clock": "1GHz",
-    "program": "tests/kernel/vecadd/vecadd.vxbin"
-})
diff --git a/docs/gem5_integration.md b/docs/gem5_integration.md
index 461835ebc2..5b2e0f1afe 100644
--- a/docs/gem5_integration.md
+++ b/docs/gem5_integration.md
@@ -163,17 +163,23 @@ install location — no extra setup needed.
 
 ### By hand
 
-**Standalone** (no host CPU; kernel preloaded via SimObject parameter):
+**Hostless** (no host CPU; kernel preloaded via SimObject parameter):
 
 ```bash
-VORTEX_GEM5_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
-VORTEX_GEM5_KERNEL=$(pwd)/tests/kernel/hello/hello.vxbin \
-    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_hello.py
+VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
+VORTEX_TEST_DIR=$(pwd)/tests/kernel/hello \
+VORTEX_TEST_KERNEL=hello.vxbin \
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_hostless_app.py
 ```
 
+`VORTEX_TEST_KERNEL` defaults to `kernel.vxbin`, so any standard
+regression test's kernel can be driven hostless without the host
+binary — e.g. `VORTEX_TEST_DIR=$(pwd)/tests/regression/vecadd
+ci/gem5_run_hostless_app.py`.
+
 **End-to-end** — any standard Vortex regression test (host binary +
 kernel.vxbin) runs through the generic
-[`ci/gem5_test_vortex_app.py`](../ci/gem5_test_vortex_app.py) runner.
+[`ci/gem5_run_app.py`](../ci/gem5_run_app.py) runner.
 
 ```bash
 # vecadd
@@ -182,7 +188,7 @@ VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \
 VORTEX_TEST_DIR=$(pwd)/tests/regression/vecadd \
 VORTEX_TEST_BIN=vecadd \
 VORTEX_TEST_ARGS="-n16" \
-    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_app.py
 
 # sgemm
 VORTEX_GEM5_DEV_LIB=$(pwd)/sim/simx/libvortex-gem5.so \
@@ -190,7 +196,7 @@ VORTEX_GEM5_HOST_RT_DIR=$(pwd)/sw/runtime \
 VORTEX_TEST_DIR=$(pwd)/tests/regression/sgemm \
 VORTEX_TEST_BIN=sgemm \
 VORTEX_TEST_ARGS="-n4" \
-    $GEM5_HOME/build/X86/gem5.opt ci/gem5_test_vortex_app.py
+    $GEM5_HOME/build/X86/gem5.opt ci/gem5_run_app.py
 ```
 
 Expected output ends with:
@@ -228,7 +234,7 @@ is identity-mapped (cacheable=False) so the dispatcher's PIO MMIO
 reaches the SimObject's regfile decoder.
 
 These constants are duplicated in two places — `sw/runtime/gem5/driver.h`
-and `ci/gem5_test_vortex_app.py`. If you change one, change the other.
+and `ci/gem5_run_app.py`. If you change one, change the other.
 
 ## Writing your own gem5 Python script
 
@@ -317,8 +323,8 @@ m5.simulate()
 ```
 
 Reference implementations:
-- [ci/gem5_test_vortex_hello.py](../ci/gem5_test_vortex_hello.py) — standalone Phase-3 variant (preload via `kernel=` param; no host CPU)
-- [ci/gem5_test_vortex_app.py](../ci/gem5_test_vortex_app.py) — Phase-5 e2e variant (any regression test via `VORTEX_TEST_BIN`)
+- [ci/gem5_run_hostless_app.py](../ci/gem5_run_hostless_app.py) — hostless variant (preload via `kernel=` param; no host CPU)
+- [ci/gem5_run_app.py](../ci/gem5_run_app.py) — e2e variant (any regression test via `VORTEX_TEST_BIN`)
 
 ## Load-bearing invariants — do not violate
 
diff --git a/docs/proposals/gem5_v2_cp_migration_proposal.md b/docs/proposals/gem5_v2_cp_migration_proposal.md
index a5c8bfc7c0..035d0805bb 100644
--- a/docs/proposals/gem5_v2_cp_migration_proposal.md
+++ b/docs/proposals/gem5_v2_cp_migration_proposal.md
@@ -622,7 +622,7 @@ CommandProcessor wiring; runnable without gem5 itself).
   HOST_ARCH switch).
 
 **Validation:**
-- Phase 3 standalone test (`ci/gem5_test_vortex_hello.py`): PASSES.
+- Hostless test (`ci/gem5_run_hostless_app.py`): PASSES.
   (No host runtime involvement.)
 - `./ci/regression.sh --gem5`: PASSES — hello + vecadd + sgemm e2e on x86.
 - `VORTEX_GEM5_ARM=1 ./ci/regression.sh --gem5`: PASSES — same 3 tests
diff --git a/docs/proposals/sst_simx_v3_proposal.md b/docs/proposals/sst_simx_v3_proposal.md
index 3dbe0a00ef..65db9ebbbd 100644
--- a/docs/proposals/sst_simx_v3_proposal.md
+++ b/docs/proposals/sst_simx_v3_proposal.md
@@ -1,7 +1,7 @@
 # SST Integration for SimX v3 — Proposal
 
 **Date:** 2026-05-03
-**Status:** Draft
+**Status:** Implemented — note that `ci/sst_test_vortex_*.py` have been consolidated into a single generic runner [ci/sst_run_hostless_app.py](../../ci/sst_run_hostless_app.py) (parameterized by `VORTEX_TEST_DIR` + `VORTEX_TEST_KERNEL`, parallel to [ci/gem5_run_hostless_app.py](../../ci/gem5_run_hostless_app.py)). The naming reserves the `ci/sst_run_app.py` slot for a future host-CPU-driven SST integration (none today — see §3). The memHierarchy wiring described in §6 is no longer kept as a standalone test runner; the recipe stays here as documentation. References to specific `sst_test_vortex_<test>.py` filenames below are historical.
 **Author:** Blaise Tine
 **Related:**
 [simx_v3_proposal.md](simx_v3_proposal.md) (Phase 5: TLM data path),
diff --git a/sw/runtime/gem5/Makefile b/sw/runtime/gem5/Makefile
index 16bd3390be..259bda5d9e 100644
--- a/sw/runtime/gem5/Makefile
+++ b/sw/runtime/gem5/Makefile
@@ -16,13 +16,6 @@ CXXFLAGS += -DXLEN_$(XLEN)
 CXXFLAGS += -fPIC
 CXXFLAGS += $(CONFIGS)
 
-# OPAE-shaped MMIO constants come from the generated vortex_opae.h
-# at build/sw/ (already on the include path via -I$(ROOT_DIR)/sw).
-# vortex.cpp does `#include <vortex_opae.h>` for the AFU_IMAGE_*
-# defines. Unlike sw/runtime/opae/Makefile we do NOT call
-# afu_json_mgr — configure already generated the header from
-# vortex_opae.toml at build time.
-
 # Per-arch compiler selection. The cross-compilers are sysroot-aware
 # (Ubuntu's gcc-aarch64-linux-gnu ships the matching libstdc++); no
 # extra --sysroot flags needed.