Reproducibility-oriented benchmarking framework for CPU and accelerator workloads on commodity Linux and Apple Silicon macOS. Designed for multi-host operation: a dev/methodology host, Apple Silicon smoke hosts, Intel iGPU hosts, and CUDA execution hosts with NVIDIA GPUs.
| Profile | Hardware target | What runs |
|---|---|---|
cpu |
CPU + iGPU OpenCL only (no NVIDIA stack) | bench-cpu, bench-standard-cpu, bench-opencl, bench-opencl-gemm |
cuda |
CPU + iGPU OpenCL + NVIDIA GPU + CUDA toolkit | everything in cpu plus bench-cuda-vector-add, bench-cuda-gemm, optional bench-cuda-uvm-access |
apple |
macOS Apple Silicon M1/M4 | lab-apple-smoke, bench-apple-metal, optional bench-apple-mps |
lab-pipeline adds a reviewer-facing planning layer on top of the existing
suite runner. It maps research themes to supported environments, suite configs,
workloads, tunables, metrics, and claim boundaries.
lab-pipeline list
lab-pipeline show forest-uvm-access
lab-pipeline review
lab-pipeline plan consumer-accelerator-baseline --profile cpu
lab-pipeline plan forest-uvm-access --profile cuda --sweepThe default matrix is installed to
~/.config/lab/pipelines/research-matrix.yaml. Current status:
| Track | Status | Purpose |
|---|---|---|
consumer-accelerator-baseline |
wired | CPU/OpenCL/CUDA baseline performance, energy, and reproducibility |
forest-uvm-access |
wired | Forest-inspired CUDA managed-memory access-pattern probe |
memory-hierarchy-pim |
partial | CUDA GEMV/SpMV microbenchmarks for bandwidth-dominated memory kernels |
edge-ai-cnn-transformer |
planned | CNN/Transformer accelerator workloads |
multi-tenant-migration-storage |
planned | tensor migration, UVM, storage oversubscription |
cluster-communication |
planned | NCCL/PXN rail and topology experiments |
security-counter-cache |
planned | defensive cache/counter characterization |
Important claim boundary: bench-cuda-uvm-access uses cudaMallocManaged and
Forest-style access classes (ls, hchi, hcli, lc) to probe real hardware
UVM symptoms. It is not an implementation of Forest's modified UVM driver,
access-time tracker, heterogeneous TBNp, or pseudo-LRU eviction. Page-fault,
migration, re-fault, and thrashing claims require CUPTI/Nsight/driver counters
or simulator instrumentation in addition to this harness.
Validate configs and generated suites with:
lab-validate matrix ~/.config/lab/pipelines/research-matrix.yaml
lab-validate suite-config ~/.config/lab/suites/forest-uvm.yaml
lab-validate suite-dir <suite_dir>Host acceptance artifact:
lab-host-acceptance # readiness logs under ~/lab/_acceptance
lab-host-acceptance --run # also run host-specific smoke benchmark
lab-host-acceptance --run --uvm-profile # CUDA hosts: also capture a small Nsight UVM profile
lab-acceptance-verify <acceptance_dir> --expect-profile cuda --require-run --require-uvm-profile
lab-acceptance-bundle <acceptance_dir> --expect-profile cuda --require-run --require-uvm-profile
lab-acceptance-bundle --check-bundle <bundle.tar.gz> --expect-profile cuda --require-run --require-uvm-profile
lab-acceptance-collect --profile cuda --run --uvm-profile --require-provenance --require-gpu-name "RTX 5060" --min-gpu-memory-mib 7600 --require-compute-cap 12.0 --require-cuda-sm 120
lab-remote-acceptance user@rtx-host --profile cuda --run --uvm-profile --require-provenance --require-gpu-name "RTX 5060" --min-gpu-memory-mib 7600 --require-compute-cap 12.0 --require-cuda-sm 120
lab-remote-acceptance user@intel-host --profile cpu --run --require-provenance --require-opencl-device Intel
lab-acceptance-collect --profile apple --run --require-provenance --require-apple-chip "Apple M1"
lab-acceptance-matrix --bundle-dir ~/lab/_acceptance_bundles
lab-acceptance-matrix --bundle-dir ~/lab/_acceptance_bundles --next-commands
lab-acceptance-stage --out ~/lab/_drive_stage/lab-acceptance-bundles
lab-acceptance-import ~/lab/_drive_stage/lab-acceptance-bundleslab-acceptance-matrix checks all collected bundles against
config/acceptance/required-hosts.json: MacBook M1, MacBook M4, Ubuntu Intel
iGPU, Ubuntu RTX 5060 8GB, and Ubuntu RTX 5080 16GB. Bundles are plain
.tar.gz files with .sha256 sidecars, so they can be moved by scp, USB, or
Google Drive as long as the sidecar is kept with the bundle.
lab-acceptance-stage copies the latest passing bundle per matrix target,
the acceptance config used for verification, and STAGE-MANIFEST.json. The
resulting folder can be uploaded to Google Drive or copied to USB without
dragging along stale bundles or relying on this Mac's local config path.
lab-acceptance-import copies a staged folder back into
~/lab/_acceptance_bundles, verifies sidecar hashes, and prints the matrix
result using the staged config.
lab-remote-acceptance also installs PyYAML in the remote user's Python
environment when needed, because YAML parsing is required before acceptance can
run.
RTX host smoke and UVM mechanism profiling:
lab-rtx-smoke # dry readiness check
lab-rtx-smoke --run # executes small CUDA UVM/GEMV/SpMV/GCN probes
lab-uvm-profile --pattern hchi --mb 12288 --passes 2
LAB_CUDA_ARCH=sm_120 bench-cuda-gemvlab-uvm-profile uses NVIDIA Nsight Systems Unified Memory CPU/GPU page-fault
tracing. NVIDIA documents these as high-overhead tracing options, so keep them
out of normal timing suites and use them to explain mechanisms after locating
interesting UVM cases.
Memory-kernel sweep:
lab-pipeline plan memory-hierarchy-pim --profile cuda --sweep
bench-suite-config ~/.config/lab/suites/memory-kernels.yamlApple Silicon smoke:
lab-apple-smoke
LAB_APPLE_ELEMENTS=1048576 lab-apple-smoke --run
LAB_APPLE_ELEMENTS=1048576 lab-apple-smoke --run --run-mps # optional PyTorch MPS smokeThe profile auto-detects from nvidia-smi + nvcc. Override:
lab-profile set cuda # persistent (writes ~/.config/lab/profile)
lab-profile clear # back to auto-detect
LAB_PROFILE=cpu lab-doctor # transientlab-doctor and bench-suite honor the profile — on cpu, CUDA tools are listed as info/optional and CUDA workloads are skipped automatically. On cuda, CUDA tools become required and bench-suite includes cuda-vector + cuda-gemm workloads.
git clone <THIS-REPO-URL> ~/lab-tools
cd ~/lab-tools
python3 -m pip install --user PyYAML # needed for YAML suites, lab-pipeline, and acceptance validation
bash bin/lab-tools-install # copies into ~/bin, ~/.config/lab, ~/notes
export PATH="$HOME/bin:$PATH" # if ~/bin is not already in your shell PATH
lab-doctor # sanity check
lab-host-acceptance # reproducible host readiness artifact
lab-acceptance-verify ~/lab/_acceptance/<dir>
lab-acceptance-matrix --dry-run # see required cross-host gatesFor the current completion status and the remaining RTX hardware gate, see
notes/completion-audit.md.
After install, on Intel CPUs:
sudo lab-pin-system enable-rapl # one-shot per boot for energy measurementFor a measurement campaign:
sudo lab-pin-system pin # governor=performance, no_turbo=1, ASLR=0
bench-suite-config baseline.yaml # full suite with stats + reports
sudo lab-pin-system restore # back to powersave/turbo onEvery bench-suite run produces a directory under ~/lab/<experiment>/suites/<id>/:
summary.csv— one row per run, all metrics + duration + RAPL energy + (cuda profile) NVML energy + phase + thermal eventsstats.csv— per (workload, phase=all/cold/steady, metric): n, mean, median, SD, CV%, ±1.96σ/√n, 95% bootstrap CI, MAD, outlier count, quality gradereport.md— Markdown report with results + statistics tablesmethod.md— paper-ready §Methodology section auto-filled from manifest.jsonreproducibility.md— ACM-style artifact checklist (auto-scored against artifacts present)execution-order.csv— randomized (workload, repeat) order with seed for reproducibility- per-run dirs with
manifest.json(system snapshot + sha256 of all sources),monitor.csv(1–2 Hz thermal/load/RAPL/NVML samples),result.json(energy_j, avg_power_w, thermal events, system_pinned, gpu_max_temp_c)
# Host A (cpu profile): build a baseline suite
sudo lab-pin-system pin
bench-suite-config baseline.yaml
sudo lab-pin-system restore
# Package it for Host B
lab-handoff <suite_dir>
# -> ~/lab/_handoffs/<experiment>-<id>-handoff-<date>.tar.zst
# Transfer to Host B and re-run with CUDA workloads added
scp ~/lab/_handoffs/*.tar.zst hostB:~/
ssh hostB
git clone <THIS-REPO-URL> ~/lab-tools && cd ~/lab-tools && bash bin/lab-tools-install
lab-profile set cuda
sudo lab-pin-system pin
bench-suite-config baseline.yaml # auto-includes cuda-vector + cuda-gemm
sudo lab-pin-system restore
# Compare cross-host (works on either side)
suite-compare <hostA_suite_dir> <hostB_suite_dir> --md compare.md- Mean, median, SD, CV%, parametric ±1.96σ/√n
- 95% bootstrap percentile CI over 10000 resamples (seed-fixed via
LAB_BOOTSTRAP_SEED) - MAD-based outlier flagging (Iglewicz & Hoaglin)
- Phase split:
cold= first run of each workload by execution order;steady= rest - Suite A vs B: Mann-Whitney U two-sided + Cliff's δ + Romano (2006) effect-size thresholds (negligible/small/medium/large)
~/.config/lab/containers/Containerfile.cpu— Ubuntu 24.04 + clinfo/OpenCL/OpenBLAS/sysbench/python~/.config/lab/containers/Containerfile.cuda—nvidia/cuda:13.1.0-devel-ubuntu24.04+ same toolchain (Blackwell sm_120 ready)
Both containers include PyYAML for suite parsing and SciPy for Mann-Whitney
statistics in suite-compare.
lab-container-build cpu # cpu profile
lab-container-build cuda # cuda profile
lab-container-run -- bench-standard-cpuActive source-of-truth is ~/bin/ and ~/.config/lab/. After editing, run lab-tools-sync to copy changes back into this repo and commit. Other hosts pull and run lab-tools-install.
This framework targets commodity Linux CPU/iGPU, Apple Silicon smoke testing, and consumer-tier NVIDIA CUDA. It does not attempt to support:
- Multi-GPU / large LLM training / H100 baselines
- AMD ROCm or Intel oneAPI/SYCL (stubs only)
- Distributed/HPC measurement, except planned NCCL/PXN scaffolding
- Windows or non-Apple-Silicon macOS execution
For workloads beyond the local hardware envelope, use lab-handoff to package a suite and run on cloud-by-hour GPUs (RunPod / Lambda / Vast.ai) or shared cluster resources (KISTI Nurion, NIPA AI 바우처, university GPU clusters).