BAND (Bandwidth Assessment for Native DDR) is a portable memory-bandwidth measurement tool. It runs the four classic STREAM kernels — Copy, Scale, Add, Triad — and reports sustained memory bandwidth in decimal GB/s, directly comparable to STREAM.C output.
The hard part of measuring "DDR bandwidth" portably is that true memory‑controller
counters live behind privileged, per‑SoC interfaces (Intel uncore IMC, AMD µProf,
ARM PMU, Apple powermetrics) that differ on every platform. BAND's answer is to
measure at several levels and always tell you which level produced each number:
| Tier | Backend | Requires | What it measures |
|---|---|---|---|
| 0 | NumPy | nothing (Python + NumPy) | bandwidth achievable from Python |
| 1 | Native | a C compiler | bandwidth achievable from compiled code (≈ hardware achievable) |
| 2 | DRAM counters | Linux perf + privileges |
actual DRAM traffic at the memory controller |
Higher tiers are more accurate but less portable. BAND auto‑uses whatever is available and degrades gracefully — on a bare machine you still get Tier 0, on a machine with a compiler you also get Tier 1, and with privileges you get the ground‑truth Tier 2 number.
In addition, a working‑set sweep runs a single‑core kernel across array sizes from sub‑L1 up past L3. The plateau at large sizes is a measured estimate of per‑core DRAM bandwidth — no assumptions about cache size required.
Earlier versions of BAND tried to convince you that NumPy results matched STREAM.C. That comparison is fragile for two real reasons:
- NumPy does not emit non‑temporal (streaming) stores. A logical "write" of N bytes can cost a hidden read‑for‑ownership, so real DRAM traffic exceeds the byte count STREAM reports. Whether that matters depends on the CPU and compiler.
- A naive NumPy Triad allocates a temporary array, so it measures allocator churn, not memory bandwidth. (The previous version "fixed" this by doubling the Triad number — which simply hid the bug.)
BAND now computes the Triad in place with a cache‑resident temporary (no large intermediate allocation), so Tier 0 Triad is a real bandwidth figure rather than an artifact. And rather than pretend Python equals C, BAND shows Tier 0 and Tier 1 side by side so you can see the gap directly.
- Python 3.7+
- NumPy (required)
- psutil (optional — used only for the total‑memory line in system info)
- A C compiler (optional — enables Tier 1;
cc,gcc, orclang) - Linux
perf+ privileges (optional — enables Tier 2)
git clone https://github.com/kylefoxaustin/band.git
cd band
pip install -r requirements.txt
chmod +x band.py./band.py # default: 2 GB total, all CPUs (capped at 8)
./band.py --size 4 --threads 8 # larger working set, 8 threads
./band.py --dram-counters # also attempt Tier 2 (needs perf + privileges)--size FLOAT Total memory in GB across all three arrays (default: 2.0)
--iterations INT Timed iterations per operation (default: 7)
--threads INT Worker threads (default: all available CPUs, capped at 8)
--no-pin Do not pin threads to CPUs
--no-numpy Skip Tier 0
--no-native Skip Tier 1
--dram-counters Attempt Tier 2 (DRAM counters via perf)
--dram-driver CHOICE Workload driving the Tier 2 window: auto|native|numpy
(auto = native if available, else numpy)
--no-sweep Skip the working-set sweep
--verbose Show the compiler command and extra diagnostics
--stream-file PATH Compare against a STREAM.C output file
--peak-mts FLOAT Memory transfer rate (e.g. 6000 for DDR5-6000)
--channels INT Populated memory channels (with --peak-mts: % of peak)
[Tier 0] NumPy (achievable from Python)
Copy 72.51 Scale 97.28 Add 54.72 Triad 70.81 (GB/s, median)
[Tier 1] Native (achievable from compiled C + OpenMP)
Copy 78.06 Scale 79.22 Add 78.37 Triad 78.09 (GB/s, best-of-reps)
Working-set sweep (single core, Copy kernel)
512.0 KB 153.55 ######################################## <- in L2
2.0 MB 61.92 ################ <- exceeds L2
256.0 MB 52.07 ############## <- DRAM plateau
Estimated per-core DRAM bandwidth (plateau): ~52 GB/s
A few things worth noting in real output:
- Each result shows median, min, max, and CV% (coefficient of variation).
A
<- high varianceflag appears when CV > 5%, which usually means CPU frequency scaling, thermal throttling, or a busy machine — pin the clocks and close other workloads for stable numbers. - When the Tier 1 native kernel is bus‑bound, all four operations converge to roughly the same GB/s — that is the correct signature of saturated memory bandwidth. Tier 0 spreads out more because it also reflects NumPy's per‑operation efficiency.
- The sweep makes the cache hierarchy visible: bandwidth peaks while the working set fits in L1/L2, then steps down to the DRAM plateau.
Tier 2 reads the CPU's uncore IMC counters via perf while a sustained workload
runs, giving the actual bytes moved across the memory bus — including the
read‑for‑ownership traffic the logical byte count misses.
# perf must be allowed to read uncore counters:
sudo sysctl kernel.perf_event_paranoid=0 # or run band.py under sudo
./band.py --dram-countersBAND probes several IMC event sets and uses the first the kernel actually returns numbers for:
uncore_imc_free_running/data_read/+data_write/(client / Raptor Lake etc.)unc_m_cas_count_rd+unc_m_cas_count_wr(per‑controller CAS counts)- older
uncore_imc/cas_count_*anduncore_imc/data_*spellings
perf reports some counters pre‑scaled (e.g. in MiB) and others as raw cache‑line counts; BAND reads perf's unit column and converts accordingly (raw counts ×64 B). While a saturating multi‑threaded Triad runs in‑process, perf counts DRAM traffic system‑wide over a fixed window, so the figure is true bus traffic — it includes read‑for‑ownership and can legitimately differ from the logical Tier 0/1 numbers. If no event set works, or privileges are missing, Tier 2 is skipped with an explanation.
By default the counting window is driven by the native kernel (--dram-driver auto), which sustains traffic closest to the hardware ceiling; use
--dram-driver numpy to measure the DRAM traffic of a NumPy workload instead.
A useful side observation: when the native kernel's measured DRAM bandwidth
roughly equals its logical Tier 1 number, the compiler is emitting non‑temporal
(streaming) stores — the write bypasses cache, so there is no read‑for‑ownership
inflation.
Tier 2 is currently Linux/
perf‑only and has been verified on Intel Raptor Lake. On other platforms/microarchitectures it falls back gracefully rather than reporting a guessed number.
The included setup_stream.sh downloads, compiles, and runs the reference
STREAM benchmark:
./setup_stream.sh # writes stream_results.txt
./band.py --stream-file stream_results.txtBAND parses the STREAM table and prints a side‑by‑side comparison of STREAM.C, BAND's native (Tier 1), and NumPy (Tier 0) results, with each tier's share of STREAM.C. Units are decimal throughout (1 GB/s = 1000 MB/s), so the percentages are apples‑to‑apples.
- Decimal units, matching STREAM: 1 GB = 1e9 bytes; bandwidth = bytes / time / 1e9.
time.perf_counter()for Tier 0; Tier 1 times inside C withomp_get_wtime, excluding all Python overhead and reporting best‑of‑reps.- Logical traffic counted per element: Copy/Scale = 16 B (1 read + 1 write), Add/Triad = 24 B (2 reads + 1 write) — identical to STREAM's accounting.
- Thread pinning + parallel first‑touch so pages are allocated on the NUMA node that uses them; NumPy ufuncs release the GIL, so threading yields real parallelism for large arrays.
- The default
--size(2 GB) far exceeds any current CPU cache, so the steady bandwidth reflects DRAM, not cache. Use the sweep to confirm you've exceeded cache on your hardware.
- Scan thread counts (
--threads 1,2,4,8,…across runs) to find where bandwidth saturates — it is often well below your core count, set by the number of memory channels. - Match threads to memory channels rather than to cores.
- Pin CPU frequency (disable turbo / set the
performancegovernor) and close other memory‑heavy processes to drop the CV%. - Pass
--peak-mtsand--channelsto see your result as a percentage of theoretical peak, e.g.--peak-mts 6000 --channels 2for dual‑channel DDR5‑6000 (= 96 GB/s peak).
Maintained by Kyle Fox (@kylefoxaustin). Intended for educational and performance‑measurement purposes. Contributions, bug reports, and feature requests are welcome.
MIT — see the LICENSE file.