CPU Microarchitecture Benchmark Suite

I keep on reading things like "L1 data cache access times are 10-20ns." under which conditions, warm/cold, how did they measure it,... and similar claims but what about my machine? I used C++23, Google benchmark and RDTSC cycle counting for the measurements. My CPU is a AMD Ryzen 7 9800X3D (Zen 5).

Organisation

src/ – C++ benchmarks
data/ – CSV samples.
python/ – Plotting scripts
plots/ – The resulting graphs.

Benchmarks

0. CPU specification.

Below the Cache sizes and CPU specifications I used:

1. Cache hierarchy latency

We walk a random‑order pointer chain to avoid the prefetcher and measure L1/L2/L3 reads.

Plotting the initial data, I was confused because the 4KB pages were artificially outperforming the large pages. This turned out to be an artifact of OS demand paging. Because I did not pre-fault the memory, the physical pages were mapped sequentially during the setup loop, allowing the hardware prefetcher to anticipate the access pattern and hide the latency.

After adding a pre-faulting loop to ensure a truly random physical layout, I compared the standard 4KB pages to 2MB large pages again. The corrected results showed that exceeding the 12MB L2 TLB capacity does not cause a massive latency cliff. The true TLB miss penalty is only about 5 cycles, as the page tables themselves probably remain fully cached within the 96MB L3 cache.

2. Branch prediction & speculation

Compares sorted vs. shuffled data to force real branch instructions.

3. L3 latency & TLB thrashing

Every single memory access is timed with RDTSC + _mm_lfence.

4. 3D V‑Cache performance

AMD's patent for "Balanced Latency stacked cache" (https://patents.google.com/patent/US20260003794A1/en), claims near‑perfect, single‑spike latency distribution due to the 9800X3D’s “flipped” design (CCD on top of the cache). I checked and it looks like they are right! (We can see this in the above cache sweep image too.)

5. False Sharing

Measuring the impact of false sharing, not much to say here. See speedup table below.

6. Instruction Level parallelism (ILP) & Dependency Chains

Two benchmarks: ILP_Dependent tests the raw latency of the CPU's ALUs by making each instruction wait for the previous one, and ILP_Independent visualizes the speedup gained by Out-of-Order execution and ILP.

From what I read, the Zen 5 architecture should have 6 ALUs (https://chipsandcheese.com/p/zen-5s-leaked-slides), but our graph starts sloping upward at 5 instead of staying flat through 6.

It must be that those 6 ALUs are not identical. Our ILP_Independent benchmark uses a Xorshift sequence (x ^= (x << 13); x ^= (x >> 7)) to prevent the compiler from cheating. This operation relies heavily on bit shifts.

While Zen 5 can dispatch simple additions to all 6 Integer ALUs, complex hardware like barrel shifters (required for << and >>) are may not be duplicated across every single port
due to the massive area and power requirements of the register file routing. Our graph proves that we are hitting a physical throughput bottleneck on Zen 5's dedicated shift units right at that 5-chain mark, bottlenecking the core before we can fully saturate the theoretical 6-ALU width.

Benchmark	Strategy	Latency (Avg)	Throughput	Speedup
ILP	Dependent Chains	3749 ns	2.67 G/s	1.0x (Baseline)
ILP	Independent (6+ ALUs)	725 ns	27.57 G/s	10.3x faster
Coherency	False Sharing (Packed)	12.83 ms	155.8 M/s	1.0x (Baseline)
Coherency	Cache-Aligned (64B)	3.84 ms	520.7 M/s	3.3x faster

Benchmark output:

Run on (16 X 4700 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 98304 KiB (x1)
--------------------------------------------------------------------------------------
Benchmark                     Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
ILP_Dependent              3849 ns         3749 ns       179200 items_per_second=2.66716G/s
ILP_Independent             721 ns          725 ns      1120000 items_per_second=27.5692G/s
FalseSharing_Conflict  13406570 ns     12834821 ns           56 items_per_second=155.826M/s
FalseSharing_Aligned    3880062 ns      3840782 ns          179 items_per_second=520.727M/s
BM_BranchPrediction/0     59282 ns        58594 ns        11200
BM_BranchPrediction/1     51594 ns        51618 ns        11200
TLB_Walk                4342550 ns      4360465 ns          172 items_per_second=120.237M/s
BM_CacheHierarchy/4096     0.768 ns        0.781 ns   1120000000
BM_CacheHierarchy/8192     0.766 ns        0.767 ns    896000000
BM_CacheHierarchy/16384    0.765 ns        0.767 ns   1120000000
BM_CacheHierarchy/32768    0.769 ns        0.767 ns    896000000
BM_CacheHierarchy/65536     2.68 ns         2.67 ns    263529412
BM_CacheHierarchy/131072    2.68 ns         2.57 ns    248888889
BM_CacheHierarchy/262144    2.68 ns         2.73 ns    263529412
BM_CacheHierarchy/524288    2.69 ns         2.61 ns    263529412
BM_CacheHierarchy/1048576   3.87 ns         3.84 ns    235789474
BM_CacheHierarchy/2097152   8.03 ns         7.85 ns     89600000
BM_CacheHierarchy/4194304   9.88 ns         10.0 ns     74666667
BM_CacheHierarchy/8388608   11.0 ns         11.0 ns     64000000
BM_CacheHierarchy/16777216  12.6 ns         12.3 ns     56000000
BM_CacheHierarchy/33554432  18.8 ns         19.0 ns     34461538
BM_CacheHierarchy/67108864  37.9 ns         38.4 ns     17920000
BM_CacheHierarchy/134217728 58.5 ns         57.5 ns      8960000
BM_CacheHierarchy/268435456 82.5 ns         82.0 ns      8960000

Build & run

Requirements

C++23 compiler (GCC 13+ or Clang 16+)
CMake 3.31+
Google Benchmark
Python 3.10+ with pandas, matplotlib, seaborn
must have group protocols enabled (bat file provided)

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
data		data
plots		plots
python		python
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md
img.png		img.png
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPU Microarchitecture Benchmark Suite

Organisation

Benchmarks

0. CPU specification.

1. Cache hierarchy latency

2. Branch prediction & speculation

3. L3 latency & TLB thrashing

4. 3D V‑Cache performance

5. False Sharing

6. Instruction Level parallelism (ILP) & Dependency Chains

Benchmark output:

Build & run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CPU Microarchitecture Benchmark Suite

Organisation

Benchmarks

0. CPU specification.

1. Cache hierarchy latency

2. Branch prediction & speculation

3. L3 latency & TLB thrashing

4. 3D V‑Cache performance

5. False Sharing

6. Instruction Level parallelism (ILP) & Dependency Chains

Benchmark output:

Build & run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages