I keep on reading things like "L1 data cache access times are 10-20ns." under which conditions, warm/cold, how did they measure it,... and similar claims but what about my machine? I used C++23, Google benchmark and RDTSC cycle counting for the measurements. My CPU is a AMD Ryzen 7 9800X3D (Zen 5).
src/– C++ benchmarksdata/– CSV samples.python/– Plotting scriptsplots/– The resulting graphs.
Below the Cache sizes and CPU specifications I used:
We walk a random‑order pointer chain to avoid the prefetcher and measure L1/L2/L3 reads.
Plotting the initial data, I was confused because the 4KB pages were artificially outperforming the large pages. This turned out to be an artifact of OS demand paging. Because I did not pre-fault the memory, the physical pages were mapped sequentially during the setup loop, allowing the hardware prefetcher to anticipate the access pattern and hide the latency.
After adding a pre-faulting loop to ensure a truly random physical layout, I compared the standard 4KB pages to 2MB large pages again. The corrected results showed that exceeding the 12MB L2 TLB capacity does not cause a massive latency cliff. The true TLB miss penalty is only about 5 cycles, as the page tables themselves probably remain fully cached within the 96MB L3 cache.
Compares sorted vs. shuffled data to force real branch instructions.
Every single memory access is timed with RDTSC + _mm_lfence.

AMD's patent for "Balanced Latency stacked cache" (https://patents.google.com/patent/US20260003794A1/en), claims near‑perfect, single‑spike latency distribution due to the 9800X3D’s “flipped” design (CCD on top of the cache).
I checked and it looks like they are right! (We can see this in the above cache sweep image too.)

Measuring the impact of false sharing, not much to say here. See speedup table below.
Two benchmarks: ILP_Dependent tests the raw latency of the CPU's ALUs by making each instruction wait for the previous one, and ILP_Independent visualizes the speedup gained by Out-of-Order execution and ILP.
From what I read, the Zen 5 architecture should have 6 ALUs (https://chipsandcheese.com/p/zen-5s-leaked-slides), but our graph starts sloping upward at 5 instead of staying flat through 6.
It must be that those 6 ALUs are not identical. Our ILP_Independent benchmark uses a Xorshift sequence (x ^= (x << 13); x ^= (x >> 7)) to prevent the compiler from cheating. This operation relies heavily on bit shifts.
While Zen 5 can dispatch simple additions to all 6 Integer ALUs, complex hardware like barrel shifters (required for << and >>) are may not be duplicated across every single port
due to the massive area and power requirements of the register file routing. Our graph proves that we are hitting a physical throughput bottleneck
on Zen 5's dedicated shift units right at that 5-chain mark, bottlenecking the core before we can fully saturate the theoretical 6-ALU width.
| Benchmark | Strategy | Latency (Avg) | Throughput | Speedup |
|---|---|---|---|---|
| ILP | Dependent Chains | 3749 ns | 2.67 G/s | 1.0x (Baseline) |
| ILP | Independent (6+ ALUs) | 725 ns | 27.57 G/s | 10.3x faster |
| Coherency | False Sharing (Packed) | 12.83 ms | 155.8 M/s | 1.0x (Baseline) |
| Coherency | Cache-Aligned (64B) | 3.84 ms | 520.7 M/s | 3.3x faster |
Run on (16 X 4700 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 98304 KiB (x1)
--------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------
ILP_Dependent 3849 ns 3749 ns 179200 items_per_second=2.66716G/s
ILP_Independent 721 ns 725 ns 1120000 items_per_second=27.5692G/s
FalseSharing_Conflict 13406570 ns 12834821 ns 56 items_per_second=155.826M/s
FalseSharing_Aligned 3880062 ns 3840782 ns 179 items_per_second=520.727M/s
BM_BranchPrediction/0 59282 ns 58594 ns 11200
BM_BranchPrediction/1 51594 ns 51618 ns 11200
TLB_Walk 4342550 ns 4360465 ns 172 items_per_second=120.237M/s
BM_CacheHierarchy/4096 0.768 ns 0.781 ns 1120000000
BM_CacheHierarchy/8192 0.766 ns 0.767 ns 896000000
BM_CacheHierarchy/16384 0.765 ns 0.767 ns 1120000000
BM_CacheHierarchy/32768 0.769 ns 0.767 ns 896000000
BM_CacheHierarchy/65536 2.68 ns 2.67 ns 263529412
BM_CacheHierarchy/131072 2.68 ns 2.57 ns 248888889
BM_CacheHierarchy/262144 2.68 ns 2.73 ns 263529412
BM_CacheHierarchy/524288 2.69 ns 2.61 ns 263529412
BM_CacheHierarchy/1048576 3.87 ns 3.84 ns 235789474
BM_CacheHierarchy/2097152 8.03 ns 7.85 ns 89600000
BM_CacheHierarchy/4194304 9.88 ns 10.0 ns 74666667
BM_CacheHierarchy/8388608 11.0 ns 11.0 ns 64000000
BM_CacheHierarchy/16777216 12.6 ns 12.3 ns 56000000
BM_CacheHierarchy/33554432 18.8 ns 19.0 ns 34461538
BM_CacheHierarchy/67108864 37.9 ns 38.4 ns 17920000
BM_CacheHierarchy/134217728 58.5 ns 57.5 ns 8960000
BM_CacheHierarchy/268435456 82.5 ns 82.0 ns 8960000
Requirements
- C++23 compiler (GCC 13+ or Clang 16+)
- CMake 3.31+
- Google Benchmark
- Python 3.10+ with
pandas,matplotlib,seaborn - must have group protocols enabled (bat file provided)
Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .

