test(gpu-d3d12): synthetic-load harness to validate sandwich additivity by mledour · Pull Request #22 · mledour/XR-Layer-Monitor

mledour · 2026-06-09T08:55:42Z

The D3D12 GPU sandwich shows a ~270us floor on an empty pre->post run (target_gpu_us ~= 270us with no target between the layers) vs ~0 on D3D11. Root cause: D3D11 places both timestamps inline in the app's immediate context (adjacent, zero gap), while D3D12 has no immediate context so pre and post each submit their own ExecuteCommandLists -- the gap between the two is inter-submission latency, not target work.

Before building a calibrate-and-subtract correction, we must know whether that overhead is ADDITIVE (a constant we can subtract) or ABSORBED by a real target's GPU work (in which case subtracting a constant over-corrects). This harness measures it.

How it works: the PRE side, only when MLEDOUR_GPULOAD_ITERS is set, submits a self-timed CopyBufferRegion loop on the app queue AFTER recording T_pre and BEFORE forwarding, so on the queue it lands between T_pre and T_post. The merge's target_gpu_us then measures S = K + O, where K is the loop's inline self-timed duration (gpuload-.csv) and O is the overhead. Sweeping the iteration count and joining K against S reveals whether O is constant (additive) or collapses as K grows (absorbed).

Contents:

utils/synthetic_load.{h,cpp}: self-timed D3D12 copy-loop load, mirroring the fence/ring discipline of gpu_timer.cpp's D3D12 backend
layer.cpp: gated PRE-side build (xrCreateSession) / record (xrEndFrame) / poll / flush (xrDestroySession + dtor fallback)
openxr-api-layer.vcxproj: register the two new files
scripts/validate_additivity.py: join by pid, regress S = a*K + b, print verdict
docs/ADDITIVITY_VALIDATION.md: build + sweep + analyze protocol

Off by default: with MLEDOUR_GPULOAD_ITERS unset, no code runs and the monitor is behaviourally unchanged. Every site is tagged TEMPORARY for removal once the additivity question is decided. Excludes the unrelated external/ submodule change.

The D3D12 GPU sandwich shows a ~270us floor on an empty pre->post run (target_gpu_us ~= 270us with no target between the layers) vs ~0 on D3D11. Root cause: D3D11 places both timestamps inline in the app's immediate context (adjacent, zero gap), while D3D12 has no immediate context so pre and post each submit their own ExecuteCommandLists -- the gap between the two is inter-submission latency, not target work. Before building a calibrate-and-subtract correction, we must know whether that overhead is ADDITIVE (a constant we can subtract) or ABSORBED by a real target's GPU work (in which case subtracting a constant over-corrects). This harness measures it. How it works: the PRE side, only when MLEDOUR_GPULOAD_ITERS is set, submits a self-timed CopyBufferRegion loop on the app queue AFTER recording T_pre and BEFORE forwarding, so on the queue it lands between T_pre and T_post. The merge's target_gpu_us then measures S = K + O, where K is the loop's inline self-timed duration (gpuload-<pid>.csv) and O is the overhead. Sweeping the iteration count and joining K against S reveals whether O is constant (additive) or collapses as K grows (absorbed). Contents: - utils/synthetic_load.{h,cpp}: self-timed D3D12 copy-loop load, mirroring the fence/ring discipline of gpu_timer.cpp's D3D12 backend - layer.cpp: gated PRE-side build (xrCreateSession) / record (xrEndFrame) / poll / flush (xrDestroySession + dtor fallback) - openxr-api-layer.vcxproj: register the two new files - scripts/validate_additivity.py: join by pid, regress S = a*K + b, print verdict - docs/ADDITIVITY_VALIDATION.md: build + sweep + analyze protocol Off by default: with MLEDOUR_GPULOAD_ITERS unset, no code runs and the monitor is behaviourally unchanged. Every site is tagged TEMPORARY for removal once the additivity question is decided. Excludes the unrelated external/ submodule change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

layer.cpp is compiled by the test project too, and it references MakeSyntheticGpuLoad; the tests vcxproj is its own link unit, so add synthetic_load.{cpp,h} there exactly like gpu_timer.cpp already is. Remove when the additivity harness is deleted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The gpuload-<pid>.csv was written only at xrDestroySession / the dtor, but the frame+GPU CSVs are finalized and merged at Ctrl+F9-OFF (ApplyToggle stop), mid-session. Killing hello_xr after stopping monitoring skipped clean teardown, so the merged CSV was present but gpuload never appeared -- exactly the reported symptom (log shows 'SYNTHETIC GPU LOAD active', no file). Now flush at the Ctrl+F9-OFF point too (the one path guaranteed reached when the user stops monitoring), reset the K accumulator on Ctrl+F9-ON to match g_csv's truncate-on-start, and log the written row count so an empty result is diagnosable from the .log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The synthetic-load harness (51bc334, e0e059b, 476cff0) did its job: it proved the D3D12 GPU sandwich's ~290us overhead is NOT additive -- a heavy target absorbs it, a light one is dominated by it, so it can't be calibrated away by subtracting a constant (O = S - K held ~288us for K up to ~300us, then fell to ~74us at K ~= 3.4ms). No current layer does per-frame GPU work (fov_crop = projection metadata + a one-shot overlay upload composited by the runtime), so the harness has no consumer. Removed: synthetic_load.{h,cpp}, the gated pre-side injection/flush in layer.cpp, the vcxproj entries (both projects), validate_additivity.py, the run-protocol doc, and the unused gpu_probe.h scaffold. The finding is documented in gpu_timer.h + README (next commit); restore these commits if a GPU-heavy layer ever needs the inline-probe path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gpu_timer.h LIMITATIONS + a README subsection: the D3D12 GPU sandwich carries a ~290us inter-submission overhead floor that a heavy target absorbs (O fell from ~290us at <=300us of target work to ~74us at ~3.4ms) -- not additive, not subtractable. D3D11 is floor-free (inline shared stream). target_gpu_us is reliable for >~1ms targets, dominated by the floor for <~300us; CPU unaffected; to profile your own layer, time it inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mledour and others added 5 commits June 9, 2026 10:54

mledour merged commit 001f233 into main Jun 9, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(gpu-d3d12): synthetic-load harness to validate sandwich additivity#22

test(gpu-d3d12): synthetic-load harness to validate sandwich additivity#22
mledour merged 5 commits into
mainfrom
test/gpu-d3d12-additivity-harness

mledour commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mledour commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant