test(gpu-d3d12): synthetic-load harness to validate sandwich additivity#22
Merged
Conversation
The D3D12 GPU sandwich shows a ~270us floor on an empty pre->post run
(target_gpu_us ~= 270us with no target between the layers) vs ~0 on D3D11.
Root cause: D3D11 places both timestamps inline in the app's immediate
context (adjacent, zero gap), while D3D12 has no immediate context so pre
and post each submit their own ExecuteCommandLists -- the gap between the
two is inter-submission latency, not target work.
Before building a calibrate-and-subtract correction, we must know whether
that overhead is ADDITIVE (a constant we can subtract) or ABSORBED by a real
target's GPU work (in which case subtracting a constant over-corrects). This
harness measures it.
How it works: the PRE side, only when MLEDOUR_GPULOAD_ITERS is set, submits a
self-timed CopyBufferRegion loop on the app queue AFTER recording T_pre and
BEFORE forwarding, so on the queue it lands between T_pre and T_post. The
merge's target_gpu_us then measures S = K + O, where K is the loop's inline
self-timed duration (gpuload-<pid>.csv) and O is the overhead. Sweeping the
iteration count and joining K against S reveals whether O is constant
(additive) or collapses as K grows (absorbed).
Contents:
- utils/synthetic_load.{h,cpp}: self-timed D3D12 copy-loop load, mirroring the
fence/ring discipline of gpu_timer.cpp's D3D12 backend
- layer.cpp: gated PRE-side build (xrCreateSession) / record (xrEndFrame) /
poll / flush (xrDestroySession + dtor fallback)
- openxr-api-layer.vcxproj: register the two new files
- scripts/validate_additivity.py: join by pid, regress S = a*K + b, print verdict
- docs/ADDITIVITY_VALIDATION.md: build + sweep + analyze protocol
Off by default: with MLEDOUR_GPULOAD_ITERS unset, no code runs and the monitor
is behaviourally unchanged. Every site is tagged TEMPORARY for removal once the
additivity question is decided. Excludes the unrelated external/ submodule change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
layer.cpp is compiled by the test project too, and it references MakeSyntheticGpuLoad; the tests vcxproj is its own link unit, so add synthetic_load.{cpp,h} there exactly like gpu_timer.cpp already is. Remove when the additivity harness is deleted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The gpuload-<pid>.csv was written only at xrDestroySession / the dtor, but the frame+GPU CSVs are finalized and merged at Ctrl+F9-OFF (ApplyToggle stop), mid-session. Killing hello_xr after stopping monitoring skipped clean teardown, so the merged CSV was present but gpuload never appeared -- exactly the reported symptom (log shows 'SYNTHETIC GPU LOAD active', no file). Now flush at the Ctrl+F9-OFF point too (the one path guaranteed reached when the user stops monitoring), reset the K accumulator on Ctrl+F9-ON to match g_csv's truncate-on-start, and log the written row count so an empty result is diagnosable from the .log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The synthetic-load harness (51bc334, e0e059b, 476cff0) did its job: it proved the D3D12 GPU sandwich's ~290us overhead is NOT additive -- a heavy target absorbs it, a light one is dominated by it, so it can't be calibrated away by subtracting a constant (O = S - K held ~288us for K up to ~300us, then fell to ~74us at K ~= 3.4ms). No current layer does per-frame GPU work (fov_crop = projection metadata + a one-shot overlay upload composited by the runtime), so the harness has no consumer. Removed: synthetic_load.{h,cpp}, the gated pre-side injection/flush in layer.cpp, the vcxproj entries (both projects), validate_additivity.py, the run-protocol doc, and the unused gpu_probe.h scaffold. The finding is documented in gpu_timer.h + README (next commit); restore these commits if a GPU-heavy layer ever needs the inline-probe path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gpu_timer.h LIMITATIONS + a README subsection: the D3D12 GPU sandwich carries a ~290us inter-submission overhead floor that a heavy target absorbs (O fell from ~290us at <=300us of target work to ~74us at ~3.4ms) -- not additive, not subtractable. D3D11 is floor-free (inline shared stream). target_gpu_us is reliable for >~1ms targets, dominated by the floor for <~300us; CPU unaffected; to profile your own layer, time it inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The D3D12 GPU sandwich shows a ~270us floor on an empty pre->post run (target_gpu_us ~= 270us with no target between the layers) vs ~0 on D3D11. Root cause: D3D11 places both timestamps inline in the app's immediate context (adjacent, zero gap), while D3D12 has no immediate context so pre and post each submit their own ExecuteCommandLists -- the gap between the two is inter-submission latency, not target work.
Before building a calibrate-and-subtract correction, we must know whether that overhead is ADDITIVE (a constant we can subtract) or ABSORBED by a real target's GPU work (in which case subtracting a constant over-corrects). This harness measures it.
How it works: the PRE side, only when MLEDOUR_GPULOAD_ITERS is set, submits a self-timed CopyBufferRegion loop on the app queue AFTER recording T_pre and BEFORE forwarding, so on the queue it lands between T_pre and T_post. The merge's target_gpu_us then measures S = K + O, where K is the loop's inline self-timed duration (gpuload-.csv) and O is the overhead. Sweeping the iteration count and joining K against S reveals whether O is constant (additive) or collapses as K grows (absorbed).
Contents:
Off by default: with MLEDOUR_GPULOAD_ITERS unset, no code runs and the monitor is behaviourally unchanged. Every site is tagged TEMPORARY for removal once the additivity question is decided. Excludes the unrelated external/ submodule change.