Context
The C-LEARN VM run is branch-mispredict-bound (~68% of branch-misses are the eval_bytecode dispatch indirect branch; IPC ~3.3). During the PR #599 campaign, an instruction-count-reducing change (the to_runtime_view memcpy, −4.2% retired instructions) produced no wall-clock movement — the out-of-order core absorbed the freed instructions in spare IPC. The same was true of bounds-check elimination when investigated alone (docs/design/engine-performance.md R1: sub-noise at opt-level=3).
Idea
Treat individually-sub-noise branch/dispatch reductions as synergy candidates, not discards. The dispatch indirect branch's target-history working set likely sits just above the BTB/predictor capacity, so cumulatively shrinking the program's distinct branch sites / dispatches may cross a threshold for a non-linear wall-clock win that none of the changes show alone.
Concretely:
- Maintain a set of marginal reductions (more superinstructions, removing conditional arms from the hot loop, view-validity branch hoisting, etc.).
- Measure them as a bundle with
perf stat -e instructions,cycles,branches,branch-misses,L1-icache-load-misses — watch the branch-miss rate and IPC, not just wall-clock.
- Re-evaluate bounds-check elimination in the bundled context. It is sub-noise alone (R1), but it removes ~127
panic_bounds_check branch sites from eval_bytecode; it may contribute to a threshold tip. (Would require unsafe get_unchecked + a static validation pass — see R1 — so it is gated on that decision; not pursued solo.)
Refs
Context
The C-LEARN VM run is branch-mispredict-bound (~68% of branch-misses are the
eval_bytecodedispatch indirect branch; IPC ~3.3). During the PR #599 campaign, an instruction-count-reducing change (theto_runtime_viewmemcpy, −4.2% retired instructions) produced no wall-clock movement — the out-of-order core absorbed the freed instructions in spare IPC. The same was true of bounds-check elimination when investigated alone (docs/design/engine-performance.mdR1: sub-noise atopt-level=3).Idea
Treat individually-sub-noise branch/dispatch reductions as synergy candidates, not discards. The dispatch indirect branch's target-history working set likely sits just above the BTB/predictor capacity, so cumulatively shrinking the program's distinct branch sites / dispatches may cross a threshold for a non-linear wall-clock win that none of the changes show alone.
Concretely:
perf stat -e instructions,cycles,branches,branch-misses,L1-icache-load-misses— watch the branch-miss rate and IPC, not just wall-clock.panic_bounds_checkbranch sites fromeval_bytecode; it may contribute to a threshold tip. (Would requireunsafe get_unchecked+ a static validation pass — see R1 — so it is gated on that decision; not pursued solo.)Refs
docs/design/engine-performance.md— R1 (bounds-check elimination, sub-noise alone), R3 (dispatch).