Skip to content

engine: bundle marginal branch/dispatch reductions to test for a CPU predictor threshold #604

@bpowers

Description

@bpowers

Context

The C-LEARN VM run is branch-mispredict-bound (~68% of branch-misses are the eval_bytecode dispatch indirect branch; IPC ~3.3). During the PR #599 campaign, an instruction-count-reducing change (the to_runtime_view memcpy, −4.2% retired instructions) produced no wall-clock movement — the out-of-order core absorbed the freed instructions in spare IPC. The same was true of bounds-check elimination when investigated alone (docs/design/engine-performance.md R1: sub-noise at opt-level=3).

Idea

Treat individually-sub-noise branch/dispatch reductions as synergy candidates, not discards. The dispatch indirect branch's target-history working set likely sits just above the BTB/predictor capacity, so cumulatively shrinking the program's distinct branch sites / dispatches may cross a threshold for a non-linear wall-clock win that none of the changes show alone.

Concretely:

  • Maintain a set of marginal reductions (more superinstructions, removing conditional arms from the hot loop, view-validity branch hoisting, etc.).
  • Measure them as a bundle with perf stat -e instructions,cycles,branches,branch-misses,L1-icache-load-misses — watch the branch-miss rate and IPC, not just wall-clock.
  • Re-evaluate bounds-check elimination in the bundled context. It is sub-noise alone (R1), but it removes ~127 panic_bounds_check branch sites from eval_bytecode; it may contribute to a threshold tip. (Would require unsafe get_unchecked + a static validation pass — see R1 — so it is gated on that decision; not pursued solo.)

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    engineIssues with the rust-based simulation engineenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions