Context
During the PR #599 VM-interpreter perf campaign on the C-LEARN model (the largest model: ~5,200 root slots, ~63k opcodes, 1000 Euler steps), a fresh perf profile showed the run is branch-mispredict-bound, not instruction-bound (IPC ~3.3). perf record -e branch-misses attributes ~68% of all branch-misses to Vm::eval_bytecode — specifically its central dispatch while pc < code.len() { match &code[pc] { ... } }, which lowers to a single data-dependent indirect branch (one jump table). With C-LEARN's ~25k-opcode flow program executed 1000×, the indirect branch's target-history working set exceeds the BTB/predictor capacity, so it mispredicts heavily (~15-18% of run cycles, est.).
Opcode fusion (PR #599) chips at this by cutting dispatch count (the 3-operand and global/const fusions dropped branch-misses ~0.95% and ~3.8% respectively), but the dispatch mechanism is the structural bottleneck.
Idea
Threaded dispatch. Instead of one central match, give each opcode handler its own continuation that dispatches the next opcode directly (each handler tail-calls the next via the opcode table), spreading the single indirect branch across many sites so the predictor can correlate per-opcode successors. This is the classic interpreter speedup (CPython 3.11+ "computed goto", LuaJIT, etc.). In Rust the portable equivalent is guaranteed tail calls via the become keyword.
Caveat / blocker
become is unstable (nightly Rust only) — adopting it is a toolchain/policy decision for the project, not a code-level change. It is neither unsafe nor assembly. Until then, the only portable lever is more superinstructions (dispatch-count reduction).
Expected impact
Potentially the largest single remaining win (it attacks the 68%-of-branch-misses bottleneck directly), but unquantified and gated on the nightly decision. Worth a spike to measure on a become-based prototype before committing.
Refs
Context
During the PR #599 VM-interpreter perf campaign on the C-LEARN model (the largest model: ~5,200 root slots, ~63k opcodes, 1000 Euler steps), a fresh
perfprofile showed the run is branch-mispredict-bound, not instruction-bound (IPC ~3.3).perf record -e branch-missesattributes ~68% of all branch-misses toVm::eval_bytecode— specifically its central dispatchwhile pc < code.len() { match &code[pc] { ... } }, which lowers to a single data-dependent indirect branch (one jump table). With C-LEARN's ~25k-opcode flow program executed 1000×, the indirect branch's target-history working set exceeds the BTB/predictor capacity, so it mispredicts heavily (~15-18% of run cycles, est.).Opcode fusion (PR #599) chips at this by cutting dispatch count (the 3-operand and global/const fusions dropped branch-misses ~0.95% and ~3.8% respectively), but the dispatch mechanism is the structural bottleneck.
Idea
Threaded dispatch. Instead of one central
match, give each opcode handler its own continuation that dispatches the next opcode directly (each handler tail-calls the next via the opcode table), spreading the single indirect branch across many sites so the predictor can correlate per-opcode successors. This is the classic interpreter speedup (CPython 3.11+ "computed goto", LuaJIT, etc.). In Rust the portable equivalent is guaranteed tail calls via thebecomekeyword.Caveat / blocker
becomeis unstable (nightly Rust only) — adopting it is a toolchain/policy decision for the project, not a code-level change. It is neitherunsafenor assembly. Until then, the only portable lever is more superinstructions (dispatch-count reduction).Expected impact
Potentially the largest single remaining win (it attacks the 68%-of-branch-misses bottleneck directly), but unquantified and gated on the nightly decision. Worth a spike to measure on a
become-based prototype before committing.Refs
src/simlin-engine/src/vm.rs—Vm::eval_bytecode, thewhile pc < code.len() { match &code[pc] }dispatch loop.docs/design/engine-performance.md— R3 (faster dispatch); notesbecomeis unstable.