engine: threaded VM dispatch (guaranteed tail calls) to cut the dispatch branch-miss bottleneck

## Context

During the PR #599 VM-interpreter perf campaign on the C-LEARN model (the largest model: ~5,200 root slots, ~63k opcodes, 1000 Euler steps), a fresh `perf` profile showed the run is **branch-mispredict-bound, not instruction-bound** (IPC ~3.3). `perf record -e branch-misses` attributes **~68% of all branch-misses to `Vm::eval_bytecode`** — specifically its central dispatch `while pc < code.len() { match &code[pc] { ... } }`, which lowers to a single data-dependent indirect branch (one jump table). With C-LEARN's ~25k-opcode flow program executed 1000×, the indirect branch's target-history working set exceeds the BTB/predictor capacity, so it mispredicts heavily (~15-18% of run cycles, est.).

Opcode fusion (PR #599) chips at this by cutting *dispatch count* (the 3-operand and global/const fusions dropped branch-misses ~0.95% and ~3.8% respectively), but the dispatch *mechanism* is the structural bottleneck.

## Idea

**Threaded dispatch.** Instead of one central `match`, give each opcode handler its own continuation that dispatches the next opcode directly (each handler tail-calls the next via the opcode table), spreading the single indirect branch across many sites so the predictor can correlate per-opcode successors. This is the classic interpreter speedup (CPython 3.11+ "computed goto", LuaJIT, etc.). In Rust the portable equivalent is **guaranteed tail calls via the `become` keyword**.

## Caveat / blocker

`become` is **unstable (nightly Rust only)** — adopting it is a toolchain/policy decision for the project, not a code-level change. It is neither `unsafe` nor assembly. Until then, the only portable lever is more superinstructions (dispatch-count reduction).

## Expected impact

Potentially the **largest single remaining win** (it attacks the 68%-of-branch-misses bottleneck directly), but unquantified and gated on the nightly decision. Worth a spike to measure on a `become`-based prototype before committing.

## Refs
- `src/simlin-engine/src/vm.rs` — `Vm::eval_bytecode`, the `while pc < code.len() { match &code[pc] }` dispatch loop.
- `docs/design/engine-performance.md` — R3 (faster dispatch); notes `become` is unstable.
- PR #599 — the fusion wins that reduced branch-misses, evidence the bottleneck is dispatch.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: threaded VM dispatch (guaranteed tail calls) to cut the dispatch branch-miss bottleneck #601

Context

Idea

Caveat / blocker

Expected impact

Refs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

engine: threaded VM dispatch (guaranteed tail calls) to cut the dispatch branch-miss bottleneck #601

Description

Context

Idea

Caveat / blocker

Expected impact

Refs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions