Skip to content

engine: C-LEARN interpreter perf -- run -24% via array-view + opcode fusion#599

Merged
bpowers merged 5 commits into
mainfrom
engine-perf-clearn-interp
May 20, 2026
Merged

engine: C-LEARN interpreter perf -- run -24% via array-view + opcode fusion#599
bpowers merged 5 commits into
mainfrom
engine-perf-clearn-interp

Conversation

@bpowers
Copy link
Copy Markdown
Owner

@bpowers bpowers commented May 20, 2026

Summary

A hill-climbing pass on the bytecode VM interpreter, benchmarked on the C-LEARN model (our largest: ~53k MDL lines, 5726 root slots, 1000 Euler steps), LTM off, native release (opt-level=3 + LTO). Each step was profiled, A/B-measured, and validated numerically.

Cumulative: run_to_end ~155 -> ~118 ms/iter (about -24%). Every commit is numerically faithful (byte-exact vs Vensim's Ref.vdf via the clearn_residual_exactness guard, plus the full engine + simulate suites green).

Commits

Change C-LEARN run Notes
flat_offset dense fast-path (skip the per-element sparse_lookup SmallVec when a view has no sparse mappings) 155 -> 123 ms (-19.8%) bit-identical; flat_offset 18.9% -> 8.6% of run
3-operand register fusion of leaf assigns dst = a op b (operator-specialized Assign{Add,Sub,Mul,Div}{VarVar,VarConst,ConstVar}{Curr,Next} + stack-leaf forms) 123 -> 121 ms (-2.2%) operator-in-tag keeps Opcode at 8 bytes; branch-misses -0.95%
to_runtime_view memcpy (SmallVec::from_slice vs element-wise clone) neutral retired instructions -4.2%; wall-clock-neutral (the run is branch/latency-bound)
fuse LoadGlobalVar and two-literal operands into pushing binops (BinGlobal* / BinConstConst) 121 -> ~118 ms (-1.2%) 1858 dispatches fused; branches -2.9%, branch-misses -3.8%

Both fusion passes run at Vm::new on the execution bytecode, so CompiledSimulation stays a pure, salsa-cached, symbolizable artifact (the fused opcodes have an exhaustive unreachable! arm in symbolize_opcode). size_of::<Opcode>() stays 8 bytes throughout.

Key finding: the run is branch-mispredict-bound, not instruction-bound

A fresh perf profile attributes 68% of branch-misses to the eval_bytecode dispatch indirect branch; IPC is ~3.3. So wall-clock wins come from cutting dispatches (fewer indirect-branch mispredicts), not instruction count -- which is exactly why the to_runtime_view memcpy (-4.2% instructions) was wall-clock-neutral while the fusion passes (which cut dispatches) moved the needle and the branch-miss rate.

Ruled out (data-driven NO-GOs)

  • Full register VM: an interior-node analysis showed ~0 marginal dispatch reduction over the fusion passes (trailing-op-into-store is already fully fused; interior binops are 1 dispatch in both stack and register models). The ~2.9% remaining dispatch ceiling was captured by the cheap global/const operand fusion instead.
  • Hot-loop allocation: run_to_end is allocation-free (the malloc/free seen in profiles is the one-time compile + the harness's per-iter Vm::new/clone).
  • Constant folding / strength reduction: C-LEARN has too few powers/const-divisions, and R2 already folds leaf constants.
  • Bounds-check elimination (R1): sub-noise at opt-level=3 (and would need unsafe).

Open follow-ups (not in this PR)

  • Threaded dispatch via guaranteed tail calls (become): the one remaining step-change for the branch-bound dispatch loop, but it requires nightly Rust (a toolchain decision).
  • Incremental array-machinery wins: lookup/graphical-function hot path (~3-4% of run); flat_offset stride-stepping for the BeginIter non-contiguous precompute.

Validation

Every commit passed the full pre-commit hook. C-LEARN matches Ref.vdf byte-for-byte (clearn_residual_exactness) on all four commits. Operand order for the non-commutative Sub/Div and the global-vs-module_off slot distinction are covered by dedicated, verified-discriminating tests (injecting the bug fails the test and the exactness guard).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 975b9b5510

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simlin-engine/src/vm_profile.rs Outdated
Comment on lines +46 to +49
tally(&fused, &mut p.fused_histogram);
let mut fused_stocks = module.compiled_stocks.as_ref().clone();
fused_stocks.fuse_three_address();
tally(&fused_stocks, &mut p.fused_histogram);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude fused clones from total_literals accounting

bytecode_profile() now calls tally() on fused and fused_stocks, but tally() increments p.total_literals as a side effect. Because those fused bytecodes are temporary clones of the same module bytecode, every module’s flow/stock literal table is counted twice, so total_literals is inflated and no longer reflects the compiled artifact size. This skews profiling output and can mislead perf/size comparisons that rely on literal counts.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Automated review

Reviewed the bytecode-fusion + array-view perf changes for correctness (not just the numeric guards in the PR). No blocking issues found. The correctness-critical surfaces I traced all hold:

  • Operand order for non-commutative ops: Op2 and BinOpAssign{Curr,Next} both pop r then l and compute l op r; the fused Assign{Sub,Div}*/Bin* forms keep l/r in load order, so Sub/Div are byte-faithful. eval_op2's Add/Sub/Mul/Div are raw IEEE ops, identical to the inlined arithmetic in the operator-specialized arms.
  • Global vs module-relative slots: the _global fields read curr[g] (no module_off) matching LoadGlobalVar, while var fields use curr[module_off + v] matching LoadVar; the dedicated submodule test (test_global_operand_binop_reads_global_not_module_relative) exercises module_off > 0.
  • flat_offset dense fast-path: numerically identical to the general path when sparse is empty (every actual_idx == idx).
  • to_runtime_view: SmallVec::from_slice on the Copy-element vectors is equivalent to the prior element-wise clone.
  • Jump fixup / stack safety: the old→new PC remap is unchanged and correct for both 3→1 and 2→1 folds; fusion can only shrink peak stack depth (guarded by the *preserve_max_stack_depth tests).
  • Symbolic-layer isolation: fusion runs only on the Vm's private execution copy in CompiledSlicedSimulation::build; symbolize_opcode exhaustively unreachable!s every fused variant, so none can leak into the salsa-cached artifact.
  • Override + RK stock-offset detection consistency: collect_constant_info and collect_stock_offsets both run on the un-fused CompiledModule map, and fusion preserves both the literal pool (overrides mutate by literal_id) and the dst offsets (RK reads next[off] - curr[off]), so the un-fused detection stays consistent with fused execution.

Overall correctness verdict: correct. The patch preserves observable VM semantics; existing code/tests should not break.

🤖 Generated with Claude Code

bpowers added a commit that referenced this pull request May 20, 2026
The fused-stream histogram added in the global/const operand-fusion
commit tallied the fused flow/stock clones through the same closure
that increments total_literals. Those clones share the real bytecode's
literal table (fusion rewrites opcodes, never the literal pool), so
every module's flow+stock literals were counted twice and
total_literals was inflated (profiling-only; no VM behavior change).
Split out a histogram-only tally for the fused streams and add a
regression test asserting total_literals counts each phase once.
Addresses the P2 review comment on #599.
bpowers added 5 commits May 20, 2026 16:13
flat_offset is a per-element hot spot (~19% of the C-LEARN run), hit at
every vector-op/reducer dispatch site. It rebuilt a per-dimension
sparse_lookup SmallVec and did per-index Option branching on every call,
even for dense views with no sparse mappings (the common case). Skip that
work when self.sparse is empty; numerically identical there (empty sparse
implies actual_idx == idx). Measured C-LEARN run_to_end ~155 -> ~125
ms/iter (-20%); engine and simulate suites unchanged.
R2 fused subexpression loads (Load;Load;Op2 -> BinVarVar) but left
leaf assignments dst = a op b as Load a; Load b; BinOpAssign (3
dispatches + stack traffic). Add operator-specialized 3-operand
opcodes Assign{Add,Sub,Mul,Div}{VarVar,VarConst,ConstVar}{Curr,Next}
plus 2-operand stack-leaf forms, emitted by an extended
fuse_three_address pass at Vm::new. Operator-in-the-tag keeps the
payload at 3xu16 = 6 bytes, so size_of::<Opcode>() stays 8 (no
encoding growth) and dispatch is straight-line. These ops never
enter the symbolic layer (exhaustive unreachable! arm). Numerically
exact. C-LEARN: flow opcodes 26539 -> 25215 (973 sites), run_to_end
~124 -> ~121 ms (-2.2%), retired instructions -1.4%, branch-misses
-0.95%.

vm.rs was 32 lines under the 6000-line per-file cap, so the added
dispatch arms and tests pushed it over. Relocate the unrelated,
all-public-API set_value tests to a #[path]-included child module
vm_set_value_tests.rs (still crate::vm's child, so super::* reaches
the parent internals with no visibility changes) to restore headroom.
Pure relocation, no behaviour change.
PushStaticView clones the static view's dims/strides/dim_ids into a
RuntimeView on every execution (~1M times/run on C-LEARN). smallvec
1.15 (built without the specialization feature) lowers SmallVec::clone
to an element-wise Extend, so use SmallVec::from_slice (a single memcpy)
for the Copy-element buffers and an empty allocation for the
near-always-empty sparse list. Numerically identical; removes ~4.2% of
retired run instructions. Wall-clock is unchanged here -- the C-LEARN
run is dispatch-mispredict/latency-bound, not instruction-bound (IPC
~3.3), so the core already absorbed these tiny L1-resident copies -- but
it is a strict efficiency improvement worth keeping.
R2 fused Load;Load;Op2 for module-relative var operands but left
LoadGlobalVar operands and two-literal operands as separate loads.
Extend the Vm::new-time fuse_three_address pass with pushing binop
variants that read a global operand directly from curr[g] (absolute,
no module_off): BinGlobalVar/BinVarGlobal/BinGlobalConst/BinConstGlobal/
BinGlobalGlobal/BinStackGlobal, plus BinConstConst for two-literal
operands. Numerically exact (still computes l OP r at run time); the
ops live only in the Vm execution copy (exhaustive unreachable! arm in
symbolize_opcode), so CompiledSimulation stays symbolizable. C-LEARN:
1858 dispatches fused, branches -2.9%, branch-misses -3.8% (the
dispatch loop is branch-bound); run ~120 -> ~118 ms (noise floor).
size_of::<Opcode>() unchanged (8 bytes).
The fused-stream histogram added in the global/const operand-fusion
commit tallied the fused flow/stock clones through the same closure
that increments total_literals. Those clones share the real bytecode's
literal table (fusion rewrites opcodes, never the literal pool), so
every module's flow+stock literals were counted twice and
total_literals was inflated (profiling-only; no VM behavior change).
Split out a histogram-only tally for the fused streams and add a
regression test asserting total_literals counts each phase once.
Addresses the P2 review comment on #599.
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review — engine: C-LEARN interpreter perf (#599)

I reviewed all six files with a focus on numerical faithfulness of the fusion passes: the flat_offset dense fast-path, the new global/const/leaf-assign opcodes (definitions, stack_effect, name, the exhaustive symbolize_opcode guard), the greedy fuse_three_address pass, the VM dispatch arms, to_runtime_view, and the bytecode_profile double-count fix.

The implementation holds up well. Each new dispatch arm matches its opcode definition; operand order is preserved for the non-commutative Sub/Div; _global operands correctly read curr[g] (absolute) vs. curr[module_off + v] (module-relative); the leaf-assign 3-window's empty-stack assumption is sound (a BinOpAssign combiner always begins with an empty arithmetic stack), with the comparison/Exp operators correctly falling through to the operator-in-payload AssignStack*/BinStack* forms; the jump-target guard prevents fusing across jump targets; and set_value mutates the same fused execution copy via stable literal-table indices. The flat_offset fast-path is bit-identical to the general path when sparse is empty, and the profiling fix correctly excludes the shared literal pool from total_literals.

[P3] Stale comment contradicts the leaf-assign opcodes added in this same PR

src/simlin-engine/src/bytecode.rs lines 669-670 state: "A 3-operand dst = a op b would exceed the 8-byte Opcode budget, so the assign stays a separate op." This PR's own leaf-assign block immediately below (the AssignAddVarVarCurr-family, {l, r, dst} = 3×u16) folds exactly such a 3-operand assign into one opcode by encoding the operator in the variant tag, so the stated limitation no longer holds in general. A reader of the pushing-forms block would conclude 3-operand assigns are impossible and then be confused by the block right below. Worth scoping the sentence to the pushing forms (e.g. note the leaf-assign forms fold the store via operator-in-tag). Non-blocking.

Overall correctness: correct

I found no correctness issues. The patch is numerically faithful, the new opcode set is handled exhaustively in every match, and the change is well covered by the added discriminating tests (operand order, the submodule global-vs-module_off distinction, the Next-family stock path, and the total_literals regression guard). The only finding above is a non-blocking doc inconsistency.

@bpowers bpowers force-pushed the engine-perf-clearn-interp branch from 0c143bf to 047e192 Compare May 20, 2026 23:26
@bpowers bpowers merged commit 899e4cd into main May 20, 2026
13 checks passed
@bpowers bpowers deleted the engine-perf-clearn-interp branch May 20, 2026 23:26
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review: engine C-LEARN interpreter perf (opcode fusion + array-view fast paths)

I did a line-by-line correctness audit of the new fused opcodes, the fusion pass, and the array-view fast paths. No blocking issues found — the patch looks correct.

Things I verified:

  • VM dispatch arms (all ~40 new opcodes): operand order is l op r for every form (load-bearing for Sub/Div), Curr writes curr[dst] / Next writes next[dst] with operands read from curr, and _global operands index curr[g] absolutely while plain vars use curr[module_off + v]. The tag-specialized Assign{Add,Sub,Mul,Div}* arms use raw + - * /, which is byte-identical to eval_op2 (no SAFEDIV/special-casing for those four), and every non-{Add,Sub,Mul,Div} operator stays in a payload form that routes through eval_op2.
  • fuse_three_address: longest-match-first, jump-target guard and old→new PC remap preserve jump semantics; leaf-assign forms only fire for the four specialized operators and otherwise fall through to the operator-agnostic 2-window or stay unfused (the global + BinOpAssign combiner case correctly leaves the global load standalone).
  • collect_stock_offsets reads the unfused CompiledModule map, so RK stock-offset collection is unaffected; the fused *Next ops write the same next[off], so next[off] - curr[off] derivative extraction is unchanged. rk_integration_tests.rs exercises this with fusion on.
  • set_value patches the literals table (untouched by fusion) and AssignConstCurr opcodes (not a fusion target) on the execution copy; literal ids stay valid because fusion never rewrites the literal pool.
  • flat_offset dense fast-path is numerically identical to the general path when sparse is empty; to_runtime_view from_slice/empty-sparse shortcut is equivalent to the prior clone() for the Copy-element vectors.
  • symbolize_opcode exhaustively lists every new opcode under the unreachable! guard, and stack_effect is correct for all new variants, so max_stack_depth's safety proof (fixed STACK_CAPACITY, fusion only lowers peaks) holds.
  • Test relocation of set_value_testsvm_set_value_tests.rs is clean (19 moved; the one not moved, test_binop_assign_curr_present_in_bytecode, was intentionally replaced by test_binop_assign_curr_fuses_to_leaf_assign to reflect the new fused shape).

Overall correctness verdict: correct. No bugs or test-breaking changes identified.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 66.11842% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.87%. Comparing base (59e48a9) to head (047e192).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/simlin-engine/src/vm.rs 56.94% 62 Missing ⚠️
src/simlin-engine/src/bytecode.rs 70.92% 41 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main     #599    +/-   ##
========================================
  Coverage   82.87%   82.87%            
========================================
  Files         260      261     +1     
  Lines       69576    69828   +252     
========================================
+ Hits        57659    57872   +213     
- Misses      11917    11956    +39     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant