engine: C-LEARN performance (module dispatch, native build levers, 3-address fusion)#594
Conversation
The EvalModule opcode rebuilt a (String, BTreeSet<String>) module key and SipHashed it for a HashMap lookup on every module evaluation, every timestep -- ~1.3M key constructions on C-LEARN, and the entire per-timestep allocation churn. Replace the three key-keyed module maps with a single indexed Vec<ResolvedModule> plus a per-module child_targets table resolved once at Vm::new, so the eval recursion threads a module index and array-indexes its children. The ModuleKey map survives only for the cold set_value/clear_values literal-override paths. On C-LEARN: run -17% and run_to_end drops from ~2.9M allocations to zero. reify_0_arity_builtins called to_lowercase() -- a heap allocation -- on every variable reference just to test membership in a 9-element ASCII set; add the allocation-free is_0_arity_builtin_fn_ci and only materialize the lowercased name in the rare genuine-builtin case (compile -3%). compile_var_fragment now reads the salsa-cached project_datamodel_dims instead of rebuilding the dims vec per variable. Adds a profiling harness (examples/clearn_profile.rs: per-stage timing plus a gated counting allocator) and CompiledSimulation::bytecode_profile() / Opcode::name() for opcode-histogram diagnostics. Behavior-preserving: the engine lib suite, the simulate integration tests, and clearn_residual_exactness pass with byte-identical compiled bytecode.
The release profile was opt-level="z" (size-tuned for the WASM bundle) and native CLI/server/MCP/pysimlin builds inherited it. Set [profile.release] opt-level=3 and force the WASM bundle back to opt-level=z via a target-keyed .cargo/config.toml ([target.wasm32-unknown-unknown] rustflags) -- keyed on the target so every wasm build path stays size-optimized regardless of invocation (verified: wasm bundle unchanged at ~7.2 MB; opt=3 would bloat it to ~9.7 MB). On C-LEARN, opt=3 is ~-30% compile / ~-41% simulate vs "z". Install mimalloc as the global allocator on native builds (the engine compile path is allocation-heavy; mimalloc roughly halves allocator time -- another ~-40% compile on top of opt=3). simlin-serve and simlin-mcp set it in main.rs; libsimlin gates it behind an opt-in `mimalloc` feature additionally cfg'd off for wasm32, which pysimlin (Makefile, build_wheels.py) enables for its cdylib. simlin-cli routes mimalloc through libsimlin's feature rather than declaring its own global allocator, so there is exactly one per artifact even under `cargo clippy --all-features`. WASM links no mimalloc. Cumulative on C-LEARN with the prior engine commit: compile 3574->1459 ms (-59%), run 342->168 ms (-51%). Full profile and remaining proposals are in docs/design/engine-performance.md.
Measured the get_unchecked ceiling on the hottest scalar opcode arms plus the dispatch code[pc] access: the C-LEARN run moved less than run-to-run noise (165-172 ms vs ~167 ms checked). At opt-level=3 an always-in-bounds check is a predicted, never-taken branch with an out-of-line cold panic path, so it is effectively free; the dispatch index is already elided in safe code because the `while pc < code.len()` loop guard dominates it. Records the safe-vs-unsafe analysis -- the data-driven curr[module_off+off] and literals[id] indices can only be elided with unsafe get_unchecked plus a static validation pass; the safe idioms (sequential iteration, fixed-size arrays, power-of-two masking) do not fit data-driven random access -- and the decision: do not add unsafe to a deny(unsafe_code) crate for a sub-noise gain. The run's instruction count, not its bounds checks, is the lever (R2). Also indexes the doc in docs/README.md (missed when the doc was first added).
A stack VM spends ~70% of its dispatches on load/store/binop. Fold the leaf operand loads of a binary op into the op: `LoadX; LoadY; Op2` becomes one BinVarVar/BinVarConst/BinConstVar (3->1), and `LoadX; Op2` (lhs already on the stack) becomes one BinStackVar/BinStackConst (2->1). The curr[] slot array is already the register file, so the fused ops read operands straight from it (or from literals); the stack carries only nested subexpression results. A 3-operand `dst = a op b` would exceed the asserted 8-byte Opcode budget (3xu16 + Op2), so the trailing store keeps the existing BinOpAssignCurr. The fusion is a late ByteCode::fuse_three_address pass applied to the Vm's flow/stock execution bytecode at Vm::new -- not at compile time -- so the CompiledSimulation stays a pure, symbolizable, salsa-cached artifact (the symbolic roundtrip path has no form for the fused opcodes). It reuses peephole_optimize's jump-target guard and old->new PC remap and preserves max_stack_depth, so the Stack-safety proof is unchanged. Initials are left unfused (run once; their AssignCurr targets are read by extract_assign_curr_offsets). On C-LEARN: flow opcodes 34673 -> 26539 (-23.5%), run 166.8 -> 155.4 ms (-6.8%). The runtime gain is smaller than the opcode reduction because the f64 arithmetic, stock phase, and array machinery are untouched -- only the scalar dispatch shrinks; scalar-heavy models benefit more. Behavior-preserving: full engine suite, simulate integration tests, and clearn_residual_exactness pass, with dedicated fusion-pass and end-to-end operand-order unit tests. Bytecode profiling moved to a vm_profile.rs sibling for the per-file line cap.
Review: engine performance (module dispatch, native build levers, 3-address fusion)I focused on the correctness-sensitive areas: the [P3] Comment contradicts the actual fusion call site
This comment states the 3-address fusion is applied by [P3] RUSTFLAGS caveat understates which contexts set the env var
The caveat says "Today only the asan test scripts set RUSTFLAGS, and they build for the host." But [P3]
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #594 +/- ##
==========================================
- Coverage 82.92% 82.80% -0.12%
==========================================
Files 258 260 +2
Lines 69079 69286 +207
==========================================
+ Hits 57284 57374 +90
- Misses 11795 11912 +117 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Profiles compiling + simulating the C-LEARN hero model (~53k MDL lines, 5726 slots, 1000 Euler steps) and lands the clear-win optimizations, with the investigation written up in
docs/design/engine-performance.md. A reusable harness (examples/clearn_profile.rs) drove the measurements (per-stage timing + a gated counting allocator;perf/callgrindfor CPU).Headline results on C-LEARN
All behavior-preserving: the engine lib suite, the
simulateintegration tests, and theclearn_residual_exactnessguard (matches genuine Vensim'sRef.vdfbyte-for-byte) pass on every commit.What's here (4 commits)
engine: index VM module dispatch—EvalModulerebuilt a(String, BTreeSet<String>)key and SipHashed it for aHashMaplookup on every module eval every timestep. Replaced the keyed module maps with an indexedVec<ResolvedModule>+ per-modulechild_targetsresolved once atVm::new. Run -17%; per-timestep allocations 2.94M -> 0. Also an allocation-free 0-arity-builtin check on the parse hot path (compile -3%) and a cachedproject_datamodel_dims.build: opt-level=3 + mimalloc on native targets—[profile.release]wasopt-level="z"(right for the WASM bundle; native inherited it). Native now getsopt-level=3(compile -30%, run -41%) with WASM forced back tozvia a target-keyed.cargo/config.toml(bundle unchanged at 7.2 MB; verified). mimalloc as the global allocator on the native binaries + an opt-in libsimlin feature for pysimlin/C FFI (compile -40% more; compile is allocation-bound). Both native-only; WASM links no mimalloc.doc: R1 (bounds-check elimination) investigated, not worth it— measured theget_uncheckedceiling on the hottest scalar arms + the dispatch index: sub-noise (~0). Atopt-level=3an always-in-bounds check is a predicted, never-taken branch with an out-of-line cold panic path; the dispatch index is already elided in safe code (the loop guard dominates it). Records the safe-vs-unsafe analysis and the decision not to addunsafeto a#![deny(unsafe_code)]crate for no measurable gain.engine: fuse leaf operand loads into binary ops (3-address, R2)— foldsLoad; Load; Op2(3->1) andLoad; Op2(2->1) into 3-address binops that read operands straight fromcurr[]/literals(the slot array is already the register file). A latefuse_three_addresspass on the Vm's execution bytecode atVm::new, soCompiledSimulationstays a pure, symbolizable, salsa-cached artifact. Flow opcodes -23.5%, run -6.8%.Remaining proposals (in the doc)
R4 (
RuntimeView::flat_offsetrebuilds aSmallVecper element — now the largest remaining run lever for arrayed models), R3 (more superinstructions), and compile-side C2/C3, plus a note on a full register VM vs the pragmatic 2-operand fusion landed here.