engine: C-LEARN performance (module dispatch, native build levers, 3-address fusion) by bpowers · Pull Request #594 · bpowers/simlin

bpowers · 2026-05-20T14:53:42Z

Profiles compiling + simulating the C-LEARN hero model (~53k MDL lines, 5726 slots, 1000 Euler steps) and lands the clear-win optimizations, with the investigation written up in docs/design/engine-performance.md. A reusable harness (examples/clearn_profile.rs) drove the measurements (per-stage timing + a gated counting allocator; perf/callgrind for CPU).

Headline results on C-LEARN

	before	after	delta
compile	3574 ms	1459 ms	-59%
simulate (run)	342 ms	155 ms	-55%

All behavior-preserving: the engine lib suite, the simulate integration tests, and the clearn_residual_exactness guard (matches genuine Vensim's Ref.vdf byte-for-byte) pass on every commit.

What's here (4 commits)

engine: index VM module dispatch — EvalModule rebuilt a (String, BTreeSet<String>) key and SipHashed it for a HashMap lookup on every module eval every timestep. Replaced the keyed module maps with an indexed Vec<ResolvedModule> + per-module child_targets resolved once at Vm::new. Run -17%; per-timestep allocations 2.94M -> 0. Also an allocation-free 0-arity-builtin check on the parse hot path (compile -3%) and a cached project_datamodel_dims.
build: opt-level=3 + mimalloc on native targets — [profile.release] was opt-level="z" (right for the WASM bundle; native inherited it). Native now gets opt-level=3 (compile -30%, run -41%) with WASM forced back to z via a target-keyed .cargo/config.toml (bundle unchanged at 7.2 MB; verified). mimalloc as the global allocator on the native binaries + an opt-in libsimlin feature for pysimlin/C FFI (compile -40% more; compile is allocation-bound). Both native-only; WASM links no mimalloc.
doc: R1 (bounds-check elimination) investigated, not worth it — measured the get_unchecked ceiling on the hottest scalar arms + the dispatch index: sub-noise (~0). At opt-level=3 an always-in-bounds check is a predicted, never-taken branch with an out-of-line cold panic path; the dispatch index is already elided in safe code (the loop guard dominates it). Records the safe-vs-unsafe analysis and the decision not to add unsafe to a #![deny(unsafe_code)] crate for no measurable gain.
engine: fuse leaf operand loads into binary ops (3-address, R2) — folds Load; Load; Op2 (3->1) and Load; Op2 (2->1) into 3-address binops that read operands straight from curr[]/literals (the slot array is already the register file). A late fuse_three_address pass on the Vm's execution bytecode at Vm::new, so CompiledSimulation stays a pure, symbolizable, salsa-cached artifact. Flow opcodes -23.5%, run -6.8%.

Remaining proposals (in the doc)

R4 (RuntimeView::flat_offset rebuilds a SmallVec per element — now the largest remaining run lever for arrayed models), R3 (more superinstructions), and compile-side C2/C3, plus a note on a full register VM vs the pragmatic 2-operand fusion landed here.

The EvalModule opcode rebuilt a (String, BTreeSet<String>) module key and SipHashed it for a HashMap lookup on every module evaluation, every timestep -- ~1.3M key constructions on C-LEARN, and the entire per-timestep allocation churn. Replace the three key-keyed module maps with a single indexed Vec<ResolvedModule> plus a per-module child_targets table resolved once at Vm::new, so the eval recursion threads a module index and array-indexes its children. The ModuleKey map survives only for the cold set_value/clear_values literal-override paths. On C-LEARN: run -17% and run_to_end drops from ~2.9M allocations to zero. reify_0_arity_builtins called to_lowercase() -- a heap allocation -- on every variable reference just to test membership in a 9-element ASCII set; add the allocation-free is_0_arity_builtin_fn_ci and only materialize the lowercased name in the rare genuine-builtin case (compile -3%). compile_var_fragment now reads the salsa-cached project_datamodel_dims instead of rebuilding the dims vec per variable. Adds a profiling harness (examples/clearn_profile.rs: per-stage timing plus a gated counting allocator) and CompiledSimulation::bytecode_profile() / Opcode::name() for opcode-histogram diagnostics. Behavior-preserving: the engine lib suite, the simulate integration tests, and clearn_residual_exactness pass with byte-identical compiled bytecode.

The release profile was opt-level="z" (size-tuned for the WASM bundle) and native CLI/server/MCP/pysimlin builds inherited it. Set [profile.release] opt-level=3 and force the WASM bundle back to opt-level=z via a target-keyed .cargo/config.toml ([target.wasm32-unknown-unknown] rustflags) -- keyed on the target so every wasm build path stays size-optimized regardless of invocation (verified: wasm bundle unchanged at ~7.2 MB; opt=3 would bloat it to ~9.7 MB). On C-LEARN, opt=3 is ~-30% compile / ~-41% simulate vs "z". Install mimalloc as the global allocator on native builds (the engine compile path is allocation-heavy; mimalloc roughly halves allocator time -- another ~-40% compile on top of opt=3). simlin-serve and simlin-mcp set it in main.rs; libsimlin gates it behind an opt-in `mimalloc` feature additionally cfg'd off for wasm32, which pysimlin (Makefile, build_wheels.py) enables for its cdylib. simlin-cli routes mimalloc through libsimlin's feature rather than declaring its own global allocator, so there is exactly one per artifact even under `cargo clippy --all-features`. WASM links no mimalloc. Cumulative on C-LEARN with the prior engine commit: compile 3574->1459 ms (-59%), run 342->168 ms (-51%). Full profile and remaining proposals are in docs/design/engine-performance.md.

Measured the get_unchecked ceiling on the hottest scalar opcode arms plus the dispatch code[pc] access: the C-LEARN run moved less than run-to-run noise (165-172 ms vs ~167 ms checked). At opt-level=3 an always-in-bounds check is a predicted, never-taken branch with an out-of-line cold panic path, so it is effectively free; the dispatch index is already elided in safe code because the `while pc < code.len()` loop guard dominates it. Records the safe-vs-unsafe analysis -- the data-driven curr[module_off+off] and literals[id] indices can only be elided with unsafe get_unchecked plus a static validation pass; the safe idioms (sequential iteration, fixed-size arrays, power-of-two masking) do not fit data-driven random access -- and the decision: do not add unsafe to a deny(unsafe_code) crate for a sub-noise gain. The run's instruction count, not its bounds checks, is the lever (R2). Also indexes the doc in docs/README.md (missed when the doc was first added).

A stack VM spends ~70% of its dispatches on load/store/binop. Fold the leaf operand loads of a binary op into the op: `LoadX; LoadY; Op2` becomes one BinVarVar/BinVarConst/BinConstVar (3->1), and `LoadX; Op2` (lhs already on the stack) becomes one BinStackVar/BinStackConst (2->1). The curr[] slot array is already the register file, so the fused ops read operands straight from it (or from literals); the stack carries only nested subexpression results. A 3-operand `dst = a op b` would exceed the asserted 8-byte Opcode budget (3xu16 + Op2), so the trailing store keeps the existing BinOpAssignCurr. The fusion is a late ByteCode::fuse_three_address pass applied to the Vm's flow/stock execution bytecode at Vm::new -- not at compile time -- so the CompiledSimulation stays a pure, symbolizable, salsa-cached artifact (the symbolic roundtrip path has no form for the fused opcodes). It reuses peephole_optimize's jump-target guard and old->new PC remap and preserves max_stack_depth, so the Stack-safety proof is unchanged. Initials are left unfused (run once; their AssignCurr targets are read by extract_assign_curr_offsets). On C-LEARN: flow opcodes 34673 -> 26539 (-23.5%), run 166.8 -> 155.4 ms (-6.8%). The runtime gain is smaller than the opcode reduction because the f64 arithmetic, stock phase, and array machinery are untouched -- only the scalar dispatch shrinks; scalar-heavy models benefit more. Behavior-preserving: full engine suite, simulate integration tests, and clearn_residual_exactness pass, with dedicated fusion-pass and end-to-end operand-order unit tests. Bytecode profiling moved to a vm_profile.rs sibling for the per-file line cap.

claude · 2026-05-20T15:04:11Z

Review: engine performance (module dispatch, native build levers, 3-address fusion)

I focused on the correctness-sensitive areas: the fuse_three_address pass and its operand-order/jump-fixup logic, the indexed module dispatch threading (module_idx/child_targets), the LoadVar vs LoadGlobalVar offset handling in the fused opcodes, the ASCII-insensitive 0-arity builtin check, and the build-config changes (opt-level split + mimalloc global allocator). The core logic checks out: the fused ops read curr[module_off+off]/literals with the correct operand order, LoadGlobalVar is correctly left unfused (so no module_off is wrongly applied), the jump-target guard mirrors the proven peephole_optimize pattern, and the single-global-allocator routing is consistent (serve/mcp declare their own; cli routes through libsimlin; serve/mcp don't depend on libsimlin so there's no double-allocator). The wasm bundle stays size-optimized because the CI jobs that set RUSTFLAGS build for the host only, while the frontend/serve-smoke jobs that build wasm don't set it. The findings below are all documentation/diagnostic-accuracy only.

[P3] Comment contradicts the actual fusion call site

src/simlin-engine/src/compiler/symbolic.rs:1182-1185

This comment states the 3-address fusion is applied by assemble_module ("the production assembler assemble_module applies it to this function's output instead"), but fuse_three_address is only ever called in CompiledSlicedSimulation::build at Vm::new (vm.rs:397) — never in assemble_module. The adjacent comment in assemble_module (db.rs) correctly says fusion "is applied later, at Vm::new." A maintainer following this pointer would look in the wrong place for the boundary that keeps CompiledModule symbolizable.

    // `resolve_module` is a pure symbolic<->concrete primitive (the roundtrip
    // tests symbolize its output again), so the 3-address fusion (R2) is NOT
    // applied here -- it is applied later at `Vm::new`, to the execution copy of
    // the bytecode, where the result is never re-symbolized.

[P3] RUSTFLAGS caveat understates which contexts set the env var

.cargo/config.toml:13-14

The caveat says "Today only the asan test scripts set RUSTFLAGS, and they build for the host." But .github/workflows/ci.yaml also sets RUSTFLAGS: -D warnings in the build, rust (clippy), and coverage jobs. The conclusion still holds today (those jobs build for the host, not wasm32, so the bundle is unaffected), but the premise the caveat rests on is inaccurate — the real invariant is "no RUSTFLAGS-setting context cross-compiles to wasm32," and a future wasm build added to one of those jobs would silently inherit opt-level=3. Worth rewording so the warning points at the actual hazard.

[P3] `estimate_fused_len` over-counts fusions vs. the implemented pass (diagnostics only)

src/simlin-engine/src/bytecode.rs:1386-1393

is_leaf_load here (and the docstring above it) count LoadGlobalVar as a fusable leaf, but fuse_three_address only fuses LoadVar/LoadConstant (there is no BinGlobalVar* opcode). So BytecodeProfile.flow_opcodes_after_fusion overstates the reduction for any model that loads global vars into binops. This only affects the clearn_profile diagnostic output, not the engine, but the estimate and the real pass should agree on what a leaf load is.

                Opcode::LoadVar { .. } | Opcode::LoadConstant { .. }

(and drop the LoadGlobalVar mention from the docstring on line 1386).

Overall correctness: correct

No functional bugs found. The fusion pass is operand-order-correct and stack-effect-preserving, the indexed dispatch is faithfully threaded, and the build/allocator changes are sound. Existing behavior and tests are preserved. The three findings above are documentation/diagnostic-accuracy nits and do not block.

codecov · 2026-05-20T16:22:22Z

Codecov Report

❌ Patch coverage is 46.36364% with 177 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.80%. Comparing base (2ed9395) to head (bf51b34).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/simlin-engine/src/bytecode.rs	41.87%	93 Missing ⚠️
src/simlin-engine/src/vm.rs	67.82%	37 Missing ⚠️
src/simlin-engine/src/vm_profile.rs	0.00%	28 Missing ⚠️
src/simlin-engine/examples/clearn_profile.rs	0.00%	19 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #594      +/-   ##
==========================================
- Coverage   82.92%   82.80%   -0.12%     
==========================================
  Files         258      260       +2     
  Lines       69079    69286     +207     
==========================================
+ Hits        57284    57374      +90     
- Misses      11795    11912     +117

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bpowers added 4 commits May 20, 2026 06:37

bpowers merged commit bf9b19e into main May 20, 2026
13 checks passed

bpowers deleted the engine-perf-clearn branch May 20, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine: C-LEARN performance (module dispatch, native build levers, 3-address fusion)#594

engine: C-LEARN performance (module dispatch, native build levers, 3-address fusion)#594
bpowers merged 4 commits into
mainfrom
engine-perf-clearn

bpowers commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bpowers commented May 20, 2026

Headline results on C-LEARN

What's here (4 commits)

Remaining proposals (in the doc)

Uh oh!

claude Bot commented May 20, 2026

Review: engine performance (module dispatch, native build levers, 3-address fusion)

[P3] Comment contradicts the actual fusion call site

[P3] RUSTFLAGS caveat understates which contexts set the env var

[P3] estimate_fused_len over-counts fusions vs. the implemented pass (diagnostics only)

Overall correctness: correct

Uh oh!

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[P3] `estimate_fused_len` over-counts fusions vs. the implemented pass (diagnostics only)