Skip to content

engine: C-LEARN performance (module dispatch, native build levers, 3-address fusion)#594

Merged
bpowers merged 4 commits into
mainfrom
engine-perf-clearn
May 20, 2026
Merged

engine: C-LEARN performance (module dispatch, native build levers, 3-address fusion)#594
bpowers merged 4 commits into
mainfrom
engine-perf-clearn

Conversation

@bpowers
Copy link
Copy Markdown
Owner

@bpowers bpowers commented May 20, 2026

Profiles compiling + simulating the C-LEARN hero model (~53k MDL lines, 5726 slots, 1000 Euler steps) and lands the clear-win optimizations, with the investigation written up in docs/design/engine-performance.md. A reusable harness (examples/clearn_profile.rs) drove the measurements (per-stage timing + a gated counting allocator; perf/callgrind for CPU).

Headline results on C-LEARN

before after delta
compile 3574 ms 1459 ms -59%
simulate (run) 342 ms 155 ms -55%

All behavior-preserving: the engine lib suite, the simulate integration tests, and the clearn_residual_exactness guard (matches genuine Vensim's Ref.vdf byte-for-byte) pass on every commit.

What's here (4 commits)

  1. engine: index VM module dispatchEvalModule rebuilt a (String, BTreeSet<String>) key and SipHashed it for a HashMap lookup on every module eval every timestep. Replaced the keyed module maps with an indexed Vec<ResolvedModule> + per-module child_targets resolved once at Vm::new. Run -17%; per-timestep allocations 2.94M -> 0. Also an allocation-free 0-arity-builtin check on the parse hot path (compile -3%) and a cached project_datamodel_dims.

  2. build: opt-level=3 + mimalloc on native targets[profile.release] was opt-level="z" (right for the WASM bundle; native inherited it). Native now gets opt-level=3 (compile -30%, run -41%) with WASM forced back to z via a target-keyed .cargo/config.toml (bundle unchanged at 7.2 MB; verified). mimalloc as the global allocator on the native binaries + an opt-in libsimlin feature for pysimlin/C FFI (compile -40% more; compile is allocation-bound). Both native-only; WASM links no mimalloc.

  3. doc: R1 (bounds-check elimination) investigated, not worth it — measured the get_unchecked ceiling on the hottest scalar arms + the dispatch index: sub-noise (~0). At opt-level=3 an always-in-bounds check is a predicted, never-taken branch with an out-of-line cold panic path; the dispatch index is already elided in safe code (the loop guard dominates it). Records the safe-vs-unsafe analysis and the decision not to add unsafe to a #![deny(unsafe_code)] crate for no measurable gain.

  4. engine: fuse leaf operand loads into binary ops (3-address, R2) — folds Load; Load; Op2 (3->1) and Load; Op2 (2->1) into 3-address binops that read operands straight from curr[]/literals (the slot array is already the register file). A late fuse_three_address pass on the Vm's execution bytecode at Vm::new, so CompiledSimulation stays a pure, symbolizable, salsa-cached artifact. Flow opcodes -23.5%, run -6.8%.

Remaining proposals (in the doc)

R4 (RuntimeView::flat_offset rebuilds a SmallVec per element — now the largest remaining run lever for arrayed models), R3 (more superinstructions), and compile-side C2/C3, plus a note on a full register VM vs the pragmatic 2-operand fusion landed here.

bpowers added 4 commits May 20, 2026 06:37
The EvalModule opcode rebuilt a (String, BTreeSet<String>) module key and
SipHashed it for a HashMap lookup on every module evaluation, every timestep
-- ~1.3M key constructions on C-LEARN, and the entire per-timestep allocation
churn. Replace the three key-keyed module maps with a single indexed
Vec<ResolvedModule> plus a per-module child_targets table resolved once at
Vm::new, so the eval recursion threads a module index and array-indexes its
children. The ModuleKey map survives only for the cold set_value/clear_values
literal-override paths. On C-LEARN: run -17% and run_to_end drops from ~2.9M
allocations to zero.

reify_0_arity_builtins called to_lowercase() -- a heap allocation -- on every
variable reference just to test membership in a 9-element ASCII set; add the
allocation-free is_0_arity_builtin_fn_ci and only materialize the lowercased
name in the rare genuine-builtin case (compile -3%). compile_var_fragment now
reads the salsa-cached project_datamodel_dims instead of rebuilding the dims
vec per variable.

Adds a profiling harness (examples/clearn_profile.rs: per-stage timing plus a
gated counting allocator) and CompiledSimulation::bytecode_profile() /
Opcode::name() for opcode-histogram diagnostics. Behavior-preserving: the
engine lib suite, the simulate integration tests, and clearn_residual_exactness
pass with byte-identical compiled bytecode.
The release profile was opt-level="z" (size-tuned for the WASM bundle) and
native CLI/server/MCP/pysimlin builds inherited it. Set [profile.release]
opt-level=3 and force the WASM bundle back to opt-level=z via a target-keyed
.cargo/config.toml ([target.wasm32-unknown-unknown] rustflags) -- keyed on the
target so every wasm build path stays size-optimized regardless of invocation
(verified: wasm bundle unchanged at ~7.2 MB; opt=3 would bloat it to ~9.7 MB).
On C-LEARN, opt=3 is ~-30% compile / ~-41% simulate vs "z".

Install mimalloc as the global allocator on native builds (the engine compile
path is allocation-heavy; mimalloc roughly halves allocator time -- another
~-40% compile on top of opt=3). simlin-serve and simlin-mcp set it in main.rs;
libsimlin gates it behind an opt-in `mimalloc` feature additionally cfg'd off
for wasm32, which pysimlin (Makefile, build_wheels.py) enables for its cdylib.
simlin-cli routes mimalloc through libsimlin's feature rather than declaring its
own global allocator, so there is exactly one per artifact even under
`cargo clippy --all-features`. WASM links no mimalloc.

Cumulative on C-LEARN with the prior engine commit: compile 3574->1459 ms
(-59%), run 342->168 ms (-51%). Full profile and remaining proposals are in
docs/design/engine-performance.md.
Measured the get_unchecked ceiling on the hottest scalar opcode arms plus the
dispatch code[pc] access: the C-LEARN run moved less than run-to-run noise
(165-172 ms vs ~167 ms checked). At opt-level=3 an always-in-bounds check is a
predicted, never-taken branch with an out-of-line cold panic path, so it is
effectively free; the dispatch index is already elided in safe code because the
`while pc < code.len()` loop guard dominates it.

Records the safe-vs-unsafe analysis -- the data-driven curr[module_off+off] and
literals[id] indices can only be elided with unsafe get_unchecked plus a static
validation pass; the safe idioms (sequential iteration, fixed-size arrays,
power-of-two masking) do not fit data-driven random access -- and the decision:
do not add unsafe to a deny(unsafe_code) crate for a sub-noise gain. The run's
instruction count, not its bounds checks, is the lever (R2). Also indexes the
doc in docs/README.md (missed when the doc was first added).
A stack VM spends ~70% of its dispatches on load/store/binop. Fold the leaf
operand loads of a binary op into the op: `LoadX; LoadY; Op2` becomes one
BinVarVar/BinVarConst/BinConstVar (3->1), and `LoadX; Op2` (lhs already on the
stack) becomes one BinStackVar/BinStackConst (2->1). The curr[] slot array is
already the register file, so the fused ops read operands straight from it (or
from literals); the stack carries only nested subexpression results. A 3-operand
`dst = a op b` would exceed the asserted 8-byte Opcode budget (3xu16 + Op2), so
the trailing store keeps the existing BinOpAssignCurr.

The fusion is a late ByteCode::fuse_three_address pass applied to the Vm's
flow/stock execution bytecode at Vm::new -- not at compile time -- so the
CompiledSimulation stays a pure, symbolizable, salsa-cached artifact (the
symbolic roundtrip path has no form for the fused opcodes). It reuses
peephole_optimize's jump-target guard and old->new PC remap and preserves
max_stack_depth, so the Stack-safety proof is unchanged. Initials are left
unfused (run once; their AssignCurr targets are read by
extract_assign_curr_offsets).

On C-LEARN: flow opcodes 34673 -> 26539 (-23.5%), run 166.8 -> 155.4 ms (-6.8%).
The runtime gain is smaller than the opcode reduction because the f64
arithmetic, stock phase, and array machinery are untouched -- only the scalar
dispatch shrinks; scalar-heavy models benefit more. Behavior-preserving: full
engine suite, simulate integration tests, and clearn_residual_exactness pass,
with dedicated fusion-pass and end-to-end operand-order unit tests. Bytecode
profiling moved to a vm_profile.rs sibling for the per-file line cap.
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review: engine performance (module dispatch, native build levers, 3-address fusion)

I focused on the correctness-sensitive areas: the fuse_three_address pass and its operand-order/jump-fixup logic, the indexed module dispatch threading (module_idx/child_targets), the LoadVar vs LoadGlobalVar offset handling in the fused opcodes, the ASCII-insensitive 0-arity builtin check, and the build-config changes (opt-level split + mimalloc global allocator). The core logic checks out: the fused ops read curr[module_off+off]/literals with the correct operand order, LoadGlobalVar is correctly left unfused (so no module_off is wrongly applied), the jump-target guard mirrors the proven peephole_optimize pattern, and the single-global-allocator routing is consistent (serve/mcp declare their own; cli routes through libsimlin; serve/mcp don't depend on libsimlin so there's no double-allocator). The wasm bundle stays size-optimized because the CI jobs that set RUSTFLAGS build for the host only, while the frontend/serve-smoke jobs that build wasm don't set it. The findings below are all documentation/diagnostic-accuracy only.


[P3] Comment contradicts the actual fusion call site

src/simlin-engine/src/compiler/symbolic.rs:1182-1185

This comment states the 3-address fusion is applied by assemble_module ("the production assembler assemble_module applies it to this function's output instead"), but fuse_three_address is only ever called in CompiledSlicedSimulation::build at Vm::new (vm.rs:397) — never in assemble_module. The adjacent comment in assemble_module (db.rs) correctly says fusion "is applied later, at Vm::new." A maintainer following this pointer would look in the wrong place for the boundary that keeps CompiledModule symbolizable.

    // `resolve_module` is a pure symbolic<->concrete primitive (the roundtrip
    // tests symbolize its output again), so the 3-address fusion (R2) is NOT
    // applied here -- it is applied later at `Vm::new`, to the execution copy of
    // the bytecode, where the result is never re-symbolized.

[P3] RUSTFLAGS caveat understates which contexts set the env var

.cargo/config.toml:13-14

The caveat says "Today only the asan test scripts set RUSTFLAGS, and they build for the host." But .github/workflows/ci.yaml also sets RUSTFLAGS: -D warnings in the build, rust (clippy), and coverage jobs. The conclusion still holds today (those jobs build for the host, not wasm32, so the bundle is unaffected), but the premise the caveat rests on is inaccurate — the real invariant is "no RUSTFLAGS-setting context cross-compiles to wasm32," and a future wasm build added to one of those jobs would silently inherit opt-level=3. Worth rewording so the warning points at the actual hazard.


[P3] estimate_fused_len over-counts fusions vs. the implemented pass (diagnostics only)

src/simlin-engine/src/bytecode.rs:1386-1393

is_leaf_load here (and the docstring above it) count LoadGlobalVar as a fusable leaf, but fuse_three_address only fuses LoadVar/LoadConstant (there is no BinGlobalVar* opcode). So BytecodeProfile.flow_opcodes_after_fusion overstates the reduction for any model that loads global vars into binops. This only affects the clearn_profile diagnostic output, not the engine, but the estimate and the real pass should agree on what a leaf load is.

                Opcode::LoadVar { .. } | Opcode::LoadConstant { .. }

(and drop the LoadGlobalVar mention from the docstring on line 1386).


Overall correctness: correct

No functional bugs found. The fusion pass is operand-order-correct and stack-effect-preserving, the indexed dispatch is faithfully threaded, and the build/allocator changes are sound. Existing behavior and tests are preserved. The three findings above are documentation/diagnostic-accuracy nits and do not block.

@bpowers bpowers merged commit bf9b19e into main May 20, 2026
13 checks passed
@bpowers bpowers deleted the engine-perf-clearn branch May 20, 2026 15:06
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 46.36364% with 177 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.80%. Comparing base (2ed9395) to head (bf51b34).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/simlin-engine/src/bytecode.rs 41.87% 93 Missing ⚠️
src/simlin-engine/src/vm.rs 67.82% 37 Missing ⚠️
src/simlin-engine/src/vm_profile.rs 0.00% 28 Missing ⚠️
src/simlin-engine/examples/clearn_profile.rs 0.00% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #594      +/-   ##
==========================================
- Coverage   82.92%   82.80%   -0.12%     
==========================================
  Files         258      260       +2     
  Lines       69079    69286     +207     
==========================================
+ Hits        57284    57374      +90     
- Misses      11795    11912     +117     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant