Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
1c5605c
doc: add layout quality eval design plan
bpowers May 23, 2026
f27ac2b
doc: refine layout quality eval plan per review
bpowers May 23, 2026
c33fcd3
engine: expose diagram geometry modules to layout (pub(crate))
bpowers May 23, 2026
39c09f0
engine: add rect overlap + segment-clip helpers to diagram::common
bpowers May 23, 2026
d643bdb
engine: factor arc geometry out of render_arc into a shared polyline …
bpowers May 23, 2026
65e564c
engine: count view crossings on sampled connector polylines
bpowers May 23, 2026
e72936e
engine: add pure LayoutMetrics quality core (metrics.rs)
bpowers May 23, 2026
69b1c99
engine: phase 1 fixups
bpowers May 23, 2026
7dac9f5
engine: fix label_overlap double-count and narrow crossings scale com…
bpowers May 23, 2026
fb32e36
engine: add pure layout eval statistics primitives (eval_stats.rs)
bpowers May 23, 2026
a94a930
engine: add ModelStats/CorpusReport constructors
bpowers May 23, 2026
26a2232
engine: add baseline-vs-candidate compare() with Mann-Whitney signifi…
bpowers May 23, 2026
1e069d7
engine: add worst_seed tie-break regression test
bpowers May 23, 2026
df6ac00
engine: add layout_eval example skeleton + expose LAYOUT_SEEDS
bpowers May 23, 2026
e3bea95
engine: layout_eval per-seed sweep + ModelStats/CorpusReport
bpowers May 23, 2026
7139212
engine: layout_eval renders best/median/worst + reference PNGs
bpowers May 23, 2026
36419e4
engine: make layout deterministic per seed (fix #633)
bpowers May 23, 2026
7eadbca
engine: layout_eval emits metrics.json + index.html contact-sheet
bpowers May 23, 2026
3272316
engine: layout_eval baseline diff via compare()
bpowers May 23, 2026
38afb57
engine: layout_eval skip-on-failure + full-sweep smoke check
bpowers May 23, 2026
9d0c9ff
engine: add default_projects models to layout_eval corpus
bpowers May 23, 2026
ac1a148
engine: compute node_overlap and node_connector_overlap on bare shape…
bpowers May 23, 2026
8a1cd15
engine: suppress spurious crossings where links meet flow valves/atta…
bpowers May 23, 2026
6113910
engine: compute loop_compactness term (isoperimetric loop quality)
bpowers May 23, 2026
71afe1b
engine: make label_overlap measure per-label obscuration (small-colli…
bpowers May 23, 2026
386b701
engine: commit calibrated default MetricWeights (readability-dominant)
bpowers May 23, 2026
360024c
engine: add human-vs-auto reference-pair ordering test (AC5.2)
bpowers May 23, 2026
634db4c
engine: layout_eval uses calibrated default weights + reseed baseline
bpowers May 23, 2026
d2f85c3
engine: document population anchor margin and loop cycle canonicaliza…
bpowers May 23, 2026
ccdd783
engine: rung 0 - select best layout by weighted_cost
bpowers May 23, 2026
095ea6b
engine: test rung-0 weighted_cost selection (incl. more-crossings case)
bpowers May 23, 2026
829396e
engine: add deterministic weighted_cost regression guard
bpowers May 23, 2026
53d9c2c
engine: assert per-seed layout determinism (byte-identical)
bpowers May 23, 2026
e78c8a5
engine: make select_best_layout prefer finite cost over NaN regardles…
bpowers May 23, 2026
4754e6c
doc: update simlin-engine context for layout-quality metric
bpowers May 23, 2026
37e2719
doc: add test plan for layout quality eval
bpowers May 23, 2026
35ef355
engine: score loop_compactness on flow valves, not pipe-extent centers
bpowers May 23, 2026
001bf34
engine: union per-segment node-connector overlap (no double-count acr…
bpowers May 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,11 @@
- [design-plans/2026-05-19-clearn-residual.md](design-plans/2026-05-19-clearn-residual.md) -- Close C-LEARN's residual (#590/#591) as general Vensim import/simulation primitives: arrayed inline graphical functions, import-time macro shadowing, user-macro INITIAL recurrence, residual attribution; 5 phases
- [design-plans/2026-05-20-wasm-backend.md](design-plans/2026-05-20-wasm-backend.md) -- WebAssembly code-generation backend: compile a model to one self-contained wasm module as an alternative to the bytecode VM (for fast interactive re-simulation), validated to full VM parity; 8 phases
- [design-plans/2026-05-22-engine-wasm-sim.md](design-plans/2026-05-22-engine-wasm-sim.md) -- Integrate the wasm backend into `@simlin/engine` as a selectable engine (`Model.simulate({engine:'wasm'})`): vm-vs-wasm demux below the `Sim` facade in `DirectBackend`, a resumable blob run ABI for `runTo`, and a node VM-vs-wasm benchmark; 4 phases
- [design-plans/2026-05-22-layout-quality-eval.md](design-plans/2026-05-22-layout-quality-eval.md) -- Layout quality evaluation + hill-climbing harness: a pure geometry-accurate `LayoutMetrics` (overlap/sprawl/accurate-arc crossings) and benchstat-style seed-distribution stats, an on-demand corpus sweep that renders and scores layouts against human references, and Rung 0 (rank seeds by `weighted_cost`); 5 phases
- [plans/](plans/README.md) -- Implementation plans (active and completed)
- [test-plans/](test-plans/) -- Human verification plans for completed features
- [test-plans/2026-05-22-engine-wasm-sim.md](test-plans/2026-05-22-engine-wasm-sim.md) -- Manual verification for the `@simlin/engine` selectable wasm engine (`Model.simulate({engine:'wasm'})`): re-running the automated gates, driving the gated/`#[ignore]`d heavy tests, and the human-judged extras (interactive scrubbing feel, VM-vs-wasm benchmark numbers); all 25 ACs already have automated coverage
- [test-plans/2026-05-22-layout-quality-eval.md](test-plans/2026-05-22-layout-quality-eval.md) -- Manual verification for the layout-quality eval: running the on-demand corpus sweep and inspecting its `target/layout-eval/` artifacts (metrics.json, the worst-first contact-sheet), plus the human-judgment calibration gate (best/median/worst ordering, reference-vs-auto scoring, weight magnitudes)
- `implementation-plans/` -- Detailed phase-by-phase implementation plans, created during plan execution

## Security
Expand Down
564 changes: 564 additions & 0 deletions docs/design-plans/2026-05-22-layout-quality-eval.md

Large diffs are not rendered by default.

85 changes: 85 additions & 0 deletions docs/test-plans/2026-05-22-layout-quality-eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Test Plan: Layout Quality Evaluation

Human verification plan for the layout-quality-eval feature (implementation plan
`docs/implementation-plans/2026-05-22-layout-quality-eval/`). The automated suite
proves the metric math, the selection rule, and per-seed determinism. This plan
covers what automated tests cannot: that the on-demand corpus **sweep** emits the
right artifacts, and that the **human-judgment** calls (best/median/worst
ordering, reference-vs-auto scoring, weight magnitudes) match a modeler's eye.
This is the gate for AC3.*, AC4.1-4.3, and the human-in-the-loop part of AC5.

## Prerequisites

- Repo at a commit including the layout-quality-eval branch, clean working tree.
Run `./scripts/dev-init.sh`.
- Toolchain that can build `resvg` (the `png_render` feature):
`cargo build -p simlin-engine --features png_render,file_io --example layout_eval`
should finish without error.
- A browser to open `target/layout-eval/index.html`, and a JSON viewer / `jq`
for `target/layout-eval/metrics.json`.
- Automated gate already green:
`cargo test -p simlin-engine --lib layout::` and
`cargo test -p simlin-engine --features file_io --test layout`.

## Phase 1: Time-boxed smoke run (fast confidence)

| Step | Action | Expected |
|------|--------|----------|
| 1 | `LAYOUT_EVAL_MODELS=teacup,sir LAYOUT_EVAL_SEEDS=4 cargo run --release -p simlin-engine --features png_render,file_io --example layout_eval` | Exits 0 (AC3.1). stdout prints a per-model `sir: median=… p25/p75=…/… best_of_k=… (M=4)` line and `corpus: geomean_of_medians=… (2 model(s) scored)`. |
| 2 | `ls target/layout-eval/` | Contains `metrics.json`, `index.html`, and PNGs: `sir_best/median/worst/reference.png`, `teacup_best/median/worst/reference.png`. |
| 3 | `git status --porcelain target/` | Empty — nothing under `target/` is tracked (AC3.5). |

## Phase 2: Full corpus sweep + artifact inspection

| Step | Action | Expected |
|------|--------|----------|
| 1 | `cargo run --release -p simlin-engine --features png_render,file_io --example layout_eval` (no env overrides: all corpus keys, M=25) | Exits 0. Each model prints its median/spread/best-of-k line; corpus aggregate at the end. Runtime is minutes (deliberately kept out of `cargo test`). |
| 2 | Open `target/layout-eval/metrics.json` | Valid JSON. Each `per_model[]` has the full `LayoutMetrics` breakdown (`node_overlap`, `node_connector_overlap`, `label_overlap`, `crossings`, `sprawl`, `edge_length_cv`, `aspect_penalty`, `loop_compactness`, `chain_straightness`) + `weighted_cost`, `median_cost`, `spread`, `best_of_k_cost`, `best/median/worst_seed`. Top level has `geomean_of_medians` and the `weights` set (AC3.2). |
| 3 | Verify AC4.2 by hand: collect each model's `median_cost`, compute their (epsilon-floored) geometric mean, compare to `geomean_of_medians` | The two agree to a few decimals. |
| 4 | Open `target/layout-eval/index.html` in a browser | Contact sheet sorted **worst weighted_cost first**. Each model row shows best/median/worst (and reference where present) thumbnails with a per-term cost breakdown and the `median / p25/p75 / best_of_k / M=25` summary (AC3.3). Header shows `geomean_of_medians` and the weight set. |

## Phase 3: Human-judgment checks (the calibration gate, AC5.1 / AC5.2)

These are the calls only a human can make; sign-off here closes the
human-in-the-loop component of AC5.

| Step | Action | Expected (human judgment) |
|------|--------|---------------------------|
| 1 (best/median/worst ordering) | For 3-4 models (e.g. `sir`, `fishbanks`, `reliability`, `population`), look at the three generated thumbnails side by side | "best" should genuinely look cleanest (fewest overlaps/crossings, labels readable); "worst" messiest. If the metric's "best" looks worse than its "worst", that is calibration feedback — record it, do not silently accept it. |
| 2 (reference vs auto) | For each model shipping a `*_reference.png`, compare it to that model's `*_best.png` and read both `weighted_cost` values | For `reliability`, `fishbanks`, `population`, `logistic-growth`: the hand-authored reference should both look cleaner and carry the lower `weighted_cost` (the human<auto direction the AC5.2 tests pin). For `sir`: the reference deliberately obscures more labels, so the auto scores lower — confirm that asymmetry looks right. |
| 3 (weight magnitudes, AC5.1) | Read the weight set in the `index.html` header / `metrics.json` | Overlap + crossings family carry the dominant weights; `sprawl`/`edge_length_cv`/`aspect_penalty` are 0; `loop_compactness` is a small positive nudge (0.25); `chain_straightness` is 0. Confirm these still match intent over the contact sheet, then sign off. |

## End-to-End: baseline-vs-candidate regression diff (AC4.3)

Validates the full statistical-comparison path (per-model + aggregate deltas with
Mann-Whitney U p-values + significance) a future tuning change would rely on.

1. Seed a baseline: `LAYOUT_EVAL_WRITE_BASELINE=1 cargo run --release -p simlin-engine --features png_render,file_io --example layout_eval`. stdout notes the baseline was written to `examples/layout_eval_baseline.json`.
2. Run a plain candidate sweep (no `WRITE_BASELINE`).
3. In stdout and the `index.html` "baseline diff" section: each model shows a signed `delta_ratio` %, a `p_value`, and a significance verdict; an aggregate delta + verdict is shown.
4. Sanity (matches automated AC4.5): an unchanged candidate vs the just-written baseline shows deltas near 0% and non-significant everywhere. A genuinely different candidate (e.g. after a deliberate weight change) shows non-zero deltas; large, consistent ones read as significant.
5. Reset the committed baseline when done: `git checkout examples/layout_eval_baseline.json` (unless intentionally updating it).

## End-to-End: skip-on-failure (AC3.6)

Confirms one bad model never aborts the sweep.

1. Run a sweep including a model whose file you temporarily make missing/unreadable.
2. Expected: a `WARN: skipping {key}: {err}` line is printed, that model is absent from `metrics.json`/`index.html`, and the sweep still exits 0 and writes a report for the survivors. Restore the file afterward.

## Human Verification Required

| Criterion | Why Manual | Steps |
|-----------|------------|-------|
| AC5.1 (weight magnitudes) | Final numeric weights are a taste call over the contact sheet, not derivable from a test. | Phase 3 step 3. |
| AC5.2 (reference-pair selection + sign-off) | Which models are agreed anchors and whether the human layout truly looks better is human judgment. | Phase 3 steps 1-2. |
| AC8.2 (rungs 1-3 documented) | Documentation criterion; no implementation phase. | Read the "Additional Considerations / hill-climbing ladder" of `docs/design-plans/2026-05-22-layout-quality-eval.md`; confirm Rung 1 (`config.rs`/`sfdp.rs`/`annealing.rs`), Rung 2 (`annealing.rs`), Rung 3 (overlap-removal / obstacle-aware routing) are each named with their seam. |

## Notes

- Automated coverage was validated PASS against
`docs/implementation-plans/2026-05-22-layout-quality-eval/test-requirements.md`
(20/20 automated criteria; AC3.* and AC4.1-4.3 operational by design; AC8.2 documentation).
- The corpus sweep is intentionally **not** part of `cargo test` (it renders PNGs
and runs for minutes). It is an on-demand developer tool whose artifacts live
under the gitignored `target/layout-eval/`.
Loading
Loading