diff --git a/docs/baseline-policy.md b/docs/baseline-policy.md new file mode 100644 index 00000000..8c86062e --- /dev/null +++ b/docs/baseline-policy.md @@ -0,0 +1,137 @@ +# Equivalence baseline policy + +This document defines what the `python//tests/test_equivalence.py` +suites assert, why they only run on Linux, and how a contributor +promotes a new platform to canonical when the project's needs change. + +Tracking issue: [#213](https://github.com/k-yoshimi/task/issues/213). +Design rationale: `docs/superpowers/specs/2026-05-26-linux-canonical-equiv-policy-design.md`. + +## What the 1e-10 contract asserts + +For each of 20 cases under `test_run/baselines//metrics.json`, +the test loads `libapi.so` (via Python wrapper), replays the +fixture parameters, advances the simulation, serializes the resulting +state, and compares against the committed JSON at relative tolerance +`1e-10`. + +The contract is: **same Fortran source + same compiler + same libm = +bit-stable output**. It does NOT promise byte-equivalent output across +different compilers or libm vendors. + +## What the contract does NOT assert + +- **Cross-platform bit-equivalence**. macOS Homebrew GCC 15.2.0 + + Apple libm and Ubuntu CI gfortran 13.x + glibc produce slightly + different floating-point output (single-ULP drift in transcendental + intrinsics like `exp` / `sin` / `sqrt`). For tight iterative + solvers (FP collision operator, ray-tracing) the per-step drift + amplifies past 1e-10 within a handful of iterations. + + The 4 cases that surface this on macOS today (issue #213): + `fp_dt1` (RPCT[0] rel_err 2.354e-10), `fp_iter01` (40+ RPCT + mismatches, worst 4.447e-9), `wrx_demo` (~1 scalar), `wrx_iter01` + (`pwr_tot` rel_err 1.382e-9). + +- **Per-platform reproducibility on non-canonical platforms**. The + policy is "one canonical platform's baseline is the truth". On + non-canonical platforms the test does not run. + +## The canonical platform + +**Ubuntu CI runner with gfortran 13.x**. The full set of +`linux-gcc13` baselines under `test_run/baselines/*/metrics.json` +was generated on clavius +(memory `reference_clavius_baseline_regen.md`) and is exercised on +every push by `.github/workflows/python-tests.yml` line 323's +whole-tree pytest (which sweeps the 7 module `test_equivalence.py` +suites). + +## Non-Linux behavior + +`python//tests/test_equivalence.py` carries: + +```python +IS_LINUX = sys.platform.startswith("linux") + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. ... See docs/baseline-policy.md.", +) +class TestEquivalence(...): +``` + +On macOS / FreeBSD / Windows / any non-Linux: tests skip. On WSL +or Linux containers running on macOS Docker: `sys.platform.startswith("linux")` is True, +so they run. The policy is "Linux userland", not "physical host OS". + +The skip is **NOT overridable** by env var. The policy is binary: +Linux userland or no equivalence check. + +## What macOS dev gets + +- `libapi.so` build and run normally via the Python wrappers + (`Tot`, `Eqlib`, `Trlib`, etc.). Other test suites + (`python//tests/test_lib.py`, etc.) exercise the + wrappers and ARE run on macOS — they catch ABI / load / call- + pattern issues. +- Equivalence at 1e-10 is verified by CI Ubuntu every push. Pulling + the PR after CI green is the verification gate. + +## How to verify equivalence locally on non-Linux + +Run a Linux container with gfortran 13.x: + +```bash +docker run --rm -it -v "$(pwd)":/work -w /work ubuntu:24.04 bash -c " + apt-get update && apt-get install -y gfortran python3 python3-pip + pip3 install pytest pytest-forked pytest-timeout pytest-mock + ./scripts/setup.sh # bpsd clone + lib*api.so build + python3 -m pytest python/ --forked --timeout=120 --timeout-method=signal +" +``` + +This reproduces the CI gate locally. Slower than running on macOS +directly, but the only way to get the 1e-10 contract verified +off-CI. + +## How to promote a new platform to canonical + +If a project priority arises (e.g. macOS becomes a supported +production target, not just dev), two structural prerequisites must +be addressed first — these block the platform-keyed baseline +approach that was attempted and rejected on 2026-05-26 (Codex +2-round execution blocker analysis): + +1. **Graphics-stubs gap**: 6 of 7 modules (`fp` / `ti` / `tr` / + `eq` / `wr` / `wrx`) link their standalone Fortran binaries + against real GFLIBS (`-lg3d-gfc64 -lgsp-gfc64 -lgdp-gfc64`), + which Homebrew does not provide. Only `tot` has + `tot_static_stubs.f90` for graphics-free linking. + `reference_clavius_baseline_regen.md` documents this. Phase-L- + sized work to write per-module `_static_stubs.f90` would + close the gap. +2. **Python-fixture gap**: 6-8 of 20 baseline cases lack + `_params.py` Python fixtures (`eq_jt60`, `fp_jt60`, + `ti_min`, `ti_w`, `tr_m0904`, `wrx_jt60`, plus `tot_*_short` + name-mismatch resolution). They were generated by the Linux- + only standalone-binary regen workflow and are dead baselines + from the Python test surface's perspective. See [#215](https://github.com/k-yoshimi/task/issues/215) + for the gap inventory + how to close it case-by-case. + +Once both gaps close, the rejected platform-keyed design +(`docs/superpowers/specs/2026-05-26-platform-keyed-baselines-design.md` +— marked SUPERSEDED) can be revisited as a follow-up. + +## References + +- Memory: `feedback_equivalence_must_pass.md` — distinguishes + principled platform-scoped skip (this) from invisibility skip + (forbidden). +- Memory: `reference_clavius_baseline_regen.md` — Linux canonical + baseline-gen host conventions. +- Spec: `docs/superpowers/specs/2026-05-26-linux-canonical-equiv-policy-design.md` + (this design). +- Superseded specs (kept as design history): + - `docs/superpowers/specs/2026-05-26-platform-keyed-baselines-design.md` + - `docs/superpowers/plans/2026-05-26-platform-keyed-baselines-implementation.md` diff --git a/python/eqlib/tests/test_equivalence.py b/python/eqlib/tests/test_equivalence.py index d3da8e8c..0adb0043 100644 --- a/python/eqlib/tests/test_equivalence.py +++ b/python/eqlib/tests/test_equivalence.py @@ -163,6 +163,18 @@ def _compare_with_baseline(actual: dict, case_name: str, tol: str = "1e-10") -> pass +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libeqapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( _any_so_exists(), "libeqapi.so not built at any candidate path " diff --git a/python/fplib/tests/test_equivalence.py b/python/fplib/tests/test_equivalence.py index 45fed507..a6476b2e 100644 --- a/python/fplib/tests/test_equivalence.py +++ b/python/fplib/tests/test_equivalence.py @@ -165,6 +165,18 @@ def _fplib_importable() -> bool: return True +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libfpapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( DEFAULT_SO.exists(), f"libfpapi.so not built at {DEFAULT_SO}; run `make -C fp libfpapi.so`", diff --git a/python/tilib/tests/test_equivalence.py b/python/tilib/tests/test_equivalence.py index 51a9fc75..8b5f67f2 100644 --- a/python/tilib/tests/test_equivalence.py +++ b/python/tilib/tests/test_equivalence.py @@ -113,6 +113,18 @@ def _tilib_importable() -> bool: return True +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libtiapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( DEFAULT_SO.exists(), f"libtiapi.so not built at {DEFAULT_SO}; run `make -C ti libtiapi.so`", diff --git a/python/totlib/tests/test_equivalence.py b/python/totlib/tests/test_equivalence.py index b07583d7..01f2427c 100644 --- a/python/totlib/tests/test_equivalence.py +++ b/python/totlib/tests/test_equivalence.py @@ -176,6 +176,18 @@ def _compare_with_baseline(actual: dict, case_name: str, tol: str = "1e-10") -> pass +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libtotapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( _any_so_exists(), "libtotapi.so not built at any candidate path " diff --git a/python/trlib/tests/test_equivalence.py b/python/trlib/tests/test_equivalence.py index 0eee179f..b3e0024b 100644 --- a/python/trlib/tests/test_equivalence.py +++ b/python/trlib/tests/test_equivalence.py @@ -149,6 +149,18 @@ def _trlib_importable() -> bool: return True +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libtrapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( DEFAULT_SO.exists(), f"libtrapi.so not built at {DEFAULT_SO}; run `make -C tr libtrapi.so`", diff --git a/python/wrlib/tests/test_equivalence.py b/python/wrlib/tests/test_equivalence.py index cdae6c84..c2367c9d 100644 --- a/python/wrlib/tests/test_equivalence.py +++ b/python/wrlib/tests/test_equivalence.py @@ -232,6 +232,18 @@ def _wrlib_importable() -> bool: return True +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libwrapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( DEFAULT_SO.exists(), f"libwrapi.so not built at {DEFAULT_SO}; run `make -C wr libwrapi.so`", diff --git a/python/wrxlib/tests/test_equivalence.py b/python/wrxlib/tests/test_equivalence.py index ae3e5d67..e31241ee 100644 --- a/python/wrxlib/tests/test_equivalence.py +++ b/python/wrxlib/tests/test_equivalence.py @@ -127,6 +127,18 @@ def _compare_with_baseline(actual: dict, case_name: str, tol: str = "1e-10") -> pass +IS_LINUX = sys.platform.startswith("linux") + + +@unittest.skipUnless( + IS_LINUX, + "Equivalence tests are Linux-canonical. The 1e-10 baselines " + "live in test_run/baselines//metrics.json and were " + "generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / " + "non-Linux dev runs the libwrxapi.so via the Python wrapper; " + "correctness is verified by Linux CI on every push. See " + "docs/baseline-policy.md.", +) @unittest.skipUnless( _any_so_exists(), "libwrxapi.so not built at any candidate path "