Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/baseline-policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Equivalence baseline policy

This document defines what the `python/<mod>/tests/test_equivalence.py`
suites assert, why they only run on Linux, and how a contributor
promotes a new platform to canonical when the project's needs change.

Tracking issue: [#213](https://github.com/k-yoshimi/task/issues/213).
Design rationale: `docs/superpowers/specs/2026-05-26-linux-canonical-equiv-policy-design.md`.

## What the 1e-10 contract asserts

For each of 20 cases under `test_run/baselines/<case>/metrics.json`,
the test loads `lib<mod>api.so` (via Python wrapper), replays the
fixture parameters, advances the simulation, serializes the resulting
state, and compares against the committed JSON at relative tolerance
`1e-10`.

The contract is: **same Fortran source + same compiler + same libm =
bit-stable output**. It does NOT promise byte-equivalent output across
different compilers or libm vendors.

## What the contract does NOT assert

- **Cross-platform bit-equivalence**. macOS Homebrew GCC 15.2.0 +
Apple libm and Ubuntu CI gfortran 13.x + glibc produce slightly
different floating-point output (single-ULP drift in transcendental
intrinsics like `exp` / `sin` / `sqrt`). For tight iterative
solvers (FP collision operator, ray-tracing) the per-step drift
amplifies past 1e-10 within a handful of iterations.

The 4 cases that surface this on macOS today (issue #213):
`fp_dt1` (RPCT[0] rel_err 2.354e-10), `fp_iter01` (40+ RPCT
mismatches, worst 4.447e-9), `wrx_demo` (~1 scalar), `wrx_iter01`
(`pwr_tot` rel_err 1.382e-9).

- **Per-platform reproducibility on non-canonical platforms**. The
policy is "one canonical platform's baseline is the truth". On
non-canonical platforms the test does not run.

## The canonical platform

**Ubuntu CI runner with gfortran 13.x**. The full set of
`linux-gcc13` baselines under `test_run/baselines/*/metrics.json`
was generated on clavius
(memory `reference_clavius_baseline_regen.md`) and is exercised on
every push by `.github/workflows/python-tests.yml` line 323's
whole-tree pytest (which sweeps the 7 module `test_equivalence.py`
suites).

## Non-Linux behavior

`python/<mod>/tests/test_equivalence.py` carries:

```python
IS_LINUX = sys.platform.startswith("linux")

@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. ... See docs/baseline-policy.md.",
)
class TestEquivalence(...):
```

On macOS / FreeBSD / Windows / any non-Linux: tests skip. On WSL
or Linux containers running on macOS Docker: `sys.platform.startswith("linux")` is True,
so they run. The policy is "Linux userland", not "physical host OS".

The skip is **NOT overridable** by env var. The policy is binary:
Linux userland or no equivalence check.

## What macOS dev gets

- `lib<mod>api.so` build and run normally via the Python wrappers
(`Tot`, `Eqlib`, `Trlib`, etc.). Other test suites
(`python/<mod>/tests/test_<mod>lib.py`, etc.) exercise the
wrappers and ARE run on macOS — they catch ABI / load / call-
pattern issues.
- Equivalence at 1e-10 is verified by CI Ubuntu every push. Pulling
the PR after CI green is the verification gate.

## How to verify equivalence locally on non-Linux

Run a Linux container with gfortran 13.x:

```bash
docker run --rm -it -v "$(pwd)":/work -w /work ubuntu:24.04 bash -c "
apt-get update && apt-get install -y gfortran python3 python3-pip
pip3 install pytest pytest-forked pytest-timeout pytest-mock
./scripts/setup.sh # bpsd clone + lib*api.so build
python3 -m pytest python/ --forked --timeout=120 --timeout-method=signal
"
```

This reproduces the CI gate locally. Slower than running on macOS
directly, but the only way to get the 1e-10 contract verified
off-CI.

## How to promote a new platform to canonical

If a project priority arises (e.g. macOS becomes a supported
production target, not just dev), two structural prerequisites must
be addressed first — these block the platform-keyed baseline
approach that was attempted and rejected on 2026-05-26 (Codex
2-round execution blocker analysis):

1. **Graphics-stubs gap**: 6 of 7 modules (`fp` / `ti` / `tr` /
`eq` / `wr` / `wrx`) link their standalone Fortran binaries
against real GFLIBS (`-lg3d-gfc64 -lgsp-gfc64 -lgdp-gfc64`),
which Homebrew does not provide. Only `tot` has
`tot_static_stubs.f90` for graphics-free linking.
`reference_clavius_baseline_regen.md` documents this. Phase-L-
sized work to write per-module `<mod>_static_stubs.f90` would
close the gap.
2. **Python-fixture gap**: 6-8 of 20 baseline cases lack
`<case>_params.py` Python fixtures (`eq_jt60`, `fp_jt60`,
`ti_min`, `ti_w`, `tr_m0904`, `wrx_jt60`, plus `tot_*_short`
name-mismatch resolution). They were generated by the Linux-
only standalone-binary regen workflow and are dead baselines
from the Python test surface's perspective. See [#215](https://github.com/k-yoshimi/task/issues/215)
for the gap inventory + how to close it case-by-case.

Once both gaps close, the rejected platform-keyed design
(`docs/superpowers/specs/2026-05-26-platform-keyed-baselines-design.md`
— marked SUPERSEDED) can be revisited as a follow-up.

## References

- Memory: `feedback_equivalence_must_pass.md` — distinguishes
principled platform-scoped skip (this) from invisibility skip
(forbidden).
- Memory: `reference_clavius_baseline_regen.md` — Linux canonical
baseline-gen host conventions.
- Spec: `docs/superpowers/specs/2026-05-26-linux-canonical-equiv-policy-design.md`
(this design).
- Superseded specs (kept as design history):
- `docs/superpowers/specs/2026-05-26-platform-keyed-baselines-design.md`
- `docs/superpowers/plans/2026-05-26-platform-keyed-baselines-implementation.md`
12 changes: 12 additions & 0 deletions python/eqlib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,18 @@ def _compare_with_baseline(actual: dict, case_name: str, tol: str = "1e-10") ->
pass


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libeqapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
_any_so_exists(),
"libeqapi.so not built at any candidate path "
Expand Down
12 changes: 12 additions & 0 deletions python/fplib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,18 @@ def _fplib_importable() -> bool:
return True


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libfpapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
DEFAULT_SO.exists(),
f"libfpapi.so not built at {DEFAULT_SO}; run `make -C fp libfpapi.so`",
Expand Down
12 changes: 12 additions & 0 deletions python/tilib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,18 @@ def _tilib_importable() -> bool:
return True


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libtiapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
DEFAULT_SO.exists(),
f"libtiapi.so not built at {DEFAULT_SO}; run `make -C ti libtiapi.so`",
Expand Down
12 changes: 12 additions & 0 deletions python/totlib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,18 @@ def _compare_with_baseline(actual: dict, case_name: str, tol: str = "1e-10") ->
pass


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libtotapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
_any_so_exists(),
"libtotapi.so not built at any candidate path "
Expand Down
12 changes: 12 additions & 0 deletions python/trlib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,18 @@ def _trlib_importable() -> bool:
return True


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libtrapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
DEFAULT_SO.exists(),
f"libtrapi.so not built at {DEFAULT_SO}; run `make -C tr libtrapi.so`",
Expand Down
12 changes: 12 additions & 0 deletions python/wrlib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,18 @@ def _wrlib_importable() -> bool:
return True


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libwrapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
DEFAULT_SO.exists(),
f"libwrapi.so not built at {DEFAULT_SO}; run `make -C wr libwrapi.so`",
Expand Down
12 changes: 12 additions & 0 deletions python/wrxlib/tests/test_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,18 @@ def _compare_with_baseline(actual: dict, case_name: str, tol: str = "1e-10") ->
pass


IS_LINUX = sys.platform.startswith("linux")


@unittest.skipUnless(
IS_LINUX,
"Equivalence tests are Linux-canonical. The 1e-10 baselines "
"live in test_run/baselines/<case>/metrics.json and were "
"generated on Linux gfortran 13.x (Ubuntu CI runner). macOS / "
"non-Linux dev runs the libwrxapi.so via the Python wrapper; "
"correctness is verified by Linux CI on every push. See "
"docs/baseline-policy.md.",
)
@unittest.skipUnless(
_any_so_exists(),
"libwrxapi.so not built at any candidate path "
Expand Down
Loading