[CI] Add slow marker and remove un-necessary tests#711
Conversation
Add a `slow` pytest marker, mark the worst-case tests with it, and have
`tests/run_tests.py` skip those tests by default (use `--run-slow` to include
them, or `pytest -m slow` to run only those).
Picked from macOS CI per-file timing (QD_FILE_TIMING=1, run 26083950810):
phase 1 totals 6415s across 8641 test calls; the slowest 3 files alone
(test_eig, test_tile16, test_linalg) cover 55%. The cost of test_eig /
test_make_spd is super-linear in matrix size n (n=12 ≈ 5x n=9).
Marked slow:
- Parametrize cases n in {6, 9, 12} (and 7..11 for inverse_large) across
test_eig.py and test_linalg.py.
- Rectangular (9, 12) / (12, 3) cases in test_frobenius_inner_rectangular.
- test_matmul_chain_qipc_sizes_{f32,f64} (>130s each on macOS CI).
- test_clear_all_gradients (180s/invocation).
- test_reset_ndarrays::test_ndarray_doesnt_crash_on_gc (127s).
- test_mpm88::{test_mpm88, test_mpm88_numpy_and_ndarray} (~30s/invocation).
- test_struct::test_2d_nested (122s/invocation).
run_tests.py composes `not slow` with any user-supplied `-m` expression, so
existing CI invocations like `-m "not needs_torch"` become
`(not needs_torch) and not slow`. Note that this also drops slow tests from
GPU / Linux / macOS CI runs — a separate workflow (or `--run-slow` job) is
needed if we still want to exercise the n>=6 / n=12 paths in CI.
The previous lists ([4, 5, 6, 9, 12], [2, 3, 4, 6, 9, 12], [5..12], etc.) gave
the Householder/QR path a lot of redundant size coverage. For routine CI we
only need to exercise a small size + the largest supported size (12, which
also doubles as the slow-marked stress case): if a bug shows up only at
n=7 or n=11 it almost certainly also shows up at n=12.
test_eig.py
sym_eig_general_{f32,f64} [4,5,6,9,12] -> [4, 12*]
make_spd_{f32,f64} [4,6,9,12] -> [4, 12*]
sym_eig_alpha_identity_f64 [4,6,9,12] -> [4, 12*]
make_spd_idempotent_f64 [4,6,9,12] -> [4, 12*]
make_spd_negative_definite_zero_f64 [4,6,9,12] -> [4, 12*]
sym_eig_sort_order_{f32,f64} [2,3,4,6,9,12] -> [3, 12*]
test_linalg.py
frobenius_inner_{f32,f64} [2,3,6,9,12] -> [3, 12*]
inverse_large_{f32,f64} [5..12] -> [5, 12*]
* n=12 retains the `slow` marker, so default `run_tests.py` invocations only
hit n=4 / n=3 / n=5. `--run-slow` runs both.
Closed-form 2x2/3x3 paths in test_sym_eig_sort_order: dropped n=2 in favour
of n=3 (per directive); the 2x2 path is still covered by
test_sym_eig2x2_{f32,f64}. The 3x3 closed-form path stays covered by n=3.
Other parametrize lists left untouched:
- rectangular (rows, cols) tuples in test_frobenius_inner_rectangular (it's
varying shape, not pure size).
- test_mat_inverse_size's `range(1, 5)` (tiny sizes only).
- `a00` integer parametrize in test_sym_eig3x3_{f32,f64}.
The blocked-Cholesky demo previously hard-coded N=92, N_ENVS=4096,
WARMUP=50, ITERS=200 as module globals. The unit-test wrapper
test_tile16_cholesky_blocked_demo runs the demo as a subprocess and
only cares that it returns 0; at the hard-coded sizes that takes ~74 s
on cluster CUDA, dominated by JIT-compiling 3 large unrolled kernels
at N=92 and running the 4096-env x 250-iter benchmark loop.
Expose all four knobs as command-line flags with the previous values as
defaults, so:
python misc/demos/cholesky_blocked.py # unchanged, full demo
python misc/demos/cholesky_blocked.py --n 32 --n-envs 64 \
--num-warmup 1 --num-iters 1 # smoke-mode
The test will switch to the smoke-mode invocation in a follow-up
commit so it stops dominating the slow critical path.
Flag names (--n, --n-envs, --num-warmup, --num-iters) follow the user
spec; using argparse + ArgumentDefaultsHelpFormatter so --help shows
the full demo defaults.
Pass small CLI overrides (--n 32 --n-envs 64 --num-warmup 1 --num-iters 1) so the demo runs end-to-end in seconds instead of ~74 s. The test contract is just "demo exits 0"; it doesn't read any of the benchmark numbers, so the smaller workload still satisfies the smoke test. The full N=92 / N_ENVS=4096 / 50+200-iter demo is still what humans running misc/demos/cholesky_blocked.py see by default (argparse defaults match the previous hard-coded values). Together with the previous commit, this drops the test_tile16_cholesky_blocked_demo wall time on cluster CUDA from ~74 s to (expected) a few seconds, removing the largest remaining single-test outlier on hp/mark-slow-tests.
Previously the test hard-coded the qipc IPC sizes (9x12) · (12x12) · (12x9). On cluster CUDA those two cases (f32 + f64) take ~92.7s and ~87.3s respectively -- the top two single-test outliers in the suite, each holding one xdist worker for ~90s of contiguous JIT-compile + unrolled-FMA work. Parametrize `_test_matmul_chain` on (rows_a, cols_a, cols_b, cols_c). Default lane runs the small (3,4,4,3) chain to exercise the same Matrix.__matmul__ codegen path; the original (9,12,12,9) qipc-sized chain is slow-marked so it still runs on --run-slow (i.e. CI's nightly / release lane, once that's wired up). Estimated saving: ~180s CPU, ~70s wall (these tests were on the critical path of the branch run). No function-level coverage lost: both f32 and f64 versions still run the same chain by default, just at a smaller size.
Previously hard-coded N=30 (900 particles), n_grid=120, steps=32 -- 26s on cluster CUDA. The test's actual contract is that the AD-validation checker raises QuadrantsAssertionError on the global-data-access violation in g2p (`v[f, p] = new_v`), which fires on the first substep regardless of grid / particle / step counts. Parametrize on (particles_side, n_grid_size, num_steps) with a small default (8, 32, 4) and slow-marked original (30, 120, 32). The default still exercises the same diff-MPM pipeline (p2g / grid_op / g2p, qd.ad.Tape with validation=True, `with pytest.raises(...)`) and still triggers the assertion error. Estimated CPU saving: ~22s; wall saving ~3s on the branch run.
…ne op-parametrized test
The three reduce variants (and the three scan variants) shared an identical
kernel signature, identical input shape, and differed only in (a) which
qd.algorithms.device_<op> function they called and (b) overflow vs
bitwise-exact verification. Collapse each triple into a single op-parametrized
test:
test_device_reduce(op, dtype, N) # op in {add, min, max}
test_device_exclusive_scan(op, dtype, N) # op in {add, min, max}
Behavior, coverage and the parametrize space are unchanged -- pytest still
collects the same number of parametrize cases, just under unified test names.
This is purely a code-dedup refactor (~130 LOC less) which makes the next
op-axis sampling change (if/when we choose to drop A vs B vs C from the
sweep) a one-line edit.
…run_tests help string Pure formatting fix from `pre-commit run -a`; no behavior change.
…ppers into 2 op-parametrized tests
Lines 3608-3694 in test_simt.py were 18 ~5-line wrappers each calling
``_check_full_matches_tiled(subgroup.<op>, subgroup.<op>_tiled, ...)``.
Lines 3841-3848 were 2 more, parametrized on dtype. ``_check_full_matches_tiled``
already accepts the full / tiled functions as Python arguments (closure-captured
into ``@qd.kernel``), so collapsing the family is a pure dedup move:
test_subgroup_full_matches_tiled(op_name, host_init)
# 18 cases: {reduce, inclusive, exclusive}_{add,min,max,mul,and,or,xor} on qd.i32
test_subgroup_full_matches_tiled_float(op_name, dtype)
# 4 cases: {reduce_add, inclusive_add} x {qd.f32, qd.f64}
Behavior + coverage unchanged (still 22 parametrize cases, same dtype + init
configurations). Pytest ids are designed to match the original test-name
suffixes (e.g. ``[reduce_add]``, ``[inclusive_mul]``) so ``-k`` selectors and
test reports stay readable. Drops ~50 LOC net.
…zed tests The six block-reduce tests (3 single-output + 3 broadcast) share an identical kernel skeleton, parametrize axes, and verification loop. They only differ in which `block.reduce_*` function is called (closure-captured into `@qd.kernel` via getattr), the host-side reference oracle, the init pattern (sequential for `add` so the running sum has signal; permuted hash for `min` / `max` so the result depends on lanes other than first / last), and the float tolerance regime (relative for accumulating `add`, absolute for picker `min` / `max`). Collapse the six tests into two op-parametrized tests: test_block_reduce(sg_per_block, dtype, op_name, ...) # single-output, 3 ops test_block_reduce_all(sg_per_block, dtype, op_name, ...) # broadcast, 3 ops Parametrize space is unchanged (3 sg x 5 dtype x 3 op = 45 cases per fused test, matching the original 3 tests x 15 cases each). Pytest ids use plain `[add|min|max]` suffixes so `-k` selectors remain readable. Drops ~100 LOC of boilerplate -- two new small helpers (`_init_block_reduce_src` and `_assert_block_reduce_close`) capture the per-op behavioral differences in one place each.
…zed test The three block inclusive scan tests share the same kernel skeleton and only differ in the closure-captured `block.inclusive_<op>` function, the host-side reference oracle, the init pattern (sequential for `add` -- sums grow with prefix length; permuted for `min` / `max` -- result depends on lanes other than first / last), and the float tolerance regime (relative for `add`, absolute for `min` / `max`). Collapse into one op-parametrized test: test_block_inclusive(sg_per_block, dtype, op_name, ...) Identical param count to the original three tests (3 sg x 5 dtype x 3 op = 45 cases vs original 3 x 15). Pulls a shared `_assert_block_scan_close` helper out so the int / relative-float / absolute-float regime is encoded in one place; the relative-float branch keeps the floor-on-tol-base trick needed by the original `test_block_exclusive_add` (also routed through the same helper). `test_block_exclusive_add` stays as its own function for now because the matching exclusive `min` / `max` cases need dtype-derived sentinel identities + ``isinf`` handling that's different enough that fusing them in would create more branches than it removes; can address that in a follow-up if needed.
…trized test `test_block_exclusive_min` and `test_block_exclusive_max` share the same permuted-init pattern and only differ in the dtype-derived sentinel identity (``+inf`` / ``iinfo.max`` for min, ``-inf`` / ``iinfo.min`` for max) and the inf-sign check at lane 0. Collapse into one op-parametrized test that takes ``(op_name, sentinel_fn, py_op, inf_sign)`` and dispatches via getattr + the (already module-level) `_PY_MIN` / `_PY_MAX` lambdas. Identical param count to the original pair (3 sg x 5 dtype x 2 op = 30 cases vs original 2 x 15 each = 30). `test_block_exclusive_add` remains its own function because the integer identity is `0` (not `iinfo.max/min`) and the init pattern is sequential -- different enough that fusing it in would add more branches than it removes. Drops ~30 LOC.
The PR's `Check line wrapping` CI agent flagged three comments wrapped at
the AI-default ~78-90c instead of the project's 120c target. Reflow each
to the full target width:
- tests/python/test_tile16.py:1791 (78c -> 120c) docstring for
test_tile16_cholesky_blocked_demo.
- tests/python/test_ad_gdar_diffmpm.py:8 (85c -> 120c) the
"defaults shrink ..." comment above the parametrize block.
- tests/run_tests.py:60 (90c -> 120c) the "--run-slow opts back in"
comment.
Also collapse the dangling-backslash continuation in
misc/demos/cholesky_blocked.py's Usage example onto one line (69c -> 109c).
No behavior change; comments only. Verified via the cursor
find-underwrapped skill that the remaining flagged runs in my diff are
all 103-116c with save~=0 (already-tight runs the greedy heuristic still
reports), comfortably in the agent's "not borderline" exemption.
Hugh requested in PR #709 review comment that the testing bullet collapse to just a pointer at unit_testing.md, since the long inline summary duplicates the dedicated doc immediately below.
Documents the test launcher, the @pytest.mark.slow marker (whole-test and parametrize-case variants), how to write a new parametrized test with the test_utils.test decorator, and the Advanced section with the per-test timeout, kernel compilation cache, and per-file timing knobs. Modeled on the structure of the equivalent doc on hp/mark-slow-tests (after Hugh's two rounds of PR review feedback there) but with all @pytest.mark.sample references stripped, since the @sample marker is not part of this branch.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a85c6ecbcc
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| sm70: Can only run on GPU with compute capability 7.0 or higher. | ||
| needs_torch: mark test as requiring PyTorch. | ||
| slow: mark test (or parametrize case) as slow. Skipped by default by tests/run_tests.py; | ||
| pass --run-slow to include them, or directly `pytest -m slow` to run only the slow ones. |
There was a problem hiding this comment.
Keep slow marker description in one ini entry
The wrapped description line is parsed as a second marker entry, so pytest now registers an unintended marker named pass --run-slow to include them, or directly \pytest -m slow` to run only the slow ones.instead of treating it as part ofslow's description. You can see this in pytest --markers, and it can confuse marker tooling (and future strict marker validation) because tests/pytest.inino longer defines markers as onename: description` per line.
Useful? React with 👍 / 👎.
|
ok to merge |
pytest registers the `markers` ini option as a `linelist` type, so
config.getini("markers") splits the value on newlines and every line
becomes a separate registered marker. configparser preserves
indented-continuation lines as literal newlines inside the value, so
the previous two-line entry
slow: mark test (or parametrize case) as slow. Skipped by default by tests/run_tests.py;
pass --run-slow to include them, or directly `pytest -m slow` to run only the slow ones.
was being parsed as TWO markers: `slow: ...` and a phantom marker
whose "name" is the literal text `pass --run-slow to include them, ...`.
The phantom marker shows up in `pytest --markers` output and would
either pollute or break future `--strict-markers` validation.
Collapse to a single line; trim wording slightly to keep it readable.
Verified via configparser that the markers list now has 4 entries (was 5).
Reported by chatgpt-codex on PR #711.
pytest registers the `markers` ini option as a `linelist` type, so
config.getini("markers") splits the value on newlines and every line
becomes a separate registered marker. configparser preserves
indented-continuation lines as literal newlines inside the value, so
the two multi-line entries here were being parsed as 9 markers (4 real
+ 5 phantom whose "names" were the continuation lines, e.g. a marker
named `pass --run-slow to include them, ...` and another named
`--no-sample, nodeid-paste).`).
Phantom markers pollute `pytest --markers` output and would either
mask or break future `--strict-markers` validation, since strict mode
would accept the literal text `pass` or `--no-sample,` as valid marker
names on any test.
Collapse both descriptions to single lines (with light wording trim to
keep the source readable). Verified via configparser that the markers
list now has exactly 5 real entries (was 9).
Reported by chatgpt-codex on the sibling PR #711.
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough