fix(dynacell/eval): unblock dynacell-configs CI (lazy cubic import) + cross-condition probe hardening#447
Merged
alxndrkalinin merged 3 commits intoJun 2, 2026
Conversation
…airs _probe_pair re-read each NPZ from disk on every call, so the shared mock reference side was loaded once per pair: run_for_group read the 8 mock files (4 features x 2 sources) 16 times, and run re-read each condition once per pair. Thread an optional per-call dict cache keyed on the NPZ path through _probe_pair -> _load_embeddings; run / run_for_group create one and pass it so each file is read once per group. Measured on a synthetic 3-condition group: run_for_group drops from 32 to 24 np.load reads (~26 MB of redundant mock NPZ I/O per group avoided). The cache is local to each call and released on return, so memory stays bounded to one group's embeddings rather than accumulating across groups (which a module-level lru_cache would do). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_for_group built its condition->dir map with a last-wins assignment, so two dirs mapping to the same condition (e.g. two *_mock dirs) silently overwrote each other and produced a probe against an arbitrary dir with no signal. run() already raises ValueError on duplicates; mirror that here so an ambiguous group surfaces as an error (the grouped pipeline catches it and safely skips the probe). Unrecognized-token dirs are still skipped, as documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
instance_metrics imported cubic.metrics.average_precision at module top level, and pipeline_cache imports DEFAULT_IOU_THRESHOLDS from it — so the whole pipeline import chain pulled the GPU-only cubic stack. The minimal `test-dynacell-configs` CI job (uv sync --group test, no eval extra) then failed to import dynacell.evaluation.pipeline with `ModuleNotFoundError: No module named 'cubic'`, taking down every grouped / parallel / parallel_cpu eval test at collection. Move the cubic import into instance_average_precision (its only use), so importing the module for the threshold constant no longer needs cubic. The AP computation still hard-requires cubic and fails loudly there. Mirrors the lazy torch_fidelity imports in feature_metrics. Verified: dynacell.evaluation.pipeline now imports with cubic unavailable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
alxndrkalinin
added a commit
that referenced
this pull request
Jun 3, 2026
* ci: drop Python 3.11 from matrix and trigger on modular-viscy-staging All workspace packages require python>=3.12 but the CI matrix included 3.11, causing uv sync to fail with 'incompatible with the project's Python requirement: >=3.12' and fail-fast cancelling all other matrix jobs. Also extend push/pull_request triggers to modular-viscy-staging so PRs targeting the 0.5 staging branch run lint/test. * fix(deps): declare viscy-transforms test dep and hydra-core for dynacell - packages/viscy-data tests import viscy_transforms (test_combined.py, test_hcs.py, test_hcs_weighted_crop.py) but it was not declared in the test group, causing ModuleNotFoundError in Test Data Extras CI. - applications/dynacell/src/dynacell/evaluation/pipeline.py:9 imports hydra unconditionally. hydra-core was only present in the 'eval' and 'report' optional extras, so 'uv sync --group test' (used by Test dynacell benchmark configs CI) failed at collection with ModuleNotFoundError: No module named 'hydra'. Promote hydra-core to a base dependency since it's needed at import time. * fix(dynaclr): rename test_2d_mip_augmentation.py to demo_; scope pytest to tests/ The file is a Jupyter-style demo script (# %% cells) with a hardcoded absolute home path /home/eduardo.hirata/... It is meant to be invoked via 'uv run python ...', not collected by pytest, but its test_ prefix caused 'Test (dynaclr)' CI to fail with FileNotFoundError on the missing parquet at collection time. - Rename to demo_2d_mip_augmentation.py so pytest does not collect it. - Add testpaths=['tests'] under [tool.pytest.ini_options] in dynaclr's pyproject.toml so future demo scripts under scripts/ are not picked up when CI invokes pytest from applications/dynaclr/. * style: satisfy pre-commit (ruff format, EOF, D100/D102, E741, NPY002) Auto-fixes from prek/pre-commit hooks: - ruff format: distance.py, lca.py, normalize.py, precompute.py, conftest.py (5 files reformatted, no semantic change) - end-of-file-fixer: append trailing newline to .envrc Manual fixes: - E741: rename ambiguous 'O' to 'Ov' (overlap) in celldiff_wrapper.py sliding-window patcher. - D100: add module docstrings to plot_dim_reduct.py, tau_sampling.py, channel_dropout.py. - D102: add forward() docstring in ChannelDropout. - NPY002: replace np.random.randn() with np.random.default_rng() in test_mp_utils.py. The renamed demo_2d_mip_augmentation.py is in this commit because ruff format touched its imports as part of the same auto-pass. * fix(dynaclr): unpack tuple from ExperimentRegistry.from_cell_index in e2e test from_cell_index returns (ExperimentRegistry, DataFrame); the fixture assigned the tuple to a bare `registry` and passed it to MultiExperimentIndex, which raised `AttributeError: 'tuple' object has no attribute 'experiments'` at setup of every TestFlatParquetRegistry / TestChannelModes / TestSamplerGroupings test. Unpack and discard the DataFrame (the fixture builds its own). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell): point predict-leaf test at split celldiff inference configs celldiff's predict__ipsc_confocal.yml was split into per-method variants (__denoise, __iterative, __sliding_window), but test_predict_leaf_composes still expected the single file for every model in PREDICT_LEAVES, raising FileNotFoundError for all four celldiff organelles. Add PREDICT_LEAF_FILES parametrizing celldiff's three variants plus unetvit3d's single file, and switch the predict-composition test to it. PREDICT_LEAVES is retained for the eval symlink test, whose eval__ipsc_confocal.yaml is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * style: ruff format PREDICT_LEAF_FILES comprehension Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(viscy-data): pin iohub codec pipeline to zarr-python, not zarrs iohub auto-enables the Rust `zarrs` codec pipeline whenever `zarrs` is importable (it ships as a transitive dep of iohub 0.3.x). `zarrs` 0.2.3 builds a rank-6 read offset for rank-5 sharded zarr-v3 arrays, raising `RuntimeError: incompatible offset [0,0,0,0,0,0] for region with start [0,0,0,0,0]` on every sharded v3 read, and deadlocks spawned DDP ranks. iohub re-applies its codec choice on every store open, so a one-time `zarr.config.set(...)` is clobbered by the next `open_ome_zarr`. Override iohub's ZarrConfig field default instead, applied on import of viscy_data (the lowest zarr-reading package; everything else depends on it) so the native `BatchedCodecPipeline` is in place before any store is opened. Fixes the zarr_v3 failures in test-data / test-data-extras and the test_combined_ddp hang. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell): lazy-import torch_fidelity in feature_metrics torch_fidelity is in dynacell's [eval] extra, but feature_metrics.py imported its metric helpers at module top. The minimal CI test job (test group only, no eval extras) imports the eval pipeline via test_evaluation_grouped.py, so collection failed with `ModuleNotFoundError: No module named 'torch_fidelity'` at pipeline.py:18 -> feature_metrics.py:18. Move the four torch_fidelity imports into the _fid/_kid/_bootstrap_prc/ _mind functions that use them. The cache-only grouped parity test short-circuits before any metric runs, so the eval pipeline now imports without torch_fidelity, matching the CI job's documented minimal-env intent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynaclr): repoint class-path-resolution tests at demo configs configs/training/fit.yml and multi_experiment_fit.yml were removed in the config reorg (ecf7c08 "refactor training configs into composable _base/arch/data layers"), but two tests still asserted they exist and that their class_paths resolve, failing with "Config file not found". Point them at the current full demo configs: demo/demo_2d_fit.yml and demo/demo_bag_of_channels_v3_fit.yml. The predict.yml parametrization is unchanged (it still exists). All resolved class_paths import cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell): align 4gpu-constraint test with hardware_4gpu.yml (h100|h200) test_4gpu_train_leaves_inherit_a100_exclude asserted constraint 'h100|h200|a40|a6000|l40s', but hardware_4gpu.yml sets 'h100|h200' and its prose explains why: 4-GPU FCMAE/UNeXt2 train at large patches where a single DDP rank needs 30-50 GB, so the 48 GB A40/A6000/L40S cards have no headroom. The config is authoritative; the test constant was inconsistent (both landed together in #404). Update the expected constraint and the test's rationale comment to match. The file's override-hint (opt in to smaller cards) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(viscy-data): make mmap-sharing test child picklable for spawn test_mmap_preload_multi_process_sharing defined its child target as a nested closure, which fails under the spawn start method (Windows/macOS default) with `AttributeError: Can't get local object`. fork (Linux) inherits the function so it passed there. Move the child to module level as _mmap_sharing_child and pass the data path explicitly instead of closing over the fixture. Verified passing under a forced spawn context. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell): reconcile benchmark-config tests with split/redesigned leaves The config layout was refactored but test_benchmark_config_composition.py still asserted the old layout, failing once collection-blocking imports were fixed. Reconcile the tests against the current configs: - test_a549_predict_leaf_composes: predict__a549_mantis.yml was split per treatment condition (_mock/_denv/_zikv). Generate expectations over gene-info x model x condition; assert the condition-specific data_path, dataset_ref, and experiment_id each composed leaf actually produces. - test_migrated_target_predict_resolves_to_test_store: celldiff's predict__ipsc_confocal.yml was split per inference method; point celldiff at the sliding_window representative (all three resolve the same store). - test_joint_train_leaf_composes / _smoke_: the joint train leaf was redesigned to single-GPU H200 (strategy auto, devices 1, bare-string channels) and the a549 child path moved to mantis_v1/train/SEC61B_all.zarr. Update assertions to match; topology still satisfies the SLURM invariant. Configs unchanged. Full file: 165 passed, 52 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(viscy-data): use fork context + Windows skip for mmap-sharing test Making the child target module-level was necessary but not sufficient: under the spawn start method (Windows default) the child re-imports the test module, which pytest loaded via --import-mode=importlib, failing with ModuleNotFoundError: No module named 'packages'. Use an explicit fork context (child inherits the loaded module, no re-import) and skip on Windows where fork is unavailable -- the same pattern test_combined_ddp.py already uses for this exact issue. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(dynaclr): import NTXentLoss from viscy_models Engine's isinstance check targets viscy_models' NTXentLoss subclass, so importing pml's base class fell through to the triplet branch and passed a 2-D projection as the 1-D label arg ("labels must be a 1D tensor"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(dynaclr): use bag-of-channels for heterogeneous experiments All-channels mode is undefined when experiments carry different channel sets: norm_meta collapses to one channel per cell after dedup, and _scatter_channels zips the marker union positionally against a single experiment's patch. Bag-of-channels (channels_per_sample=1) is the supported path — one channel per sample keyed channel_0. - _C 2 -> 1, channel_dropout_channels [1] -> [0] - drop bogus collection_path kwarg from the parquet case Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(dynaclr): repoint multi-experiment parquet test at cell_index_path entry point * deps: regenerate uv.lock after green-up merge * fix(dynacell): green up dynacell-configs CI (run_for_group import + a549 any-GPU test) (#446) * feat(dynacell/eval): add run_for_group cross-condition probe entry point pipeline.py imports run_for_group from cross_condition_probe to write a per-condition mock-vs-infected probe CSV into each condition's eval dir, but the function was never committed. As a result dynacell.evaluation.pipeline was unimportable in a clean checkout (ImportError: cannot import name 'run_for_group'), so the grouped and parallel eval tests failed at collection in CI. Add run_for_group (and the _write_rows helper) matching the pipeline call site: it runs the existing pairwise probe per infected condition against mock and writes one CSV per eval dir, colocated with the dir the reporting layer already resolves per (model, pool, organelle, condition). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(dynacell): unpin predict from H200 via any-GPU launcher profile Predict jobs were pinned to H200 by every leaf composing hardware_h200_single.yml. Predict at FP32 fits in ~7 GB and is compute-bound (measured 6.6 GB / 100% SM / 15-20% bandwidth on celldiff a549-trained membrane mock, H200), so Hopper HBM/bandwidth is unused — A40 / A6000 / L40S / L4 all run it and drain the queue faster. Add hardware_predict_any_gpu.yml (same resource shape as hardware_h200_single, constraint=null) and swap all 233 tracked predict leaves across celldiff, fcmae_vscyto3d_pretrained, fcmae_vscyto3d_scratch, fnet3d_paper, pix2pix3d_unetvit, and unetvit3d to use it. Train leaves stay on hardware_h200_single / hardware_4gpu (memory + bandwidth profile differs). README updated. test_a549_predict_leaf_composes now asserts sbatch.constraint is None (any GPU) instead of "h200". This is the config-side fix for the 24 a549 composition failures #435 left open by re-pinning predicts to H200. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(dynacell/eval): tidy run_for_group (reuse _DEFAULT_PAIRS, drop dead param) Post-review cleanups, no behavior change: - Iterate _DEFAULT_PAIRS instead of hardcoding ("denv","zikv") + building ("mock", cond) by hand — keeps the mock-vs-infected pair set in one place, so a future infected condition is picked up automatically. - Drop the unused filename parameter (no caller passes it) and write to GROUP_PROBE_FILENAME directly. - Drop the needless list() around the _FIELDNAMES tuple in _write_rows (csv.DictWriter accepts any sequence). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(dynacell): assert predict constraint key present-and-null test_a549_predict_leaf_composes used .get("constraint") is None, which also passes if the hardware profile is dropped from the leaf's base chain (missing key -> None). Subscript the key so a dropped any-GPU profile fails loudly instead of silently — the any-GPU profile composes an explicit constraint=null, so the key is always present when applied. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell/eval): unblock dynacell-configs CI (lazy cubic import) + cross-condition probe hardening (#447) * perf(dynacell/eval): memoize embedding loads across cross-condition pairs _probe_pair re-read each NPZ from disk on every call, so the shared mock reference side was loaded once per pair: run_for_group read the 8 mock files (4 features x 2 sources) 16 times, and run re-read each condition once per pair. Thread an optional per-call dict cache keyed on the NPZ path through _probe_pair -> _load_embeddings; run / run_for_group create one and pass it so each file is read once per group. Measured on a synthetic 3-condition group: run_for_group drops from 32 to 24 np.load reads (~26 MB of redundant mock NPZ I/O per group avoided). The cache is local to each call and released on return, so memory stays bounded to one group's embeddings rather than accumulating across groups (which a module-level lru_cache would do). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell/eval): raise on duplicate condition in run_for_group run_for_group built its condition->dir map with a last-wins assignment, so two dirs mapping to the same condition (e.g. two *_mock dirs) silently overwrote each other and produced a probe against an arbitrary dir with no signal. run() already raises ValueError on duplicates; mirror that here so an ambiguous group surfaces as an error (the grouped pipeline catches it and safely skips the probe). Unrecognized-token dirs are still skipped, as documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell/eval): lazily import cubic in instance_metrics instance_metrics imported cubic.metrics.average_precision at module top level, and pipeline_cache imports DEFAULT_IOU_THRESHOLDS from it — so the whole pipeline import chain pulled the GPU-only cubic stack. The minimal `test-dynacell-configs` CI job (uv sync --group test, no eval extra) then failed to import dynacell.evaluation.pipeline with `ModuleNotFoundError: No module named 'cubic'`, taking down every grouped / parallel / parallel_cpu eval test at collection. Move the cubic import into instance_average_precision (its only use), so importing the module for the threshold constant no longer needs cubic. The AP computation still hard-requires cubic and fails loudly there. Mirrors the lazy torch_fidelity imports in feature_metrics. Verified: dynacell.evaluation.pipeline now imports with cubic unavailable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): install dynacell eval extra instead of hiding cubic imports The test-dynacell-configs job installed only base deps + the test group, so the eval-pipeline tests (which import dynacell.evaluation.pipeline) failed with `ModuleNotFoundError: No module named 'cubic'` once the import chain reached instance_metrics / segmentation. cubic is a *hard* dependency of the eval pipeline (evaluation/CLAUDE.md: do not hide its imports), and it is CPU-capable — it falls back to numpy/scikit-image without CUDA — so the right fix is to install it, not to lazily defer its import. - CI: `uv sync --frozen --extra eval --group test` (CPU; eval_gpu cupy/cucim stays out — those are the only CUDA-only deps). - Revert the lazy `from cubic.metrics import average_precision` in instance_metrics back to a module-level import (restores the convention; the deferral added earlier is no longer needed now that cubic is present). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(dynacell/eval): restore module-level torch_fidelity imports torch-fidelity is a hard dependency of the eval extra (and the test-dynacell-configs CI job now installs that extra), so there is no reason to defer its imports into _fid/_kid/_mind/_bootstrap_prc. Move fid/kid/mind/prc helper imports back to the module top, matching the convention that eval deps are imported eagerly and fail loud if absent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell/eval): don't hard-require cupy/cucim; use cubic CPU fallback _require_cubic (in metrics.py and io.py) raised "<pkg> is required for GPU-backed metrics … uv sync --extra eval_gpu" whenever cupy/cucim were absent — even on the CPU path. That defeated cubic's own numpy/scikit-image fallback: compute_pixel_metrics and _cp_raw_regionprops already gate the GPU upload on `torch.cuda.is_available()` and use asnumpy/numpy otherwise, where cubic dispatches to CPU. The eager gate fired before that code ran, so a GPU-less runner with the eval extra (but no eval_gpu) — e.g. GitHub CI — failed instead of falling back. Drop the cupy/cucim requirement from both _require_cubic; keep the cubic presence check. On an actual GPU run without cupy, `ascupy` still raises a clear "GPU requested but not available". Verified: the eval-pipeline + grouped tests pass with the GPU hidden (CUDA_VISIBLE_DEVICES="") across the serial and spawn-process paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(dynacell/eval): resolve NPZ path for the embedding cache key _load_embeddings keyed its memoization cache on the unresolved eval_dir/embeddings/<f>.npz path while the docstring promised the *resolved* path — a doc/code mismatch that could also defeat caching when the same file is reached via different relative paths or symlinks. Resolve the path (strict=False, so a missing file still surfaces as np.load's FileNotFoundError) before the cache lookup and load. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * buump iohub 0.3.6 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Sricharan Reddy Varra <sricharan.varra@biohub.org> Co-authored-by: Alexandr Kalinin <1107762+alxndrkalinin@users.noreply.github.com> Co-authored-by: Alexandr Kalinin <alxndrkalinin@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Follow-up to #446 (which merged the run_for_group import + a549 any-GPU test). Three independent
cross_condition_probe/ instance-metrics improvements, the first of which unblocks thetest-dynacell-configsCI job.Commits
fix: lazily import cubic ininstance_metrics—instance_metricsimportedcubic.metrics.average_precisionat module top level, andpipeline_cacheimportsDEFAULT_IOU_THRESHOLDSfrom it, so the wholepipelineimport chain pulled the GPU-only cubic stack. The minimaltest-dynacell-configsjob (uv sync --group test, noevalextra) then died importingdynacell.evaluation.pipelinewithModuleNotFoundError: No module named 'cubic', taking down every grouped / parallel / parallel_cpu eval test at collection. Move the import intoinstance_average_precision(its only use); the AP computation still hard-requires cubic and fails loudly there. Verifiedpipelineimports with cubic unavailable.fix: raise on duplicate condition inrun_for_group(Copilot review on fix(dynacell): green up dynacell-configs CI (run_for_group import + a549 any-GPU test) #446) — the condition→dir map used last-wins, so two dirs mapping to the same condition (e.g. two*_mock) silently overwrote each other and probed an arbitrary dir.run()already raisesValueError; mirror it. Unrecognized-token dirs are still skipped.perf: cache cross-condition embedding loads —_probe_pairre-read each NPZ per call, so the shared mock reference was loaded once per pair (run_for_group: 16× for 8 files). Thread a per-call dict cache keyed on the NPZ path; each file is read once per group (32 → 24 reads, ~26 MB redundant I/O avoided/group), released on return (no process-lifetime accumulation).Verification
pipeline/pipeline_cacheimport cleanly with cubic blocked (simulating the minimal CI env).instance_metrics_test,test_evaluation_grouped, bothtest_evaluation_pipeline_parallel*, andtest_cross_condition_probepass with cubic present; a load-count check confirms the 32→24 read reduction and the duplicate-conditionValueError.🤖 Generated with Claude Code