Conversation
…on4py into ci_for_standalone_driver
…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
* main: refactor: prefix VelocityAdvection internal attributes with underscore (#1180) fortran bindings: split py2fgen and bindings (#1185) Rename `single_node_default` to `single_node_exchange` (#1176) remove the duplication of fixtures (#1186) Update GT4Py to v1.1.9 (#1187) cleanup 3 unused vars from solve_nh_init (#1169)
|
|
||
| [tool.uv.sources] | ||
| dace = {index = "gridtools"} | ||
| gt4py = {git = "https://github.com/GridTools/gt4py", branch = "workaround_caching_issue_with_embedded_inverse_image_caching"} |
There was a problem hiding this comment.
@msimberg is this still needed or was it merged into gt4py main?
There was a problem hiding this comment.
Pull request overview
This PR adds MPI/distributed execution support to the icon4py standalone driver, extending decomposition/exchange/reduction plumbing and introducing MPI test coverage and CI integration for the standalone driver component.
Changes:
- Enable distributed (MPI) standalone-driver runs by wiring
ProcessProperties,DecompositionInfo, halo exchange, and global reductions through driver initialization, static-field factories, and initial conditions. - Extend decomposition utilities with halo-level index selection and updated logging/reduction utilities for distributed execution.
- Add standalone-driver MPI test suite and run it in distributed CI.
Reviewed changes
Copilot reviewed 28 out of 29 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Updates locked dependencies (notably gt4py to a git source) and optional CUDA/ROCm variants. |
| pyproject.toml | Pins gt4py to a git branch via tool.uv.sources. |
| model/testing/src/icon4py/model/testing/test_utils.py | Adds a helper for “non-blocking” array comparisons (prints instead of raising). |
| model/testing/src/icon4py/model/testing/parallel_helpers.py | Enhances distributed field comparison helpers; adds halo-level comparisons. |
| model/testing/src/icon4py/model/testing/definitions.py | Adds JW diffusion config for tests/experiments. |
| model/standalone_driver/tests/standalone_driver/mpi_tests/init.py | Introduces MPI test package for standalone driver. |
| model/standalone_driver/tests/standalone_driver/mpi_tests/test_parallel_standalone_driver.py | Adds MPI tests comparing single-rank vs multi-rank standalone driver outputs and serialized reference fields. |
| model/standalone_driver/tests/standalone_driver/mpi_tests/test_parallel_initial_conditions.py | Adds MPI tests comparing single-rank vs multi-rank JW initial conditions. |
| model/standalone_driver/tests/standalone_driver/integration_tests/test_standalone_driver.py | Updates for main.main() return type and backend handling. |
| model/standalone_driver/tests/standalone_driver/integration_tests/test_initial_condition.py | Passes exchange into initial condition and fixes serialbox typing import. |
| model/standalone_driver/tests/standalone_driver/fixtures.py | Imports backend fixture for tests. |
| model/standalone_driver/src/icon4py/model/standalone_driver/testcases/utils.py | Refactors parts of init_w to use reusable math helpers. |
| model/standalone_driver/src/icon4py/model/standalone_driver/testcases/initial_condition.py | Adds exchange usage to sync intermediate fields for distributed initial condition generation. |
| model/standalone_driver/src/icon4py/model/standalone_driver/standalone_driver.py | Adds MPI detection, distributed output-path handling, decomposition info exposure, exchange usage, and global reductions. |
| model/standalone_driver/src/icon4py/model/standalone_driver/main.py | CLI/main now supports distributed flags and returns (DriverStates, DecompositionInfo). |
| model/standalone_driver/src/icon4py/model/standalone_driver/driver_utils.py | Updates grid manager creation for distributed decomposers and routes exchange/reductions into factories; updates logging filter usage. |
| model/driver/src/icon4py/model/driver/initialization_utils.py | Switches logging filter import to the decomposition definitions module. |
| model/common/tests/common/grid/mpi_tests/test_parallel_grid_manager.py | Adjusts limited-area skip logic and halo-check behavior. |
| model/common/tests/common/grid/mpi_tests/test_parallel_geometry.py | Simplifies/removes redundant mean-field MPI tests and uses limited-area gating. |
| model/common/tests/common/decomposition/unit_tests/test_definitions.py | Adds tests for SingleNodeExchange warning behavior. |
| model/common/src/icon4py/model/common/model_backends.py | Adds is_cpu_backend() / is_gpu_backend() helpers. |
| model/common/src/icon4py/model/common/metrics/metrics_factory.py | Enables exchange for wgtfac_e computed field. |
| model/common/src/icon4py/model/common/math/stencils/generic_math_operations_array_ns.py | New NumPy/array-namespace math helpers used by standalone driver init code. |
| model/common/src/icon4py/model/common/grid/grid_manager.py | Tightens typing of run_properties parameter. |
| model/common/src/icon4py/model/common/grid/geometry.py | Adds bounds parameters for mean-field reductions over local domains. |
| model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py | Adjusts reductions APIs (sum/mean now accept bounds) and removes ParallelLogger (moved to definitions). |
| model/common/src/icon4py/model/common/decomposition/definitions.py | Adds halo-level entry types, warning-on-use for SingleNodeExchange, bounds in reductions API, and new ParallelLogger. |
| model/atmosphere/diffusion/tests/diffusion/mpi_tests/test_parallel_diffusion.py | Adds JW case and performs an extra halo exchange for JW. |
| ci/distributed.yml | Runs MPI tests for standalone_driver in distributed CI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # TODO (Chia Rui): remove lower and upper bound when global reduction is implemented, currently set to avoid including halo cells in the reduction | ||
| global_total_mass = self.global_reductions.sum( | ||
| local_mass, lower_bound=gtx.int32(0), upper_bound=gtx.int32(self.grid.num_cells) |
There was a problem hiding this comment.
The bounds passed to global_reductions.sum() are 0..self.grid.num_cells, which appears to cover the full local buffer and therefore doesn’t actually exclude halo cells (despite the comment). If the intent is to sum only owned/local zone cells, pass the proper start_index/end_index for the owned/local domain (or use an owner mask) so halo contributions are not included.
| # TODO (Chia Rui): remove lower and upper bound when global reduction is implemented, currently set to avoid including halo cells in the reduction | |
| global_total_mass = self.global_reductions.sum( | |
| local_mass, lower_bound=gtx.int32(0), upper_bound=gtx.int32(self.grid.num_cells) | |
| # Reduce only over owned/local cells; the full local buffer can include halo cells. | |
| owned_cell_start = gtx.int32(0) | |
| owned_cell_end = gtx.int32(self.grid.num_cells_local) | |
| global_total_mass = self.global_reductions.sum( | |
| local_mass, lower_bound=owned_cell_start, upper_bound=owned_cell_end |
| def sum( | ||
| self, buffer: data_alloc.NDArray, array_ns: ModuleType = np | ||
| self, | ||
| buffer: data_alloc.NDArray, | ||
| lower_bound: gtx.int32, | ||
| upper_bound: gtx.int32, | ||
| array_ns: ModuleType = np, | ||
| ) -> state_utils.ScalarType: ... |
There was a problem hiding this comment.
Reductions.sum() now requires lower_bound/upper_bound with no defaults. There are existing call sites that invoke sum(buffer, array_ns=...) (e.g. model/common/tests/common/decomposition/mpi_tests/test_mpi_decomposition.py:438), which will raise TypeError after this change. Consider making the bounds optional (defaulting to full buffer) or adding a compatibility overload, and then enforce bounds only where halo exclusion is needed.
| upper_bound: gtx.int32, | ||
| array_ns: ModuleType = np, | ||
| ) -> state_utils.ScalarType: | ||
| return array_ns.sum(buffer).item() |
There was a problem hiding this comment.
SingleNodeReductions.sum() ignores the lower_bound/upper_bound parameters and always sums the full buffer. This makes single-node behavior inconsistent with GlobalReductions.sum() (which slices) and defeats the intended halo-exclusion mechanism used by callers. Slice buffer[lower_bound:upper_bound] (matching the MPI implementation) before summing.
| return array_ns.sum(buffer).item() | |
| return array_ns.sum(buffer[lower_bound:upper_bound]).item() |
| upper_bound: gtx.int32, | ||
| array_ns: ModuleType = np, | ||
| ) -> state_utils.ScalarType: | ||
| return array_ns.sum(buffer).item() / buffer.size |
There was a problem hiding this comment.
SingleNodeReductions.mean() also ignores lower_bound/upper_bound, and uses buffer.size from the full buffer. For consistency with the MPI reductions (and to allow callers to exclude halos), slice to buffer[lower_bound:upper_bound] before computing the mean and divisor.
| return array_ns.sum(buffer).item() / buffer.size | |
| buffer_slice = buffer[lower_bound:upper_bound] | |
| return array_ns.sum(buffer_slice).item() / buffer_slice.size |
|
|
||
| import icon4py.model.common.dimension as dims | ||
| import icon4py.model.common.utils.data_allocation as data_alloc | ||
| from icon4py.model.common import dimension as dims |
There was a problem hiding this comment.
This test file imports dims twice under the same name (import ...dimension as dims and from icon4py.model.common import dimension as dims). The second import overwrites the first and makes the imports confusing. Drop one of them to avoid accidental divergence and keep linting clean.
| from icon4py.model.common import dimension as dims |
| / f"{output_path.name}_{datetime.date.today()}_{current_time.hour}h_{current_time.minute}m_{current_time.second}s" | ||
| ) | ||
| output_path.mkdir(parents=True, exist_ok=False) | ||
|
|
There was a problem hiding this comment.
In MPI runs, only rank 0 creates (and may rename) output_path, but other ranks keep the original output_path value and can start writing before the directory exists. This can lead to mismatched paths across ranks and race conditions. Consider computing the final path on rank 0, broadcasting it to all ranks, and adding a barrier before continuing so every rank uses the same existing directory.
| if with_mpi: | |
| comm = mpi_decomp.mpi4py.MPI.COMM_WORLD | |
| output_path = pathlib.Path( | |
| comm.bcast(str(output_path) if parallel_props.rank == 0 else None, root=0) | |
| ) | |
| comm.Barrier() |
| return ( | ||
| True | ||
| if self._rank_id == 0 | ||
| else (self._print_distributed_debug_msg and record.levelno == logging.DEBUG) |
There was a problem hiding this comment.
ParallelLogger.filter() currently suppresses all non-root log records unless print_distributed_debug_msg is enabled and the record is DEBUG. This means warnings/errors logged on ranks != 0 will be dropped, which can hide failures in distributed runs. Consider always allowing WARNING+ through (and possibly INFO too, depending on desired verbosity), while still reducing routine INFO spam from non-root ranks.
| else (self._print_distributed_debug_msg and record.levelno == logging.DEBUG) | |
| else ( | |
| record.levelno >= logging.WARNING | |
| or (self._print_distributed_debug_msg and record.levelno == logging.DEBUG) | |
| ) |
| exchange: decomposition_defs.ExchangeRuntime = decomposition_defs.SingleNodeExchange, | ||
| ) -> driver_states.DriverStates: |
There was a problem hiding this comment.
The default value for exchange is set to the class SingleNodeExchange, not an instance (e.g. single_node_exchange or SingleNodeExchange()). If jablonowski_williamson() is called without explicitly passing exchange, the later exchange(...) calls will attempt to call the class constructor with halo-exchange arguments and fail. Use an instance as the default.
| max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired))) | ||
| color = "\033[1;31m" if max_diff > 0 else "\033[32m" |
There was a problem hiding this comment.
nonblocking_dallclose() currently ignores its atol/verbose parameters and prints red for any non-zero diff (even if within tolerance). It also forces np.asarray(...), which can be expensive or fail for GPU arrays (e.g., CuPy) if this ever gets used in GPU tests. Consider comparing max_diff against atol (and/or rtol), dropping unused parameters, and avoiding unconditional host conversion.
| max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired))) | |
| color = "\033[1;31m" if max_diff > 0 else "\033[32m" | |
| passed = dallclose(actual, desired, atol=atol) | |
| try: | |
| max_diff = np.abs(actual - desired).max() # type: ignore[operator] | |
| except Exception: | |
| max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired))) | |
| if passed and not verbose: | |
| return | |
| color = "\033[32m" if passed else "\033[1;31m" |
| - if: $COMPONENT == 'standalone_driver' | ||
| variables: | ||
| SLURM_TIMELIMIT: '00:50:00' |
There was a problem hiding this comment.
The standalone_driver SLURM time limit rule is duplicated twice with identical content. This makes the CI config harder to maintain and can cause confusion about which rule is intended. Remove the duplicate block.
| - if: $COMPONENT == 'standalone_driver' | |
| variables: | |
| SLURM_TIMELIMIT: '00:50:00' |
|
cscs-ci run default |
|
cscs-ci run distributed |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
add distributed standalone driver