Distributed standalone driver by jcanton · Pull Request #1102 · C2SM/icon4py

jcanton · 2026-03-10T16:22:06Z

add distributed standalone driver

…on4py into ci_for_standalone_driver

…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

…e-gpu

* main: refactor: prefix VelocityAdvection internal attributes with underscore (#1180) fortran bindings: split py2fgen and bindings (#1185) Rename `single_node_default` to `single_node_exchange` (#1176) remove the duplication of fixtures (#1186) Update GT4Py to v1.1.9 (#1187) cleanup 3 unused vars from solve_nh_init (#1169)

jcanton · 2026-04-16T16:01:12Z


 [tool.uv.sources]
 dace = {index = "gridtools"}
+gt4py = {git = "https://github.com/GridTools/gt4py", branch = "workaround_caching_issue_with_embedded_inverse_image_caching"}


@msimberg is this still needed or was it merged into gt4py main?

Copilot

Pull request overview

This PR adds MPI/distributed execution support to the icon4py standalone driver, extending decomposition/exchange/reduction plumbing and introducing MPI test coverage and CI integration for the standalone driver component.

Changes:

Enable distributed (MPI) standalone-driver runs by wiring ProcessProperties, DecompositionInfo, halo exchange, and global reductions through driver initialization, static-field factories, and initial conditions.
Extend decomposition utilities with halo-level index selection and updated logging/reduction utilities for distributed execution.
Add standalone-driver MPI test suite and run it in distributed CI.

Reviewed changes

Copilot reviewed 28 out of 29 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
uv.lock	Updates locked dependencies (notably `gt4py` to a git source) and optional CUDA/ROCm variants.
pyproject.toml	Pins `gt4py` to a git branch via `tool.uv.sources`.
model/testing/src/icon4py/model/testing/test_utils.py	Adds a helper for “non-blocking” array comparisons (prints instead of raising).
model/testing/src/icon4py/model/testing/parallel_helpers.py	Enhances distributed field comparison helpers; adds halo-level comparisons.
model/testing/src/icon4py/model/testing/definitions.py	Adds JW diffusion config for tests/experiments.
model/standalone_driver/tests/standalone_driver/mpi_tests/init.py	Introduces MPI test package for standalone driver.
model/standalone_driver/tests/standalone_driver/mpi_tests/test_parallel_standalone_driver.py	Adds MPI tests comparing single-rank vs multi-rank standalone driver outputs and serialized reference fields.
model/standalone_driver/tests/standalone_driver/mpi_tests/test_parallel_initial_conditions.py	Adds MPI tests comparing single-rank vs multi-rank JW initial conditions.
model/standalone_driver/tests/standalone_driver/integration_tests/test_standalone_driver.py	Updates for `main.main()` return type and backend handling.
model/standalone_driver/tests/standalone_driver/integration_tests/test_initial_condition.py	Passes exchange into initial condition and fixes serialbox typing import.
model/standalone_driver/tests/standalone_driver/fixtures.py	Imports `backend` fixture for tests.
model/standalone_driver/src/icon4py/model/standalone_driver/testcases/utils.py	Refactors parts of `init_w` to use reusable math helpers.
model/standalone_driver/src/icon4py/model/standalone_driver/testcases/initial_condition.py	Adds exchange usage to sync intermediate fields for distributed initial condition generation.
model/standalone_driver/src/icon4py/model/standalone_driver/standalone_driver.py	Adds MPI detection, distributed output-path handling, decomposition info exposure, exchange usage, and global reductions.
model/standalone_driver/src/icon4py/model/standalone_driver/main.py	CLI/main now supports distributed flags and returns `(DriverStates, DecompositionInfo)`.
model/standalone_driver/src/icon4py/model/standalone_driver/driver_utils.py	Updates grid manager creation for distributed decomposers and routes exchange/reductions into factories; updates logging filter usage.
model/driver/src/icon4py/model/driver/initialization_utils.py	Switches logging filter import to the decomposition definitions module.
model/common/tests/common/grid/mpi_tests/test_parallel_grid_manager.py	Adjusts limited-area skip logic and halo-check behavior.
model/common/tests/common/grid/mpi_tests/test_parallel_geometry.py	Simplifies/removes redundant mean-field MPI tests and uses limited-area gating.
model/common/tests/common/decomposition/unit_tests/test_definitions.py	Adds tests for `SingleNodeExchange` warning behavior.
model/common/src/icon4py/model/common/model_backends.py	Adds `is_cpu_backend()` / `is_gpu_backend()` helpers.
model/common/src/icon4py/model/common/metrics/metrics_factory.py	Enables exchange for `wgtfac_e` computed field.
model/common/src/icon4py/model/common/math/stencils/generic_math_operations_array_ns.py	New NumPy/array-namespace math helpers used by standalone driver init code.
model/common/src/icon4py/model/common/grid/grid_manager.py	Tightens typing of `run_properties` parameter.
model/common/src/icon4py/model/common/grid/geometry.py	Adds bounds parameters for mean-field reductions over local domains.
model/common/src/icon4py/model/common/decomposition/mpi_decomposition.py	Adjusts reductions APIs (sum/mean now accept bounds) and removes `ParallelLogger` (moved to definitions).
model/common/src/icon4py/model/common/decomposition/definitions.py	Adds halo-level entry types, warning-on-use for `SingleNodeExchange`, bounds in reductions API, and new `ParallelLogger`.
model/atmosphere/diffusion/tests/diffusion/mpi_tests/test_parallel_diffusion.py	Adds JW case and performs an extra halo exchange for JW.
ci/distributed.yml	Runs MPI tests for `standalone_driver` in distributed CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T16:10:57Z

+            # TODO (Chia Rui): remove lower and upper bound when global reduction is implemented, currently set to avoid including halo cells in the reduction
+            global_total_mass = self.global_reductions.sum(
+                local_mass, lower_bound=gtx.int32(0), upper_bound=gtx.int32(self.grid.num_cells)


The bounds passed to global_reductions.sum() are 0..self.grid.num_cells, which appears to cover the full local buffer and therefore doesn’t actually exclude halo cells (despite the comment). If the intent is to sum only owned/local zone cells, pass the proper start_index/end_index for the owned/local domain (or use an owner mask) so halo contributions are not included.

Suggested change

# TODO (Chia Rui): remove lower and upper bound when global reduction is implemented, currently set to avoid including halo cells in the reduction

global_total_mass = self.global_reductions.sum(

local_mass, lower_bound=gtx.int32(0), upper_bound=gtx.int32(self.grid.num_cells)

# Reduce only over owned/local cells; the full local buffer can include halo cells.

owned_cell_start = gtx.int32(0)

owned_cell_end = gtx.int32(self.grid.num_cells_local)

global_total_mass = self.global_reductions.sum(

local_mass, lower_bound=owned_cell_start, upper_bound=owned_cell_end

Copilot · 2026-04-16T16:10:58Z

    def sum(
-        self, buffer: data_alloc.NDArray, array_ns: ModuleType = np
+        self,
+        buffer: data_alloc.NDArray,
+        lower_bound: gtx.int32,
+        upper_bound: gtx.int32,
+        array_ns: ModuleType = np,
    ) -> state_utils.ScalarType: ...


Reductions.sum() now requires lower_bound/upper_bound with no defaults. There are existing call sites that invoke sum(buffer, array_ns=...) (e.g. model/common/tests/common/decomposition/mpi_tests/test_mpi_decomposition.py:438), which will raise TypeError after this change. Consider making the bounds optional (defaulting to full buffer) or adding a compatibility overload, and then enforce bounds only where halo exclusion is needed.

Copilot · 2026-04-16T16:10:58Z

+        upper_bound: gtx.int32,
+        array_ns: ModuleType = np,
+    ) -> state_utils.ScalarType:
        return array_ns.sum(buffer).item()


SingleNodeReductions.sum() ignores the lower_bound/upper_bound parameters and always sums the full buffer. This makes single-node behavior inconsistent with GlobalReductions.sum() (which slices) and defeats the intended halo-exclusion mechanism used by callers. Slice buffer[lower_bound:upper_bound] (matching the MPI implementation) before summing.

Suggested change

return array_ns.sum(buffer).item()

return array_ns.sum(buffer[lower_bound:upper_bound]).item()

Copilot · 2026-04-16T16:10:59Z

+        upper_bound: gtx.int32,
+        array_ns: ModuleType = np,
+    ) -> state_utils.ScalarType:
        return array_ns.sum(buffer).item() / buffer.size


SingleNodeReductions.mean() also ignores lower_bound/upper_bound, and uses buffer.size from the full buffer. For consistency with the MPI reductions (and to allow callers to exclude halos), slice to buffer[lower_bound:upper_bound] before computing the mean and divisor.

Suggested change

return array_ns.sum(buffer).item() / buffer.size

buffer_slice = buffer[lower_bound:upper_bound]

return array_ns.sum(buffer_slice).item() / buffer_slice.size

Copilot · 2026-04-16T16:10:59Z


 import icon4py.model.common.dimension as dims
 import icon4py.model.common.utils.data_allocation as data_alloc
+from icon4py.model.common import dimension as dims


This test file imports dims twice under the same name (import ...dimension as dims and from icon4py.model.common import dimension as dims). The second import overwrites the first and makes the imports confusing. Drop one of them to avoid accidental divergence and keep linting clean.

Suggested change

from icon4py.model.common import dimension as dims

Copilot · 2026-04-16T16:10:59Z

+                / f"{output_path.name}_{datetime.date.today()}_{current_time.hour}h_{current_time.minute}m_{current_time.second}s"
+            )
        output_path.mkdir(parents=True, exist_ok=False)



In MPI runs, only rank 0 creates (and may rename) output_path, but other ranks keep the original output_path value and can start writing before the directory exists. This can lead to mismatched paths across ranks and race conditions. Consider computing the final path on rank 0, broadcasting it to all ranks, and adding a barrier before continuing so every rank uses the same existing directory.

Suggested change

if with_mpi:

comm = mpi_decomp.mpi4py.MPI.COMM_WORLD

output_path = pathlib.Path(

comm.bcast(str(output_path) if parallel_props.rank == 0 else None, root=0)

)

comm.Barrier()

Copilot · 2026-04-16T16:10:59Z

+        return (
+            True
+            if self._rank_id == 0
+            else (self._print_distributed_debug_msg and record.levelno == logging.DEBUG)


ParallelLogger.filter() currently suppresses all non-root log records unless print_distributed_debug_msg is enabled and the record is DEBUG. This means warnings/errors logged on ranks != 0 will be dropped, which can hide failures in distributed runs. Consider always allowing WARNING+ through (and possibly INFO too, depending on desired verbosity), while still reducing routine INFO spam from non-root ranks.

Suggested change

else (self._print_distributed_debug_msg and record.levelno == logging.DEBUG)

else (

record.levelno >= logging.WARNING

or (self._print_distributed_debug_msg and record.levelno == logging.DEBUG)

)

Copilot · 2026-04-16T16:11:00Z

+    exchange: decomposition_defs.ExchangeRuntime = decomposition_defs.SingleNodeExchange,
 ) -> driver_states.DriverStates:


The default value for exchange is set to the class SingleNodeExchange, not an instance (e.g. single_node_exchange or SingleNodeExchange()). If jablonowski_williamson() is called without explicitly passing exchange, the later exchange(...) calls will attempt to call the class constructor with halo-exchange arguments and fail. Use an instance as the default.

Copilot · 2026-04-16T16:11:00Z

+    max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired)))
+    color = "\033[1;31m" if max_diff > 0 else "\033[32m"


nonblocking_dallclose() currently ignores its atol/verbose parameters and prints red for any non-zero diff (even if within tolerance). It also forces np.asarray(...), which can be expensive or fail for GPU arrays (e.g., CuPy) if this ever gets used in GPU tests. Consider comparing max_diff against atol (and/or rtol), dropping unused parameters, and avoiding unconditional host conversion.

Suggested change

max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired)))

color = "\033[1;31m" if max_diff > 0 else "\033[32m"

passed = dallclose(actual, desired, atol=atol)

try:

max_diff = np.abs(actual - desired).max() # type: ignore[operator]

except Exception:

max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired)))

if passed and not verbose:

return

color = "\033[32m" if passed else "\033[1;31m"

Copilot · 2026-04-16T16:11:00Z

+    - if: $COMPONENT == 'standalone_driver'
+      variables:
+        SLURM_TIMELIMIT: '00:50:00'


The standalone_driver SLURM time limit rule is duplicated twice with identical content. This makes the CI config harder to maintain and can cause confusion about which rule is intended. Remove the duplicate block.

Suggested change

- if: $COMPONENT == 'standalone_driver'

variables:

SLURM_TIMELIMIT: '00:50:00'

jcanton · 2026-04-16T16:37:22Z

cscs-ci run default

jcanton · 2026-04-16T16:37:33Z

cscs-ci run distributed

github-actions · 2026-04-16T16:43:22Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

msimberg and others added 30 commits February 24, 2026 16:06

Test more metrics fields

67e054b

Remove unused fixture

f864ec6

Fix figure reference

0fbfb62

Merge remote-tracking branch 'origin/main' into halo_construction

2bc143e

Remove deleted field from test

4fb0936

inital conditions edits and tests

203fcf9

Merge branch 'main' into ci_for_standalone_driver

ca8d67c

further edits to remove serialized constants

dfdcfcb

Merge branch 'ci_for_standalone_driver' of https://github.com/C2SM/ic…

bdde616

…on4py into ci_for_standalone_driver

Merge branch 'main' into ci_for_standalone_driver

7234c6b

Merge branch 'main' into ci_for_standalone_driver

5e5a6ef

Update model/common/src/icon4py/model/common/decomposition/definition…

7290b81

…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

Update model/common/src/icon4py/model/common/decomposition/definition…

135172d

…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

Update model/common/src/icon4py/model/common/decomposition/halo.py

171d23f

Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

Update model/common/src/icon4py/model/common/decomposition/definition…

7fa119a

…s.py Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

Merge remote-tracking branch 'origin/main' into distributed-tests-dac…

5b6ddf8

…e-gpu

Merge remote-tracking branch 'origin/main' into halo_construction

18ebc10

Work on failing test_parallel_grid_geometry.py fields

f47e5f3

Comment one more field out for now

20e4a8b

further edits

46014f6

Merge branch 'main' into ci_for_standalone_driver

1061e37

Add assertions for halo construction consistency

02a9650

lowered tolerance in standalone driver test and ran pre-commit

bb9ebf1

Merge branch 'main' into ci_for_standalone_driver

5b8a448

update from upstream, edits to types, and ran pre-commit

cd8ef1c

Change implementation of array_ns_from_array

63a2520

some edits

3eea114

small edit

c1217d5

Merge branch 'main' into ci_for_standalone_driver

4bf3733

Fix inverse dual edge length and c_lin_e fields

23d7c83

jcanton and others added 12 commits April 10, 2026 16:34

add todos

1f5e921

todos and optimistic tolerance

1456201

global max of max cfl number and modify logging

115e8e2

global mass and clean up

9fc9f46

move default value

361bb06

add helpers

a35b5ab

cleanup nonblocking_dallclose

94a2a1c

backend_like instead of _name and tolerances

3ea2636

unify checks

f5b0629

pre-commit

aee9d32

and re-update this

c86792a

jcanton commented Apr 16, 2026

View reviewed changes

jcanton marked this pull request as ready for review April 16, 2026 16:01

jcanton requested review from OngChia, Copilot, msimberg and nfarabullini April 16, 2026 16:01

Copilot started reviewing on behalf of jcanton April 16, 2026 16:03 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

jcanton added 6 commits April 16, 2026 18:12

fix and update

1c7aaf1

fix exchanges

2a0c161

remove unused type ignore

ce2859b

overloads were un-used and just annoying with mypy

2ebcf85

fix one missing

031b3a2

finish fix

4ec0071

change due to ruff multiline formatting

a6e6ab4

	return array_ns.sum(buffer).item()
	return array_ns.sum(buffer[lower_bound:upper_bound]).item()

	return array_ns.sum(buffer).item() / buffer.size
	buffer_slice = buffer[lower_bound:upper_bound]
	return array_ns.sum(buffer_slice).item() / buffer_slice.size

+    if with_mpi:
+        comm = mpi_decomp.mpi4py.MPI.COMM_WORLD
+        output_path = pathlib.Path(
+            comm.bcast(str(output_path) if parallel_props.rank == 0 else None, root=0)
+        )
+        comm.Barrier()

-            else (self._print_distributed_debug_msg and record.levelno == logging.DEBUG)
+            else (
+                record.levelno >= logging.WARNING
+                or (self._print_distributed_debug_msg and record.levelno == logging.DEBUG)
+            )

		exchange: decomposition_defs.ExchangeRuntime = decomposition_defs.SingleNodeExchange,
		) -> driver_states.DriverStates:

		max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired)))
		color = "\033[1;31m" if max_diff > 0 else "\033[32m"

-    max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired)))
-    color = "\033[1;31m" if max_diff > 0 else "\033[32m"
+    passed = dallclose(actual, desired, atol=atol)
+    try:
+        max_diff = np.abs(actual - desired).max()  # type: ignore[operator]
+    except Exception:
+        max_diff = np.max(np.abs(np.asarray(actual) - np.asarray(desired)))
+    if passed and not verbose:
+        return
+    color = "\033[32m" if passed else "\033[1;31m"

	- if: $COMPONENT == 'standalone_driver'
	variables:
	SLURM_TIMELIMIT: '00:50:00'

Conversation

jcanton commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcanton Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

jcanton commented Apr 16, 2026

Uh oh!

jcanton commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jcanton commented Mar 10, 2026 •

edited

Loading