Allow stochastic `CoupledStepper` training by jpdunc23 · Pull Request #750 · ai2cm/ace

jpdunc23 · 2026-01-21T17:47:34Z

Adds n_ensemble to CoupledTrainStepperConfig and handling for the ensemble dimension in CoupledTrainStepper.

Changes:

CoupledTrainStepper init now has config: CoupledTrainStepperConfig as an arg rather than loss: CoupledStepperTrainLoss.
Tests added

…hastic-atmos

fme/coupled/test_train.py

elynnwu · 2026-04-06T22:14:23Z

fme/coupled/stepper.py

+            gen_data=ocean_gen_data,
            target_data=add_ensemble_dim(dict(ocean_forward_data.data)),
-            time=gen_data.ocean_data.time,
+            time=ocean_forward_data.time,


Is this equivalent to gen_data.ocean_data.time when no ensemble is used?

Ya, the target data is ocean_forward_data so this is just a different but equivalent source for the time dim.

fme/coupled/stepper.py

elynnwu · 2026-04-06T22:19:40Z

fme/coupled/stepper.py

+        """
+        atmos_gen_data = process_ensemble_prediction_generator_list(
+            [
+                EnsembleTensorDict(x.data)  # FIXME: fix output_list typing


Maybe just cast x.data to EnsembleTensorDict?

Refactored things a bit to get the typing in line with fme/ace/stepper/single_module.py. CoupledTrainStepper now deals with ComponentEnsembleStepPrediction, while CoupledStepper uses ComponentStepPrediction.

elynnwu · 2026-04-06T22:21:40Z

fme/coupled/stepper.py

+            gen_data=atmos_gen_data,
            target_data=add_ensemble_dim(dict(atmos_forward_data.data)),
-            time=gen_data.atmosphere_data.time,
+            time=atmos_forward_data.time,  # Use original time (not broadcasted)


So is this the reason for the switch to forward data instead of gen for ocean too?

The main reason is that there is no more gen_data and ocean_gen_data and atmos_gen_data here are no longer BatchData objects, and getting the time attribute from the ocean_forward_data or atmos_forward_data is equivalent to what we had before (note that this is happening before stepped = stepped.prepend_initial_condition(ic).

elynnwu · 2026-04-06T22:40:31Z

fme/coupled/stepper.py

        self._stepper = stepper
-        self._loss = loss
+        self._config = config
+        self._loss = self._config._build_loss(stepper)


[nit] This would make CoupledTrainStepper harder to build in tests since you now need the full config, you can still just build the loss in get_train_stepper right?

The changes to the tests are luckily minor since we were already building the CoupledTrainStepperConfig and using its get_train_stepper() method to build the CoupledTrainStepper. I can see some benefit to letting the loss be built outside of the CoupledTrainStepper init, but on the other hand this makes the init args identical to the fme.ace.TrainStepper init and avoids having to add an n_ensemble: int arg, both of which are nice. Don't feel super strongly one way or another.

Okay, let's keep as it is to be consistent with fme.ace.TrainStepper

elynnwu · 2026-04-06T22:44:23Z

fme/coupled/stepper.py

+        # Ensemble support: broadcast atmosphere data for ensemble training
+        n_ensemble = self._config.n_ensemble
+        atmos_data_ensemble = data.atmosphere_data.broadcast_ensemble(n_ensemble)
+        ocean_data_ensemble = data.ocean_data.broadcast_ensemble(n_ensemble)


My understanding is that you broadcast ocean data as well even though we currently only support training stochastic atmosphere, and the stochastic losses are propagated to ocean via the surface forcing variables. Can you add a short description in n_ensemble?

I added some info about stochastic training assumptions to the CoupledTrainStepper docstring. Lmk if this is more clear now.

Thanks, the docs are good.

elynnwu · 2026-04-06T23:01:45Z

fme/coupled/stepper.py

+        else:
+            gen_data = self._stepper._process_prediction_generator_list(
+                output_list, data_ensemble
+            )
+            ocean_gen_data = unfold_ensemble_dim(
+                dict(gen_data.ocean_data.data), n_ensemble=1
+            )
+            atmos_gen_data = unfold_ensemble_dim(
+                dict(gen_data.atmosphere_data.data), n_ensemble=1
+            )


Would be better if _process_ensemble_prediction_generator_list also accepts n_ensemble=1 so you don't need to have if/else here.

Was able to remove these if n_ensemble > 1: blocks.

…hastic-atmos

jpdunc23 · 2026-04-08T04:30:48Z

fme/coupled/stepper.py

-    def detach(self, optimizer: OptimizationABC) -> "ComponentStepPrediction":
-        """Detach the data tensor map from the computation graph."""
-        return ComponentStepPrediction(
-            realm=self.realm,
-            data=optimizer.detach_if_using_gradient_accumulation(self.data),
-            step=self.step,
-        )


ComponentStepPrediction is not returned by the CoupledTrainStepper methods now, so it doesn't need this detach() method.

jpdunc23 · 2026-04-08T04:32:10Z

fme/coupled/stepper.py

-class CoupledStepperTrainLoss:
-    def __init__(
-        self,
-        ocean_loss: StepLossABC,
-        atmosphere_loss: StepLossABC,
-    ):
-        self._loss_objs = {
-            "ocean": ocean_loss,
-            "atmosphere": atmosphere_loss,
-        }
-
-    @property
-    def effective_loss_scaling(self) -> CoupledTensorMapping:
-        return CoupledTensorMapping(
-            ocean=self._loss_objs["ocean"].effective_loss_scaling,
-            atmosphere=self._loss_objs["atmosphere"].effective_loss_scaling,
-        )
-
-    def step_is_optimized(self, realm: str, step: int) -> bool:
-        return self._loss_objs[realm].step_is_optimized(step)
-
-    def __call__(
-        self,
-        prediction: ComponentStepPrediction,
-        target_data: TensorMapping,
-    ) -> torch.Tensor | None:
-        loss_obj = self._loss_objs[prediction.realm]
-        if loss_obj.step_is_optimized(prediction.step):
-            return loss_obj(prediction, target_data)
-        return None


Moved below for a bit better organization. The only change is to the typing of the prediction arg to __call__, which is now prediction: ComponentEnsembleStepPrediction.

jpdunc23 · 2026-04-08T04:35:14Z

fme/coupled/stepper.py

+    def step(self) -> int:
+        return self._step
+
+    def detach_if_using_gradient_accumulation(


This replaces ComponentStepPrediction.detach()

jpdunc23 added 7 commits January 21, 2026 09:45

Allow stochastic atmos training in CoupledStepper

328c978

Merge branch 'main' of github.com:ai2cm/ace into coupled-ft-with-stoc…

eb19d3d

…hastic-atmos

Merge branch 'main' of github.com:ai2cm/ace into coupled-ft-with-stoc…

8533d85

…hastic-atmos

Fix test

81f7329

Merge branch 'main' of github.com:ai2cm/ace into coupled-ft-with-stoc…

e5131f4

…hastic-atmos

Merge branch 'main' of github.com:ai2cm/ace into coupled-ft-with-stoc…

604d22c

…hastic-atmos

Allow for ensemble ocean training

1f39fb5

jpdunc23 changed the title ~~Allow stochastic atmos training in CoupledStepper~~ Allow stochastic CoupledStepper training Apr 2, 2026

Merge branch 'main' into coupled-ft-with-stochastic-atmos

1ceb8d6

jpdunc23 marked this pull request as ready for review April 6, 2026 21:11

elynnwu reviewed Apr 6, 2026

View reviewed changes

jpdunc23 added 3 commits April 7, 2026 16:50

Add CoupledTrainStepper._accumulate_loss()

4f9585e

Merge branch 'main' of github.com:ai2cm/ace into coupled-ft-with-stoc…

cec0022

…hastic-atmos

Address additional review comments

8743338

jpdunc23 commented Apr 8, 2026

View reviewed changes

jpdunc23 and others added 2 commits April 7, 2026 21:37

Fix unhelpful docstring

76c13e1

Merge branch 'main' into coupled-ft-with-stochastic-atmos

30ffd90

jpdunc23 requested a review from elynnwu April 9, 2026 15:58

elynnwu approved these changes Apr 9, 2026

View reviewed changes

Merge branch 'main' into coupled-ft-with-stochastic-atmos

d7e12a5

jpdunc23 enabled auto-merge (squash) April 9, 2026 16:49

jpdunc23 merged commit 366f117 into main Apr 9, 2026
7 checks passed

jpdunc23 deleted the coupled-ft-with-stochastic-atmos branch April 9, 2026 16:58

Conversation

jpdunc23 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpdunc23 commented Jan 21, 2026 •

edited

Loading