Hunt for Tropical Cyclones pt. 1 by mariusaurus · Pull Request #760 · NVIDIA/earth2studio

mariusaurus · 2026-03-18T15:48:04Z

Earth2Studio Pull Request

Description

Adds a new recipes/tc_tracking recipe that generates ensemble forecasts (using FCN3 or AIFS-ENS) and tracks tropical cyclones within the predictions by integrating TempestExtremes as a downstream diagnostic tool.
Implements an asynchronous CPU/GPU execution mode where TempestExtremes runs cyclone detection on CPU in parallel with GPU inference, resulting in virtually no computational overhead from the tracking process.
Uses in-memory file handling (/dev/shm) to avoid writing large atmospheric field data to disk, which would otherwise slow down inference significantly and become prohibitive at scale.
Includes a stability check mechanism for detecting and recovering from numerical instabilities in long-range FCN3 forecasts.
Includes a comprehensive README with setup instructions (container and uv), configuration reference, and an example workflow. Also serves as a reference for integrating other custom downstream analysis tools into Earth-2 Studio prediction pipelines.
Includes an end-to-end test (test/test_tc_hunt.sh) that runs a five-member ensemble forecast of Hurricane Helene with FCN3 and verifies that track files are produced correctly.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
Assess and address Greptile feedback (AI code review bot for guidance; use discretion, addressing all feedback is not required).

Dependencies

Licences for all Python dependencies declared in pyproject.toml.
Information sourced from PyPI on 2026-04-09.

Core Dependencies

Package	Licence	PyPI
earth2studio	Apache-2.0	link
huggingface-hub	Apache-2.0	link
hydra-core	MIT	link
importlib-metadata	Apache-2.0	link
more-itertools	MIT	link
omegaconf	BSD	link
ruamel-yaml	MIT	link
scipy	BSD-3-Clause	link
setuptools	MIT	link
tqdm	MPL-2.0 AND MIT	link
zarrdump	BSD-3-Clause	link

Optional – `plot`

Package	Licence	PyPI
cartopy	BSD-3-Clause	link
ipykernel	BSD-3-Clause	link
matplotlib	PSF-2.0 (matplotlib licence)	link
moviepy	MIT	link

Optional – `dev`

Package	Licence	PyPI
ninja	Apache-2.0 / BSD	link
pre-commit	MIT	link
pytest	MIT	link

…l but needs further testing with batching.

…num workers

… are too many right now

…t of scope

…r of additional layers

greptile-apps · 2026-03-18T15:53:20Z

Greptile Summary

This PR adds a self-contained recipes/tc_tracking recipe that integrates TempestExtremes into Earth2Studio ensemble forecasts for tropical cyclone tracking. The implementation is well-structured and covers the full pipeline: data fetching, GPU inference (FCN3 / AIFS-ENS), in-RAM NetCDF staging, asynchronous TE execution (GPU/CPU overlap via AsyncTempestExtremes.__call__ → track_cyclones_async), and output storage (Zarr / NetCDF4). A stability-check mechanism for numerical blow-ups and a comprehensive end-to-end test round out the contribution.

Key observations:

The async GPU/CPU overlap is correctly implemented — AsyncTempestExtremes.__call__ overrides the parent and dispatches to track_cyclones_async(), which submits TE work to a global ThreadPoolExecutor and returns a Future immediately, letting the GPU proceed to the next IC.
cleanup() and wait_for_completion() both raise on the first failed background task, silently abandoning subsequent tasks whose errors are never surfaced. Collecting all failures before raising (as is done inside _run_te_and_cleanup) would give more complete diagnostics.
pyproject.toml carries a placeholder description ("no, i won't") and references both earth2studio and torch-harmonics via unpinned git sources, making the recipe non-reproducible as upstream changes.
The Dockerfile clones TempestExtremes from HEAD without a pinned tag, creating the same reproducibility risk.

Confidence Score: 3/5

The core logic and async tracking implementation are sound, but packaging metadata issues and partial failure reporting in cleanup reduce confidence for a production-ready merge.
The functional code is well-designed with good synchronisation, proper use of semaphores, and correct GPU/CPU overlap. The main concerns are non-blocking: a placeholder package description, unpinned git/Docker dependencies that compromise reproducibility, and cleanup()/wait_for_completion() aborting on the first failure while silently discarding errors from subsequent tasks.
recipes/tc_tracking/pyproject.toml (placeholder description + unpinned git sources), recipes/tc_tracking/Dockerfile (unpinned TempestExtremes clone), and recipes/tc_tracking/src/tempest_extremes.py (cleanup/wait_for_completion error aggregation).

Important Files Changed

Filename	Overview
recipes/tc_tracking/src/tempest_extremes.py	Core implementation of TempestExtremes integration — provides synchronous `TempestExtremes` and asynchronous `AsyncTempestExtremes` classes; `AsyncTempestExtremes.__call__` correctly submits tracking to a background thread pool enabling GPU/CPU overlap, but `cleanup()`/`wait_for_completion()` abort on the first task failure and silently abandon remaining failing tasks.
recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py	Main inference loop orchestrating ensemble generation, stability checking, and cyclone tracking; logic is sound and correctly uses the async TempestExtremes API; previously flagged debug comments have been cleaned up.
recipes/tc_tracking/pyproject.toml	Package metadata and dependencies; contains a placeholder description ("no, i won't") and unpinned git sources for both `earth2studio` and `torch-harmonics`, which reduce build reproducibility.
recipes/tc_tracking/Dockerfile	Docker build environment that compiles TempestExtremes from source; clones TempestExtremes at HEAD without a pinned tag/commit, which makes image builds non-reproducible.
recipes/tc_tracking/tc_hunt.py	Entry-point script with Hydra configuration; still contains an informal `print("finished yaaayyyy")` celebration message (previously flagged).

_{Last reviewed commit: "TE workers"}

test/models/dx/test_tempest_extremes.py

recipes/tc_tracking/tc_hunt.py

recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py

recipes/tc_tracking/src/tempest_extremes.py

greptile-apps · 2026-03-18T15:53:31Z

recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py

+    for ic, mems, seed in ic_mems:
+        mini_batch_size = len(mems)
+
+        data_source = data_source_mngr.select_data_source(ic)
+
+        # if new IC, fetch data, create iterator
+        if ic != ic_prev:
+            if cfg.store_type == "netcdf":
+                store = initialise_netcdf_output(cfg, out_coords, ic, ic_mems)
+            x0, coords0 = fetch_data(
+                data_source,
+                time=[np.datetime64(ic)],
+                lead_time=model.input_coords()["lead_time"],
+                variable=model.input_coords()["variable"],
+                device=dist.device,
+            )
+            ic_prev = ic
+
+        coords = {"ensemble": np.array(mems)} | coords0.copy()
+        xx = x0.unsqueeze(0).repeat(mini_batch_size, *([1] * x0.ndim))
+
+        if stability_check:
+            stability_check.reset(deepcopy(coords))
+        # print(stability_check.input_coords)
+        # exit()
+
+        # set random state or apply perturbation
+        if ("model" not in cfg) or (cfg.model == "fcn3"):
+            model.set_rng(seed=seed)
+        elif (
+            cfg.model[:4] == "aifs"
+        ):  # no need for perturbation, but also cannot set internal noise state
+            pass
+        else:
+            sg = SphericalGaussian(noise_amplitude=0.0005)
+            xx, coords = sg(xx, coords)
+
+        iterator = model.create_iterator(xx, coords)
+
+        # roll out the model and record data as desired
+        for _, (xx, coords) in tqdm(
+            zip(range(cfg.n_steps + 1), iterator), total=cfg.n_steps + 1
+        ):
+            write_to_store(store, xx, coords, out_coords)
+            if cyclone_tracking:
+                cyclone_tracking.record_state(xx, coords)
+
+            if stability_check:
+                yy, coy = map_coords(xx, coords, stability_check.input_coords)
+                stab, _ = stability_check(yy, coy)
+                if not stab.all():
+                    ic_mems.append((ic, mems, seed + 1))
+                    print(
+                        f"CAUTION: one of members {mems} became unstable. will re-create with new seed."
+                    )
+                    break


Unbounded retry loop for unstable members

When a member is detected as unstable (line 260), it is re-appended to ic_mems with seed + 1. Because Python's for loop over a list processes newly-appended items, this creates an unbounded retry cycle — there is no guard on how many times any given (ic, mems) combination can be re-queued.

If a particular initial condition consistently produces unstable trajectories (e.g., a known degenerate edge case), the job will never terminate. A maximum-retry counter should be tracked per (ic, seed) pair, and members that exceed the limit should be skipped with a warning rather than being re-queued indefinitely.

in practice, such jobs will be killed by the system after exceeding their allocated time.
in a future version I want to update the scheduling anyway to something smarter as individual ensemble members might then not always take roughly the same time to execute, as they do now.

mariusaurus · 2026-03-18T16:00:32Z

@greptileai

recipes/tc_tracking/src/tempest_extremes.py

mariusaurus · 2026-03-19T13:14:52Z

@greptileai

recipes/tc_tracking/src/tempest_extremes.py

mariusaurus · 2026-03-19T14:09:34Z

@greptileai

recipes/tc_tracking/src/tempest_extremes.py

recipes/tc_tracking/uv.lock

recipes/tc_tracking/pyproject.toml

recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py

recipes/tc_tracking/src/tc_hunt_utils.py

recipes/tc_tracking/src/modes/generate_tc_hunt_ensembles.py

recipes/tc_tracking/src/tempest_extremes.py

…anups

…en nested thread pools

mariusaurus and others added 30 commits December 1, 2025 07:52

resolved conflict

d3a2245

update changelog

0189d52

move seed initialization and fix dxwrapper tests

f11b18b

tempest extremes diagnostic model

d063760

error message

a4d2544

testing if TE is available and works

c1cdca0

started working on support for batch sizes >1, currently works for bs 1

016f16b

halfway to larger batch support

68e33b5

enabling TE for batch sizes of >1. async version seems to work as wel…

7bd60e1

…l but needs further testing with batching.

option to pass file names to TE connector

3b0c00e

array equal test

1e9bbe8

first stable try

d6be6dd

support for per-member parallel execution and lets user controll max …

1e9b275

…num workers

precommit hooks

b5f5c18

vibe-coded some tests, need to be hand-tested and selected

af8bc71

vibe-coded some tests, need to be hand-tested and selected

a9fd2bc

passing all pre-commit tests, still need to sub-select tests as there…

526e6bf

… are too many right now

subselected tests

c3258d9

install doc

3fd145d

throwing an error in case cleanup is not called before object goes ou…

c26f453

…t of scope

custom depenmdency failure message for TE

d2a8e4a

moved tensor tiling and concatenation to utils

0ab6d67

enable setting fcn3 random seed

8ca3fae

add proper noise handling for fcn3

e93932e

fix linting and test issues

bc9e3ac

update lockfile

2685f90

move seed initialization and fix dxwrapper tests

e3a4e3d

tc tracking pipeline

1dec990

update

02945f1

updated uv.lock

f89efe3

mariusaurus added 5 commits March 18, 2026 03:33

Merge branch 'main' into mkoch/tc_hunt_1

6d58f4f

reverted fcn3 changes

6bcbb5a

updated dockerfile to latest physics nemo container and reduced numbe…

3a9a5cf

…r of additional layers

first round of pre-commit hooks

342caf5

licenses

731231b

mariusaurus requested a review from NickGeneva March 18, 2026 15:48

mariusaurus self-assigned this Mar 18, 2026

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

renamed some files and removed tempest_extremes testing

e0b01b2

greptile-apps bot reviewed Mar 18, 2026

View reviewed changes

recipes/tc_tracking/src/tempest_extremes.py Show resolved Hide resolved

recipes/tc_tracking/src/tempest_extremes.py Outdated Show resolved Hide resolved

mariusaurus changed the title ~~Hunt for Tropical Cyclones~~ Hunt for Tropical Cyclones pt. 1 Mar 19, 2026

mariusaurus added 4 commits March 19, 2026 02:22

TE list files unique by time stamp

cecdcb7

Te timeouts

a2c7510

improved catching of TE failures

82fa33c

satisfying the greptile

98b80c8

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

recipes/tc_tracking/src/tempest_extremes.py Outdated Show resolved Hide resolved

recipes/tc_tracking/src/tempest_extremes.py Show resolved Hide resolved

TE workers

a897748

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

recipes/tc_tracking/src/tempest_extremes.py Show resolved Hide resolved

NickGeneva reviewed Apr 6, 2026

View reviewed changes

mariusaurus added 7 commits April 8, 2026 05:19

collecting all tasks in wait_for_completion and in cleanup

1e59b6e

re-using wait for compeltion and cosmetics

63bac87

removed some stale code, replaced use_ram with shm location, some cle…

01b1a00

…anups

replace TempestExtremes module-level globals with singleton and flatt…

f099214

…en nested thread pools

removed SFNO support, replaced print statements with loggers.

c86bfe0

refactored pyproject.toml

2252140

Merge branch 'main' into mkoch/tc_hunt_1

29c75a7

Conversation

mariusaurus commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Earth2Studio Pull Request

Description

Checklist

Dependencies

Core Dependencies

Optional – plot

Optional – dev

Uh oh!

greptile-apps bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mariusaurus Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mariusaurus commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

mariusaurus commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

mariusaurus commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mariusaurus commented Mar 18, 2026 •

edited

Loading

Optional – `plot`

Optional – `dev`

greptile-apps bot commented Mar 18, 2026 •

edited

Loading