Eval recipe pipelines and config reorg#810
Eval recipe pipelines and config reorg#810pzharrington wants to merge 18 commits intoNVIDIA:mainfrom
Conversation
…/earth2studio into vnv-recipe-skeleton
|
/blossom-ci |
Greptile SummaryThis PR introduces a
|
| Filename | Overview |
|---|---|
| recipes/eval/src/pipeline.py | New file — well-structured ABC with two built-in implementations; shared run() correctly handles output filtering, ensemble injection, and threaded-write flushing before markers. |
| recipes/eval/src/output.py | Refactored — OutputManager.__init__ slimmed down; validate_output_store added for deferred store creation; flush() for resume sync; build_forecast_coords / build_diagnostic_coords extracted as standalone helpers. |
| recipes/eval/src/work.py | Added write_marker, filter_completed_items, clear_progress, and progress_dir; also fixes _parse_initial_times to treat start_times: null as absent so campaign configs can use the IC-block path. |
| recipes/eval/main.py | Cleanly delegates to build_pipeline / pipeline.setup / pipeline.run; resume early-exit is globally consistent because filter_completed_items reads from the same shared filesystem across all ranks. |
| recipes/eval/predownload.py | Split into _predownload_forecast and _predownload_diagnostic; custom pipelines (fully-qualified class names) will hit a ValueError with a message that doesn't hint users need to implement their own pre-download step. |
| recipes/eval/test/_multigpu_worker.py | Updated to use ForecastPipeline; manually assigns attributes instead of calling pipeline.setup(), bypassing the run_on_rank0_first barrier inside load_prognostic. |
| recipes/eval/test/test_resume.py | New — comprehensive coverage of progress tracking, resume/flush, and pipeline resume integration. |
| recipes/eval/test/test_pipeline.py | New — covers ABC enforcement, registry lookup, custom-class-by-FQN path, and the stub pipeline's build/run methods. |
| recipes/eval/test/test_diagnostic_inference.py | New — covers single-IC, multi-IC, ensemble, and empty-work-item paths for DiagnosticPipeline. |
| recipes/eval/cfg/default.yaml | Adds pipeline, resume, and predownload keys; moves predownload config into the shared default. |
| recipes/eval/cfg/campaign/fcn3_2024_full.yaml | New campaign config for FCN3 full-year ensemble; correctly sets start_times: null to activate the IC-block path and resume: true for multi-job splitting. |
Reviews (1): Last reviewed commit: "Merge branch 'main' into vnv-recipe-exte..." | Re-trigger Greptile
| if self._executor is not None: | ||
| for f in self._futures: | ||
| f.result() | ||
| self._futures.clear() |
There was a problem hiding this comment.
flush() leaves stale futures on exception
If any future raises, f.result() re-raises immediately and self._futures.clear() is never reached. __exit__ will then encounter the same failed future again (it wraps f.result() in try/except, so it's safe), but the list isn't left in a clean state after a failed flush() call. Consider using a swap-and-clear pattern to guarantee the list is cleared:
| if self._executor is not None: | |
| for f in self._futures: | |
| f.result() | |
| self._futures.clear() | |
| def flush(self) -> None: | |
| """Wait for all pending threaded writes to complete. | |
| Called between work items during resume runs to guarantee that all | |
| zarr writes have landed before a completion marker is written. | |
| Raises immediately if any pending write failed. | |
| """ | |
| if self._executor is not None: | |
| futures, self._futures = self._futures, [] | |
| for f in futures: | |
| f.result() |
|
|
||
| # --- Sentinel file ------------------------------------------------------ | ||
| if dist.distributed: | ||
| torch.distributed.barrier() | ||
|
|
||
| if dist.rank == 0: | ||
| sp = sentinel_path(cfg) |
There was a problem hiding this comment.
Custom pipelines not supported with an unhelpful error message
build_pipeline() in pipeline.py accepts fully-qualified class names for custom pipelines, but predownload.py only handles "forecast" and "diagnostic". If a user sets pipeline: my_module.MyPipeline and runs predownload.py, they get ValueError: Unknown pipeline 'my_module.MyPipeline'. Expected 'forecast' or 'diagnostic'. with no hint that they need to implement their own pre-download step. Consider improving the error message:
| # --- Sentinel file ------------------------------------------------------ | |
| if dist.distributed: | |
| torch.distributed.barrier() | |
| if dist.rank == 0: | |
| sp = sentinel_path(cfg) | |
| else: | |
| raise ValueError( | |
| f"Unknown built-in pipeline '{pipeline}'. " | |
| "predownload.py only supports 'forecast' and 'diagnostic'. " | |
| "For custom pipelines, implement pre-download logic directly " | |
| "or extend this script." | |
| ) |
|
|
||
| with OutputManager(cfg) as output_mgr: | ||
| output_mgr.validate_output_store(total_coords, list(VARIABLES)) | ||
| pipeline.run( | ||
| work_items=my_items, | ||
| prognostic=prognostic, | ||
| data_source=data_source, | ||
| output_mgr=output_mgr, | ||
| nsteps=nsteps, | ||
| device=torch.device(f"cuda:{dist.local_rank}"), | ||
| output_variables=list(VARIABLES), | ||
| device=dist.device, |
There was a problem hiding this comment.
Bypasses
pipeline.setup() distributed barriers
The pipeline is assembled via direct attribute assignment instead of pipeline.setup(cfg, device). ForecastPipeline.setup() contains the comment "All ranks must participate in model loading for barrier correctness" because load_prognostic uses run_on_rank0_first. This is safe here only because Persistence never triggers those barriers, but the pattern could mislead future test maintainers. Consider calling pipeline.setup(cfg, dist.device) or at least leaving a comment explaining why direct assignment is used.
Earth2Studio Pull Request
Description
An incremental update to eval recipe, making it more flexible for different types of models via the introduction of the
Pipelineinterface. Description copied from README below. Besides that, also reorganizes the config system to reduce bloat from having different possible models and inference campaigns.Pipeline interface
All inference logic is driven by a Pipeline — an abstract base class
(
src/pipeline.py) that separates per-work-item inference from the sharedscaffolding (work iteration, output filtering, ensemble injection, zarr
writes). Subclasses implement three methods:
setup(cfg, device)build_total_coords(times, ensemble_size)run_item(item, data_source, device)(tensor, coords)pairs for one work itemThe base class
Pipeline.run()handles everything else: iterating workitems, building the output variable filter, injecting the ensemble dimension,
and writing to the
OutputManager.Two built-in pipelines are provided:
ForecastPipeline(pipeline=forecast) — prognostic rollout withoptional diagnostic models. Yields one output per lead-time step.
DiagnosticPipeline(pipeline=diagnostic) — diagnostic-only (noprognostic model). Yields a single output per work item.
Custom pipelines
To add a custom inference loop, subclass
Pipelineand setpipelineinyour Hydra config to the fully-qualified class name:
Custom pipelines inherit the full shared machinery — distributed output
management, ensemble dimension handling, threaded zarr writes — for free.
Checklist
Dependencies