Skip to content

Opt-in array-backed loading for !include netCDF resources#4

Open
bjarketol wants to merge 1 commit into
EUFLOW:mainfrom
bjarketol:array-backed-netcdf-loader
Open

Opt-in array-backed loading for !include netCDF resources#4
bjarketol wants to merge 1 commit into
EUFLOW:mainfrom
bjarketol:array-backed-netcdf-loader

Conversation

@bjarketol

Copy link
Copy Markdown

Summary

Adds an opt-in way to keep !included netCDF resources as numpy arrays instead of nested Python lists, avoiding a large memory blow-up for big resources (e.g. per-turbine time series).

  • load_yaml(..., nc_data="array") keeps included netCDF data as numpy arrays. nc_data is threaded through _get_YAML/_ds2yml and propagates into nested includes. Default ("list") is unchanged.
  • _fmt is made ndarray-safe (the elementwise != {} filter raised on arrays).
  • validate(..., array_data=True) adds structure-only validation for array-backed inputs: arrays are replaced by [] so jsonschema validates keys/dims without materialising or iterating the bulk data (jsonschema cannot accept ndarrays, and iterating large arrays is O(N)).

Why

_ds2yml currently calls xr.Dataset.to_dict(), which turns array data into nested Python lists — roughly 4–28× the numpy footprint (Python float objects vs packed float64). For large time-series resources this dominates load memory.

Measured on a 16 MB wind_resource.nc:

load peak
nc_data="list" (default) 76 MB
nc_data="array" 19 MB

→ ~4× lower peak, with no change to default behaviour.

Compatibility

Both additions are opt-in; the default dict-of-lists representation and full validation are untouched, so existing consumers are unaffected. Includes tests for array round-trip equivalence and structure-only validation.

Note: the 3 failing tests on main (test_schema, wind_farm_schema_unit_test, turbine regression) are pre-existing and unrelated to this change (it touches only yaml.py/validator.py).

🤖 Generated with Claude Code

Keep included netCDF data as numpy arrays instead of nested Python lists,
avoiding a ~4-28x memory blow-up for large resources:

- load_yaml/_get_YAML/_ds2yml gain an nc_data option ("list" default,
  "array" keeps numpy arrays); nc_data propagates into nested includes
- _fmt is made ndarray-safe (the elementwise "!= {}" filter broke on arrays)
- validate() gains array_data=True for structure-only validation: arrays are
  replaced by [] so jsonschema checks keys/dims without materialising or
  iterating the bulk data
- tests for array round-trip equivalence and structure-only validation

Default behaviour (lists, full validation) is unchanged; both are opt-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant