Skip to content

Out-of-core fields: move unstructured interpolators to .isel, and add a time-windowed array cache (WindowedArray) #2656

@fluidnumericsJoe

Description

@fluidnumericsJoe

Summary

For datasets whose full time series doesn't fit in RAM (the common production case — global 1/12° u,v,w for two time levels alone is ~10–21 GiB), Parcels v4 must open fields lazily with dask. In that regime field sampling is orders of magnitude slower than it needs to be, and the unstructured path can't be made lazy at all as currently written. This proposes two changes:

  1. Unstructured interpolators: replace field.data.values[...] with vectorized .isel(...) — a prerequisite for out-of-core UxGrid fields and for any caching layer.
  2. Add a WindowedArray wrapper that transparently keeps only the time level(s) bracketing the simulation clock resident as NumPy, loading on demand and retiring stale levels — a near-drop-in that removes both the read-amplification and the dask scheduling overhead.

Full measured evidence + reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling

Motivation (measured)

On an identical structured-grid advection (2000 particles, 480 steps), dask-backed vs .load()-ed-into-RAM:

field backing wall time
in-memory (numpy) 6.6 s
dask-backed (on-disk) 2156 s (≈ 327×)

This is not disk-bound — the field is cache-resident and the process is CPU-pinned. Two independent causes:

  • Read amplification. Per kernel step, scattered per-particle isel pulls whole chunks; the same time-slab chunks are re-read every step. Measured 40–52× bytes-read vs bytes-used (and small chunks collapse to fio-4 KiB-random throughput, ~34 MB/s).
  • dask per-compute() scheduling overhead. With the field fully in RAM (.persist(), zero I/O), a single scattered isel().compute() is ~200× slower than the numpy equivalent (0.3 ms → 63 ms). Profiling (cProfile + VizTracer) attributes it to the threaded scheduler's lock/condition machinery plus per-call graph build/tokenize — not array work.

We checked the dask-only escape hatches; none close the gap:

  • scheduler="synchronous" + larger chunks: ~6–12× faster, but floors at ~47× numpy (graph build/tokenize is paid per .compute()).
  • Batching U/V/W into one dask.compute(): reduces call count (RK2-3D: 6→2) but not wall time — cost is per-task, not per-call.
  • Bigger chunks: fewer tasks (faster in RAM) but read amplification is unchanged on disk — only pays off inside a resident window.

The structural fix is to stop sampling lazily per step: read the bracketing time level(s) once into NumPy, reuse across all particles and sub-steps, evict when the clock advances. A prototype of exactly this is 6× faster and reads 12× less data at production scale (20 GB Atlantic), with byte-identical trajectories.

Root cause in the code

  • Structured (interpolators/_xinterpolators.py) samples lazily: data.isel(selection_dict).data.reshape(...) (lines 82, 244, 373, 563) then value.compute() if is_dask_collection(value) (135, 299–300, 381, 570, 626). Lazy seam exists.
  • Unstructured (interpolators/_uxinterpolators.py) samples via field.data.values[ti, zi, fi] (lines 20, 41–42, 65, 87–88). .values materializes the entire field, then NumPy fancy-indexes. For a dask-backed UxGrid field this requires the whole field in memory (or recomputes it per call) → precludes out-of-core operation, and leaves no lazy seam to cache behind.
  • .sel is not used anywhere in the interpolators.

Proposal 1 — unstructured interpolators → .isel

Index before materializing, mirroring the structured path:

# now (materializes the whole field):
field.data.values[ti, zi, fi]
field.data.values[ti[:, None], zi[:, None], node_ids]

# proposed (lazy, only the requested points):
field.data.isel(time=ti, n_z=zi, n_face=fi).data
field.data.isel(time=ti[:, None], n_z=zi[:, None], n_node=node_ids).data

Benefits independent of caching: UxGrid fields become genuinely out-of-core-capable (no full materialization / no OOM on large meshes), and the structured & unstructured paths share one lazy access idiom. The mesh connectivity (uxgrid.face_node_connectivity, node coords) stays a separate, static, one-time load — only data variables are touched lazily.

Proposal 2 — WindowedArray

A thin wrapper around field.data that exploits the fact that all particles share the simulation clock, so any access touches only {ti, ti+1}:

class WindowedArray:
    """Wrap a lazy DataArray; .isel() loads only the requested time levels to
    NumPy, retires levels below the current minimum, and gathers on the small
    resident block. All other attributes forward to the wrapped array."""
    # on .isel(time=..., ...):
    #   1. levels = unique(time indices)
    #   2. load missing levels -> NumPy (one bulk read each); evict levels < min(levels)
    #   3. gather on the stacked resident block (vectorized isel in NumPy)

Because the result is NumPy, the structured interpolator's value.compute() if is_dask_collection(value) guard auto-skips .compute() — so this removes the re-reads and the scheduler tax with no interpolator change. Integration is one line where Field.data is set (e.g. in FieldSet.from_sgrid_conventions / from_ugrid_conventions, guarded by "is dask-backed"):

field.data = WindowedArray(field.data)

Verified in the reference repo: identical to dask (max|Δ| = 0), each time level loaded once (20 loads / 60 steps vs 120 naive gathers), and ≤ 2 levels resident at any time. Works for XGrid today; works for UxGrid once Proposal 1 lands (the wrapper is grid-shape-agnostic — it windows the leading time axis and forwards the rest).

Why this design

The working set is tiny even when the dataset is enormous: a simulation only ever needs the two time levels bracketing "now." Windowing turns "hundreds of thousands of tiny scattered dask gathers" into "one bulk sequential read per time level + free NumPy indexing," with memory bounded to ~2 levels regardless of series length.

Scope / open questions

  • Correctness guard: valid only while requested time indices stay within the resident window. Add a cheap assert (or fall back to a wider load) for kernels that sample arbitrary past/future times. Needs care for backward-in-time runs and multiple dts.
  • Prefetch the next level on a background thread to overlap the boundary reload (prototype has this in the standalone sampler).
  • Spatial/depth subsetting (load only the particles' bbox / depth range) is a natural follow-on for very large meshes.
  • Assumes time is the leading dim of field.data (true for the current Field construction).

Alternatives considered

  • dask.cache.Cache(...).register() — zero-code, caches loaded chunks across .compute() calls (removes re-reads) but keeps the per-call scheduler tax; eviction is byte-budget/LRU, not time-aware.
  • Custom xarray BackendArray duck-array — the most robust/idiomatic home for the window logic (xarray's lazy-indexing adapter calls __getitem__ with windowed indices), but more plumbing. Could be the long-term form of WindowedArray.
  • dask-only tuning (synchronous scheduler, larger chunks, batched compute) — worth setting as defaults for the lazy path, but caps at ~47× numpy; doesn't replace windowing.

Proposed work / acceptance criteria

  • Switch _uxinterpolators.py data access from .values[...] to vectorized .isel(...); confirm unchanged results on existing UxGrid tests and that a chunked/dask UxGrid field runs without full materialization.
  • Add WindowedArray (or a BackendArray equivalent) with load-on-demand + retire-stale + optional prefetch and a window-span assertion.
  • Wire field.data = WindowedArray(field.data) for dask-backed fields in the FieldSet constructors (opt-in flag initially).
  • Benchmark: identical trajectories vs the current path; report speedup + peak memory on a large structured and a large unstructured case.

Filed by Claude Code on behalf of Joe Schoonover (FluidNumerics). Evidence and reproducers: https://github.com/FluidNumerics/uxxarray-perf-modeling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions