[data] Zarr datasource by alexandrplashchinsky · Pull Request #63003 · ray-project/ray

alexandrplashchinsky · 2026-04-28T21:19:05Z

Description

This PR introduces ray.data.read_zarr() and the backing ZarrV2Datasource for reading Zarr v2 stores with Ray Data.

This adds a dedicated public API and datasource implementation for Zarr v2 so users can read chunk metadata from consolidated Zarr v2 stores through the standard Ray Data read API surface.

Related issues

N/A

Additional information

This PR adds:

ray.data.read_zarr() as a new public Ray Data read API
ZarrV2Datasource as the datasource implementation used by the API
unit tests covering the datasource behavior and API wiring

Example usage:

import ray

ds = ray.data.read_zarr("/path/to/store")

gemini-code-assist

Code Review

This pull request introduces support for reading Zarr v2 stores in Ray Data by adding the ZarrV2Datasource and a public read_zarrv2 API. The implementation includes support for various storage backends (local, S3, Azure) and handles chunk metadata, slice bounds, and padding. Feedback focuses on reducing logic duplication in chunk calculation, simplifying path normalization, converting a utility method to a static method, and adhering to standard Python formatting for keyword arguments.

richardliaw · 2026-05-12T04:50:00Z

Can we rename this so that it's read_zarr?

Address four edge-case review findings in ZarrV2Datasource: 1. Pin local:// stores to the driver node: set supports_distributed_reads from the path scheme (like FileBasedDatasource) so read tasks aren't scheduled on workers that can't see the driver's local disk. 2. Detect consolidated metadata by trying open_consolidated rather than a separately-built exists() probe. The probe could disagree with the mapper's key lookup (e.g. archive/root stores with an empty store path) and wrongly treat a consolidated store as unconsolidated. 3. Reject a group path passed via array_paths on an unconsolidated store with a clear "is a group, not an array" error instead of a confusing AttributeError later. (The consolidated and full-scan paths already filter to arrays.) 4. Validate array_paths for single root-level array stores so a bad path errors instead of silently returning the root array. Add a test for each. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

The "This guide covers" list linked to #cloud-storage-and-credentials, but that section was renamed to "Zarr's .zattrs". Sphinx emits a myst.xref_missing warning, which ReadTheDocs (fail_on_warning: true) turns into a build failure -- though Buildkite's doc build tolerates it. Repoint the bullet to the .zattrs section via an explicit `(zarr-zattrs)=` target so the link doesn't depend on the auto-generated heading slug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

chunk_shapes validation used isinstance(x, int), which rejected NumPy scalar integers (numpy.int64, etc.) even when positive -- a common case since chunk sizes are often derived from array metadata. Accept any numbers.Integral (excluding bool) via a shared _is_positive_int helper, and normalize stored values to plain ints. Adds a test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

richardliaw · 2026-06-18T22:27:17Z

+    def read_fn() -> Iterable[pd.DataFrame]:
+        yield pd.DataFrame(
+            {
+                "array": [d.array_name for d in batch],
+                "chunk_index": [d.chunk_index for d in batch],
+                "chunk_slices": [d.chunk_slices for d in batch],
+                "chunk": [
+                    _read_chunk(root, d.array_name, d.chunk_slices) for d in batch
+                ],
+            }
+        )
+
+    return read_fn


should we yield pyarrow instead?

You are right. I checked other datasources and they use DelegatingBlockBuilder in this place so I adopted that.

elliot-barn · 2026-06-18T22:55:21Z

 torchvision==0.24.0
 confluent-kafka
+zarr<3 ; python_version >= '3.11'  # zarr 2.18.4+ requires py3.11+ (v2 API)
+zarr>=2.18,<2.18.4 ; python_version < '3.11'  # 2.18.3: last v2 line supporting py3.10


can you remove the upperbound? should recompile without issue since the lock files have a pinned version

tldr on why here: https://iscinumpy.dev/post/bound-version-constraints/#tldr

Build read-task output with DelegatingBlockBuilder (-> ArrowBlockBuilder) instead of hand-constructing pandas DataFrames, matching the tensor/per-row datasources (image, audio, video, torch). Blocks are now pyarrow Tables and the Arrow tensor extension handles the variable-shaped `chunk` column (shorter trailing-edge chunks) automatically. Drops the pandas dependency in the datasource. Test updates: - _execute_read_tasks converts each (now-Arrow) block to pandas. - _reconstruct_array sorts by a tuple key, since chunk_index/chunk_slices round-trip as Arrow lists, not Python tuples. - Drop ray_start_regular_shared from the two auto-init tests: building an Arrow block auto-inits Ray, which conflicted with the fixture's unguarded ray.init() (the rest of the module already relies on auto-init). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Remove three tests whose coverage is fully subsumed elsewhere: - test_align_axis_0_accepts_per_array_chunk_shapes: dict chunk_shapes resolution is covered by test_chunk_shapes_resolution_across_mixed_rank (asserts _array_chunks directly), and aligned wide-row output by test_align_axis_0_emits_wide_rows; the aligned path consumes the resolved chunks regardless of dict vs sequence, so the combination adds no path. - test_overlap_enables_windowing_without_cross_row_loss: its assertion is pure arithmetic on the per-row data extents already asserted by test_overlap_extends_chunk_data; it exercises no new datasource behavior. - test_align_axis_0_column_set: the no-array_paths case duplicated the column assertion in test_align_axis_0_emits_wide_rows; de-parametrized to keep only the array_paths-filtering case, which is its unique coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Fix ray-project#2 (limit pushdown): the read fns now slice their batch to per_task_row_limit, so a downstream limit(K) reads ~K chunks instead of the whole batch's I/O. Previously ReadTask only truncated the already-built block (_iter_sliced_blocks), so every chunk in the batch was still fetched. Fix ray-project#4 (retries): chunk reads are wrapped in iterate_with_retry(match=DataContext.retried_io_errors) -- the same mechanism FileBasedDatasource uses -- so zarr reads now honor Ray Data's retry config. The underlying filesystem's own retry still applies underneath. Tests: per_task_row_limit caps the number of _read_chunk calls (not just the output row count); _read_chunk retries a transient error then succeeds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

_get_long_form_read_tasks no longer materializes a per-chunk descriptor list on the driver -- the product(grid) enumeration was O(total chunks) and ran even for take(1)/limit (e.g. ~64,800 descriptors for one MUR SST array). Read tasks now describe a contiguous flat range of the chunk grid; the read fn unravels each flat index to an N-D chunk_index lazily on the worker. Planning is O(n_tasks) per array, independent of chunk count. - New _ChunkRange (replaces per-chunk _ChunkDescriptor) + _unravel (row-major, preserving the previous itertools.product ordering). - size_bytes is now an O(1) upper-bound estimate (full-size chunk per index) instead of an O(chunks) exact sum. - per_task_row_limit caps the range, not a list slice; aligned path unchanged (already O(output rows)). Adds a test asserting chunk_index order is identical to grid enumeration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Replace the hand-rolled flat-index -> N-D unravel helper with numpy's np.unravel_index (the recognized primitive for exactly this). Its default C-order matches the previous ordering, so the emitted chunk_index sequence is unchanged (the ordering test still passes); int() keeps the indices as Python ints. Drops the _unravel helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

ArturNiederfahrenhorst · 2026-06-19T12:54:38Z

I did another round of self-review and also asked CC to view to make this quicker.
Also ran some Anyscale jobs to test again.

pyrefly (CI lint) flagged 3 type errors in the new files. None are runtime bugs (the suite passes); the fixes clarify intent or suppress an intentional fake: - _create_aligned_read_fn: annotate `row: dict[str, Any]` so assigning a chunk ndarray isn't checked against the all-int TypedDict pyrefly inferred from the t_start/t_stop literals. - _is_positive_int: `int(x) > 0` (pyrefly can't type `>` on numbers.Integral). - test_read_chunk_retries_transient_io: `# pyrefly: ignore[bad-argument-type]` for the deliberate fake _Root() passed to _read_chunk (repo convention). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Use the public example dataset in the docs and add an integration test (matching read_zarr, ray-project#63003): the read_lerobot and LeRobotDatasource docstring examples now read s3://anonymous@ray-example-data/lerobot/libero-mini, and test_read_lerobot_integration_public_s3 reads it end-to-end. Review fixes (Ray Data conventions): - Expose num_cpus/num_gpus/memory/ray_remote_args/concurrency on read_lerobot, forwarded to read_datasource (mirrors read_images) so the decode-heavy read tasks can be tuned; document the override_num_blocks vs partitioning interaction. - Dictionary-encode the per-dataset-constant stats and task columns instead of repeating the multi-KB stats JSON on every row, and count the appended columns in the in-memory size estimate. - Close the fsspec file handles in _CredsVideoDecoderCache.clear() (was leaking a file descriptor per decoded video file). - Add image-based v3 unit coverage: a synthetic image_camera fixture and test_read_lerobot_image_camera. - Drop the redundant per-task driver ray.get(roots_ref); pass the already- materialized roots to _LeRobotReadTask, keeping the ref only for the worker read path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Reviewed by Cursor Bugbot for commit fa16bd3. Configure here.}

richardliaw · 2026-06-19T19:15:16Z

+    if ray.is_initialized():
+        ray.shutdown()
+    ray.init(
+        num_cpus=1,
+        logging_level=logging.ERROR,
+        log_to_driver=False,
+        runtime_env={"worker_process_setup_hook": _register_codec},
+    )
+    try:
+        ds = ray.data.read_zarr(str(store_path))
+        rows = sorted(ds.take_all(), key=lambda r: tuple(r["chunk_index"]))
+        recon = np.concatenate([r["chunk"] for r in rows])
+        np.testing.assert_array_equal(recon, np.arange(8, dtype="u1"))
+    finally:
+        ray.shutdown()


is there no existing ray fixture for this? if not, can we use a fixture instead?

richardliaw · 2026-06-19T19:15:45Z

+        zarrv2_datasource.ZarrV2Datasource(str(tmp_path))
+
+
+def test_explicit_filesystem_strips_uri_scheme(tmp_path):


none of these tests are well-isolated, i think you need the shutdown_only fixture or something

Thanks! Isolated.

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

The datasource is backend-agnostic: path/filesystem resolution is delegated to pyarrow/fsspec (shared Ray Data machinery), so a live remote read exercises generic pyarrow/fsspec, not datasource logic. Filesystem handling is already covered hermetically by test_read_zarr_basic_across_filesystems (parametrized over fs flavors on local paths), so the unit-test file stays network-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

alexandrplashchinsky requested a review from a team as a code owner April 28, 2026 21:19

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/read_api.py Outdated

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

alexandrplashchinsky force-pushed the zarr-datasource branch from 7390b42 to 1da5b21 Compare April 28, 2026 21:30

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

ayushk7102 added the go add ONLY when ready to merge, run all tests label Apr 28, 2026

ray-gardener Bot added the data Ray Data-related issues label Apr 29, 2026

alexandrplashchinsky removed the go add ONLY when ready to merge, run all tests label Apr 29, 2026

alexandrplashchinsky self-assigned this Apr 29, 2026

alexandrplashchinsky force-pushed the zarr-datasource branch 2 times, most recently from 901636d to 687c8fa Compare April 29, 2026 03:12

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated

alexandrplashchinsky force-pushed the zarr-datasource branch 2 times, most recently from 34ab253 to a57fd8c Compare April 29, 2026 18:34

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated

cursor Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

alexandrplashchinsky force-pushed the zarr-datasource branch from d991098 to 27539fe Compare April 30, 2026 19:31

alexandrplashchinsky added the go add ONLY when ready to merge, run all tests label May 1, 2026

cursor Bot reviewed May 1, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

alexandrplashchinsky added go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels May 1, 2026

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread python/ray/data/__init__.py Outdated

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/tests/datasource/test_zarrv2.py Outdated

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

ArturNiederfahrenhorst and others added 2 commits June 18, 2026 23:23

delete test

d4b97ec

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

richardliaw reviewed Jun 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py Outdated

richardliaw reviewed Jun 18, 2026

View reviewed changes

elliot-barn reviewed Jun 18, 2026

View reviewed changes

ArturNiederfahrenhorst and others added 6 commits June 19, 2026 09:57

datasource polish

5cf44f2

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

fix test

a7ebe8f

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

polish tests

78e4242

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

simplify retries

fa16bd3

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/datasource/zarrv2_datasource.py

richardliaw reviewed Jun 19, 2026

View reviewed changes

ArturNiederfahrenhorst and others added 4 commits June 21, 2026 23:03

Richard's comment

5d29c90

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

polish

e869a76

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>

Merge branch 'master' into zarr-datasource

2c087d4

		zarrv2_datasource.ZarrV2Datasource(str(tmp_path))


		def test_explicit_filesystem_strips_uri_scheme(tmp_path):

Conversation

alexandrplashchinsky commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richardliaw Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

ArturNiederfahrenhorst Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

elliot-barn Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

elliot-barn Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArturNiederfahrenhorst commented Jun 19, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richardliaw Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArturNiederfahrenhorst Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

richardliaw Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

ArturNiederfahrenhorst Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

alexandrplashchinsky commented Apr 28, 2026 •

edited

Loading

richardliaw Jun 19, 2026 •

edited

Loading