[Data] Add glob pattern support to read_xxx APIs by jiangxt2 · Pull Request #64247 · ray-project/ray

jiangxt2 · 2026-06-21T14:32:41Z

Description

Add glob pattern support (*, **, ?, [...]) to all read_xxx APIs in Ray Data, enabling ray.data.read_parquet("s3://bucket/*.parquet") to work natively.

The implementation injects glob expansion into _resolve_paths_and_filesystem() — the single resolution path shared by V1 and V2 datasources — so all read_xxx callers benefit automatically.

Key design decisions:

Uses pathlib.Path.glob() for local paths and FileSelector listing + segment-aware matching for cloud paths (S3, GCS, etc.)
* matches within a single path segment; ** spans multiple segments (dynamic programming matcher)
Bucket-level globs (s3://bucket-*/*.parquet) are explicitly rejected with a clear error
URI query strings (?X-Amz-Signature=...) are not treated as glob characters

Related issues

Fixes #59304

Additional information

Files changed (8):

path_util.py: New helpers _has_glob_chars, _split_glob_base, _glob_match_path, _expand_glob; inject expansion into _resolve_paths_and_filesystem
file_based_datasource.py: Enable expand_globs=True, fix ignore_missing_paths for empty glob results
parquet_datasource.py / parquet_datasource_v2.py / file_indexer.py: Enable glob expansion
read_api.py: Add ignore_missing_paths to read_parquet, add glob examples to all read_xxx docstrings

Tests (31 total):

test_glob_patterns.py (16): Integration tests for local + S3 (moto), V1/V2 parameterization
test_path_util_glob.py (15): Unit tests for pure helper functions

Add _has_glob_chars, _split_glob_base, _glob_match_path, _glob_match_segs, _expand_glob to path_util.py. Modify _resolve_paths_and_filesystem with expand_globs parameter to detect and expand glob patterns before path resolution. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>

Pass expand_globs=True to _resolve_paths_and_filesystem in ParquetDatasource, ParquetDatasourceV2, FileBasedDatasource, and NonSamplingFileIndexer. Relax assert in file_indexer to support multiple resolved paths from glob expansion. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>

Add ignore_missing_paths parameter to read_parquet API, consistent with other read_xxx functions. Add glob pattern usage examples to docstring. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>

Add integration tests for local glob patterns (*, **, ?, [...]) across multiple datasources with V1/V2 parameterization. Add unit tests for glob helper functions in path_util. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>

- Fix recursive listing: check glob chars in non-terminal path segments instead of simply checking '/' in pattern (e.g. sub*/*.parquet needs recursive, but *.parquet does not) - Add S3 (moto) tests: single wildcard, recursive glob, bucket glob rejection Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>

V1: FileBasedDatasource.__init__ skips expand_paths when glob returns empty paths with ignore_missing_paths=True, avoiding zip(*[]) crash. Stores empty lists so get_read_tasks returns an empty Dataset. V2: _read_datasource_v2 returns empty Dataset when sample is empty and ignore_missing_paths=True, instead of unconditionally raising ValueError. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>

gemini-code-assist

Code Review

This pull request introduces robust glob pattern support (including recursive and segment-aware matching) across Ray Data readers, resolving patterns before path resolution. The review feedback highlights a few important issues: a variable shadowing bug in _resolve_paths_and_filesystem that overwrites the shared filesystem parameter across loop iterations; OS-dependent case-sensitivity bugs on Windows due to using fnmatch.fnmatch instead of fnmatch.fnmatchcase for cloud paths; a potential unpacking crash when paths is empty; and a risk of AttributeError when directly accessing datasource.ignore_missing_paths instead of using getattr.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-21T14:34:52Z

    resolved_paths = []
    for path in paths:
+        # Expand glob patterns before resolving individual paths.
+        # Only enabled when callers explicitly opt in (expand_globs=True),
+        # since this function is also used by write, download, and
+        # partition paths where glob expansion is undesirable.
+        if expand_globs and _has_glob_chars(path):
+            # Resolve relative paths to absolute before glob expansion
+            # to ensure consistency across processes (e.g., Ray workers).
+            # Only for local paths — cloud URIs (s3://, gs://, etc.) are
+            # already absolute; os.path.isabs misidentifies them as relative.
+            parsed = urlparse(path)
+            is_cloud_fs = filesystem is not None and not isinstance(
+                filesystem, LocalFileSystem
+            )
+            if not parsed.scheme and not os.path.isabs(path) and not is_cloud_fs:
+                path = os.path.abspath(path)
+
+            if filesystem is None:
+                try:
+                    filesystem, _ = _resolve_single_path_with_fallback(path, filesystem)
+                except (ValueError, ImportError) as e:
+                    if ignore_missing_paths:
+                        logger.debug(
+                            f"Failed to resolve filesystem for glob "
+                            f"'{path}': {e}, skipping"
+                        )
+                        continue
+                    raise ValueError(
+                        f"Failed to resolve filesystem for glob pattern '{path}': {e}"
+                    ) from e
+            expanded = _expand_glob(path, filesystem, ignore_missing_paths)
+            resolved_paths.extend(expanded)
+            continue


Reassigning the filesystem variable inside the loop (filesystem, _ = _resolve_single_path_with_fallback(...)) shadows and overwrites the input filesystem parameter for all subsequent iterations. If the input list contains multiple paths that require different filesystems (or if the first is a glob and the second is a non-glob), subsequent paths will incorrectly use the filesystem resolved from the first path.

To fix this, track the resolved filesystem in a separate variable (e.g., resolved_filesystem) and use a local resolved_fs inside the loop to avoid mutating the shared filesystem state.

resolved_filesystem = filesystem resolved_paths = [] for path in paths: # Expand glob patterns before resolving individual paths. # Only enabled when callers explicitly opt in (expand_globs=True), # since this function is also used by write, download, and # partition paths where glob expansion is undesirable. if expand_globs and _has_glob_chars(path): # Resolve relative paths to absolute before glob expansion # to ensure consistency across processes (e.g., Ray workers). # Only for local paths — cloud URIs (s3://, gs://, etc.) are # already absolute; os.path.isabs misidentifies them as relative. parsed = urlparse(path) is_cloud_fs = filesystem is not None and not isinstance( filesystem, LocalFileSystem ) if not parsed.scheme and not os.path.isabs(path) and not is_cloud_fs: path = os.path.abspath(path) resolved_fs = filesystem if resolved_fs is None: try: resolved_fs, _ = _resolve_single_path_with_fallback(path, resolved_fs) except (ValueError, ImportError) as e: if ignore_missing_paths: logger.debug( f"Failed to resolve filesystem for glob " f"'{path}': {e}, skipping" ) continue raise ValueError( f"Failed to resolve filesystem for glob pattern '{path}': {e}" ) from e expanded = _expand_glob(path, resolved_fs, ignore_missing_paths) resolved_paths.extend(expanded) resolved_filesystem = resolved_fs continue

gemini-code-assist · 2026-06-21T14:34:53Z

+            for j in range(1, n_rel + 1):
+                if fnmatch.fnmatch(rel_segs[j - 1], seg):
+                    dp[i][j] = dp[i - 1][j - 1]


fnmatch.fnmatch is OS-dependent (case-insensitive on Windows, case-sensitive on Unix). However, cloud filesystems (S3, GCS, ADLS) are strictly case-sensitive regardless of the operating system of the driver or worker. Using fnmatch.fnmatch can lead to inconsistent behavior and bugs when running on Windows drivers. Use fnmatch.fnmatchcase instead to ensure consistent, case-sensitive matching for cloud paths.

Suggested change

for j in range(1, n_rel + 1):

if fnmatch.fnmatch(rel_segs[j - 1], seg):

dp[i][j] = dp[i - 1][j - 1]

for j in range(1, n_rel + 1):

if fnmatch.fnmatchcase(rel_segs[j - 1], seg):

dp[i][j] = dp[i - 1][j - 1]

gemini-code-assist · 2026-06-21T14:34:53Z

        if ignore_missing_paths and len(paths) == 0:
-            raise ValueError(
-                "None of the provided paths exist. "
-                "The 'ignore_missing_paths' field is set to True."
+            file_sizes = []
+        else:
+            paths, file_sizes = map(
+                list,
+                zip(
+                    *meta_provider.expand_paths(
+                        paths,
+                        self._filesystem,
+                        partitioning,
+                        ignore_missing_paths=ignore_missing_paths,
+                    )
+                ),
            )


If paths is empty and ignore_missing_paths is False, the code will fall through to the else block and attempt to unpack paths, file_sizes = map(list, zip(*...)). Since zip(*[]) is empty, this will raise a cryptic ValueError: not enough values to unpack (expected 2, got 0). Handling the empty paths case explicitly with a clear error message improves robustness and user experience.

Suggested change

if ignore_missing_paths and len(paths) == 0:

raise ValueError(

"None of the provided paths exist. "

"The 'ignore_missing_paths' field is set to True."

file_sizes = []

else:

paths, file_sizes = map(

list,

zip(

*meta_provider.expand_paths(

paths,

self._filesystem,

partitioning,

ignore_missing_paths=ignore_missing_paths,

)

),

)

if len(paths) == 0:

if not ignore_missing_paths:

raise ValueError("No paths resolved or provided.")

file_sizes = []

else:

paths, file_sizes = map(

list,

zip(

*meta_provider.expand_paths(

paths,

self._filesystem,

partitioning,

ignore_missing_paths=ignore_missing_paths,

)

),

)

gemini-code-assist · 2026-06-21T14:34:53Z

+        if datasource.ignore_missing_paths:
+            return from_blocks([])


Directly accessing datasource.ignore_missing_paths can raise an AttributeError if the datasource implementation does not define or expose this attribute (or if it is stored as a private attribute like _ignore_missing_paths). Using getattr with a fallback is safer and more defensive.

Suggested change

if datasource.ignore_missing_paths:

return from_blocks([])

if getattr(datasource, "ignore_missing_paths", False) or getattr(datasource, "_ignore_missing_paths", False):

return from_blocks([])

Follow-up improvements to the glob pattern implementation: Bug fixes: - Store ignore_missing_paths in ParquetDatasource. - Reject empty brackets in glob character detection. - Return absolute paths for local glob expansion. - Raise clear errors on glob filesystem resolution failures. Code quality: - Clarify ignore_missing_paths behavior in FileBasedDatasource. - Remove unreachable RetryingPyFileSystem branch. Documentation: - Add glob pattern examples to read_images, read_json, read_csv, read_text, read_avro, and read_binary_files docstrings. Tests: - Add direct unit tests for _expand_glob local expansion. Signed-off-by: Chang-Tong <zdcheerful@hotmail.com> Co-Authored-By: jiangxt2 <jiangxt2@vip.qq.com>

- Fix filesystem variable shadowing in _resolve_paths_and_filesystem - Use fnmatchcase for case-sensitive cloud path matching - Handle empty paths unpacking crash in FileBasedDatasource - Skip partition/extension filters when paths is empty - Add local glob fallback when filesystem resolution fails - Add early return for empty glob in ParquetDatasource - Add edge case tests for ignore_missing_paths + glob Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

- Add _is_local_filesystem() helper that unwraps RetryingPyFileSystem before checking, fixing wrapped LocalFileSystem misclassified as cloud - Use getattr for datasource.ignore_missing_paths in V2 read path Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

The CI `check-test-run.py` requires all `tests/test_*.py` files to have a corresponding Bazel `py_test` target. The new `test_glob_patterns.py` was missing from BUILD.bazel, causing microcheck lint:test_coverage to fail. Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

- Add "/" boundary check in _expand_glob to prevent string-prefix matching (e.g. "bucket/data" matching "bucket/dataextra.parquet") - Wrap FileSelector.get_file_info in try/except so that missing cloud prefixes return [] when ignore_missing_paths=True instead of raising Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

…hema - Fix black formatting for prefix boundary check in _expand_glob - Set resolved_filesystem to LocalFileSystem after pathlib glob fallback succeeds, preventing None filesystem in downstream ops - Use pa.schema([]) instead of None for file_schema/read_schema when ignore_missing_paths=True and no files match in V1 ParquetDatasource, avoiding TypeError in _derive_schema Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

- Only treat '?' as a glob metachar for cloud paths (with URI scheme). Local filenames may contain literal '?' and worked unchanged before expand_globs was enabled. - Narrow _expand_glob's exception handler from bare Exception to (FileNotFoundError, OSError) so that permission, network, and configuration errors propagate to the caller instead of being silently swallowed when ignore_missing_paths=True. Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 9eecafb. Configure here.}

cursor · 2026-06-22T16:01:41Z

+        return True
+    # Bracket expression: [ followed by ], not preceded by a space.
+    # Require at least one character between brackets to avoid matching empty [].
+    return bool(re.search(r"(?<! )\[[^\]]+\]", check_target))


Local question-mark globs not expanded

Medium Severity

Glob expansion is gated on _has_glob_chars, but that helper only treats ? as a metacharacter when the path has a URI scheme. Local patterns such as part-?.parquet are therefore not expanded and are resolved as literal paths, so single-character ? globs on local filesystems do not match the documented behavior or the new tests.

^{Reviewed by Cursor Bugbot for commit 9eecafb. Configure here.}

- Only treat '?' as a glob metachar for cloud paths (with URI scheme). Local filenames may contain literal '?' and worked unchanged before expand_globs was enabled. - Narrow _expand_glob's exception handler from bare Exception to (FileNotFoundError, OSError) so that permission, network, and configuration errors propagate to the caller instead of being silently swallowed when ignore_missing_paths=True. - Update test_question_mark to reflect new behavior. Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

jiangxt2 and others added 6 commits June 21, 2026 20:16

jiangxt2 requested a review from a team as a code owner June 21, 2026 14:32

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

cursor Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread python/ray/data/datasource/file_based_datasource.py Outdated

Comment thread python/ray/data/datasource/path_util.py

Comment thread python/ray/data/_internal/datasource/parquet_datasource.py

cursor Bot reviewed Jun 21, 2026

View reviewed changes

Comment thread python/ray/data/datasource/path_util.py Outdated

ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues community-contribution Contributed by the community labels Jun 21, 2026

jiangxt2 force-pushed the feat/glob-pattern-support branch from f6dc591 to daf2d28 Compare June 22, 2026 04:53