Skip to content

[Data] Add glob pattern support to read_xxx APIs#64247

Open
jiangxt2 wants to merge 14 commits into
ray-project:masterfrom
jiangxt2:feat/glob-pattern-support
Open

[Data] Add glob pattern support to read_xxx APIs#64247
jiangxt2 wants to merge 14 commits into
ray-project:masterfrom
jiangxt2:feat/glob-pattern-support

Conversation

@jiangxt2

Copy link
Copy Markdown

Description

Add glob pattern support (*, **, ?, [...]) to all read_xxx APIs in Ray Data, enabling ray.data.read_parquet("s3://bucket/*.parquet") to work natively.

The implementation injects glob expansion into _resolve_paths_and_filesystem() — the single resolution path shared by V1 and V2 datasources — so all read_xxx callers benefit automatically.

Key design decisions:

  • Uses pathlib.Path.glob() for local paths and FileSelector listing + segment-aware matching for cloud paths (S3, GCS, etc.)
  • * matches within a single path segment; ** spans multiple segments (dynamic programming matcher)
  • Bucket-level globs (s3://bucket-*/*.parquet) are explicitly rejected with a clear error
  • URI query strings (?X-Amz-Signature=...) are not treated as glob characters

Related issues

Fixes #59304

Additional information

Files changed (8):

  • path_util.py: New helpers _has_glob_chars, _split_glob_base, _glob_match_path, _expand_glob; inject expansion into _resolve_paths_and_filesystem
  • file_based_datasource.py: Enable expand_globs=True, fix ignore_missing_paths for empty glob results
  • parquet_datasource.py / parquet_datasource_v2.py / file_indexer.py: Enable glob expansion
  • read_api.py: Add ignore_missing_paths to read_parquet, add glob examples to all read_xxx docstrings

Tests (31 total):

  • test_glob_patterns.py (16): Integration tests for local + S3 (moto), V1/V2 parameterization
  • test_path_util_glob.py (15): Unit tests for pure helper functions

jiangxt2 and others added 6 commits June 21, 2026 20:16
Add _has_glob_chars, _split_glob_base, _glob_match_path,
_glob_match_segs, _expand_glob to path_util.py. Modify
_resolve_paths_and_filesystem with expand_globs parameter
to detect and expand glob patterns before path resolution.

Co-authored-by: Chang-Tong <zdcheerful@hotmail.com>
Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Pass expand_globs=True to _resolve_paths_and_filesystem in
ParquetDatasource, ParquetDatasourceV2, FileBasedDatasource,
and NonSamplingFileIndexer. Relax assert in file_indexer to
support multiple resolved paths from glob expansion.

Co-authored-by: Chang-Tong <zdcheerful@hotmail.com>
Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Add ignore_missing_paths parameter to read_parquet API,
consistent with other read_xxx functions. Add glob pattern
usage examples to docstring.

Co-authored-by: Chang-Tong <zdcheerful@hotmail.com>
Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Add integration tests for local glob patterns (*, **, ?, [...])
across multiple datasources with V1/V2 parameterization.
Add unit tests for glob helper functions in path_util.

Co-authored-by: Chang-Tong <zdcheerful@hotmail.com>
Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
- Fix recursive listing: check glob chars in non-terminal path segments
  instead of simply checking '/' in pattern (e.g. sub*/*.parquet needs
  recursive, but *.parquet does not)
- Add S3 (moto) tests: single wildcard, recursive glob, bucket glob rejection

Co-authored-by: Chang-Tong <zdcheerful@hotmail.com>
Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
V1: FileBasedDatasource.__init__ skips expand_paths when glob returns
empty paths with ignore_missing_paths=True, avoiding zip(*[]) crash.
Stores empty lists so get_read_tasks returns an empty Dataset.

V2: _read_datasource_v2 returns empty Dataset when sample is empty
and ignore_missing_paths=True, instead of unconditionally raising
ValueError.

Co-authored-by: Chang-Tong <zdcheerful@hotmail.com>
Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
@jiangxt2 jiangxt2 requested a review from a team as a code owner June 21, 2026 14:32

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces robust glob pattern support (including recursive and segment-aware matching) across Ray Data readers, resolving patterns before path resolution. The review feedback highlights a few important issues: a variable shadowing bug in _resolve_paths_and_filesystem that overwrites the shared filesystem parameter across loop iterations; OS-dependent case-sensitivity bugs on Windows due to using fnmatch.fnmatch instead of fnmatch.fnmatchcase for cloud paths; a potential unpacking crash when paths is empty; and a risk of AttributeError when directly accessing datasource.ignore_missing_paths instead of using getattr.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 616 to +649
resolved_paths = []
for path in paths:
# Expand glob patterns before resolving individual paths.
# Only enabled when callers explicitly opt in (expand_globs=True),
# since this function is also used by write, download, and
# partition paths where glob expansion is undesirable.
if expand_globs and _has_glob_chars(path):
# Resolve relative paths to absolute before glob expansion
# to ensure consistency across processes (e.g., Ray workers).
# Only for local paths — cloud URIs (s3://, gs://, etc.) are
# already absolute; os.path.isabs misidentifies them as relative.
parsed = urlparse(path)
is_cloud_fs = filesystem is not None and not isinstance(
filesystem, LocalFileSystem
)
if not parsed.scheme and not os.path.isabs(path) and not is_cloud_fs:
path = os.path.abspath(path)

if filesystem is None:
try:
filesystem, _ = _resolve_single_path_with_fallback(path, filesystem)
except (ValueError, ImportError) as e:
if ignore_missing_paths:
logger.debug(
f"Failed to resolve filesystem for glob "
f"'{path}': {e}, skipping"
)
continue
raise ValueError(
f"Failed to resolve filesystem for glob pattern '{path}': {e}"
) from e
expanded = _expand_glob(path, filesystem, ignore_missing_paths)
resolved_paths.extend(expanded)
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Reassigning the filesystem variable inside the loop (filesystem, _ = _resolve_single_path_with_fallback(...)) shadows and overwrites the input filesystem parameter for all subsequent iterations. If the input list contains multiple paths that require different filesystems (or if the first is a glob and the second is a non-glob), subsequent paths will incorrectly use the filesystem resolved from the first path.

To fix this, track the resolved filesystem in a separate variable (e.g., resolved_filesystem) and use a local resolved_fs inside the loop to avoid mutating the shared filesystem state.

    resolved_filesystem = filesystem
    resolved_paths = []
    for path in paths:
        # Expand glob patterns before resolving individual paths.
        # Only enabled when callers explicitly opt in (expand_globs=True),
        # since this function is also used by write, download, and
        # partition paths where glob expansion is undesirable.
        if expand_globs and _has_glob_chars(path):
            # Resolve relative paths to absolute before glob expansion
            # to ensure consistency across processes (e.g., Ray workers).
            # Only for local paths — cloud URIs (s3://, gs://, etc.) are
            # already absolute; os.path.isabs misidentifies them as relative.
            parsed = urlparse(path)
            is_cloud_fs = filesystem is not None and not isinstance( 
                filesystem, LocalFileSystem
            )
            if not parsed.scheme and not os.path.isabs(path) and not is_cloud_fs:
                path = os.path.abspath(path)

            resolved_fs = filesystem
            if resolved_fs is None:
                try:
                    resolved_fs, _ = _resolve_single_path_with_fallback(path, resolved_fs)
                except (ValueError, ImportError) as e:
                    if ignore_missing_paths:
                        logger.debug(
                            f"Failed to resolve filesystem for glob "
                            f"'{path}': {e}, skipping"
                        )
                        continue
                    raise ValueError(
                        f"Failed to resolve filesystem for glob pattern '{path}': {e}"
                    ) from e
            expanded = _expand_glob(path, resolved_fs, ignore_missing_paths)
            resolved_paths.extend(expanded)
            resolved_filesystem = resolved_fs
            continue

Comment on lines +282 to +284
for j in range(1, n_rel + 1):
if fnmatch.fnmatch(rel_segs[j - 1], seg):
dp[i][j] = dp[i - 1][j - 1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

fnmatch.fnmatch is OS-dependent (case-insensitive on Windows, case-sensitive on Unix). However, cloud filesystems (S3, GCS, ADLS) are strictly case-sensitive regardless of the operating system of the driver or worker. Using fnmatch.fnmatch can lead to inconsistent behavior and bugs when running on Windows drivers. Use fnmatch.fnmatchcase instead to ensure consistent, case-sensitive matching for cloud paths.

Suggested change
for j in range(1, n_rel + 1):
if fnmatch.fnmatch(rel_segs[j - 1], seg):
dp[i][j] = dp[i - 1][j - 1]
for j in range(1, n_rel + 1):
if fnmatch.fnmatchcase(rel_segs[j - 1], seg):
dp[i][j] = dp[i - 1][j - 1]

Comment on lines 190 to 203
if ignore_missing_paths and len(paths) == 0:
raise ValueError(
"None of the provided paths exist. "
"The 'ignore_missing_paths' field is set to True."
file_sizes = []
else:
paths, file_sizes = map(
list,
zip(
*meta_provider.expand_paths(
paths,
self._filesystem,
partitioning,
ignore_missing_paths=ignore_missing_paths,
)
),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If paths is empty and ignore_missing_paths is False, the code will fall through to the else block and attempt to unpack paths, file_sizes = map(list, zip(*...)). Since zip(*[]) is empty, this will raise a cryptic ValueError: not enough values to unpack (expected 2, got 0). Handling the empty paths case explicitly with a clear error message improves robustness and user experience.

Suggested change
if ignore_missing_paths and len(paths) == 0:
raise ValueError(
"None of the provided paths exist. "
"The 'ignore_missing_paths' field is set to True."
file_sizes = []
else:
paths, file_sizes = map(
list,
zip(
*meta_provider.expand_paths(
paths,
self._filesystem,
partitioning,
ignore_missing_paths=ignore_missing_paths,
)
),
)
if len(paths) == 0:
if not ignore_missing_paths:
raise ValueError("No paths resolved or provided.")
file_sizes = []
else:
paths, file_sizes = map(
list,
zip(
*meta_provider.expand_paths(
paths,
self._filesystem,
partitioning,
ignore_missing_paths=ignore_missing_paths,
)
),
)

Comment thread python/ray/data/read_api.py Outdated
Comment on lines +494 to +495
if datasource.ignore_missing_paths:
return from_blocks([])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Directly accessing datasource.ignore_missing_paths can raise an AttributeError if the datasource implementation does not define or expose this attribute (or if it is stored as a private attribute like _ignore_missing_paths). Using getattr with a fallback is safer and more defensive.

Suggested change
if datasource.ignore_missing_paths:
return from_blocks([])
if getattr(datasource, "ignore_missing_paths", False) or getattr(datasource, "_ignore_missing_paths", False):
return from_blocks([])

Comment thread python/ray/data/datasource/file_based_datasource.py Outdated
Comment thread python/ray/data/datasource/path_util.py
Comment thread python/ray/data/_internal/datasource/parquet_datasource.py
Comment thread python/ray/data/datasource/path_util.py Outdated
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation data Ray Data-related issues community-contribution Contributed by the community labels Jun 21, 2026
@jiangxt2 jiangxt2 force-pushed the feat/glob-pattern-support branch from f6dc591 to daf2d28 Compare June 22, 2026 04:53
Comment thread python/ray/data/datasource/path_util.py
Chang-Tong and others added 3 commits June 22, 2026 17:02
Follow-up improvements to the glob pattern implementation:

Bug fixes:
- Store ignore_missing_paths in ParquetDatasource.
- Reject empty brackets in glob character detection.
- Return absolute paths for local glob expansion.
- Raise clear errors on glob filesystem resolution failures.

Code quality:
- Clarify ignore_missing_paths behavior in FileBasedDatasource.
- Remove unreachable RetryingPyFileSystem branch.

Documentation:
- Add glob pattern examples to read_images, read_json, read_csv,
  read_text, read_avro, and read_binary_files docstrings.

Tests:
- Add direct unit tests for _expand_glob local expansion.

Signed-off-by: Chang-Tong <zdcheerful@hotmail.com>
Co-Authored-By: jiangxt2 <jiangxt2@vip.qq.com>
- Fix filesystem variable shadowing in _resolve_paths_and_filesystem
- Use fnmatchcase for case-sensitive cloud path matching
- Handle empty paths unpacking crash in FileBasedDatasource
- Skip partition/extension filters when paths is empty
- Add local glob fallback when filesystem resolution fails
- Add early return for empty glob in ParquetDatasource
- Add edge case tests for ignore_missing_paths + glob

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
- Add _is_local_filesystem() helper that unwraps RetryingPyFileSystem
  before checking, fixing wrapped LocalFileSystem misclassified as cloud
- Use getattr for datasource.ignore_missing_paths in V2 read path

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
@jiangxt2 jiangxt2 force-pushed the feat/glob-pattern-support branch 2 times, most recently from 037a5de to db997c5 Compare June 22, 2026 12:57
The CI `check-test-run.py` requires all `tests/test_*.py` files to have
a corresponding Bazel `py_test` target. The new `test_glob_patterns.py`
was missing from BUILD.bazel, causing microcheck lint:test_coverage to
fail.

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
Comment thread python/ray/data/datasource/path_util.py Outdated
- Add "/" boundary check in _expand_glob to prevent string-prefix
  matching (e.g. "bucket/data" matching "bucket/dataextra.parquet")
- Wrap FileSelector.get_file_info in try/except so that missing
  cloud prefixes return [] when ignore_missing_paths=True instead
  of raising

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
Comment thread python/ray/data/datasource/path_util.py
Comment thread python/ray/data/_internal/datasource/parquet_datasource.py
…hema

- Fix black formatting for prefix boundary check in _expand_glob
- Set resolved_filesystem to LocalFileSystem after pathlib glob
  fallback succeeds, preventing None filesystem in downstream ops
- Use pa.schema([]) instead of None for file_schema/read_schema
  when ignore_missing_paths=True and no files match in V1
  ParquetDatasource, avoiding TypeError in _derive_schema

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
Comment thread python/ray/data/datasource/path_util.py
Comment thread python/ray/data/datasource/path_util.py Outdated
- Only treat '?' as a glob metachar for cloud paths (with URI
  scheme). Local filenames may contain literal '?' and worked
  unchanged before expand_globs was enabled.
- Narrow _expand_glob's exception handler from bare Exception to
  (FileNotFoundError, OSError) so that permission, network, and
  configuration errors propagate to the caller instead of being
  silently swallowed when ignore_missing_paths=True.

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 9eecafb. Configure here.

return True
# Bracket expression: [ followed by ], not preceded by a space.
# Require at least one character between brackets to avoid matching empty [].
return bool(re.search(r"(?<! )\[[^\]]+\]", check_target))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local question-mark globs not expanded

Medium Severity

Glob expansion is gated on _has_glob_chars, but that helper only treats ? as a metacharacter when the path has a URI scheme. Local patterns such as part-?.parquet are therefore not expanded and are resolved as literal paths, so single-character ? globs on local filesystems do not match the documented behavior or the new tests.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9eecafb. Configure here.

- Only treat '?' as a glob metachar for cloud paths (with URI
  scheme). Local filenames may contain literal '?' and worked
  unchanged before expand_globs was enabled.
- Narrow _expand_glob's exception handler from bare Exception to
  (FileNotFoundError, OSError) so that permission, network, and
  configuration errors propagate to the caller instead of being
  silently swallowed when ignore_missing_paths=True.
- Update test_question_mark to reflect new behavior.

Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues docs An issue or change related to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add glob pattern support to read_xxx APIs

2 participants