[Data] Add glob pattern support to read_xxx APIs#64247
Conversation
Add _has_glob_chars, _split_glob_base, _glob_match_path, _glob_match_segs, _expand_glob to path_util.py. Modify _resolve_paths_and_filesystem with expand_globs parameter to detect and expand glob patterns before path resolution. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Pass expand_globs=True to _resolve_paths_and_filesystem in ParquetDatasource, ParquetDatasourceV2, FileBasedDatasource, and NonSamplingFileIndexer. Relax assert in file_indexer to support multiple resolved paths from glob expansion. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Add ignore_missing_paths parameter to read_parquet API, consistent with other read_xxx functions. Add glob pattern usage examples to docstring. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
Add integration tests for local glob patterns (*, **, ?, [...]) across multiple datasources with V1/V2 parameterization. Add unit tests for glob helper functions in path_util. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
- Fix recursive listing: check glob chars in non-terminal path segments instead of simply checking '/' in pattern (e.g. sub*/*.parquet needs recursive, but *.parquet does not) - Add S3 (moto) tests: single wildcard, recursive glob, bucket glob rejection Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
V1: FileBasedDatasource.__init__ skips expand_paths when glob returns empty paths with ignore_missing_paths=True, avoiding zip(*[]) crash. Stores empty lists so get_read_tasks returns an empty Dataset. V2: _read_datasource_v2 returns empty Dataset when sample is empty and ignore_missing_paths=True, instead of unconditionally raising ValueError. Co-authored-by: Chang-Tong <zdcheerful@hotmail.com> Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces robust glob pattern support (including recursive and segment-aware matching) across Ray Data readers, resolving patterns before path resolution. The review feedback highlights a few important issues: a variable shadowing bug in _resolve_paths_and_filesystem that overwrites the shared filesystem parameter across loop iterations; OS-dependent case-sensitivity bugs on Windows due to using fnmatch.fnmatch instead of fnmatch.fnmatchcase for cloud paths; a potential unpacking crash when paths is empty; and a risk of AttributeError when directly accessing datasource.ignore_missing_paths instead of using getattr.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| resolved_paths = [] | ||
| for path in paths: | ||
| # Expand glob patterns before resolving individual paths. | ||
| # Only enabled when callers explicitly opt in (expand_globs=True), | ||
| # since this function is also used by write, download, and | ||
| # partition paths where glob expansion is undesirable. | ||
| if expand_globs and _has_glob_chars(path): | ||
| # Resolve relative paths to absolute before glob expansion | ||
| # to ensure consistency across processes (e.g., Ray workers). | ||
| # Only for local paths — cloud URIs (s3://, gs://, etc.) are | ||
| # already absolute; os.path.isabs misidentifies them as relative. | ||
| parsed = urlparse(path) | ||
| is_cloud_fs = filesystem is not None and not isinstance( | ||
| filesystem, LocalFileSystem | ||
| ) | ||
| if not parsed.scheme and not os.path.isabs(path) and not is_cloud_fs: | ||
| path = os.path.abspath(path) | ||
|
|
||
| if filesystem is None: | ||
| try: | ||
| filesystem, _ = _resolve_single_path_with_fallback(path, filesystem) | ||
| except (ValueError, ImportError) as e: | ||
| if ignore_missing_paths: | ||
| logger.debug( | ||
| f"Failed to resolve filesystem for glob " | ||
| f"'{path}': {e}, skipping" | ||
| ) | ||
| continue | ||
| raise ValueError( | ||
| f"Failed to resolve filesystem for glob pattern '{path}': {e}" | ||
| ) from e | ||
| expanded = _expand_glob(path, filesystem, ignore_missing_paths) | ||
| resolved_paths.extend(expanded) | ||
| continue |
There was a problem hiding this comment.
Reassigning the filesystem variable inside the loop (filesystem, _ = _resolve_single_path_with_fallback(...)) shadows and overwrites the input filesystem parameter for all subsequent iterations. If the input list contains multiple paths that require different filesystems (or if the first is a glob and the second is a non-glob), subsequent paths will incorrectly use the filesystem resolved from the first path.
To fix this, track the resolved filesystem in a separate variable (e.g., resolved_filesystem) and use a local resolved_fs inside the loop to avoid mutating the shared filesystem state.
resolved_filesystem = filesystem
resolved_paths = []
for path in paths:
# Expand glob patterns before resolving individual paths.
# Only enabled when callers explicitly opt in (expand_globs=True),
# since this function is also used by write, download, and
# partition paths where glob expansion is undesirable.
if expand_globs and _has_glob_chars(path):
# Resolve relative paths to absolute before glob expansion
# to ensure consistency across processes (e.g., Ray workers).
# Only for local paths — cloud URIs (s3://, gs://, etc.) are
# already absolute; os.path.isabs misidentifies them as relative.
parsed = urlparse(path)
is_cloud_fs = filesystem is not None and not isinstance(
filesystem, LocalFileSystem
)
if not parsed.scheme and not os.path.isabs(path) and not is_cloud_fs:
path = os.path.abspath(path)
resolved_fs = filesystem
if resolved_fs is None:
try:
resolved_fs, _ = _resolve_single_path_with_fallback(path, resolved_fs)
except (ValueError, ImportError) as e:
if ignore_missing_paths:
logger.debug(
f"Failed to resolve filesystem for glob "
f"'{path}': {e}, skipping"
)
continue
raise ValueError(
f"Failed to resolve filesystem for glob pattern '{path}': {e}"
) from e
expanded = _expand_glob(path, resolved_fs, ignore_missing_paths)
resolved_paths.extend(expanded)
resolved_filesystem = resolved_fs
continue| for j in range(1, n_rel + 1): | ||
| if fnmatch.fnmatch(rel_segs[j - 1], seg): | ||
| dp[i][j] = dp[i - 1][j - 1] |
There was a problem hiding this comment.
fnmatch.fnmatch is OS-dependent (case-insensitive on Windows, case-sensitive on Unix). However, cloud filesystems (S3, GCS, ADLS) are strictly case-sensitive regardless of the operating system of the driver or worker. Using fnmatch.fnmatch can lead to inconsistent behavior and bugs when running on Windows drivers. Use fnmatch.fnmatchcase instead to ensure consistent, case-sensitive matching for cloud paths.
| for j in range(1, n_rel + 1): | |
| if fnmatch.fnmatch(rel_segs[j - 1], seg): | |
| dp[i][j] = dp[i - 1][j - 1] | |
| for j in range(1, n_rel + 1): | |
| if fnmatch.fnmatchcase(rel_segs[j - 1], seg): | |
| dp[i][j] = dp[i - 1][j - 1] |
| if ignore_missing_paths and len(paths) == 0: | ||
| raise ValueError( | ||
| "None of the provided paths exist. " | ||
| "The 'ignore_missing_paths' field is set to True." | ||
| file_sizes = [] | ||
| else: | ||
| paths, file_sizes = map( | ||
| list, | ||
| zip( | ||
| *meta_provider.expand_paths( | ||
| paths, | ||
| self._filesystem, | ||
| partitioning, | ||
| ignore_missing_paths=ignore_missing_paths, | ||
| ) | ||
| ), | ||
| ) |
There was a problem hiding this comment.
If paths is empty and ignore_missing_paths is False, the code will fall through to the else block and attempt to unpack paths, file_sizes = map(list, zip(*...)). Since zip(*[]) is empty, this will raise a cryptic ValueError: not enough values to unpack (expected 2, got 0). Handling the empty paths case explicitly with a clear error message improves robustness and user experience.
| if ignore_missing_paths and len(paths) == 0: | |
| raise ValueError( | |
| "None of the provided paths exist. " | |
| "The 'ignore_missing_paths' field is set to True." | |
| file_sizes = [] | |
| else: | |
| paths, file_sizes = map( | |
| list, | |
| zip( | |
| *meta_provider.expand_paths( | |
| paths, | |
| self._filesystem, | |
| partitioning, | |
| ignore_missing_paths=ignore_missing_paths, | |
| ) | |
| ), | |
| ) | |
| if len(paths) == 0: | |
| if not ignore_missing_paths: | |
| raise ValueError("No paths resolved or provided.") | |
| file_sizes = [] | |
| else: | |
| paths, file_sizes = map( | |
| list, | |
| zip( | |
| *meta_provider.expand_paths( | |
| paths, | |
| self._filesystem, | |
| partitioning, | |
| ignore_missing_paths=ignore_missing_paths, | |
| ) | |
| ), | |
| ) |
| if datasource.ignore_missing_paths: | ||
| return from_blocks([]) |
There was a problem hiding this comment.
Directly accessing datasource.ignore_missing_paths can raise an AttributeError if the datasource implementation does not define or expose this attribute (or if it is stored as a private attribute like _ignore_missing_paths). Using getattr with a fallback is safer and more defensive.
| if datasource.ignore_missing_paths: | |
| return from_blocks([]) | |
| if getattr(datasource, "ignore_missing_paths", False) or getattr(datasource, "_ignore_missing_paths", False): | |
| return from_blocks([]) |
f6dc591 to
daf2d28
Compare
Follow-up improvements to the glob pattern implementation: Bug fixes: - Store ignore_missing_paths in ParquetDatasource. - Reject empty brackets in glob character detection. - Return absolute paths for local glob expansion. - Raise clear errors on glob filesystem resolution failures. Code quality: - Clarify ignore_missing_paths behavior in FileBasedDatasource. - Remove unreachable RetryingPyFileSystem branch. Documentation: - Add glob pattern examples to read_images, read_json, read_csv, read_text, read_avro, and read_binary_files docstrings. Tests: - Add direct unit tests for _expand_glob local expansion. Signed-off-by: Chang-Tong <zdcheerful@hotmail.com> Co-Authored-By: jiangxt2 <jiangxt2@vip.qq.com>
- Fix filesystem variable shadowing in _resolve_paths_and_filesystem - Use fnmatchcase for case-sensitive cloud path matching - Handle empty paths unpacking crash in FileBasedDatasource - Skip partition/extension filters when paths is empty - Add local glob fallback when filesystem resolution fails - Add early return for empty glob in ParquetDatasource - Add edge case tests for ignore_missing_paths + glob Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
- Add _is_local_filesystem() helper that unwraps RetryingPyFileSystem before checking, fixing wrapped LocalFileSystem misclassified as cloud - Use getattr for datasource.ignore_missing_paths in V2 read path Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
037a5de to
db997c5
Compare
The CI `check-test-run.py` requires all `tests/test_*.py` files to have a corresponding Bazel `py_test` target. The new `test_glob_patterns.py` was missing from BUILD.bazel, causing microcheck lint:test_coverage to fail. Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
- Add "/" boundary check in _expand_glob to prevent string-prefix matching (e.g. "bucket/data" matching "bucket/dataextra.parquet") - Wrap FileSelector.get_file_info in try/except so that missing cloud prefixes return [] when ignore_missing_paths=True instead of raising Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
…hema - Fix black formatting for prefix boundary check in _expand_glob - Set resolved_filesystem to LocalFileSystem after pathlib glob fallback succeeds, preventing None filesystem in downstream ops - Use pa.schema([]) instead of None for file_schema/read_schema when ignore_missing_paths=True and no files match in V1 ParquetDatasource, avoiding TypeError in _derive_schema Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
- Only treat '?' as a glob metachar for cloud paths (with URI scheme). Local filenames may contain literal '?' and worked unchanged before expand_globs was enabled. - Narrow _expand_glob's exception handler from bare Exception to (FileNotFoundError, OSError) so that permission, network, and configuration errors propagate to the caller instead of being silently swallowed when ignore_missing_paths=True. Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 9eecafb. Configure here.
| return True | ||
| # Bracket expression: [ followed by ], not preceded by a space. | ||
| # Require at least one character between brackets to avoid matching empty []. | ||
| return bool(re.search(r"(?<! )\[[^\]]+\]", check_target)) |
There was a problem hiding this comment.
Local question-mark globs not expanded
Medium Severity
Glob expansion is gated on _has_glob_chars, but that helper only treats ? as a metacharacter when the path has a URI scheme. Local patterns such as part-?.parquet are therefore not expanded and are resolved as literal paths, so single-character ? globs on local filesystems do not match the documented behavior or the new tests.
Reviewed by Cursor Bugbot for commit 9eecafb. Configure here.
- Only treat '?' as a glob metachar for cloud paths (with URI scheme). Local filenames may contain literal '?' and worked unchanged before expand_globs was enabled. - Narrow _expand_glob's exception handler from bare Exception to (FileNotFoundError, OSError) so that permission, network, and configuration errors propagate to the caller instead of being silently swallowed when ignore_missing_paths=True. - Update test_question_mark to reflect new behavior. Signed-off-by: jiangxt2 <jiangxt2@vip.qq.com> Co-Authored-By: Chang-Tong <zdcheerful@hotmail.com>


Description
Add glob pattern support (
*,**,?,[...]) to allread_xxxAPIs in Ray Data, enablingray.data.read_parquet("s3://bucket/*.parquet")to work natively.The implementation injects glob expansion into
_resolve_paths_and_filesystem()— the single resolution path shared by V1 and V2 datasources — so allread_xxxcallers benefit automatically.Key design decisions:
pathlib.Path.glob()for local paths andFileSelectorlisting + segment-aware matching for cloud paths (S3, GCS, etc.)*matches within a single path segment;**spans multiple segments (dynamic programming matcher)s3://bucket-*/*.parquet) are explicitly rejected with a clear error?X-Amz-Signature=...) are not treated as glob charactersRelated issues
Fixes #59304
Additional information
Files changed (8):
path_util.py: New helpers_has_glob_chars,_split_glob_base,_glob_match_path,_expand_glob; inject expansion into_resolve_paths_and_filesystemfile_based_datasource.py: Enableexpand_globs=True, fixignore_missing_pathsfor empty glob resultsparquet_datasource.py/parquet_datasource_v2.py/file_indexer.py: Enable glob expansionread_api.py: Addignore_missing_pathstoread_parquet, add glob examples to allread_xxxdocstringsTests (31 total):
test_glob_patterns.py(16): Integration tests for local + S3 (moto), V1/V2 parameterizationtest_path_util_glob.py(15): Unit tests for pure helper functions