Filter out metadata directories from HUDI partitions#814
Filter out metadata directories from HUDI partitions#814caglareker wants to merge 2 commits intoapache:mainfrom
Conversation
vinishjail97
left a comment
There was a problem hiding this comment.
Thanks for contributing this critical bug fix @caglareker, added comments to keep the check more restrictive. Please review and let me know what you think.
| return true; | ||
| } | ||
| String name = new Path(p).getName(); | ||
| return !name.startsWith("_") && !name.startsWith("."); |
There was a problem hiding this comment.
The startsWith("_") filter is too broad. It would silently drop legitimate user partitions where the partition column name starts with _, e.g., _status=active or _year=2024 — getName() returns the last path component, so a partition path like region=US/_status=active would be incorrectly filtered.
Consider filtering only known metadata directory names (.hoodie, _delta_log) instead of any path starting with these characters?
| List<String> result = | ||
| mockHudiCatalogPartitionSyncTool.getAllPartitionPathsOnStorage(TEST_BASE_PATH); | ||
|
|
||
| assertEquals(Arrays.asList("key1", "key2"), result); |
There was a problem hiding this comment.
filterMetadataPaths is called in 3 places (BaseFileUpdatesExtractor, HudiDataFileExtractor, HudiCatalogPartitionSyncTool) but is only tested indirectly through one of them. There is no TestHudiPathUtils unit test for the method itself.
Please add a dedicated unit test covering:
- Nested path with a metadata dir as the last segment (e.g.,
year=2024/_delta_log→ filtered) - Empty string → kept
- A partition whose column name starts with
_(e.g.,_status=active) to explicitly document whether this is expected to be filtered or not
| if (p.isEmpty()) { | ||
| return true; | ||
| } | ||
| String name = new Path(p).getName(); |
There was a problem hiding this comment.
Nit: new Path(p).getName() allocates a Hadoop Path object just to extract the last segment of a relative partition path string. Consider p.substring(p.lastIndexOf('/') + 1) to avoid the unnecessary allocation.
|
good point, tightened the filter to only match known metadata dirs (.hoodie, _delta_log) instead of the broad startsWith check. Also added TestHudiPathUtils with the cases you mentioned and replaced new Path(p).getName() with substring. |
What does this PR do?
Closes #813
The _delta_log folder was being treated as a partition when reading HUDI tables with checkpoint parquet files. Added a utility method to filter out these metadata paths from the partition list.
Applied the filter in BaseFileUpdatesExtractor and HudiDataFileExtractor where we fetch partitions from the filesystem.
How was this patch tested?
Ran the existing test suite locally, all passing. Added a test case to TestHudiCatalogPartitionSyncTool to cover the filtering logic.