Skip to content

Filter out metadata directories from HUDI partitions#814

Open
caglareker wants to merge 2 commits intoapache:mainfrom
caglareker:fix/issue-813-delta-log-folder-to-be-considered-as-par
Open

Filter out metadata directories from HUDI partitions#814
caglareker wants to merge 2 commits intoapache:mainfrom
caglareker:fix/issue-813-delta-log-folder-to-be-considered-as-par

Conversation

@caglareker
Copy link

What does this PR do?

Closes #813

The _delta_log folder was being treated as a partition when reading HUDI tables with checkpoint parquet files. Added a utility method to filter out these metadata paths from the partition list.

Applied the filter in BaseFileUpdatesExtractor and HudiDataFileExtractor where we fetch partitions from the filesystem.

How was this patch tested?

Ran the existing test suite locally, all passing. Added a test case to TestHudiCatalogPartitionSyncTool to cover the filtering logic.

Copy link
Contributor

@vinishjail97 vinishjail97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this critical bug fix @caglareker, added comments to keep the check more restrictive. Please review and let me know what you think.

return true;
}
String name = new Path(p).getName();
return !name.startsWith("_") && !name.startsWith(".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The startsWith("_") filter is too broad. It would silently drop legitimate user partitions where the partition column name starts with _, e.g., _status=active or _year=2024getName() returns the last path component, so a partition path like region=US/_status=active would be incorrectly filtered.

Consider filtering only known metadata directory names (.hoodie, _delta_log) instead of any path starting with these characters?

List<String> result =
mockHudiCatalogPartitionSyncTool.getAllPartitionPathsOnStorage(TEST_BASE_PATH);

assertEquals(Arrays.asList("key1", "key2"), result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterMetadataPaths is called in 3 places (BaseFileUpdatesExtractor, HudiDataFileExtractor, HudiCatalogPartitionSyncTool) but is only tested indirectly through one of them. There is no TestHudiPathUtils unit test for the method itself.

Please add a dedicated unit test covering:

  1. Nested path with a metadata dir as the last segment (e.g., year=2024/_delta_log → filtered)
  2. Empty string → kept
  3. A partition whose column name starts with _ (e.g., _status=active) to explicitly document whether this is expected to be filtered or not

if (p.isEmpty()) {
return true;
}
String name = new Path(p).getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: new Path(p).getName() allocates a Hadoop Path object just to extract the last segment of a relative partition path string. Consider p.substring(p.lastIndexOf('/') + 1) to avoid the unnecessary allocation.

@caglareker
Copy link
Author

good point, tightened the filter to only match known metadata dirs (.hoodie, _delta_log) instead of the broad startsWith check. Also added TestHudiPathUtils with the cases you mentioned and replaced new Path(p).getName() with substring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_delta_log folder to be considered as partition when checkpoint parquet files present by hudi reader without metadata table

2 participants