gc: implement detailed report for --dry run #10937
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #10585
This PR significantly improves the output of
dvc gc --dry. Previously, a dry run only reported the count of objects to be removed, which left users guessing about what exactly would be deleted.Now, it provides a detailed table listing the objects, allowing users to verify the cleanup target before execution.
Example
Regarding the code of this unit test, it will output:
Changes
dvc-dataGC interface to iterate over garbage objects.Type,OID(MD5),Size,Modifiedtime, andPath.You will notice the
Pathcolumn displays the internal cache path (e.g.,.dvc/cache/files/md5/...) rather than the original workspace filename.Why internal paths?
Retrieving the original workspace path for a garbage object is complex. Since gc works at the ODB (Object Database) level, it doesn't inherently know where the file came from. Finding the original name would require scanning
git reflogor refactoring upper-layer architecture, both of which involve significant complexity or performance costs.However, this output is still highly valuable:
SizeandModifiedtimestamps act as strong evidence. Users can often identify "that 2GB model file from last Tuesday" just by looking at these metadata columns.git fsckorprune), where only hashes are displayed.For this PR, I opted for this simple, robust implementation. It provides immediate value while leaving room to discuss more advanced reverse-lookup logic in future iterations.
Performance & Remote Storage
The logic explicitly checks
if isinstance(odb.fs, LocalFileSystem)before fetching detailed metadata (Size, Modified Time).Reasoning:
Retrieving
statinformation for every single garbage object on remote storage (S3, Azure, etc.) would trigger a separate network request per object. For large projects with thousands of garbage files, this would makedvc gc --dryunacceptably slow and potentially hit API rate limits. Therefore, detailed metadata is skipped for non-local filesystems to maintain performance.Testing
Added comprehensive tests in
tests/func/test_gc.pyto cover the new--drymode:--cloudflag (scanning both local and remote caches)All tests pass locally.
Dependencies
Requires updated dvc-data (PR: treeverse/dvc-data#650)