Skip to content

Conversation

@GreenHatHG
Copy link
Contributor

Summary

Fixes #10585

This PR significantly improves the output of dvc gc --dry. Previously, a dry run only reported the count of objects to be removed, which left users guessing about what exactly would be deleted.

Now, it provides a detailed table listing the objects, allowing users to verify the cleanup target before execution.

Example

Regarding the code of this unit test, it will output:

***** captured *****
total 3 objects, 14.7k reclaimed
Type  OID             Size  Modified             Path
----  --------  ----------  -------------------  ----
file  f27f5596          6B  2025-12-25 15:22:51  /tmp/pytest-of-jooooody/pytest-5/test_gc_dry_run_report_output0/.dvc/cache/files/md5/f2/7f5596d752510b7b1e97e2e1870a45
dir   933b0b31         61B  2025-12-25 15:22:51  /tmp/pytest-of-jooooody/pytest-5/test_gc_dry_run_report_output0/.dvc/cache/files/md5/93/3b0b3162b40298a8e961e8dd238a11.dir
file  32fc8a46       14.6k  2025-12-25 15:22:51  /tmp/pytest-of-jooooody/pytest-5/test_gc_dry_run_report_output0/.dvc/cache/files/md5/32/fc8a4605bce98cdac4bf5e3edc882e

********************
def test_gc_dry_run_report_output(tmp_dir, dvc, capsys):
    # Garbage object 1: A standalone file
    (garbage_stage,) = tmp_dir.dvc_gen("garbage_file", "this is garbage"*1000)

    # Garbage objects 2 & 3: A directory and its inner content
    (garbage_dir_stage,) = tmp_dir.dvc_gen({"garbage_dir": {"f": "in dir"}})

    os.remove(garbage_stage.relpath)
    os.remove(garbage_dir_stage.relpath)

    ret = main(["gc", "-w", "--dry"])
    assert ret == 0

    captured = capsys.readouterr().out
    print("***** captured *****")
    print(captured)
    print("*" * 20)

Changes

  • Integrated with the refactored dvc-data GC interface to iterate over garbage objects.
  • Implemented a tabular output showing: Type, OID (MD5), Size, Modified time, and Path.
  • Added a summary footer showing total objects and estimated space reclaimed.

⚠️ Important Note on "Path"

You will notice the Path column displays the internal cache path (e.g., .dvc/cache/files/md5/...) rather than the original workspace filename.

Why internal paths?
Retrieving the original workspace path for a garbage object is complex. Since gc works at the ODB (Object Database) level, it doesn't inherently know where the file came from. Finding the original name would require scanning git reflog or refactoring upper-layer architecture, both of which involve significant complexity or performance costs.

However, this output is still highly valuable:

  1. Traceability: The Size and Modified timestamps act as strong evidence. Users can often identify "that 2GB model file from last Tuesday" just by looking at these metadata columns.
  2. Consistency: This behavior mirrors how Git handles dangling objects (e.g., in git fsck or prune), where only hashes are displayed.

For this PR, I opted for this simple, robust implementation. It provides immediate value while leaving room to discuss more advanced reverse-lookup logic in future iterations.

Performance & Remote Storage

The logic explicitly checks if isinstance(odb.fs, LocalFileSystem) before fetching detailed metadata (Size, Modified Time).

Reasoning:
Retrieving stat information for every single garbage object on remote storage (S3, Azure, etc.) would trigger a separate network request per object. For large projects with thousands of garbage files, this would make dvc gc --dry unacceptably slow and potentially hit API rate limits. Therefore, detailed metadata is skipped for non-local filesystems to maintain performance.

Testing

Added comprehensive tests in tests/func/test_gc.py to cover the new --dry mode:

  • Accurate identification and reporting of garbage objects
  • Correct structured tabular report formatting, including human-readable sizes, timestamps, and summary statistics
  • Proper behavior when no garbage is present
  • Integration with --cloud flag (scanning both local and remote caches)
  • Robustness against edge cases, such as corrupted cache directories or missing files (graceful handling without crashes)

All tests pass locally.

Dependencies

Requires updated dvc-data (PR: treeverse/dvc-data#650)

@codecov
Copy link

codecov bot commented Dec 25, 2025

Codecov Report

❌ Patch coverage is 33.93939% with 109 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.31%. Comparing base (2431ec6) to head (e924dd8).
⚠️ Report is 172 commits behind head on main.

Files with missing lines Patch % Lines
dvc/repo/gc.py 42.30% 57 Missing and 3 partials ⚠️
tests/func/test_gc.py 19.67% 49 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #10937       +/-   ##
===========================================
- Coverage   90.68%   70.31%   -20.38%     
===========================================
  Files         504      503        -1     
  Lines       39795    41016     +1221     
  Branches     3141     3237       +96     
===========================================
- Hits        36087    28839     -7248     
- Misses       3042    11323     +8281     
- Partials      666      854      +188     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@GreenHatHG
Copy link
Contributor Author

Hi maintainers,

The CI is failing as expected because this work depends on an unmerged PR in dvc-data which introduces the is_dir_hash function.
treeverse/dvc-data#650

I'll convert this to a Draft PR until the dependency is merged.

@GreenHatHG GreenHatHG marked this pull request as draft December 25, 2025 10:33
@skshetry
Copy link
Collaborator

skshetry commented Jan 3, 2026

Hi, I think you are complicating the feature and implementation a lot. I get the intent behind separating collection and removal, but at this stage it feels like too much work for limited gain.

There's also no guarantee we'll always be able to maintain that separation, for example, if we implement #829, separating collection from removal may not be feasible.

I don’t think we need tables here. The dir/file distinction and oid are internal implementation details that we don’t expose to users, and I don’t see “Modified” as particularly meaningful for a content-addressable storage. The only field that really matters to users is the path (and maybe the count of objects that will be deleted).

While size information can be useful, since we’re dealing with garbage objects it may add unnecessary overhead.

If dvc_data.hashfile.gc.gc() doesn’t currently provide paths, we could consider breaking the API to return them instead of just a file count, and then display those paths here.

Alternatively, we could just log them inside gc() as they are deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gc --dry does not show what files are going to be removed

2 participants