Fix reflink cloning on filesystems with large block sizes such as ZFS#356
Merged
Conversation
OpenZFS 2.2+ implements FICLONERANGE, so the CanClone probe now succeeds on ZFS, but actual clones often fail: ZFS requires alignment to its recordsize (128k by default, reported as st_blksize) and returns EAGAIN when the source range hasn't been committed to disk yet. Extraction with seeds then failed with "invalid argument" or "resource temporarily unavailable" errors. On top of that, the clone math in fileSeedSegment and nullChunkSection assumed every segment contains at least one full aligned block, which is always true for 4k blocks but not for 128k ones. Segments smaller than a block made the aligned length underflow, made the head/tail copies write outside the segment, and could call FICLONERANGE with a zero length, which the kernel interprets as "clone to the end of the source file", silently corrupting the target. Guard against ranges that contain no full aligned block by copying them instead, and fall back to copying the blocks whenever CloneRange fails. Cloning failures are no longer fatal on any filesystem. Verified against a real ZFS pool (zfs 2.2.2, zfs_bclone_enabled=1): previously failing extraction tests now pass and properly aligned ranges are still reflinked. Fixes #353
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #353
Problem
Tests (and real extraction with seeds) fail on ZFS-backed storage with
invalid argumentorresource temporarily unavailableerrors fromCloneRange. Two separate causes:ZFS reflink semantics. OpenZFS 2.2+ implements
FICLONERANGE, so the 0-byteCanCloneprobe succeeds, but real clones are much stricter than on btrfs/XFS: offsets must be aligned to the recordsize (128k by default, reported asst_blksize), and cloning a range that hasn't been committed to disk yet returnsEAGAIN(zfs_bclone_wait_dirty=0default) — which is exactly what happens when the null-seed clones from its freshly written zero blockfile. On pre-2.2 ZFS the ioctl didn't exist, the probe failed, and everything quietly copied instead, which is why this is new.A latent bug in the clone math, exposed by 128k blocks.
fileSeedSegment.clone()andnullChunkSection.clone()assume every segment contains at least one full aligned block. That always holds with 4k blocks and ≥16k chunks, but not with 128k blocks. For segments smaller than a block:alignLengthunderflows (uint64), producing absurd clone lengths,alignLength == 0,FICLONERANGEis called with length 0, which the kernel defines as "clone to the end of the source file" — silently overwriting the target far beyond the segment.The last two can corrupt output on any filesystem given a large enough
st_blksize.Fix
CloneRangefailure now falls back to copying that range instead of failing the extraction, the same approach coreutilscptakes. The null seed skips the copy entirely when the target is still blank.seed.gogains a smallcloneRangeindirection so tests can simulate filesystems where the probe passes but cloning fails; new tests cover the fallback and small-segment geometry.Verification
Reproduced and verified on a real ZFS pool (Ubuntu 24.04 VM, zfs-2.2.2,
zfs_bclone_enabled=1, 128k recordsize):TestExtractCommandsubtests on ZFS.FICLONERANGEcalls confirms properly aligned ranges are still reflinked on ZFS (verified viazpool get bcloneused), so ZFS keeps the dedup/speed benefit rather than degrading to copy-only.go test ./...also passes on ext4.