Skip to content

storage: add splitfdstream for efficient layer transfer with reflink support#651

Open
giuseppe wants to merge 8 commits into
podman-container-tools:mainfrom
giuseppe:splitfdstream
Open

storage: add splitfdstream for efficient layer transfer with reflink support#651
giuseppe wants to merge 8 commits into
podman-container-tools:mainfrom
giuseppe:splitfdstream

Conversation

@giuseppe

Copy link
Copy Markdown
Contributor

Add a new splitfdstream package that enables efficient container layer transfer between storage instances by separating tar metadata from file content. File content is passed as file descriptors.

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 6% of the code base for an extremely niche use case :/

I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.

@giuseppe

Copy link
Copy Markdown
Contributor Author
  • 6% of the code base for an extremely niche use case :/

I’m afraid I can’t spend time on this now; and anyway some other work on pkg/archive needs to happen first, and expect the rebase to be somewhat non-trivial.

what other work would block this feature?

I don't love either the amount of extra code, but it is almost all in new APIs/packages that are experimental and not leaking too much to the rest of the library.

We went around this problem for a long time, and in the end this seems like an ~ok solution to not expose too much of the containers storage internals augmenting the tar format that is already plugged everywhere.

@giuseppe

Copy link
Copy Markdown
Contributor Author

to add more context: the goal of this feature is to be able to copy a layer, similarly to what we would do with Diff, but having the possibility to deduplicate files using reflinks. We need two separate copies because OS files can't be kept on the containers storage for different reasons, a podman system reset -f should not break the system, and OS files need to be relabeled.

There are other interesting use cases enabled with a RPC like facilitate usage from different languages (Rust in our case), inject into a build container to grab R/O access to layers/images, but this is out of scope now.

I realize complexity is not trivial just to get access to the raw files, but what alternatives do we have?

From the store API PoV we could just extend it to return a map $LAYER_FILE -> $PATH_TO_THE_FILE, but how would we use it from Rust?

Any suggestions?

@mtrmac

mtrmac commented Feb 10, 2026

Copy link
Copy Markdown
Contributor

(Just to avoid a misunderstanding, I didn’t really read the PR. I have, at this point, ~no opinion on RPC vs. Go API.)

Comment thread storage/pkg/archive/archive.go Outdated
return &tarReaderIterator{
tr: tr,
trBuf: trBuf,
buffer: make([]byte, 1<<20),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like 1<<20 should be in a const somewhere with a rationale?


// splitFDStreamSocketDescriptor is the fd for the Unix socket used to
// receive file descriptors via SCM_RIGHTS in the re-exec child.
const splitFDStreamSocketDescriptor = 5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding this seems a bit fragile, can't we pass it as an env var or a CLI arg?

Comment on lines +34 to +35
// FDs are streamed to the child process one-at-a-time over a Unix socket
// using SCM_RIGHTS, avoiding EMFILE from inheriting too many FDs at exec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be truly one at a time, see also https://github.com/cgwalters/jsonrpc-fdpass?tab=readme-ov-file#41-fd-batching

In my initial PoC work here I handled this by just sending the splitfdstream inline data itself as a fd (could be O_TMPFILE or memfd or pipe), and then sending all the associated fds over jsonrpc-fdpass which automatically handles buffering.

Ensure we processing just 1 recvmsg() at a time (only ~200 fds on Linux) seems really reasonable and unlikely to get anywhere close to modern fd limits.


So if we do choose jsonrpc-fdpass, I think we also need to standardize this "splitfdstream serialization".

Comment thread storage/pkg/fileutils/reflink_linux.go Outdated
return err
}

// Try copy_file_range - kernel-level copy, more efficient than userspace

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprising we weren't doing this before.

Comment thread storage/pkg/splitfdstream/fdpass.go Outdated

// ReadLine reads bytes until a newline is encountered, using recvmsg.
// Any FDs received are closed since line-reading doesn't expect them.
func (p *FDPasser) ReadLine() ([]byte, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would one want this?

Comment thread storage/pkg/splitfdstream/protocol.go Outdated

// JSON-RPC 2.0 protocol constants
const (
JSONRPCVersion = "2.0"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 23ed544 to 8d74f84 Compare February 11, 2026 16:07
@giuseppe

Copy link
Copy Markdown
Contributor Author

still PoC, so you can avoid a review as it is not yet ready but we can use to analyze the complexity.

Moved to use https://github.com/cgwalters/jsonrpc-fdpass-go

Now the code size is almost 50% than the last revision

@packit-as-a-service

Copy link
Copy Markdown

Packit jobs failed. @containers/packit-build please check.

@giuseppe giuseppe force-pushed the splitfdstream branch 2 times, most recently from 1b087bb to bc6c5ff Compare February 12, 2026 09:15
@giuseppe giuseppe force-pushed the splitfdstream branch 7 times, most recently from a0876b6 to d77fbc4 Compare February 17, 2026 12:44
@giuseppe giuseppe changed the title [WIP] storage: add splitfdstream for efficient layer transfer with reflink support storage: add splitfdstream for efficient layer transfer with reflink support Feb 19, 2026
@giuseppe giuseppe marked this pull request as ready for review February 19, 2026 14:58
@kolyshkin

Copy link
Copy Markdown
Contributor

I think these two commits can deserve own PR and merged separately:

  • storage, overlay: use openat2 instead of using procfs
  • composefs: do not leak open file

@giuseppe

giuseppe commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

I think these two commits can deserve own PR and merged separately:

  • storage, overlay: use openat2 instead of using procfs
  • composefs: do not leak open file

I've created two new PRs with these patches:

I've left "storage, overlay: use openat2 instead of using procfs" as part of this PR as well to avoid rebase conflicts, I will drop it once #820 is merged

Comment thread storage/pkg/archive/archive.go Outdated
Comment on lines +122 to +123
s.TrackConnection()
go s.HandleConnection(unixConn)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since storage/go.mod has go 1.25, you can use WaitGroup.Go instead of Add/Done. IOW replace this with

s.connections.Go(func() {
   s.HandleConnection(unixConn)
})

and remove TrackConnection and the call to s.connections.Done().

Unless, of course, you need TrackConnection separately (I suspect I am missing something here since you made it public).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, you're using it later.

What if we make this internal, and make HandleConnection call connections.Go(func() { handleConnection(conn) })? I mean, do not expose the refcount details to the API

Comment thread storage/store.go
roLayerStoresUseGetters []roLayerStore // Almost all users should use the provided accessors instead of accessing this field directly.

// FIXME: The following fields need locking, and dont have it.
// FIXME: The following fields need locking, and don't have it.

@kolyshkin kolyshkin May 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked there are another 62 instances of this symbol (it is https://en.wikipedia.org/wiki/Right_single_quotation_mark)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

508 instances in the whole container-libs

Comment thread storage/store.go Outdated
}
driver, ok := s.graphDriver.(splitfdstream.SplitFDStreamDriver)
if !ok {
s.stopUsingGraphDriver()

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, I missed the startUsingGraphDriver.

Comment thread storage/store.go Outdated
Comment on lines +4176 to +4179
if !ok {
s.stopUsingGraphDriver()
return nil, nil, fmt.Errorf("driver does not support splitfdstream")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already did this check above, no need to check it twice.

Comment on lines +33 to +35
if options == nil {
return nil, nil, fmt.Errorf("options cannot be nil")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not important, but you can do

if options == nil {
   options = &splitfdstream.GetSplitFDStreamOpts{}
}

instead to slightly simplify usage by allowing nil for use "defaults" (if such usage makes sense).

}

flatPath := string(buf[:n])
if strings.Contains(flatPath, "..") || filepath.IsAbs(flatPath) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having .. in the file name is OK here (having .. as an element in path is not). The best way to check is probably something like

if filepath.Clean("/"+flatPath) != "/"+flatPath {

OR, if redirect attribute never contains ".." then the existing check is fine.

var fd int
var openErr error

if composefsMountFd >= 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe factor out reading the attribute into a separate function, so that the code below can be unified?

Comment on lines +185 to +189
// Clean the file name to prevent path traversal
cleanName := filepath.Clean(header.Name)
if strings.Contains(cleanName, "..") {
return false, fmt.Errorf("invalid file path: %s", header.Name)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giuseppe and others added 8 commits May 12, 2026 13:01
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Extend the store with splitfdstream capabilities exposed via a
UNIX socket for JSON-RPC communication.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Implement the SplitFDStreamDriver interface for the overlay driver,
enabling efficient layer operations with reflink support.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Add a new json-proxy method that returns a Unix socket file descriptor
to the client.  The client then speaks the jsonrpc-fdpass protocol
directly over that socket for splitfdstream operations, bypassing the
json-proxy for bulk data transfer.

Bump protocolVersion to "0.2.9".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@cgwalters

Copy link
Copy Markdown
Contributor

Hi, I wanted to raise here that after some more investigation, I am proposing a change in direction. Basically, I found that varlink has "de facto" decent support for fd passing (originating from systemd). I think it makes sense to use that as a future replacement for the current skopeo proxy protocol, and to not use jsonrpc-fdpass.

See composefs/composefs-rs#309 for some PoC work.

@mtrmac

mtrmac commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

I don’t know the details but ISTR Podman has felt burned by varlink in the past. @mheon ?

@cgwalters

Copy link
Copy Markdown
Contributor

Right, obviously there's a huge history here in trying to use varlink as the RPC mechanism instead of the Docker HTTP API. No one is proposing having varlink at a scope as large as that, this would really just be a subset of image APIs that need fd passing and can't really be covered by the HTTP APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to "common" package image Related to "image" package storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants