Skip to content

[Question] efs-proxy EFS-side socket: is the absence of SO_KEEPALIVE and TCP_USER_TIMEOUT intentional? #353

@NathanKim91

Description

@NathanKim91

What We Observed

We've been experiencing recurring NodeNotReady incidents caused by EFS NFS mounts becoming stuck after the efs-proxy → EFS TCP connection goes silent. Across multiple incidents, we consistently noticed a ~2-minute gap between the start of TCP retransmits on the efs-proxy → EFS connection and complete NFS session collapse.

Environment

  • efs-utils: 3.1.1
  • aws-efs-csi-driver: v3.1.0 (EKS managed add-on)
  • Kubernetes: v1.34 (Amazon EKS, ap-northeast-2)
  • Node: c7i.8xlarge

What We Noticed in the Source

Looking at configure_stream() in src/proxy/src/connections.rs, we noticed it only sets TCP_NODELAY:

pub fn configure_stream(tcp_stream: TcpStream) -> TcpStream {
    match tcp_stream.set_nodelay(true) {
        Ok(()) => {}
        Err(e) => { warn!("failed to set TCP_NODELAY: {:?}", e); }
    }
    tcp_stream
    // SO_KEEPALIVE and TCP_USER_TIMEOUT do not appear to be set here
}

Cargo.toml includes tokio = { features = ["full"] } and libc, which we understand provide the APIs needed to configure both options — so we're wondering if this was an intentional design choice, or something that could be added.

Our Interpretation

Our understanding (happy to be corrected):

  • Without TCP_USER_TIMEOUT, when the TCP path becomes unresponsive during active NFS I/O, the kernel falls back to RTO exponential backoff before declaring the connection dead. Based on default Linux settings, this takes approximately 2 minutes — consistent with what we observed.
  • SO_KEEPALIVE would help detect dead connections during idle periods, but wouldn't fire when NFS I/O is actively in flight, which is our typical scenario.
  • NFS v4.1 session lease is ~90 seconds (RFC 5661 default). If detection takes ~2 minutes, efs-proxy reconnects after the session has already expired on the EFS side, making session recovery impossible.
  • Setting TCP_USER_TIMEOUT to a value shorter than the NFS v4.1 lease (e.g., 25s) might allow efs-proxy to detect and reconnect within the lease window, enabling transparent session recovery.

We recognize this is our interpretation based on observed timing. If the 2-minute behavior is intentional, or if there's something else in the connection lifecycle we're missing, we'd appreciate the clarification.

Question for Maintainers

Is the absence of TCP_USER_TIMEOUT (and SO_KEEPALIVE) on the EFS-side socket intentional? If so, is there another mechanism that handles dead connection detection within the NFS v4.1 lease window?

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions