Skip to content

[BUG] short 20 seconds connection timeout could cause potential recordings loss #1087

@cruizba

Description

@cruizba

Describe the bug

When the Egress process encounters issues communicating with LiveKit via Redis PSRPC, it shuts down if unable to connect within a 20 second timeout, risking the loss of ongoing recordings. This short timeout seems related to this issue, as indicated by the comment:

// TODO change to 10 min once we understand PSRPC failures

I see that the Redis PUB/SUB rpc issue is solved, and I know that probably this timeout will never occur if I update my Egress and LiveKit services, but I think the timeout is too short and could lead to potential non desired shutdowns in cases where, for example, the Redis is unreachable for more than 20 seconds.

I suggest increasing it to 10 minutes or removing it entirely.

Egress Version
v1.11.0

Egress Request
Any ongoing Egress request

Additional context

I've experienced this timeout in one of our clusters, likely because I'm using a version that doesn't include this PR. I created this issue just to ensure it's related with what I am talking about and suggest to increasing the timeout because of potential unwanted Egress process shutdowns and recordings lost.

Logs

github.com/livekit/egress/pkg/info.NewIOClient.func1
	/workspace/pkg/info/io.go:98
2025-12-16T14:58:48.236Z	ERROR	egress	server/server.go:77	shutting down server on io client watchdog trigger	{"nodeID": "NE_6eEH8bQHqkGN", "clusterID": "", "error": "io client failure"}
github.com/livekit/egress/pkg/server.NewServer.func1
	/workspace/pkg/server/server.go:77
github.com/livekit/egress/pkg/info.NewIOClient.func1
	/workspace/pkg/info/io.go:100

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions