-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Describe the bug
When the Egress process encounters issues communicating with LiveKit via Redis PSRPC, it shuts down if unable to connect within a 20 second timeout, risking the loss of ongoing recordings. This short timeout seems related to this issue, as indicated by the comment:
// TODO change to 10 min once we understand PSRPC failures
I see that the Redis PUB/SUB rpc issue is solved, and I know that probably this timeout will never occur if I update my Egress and LiveKit services, but I think the timeout is too short and could lead to potential non desired shutdowns in cases where, for example, the Redis is unreachable for more than 20 seconds.
I suggest increasing it to 10 minutes or removing it entirely.
Egress Version
v1.11.0
Egress Request
Any ongoing Egress request
Additional context
I've experienced this timeout in one of our clusters, likely because I'm using a version that doesn't include this PR. I created this issue just to ensure it's related with what I am talking about and suggest to increasing the timeout because of potential unwanted Egress process shutdowns and recordings lost.
Logs
github.com/livekit/egress/pkg/info.NewIOClient.func1
/workspace/pkg/info/io.go:98
2025-12-16T14:58:48.236Z ERROR egress server/server.go:77 shutting down server on io client watchdog trigger {"nodeID": "NE_6eEH8bQHqkGN", "clusterID": "", "error": "io client failure"}
github.com/livekit/egress/pkg/server.NewServer.func1
/workspace/pkg/server/server.go:77
github.com/livekit/egress/pkg/info.NewIOClient.func1
/workspace/pkg/info/io.go:100