Skip to content

Room.sid hangs indefinitely when room_sid_changed FFI event is not delivered #651

@damonvjanis

Description

@damonvjanis

Summary

Calling await room.sid from inside an agent's JobContext entrypoint occasionally hangs forever, even though the LiveKit server has assigned a SID to the room (visible in Cloud Analytics). Other FFI-driven events for the same Room instance (participant_connected, track_published, connection_quality_changed, SIP hangupparticipant_disconnected) continue to flow normally throughout the hang.

The caller experiences silent dial-tone — the agent never publishes audio — until they give up and hang up.

Environment

  • livekit==1.1.5
  • livekit-agents==1.5.6
  • LiveKit Cloud, SIP-inbound voice agents, ~10 concurrent sessions per worker pod

Mechanism

Room._first_sid_future is created in Room.__init__ and is only resolved by the room_sid_changed FFI event handler:

```python

rtc/room.py:189

@Property
async def sid(self) -> str:
if self._info.sid:
return self._info.sid
return await self._first_sid_future

rtc/room.py:770

elif which == "room_sid_changed":
if not self._info.sid:
self._first_sid_future.set_result(event.room_sid_changed.sid)
self._info.sid = event.room_sid_changed.sid
```

In the failure case, _first_sid_future is never set. No exception, no log — the awaiter just suspends forever. Meanwhile the same Room instance continues to receive and dispatch all other FFI events.

Recurrence

  • 21 confirmed occurrences over 30 days in one production deployment (~0.7/day)
  • Rate is steady — not correlated with any deploy or service version
  • Affects multiple pods, multiple service versions, multiple SIP callers
  • Each occurrence costs one customer call (caller hangs up at ~60s SIP ringback timeout)

Evidence the SID exists server-side

For affected sessions, the LiveKit Cloud Analytics REST API (`GET /api/project//sessions/`) returns the room with the expected SID. The server assigned it; only the SDK's local future never resolves.

Suspected cause

Race or lost event in the FFI event queue / event-loop bridge specific to `room_sid_changed`. We haven't reproduced it in development yet — the rate is too low.

Workaround we're applying

```python
try:
sid = await asyncio.wait_for(room.sid, timeout=5.0)
except asyncio.TimeoutError:
# log + fall back; abort the job so the worker frees up
...
```

Converts the silent hang into a recoverable error.

What would help from upstream

  • Confirmation of whether this is a known/possible failure mode
  • An internal timeout on `_first_sid_future` after `connect()` succeeds, raising `RuntimeError` rather than hanging — so callers don't have to arm their own timeout
  • Optionally, a way to detect "room connected but SID unresolved" without polling

Happy to share anonymized trace fingerprints if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions