Room.sid hangs indefinitely when room_sid_changed FFI event is not delivered

## Summary

Calling `await room.sid` from inside an agent's `JobContext` entrypoint occasionally hangs forever, even though the LiveKit server has assigned a SID to the room (visible in Cloud Analytics). Other FFI-driven events for the same `Room` instance (`participant_connected`, `track_published`, `connection_quality_changed`, SIP `hangup` → `participant_disconnected`) continue to flow normally throughout the hang.

The caller experiences silent dial-tone — the agent never publishes audio — until they give up and hang up.

## Environment

- `livekit==1.1.5`
- `livekit-agents==1.5.6`
- LiveKit Cloud, SIP-inbound voice agents, ~10 concurrent sessions per worker pod

## Mechanism

`Room._first_sid_future` is created in `Room.__init__` and is only resolved by the `room_sid_changed` FFI event handler:

\`\`\`python
# rtc/room.py:189
@property
async def sid(self) -> str:
    if self._info.sid:
        return self._info.sid
    return await self._first_sid_future

# rtc/room.py:770
elif which == "room_sid_changed":
    if not self._info.sid:
        self._first_sid_future.set_result(event.room_sid_changed.sid)
    self._info.sid = event.room_sid_changed.sid
\`\`\`

In the failure case, `_first_sid_future` is never set. No exception, no log — the awaiter just suspends forever. Meanwhile the same `Room` instance continues to receive and dispatch all other FFI events.

## Recurrence

- 21 confirmed occurrences over 30 days in one production deployment (~0.7/day)
- Rate is steady — not correlated with any deploy or service version
- Affects multiple pods, multiple service versions, multiple SIP callers
- Each occurrence costs one customer call (caller hangs up at ~60s SIP ringback timeout)

## Evidence the SID exists server-side

For affected sessions, the LiveKit Cloud Analytics REST API (\`GET /api/project/<id>/sessions/<roomId>\`) returns the room with the expected SID. The server assigned it; only the SDK's local future never resolves.

## Suspected cause

Race or lost event in the FFI event queue / event-loop bridge specific to \`room_sid_changed\`. We haven't reproduced it in development yet — the rate is too low.

## Workaround we're applying

\`\`\`python
try:
    sid = await asyncio.wait_for(room.sid, timeout=5.0)
except asyncio.TimeoutError:
    # log + fall back; abort the job so the worker frees up
    ...
\`\`\`

Converts the silent hang into a recoverable error.

## What would help from upstream

- Confirmation of whether this is a known/possible failure mode
- An internal timeout on \`_first_sid_future\` after \`connect()\` succeeds, raising \`RuntimeError\` rather than hanging — so callers don't have to arm their own timeout
- Optionally, a way to detect "room connected but SID unresolved" without polling

Happy to share anonymized trace fingerprints if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Room.sid hangs indefinitely when room_sid_changed FFI event is not delivered #651

Summary

Environment

Mechanism

rtc/room.py:189

rtc/room.py:770

Recurrence

Evidence the SID exists server-side

Suspected cause

Workaround we're applying

What would help from upstream

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Room.sid hangs indefinitely when room_sid_changed FFI event is not delivered #651

Description

Summary

Environment

Mechanism

rtc/room.py:189

rtc/room.py:770

Recurrence

Evidence the SID exists server-side

Suspected cause

Workaround we're applying

What would help from upstream

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions