Skip to content

DTLS handshake fails for most clients in multi-client str0m-to-str0m setup #932

@Teyk0o

Description

@Teyk0o

Environment

  • str0m version: latest (both sides)
  • Setup: str0m-based SFU server + str0m-based load test client (str0m-stress), both using Rtc::builder().build()
  • Network: client runs on a remote machine, SFU in Docker on a VPS (2 cores), UDP ports 16384-16385 mapped through Docker
  • Scale: 2-30+ concurrent clients connecting to the same SFU

Symptom

ICE completes successfully for all clients (Checking → Connected → Completed), but DTLS only completes for ~20-30% of connections. The remaining clients never get past the DTLS handshake, no data channel opens, no media flows, and the connection eventually times out.

Observed behavior

Working client (minority):

ICE state -> Connected
ICE state -> Completed
DTLS setup is: true
ClientHello: DTLS version=DTLS1_2, cookie_len=0, offering 3 cipher suites
[ServerHello received, handshake completes]
ChannelOpen("data")
MediaData received ✓

Failing client (majority):

ICE state -> Connected
ICE state -> Completed
DTLS setup is: true
ClientHello: DTLS version=DTLS1_2, cookie_len=0, offering 3 cipher suites
[No ServerHello — nothing comes back]
Flight timeout in: 1.189s
[Retransmit ClientHello]
Flight timeout in: 0.802s
[Retransmit again, eventually gives up after connect timeout (40s)]

Key data points

  1. ICE works fine for all clients — STUN binding succeeds, candidate pairs are validated, consent checks flow at ~1 pps. UDP connectivity is not the issue.
  2. The server receives the DTLS ClientHello — our packet dispatch logs confirm the UDP packet is received on the correct socket, dispatched to the correct Rtc instance via handle_input(Input::Receive(...)). No packets are dropped.
  3. The server does not produce a DTLS response — after handle_input processes the ClientHello, poll_output() does not return a Transmit containing a ServerHello/HelloVerifyRequest. The
    SFU's event loop calls poll_output in a tight loop until Timeout, so it's not a missed-poll issue.
  4. The SFU runs 2 threads with one UdpSocket per thread. Each client is assigned to a specific thread/socket via hash-based sharding. The working and failing clients are distributed across
    both threads — it's not a per-thread issue.
  5. All clients share the same pre-generated DtlsCert on the server side (via Arc). The client side generates a fresh cert per Rtc::builder().build().
  6. Timing pattern: the first 1-2 clients almost always succeed. As more clients connect concurrently (even just 4-8 total), the DTLS success rate drops dramatically. With 8 concurrent
    clients, typically only 1-2 complete DTLS.
  7. When a client is torn down and reconnected (new Rtc, new WHIP POST, new UDP socket), it sometimes succeeds on the retry — suggesting the issue is timing/state-dependent, not a permanent
    crypto mismatch.

SDP setup

  • Client (offerer): creates offer with a=setup:actpass
  • Server (answerer): accept_offer() generates answer — presumably with a=setup:active (server initiates DTLS)

The client-side log shows DTLS setup is: true followed by ClientHello, which suggests the client thinks it's the DTLS initiator (active role). If the server also considers itself active (from the SDP answer), both sides would send ClientHello to each other, and neither would respond with ServerHello — a role conflict deadlock.

However, we haven't confirmed this theory because we can't easily inspect the SDP a=setup attribute at runtime. If both sides end up as DTLS active, the handshake would deadlock exactly as observed.

Reproduction

// Server side (SFU)
  let mut rtc = Rtc::builder()                                                                                                                                                                
      .set_dtls_cert(shared_cert.clone())                                                                                                                                                     
      .set_fingerprint_verification(true)                                                                                                                                                     
      .enable_bwe(Some(Bitrate::kbps(1000)))                                                                                                                                                  
      .build(Instant::now());                                                                                                                                                                 
                                                                                                                                                                                              
  let candidate = Candidate::host(local_addr, "udp")?;                                                                                                                                        
  rtc.add_local_candidate(candidate);                                                                                                                                                         
                                                                                                                                                                                              
  let answer = rtc.sdp_api().accept_offer(client_offer)?;                                                                                                                                     
  // → send answer back to client via HTTP                                                                                                                                                    
  // → send Rtc to event loop thread                                                                                                                                                          
                                                                                                                                                                                              
  // Client side (bench tool)                                                                                                                                                                 
  let mut rtc = Rtc::builder()                                                                                                                                                                
      .set_stats_interval(Some(Duration::from_secs(2)))                                                                                                                                       
      .build(Instant::now());                                                                                                                                                                 
                                                                                                                                                                                              
  let candidate = Candidate::host(local_addr, "udp")?;                                                                                                                                        
  rtc.add_local_candidate(candidate);                                                                                                                                                         
                                                                                                                                                                                              
  let mut change = rtc.sdp_api();                                                                                                                                                             
  change.add_media(MediaKind::Audio, Direction::SendOnly, None, None, None);                                                                                                                  
  change.add_channel("data".to_string());                                                                                                                                                     
  let (offer, pending) = change.apply()?;                                                                                                                                                     
                                                                                                                                                                                              
  let answer = SdpAnswer::from_sdp_string(&server_answer)?;                                                                                                                                   
  rtc.sdp_api().accept_answer(pending, answer)?;                                                                                                                                              
                                                                                                                                                                                              
  // Event loop: poll_output → Transmit (send), recv_from → Input::Receive

Connect 8+ clients simultaneously to reproduce. First 1-2 usually succeed, rest hang at DTLS.

Questions

  1. Is the DTLS role (active/passive) correctly derived from the SDP a=setup attribute in accept_offer / accept_answer? Could both sides end up as active?
  2. Is there a known concurrency issue when multiple Rtc instances on the same thread/process perform DTLS handshakes simultaneously?
  3. Could the shared DtlsCert cause issues when multiple handshakes use the same certificate concurrently?

Note : I used AI to wrote this issue because I'm French and I want to be clear

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions