Skip to content

Fix SBP-2 session-scheduler deadlock (kernel registry busy-timeout panic)#33

Open
mhellevang wants to merge 1 commit into
mrmidi:mainfrom
mhellevang:fix/sbp2-session-scheduler-deadlock
Open

Fix SBP-2 session-scheduler deadlock (kernel registry busy-timeout panic)#33
mhellevang wants to merge 1 commit into
mrmidi:mainfrom
mhellevang:fix/sbp2-session-scheduler-deadlock

Conversation

@mhellevang

Copy link
Copy Markdown

Problem

A full kernel panic (busy timeout (60s): multiple entries holding the registry busy ... @IOService.cpp) can be triggered during SBP-2 session
teardown. The dext wedges instead of terminating, the IOKit termination
queue backs up, and after 60s the registry-busy watchdog panics the machine.

Root cause is an AB-BA deadlock between the user-client queue and the driver
queue, over the DriverKitSessionScheduler lock:

  • ASFWDriverUserClient-Default (teardown): UserClient::Stop
    SessionRegistry::ReleaseOwnerLoginSession::Logout
    StartLogoutTimerScheduleAfterArmNextLocked, which calls
    timer_->WakeAtTime() while holding lock_. WakeAtTime is
    RPC-dispatched to the timer's queue (ASFWDriver-Default), so the thread
    blocks waiting for that queue.
  • ASFWDriver-Default (concurrent logout-write completion off the AR
    interrupt): OnLogoutWriteCompleteStartLogoutTimerScheduleAfter
    → tries to take lock_.

Thread A holds lock_ and waits for the driver queue; the driver queue holds
nothing but is blocked trying to take lock_. Deadlock → panic.

Fix

Move the WakeAtTime() call out of the critical section. ArmNextLocked is
split into:

  • EarliestDeadlineLocked() — pure read of the next deadline under lock_.
  • ArmTimerUnlocked() — calls WakeAtTime() with lock_ released.

Each call site computes the deadline (and retains an OSSharedPtr to the
timer for lifetime safety) inside the lock, then arms outside it. The timer's
queue handlers can now take lock_ while another thread arms, so the cycle
can't form.

Testing

  • All host C++ tests pass (1143/1143).
  • Verified on hardware (macOS 26 / Tahoe, OHCI FireWire): 30 rapid
    login + forced-Stop teardown cycles with the patched dext — no panic,
    dext stayed responsive every cycle. The same teardown path reliably
    panicked the machine before the fix.

Note

A benign residual race remains (two threads can arm concurrently, so the
timer may arm slightly late); it self-corrects on the next HandleTimerFired
re-arm and never drops a callback. Fully serializing arming onto the timer's
own queue is possible but pulls in lifetime concerns, left out of this
minimal fix.

WakeAtTime is RPC-dispatched to the timer's queue, whose handlers re-enter
the scheduler and take lock_. Holding lock_ across it deadlocked the
user-client teardown queue against the driver queue and tripped the 60s
IOKit registry busy-timeout kernel panic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant