feat(sbp2): add session registry and command flow#19
Conversation
2823267 to
9113139
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 91131392ab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (lock_ != nullptr) { | ||
| IOLockFree(lock_); | ||
| lock_ = nullptr; |
There was a problem hiding this comment.
Unlock the registry lock before freeing it
When the registry is destroyed, this frees lock_ while the IOLockGuard created at the top of the destructor is still alive. As the function returns, the guard destructor will call IOLockUnlock on the already-freed lock pointer it captured, which can crash or corrupt memory during driver shutdown/unload. End the guard scope before calling IOLockFree or free the lock after it has been unlocked.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in the current branch head (166b8a7). The registry destructor now ends the lock guard scope before freeing lock_, so the guard cannot unlock a freed IOLock during shutdown.
| orb->SetFetchAgentWriteRetries(testFetchAgentWriteRetries_); | ||
| orb->SetAppended(true); | ||
| outstandingORBs_[MakeORBKey(orb->GetORBAddress())] = orb; |
There was a problem hiding this comment.
After a command ORB is prepared and inserted into outstandingORBs_, no code starts the SBP2CommandORB timer. In the case where the fetch-agent write succeeds but the device never writes a status block, the registry leaves commandInFlight set and callers polling GetCommandResult wait forever despite SubmitCommand having configured a timeout on the ORB. Start the ORB timer when the ORB is actually submitted to the fetch agent/chain, and cancel it on normal status completion as this code already does.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in the current branch head (166b8a7). Submitted command ORBs now start their timeout once the fetch-agent write succeeds and the target can fetch the ORB; normal completion and failure paths still cancel the timer.
| CleanupCommandResources(*rec); | ||
| }); | ||
|
|
||
| if (!record->session->SubmitORB(orb.get())) { |
There was a problem hiding this comment.
Avoid submit callbacks while holding the registry lock
This calls SubmitORB while still holding the registry IOLock, but SubmitORB can synchronously invoke the ORB completion callback on immediate submit failure (for example when WriteBlock returns an empty handle via AppendORBImmediate → FailSubmittedORB). That callback immediately tries to take the same lock_, so a transient transport submission failure deadlocks the user-client request path instead of returning an error. Move the submission outside the locked section or ensure synchronous submit failures do not call back under this lock.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in the current branch head (166b8a7). SubmitCommand now prepares and records the command under the registry lock, releases that lock before calling SubmitORB, and cleans up safely if submission fails. This avoids synchronous submit-failure callbacks re-entering the registry lock.
eff2132 to
76a8f1d
Compare
76a8f1d to
166b8a7
Compare
|
The failing check on this PR is from the C++ coverage processing step, not from the SBP-2 code or tests. The run completed all 480 tests successfully, then failed while merging LLVM profile data. I split the CI fix into #20. That PR isolates LLVM profile outputs for both test discovery and test execution, and its Build and Test check passes. Once #20 lands, this PR should only need a rerun/rebase against the updated workflow. |
|
Thanks @gly11 — this looks like a great starting point for the SBP-2 implementation. I’m going to merge this as a foundation. The next step will be to adapt and split the SBP-2 pieces so they align better with the newer protocol/device architecture we now have on the DICE branch. That follow-up restructuring can happen separately; this PR gives us a useful base to build from. |
|
I have a few follow-up draft branches from the old stack, but I do not want to open follow-up PRs against the wrong base or architecture. Should small independent fixes still target main, or would you prefer follow-up work to target DICE while the newer protocol/device architecture is being developed there? I can re-split the queue so SBP-2-related work waits for the DICE-aligned structure, while only truly independent app/debug fixes go to main if that is preferred. |
|
Preferably wait a bit. OR f you have some tokens to burn and time to debug/test — read below :). I have some ideas how to re-organize different protocols: main idea could be seen in f19d2d2. Bierfly — decouple audio leakage from discovery and clear separation how protocols should be loaded. In the same time — i'm finishing full Bus/IRM manager implementation. So the goal is not to grow the driver for every single device, but make it hardware agnostic where possible — follow the specs first, quirks later. So for SBP-2 i see it it like that (it's a draft, just the core idea): Ping me on Discord - we could chat about it more there! |
That makes sense. I’ll test the DICE branch with my Nikon hardware next. If I find any issues, I’ll report them in an issue or open a focused PR with a fix. I’ll also keep an eye on Discord for follow-up discussion. |
…tagging FW-55 (foundation deltas for the SBP-2 session/command port). The session layer ported from PR #19 tags each of its address-space ranges (login ORB, login response, status FIFO, reconnect/logout ORBs) with a human-readable label for diagnostics. DICE's AddressSpaceManager already covers every other API the session layer needs and already had the remote-write callback lifetime safety (callback copied out of the lock before firing), so this is the only foundation gap. Diagnostics-only: the label feeds range-dump logging and is a safe no-op for unknown/zero handles and null labels. Added a contract test covering bad input and that labelling never perturbs range state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decomposes PR #19's two monoliths (SBP2LoginSession.cpp 1847 lines, SBP2SessionRegistry.cpp 799 lines) into DICE-style single-purpose components, and maps every #19 ORB call onto DICE's foundation API. LoginSession -> LoginSession (orchestrator/state) + LoginOrbExchange (management plane) + FetchAgent (command plane) + UnsolicitedStatusSink. SessionRegistry -> SessionRegistry (identity/lifecycle) + CommandExecutor (command plane) + slim SessionRecord. Captures the single-Default-queue simplification (delete #19's owned timeout-queue machinery) and the small CommandORB foundation additions FW-56 needs (IsValid, bool SetCommandBlock, kern_return_t returns). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extracts the post-login ORB submission machinery from #19's SBP2LoginSession into a focused FetchAgent owned by the session (composition). Submits immediate ORBs to the fetch-agent register, chains subsequent ORBs, rings the doorbell, retries failed fetch-agent writes, tracks outstanding ORBs, times them out, and matches incoming status blocks back to their ORB. Adapted to DICE: driven by an explicit Binding (generation/node/agent addresses) the session supplies on login, instead of reading login state directly; ORB timeouts and write-retry backoff go through the injected ISessionScheduler (not the IOSleep-on-queue path); CommandORB kern_return_t returns are propagated; async bus callbacks guarded by a weak lifetime token. Adds CommandORB::GetTimeout. 4 FetchAgentTests cover unbound rejection, immediate submit→write→timeout-arm, status completion, ORB timeout, and write-retry exhaustion → agent reset. Green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decompose PR #19's SBP2LoginSession (1847 lines): the login/reconnect/ logout state machine + status-FIFO routing stays here as Session/ LoginSession; the post-login command plane is the composed FetchAgent (step 2). DICE adaptations: * Timers run on the injected ISessionScheduler (one cancelable management timer at a time), replacing #19's two-queue IOSleep model (SetTimeoutQueue/EnsureTimeoutQueue/owned queue all removed, §4). * SubmitORB/ResetFetchAgent/solicited-status routing delegate to the FetchAgent; login/reconnect success Bind() it, bus reset/logout Unbind() it. * Logs under the Async category (DICE has no SBP2 log category). Also fixes a FetchAgent port slip: the fetch-agent write retry backoff was 1 ms but PR #19 (the behavioral oracle) uses 1000 ms; the ported ImmediateORBRetry... test pins the 1000 ms timing. Ported SBP2LoginSessionTests -> LoginSessionTests, adapted to the scheduler model and FetchAgent delegation; #19's two-queue-specific test is dropped. 39 SBP2 host tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decompose PR #19's SBP2SessionRegistry (799 lines, god-object record): * SessionRegistry (Session/SessionRegistry.{hpp,cpp}) — identity & lifecycle only: Create/StartLogin/GetSessionState/Release/OnBusReset/ RefreshTargets, dup-target reject (afcbd9f), owner validation (8b64806), release-sessions-before-ranges ordering (9ca0d8e), async logout-retain (retiringSessions_ + SetReleaseLogoutCallback). * SessionRecord — slim value type; the command god-object is lifted into a per-record CommandExecutor. * CommandExecutor (Session/CommandExecutor.{hpp,cpp}) — command plane: owns command ORB / page table / management ORB / in-flight + result state, drives the session's FetchAgent via LoginSession::SubmitORB, preserves inquiry-failure status (f8b0403) and failed-ORB resource release (45a5609). DICE adaptations vs #19: * LoginSession timers run on an injected ISessionScheduler (registry ctor argument); the two-queue model is gone. ManagementORB (foundation) keeps its single work queue, passed through for task-management timers. * ReleaseSession no longer blocks the single Default queue with an IOSleep(10) wait-loop; like ReleaseOwner it starts logout, retires the session, and lets the async logout completion / scheduler timeout erase it (the wait-loop path was untested). * CleanupCommandResources clears the session's fetch-agent ORB tracking via LoginSession::ClearCommandTracking (mirrors #19's session->ClearORBTracking), cancelling any in-flight fetch-agent write when a command completes/fails/aborts. Ported SBP2SessionRegistryTests -> SessionRegistryTests, adapted to the decomposed API + scheduler. Two assertions updated for DICE's 16-byte NormalORB header (PR #19 assumed 20). The two SBP2Handler-dependent tests defer to FW-57 (the session-aware user-client handler is not on DICE yet). Full host suite green: 1166/1166. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Handler)
Expose the FW-56 session/command layer across the DriverKit user-client
boundary (host-testable core):
* UserClient/WireFormats/SBP2CommandWireFormats.hpp — net-new ABI records
(SBP2CommandRequestWire / SBP2CommandResultWire), ported verbatim from
PR #19 with the static_assert layout guards intact.
* UserClient/Handlers/SBP2Handler.hpp — re-thread #19's session-aware
handler onto DICE's decomposed SessionRegistry (type/path adapted from
SBP2SessionRegistry). Adds Create/StartLogin/GetSessionState/Inquiry/
Command/CommandResult/TaskManagement/ReleaseSession on top of the
foundation address-space methods. The registry pointer defaults to null
so the existing address-space-only construction keeps compiling until
the registry is wired into the driver lifecycle (FW-58). Owner-validation
contract (void* owner + opaque handle) passes straight through (8b64806);
SubmitSBP2Command hardens the structure-input ABI (1ee4515); ReleaseOwner
releases sessions before address ranges (9ca0d8e).
Ported the two SBP2Handler tests deferred from FW-56 → SBP2HandlerTests:
GetSBP2SessionState scalar-output sizing + SubmitSBP2Command ABI hardening.
Full host suite green: 1168/1168.
Remaining FW-57 driver-integration (paired with FW-58, needs Xcode/IIG to
verify): add the SBP-2 session selectors to ASFWDriver.iig + dispatch table,
and construct the SessionRegistry in UserClientRuntimeState (it needs the
scheduler + device manager the driver lifecycle owns). The .dext compile of
those selectors lands in FW-59.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
DICE and main had diverged into two parallel lines. This merge records DICE (40a796c) as a second parent so main's prior history — including gly11's merged PRs #18/#19/#20 and the sbp2/ci/foundation fixes — is preserved as ancestry, while the resulting tree is taken wholesale from DICE. main-only code is superseded by DICE's implementation but remains recoverable from history (refs/backup/main-pre-dice-merge = 757456d). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
This PR builds on the foundation changes merged in #18 and splits out the SBP-2 session and command core from the larger SBP-2 bring-up branch. The branch has been rebased onto current
main, so the visible diff is now limited to the SBP-2 session layer.SBP-2 Session Core
ORB and Addressing
Tests
Why this split
This PR intentionally excludes local debug UI, diagnostic handlers, install-helper changes, diagnostic scripts, and documentation experiments. Those can be reviewed separately or kept local. The async discovery and bus-reset foundations landed in #18; this layer adds SBP-2 session and command behavior on top of that.
Verification
After #18 merged, this branch was rebased onto current
upstream/main;git diff --check upstream/main..pr/sbp2-session-corepasses.