Skip to content

io: SQ flow control rework — diagnostics, SqController, session lifecycle#305

Open
maning00 wants to merge 6 commits into
mainfrom
sq-flow-control-phase1a
Open

io: SQ flow control rework — diagnostics, SqController, session lifecycle#305
maning00 wants to merge 6 commits into
mainfrom
sq-flow-control-phase1a

Conversation

@maning00
Copy link
Copy Markdown
Contributor

@maning00 maning00 commented May 7, 2026

Summary

  • Phase 1A — diagnostics (4df7bbb): rewrite SQ full hint to print worker-local QP / session qpPerTransfer / numWorkerThreads separately, and add five per-EP CQE diagnostic counters (lastPollAttemptTime / lastNonEmptyCqeTime / recentCqeCount / recentBatchReleaseWr / ledger RecordCount) into the timeout hint so operators can triage CQ-poll stall vs slow drain vs submission burst from a single error line. Adds optional MORI_IO_SQ_SIGNAL_INTERVAL_WR (default 0 = off) for in-call signal cadence, and a SplitWork small-batch round-robin that no longer pins single-entry transfers to worker 0 / global ep[0].
  • Phase 2 — SqController + ledger boundary + session lifecycle (ee44512): introduce per-EP SqController as the single owner of SQ credits / waiters / degraded state, with CAS + epoch-protected condition_variable wait and MORI_IO_SQ_RESUME_WATERMARK_WR hysteresis. SubmissionLedger no longer mutates sqDepth; new CancelTentative and ExtractOrphanedRecords return full records so callers can do exactly-once meta failure convergence. All four failure paths (TryReserveSqDepth fail, RecheckBeforePost fail, ibv_post_send fail, fatal CQE) converge through one helper MovePendingUnsignaledToOrphanedForEndpoint that releases any held submit guard, takes unique recovery guard, marks endpoint terminal degraded, inserts an orphaned record, and fails unique metas. Fatal CQE white list: IBV_WC_RETRY_EXC_ERR / RNR / LOC_QP_OP_ERR / REM_ACCESS_ERR / REM_INV_REQ_ERR / FATAL_ERR; IBV_WC_WR_FLUSH_ERR stays as flush cascade only. RdmaBackendSession::Alive() reflects any endpoint terminal degraded; sessionCache holds shared_ptr and invalidates unhealthy entries on cache hit/miss; RdmaManager::CountEndpoint / GetAllEndpoint filter out terminal-degraded EpPair so new sessions only consume healthy endpoints (BuildRdmaConn refills qpPerTransfer if needed).
  • Phase 3 — executor admission gate + dynamic per-QP load balancing (8d81a91): move backpressure from worker-thread retry loops up to the user thread. SqController grows a soft queuedDepth_ (admitted but not yet posted) plus TryAcquireAdmission / ReleaseAdmission / WaitForAdmissionChange so the executor reserves SQ slots before submitting work to a worker; under sustained SQ pressure callers block on the admission cv instead of failing on a worker-thread reserve timeout. New move-only RAII AdmissionToken is owned by each Task — even on worker shutdown the queued tasks release their admission and complete with ERR_BAD_STATE. MultithreadExecutor::RdmaBatchReadWriteWithAdmission is the new dispatcher when MORI_IO_SQ_EXECUTOR_ADMISSION=1; falls back to the legacy SplitWork path when disabled (default). Three split policies via MORI_IO_SQ_SPLIT_POLICY={static,least_loaded,capacity}: static keeps the Phase 1A round-robin; least_loaded picks the active QP with the lowest Depth + QueuedDepth (skipping degraded QPs); capacity (gated behind MORI_IO_SQ_ENABLE_EXPERIMENTAL_CAPACITY=1) splits a large batch by per-QP free slots. Admission-side WR estimation reuses a new EstimateMergedWrCount helper (SGL merging aware) shared with the worker post path. Five new counters in SqCqeDiagnostics (executorAdmissionWaitCount / executorAdmissionWaitUs / executorAdmissionTimeoutCount / leastLoadedSelectionCount / queuedWrHighWatermark) appear in the SQ-full timeout hint so operators can confirm the gate is actively backpressuring upstream callers.

Behavior summary

  • Phase 1A + 2 only (default deploy of this PR): worker-thread cv-wait gated by MORI_IO_SQ_BACKOFF_TIMEOUT_US, exactly-once orphan convergence, terminal-degraded EpPair filtering, shared_ptr session cache with unhealthy invalidation. SQ-full pressure still surfaces as user-visible errors when the backoff window closes.
  • Phase 3 enabled (opt-in: MORI_IO_SQ_EXECUTOR_ADMISSION=1): user thread blocks in the admission gate until the chosen QP has effective free slots; sustained pressure produces tail latency rather than errors (errors only when the admission deadline expires).

New env vars (defaults are safe — Phase 3 is opt-in)

Env var Default Phase Purpose
MORI_IO_SQ_BACKOFF_TIMEOUT_US 10000 (10 ms) 1A worker-thread reserve timeout
MORI_IO_SQ_SIGNAL_INTERVAL_WR 0 (off) 1A optional periodic signaled-WR cadence
MORI_IO_SQ_RESUME_WATERMARK_WR 0 (off) 2 hysteresis for waking SQ-credit waiters
MORI_IO_SQ_EXECUTOR_ADMISSION false 3 enable user-thread admission gate
MORI_IO_SQ_ADMISSION_TIMEOUT_US 1000000 (1 s) 3 user-thread admission deadline
MORI_IO_SQ_ADMISSION_RESUME_WATERMARK_WR 2048 3 hysteresis for waking admission waiters
MORI_IO_SQ_ADMISSION_WAIT_SLICE_US 100 3 inner cv slice for diagnostic accounting
MORI_IO_SQ_ADMISSION_CHUNK_WR 0 (auto) 3 per-chunk admission size cap
MORI_IO_SQ_ADMISSION_MAX_CHUNKS_PER_CALL 0 (= activeQps) 3 per-call pipeline depth cap
MORI_IO_SQ_SPLIT_POLICY static 3 static / least_loaded / capacity
MORI_IO_SQ_ENABLE_EXPERIMENTAL_CAPACITY false 3 gate the experimental capacity policy

Test plan

  • tests/cpp/io/test_engine (24/24 passing locally):
    • submission_ledger_basic (extended for ReleaseByCqe / CancelTentative / ExtractOrphanedRecords)
    • sq_controller_reserve_release_wait (CV wakeup + resume watermark hysteresis)
    • sq_controller_terminal_degraded (waiter wakes on degrade; ReleaseDrainedOrphaned does not restore admission)
    • sq_controller_recheck_rolls_back (RecheckBeforePost rollback after degrade)
    • sq_controller_admission_counters (admission acquire/release/WaitForAdmissionChange, hard + soft credit interaction)
    • sq_controller_admission_degraded (terminal-degraded SQ rejects admission)
    • sq_controller_admission_mark_degraded_wakes (MarkDegraded wakes admission waiters)
    • admission_token_move_releases_once (move-only token RAII + idempotent Release())
    • worker_shutdown_drains_tokens_and_promises (worker shutdown releases queued tokens and completes promises with ERR_BAD_STATE)
    • pending_unsignaled_recheck_failure_orphans (1-EP helper path)
    • pending_unsignaled_reserve_failure_orphans (2-EP helper path, multi-meta accumulation)
    • pending_unsignaled_orphaning_closes_admission_before_recovery (admission closed before recovery guard)
    • rdma_session_alive_checks_terminal_sq
  • Multi-node DSR1 8k1k disagg conc=1024/2048/4096 regression, two configurations on the same workload:
    • Phase 1A+2 baseline (admission off): MORI_IO_SQ_BACKOFF_TIMEOUT_US=5000000, expect SQ-full timeout count to drop ~10× and the per-EP diagnostic counters in the timeout hint to classify residual stalls.
    • Phase 3 enabled: MORI_IO_SQ_EXECUTOR_ADMISSION=1 MORI_IO_SQ_ADMISSION_TIMEOUT_US=2000000 MORI_IO_SQ_BACKOFF_TIMEOUT_US=1000000, expect SQ-full errors to approach 0 with executorAdmissionWaitCount > 0 confirming the gate is actively backpressuring upstream.
  • MORI_IO_SQ_SPLIT_POLICY=least_loaded A/B vs static at 4QP/4Worker conc=4096 — track per-QP Depth distribution stddev and leastLoadedSelectionCount.
  • Optional MORI_IO_SQ_SIGNAL_INTERVAL_WR=64/128/256 cadence matrix once the backoff=1s, interval=0 baseline is captured.
  • Submission-guard throughput regression at 4QP/4Worker conc=4096 vs. baseline (no exclusive-guard fallback enabled).

Follow-ups (separate PRs)

  • 4-class partial-post mock tests (no WR posted, signaled-tail-posted, fatal-CQE → terminal-degraded → endpoint rebuild).
  • End-to-end mock for Alive()=false → InvalidateUnhealthySessions → fresh session, plus shared_ptr lifetime test under in-flight transfer + terminal degrade.
  • RouteTable GC for terminal-degraded EpPair (use_count-guarded).
  • RdmaManager::Search GPU-branch fall-through fix (separate narrower change).
  • least_loaded / capacity policy benchmark under non-uniform NIC link speeds (e.g., one slow rank in an 8-rank × 8-NIC node).
  • Flip-the-default PR for MORI_IO_SQ_EXECUTOR_ADMISSION=1 + MORI_IO_SQ_SPLIT_POLICY=least_loaded once production validation is captured.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reworks RDMA send-queue (SQ) flow control and failure handling by introducing a per-endpoint SqController, expanding SQ/CQE diagnostics for SQ-full timeouts, and tightening session/endpoint lifecycle behavior when endpoints become terminal-degraded.

Changes:

  • Introduces SqController as the owner of SQ credit reservation/release, degraded state, and waiter coordination (with resume watermark hysteresis).
  • Refactors SubmissionLedger APIs to return full SubmissionRecords and adds orphan extraction/cancel APIs for exactly-once failure convergence.
  • Updates session caching/liveness behavior to evict unhealthy sessions and avoid allocating terminal-degraded endpoints; adds targeted unit tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/cpp/io/test_engine.cpp Adds/extends unit tests for SubmissionLedger, SqController, orphaning helper, and session Alive() behavior.
src/io/rdma/ledger.cpp Updates ledger APIs (InsertOrphaned, ReleaseByCqe, CancelTentative, ExtractOrphanedRecords, RecordCount).
src/io/rdma/executor.hpp Replaces split return type with a richer WorkSplit (worker + ep + range).
src/io/rdma/executor.cpp Implements round-robin “small batch” work splitting to avoid pinning small transfers to worker/ep 0.
src/io/rdma/common.hpp Adds SqController, diagnostics structs/enums, extends EpPair, and declares SQ/CQE diagnostic + orphaning helpers.
src/io/rdma/common.cpp Implements SqController, SQ-full diagnostic hinting, signal cadence env var, and unified orphaning/failure helper.
src/io/rdma/backend_impl.hpp Switches session cache to shared_ptr and adds unhealthy-session invalidation helper.
src/io/rdma/backend_impl.cpp Filters terminal-degraded endpoints for new sessions, adds CQ polling diagnostics, terminal CQE handling, session Alive() checks, and cache invalidation of unhealthy sessions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread src/io/rdma/ledger.cpp
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Comment thread tests/cpp/io/test_engine.cpp Outdated
Comment thread tests/cpp/io/test_engine.cpp Outdated
Comment thread tests/cpp/io/test_engine.cpp Outdated
Comment thread src/io/rdma/common.hpp
Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread src/io/rdma/backend_impl.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants