Skip to content

feat(ble-proxy): serialize BLE connects + structured slot-exhaustion errors#704

Draft
Apollon77 wants to merge 14 commits into
mainfrom
fix/better-errorhandling
Draft

feat(ble-proxy): serialize BLE connects + structured slot-exhaustion errors#704
Apollon77 wants to merge 14 commits into
mainfrom
fix/better-errorhandling

Conversation

@Apollon77

@Apollon77 Apollon77 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

A single physical device advertising several rotating BLE addresses caused matter.js parallel-PASE to fire concurrent BLE connect attempts at it. The proxy backend (HA/ESPHome) ran out of connection slots and commissioning failed with a cascade of timeouts (8 of 9 started attempt(s) failed).

This serializes BLE connects through the proxy and surfaces slot exhaustion as a clear, structured error. Scope: JS/TS server side only — see "Deferred" below for the python-client counterpart.

What changed

  • Connect gate (ProxyBleCentralInterface): a matter.js Semaphore (concurrency = 1) acquired before Connect and held for the channel lifetime, released idempotently on close/failure. At most one BLE connection in flight, so the connect storm never forms. A queued connect aborts on interface close, an incoming options.abort, or a 20 s bounded wait.
  • Orphan teardown: matter.js never signals a blocked openChannel (it only races the promise via Abort.attempt, never passes options.abort on the BLE path) and Transport.close() only fires at node teardown. So a connect finishing after its commission was abandoned would hold the single slot forever. We track handed-out channels (#openChannels) and a generation check tears down a connect that completes after an abort; ProxyBleCentralInterface.abortPendingConnects() rejects queued waiters and closes orphans.
  • Generic commissioning lifecycle: ControllerCommandHandler emits policy-free commissioningStarted / commissioningEnded (per attempt; never for invalid requests). matter-server/src/bleConnectGate.ts wireBleConnectGate() ref-counts them and calls abortPendingConnects() only when no commission is in progress — so finishing one attempt can't tear down a concurrent attempt's connect. These events are generic and reusable by other JS consumers.
  • Structured errors (server side): new wire codes out_of_connection_slots + connection_aborted, propagated as a typed BleProxyError (carries .code). isOutOfConnectionSlotsError() classifies by structured code or a case-insensitive connection_failed message-text fallback, so an un-updated python client still gets the clear, actionable error ("increase connection_slots / add proxies").
  • Python client: only the discover_faileddiscovery_failed wire-code fix is included here (independent bug — the canonical code in the enum + docs is discovery_failed).

Tests

  • TS: serialization (only one Connect in flight), queue-abort on interface close, abortPendingConnects reject-but-reusable, orphan teardown (already-open + finishes-after-abort), clear slot-exhaustion error, typed-error round-trip; commissioningStarted/commissioningEnded balanced/per-attempt/not-on-invalid-args; wireBleConnectGate ref-count (release only after last concurrent attempt).
  • Full suite 247/247 (incl. live Commission-On-Network); python 6 passed; format/lint/build green.

Deferred to the bleak_retry_connector connect_strategy PR

The python client's structured connect-error classification (_classify_connect_error + bleak marker matching) was reverted from this PR to keep it JS-only; it belongs with the python-side connect path in the connect_strategy work. The server's message-text fallback keeps things working until then. The python/HA side also needs to wire the commissioning gate behavior. Tracked locally in ble-proxy/python-structured-connect-errors.

Other follow-ups (pre-existing, not in this PR)

  • _handle_connect TimeoutError returns connection_failed rather than the existing timeout code.
  • BleProxyHandler.#cleanupClient rejects with a plain Error (could be BleProxyError).
  • matter.js upstream: ControllerCommissioner omits { abort } to ble.openChannel (TCP path passes it); plumbing it would let the gate cancel queued connects instantly instead of via the 20 s backstop.

🤖 Generated with Claude Code

Apollon77 and others added 8 commits June 1, 2026 14:41
Add out_of_connection_slots and connection_aborted wire codes, a
BleProxyError carrying the code, and isOutOfConnectionSlotsError() which
classifies by structured code with a connection_failed message-text
fallback for proxy clients predating the code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#handleResponse now rejects failed commands with BleProxyError so callers
can read the structured .code (e.g. out_of_connection_slots) instead of
only a flattened message. Adds a TestProxyClientError test helper so the
test client can emit a chosen wire error code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odes

_handle_connect now maps BLE connect exceptions to out_of_connection_slots,
connection_aborted, device_not_found, or bluetooth_unavailable (else
connection_failed), matching by exception class name + message markers so
bleak_retry_connector stays a soft dependency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a Semaphore(1) connect gate in ProxyBleCentralInterface so matter.js
parallel-PASE against a device's rotating BLE addresses can't exhaust the
proxy backend's connection slots. The slot is held for the channel lifetime
and released idempotently on close/failure; a queued connect aborts on
interface close, an incoming options.abort, or a 20s bounded wait.

Because matter.js never signals a blocked openChannel, also track handed-out
channels and tear down orphans (plus reject queued waiters) via
abortPendingConnects(), and a generation check tears down a connect that
completes after an abort — so a cancelled/timed-out commission can't leave a
channel holding the single slot forever. Surface a clear actionable error on
slot exhaustion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Emit a commissioningEnded event from ControllerCommandHandler.commissionNode
(finally, so success and failure both fire) and wire it in MatterServer to
ProxyBle.abortPendingConnects(). BLE is only used during commissioning, so
this releases queued connect waiters and tears down any orphaned channel
holding the single connection slot once an attempt finishes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted codes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orphan branch threw into the connect-failure catch, which already sends
the proxy Disconnect — so it was sent twice. Let the catch own it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
node_modules/lockfile were on 0.17.0 while package.json pins the alpha; the
branch did not type-check until synced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 15:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens BLE Proxy commissioning by preventing concurrent BLE connect storms (which can exhaust proxy backend connection slots) and by introducing structured, typed error codes for clearer diagnosis across the TS server and Python proxy client.

Changes:

  • Serialize BLE connects through the proxy with a single-slot semaphore, plus lifecycle abort/cleanup to prevent “orphaned” connections from holding the only slot.
  • Add structured wire error codes (out_of_connection_slots, connection_aborted) and propagate them as typed errors (BleProxyError) end-to-end.
  • Add focused TS + Python tests covering connect gating, abort behavior, orphan cleanup, and error classification/round-trip.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python_ble_proxy/tests/test_connect_errors.py Adds unit tests for Python-side connect-error classification into structured wire codes.
python_ble_proxy/matter_ble_proxy/client.py Implements connect-error classification and uses it when connect fails.
packages/ws-controller/test/CommissioningLifecycleTest.ts Adds tests asserting a new commissioning lifecycle event fires on success/failure.
packages/ws-controller/src/controller/ControllerCommandHandler.ts Introduces commissioningEnded event emitted after commissionNode attempt finishes.
packages/matter-server/src/MatterServer.ts Hooks commissioning lifecycle to abort pending proxy BLE connects when commissioning ends.
packages/ble-proxy/test/BleProxyTestClient.ts Enhances test proxy client to return structured error codes from handlers.
packages/ble-proxy/test/BleProxyProtocolTest.ts Adds tests for slot-exhaustion classification and typed error preservation.
packages/ble-proxy/test/BleProxyIntegrationTest.ts Adds integration tests for connect serialization, abort behavior, orphan cleanup, and error messaging.
packages/ble-proxy/src/ProxyBleChannel.ts Implements connect concurrency gate, abort generation handling, and slot-exhaustion error surfacing.
packages/ble-proxy/src/ProxyBle.ts Exposes abortPendingConnects() to trigger central-interface connect abort/cleanup.
packages/ble-proxy/src/BleProxyProtocol.ts Adds new error codes, slot-exhaustion message marker matching, and BleProxyError.
packages/ble-proxy/src/BleProxyHandler.ts Preserves structured error codes by rejecting pending commands with BleProxyError.
package-lock.json Updates workspace Node engine constraints in the lockfile to >=22.13.0.
docs/ble-proxy-protocol.md Documents the new structured connect error codes.

Comment thread python_ble_proxy/matter_ble_proxy/client.py Outdated
Comment thread packages/ws-controller/src/controller/ControllerCommandHandler.ts
Apollon77 and others added 3 commits June 1, 2026 17:38
The client sent discover_failed; the protocol enum and docs use
discovery_failed. Align the client with the canonical wire code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the TS isOutOfConnectionSlotsError lowercasing so mixed-case backend
messages still classify as slot-exhaustion/aborted instead of falling back
to connection_failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keep commissioningStarted/commissioningEnded as plain lifecycle facts on the
controller (emitted per attempt, never for invalid requests) and move the
"release the BLE connect gate only when no commission is in progress" policy
into wireBleConnectGate, the consumer-side ref-counter. Finishing one
commission can no longer abort a concurrent attempt's connect, and the
controller event stays generic and reusable.

Addresses PR review feedback on the commissioningEnded finally placement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

python_ble_proxy/matter_ble_proxy/client.py:498

  • The PR description's follow-ups mention a pre-existing wire-code drift (discover_failed vs discovery_failed) as not addressed in this PR, but this change updates the Python client to emit discovery_failed. Please update the PR description/follow-ups so it matches what the code actually does (or revert this change if it was meant to stay out of scope).
        if conn.client.is_connected:
            await conn.client.disconnect()
        await self._send_success(cmd_id)

    async def _handle_discover_services(self, cmd_id: int, args: dict[str, Any]) -> None:
        conn = self._get_connection(args["connection_handle"])

Apollon77 and others added 2 commits June 1, 2026 19:41
Keep PR #704 JS-only. The python client's _classify_connect_error (and its
case-insensitive markers) move to the bleak_retry_connector connect_strategy
PR, where the python-side connect path lives. Server still classifies
slot-exhaustion via the connection_failed message-text fallback, so an
un-updated client keeps working. discovery_failed fix retained.

Tracked in todo: ble-proxy/python-structured-connect-errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Comment on lines 484 to 488
async close() {
this.#connected = false;
this.#releaseSlot();
this.#cleanupObservers();
this.#terminateIterator();
Comment on lines +332 to +335
if (this.#closed || this.#abortGeneration !== abortGenAtStart) {
await proxyChannel.close();
throw new BleError(`BLE connect to ${peripheralAddress} aborted before completion`);
}
@Apollon77 Apollon77 marked this pull request as draft June 1, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants