Description
RoundRobinLoadBalancer.dispatch holds a single global asyncio.Lock across await connection.dispatch(...), so the dispatch/handshake phase of every concurrent dispatch is serialized through one lock. Dispatch throughput therefore does not scale with worker count — it is flat-to-declining as workers are added.
Multi-worker prototype measurement (round-robin, empty payload, 64 concurrent dispatches, Python connection, ops/s):
| lock |
N=1 |
N=2 |
N=4 |
scaling 1→4 |
| global (current) |
1470 |
1327 |
1175 |
0.80× |
| per-worker |
1492 |
2140 |
2985 |
2.00× |
Adding workers under the current lock makes throughput worse; a per-worker lock scales ~2× at 4 workers (~2.5× the global lock's N=4 throughput).
Expected behavior
Adding workers increases dispatch throughput (scaling toward N× until a caller-side limit is reached), while preserving the load balancer's existing anti-thundering-herd guarantee: a burst of concurrent dispatches must not stampede a single worker.
Root cause
The lock is deliberate load-shaping, not merely rotation-index protection. The rotation index only advances on success after the handshake, so without serialization a burst of concurrent dispatches all read the same stale index and target the same worker (thundering herd). One global lock held across the handshake prevents that — but over-broadly: it also serializes dispatches to different workers, which need not be serialized.
The fix is a per-worker lock: hold a brief lock only to advance the rotation index, then acquire a lock scoped to the selected worker across its handshake. Dispatches to different workers overlap; dispatches to the same worker serialize — cross-worker parallelism with the anti-herd guarantee intact.
Narrowing the lock changes the concurrency of index advancement, so exhaustion detection must be reworked. The current per-call checkpoint relies on rotation-index identity to detect "tried every worker"; once concurrent dispatches share and advance the index, a single dispatch can skip or miss workers under eviction with multiple workers. Track a per-dispatch tried-set of attempted worker uids rather than index identity.
Note: increased handshake concurrency can surface a separate worker-side proxy_pool "Lock is bound to a different event loop" race — validate against it (or file separately) when landing this.
Description
RoundRobinLoadBalancer.dispatchholds a single globalasyncio.Lockacrossawait connection.dispatch(...), so the dispatch/handshake phase of every concurrent dispatch is serialized through one lock. Dispatch throughput therefore does not scale with worker count — it is flat-to-declining as workers are added.Multi-worker prototype measurement (round-robin, empty payload, 64 concurrent dispatches, Python connection, ops/s):
Adding workers under the current lock makes throughput worse; a per-worker lock scales ~2× at 4 workers (~2.5× the global lock's N=4 throughput).
Expected behavior
Adding workers increases dispatch throughput (scaling toward N× until a caller-side limit is reached), while preserving the load balancer's existing anti-thundering-herd guarantee: a burst of concurrent dispatches must not stampede a single worker.
Root cause
The lock is deliberate load-shaping, not merely rotation-index protection. The rotation index only advances on success after the handshake, so without serialization a burst of concurrent dispatches all read the same stale index and target the same worker (thundering herd). One global lock held across the handshake prevents that — but over-broadly: it also serializes dispatches to different workers, which need not be serialized.
The fix is a per-worker lock: hold a brief lock only to advance the rotation index, then acquire a lock scoped to the selected worker across its handshake. Dispatches to different workers overlap; dispatches to the same worker serialize — cross-worker parallelism with the anti-herd guarantee intact.
Narrowing the lock changes the concurrency of index advancement, so exhaustion detection must be reworked. The current per-call
checkpointrelies on rotation-index identity to detect "tried every worker"; once concurrent dispatches share and advance the index, a single dispatch can skip or miss workers under eviction with multiple workers. Track a per-dispatch tried-set of attempted worker uids rather than index identity.Note: increased handshake concurrency can surface a separate worker-side
proxy_pool"Lock is bound to a different event loop" race — validate against it (or file separately) when landing this.