Skip to content

Fix Random Network Exceptions Caused by Reusing Stale Peer Connections #96

@ninan-nn

Description

@ninan-nn

1. Background

In high-frequency API calling or mixed-traffic scenarios, the Python SDK intermittently encounters network exceptions. While the error logs often report SSL decryption failures, this is merely a defensive feedback from the TLS layer. The physical essence of the issue is Peer Connection Closed (Stale Connections).

2. Root Cause Analysis

A. The Essence: "Delayed Perception" of Peer Closure

  • Server-Side Behavior: Servers or Load Balancers (e.g., Nginx, SLB) typically have an idle timeout (e.g., 60s). Once reached, the server proactively sends a TCP FIN packet to close the connection.
  • Race Condition: If the client's connection pool fails to detect this FIN signal immediately and attempts to reuse the connection, the request hits a "dead" or "half-closed" socket.

B. Why SSL Errors Appear as a Symptom

  • Encrypted communication requires strict synchronization of TLS Record sequence numbers and MAC integrity.
  • When the client attempts to reuse a connection that the peer has already closed or truncated, the resulting incomplete byte stream causes a MAC verification failure in OpenSSL. This triggers a DECRYPTION_FAILED error, effectively masking the underlying Connection Closed event.

C. Connection Pool Contamination

  • The SDK handles both long-lived SSE streams and short-lived REST calls.
  • Because the peer may close idle connections in the pool at any time due to timeouts, the pool becomes contaminated with "zombie connections" that are no longer valid for new requests.

3. Solution

The fundamental fix is to ensure the client proactively abandons idle connections before the server does, eliminating the timing conflict.

Connection Pool Optimization:

  • Proactive Expiration: Set keepalive_expiry to a shorter duration (e.g., 15s). This ensures the client recycles and destroys stale connections before the peer's timeout is triggered.
  • Disable Environment Interference: Explicitly set trust_env=False to prevent system proxy configurations from interfering with low-level connection state monitoring.

4. Comparison with Kotlin Implementation

  • Kotlin (OkHttp): When retrieving a connection from the pool, OkHttp performs a proactive health check (isHealthy()). If it detects that the peer has sent a FIN or the socket is abnormal, it silently recreates the connection.
  • Python (Httpx): Python’s stack has a more passive detection mechanism for stale connections. In complex TLS environments, it requires a shortened keepalive_expiry to compensate for this latency in perception.

5. Conclusion

By proactively shortening the client-side keep-alive duration, the SDK avoids the risk of reusing "zombie connections," fundamentally resolving the link instability caused by Peer Connection Closed.

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions