Skip to content

Commit e025134

Browse files
authored
Merge pull request #578 from PlanExeOrg/fix/worker-heartbeat-pool-dispose
fix: dispose connection pool on PGRES_TUPLES_OK in heartbeat upsert
2 parents ba852cb + c56a6b1 commit e025134

File tree

2 files changed

+178
-1
lines changed

2 files changed

+178
-1
lines changed

database_api/model_worker.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,4 +59,15 @@ def upsert_heartbeat(cls, worker_id: str, current_task_id: Optional[str] = None)
5959
db.session.commit()
6060
except Exception as e:
6161
logger.error(f"Worker {worker_id}: Database error during heartbeat upsert: {e}", exc_info=True)
62-
db.session.rollback()
62+
try:
63+
db.session.rollback()
64+
except Exception:
65+
pass
66+
try:
67+
db.engine.dispose()
68+
except Exception:
69+
pass
70+
try:
71+
db.session.remove()
72+
except Exception:
73+
pass
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Proposal: Address PGRES_TUPLES_OK Connection Corruption on Railway
2+
3+
**Author:** Simon Strandgaard + Claude
4+
**Date:** 15 April 2026
5+
**Status:** Draft
6+
**Topic:** Database reliability, connection pool management
7+
8+
## 1. Problem
9+
10+
The `worker_plan_database` service on Railway intermittently encounters
11+
`psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq`
12+
errors. This is a psycopg2-level protocol corruption where the underlying TCP connection
13+
is alive but the PostgreSQL wire protocol state is broken.
14+
15+
### Symptoms
16+
17+
- LLM calls complete successfully, but the subsequent `db.session.commit()` (for
18+
token metrics, progress updates, or heartbeat upserts) fails with `PGRES_TUPLES_OK`.
19+
- The corrupted connection is returned to SQLAlchemy's connection pool.
20+
- Other Luigi worker threads check out the bad connection; `pool_pre_ping` (`SELECT 1`)
21+
hangs on it because the protocol is broken but the TCP socket is still open.
22+
- All 4 Luigi worker threads deadlock. The pipeline appears stuck at ~3% forever.
23+
- `stop_requested` is ignored because the callback that checks it never fires.
24+
- The worker replica is completely stuck until restarted.
25+
26+
### Affected code paths (patched so far)
27+
28+
| Location | Error handler |
29+
|----------|--------------|
30+
| `worker_plan_internal/llm_util/token_metrics_store.py` | `record_token_usage()` |
31+
| `worker_plan_database/app.py` | `_handle_task_completion()` callback retries |
32+
| `worker_plan_database/app.py` | `update_task_state_with_retry()` |
33+
| `worker_plan_database/app.py` | `update_task_progress_with_retry()` |
34+
| `worker_plan_database/app.py` | `_update_failure_diagnostics()` |
35+
| `database_api/model_worker.py` | `upsert_heartbeat()` |
36+
37+
Each was patched with the same pattern: `db.session.rollback()``db.engine.dispose()`
38+
`db.session.remove()`. This destroys the pool so the corrupted connection is closed
39+
outright instead of being recycled.
40+
41+
### Current mitigation
42+
43+
```python
44+
except Exception as e:
45+
try:
46+
db.session.rollback()
47+
except Exception:
48+
pass
49+
try:
50+
db.engine.dispose()
51+
except Exception:
52+
pass
53+
try:
54+
db.session.remove()
55+
except Exception:
56+
pass
57+
```
58+
59+
This works — the pipeline recovers and continues. But it is reactive (we add the
60+
pattern each time a new code path surfaces) and `db.engine.dispose()` is heavy-handed
61+
(closes all pooled connections, not just the bad one).
62+
63+
## 2. Root cause hypotheses
64+
65+
1. **Railway's TCP proxy** — Railway routes Postgres traffic through an internal proxy.
66+
The proxy may drop/corrupt packets under load or during container migrations without
67+
cleanly closing the TCP connection.
68+
69+
2. **Idle connection timeout** — Railway or the Postgres instance may silently close
70+
idle connections after a timeout. psycopg2 doesn't notice until the next operation,
71+
and the resulting error is `PGRES_TUPLES_OK` instead of a clean disconnect.
72+
73+
3. **Concurrent session/connection sharing** — Luigi runs 4 worker threads. If
74+
Flask-SQLAlchemy's scoped session or the connection pool has a threading edge case,
75+
two threads could interleave operations on the same raw connection, corrupting the
76+
protocol state.
77+
78+
4. **psycopg2 bug** — There are known issues with `PGRES_TUPLES_OK` in psycopg2 when
79+
the connection is in an unexpected state. psycopg3 (aka `psycopg`) has a different
80+
connection architecture that may not be affected.
81+
82+
## 3. Proposed solutions
83+
84+
### Option A: Global SQLAlchemy error handler (recommended short-term)
85+
86+
Register an engine-level event that intercepts all `PGRES_TUPLES_OK` errors and
87+
invalidates the specific connection, preventing it from re-entering the pool. This
88+
replaces all per-site patches with a single handler.
89+
90+
```python
91+
from sqlalchemy import event
92+
93+
@event.listens_for(db.engine, "handle_error")
94+
def _handle_pgres_error(context):
95+
"""Invalidate connections corrupted by PGRES_TUPLES_OK."""
96+
orig = getattr(context, "original_exception", None)
97+
if orig and "PGRES_TUPLES_OK" in str(orig):
98+
context.invalidate_pool_on_disconnect = True
99+
logger.warning("PGRES_TUPLES_OK detected — invalidating connection.")
100+
```
101+
102+
**Pros:** Single point of fix, no per-site patching, only invalidates the bad
103+
connection (not the entire pool).
104+
**Cons:** Requires testing that `invalidate_pool_on_disconnect` works correctly
105+
for this error type (it was designed for disconnect errors).
106+
107+
### Option B: Migrate to psycopg3
108+
109+
Replace `psycopg2` with `psycopg` (psycopg3). The newer driver has:
110+
- Native async support
111+
- Better connection state tracking
112+
- Automatic connection recovery
113+
- No `PGRES_TUPLES_OK` corruption issues (different C binding)
114+
115+
**Pros:** Eliminates the root cause at the driver level.
116+
**Cons:** Requires dependency changes, potential API differences, testing across
117+
all services.
118+
119+
### Option C: Connection pooler (PgBouncer)
120+
121+
Deploy PgBouncer between the worker and PostgreSQL. PgBouncer manages connections
122+
at the protocol level, handles reconnection transparently, and isolates the
123+
application from transport-layer issues.
124+
125+
**Pros:** Handles all connection lifecycle issues, reduces connection count to
126+
Postgres, industry standard.
127+
**Cons:** Additional infrastructure to deploy and maintain on Railway, adds
128+
latency, transaction-mode pooling requires careful session management.
129+
130+
### Option D: Aggressive pool configuration
131+
132+
Tune SQLAlchemy pool settings to reduce exposure to stale connections:
133+
134+
```python
135+
app.config['SQLALCHEMY_ENGINE_OPTIONS'] = {
136+
'pool_recycle': 60, # recycle connections every 60s (currently 280)
137+
'pool_pre_ping': True, # already enabled
138+
'pool_size': 5,
139+
'max_overflow': 2,
140+
'pool_timeout': 10, # fail fast instead of waiting
141+
}
142+
```
143+
144+
**Pros:** Simple configuration change, no code changes.
145+
**Cons:** Doesn't prevent the corruption — just reduces the window. More
146+
frequent reconnections add overhead.
147+
148+
## 4. Recommendation
149+
150+
**Short-term (now):** Implement Option A (global error handler). This replaces
151+
the scattered per-site patches with a single, centralized handler and is the
152+
lowest-risk change.
153+
154+
**Medium-term:** Investigate Option B (psycopg3 migration). If PGRES_TUPLES_OK
155+
continues to occur frequently, the driver-level fix is the most robust solution.
156+
157+
**Long-term:** Consider Option C (PgBouncer) if the Railway Postgres proxy
158+
continues to be unreliable, especially as the service scales to more worker
159+
replicas.
160+
161+
## 5. References
162+
163+
- [psycopg2 PGRES_TUPLES_OK issue](https://github.com/psycopg/psycopg2/issues)
164+
- [SQLAlchemy handle_error event](https://docs.sqlalchemy.org/en/20/core/events.html#sqlalchemy.events.ConnectionEvents.handle_error)
165+
- [SQLAlchemy dealing with disconnects](https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects)
166+
- PRs: #573, #574, #578

0 commit comments

Comments
 (0)