|
| 1 | +# Proposal: Address PGRES_TUPLES_OK Connection Corruption on Railway |
| 2 | + |
| 3 | +**Author:** Simon Strandgaard + Claude |
| 4 | +**Date:** 15 April 2026 |
| 5 | +**Status:** Draft |
| 6 | +**Topic:** Database reliability, connection pool management |
| 7 | + |
| 8 | +## 1. Problem |
| 9 | + |
| 10 | +The `worker_plan_database` service on Railway intermittently encounters |
| 11 | +`psycopg2.DatabaseError: error with status PGRES_TUPLES_OK and no message from the libpq` |
| 12 | +errors. This is a psycopg2-level protocol corruption where the underlying TCP connection |
| 13 | +is alive but the PostgreSQL wire protocol state is broken. |
| 14 | + |
| 15 | +### Symptoms |
| 16 | + |
| 17 | +- LLM calls complete successfully, but the subsequent `db.session.commit()` (for |
| 18 | + token metrics, progress updates, or heartbeat upserts) fails with `PGRES_TUPLES_OK`. |
| 19 | +- The corrupted connection is returned to SQLAlchemy's connection pool. |
| 20 | +- Other Luigi worker threads check out the bad connection; `pool_pre_ping` (`SELECT 1`) |
| 21 | + hangs on it because the protocol is broken but the TCP socket is still open. |
| 22 | +- All 4 Luigi worker threads deadlock. The pipeline appears stuck at ~3% forever. |
| 23 | +- `stop_requested` is ignored because the callback that checks it never fires. |
| 24 | +- The worker replica is completely stuck until restarted. |
| 25 | + |
| 26 | +### Affected code paths (patched so far) |
| 27 | + |
| 28 | +| Location | Error handler | |
| 29 | +|----------|--------------| |
| 30 | +| `worker_plan_internal/llm_util/token_metrics_store.py` | `record_token_usage()` | |
| 31 | +| `worker_plan_database/app.py` | `_handle_task_completion()` callback retries | |
| 32 | +| `worker_plan_database/app.py` | `update_task_state_with_retry()` | |
| 33 | +| `worker_plan_database/app.py` | `update_task_progress_with_retry()` | |
| 34 | +| `worker_plan_database/app.py` | `_update_failure_diagnostics()` | |
| 35 | +| `database_api/model_worker.py` | `upsert_heartbeat()` | |
| 36 | + |
| 37 | +Each was patched with the same pattern: `db.session.rollback()` → `db.engine.dispose()` |
| 38 | +→ `db.session.remove()`. This destroys the pool so the corrupted connection is closed |
| 39 | +outright instead of being recycled. |
| 40 | + |
| 41 | +### Current mitigation |
| 42 | + |
| 43 | +```python |
| 44 | +except Exception as e: |
| 45 | + try: |
| 46 | + db.session.rollback() |
| 47 | + except Exception: |
| 48 | + pass |
| 49 | + try: |
| 50 | + db.engine.dispose() |
| 51 | + except Exception: |
| 52 | + pass |
| 53 | + try: |
| 54 | + db.session.remove() |
| 55 | + except Exception: |
| 56 | + pass |
| 57 | +``` |
| 58 | + |
| 59 | +This works — the pipeline recovers and continues. But it is reactive (we add the |
| 60 | +pattern each time a new code path surfaces) and `db.engine.dispose()` is heavy-handed |
| 61 | +(closes all pooled connections, not just the bad one). |
| 62 | + |
| 63 | +## 2. Root cause hypotheses |
| 64 | + |
| 65 | +1. **Railway's TCP proxy** — Railway routes Postgres traffic through an internal proxy. |
| 66 | + The proxy may drop/corrupt packets under load or during container migrations without |
| 67 | + cleanly closing the TCP connection. |
| 68 | + |
| 69 | +2. **Idle connection timeout** — Railway or the Postgres instance may silently close |
| 70 | + idle connections after a timeout. psycopg2 doesn't notice until the next operation, |
| 71 | + and the resulting error is `PGRES_TUPLES_OK` instead of a clean disconnect. |
| 72 | + |
| 73 | +3. **Concurrent session/connection sharing** — Luigi runs 4 worker threads. If |
| 74 | + Flask-SQLAlchemy's scoped session or the connection pool has a threading edge case, |
| 75 | + two threads could interleave operations on the same raw connection, corrupting the |
| 76 | + protocol state. |
| 77 | + |
| 78 | +4. **psycopg2 bug** — There are known issues with `PGRES_TUPLES_OK` in psycopg2 when |
| 79 | + the connection is in an unexpected state. psycopg3 (aka `psycopg`) has a different |
| 80 | + connection architecture that may not be affected. |
| 81 | + |
| 82 | +## 3. Proposed solutions |
| 83 | + |
| 84 | +### Option A: Global SQLAlchemy error handler (recommended short-term) |
| 85 | + |
| 86 | +Register an engine-level event that intercepts all `PGRES_TUPLES_OK` errors and |
| 87 | +invalidates the specific connection, preventing it from re-entering the pool. This |
| 88 | +replaces all per-site patches with a single handler. |
| 89 | + |
| 90 | +```python |
| 91 | +from sqlalchemy import event |
| 92 | + |
| 93 | +@event.listens_for(db.engine, "handle_error") |
| 94 | +def _handle_pgres_error(context): |
| 95 | + """Invalidate connections corrupted by PGRES_TUPLES_OK.""" |
| 96 | + orig = getattr(context, "original_exception", None) |
| 97 | + if orig and "PGRES_TUPLES_OK" in str(orig): |
| 98 | + context.invalidate_pool_on_disconnect = True |
| 99 | + logger.warning("PGRES_TUPLES_OK detected — invalidating connection.") |
| 100 | +``` |
| 101 | + |
| 102 | +**Pros:** Single point of fix, no per-site patching, only invalidates the bad |
| 103 | +connection (not the entire pool). |
| 104 | +**Cons:** Requires testing that `invalidate_pool_on_disconnect` works correctly |
| 105 | +for this error type (it was designed for disconnect errors). |
| 106 | + |
| 107 | +### Option B: Migrate to psycopg3 |
| 108 | + |
| 109 | +Replace `psycopg2` with `psycopg` (psycopg3). The newer driver has: |
| 110 | +- Native async support |
| 111 | +- Better connection state tracking |
| 112 | +- Automatic connection recovery |
| 113 | +- No `PGRES_TUPLES_OK` corruption issues (different C binding) |
| 114 | + |
| 115 | +**Pros:** Eliminates the root cause at the driver level. |
| 116 | +**Cons:** Requires dependency changes, potential API differences, testing across |
| 117 | +all services. |
| 118 | + |
| 119 | +### Option C: Connection pooler (PgBouncer) |
| 120 | + |
| 121 | +Deploy PgBouncer between the worker and PostgreSQL. PgBouncer manages connections |
| 122 | +at the protocol level, handles reconnection transparently, and isolates the |
| 123 | +application from transport-layer issues. |
| 124 | + |
| 125 | +**Pros:** Handles all connection lifecycle issues, reduces connection count to |
| 126 | +Postgres, industry standard. |
| 127 | +**Cons:** Additional infrastructure to deploy and maintain on Railway, adds |
| 128 | +latency, transaction-mode pooling requires careful session management. |
| 129 | + |
| 130 | +### Option D: Aggressive pool configuration |
| 131 | + |
| 132 | +Tune SQLAlchemy pool settings to reduce exposure to stale connections: |
| 133 | + |
| 134 | +```python |
| 135 | +app.config['SQLALCHEMY_ENGINE_OPTIONS'] = { |
| 136 | + 'pool_recycle': 60, # recycle connections every 60s (currently 280) |
| 137 | + 'pool_pre_ping': True, # already enabled |
| 138 | + 'pool_size': 5, |
| 139 | + 'max_overflow': 2, |
| 140 | + 'pool_timeout': 10, # fail fast instead of waiting |
| 141 | +} |
| 142 | +``` |
| 143 | + |
| 144 | +**Pros:** Simple configuration change, no code changes. |
| 145 | +**Cons:** Doesn't prevent the corruption — just reduces the window. More |
| 146 | +frequent reconnections add overhead. |
| 147 | + |
| 148 | +## 4. Recommendation |
| 149 | + |
| 150 | +**Short-term (now):** Implement Option A (global error handler). This replaces |
| 151 | +the scattered per-site patches with a single, centralized handler and is the |
| 152 | +lowest-risk change. |
| 153 | + |
| 154 | +**Medium-term:** Investigate Option B (psycopg3 migration). If PGRES_TUPLES_OK |
| 155 | +continues to occur frequently, the driver-level fix is the most robust solution. |
| 156 | + |
| 157 | +**Long-term:** Consider Option C (PgBouncer) if the Railway Postgres proxy |
| 158 | +continues to be unreliable, especially as the service scales to more worker |
| 159 | +replicas. |
| 160 | + |
| 161 | +## 5. References |
| 162 | + |
| 163 | +- [psycopg2 PGRES_TUPLES_OK issue](https://github.com/psycopg/psycopg2/issues) |
| 164 | +- [SQLAlchemy handle_error event](https://docs.sqlalchemy.org/en/20/core/events.html#sqlalchemy.events.ConnectionEvents.handle_error) |
| 165 | +- [SQLAlchemy dealing with disconnects](https://docs.sqlalchemy.org/en/20/core/pooling.html#dealing-with-disconnects) |
| 166 | +- PRs: #573, #574, #578 |
0 commit comments