Problem
In our Kubernetes deployment of Plane Enterprise v2.6.3 (chart v2.6.1), CSV exports intermittently fail without surfacing any user-visible error. The exporters table shows records stuck in "queued" status indefinitely — the task is never picked up by the Celery worker. Over the past 7 days we observed 5 completed exports and 5 stuck in "queued" (50% silent failure rate).
Observed evidence
RabbitMQ logs contain 936 occurrences of:
client unexpectedly closed TCP connection
...across two API pods over the same 7-day window. These events are produced when gunicorn workers are rotated due to the hardcoded --max-requests 1200 setting in the image entrypoint.
Timestamp correlation of stuck export records against RabbitMQ rotation events (June 29 sample), with API pod log evidence from Loki:
Export created_at (UTC) |
API pod |
duration_ms |
levelname |
Worker received? |
Failure mode |
| 21:36:26 |
— |
— |
— |
No |
Rotation window (~21:36–21:38, log boundary) |
| 21:58:31 |
9kvdc |
81 |
INFO |
No |
Stale pool — silent drop |
| 22:30:53 |
9kvdc |
4,295 |
WARNING |
No |
Rotation-window AMQP timeout |
| 22:33:20 |
j8bgm |
112 |
INFO |
No |
Stale pool — silent drop |
The Celery worker pod (ptx5c) received and completed three other exports in the same window (22:22:03, 22:32:47, 22:37:37) but shows zero log entries at 21:58:31 or 22:33:20 — the tasks were never published to RabbitMQ.
22:30:53 case: The 4,295 ms duration and WARNING-level log indicate Kombu hit a broken AMQP connection during the rotation reconnect window (rotation at 22:30:42, reconnection at 22:30:54). After an internal socket timeout, task.delay() returned without raising an exception — the task was silently discarded.
21:58:31 and 22:33:20 cases: Both completed in under 120 ms at INFO level — no AMQP exception, no warning, no visible error. Kombu selected a stale TCP connection from its broker pool: the socket still appeared open on the client side but had been closed by RabbitMQ as part of the earlier rotation churn. The kernel-level socket write succeeded instantly; RabbitMQ never received the message.
Root cause
There are two compounding failure modes.
1. Disabled AMQP heartbeat (latent risk, ruled out as current cause)
py-amqp defaults to heartbeat=0 (kombu/connection.py, line 173, v5.6.2), which silently overrides RabbitMQ's proposed 60-second timeout during AMQP negotiation (py-amqp amqp/connection.py, lines 425–440, v5.3.1). With no heartbeat active, stale idle connections accumulate without detection.
A 7-day scan of RabbitMQ logs found zero "missed heartbeats from client" events — all 936 connection-close events were "client unexpectedly closed TCP connection" (client-initiated, caused by gunicorn rotation). This rules out heartbeat failure as the active cause of the observed 50% failure rate. It remains a latent risk: a Celery worker idle for more than ~120 s would hold a stale AMQP connection and silently discard the next task.delay() call.
2. Gunicorn worker rotation — hardcoded and not configurable (confirmed primary cause)
The image entrypoint (docker-entrypoint-api-ee.sh) contains:
exec gunicorn -w "$GUNICORN_WORKERS" -k uvicorn.workers.UvicornWorker \
plane.asgi:application \
--bind 0.0.0.0:"${PORT:-8000}" \
--max-requests 1200 \
--max-requests-jitter 1000 \
--access-logfile -
--max-requests and --max-requests-jitter are hardcoded literals. There is no environment variable (GUNICORN_MAX_REQUESTS, etc.) that can override them. The GUNICORN_WORKERS env var is respected, but the rotation settings are not.
Kubernetes best practice: The gunicorn docs describe --max-requests as "a simple method to help limit the damage of memory leaks" and note that "if this is set to zero (the default) then the automatic worker restarts are disabled" (gunicorn docs — max_requests). In container environments managed by Kubernetes, this leak-mitigation role is already covered by the OOMKiller and pod restart policies. Setting --max-requests 0 disables rotation entirely, leaving pod lifecycle to the orchestrator. The current hardcoded values cause each worker process to be replaced every ~1,200–2,200 requests, producing a 10–30 second AMQP reconnect window on each rotation — directly generating the "client unexpectedly closed TCP connection" events observed in RabbitMQ.
Why silent failures instead of retries
Two chart omissions make the failures completely invisible:
-
CELERY_TASK_PUBLISH_RETRY is not exposed. Without publish retry, Kombu treats a failed or swallowed publish as final and returns control to the caller without raising an exception. The HTTP API returns 200 (the exporters record is written to the DB), but the task is never enqueued, leaving the record in "queued" forever.
-
CELERY_BROKER_POOL_LIMIT is not exposed. The default unlimited pool allows stale connections to accumulate without bound. Kombu does not health-check connections before reuse: a socket that appears open on the client side but is closed on the RabbitMQ side accepts a write silently at the kernel level. The message is dropped with no exception and no log output from Kombu or py-amqp. This was confirmed by a keyword search across all API pod log lines (amqp, broker, channel, kombu, celery, publish, connect) over the 21:40–23:00 UTC window — zero results for all terms.
Requested changes
1. Make gunicorn --max-requests configurable
Add GUNICORN_MAX_REQUESTS and GUNICORN_MAX_REQUESTS_JITTER env var support to the API entrypoint:
exec gunicorn -w "$GUNICORN_WORKERS" -k uvicorn.workers.UvicornWorker \
plane.asgi:application \
--bind 0.0.0.0:"${PORT:-8000}" \
--max-requests "${GUNICORN_MAX_REQUESTS:-1200}" \
--max-requests-jitter "${GUNICORN_MAX_REQUESTS_JITTER:-1000}" \
--access-logfile -
Expose these in values.yaml so operators targeting Kubernetes can set them to 0:
env:
gunicorn_max_requests: 0
gunicorn_max_requests_jitter: 0
2. Expose CELERY_TASK_PUBLISH_RETRY in chart values
env:
celery_task_publish_retry: true
Rendered into app-env.yaml as:
CELERY_TASK_PUBLISH_RETRY: "{{ .Values.env.celery_task_publish_retry
| default true | ternary "True" "False" }}"
3. Expose CELERY_BROKER_POOL_LIMIT in chart values
env:
celery_broker_pool_limit: 10
Setting a bounded limit prevents stale connections from accumulating without bound. The Celery docs recommend tuning this value relative to the number of threads or green threads in use (broker_pool_limit docs).
Environment
- Chart:
plane-enterprise v2.6.1
- Image:
makeplane/backend-commercial (extracted entrypoint as above)
- Kubernetes: EKS; 2 API pod replicas
- RabbitMQ: in-cluster StatefulSet (chart-managed)
AMQP_URL injected via external secret (ExternalSecret → K8s Secret), so the chart's app-env.yaml AMQP_URL template is bypassed in our case.
Problem
In our Kubernetes deployment of Plane Enterprise v2.6.3 (chart v2.6.1), CSV exports intermittently fail without surfacing any user-visible error. The
exporterstable shows records stuck in"queued"status indefinitely — the task is never picked up by the Celery worker. Over the past 7 days we observed 5 completed exports and 5 stuck in"queued"(50% silent failure rate).Observed evidence
RabbitMQ logs contain 936 occurrences of:
...across two API pods over the same 7-day window. These events are produced when gunicorn workers are rotated due to the hardcoded
--max-requests 1200setting in the image entrypoint.Timestamp correlation of stuck export records against RabbitMQ rotation events (June 29 sample), with API pod log evidence from Loki:
created_at(UTC)duration_mslevelname9kvdc9kvdcj8bgmThe Celery worker pod (
ptx5c) received and completed three other exports in the same window (22:22:03, 22:32:47, 22:37:37) but shows zero log entries at 21:58:31 or 22:33:20 — the tasks were never published to RabbitMQ.22:30:53 case: The 4,295 ms duration and WARNING-level log indicate Kombu hit a broken AMQP connection during the rotation reconnect window (rotation at 22:30:42, reconnection at 22:30:54). After an internal socket timeout,
task.delay()returned without raising an exception — the task was silently discarded.21:58:31 and 22:33:20 cases: Both completed in under 120 ms at INFO level — no AMQP exception, no warning, no visible error. Kombu selected a stale TCP connection from its broker pool: the socket still appeared open on the client side but had been closed by RabbitMQ as part of the earlier rotation churn. The kernel-level socket write succeeded instantly; RabbitMQ never received the message.
Root cause
There are two compounding failure modes.
1. Disabled AMQP heartbeat (latent risk, ruled out as current cause)
py-amqp defaults to
heartbeat=0(kombu/connection.py, line 173, v5.6.2), which silently overrides RabbitMQ's proposed 60-second timeout during AMQP negotiation (py-amqpamqp/connection.py, lines 425–440, v5.3.1). With no heartbeat active, stale idle connections accumulate without detection.A 7-day scan of RabbitMQ logs found zero "missed heartbeats from client" events — all 936 connection-close events were "client unexpectedly closed TCP connection" (client-initiated, caused by gunicorn rotation). This rules out heartbeat failure as the active cause of the observed 50% failure rate. It remains a latent risk: a Celery worker idle for more than ~120 s would hold a stale AMQP connection and silently discard the next
task.delay()call.2. Gunicorn worker rotation — hardcoded and not configurable (confirmed primary cause)
The image entrypoint (
docker-entrypoint-api-ee.sh) contains:--max-requestsand--max-requests-jitterare hardcoded literals. There is no environment variable (GUNICORN_MAX_REQUESTS, etc.) that can override them. TheGUNICORN_WORKERSenv var is respected, but the rotation settings are not.Kubernetes best practice: The gunicorn docs describe
--max-requestsas "a simple method to help limit the damage of memory leaks" and note that "if this is set to zero (the default) then the automatic worker restarts are disabled" (gunicorn docs —max_requests). In container environments managed by Kubernetes, this leak-mitigation role is already covered by the OOMKiller and pod restart policies. Setting--max-requests 0disables rotation entirely, leaving pod lifecycle to the orchestrator. The current hardcoded values cause each worker process to be replaced every ~1,200–2,200 requests, producing a 10–30 second AMQP reconnect window on each rotation — directly generating the "client unexpectedly closed TCP connection" events observed in RabbitMQ.Why silent failures instead of retries
Two chart omissions make the failures completely invisible:
CELERY_TASK_PUBLISH_RETRYis not exposed. Without publish retry, Kombu treats a failed or swallowed publish as final and returns control to the caller without raising an exception. The HTTP API returns 200 (theexportersrecord is written to the DB), but the task is never enqueued, leaving the record in"queued"forever.CELERY_BROKER_POOL_LIMITis not exposed. The default unlimited pool allows stale connections to accumulate without bound. Kombu does not health-check connections before reuse: a socket that appears open on the client side but is closed on the RabbitMQ side accepts a write silently at the kernel level. The message is dropped with no exception and no log output from Kombu or py-amqp. This was confirmed by a keyword search across all API pod log lines (amqp,broker,channel,kombu,celery,publish,connect) over the 21:40–23:00 UTC window — zero results for all terms.Requested changes
1. Make gunicorn
--max-requestsconfigurableAdd
GUNICORN_MAX_REQUESTSandGUNICORN_MAX_REQUESTS_JITTERenv var support to the API entrypoint:Expose these in
values.yamlso operators targeting Kubernetes can set them to0:2. Expose
CELERY_TASK_PUBLISH_RETRYin chart valuesRendered into
app-env.yamlas:3. Expose
CELERY_BROKER_POOL_LIMITin chart valuesSetting a bounded limit prevents stale connections from accumulating without bound. The Celery docs recommend tuning this value relative to the number of threads or green threads in use (
broker_pool_limitdocs).Environment
plane-enterprisev2.6.1makeplane/backend-commercial(extracted entrypoint as above)AMQP_URLinjected via external secret (ExternalSecret → K8s Secret), so the chart'sapp-env.yamlAMQP_URL template is bypassed in our case.