Skip to content

plane-enterprise: AMQP broker stability gaps cause silent task dispatch failures in Kubernetes #261

Description

@leedsjb

Problem

In our Kubernetes deployment of Plane Enterprise v2.6.3 (chart v2.6.1), CSV exports intermittently fail without surfacing any user-visible error. The exporters table shows records stuck in "queued" status indefinitely — the task is never picked up by the Celery worker. Over the past 7 days we observed 5 completed exports and 5 stuck in "queued" (50% silent failure rate).

Image

Observed evidence

RabbitMQ logs contain 936 occurrences of:

client unexpectedly closed TCP connection

...across two API pods over the same 7-day window. These events are produced when gunicorn workers are rotated due to the hardcoded --max-requests 1200 setting in the image entrypoint.

Timestamp correlation of stuck export records against RabbitMQ rotation events (June 29 sample), with API pod log evidence from Loki:

Export created_at (UTC) API pod duration_ms levelname Worker received? Failure mode
21:36:26 No Rotation window (~21:36–21:38, log boundary)
21:58:31 9kvdc 81 INFO No Stale pool — silent drop
22:30:53 9kvdc 4,295 WARNING No Rotation-window AMQP timeout
22:33:20 j8bgm 112 INFO No Stale pool — silent drop

The Celery worker pod (ptx5c) received and completed three other exports in the same window (22:22:03, 22:32:47, 22:37:37) but shows zero log entries at 21:58:31 or 22:33:20 — the tasks were never published to RabbitMQ.

22:30:53 case: The 4,295 ms duration and WARNING-level log indicate Kombu hit a broken AMQP connection during the rotation reconnect window (rotation at 22:30:42, reconnection at 22:30:54). After an internal socket timeout, task.delay() returned without raising an exception — the task was silently discarded.

21:58:31 and 22:33:20 cases: Both completed in under 120 ms at INFO level — no AMQP exception, no warning, no visible error. Kombu selected a stale TCP connection from its broker pool: the socket still appeared open on the client side but had been closed by RabbitMQ as part of the earlier rotation churn. The kernel-level socket write succeeded instantly; RabbitMQ never received the message.

Root cause

There are two compounding failure modes.

1. Disabled AMQP heartbeat (latent risk, ruled out as current cause)

py-amqp defaults to heartbeat=0 (kombu/connection.py, line 173, v5.6.2), which silently overrides RabbitMQ's proposed 60-second timeout during AMQP negotiation (py-amqp amqp/connection.py, lines 425–440, v5.3.1). With no heartbeat active, stale idle connections accumulate without detection.

A 7-day scan of RabbitMQ logs found zero "missed heartbeats from client" events — all 936 connection-close events were "client unexpectedly closed TCP connection" (client-initiated, caused by gunicorn rotation). This rules out heartbeat failure as the active cause of the observed 50% failure rate. It remains a latent risk: a Celery worker idle for more than ~120 s would hold a stale AMQP connection and silently discard the next task.delay() call.

2. Gunicorn worker rotation — hardcoded and not configurable (confirmed primary cause)

The image entrypoint (docker-entrypoint-api-ee.sh) contains:

exec gunicorn -w "$GUNICORN_WORKERS" -k uvicorn.workers.UvicornWorker \
  plane.asgi:application \
  --bind 0.0.0.0:"${PORT:-8000}" \
  --max-requests 1200 \
  --max-requests-jitter 1000 \
  --access-logfile -

--max-requests and --max-requests-jitter are hardcoded literals. There is no environment variable (GUNICORN_MAX_REQUESTS, etc.) that can override them. The GUNICORN_WORKERS env var is respected, but the rotation settings are not.

Kubernetes best practice: The gunicorn docs describe --max-requests as "a simple method to help limit the damage of memory leaks" and note that "if this is set to zero (the default) then the automatic worker restarts are disabled" (gunicorn docs — max_requests). In container environments managed by Kubernetes, this leak-mitigation role is already covered by the OOMKiller and pod restart policies. Setting --max-requests 0 disables rotation entirely, leaving pod lifecycle to the orchestrator. The current hardcoded values cause each worker process to be replaced every ~1,200–2,200 requests, producing a 10–30 second AMQP reconnect window on each rotation — directly generating the "client unexpectedly closed TCP connection" events observed in RabbitMQ.

Why silent failures instead of retries

Two chart omissions make the failures completely invisible:

  1. CELERY_TASK_PUBLISH_RETRY is not exposed. Without publish retry, Kombu treats a failed or swallowed publish as final and returns control to the caller without raising an exception. The HTTP API returns 200 (the exporters record is written to the DB), but the task is never enqueued, leaving the record in "queued" forever.

  2. CELERY_BROKER_POOL_LIMIT is not exposed. The default unlimited pool allows stale connections to accumulate without bound. Kombu does not health-check connections before reuse: a socket that appears open on the client side but is closed on the RabbitMQ side accepts a write silently at the kernel level. The message is dropped with no exception and no log output from Kombu or py-amqp. This was confirmed by a keyword search across all API pod log lines (amqp, broker, channel, kombu, celery, publish, connect) over the 21:40–23:00 UTC window — zero results for all terms.


Requested changes

1. Make gunicorn --max-requests configurable

Add GUNICORN_MAX_REQUESTS and GUNICORN_MAX_REQUESTS_JITTER env var support to the API entrypoint:

exec gunicorn -w "$GUNICORN_WORKERS" -k uvicorn.workers.UvicornWorker \
  plane.asgi:application \
  --bind 0.0.0.0:"${PORT:-8000}" \
  --max-requests "${GUNICORN_MAX_REQUESTS:-1200}" \
  --max-requests-jitter "${GUNICORN_MAX_REQUESTS_JITTER:-1000}" \
  --access-logfile -

Expose these in values.yaml so operators targeting Kubernetes can set them to 0:

env:
  gunicorn_max_requests: 0
  gunicorn_max_requests_jitter: 0

2. Expose CELERY_TASK_PUBLISH_RETRY in chart values

env:
  celery_task_publish_retry: true

Rendered into app-env.yaml as:

CELERY_TASK_PUBLISH_RETRY: "{{ .Values.env.celery_task_publish_retry
  | default true | ternary "True" "False" }}"

3. Expose CELERY_BROKER_POOL_LIMIT in chart values

env:
  celery_broker_pool_limit: 10

Setting a bounded limit prevents stale connections from accumulating without bound. The Celery docs recommend tuning this value relative to the number of threads or green threads in use (broker_pool_limit docs).


Environment

  • Chart: plane-enterprise v2.6.1
  • Image: makeplane/backend-commercial (extracted entrypoint as above)
  • Kubernetes: EKS; 2 API pod replicas
  • RabbitMQ: in-cluster StatefulSet (chart-managed)
  • AMQP_URL injected via external secret (ExternalSecret → K8s Secret), so the chart's app-env.yaml AMQP_URL template is bypassed in our case.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions