plane-enterprise: AMQP broker stability gaps cause silent task dispatch failures in Kubernetes

## Problem

In our Kubernetes deployment of Plane Enterprise v2.6.3 (chart v2.6.1), CSV exports intermittently fail without surfacing any user-visible error. The `exporters` table shows records stuck in `"queued"` status indefinitely — the task is never picked up by the Celery worker. Over the past 7 days we observed 5 completed exports and 5 stuck in `"queued"` (50% silent failure rate).

<img width="995" height="447" alt="Image" src="https://github.com/user-attachments/assets/99f38d1e-9e15-4785-b0f3-007cc629d3a9" />

### Observed evidence

**RabbitMQ logs** contain 936 occurrences of:

```
client unexpectedly closed TCP connection
```

...across two API pods over the same 7-day window. These events are produced when gunicorn workers are rotated due to the hardcoded `--max-requests 1200` setting in the image entrypoint.

**Timestamp correlation** of stuck export records against RabbitMQ rotation events (June 29 sample), with API pod log evidence from Loki:

| Export `created_at` (UTC) | API pod | `duration_ms` | `levelname` | Worker received? | Failure mode |
|---|---|---|---|---|---|
| 21:36:26 | — | — | — | No | Rotation window (~21:36–21:38, log boundary) |
| 21:58:31 | `9kvdc` | 81 | INFO | No | **Stale pool — silent drop** |
| 22:30:53 | `9kvdc` | 4,295 | **WARNING** | No | **Rotation-window AMQP timeout** |
| 22:33:20 | `j8bgm` | 112 | INFO | No | **Stale pool — silent drop** |

The Celery worker pod (`ptx5c`) received and completed three other exports in the same window (22:22:03, 22:32:47, 22:37:37) but shows **zero log entries** at 21:58:31 or 22:33:20 — the tasks were never published to RabbitMQ.

**22:30:53 case**: The 4,295 ms duration and WARNING-level log indicate Kombu hit a broken AMQP connection during the rotation reconnect window (rotation at 22:30:42, reconnection at 22:30:54). After an internal socket timeout, `task.delay()` returned without raising an exception — the task was silently discarded.

**21:58:31 and 22:33:20 cases**: Both completed in under 120 ms at INFO level — no AMQP exception, no warning, no visible error. Kombu selected a stale TCP connection from its broker pool: the socket still appeared open on the client side but had been closed by RabbitMQ as part of the earlier rotation churn. The kernel-level socket write succeeded instantly; RabbitMQ never received the message.

### Root cause

There are two compounding failure modes.

#### 1. Disabled AMQP heartbeat (latent risk, ruled out as current cause)

py-amqp defaults to `heartbeat=0` ([`kombu/connection.py`, line 173, v5.6.2](https://github.com/celery/kombu/blob/v5.6.2/kombu/connection.py#L173)), which silently overrides RabbitMQ's proposed 60-second timeout during AMQP negotiation ([py-amqp `amqp/connection.py`, lines 425–440, v5.3.1](https://github.com/celery/py-amqp/blob/v5.3.1/amqp/connection.py#L425)). With no heartbeat active, stale idle connections accumulate without detection.

A 7-day scan of RabbitMQ logs found **zero** "missed heartbeats from client" events — all 936 connection-close events were "client unexpectedly closed TCP connection" (client-initiated, caused by gunicorn rotation). This rules out heartbeat failure as the active cause of the observed 50% failure rate. It remains a latent risk: a Celery worker idle for more than ~120 s would hold a stale AMQP connection and silently discard the next `task.delay()` call.

#### 2. Gunicorn worker rotation — hardcoded and not configurable (confirmed primary cause)

The image entrypoint (`docker-entrypoint-api-ee.sh`) contains:

```bash
exec gunicorn -w "$GUNICORN_WORKERS" -k uvicorn.workers.UvicornWorker \
  plane.asgi:application \
  --bind 0.0.0.0:"${PORT:-8000}" \
  --max-requests 1200 \
  --max-requests-jitter 1000 \
  --access-logfile -
```

`--max-requests` and `--max-requests-jitter` are hardcoded literals. There is no environment variable (`GUNICORN_MAX_REQUESTS`, etc.) that can override them. The `GUNICORN_WORKERS` env var is respected, but the rotation settings are not.

**Kubernetes best practice**: The gunicorn docs describe `--max-requests` as "a simple method to help limit the damage of memory leaks" and note that "if this is set to zero (the default) then the automatic worker restarts are disabled" ([gunicorn docs — `max_requests`](https://gunicorn.org/reference/settings/#max_requests)). In container environments managed by Kubernetes, this leak-mitigation role is already covered by the OOMKiller and pod restart policies. Setting `--max-requests 0` disables rotation entirely, leaving pod lifecycle to the orchestrator. The current hardcoded values cause each worker process to be replaced every ~1,200–2,200 requests, producing a 10–30 second AMQP reconnect window on each rotation — directly generating the "client unexpectedly closed TCP connection" events observed in RabbitMQ.

### Why silent failures instead of retries

Two chart omissions make the failures completely invisible:

1. **`CELERY_TASK_PUBLISH_RETRY` is not exposed.** Without publish retry, Kombu treats a failed or swallowed publish as final and returns control to the caller without raising an exception. The HTTP API returns 200 (the `exporters` record is written to the DB), but the task is never enqueued, leaving the record in `"queued"` forever.

2. **`CELERY_BROKER_POOL_LIMIT` is not exposed.** The default unlimited pool allows stale connections to accumulate without bound. Kombu does not health-check connections before reuse: a socket that appears open on the client side but is closed on the RabbitMQ side accepts a write silently at the kernel level. The message is dropped with no exception and no log output from Kombu or py-amqp. This was confirmed by a keyword search across all API pod log lines (`amqp`, `broker`, `channel`, `kombu`, `celery`, `publish`, `connect`) over the 21:40–23:00 UTC window — **zero results** for all terms.

---

## Requested changes

### 1. Make gunicorn `--max-requests` configurable

Add `GUNICORN_MAX_REQUESTS` and `GUNICORN_MAX_REQUESTS_JITTER` env var support to the API entrypoint:

```bash
exec gunicorn -w "$GUNICORN_WORKERS" -k uvicorn.workers.UvicornWorker \
  plane.asgi:application \
  --bind 0.0.0.0:"${PORT:-8000}" \
  --max-requests "${GUNICORN_MAX_REQUESTS:-1200}" \
  --max-requests-jitter "${GUNICORN_MAX_REQUESTS_JITTER:-1000}" \
  --access-logfile -
```

Expose these in `values.yaml` so operators targeting Kubernetes can set them to `0`:

```yaml
env:
  gunicorn_max_requests: 0
  gunicorn_max_requests_jitter: 0
```

### 2. Expose `CELERY_TASK_PUBLISH_RETRY` in chart values

```yaml
env:
  celery_task_publish_retry: true
```

Rendered into `app-env.yaml` as:

```yaml
CELERY_TASK_PUBLISH_RETRY: "{{ .Values.env.celery_task_publish_retry
  | default true | ternary "True" "False" }}"
```

### 3. Expose `CELERY_BROKER_POOL_LIMIT` in chart values

```yaml
env:
  celery_broker_pool_limit: 10
```

Setting a bounded limit prevents stale connections from accumulating without bound. The Celery docs recommend tuning this value relative to the number of threads or green threads in use ([`broker_pool_limit` docs](https://docs.celeryq.dev/en/stable/userguide/configuration.html#broker-pool-limit)).

---

## Environment

- Chart: `plane-enterprise` v2.6.1
- Image: `makeplane/backend-commercial` (extracted entrypoint as above)
- Kubernetes: EKS; 2 API pod replicas
- RabbitMQ: in-cluster StatefulSet (chart-managed)
- `AMQP_URL` injected via external secret (ExternalSecret → K8s Secret), so the chart's `app-env.yaml` AMQP_URL template is bypassed in our case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

plane-enterprise: AMQP broker stability gaps cause silent task dispatch failures in Kubernetes #261

Problem

Observed evidence

Root cause

1. Disabled AMQP heartbeat (latent risk, ruled out as current cause)

2. Gunicorn worker rotation — hardcoded and not configurable (confirmed primary cause)

Why silent failures instead of retries

Requested changes

1. Make gunicorn `--max-requests` configurable

2. Expose `CELERY_TASK_PUBLISH_RETRY` in chart values

3. Expose `CELERY_BROKER_POOL_LIMIT` in chart values

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Export `created_at` (UTC)	API pod	`duration_ms`	`levelname`	Worker received?	Failure mode
21:36:26	—	—	—	No	Rotation window (~21:36–21:38, log boundary)
21:58:31	`9kvdc`	81	INFO	No	Stale pool — silent drop
22:30:53	`9kvdc`	4,295	WARNING	No	Rotation-window AMQP timeout
22:33:20	`j8bgm`	112	INFO	No	Stale pool — silent drop

Uh oh!

plane-enterprise: AMQP broker stability gaps cause silent task dispatch failures in Kubernetes #261

Description

Problem

Observed evidence

Root cause

1. Disabled AMQP heartbeat (latent risk, ruled out as current cause)

2. Gunicorn worker rotation — hardcoded and not configurable (confirmed primary cause)

Why silent failures instead of retries

Requested changes

1. Make gunicorn --max-requests configurable

2. Expose CELERY_TASK_PUBLISH_RETRY in chart values

3. Expose CELERY_BROKER_POOL_LIMIT in chart values

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Make gunicorn `--max-requests` configurable

2. Expose `CELERY_TASK_PUBLISH_RETRY` in chart values

3. Expose `CELERY_BROKER_POOL_LIMIT` in chart values