OOM kill under sustained concurrency with long-form audio: _optimal_cold_workers has no RAM gating

## Summary

`_optimal_cold_workers()` predicts how many cold subprocess workers can be spawned safely, but gates **only on VRAM** (`_cold_vram_ema_gb * 1.2`). It does not consider host RAM pressure. With a corpus of long-form audio (>60s clips) under sustained concurrent load, the worker pool spawns enough subprocesses that combined RSS — driver buffers, audio bodies in flight, re-queued payloads after `transcribe failed (Connection lost)`, multipart uploads buffered in uvicorn — exceeds host RAM. The result is a global OOM kill of the uvicorn process by the Linux kernel.

This is reproducible and was triggered today (2026-05-30) on `sphinx` while processing the gaia.riosa.com asterisk monitor backlog.

## Reproduction context

Server: `sphinx` (RTX 5090 / 32 GB VRAM / 128 GB DDR5 / Ubuntu 26.10)
Service: `uttera-stt-hotcold.service` (commit at time of incident: head of `master` running with `COLD_POOL_SIZE=10`, `COLD_WORKER_IDLE_TIMEOUT=60`, `WHISPER_MODEL=turbo`, `WHISPER_FP16=1`)
Co-tenants on GPU: `uttera-sentiment-vllm` (~16 GB VRAM), `comfyui` (~0.5 GB VRAM). VRAM free at start ~13 GB.

Client: gaia.riosa.com bulk reprocessing of asterisk MixMonitor recordings via `speech-recog-asterisk-wrapper_bulk_dir` with `WORKERS=10` (xargs `-P 10`). 10 concurrent HTTP POST `/v1/audio/transcriptions` sustained.

Corpus (~16700 wavs of real phone calls):

| Bucket | Count | % |
|---|---|---|
| < 200 KB (~1-12 s) | 4468 | 27% |
| 200 KB – 1 MB (12-60 s) | 9145 | 55% |
| 1 – 5 MB (60-300 s) | 2764 | 17% |
| > 5 MB (> 5 min) | 317 | 2% |

(The published `librispeech-test-clean` benchmark uses clips of 4-20 s, mean 7.4 s — i.e. nothing above the small bucket. The long-form distribution above is what the bench does not exercise.)

## Symptoms observed before the OOM kill

`journalctl -u uttera-stt-hotcold` repeated this cycle indefinitely:

```
--- POOL MGR: target=3 cold workers (active=2, loading=0) | queue=510.4s audio (31.0s drain) → spawning ---
--- POOL MGR: pool worker ready, total_active=3, idle_timeout=140s ---
--- POOL WORKER: idle timeout (140s), exiting ---
--- POOL WORKER: transcribe failed (Connection lost), re-queuing to hot lane ---
```

Despite `queue=510.4s` of audio pending, cold workers were idle-timing-out without consuming the queue (manager process was in `D` state — uninterruptible disk sleep — waiting on swap-in for its own pages, so it couldn't dispatch). Every "Connection lost" line re-queued the original request body to the hot lane.

`ss -tn` at gaia side showed 20+ ESTAB connections to :9005 each with Recv-Q of 95-200 KB pending on the server side — the kernel buffered the multipart uploads but the userspace never read them. Curl timed out at 180 s with `rc=28` for all in-flight requests.

GPU was at 0% utilization throughout the incident. This was purely a host-RAM problem.

## OOM kill record (`dmesg -T`)

```
[Sat May 30 10:43:31 2026] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
                            cpuset=user.slice,mems_allowed=0,global_oom,
                            task_memcg=/system.slice/uttera-stt-hotcold.service,
                            task=uvicorn,pid=150221,uid=1001
[Sat May 30 10:43:31 2026] Out of memory: Killed process 150221 (uvicorn)
                            total-vm:137967328kB, anon-rss:106210652kB,
                            file-rss:57456kB, shmem-rss:9792kB
[Sat May 30 10:43:34 2026] oom_reaper: reaped process 150221 (uvicorn),
                            now anon-rss:48kB, file-rss:108kB, shmem-rss:52kB
```

That's **101 GB of anonymous RSS in one uvicorn process**. `CONSTRAINT_NONE` + `global_oom` confirm this was global system OOM, not a cgroup limit. `systemd Restart=on-failure` brought the service back up at 10:43:55 with fresh RSS of 3 GB — confirming the leak was real and the service is otherwise healthy.

Kernel stack of the blocked main process while leak was in progress (before OOM):
```
State: D (disk sleep)
VmRSS: 81853104 kB
[<0>] folio_wait_bit_common+0x11d/0x2f0
[<0>] __folio_lock_or_retry+0x34b/0x570
[<0>] do_swap_page+0x662/0x1010
[<0>] handle_pte_fault+0x1b9/0x1f0
[<0>] handle_mm_fault+0xe7/0x2f0
```

The process was paging its own RSS in and out of swap, deadlocking itself.

## Root cause analysis

`main_stt.py:820` `_optimal_cold_workers()` docstring:

> *"Capped by COLD_POOL_SIZE (safety) and available VRAM."*

The implementation:

```python
def _optimal_cold_workers() -> int:
    if _hot_ema_sps is None or _work_queue_audio_seconds <= 0:
        return 0
    cold_start = _get_cold_start_time_stt()
    ...
    total_work_s = _work_queue_audio_seconds * _hot_ema_sps
    limit = 2.0 * total_work_s / cold_start
    N_total = 1
    while N_total * (N_total - 1) < limit:
        N_total += 1
    N_total -= 1
    cold = N_total - 1
    if COLD_POOL_SIZE > 0:
        cold = min(cold, COLD_POOL_SIZE)
    return max(0, cold)
```

Variables consulted: `_hot_ema_sps`, `_work_queue_audio_seconds`, `_get_cold_start_time_stt()`, `COLD_POOL_SIZE`, and (via `_vram_per_cold_worker()`) `_cold_vram_ema_gb`.

Not consulted: any host-RAM signal (`psutil.virtual_memory().available`, `MemAvailable` from `/proc/meminfo`, swap pressure, current process RSS).

The implicit assumption — that host-RAM usage scales monotonically with VRAM usage, so VRAM gating subsumes RAM gating — holds for short clips (the benchmarked regime) but fails for long-form audio because per-worker RAM footprint includes audio bodies and decoder context that don't live in VRAM:

- Audio buffer per worker for a 5-min wav at 16 kHz mono float32: ~19 MB
- Multipart upload body retained by uvicorn until handler reads it: ~5 MB per in-flight request
- `transcribe failed → re-queue to hot lane` duplicates the body each retry; with workers cycling idle-timeout-die-respawn, retries pile up
- Whisper decoder state for long sequences holds intermediate tensors in CPU memory beyond what the EMA captures (EMA only sees VRAM drop after the worker exits)

With 10 cold workers spawned + manager + cumulative re-queue retries on the ~317 wavs >5 min mixed into the workload, RSS scaled to 101 GB.

## Suggested fix

Gate `_optimal_cold_workers()` on host RAM in addition to VRAM. Two complementary checks:

1. **Hard floor on available memory**: cap `N` such that `(N+1) * EMA(RAM_per_worker)` ≤ `psutil.virtual_memory().available - MIN_RAM_HEADROOM_GB`. Symmetric to the existing `_cold_vram_ema_gb * SAFETY_FACTOR` logic.
2. **Worst-case audio scaling**: pass `max(audio_length_s in _work_queue)` (not just sum) into the projection. The current `total_work_s` aggregate doesn't capture that one 5-minute clip ties up a worker for ~5× longer than mean.

Plus a defensive fix on `re-queue to hot lane`: drop the body bytes from the in-memory request object after first re-queue attempt, or impose a per-request retry cap, so a worker losing its connection mid-transcribe doesn't multiply RAM pressure.

Optionally: a metrics counter `uttera_stt_oom_predictor_skipped_total` incremented when `_optimal_cold_workers()` would have spawned but RAM gate vetoed — gives ops visibility before the host gets close to the OOM boundary.

## Workaround until fixed

Operators with long-form corpora should set `COLD_POOL_SIZE` empirically below the value `_optimal_cold_workers()` would pick. For our corpus (~17% clips >60 s, ~2% clips >5 min) on a 128 GB host with sentiment+comfyui co-tenants, even `COLD_POOL_SIZE=3` may overcommit RAM under sustained load. The safer path is to switch the wrapper to `uttera-stt-vllm` (single-process AsyncLLM, no per-worker model duplication) for the bulk backlog and keep `hotcold` for live MixMonitor sidecar traffic (low concurrency, short calls usually).

## Bench coverage gap

`results/2026-04-17-run1-hotcold-librispeech/` and the sustained profile at `0.5 × burst@64 rps` validate behaviour for the LibriSpeech-shaped distribution. They do not exercise the corner case of long-form audio under sustained concurrent load on a shared GPU.

Proposal: add a fourth STT corpus to `PROTOCOL.md` — `uttera-stt-longform` — with a clip-length distribution skewed toward 60-300 s (e.g. resampled call-recording WAVs, public-domain podcast cuts). And a new `sustained-overload` profile mentioned in run1 notes.md "Open questions" that exercises continuous load when HOT is saturated, with the RAM gauge captured at minute 0, 1, 2, 3, 4, 5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM kill under sustained concurrency with long-form audio: _optimal_cold_workers has no RAM gating #2

Summary

Reproduction context

Symptoms observed before the OOM kill

OOM kill record (`dmesg -T`)

Root cause analysis

Suggested fix

Workaround until fixed

Bench coverage gap

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bucket	Count	%
< 200 KB (~1-12 s)	4468	27%
200 KB – 1 MB (12-60 s)	9145	55%
1 – 5 MB (60-300 s)	2764	17%
> 5 MB (> 5 min)	317	2%

OOM kill under sustained concurrency with long-form audio: _optimal_cold_workers has no RAM gating #2

Description

Summary

Reproduction context

Symptoms observed before the OOM kill

OOM kill record (dmesg -T)

Root cause analysis

Suggested fix

Workaround until fixed

Bench coverage gap

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

OOM kill record (`dmesg -T`)