Summary
_optimal_cold_workers() predicts how many cold subprocess workers can be spawned safely, but gates only on VRAM (_cold_vram_ema_gb * 1.2). It does not consider host RAM pressure. With a corpus of long-form audio (>60s clips) under sustained concurrent load, the worker pool spawns enough subprocesses that combined RSS — driver buffers, audio bodies in flight, re-queued payloads after transcribe failed (Connection lost), multipart uploads buffered in uvicorn — exceeds host RAM. The result is a global OOM kill of the uvicorn process by the Linux kernel.
This is reproducible and was triggered today (2026-05-30) on sphinx while processing the gaia.riosa.com asterisk monitor backlog.
Reproduction context
Server: sphinx (RTX 5090 / 32 GB VRAM / 128 GB DDR5 / Ubuntu 26.10)
Service: uttera-stt-hotcold.service (commit at time of incident: head of master running with COLD_POOL_SIZE=10, COLD_WORKER_IDLE_TIMEOUT=60, WHISPER_MODEL=turbo, WHISPER_FP16=1)
Co-tenants on GPU: uttera-sentiment-vllm (~16 GB VRAM), comfyui (~0.5 GB VRAM). VRAM free at start ~13 GB.
Client: gaia.riosa.com bulk reprocessing of asterisk MixMonitor recordings via speech-recog-asterisk-wrapper_bulk_dir with WORKERS=10 (xargs -P 10). 10 concurrent HTTP POST /v1/audio/transcriptions sustained.
Corpus (~16700 wavs of real phone calls):
| Bucket |
Count |
% |
| < 200 KB (~1-12 s) |
4468 |
27% |
| 200 KB – 1 MB (12-60 s) |
9145 |
55% |
| 1 – 5 MB (60-300 s) |
2764 |
17% |
| > 5 MB (> 5 min) |
317 |
2% |
(The published librispeech-test-clean benchmark uses clips of 4-20 s, mean 7.4 s — i.e. nothing above the small bucket. The long-form distribution above is what the bench does not exercise.)
Symptoms observed before the OOM kill
journalctl -u uttera-stt-hotcold repeated this cycle indefinitely:
--- POOL MGR: target=3 cold workers (active=2, loading=0) | queue=510.4s audio (31.0s drain) → spawning ---
--- POOL MGR: pool worker ready, total_active=3, idle_timeout=140s ---
--- POOL WORKER: idle timeout (140s), exiting ---
--- POOL WORKER: transcribe failed (Connection lost), re-queuing to hot lane ---
Despite queue=510.4s of audio pending, cold workers were idle-timing-out without consuming the queue (manager process was in D state — uninterruptible disk sleep — waiting on swap-in for its own pages, so it couldn't dispatch). Every "Connection lost" line re-queued the original request body to the hot lane.
ss -tn at gaia side showed 20+ ESTAB connections to :9005 each with Recv-Q of 95-200 KB pending on the server side — the kernel buffered the multipart uploads but the userspace never read them. Curl timed out at 180 s with rc=28 for all in-flight requests.
GPU was at 0% utilization throughout the incident. This was purely a host-RAM problem.
OOM kill record (dmesg -T)
[Sat May 30 10:43:31 2026] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
cpuset=user.slice,mems_allowed=0,global_oom,
task_memcg=/system.slice/uttera-stt-hotcold.service,
task=uvicorn,pid=150221,uid=1001
[Sat May 30 10:43:31 2026] Out of memory: Killed process 150221 (uvicorn)
total-vm:137967328kB, anon-rss:106210652kB,
file-rss:57456kB, shmem-rss:9792kB
[Sat May 30 10:43:34 2026] oom_reaper: reaped process 150221 (uvicorn),
now anon-rss:48kB, file-rss:108kB, shmem-rss:52kB
That's 101 GB of anonymous RSS in one uvicorn process. CONSTRAINT_NONE + global_oom confirm this was global system OOM, not a cgroup limit. systemd Restart=on-failure brought the service back up at 10:43:55 with fresh RSS of 3 GB — confirming the leak was real and the service is otherwise healthy.
Kernel stack of the blocked main process while leak was in progress (before OOM):
State: D (disk sleep)
VmRSS: 81853104 kB
[<0>] folio_wait_bit_common+0x11d/0x2f0
[<0>] __folio_lock_or_retry+0x34b/0x570
[<0>] do_swap_page+0x662/0x1010
[<0>] handle_pte_fault+0x1b9/0x1f0
[<0>] handle_mm_fault+0xe7/0x2f0
The process was paging its own RSS in and out of swap, deadlocking itself.
Root cause analysis
main_stt.py:820 _optimal_cold_workers() docstring:
"Capped by COLD_POOL_SIZE (safety) and available VRAM."
The implementation:
def _optimal_cold_workers() -> int:
if _hot_ema_sps is None or _work_queue_audio_seconds <= 0:
return 0
cold_start = _get_cold_start_time_stt()
...
total_work_s = _work_queue_audio_seconds * _hot_ema_sps
limit = 2.0 * total_work_s / cold_start
N_total = 1
while N_total * (N_total - 1) < limit:
N_total += 1
N_total -= 1
cold = N_total - 1
if COLD_POOL_SIZE > 0:
cold = min(cold, COLD_POOL_SIZE)
return max(0, cold)
Variables consulted: _hot_ema_sps, _work_queue_audio_seconds, _get_cold_start_time_stt(), COLD_POOL_SIZE, and (via _vram_per_cold_worker()) _cold_vram_ema_gb.
Not consulted: any host-RAM signal (psutil.virtual_memory().available, MemAvailable from /proc/meminfo, swap pressure, current process RSS).
The implicit assumption — that host-RAM usage scales monotonically with VRAM usage, so VRAM gating subsumes RAM gating — holds for short clips (the benchmarked regime) but fails for long-form audio because per-worker RAM footprint includes audio bodies and decoder context that don't live in VRAM:
- Audio buffer per worker for a 5-min wav at 16 kHz mono float32: ~19 MB
- Multipart upload body retained by uvicorn until handler reads it: ~5 MB per in-flight request
transcribe failed → re-queue to hot lane duplicates the body each retry; with workers cycling idle-timeout-die-respawn, retries pile up
- Whisper decoder state for long sequences holds intermediate tensors in CPU memory beyond what the EMA captures (EMA only sees VRAM drop after the worker exits)
With 10 cold workers spawned + manager + cumulative re-queue retries on the ~317 wavs >5 min mixed into the workload, RSS scaled to 101 GB.
Suggested fix
Gate _optimal_cold_workers() on host RAM in addition to VRAM. Two complementary checks:
- Hard floor on available memory: cap
N such that (N+1) * EMA(RAM_per_worker) ≤ psutil.virtual_memory().available - MIN_RAM_HEADROOM_GB. Symmetric to the existing _cold_vram_ema_gb * SAFETY_FACTOR logic.
- Worst-case audio scaling: pass
max(audio_length_s in _work_queue) (not just sum) into the projection. The current total_work_s aggregate doesn't capture that one 5-minute clip ties up a worker for ~5× longer than mean.
Plus a defensive fix on re-queue to hot lane: drop the body bytes from the in-memory request object after first re-queue attempt, or impose a per-request retry cap, so a worker losing its connection mid-transcribe doesn't multiply RAM pressure.
Optionally: a metrics counter uttera_stt_oom_predictor_skipped_total incremented when _optimal_cold_workers() would have spawned but RAM gate vetoed — gives ops visibility before the host gets close to the OOM boundary.
Workaround until fixed
Operators with long-form corpora should set COLD_POOL_SIZE empirically below the value _optimal_cold_workers() would pick. For our corpus (~17% clips >60 s, ~2% clips >5 min) on a 128 GB host with sentiment+comfyui co-tenants, even COLD_POOL_SIZE=3 may overcommit RAM under sustained load. The safer path is to switch the wrapper to uttera-stt-vllm (single-process AsyncLLM, no per-worker model duplication) for the bulk backlog and keep hotcold for live MixMonitor sidecar traffic (low concurrency, short calls usually).
Bench coverage gap
results/2026-04-17-run1-hotcold-librispeech/ and the sustained profile at 0.5 × burst@64 rps validate behaviour for the LibriSpeech-shaped distribution. They do not exercise the corner case of long-form audio under sustained concurrent load on a shared GPU.
Proposal: add a fourth STT corpus to PROTOCOL.md — uttera-stt-longform — with a clip-length distribution skewed toward 60-300 s (e.g. resampled call-recording WAVs, public-domain podcast cuts). And a new sustained-overload profile mentioned in run1 notes.md "Open questions" that exercises continuous load when HOT is saturated, with the RAM gauge captured at minute 0, 1, 2, 3, 4, 5.
Summary
_optimal_cold_workers()predicts how many cold subprocess workers can be spawned safely, but gates only on VRAM (_cold_vram_ema_gb * 1.2). It does not consider host RAM pressure. With a corpus of long-form audio (>60s clips) under sustained concurrent load, the worker pool spawns enough subprocesses that combined RSS — driver buffers, audio bodies in flight, re-queued payloads aftertranscribe failed (Connection lost), multipart uploads buffered in uvicorn — exceeds host RAM. The result is a global OOM kill of the uvicorn process by the Linux kernel.This is reproducible and was triggered today (2026-05-30) on
sphinxwhile processing the gaia.riosa.com asterisk monitor backlog.Reproduction context
Server:
sphinx(RTX 5090 / 32 GB VRAM / 128 GB DDR5 / Ubuntu 26.10)Service:
uttera-stt-hotcold.service(commit at time of incident: head ofmasterrunning withCOLD_POOL_SIZE=10,COLD_WORKER_IDLE_TIMEOUT=60,WHISPER_MODEL=turbo,WHISPER_FP16=1)Co-tenants on GPU:
uttera-sentiment-vllm(~16 GB VRAM),comfyui(~0.5 GB VRAM). VRAM free at start ~13 GB.Client: gaia.riosa.com bulk reprocessing of asterisk MixMonitor recordings via
speech-recog-asterisk-wrapper_bulk_dirwithWORKERS=10(xargs-P 10). 10 concurrent HTTP POST/v1/audio/transcriptionssustained.Corpus (~16700 wavs of real phone calls):
(The published
librispeech-test-cleanbenchmark uses clips of 4-20 s, mean 7.4 s — i.e. nothing above the small bucket. The long-form distribution above is what the bench does not exercise.)Symptoms observed before the OOM kill
journalctl -u uttera-stt-hotcoldrepeated this cycle indefinitely:Despite
queue=510.4sof audio pending, cold workers were idle-timing-out without consuming the queue (manager process was inDstate — uninterruptible disk sleep — waiting on swap-in for its own pages, so it couldn't dispatch). Every "Connection lost" line re-queued the original request body to the hot lane.ss -tnat gaia side showed 20+ ESTAB connections to :9005 each with Recv-Q of 95-200 KB pending on the server side — the kernel buffered the multipart uploads but the userspace never read them. Curl timed out at 180 s withrc=28for all in-flight requests.GPU was at 0% utilization throughout the incident. This was purely a host-RAM problem.
OOM kill record (
dmesg -T)That's 101 GB of anonymous RSS in one uvicorn process.
CONSTRAINT_NONE+global_oomconfirm this was global system OOM, not a cgroup limit.systemd Restart=on-failurebrought the service back up at 10:43:55 with fresh RSS of 3 GB — confirming the leak was real and the service is otherwise healthy.Kernel stack of the blocked main process while leak was in progress (before OOM):
The process was paging its own RSS in and out of swap, deadlocking itself.
Root cause analysis
main_stt.py:820_optimal_cold_workers()docstring:The implementation:
Variables consulted:
_hot_ema_sps,_work_queue_audio_seconds,_get_cold_start_time_stt(),COLD_POOL_SIZE, and (via_vram_per_cold_worker())_cold_vram_ema_gb.Not consulted: any host-RAM signal (
psutil.virtual_memory().available,MemAvailablefrom/proc/meminfo, swap pressure, current process RSS).The implicit assumption — that host-RAM usage scales monotonically with VRAM usage, so VRAM gating subsumes RAM gating — holds for short clips (the benchmarked regime) but fails for long-form audio because per-worker RAM footprint includes audio bodies and decoder context that don't live in VRAM:
transcribe failed → re-queue to hot laneduplicates the body each retry; with workers cycling idle-timeout-die-respawn, retries pile upWith 10 cold workers spawned + manager + cumulative re-queue retries on the ~317 wavs >5 min mixed into the workload, RSS scaled to 101 GB.
Suggested fix
Gate
_optimal_cold_workers()on host RAM in addition to VRAM. Two complementary checks:Nsuch that(N+1) * EMA(RAM_per_worker)≤psutil.virtual_memory().available - MIN_RAM_HEADROOM_GB. Symmetric to the existing_cold_vram_ema_gb * SAFETY_FACTORlogic.max(audio_length_s in _work_queue)(not just sum) into the projection. The currenttotal_work_saggregate doesn't capture that one 5-minute clip ties up a worker for ~5× longer than mean.Plus a defensive fix on
re-queue to hot lane: drop the body bytes from the in-memory request object after first re-queue attempt, or impose a per-request retry cap, so a worker losing its connection mid-transcribe doesn't multiply RAM pressure.Optionally: a metrics counter
uttera_stt_oom_predictor_skipped_totalincremented when_optimal_cold_workers()would have spawned but RAM gate vetoed — gives ops visibility before the host gets close to the OOM boundary.Workaround until fixed
Operators with long-form corpora should set
COLD_POOL_SIZEempirically below the value_optimal_cold_workers()would pick. For our corpus (~17% clips >60 s, ~2% clips >5 min) on a 128 GB host with sentiment+comfyui co-tenants, evenCOLD_POOL_SIZE=3may overcommit RAM under sustained load. The safer path is to switch the wrapper touttera-stt-vllm(single-process AsyncLLM, no per-worker model duplication) for the bulk backlog and keephotcoldfor live MixMonitor sidecar traffic (low concurrency, short calls usually).Bench coverage gap
results/2026-04-17-run1-hotcold-librispeech/and the sustained profile at0.5 × burst@64 rpsvalidate behaviour for the LibriSpeech-shaped distribution. They do not exercise the corner case of long-form audio under sustained concurrent load on a shared GPU.Proposal: add a fourth STT corpus to
PROTOCOL.md—uttera-stt-longform— with a clip-length distribution skewed toward 60-300 s (e.g. resampled call-recording WAVs, public-domain podcast cuts). And a newsustained-overloadprofile mentioned in run1 notes.md "Open questions" that exercises continuous load when HOT is saturated, with the RAM gauge captured at minute 0, 1, 2, 3, 4, 5.