Skip to content

OOM kill under sustained concurrency with long-form audio: _optimal_cold_workers has no RAM gating #2

@fakehec

Description

@fakehec

Summary

_optimal_cold_workers() predicts how many cold subprocess workers can be spawned safely, but gates only on VRAM (_cold_vram_ema_gb * 1.2). It does not consider host RAM pressure. With a corpus of long-form audio (>60s clips) under sustained concurrent load, the worker pool spawns enough subprocesses that combined RSS — driver buffers, audio bodies in flight, re-queued payloads after transcribe failed (Connection lost), multipart uploads buffered in uvicorn — exceeds host RAM. The result is a global OOM kill of the uvicorn process by the Linux kernel.

This is reproducible and was triggered today (2026-05-30) on sphinx while processing the gaia.riosa.com asterisk monitor backlog.

Reproduction context

Server: sphinx (RTX 5090 / 32 GB VRAM / 128 GB DDR5 / Ubuntu 26.10)
Service: uttera-stt-hotcold.service (commit at time of incident: head of master running with COLD_POOL_SIZE=10, COLD_WORKER_IDLE_TIMEOUT=60, WHISPER_MODEL=turbo, WHISPER_FP16=1)
Co-tenants on GPU: uttera-sentiment-vllm (~16 GB VRAM), comfyui (~0.5 GB VRAM). VRAM free at start ~13 GB.

Client: gaia.riosa.com bulk reprocessing of asterisk MixMonitor recordings via speech-recog-asterisk-wrapper_bulk_dir with WORKERS=10 (xargs -P 10). 10 concurrent HTTP POST /v1/audio/transcriptions sustained.

Corpus (~16700 wavs of real phone calls):

Bucket Count %
< 200 KB (~1-12 s) 4468 27%
200 KB – 1 MB (12-60 s) 9145 55%
1 – 5 MB (60-300 s) 2764 17%
> 5 MB (> 5 min) 317 2%

(The published librispeech-test-clean benchmark uses clips of 4-20 s, mean 7.4 s — i.e. nothing above the small bucket. The long-form distribution above is what the bench does not exercise.)

Symptoms observed before the OOM kill

journalctl -u uttera-stt-hotcold repeated this cycle indefinitely:

--- POOL MGR: target=3 cold workers (active=2, loading=0) | queue=510.4s audio (31.0s drain) → spawning ---
--- POOL MGR: pool worker ready, total_active=3, idle_timeout=140s ---
--- POOL WORKER: idle timeout (140s), exiting ---
--- POOL WORKER: transcribe failed (Connection lost), re-queuing to hot lane ---

Despite queue=510.4s of audio pending, cold workers were idle-timing-out without consuming the queue (manager process was in D state — uninterruptible disk sleep — waiting on swap-in for its own pages, so it couldn't dispatch). Every "Connection lost" line re-queued the original request body to the hot lane.

ss -tn at gaia side showed 20+ ESTAB connections to :9005 each with Recv-Q of 95-200 KB pending on the server side — the kernel buffered the multipart uploads but the userspace never read them. Curl timed out at 180 s with rc=28 for all in-flight requests.

GPU was at 0% utilization throughout the incident. This was purely a host-RAM problem.

OOM kill record (dmesg -T)

[Sat May 30 10:43:31 2026] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
                            cpuset=user.slice,mems_allowed=0,global_oom,
                            task_memcg=/system.slice/uttera-stt-hotcold.service,
                            task=uvicorn,pid=150221,uid=1001
[Sat May 30 10:43:31 2026] Out of memory: Killed process 150221 (uvicorn)
                            total-vm:137967328kB, anon-rss:106210652kB,
                            file-rss:57456kB, shmem-rss:9792kB
[Sat May 30 10:43:34 2026] oom_reaper: reaped process 150221 (uvicorn),
                            now anon-rss:48kB, file-rss:108kB, shmem-rss:52kB

That's 101 GB of anonymous RSS in one uvicorn process. CONSTRAINT_NONE + global_oom confirm this was global system OOM, not a cgroup limit. systemd Restart=on-failure brought the service back up at 10:43:55 with fresh RSS of 3 GB — confirming the leak was real and the service is otherwise healthy.

Kernel stack of the blocked main process while leak was in progress (before OOM):

State: D (disk sleep)
VmRSS: 81853104 kB
[<0>] folio_wait_bit_common+0x11d/0x2f0
[<0>] __folio_lock_or_retry+0x34b/0x570
[<0>] do_swap_page+0x662/0x1010
[<0>] handle_pte_fault+0x1b9/0x1f0
[<0>] handle_mm_fault+0xe7/0x2f0

The process was paging its own RSS in and out of swap, deadlocking itself.

Root cause analysis

main_stt.py:820 _optimal_cold_workers() docstring:

"Capped by COLD_POOL_SIZE (safety) and available VRAM."

The implementation:

def _optimal_cold_workers() -> int:
    if _hot_ema_sps is None or _work_queue_audio_seconds <= 0:
        return 0
    cold_start = _get_cold_start_time_stt()
    ...
    total_work_s = _work_queue_audio_seconds * _hot_ema_sps
    limit = 2.0 * total_work_s / cold_start
    N_total = 1
    while N_total * (N_total - 1) < limit:
        N_total += 1
    N_total -= 1
    cold = N_total - 1
    if COLD_POOL_SIZE > 0:
        cold = min(cold, COLD_POOL_SIZE)
    return max(0, cold)

Variables consulted: _hot_ema_sps, _work_queue_audio_seconds, _get_cold_start_time_stt(), COLD_POOL_SIZE, and (via _vram_per_cold_worker()) _cold_vram_ema_gb.

Not consulted: any host-RAM signal (psutil.virtual_memory().available, MemAvailable from /proc/meminfo, swap pressure, current process RSS).

The implicit assumption — that host-RAM usage scales monotonically with VRAM usage, so VRAM gating subsumes RAM gating — holds for short clips (the benchmarked regime) but fails for long-form audio because per-worker RAM footprint includes audio bodies and decoder context that don't live in VRAM:

  • Audio buffer per worker for a 5-min wav at 16 kHz mono float32: ~19 MB
  • Multipart upload body retained by uvicorn until handler reads it: ~5 MB per in-flight request
  • transcribe failed → re-queue to hot lane duplicates the body each retry; with workers cycling idle-timeout-die-respawn, retries pile up
  • Whisper decoder state for long sequences holds intermediate tensors in CPU memory beyond what the EMA captures (EMA only sees VRAM drop after the worker exits)

With 10 cold workers spawned + manager + cumulative re-queue retries on the ~317 wavs >5 min mixed into the workload, RSS scaled to 101 GB.

Suggested fix

Gate _optimal_cold_workers() on host RAM in addition to VRAM. Two complementary checks:

  1. Hard floor on available memory: cap N such that (N+1) * EMA(RAM_per_worker)psutil.virtual_memory().available - MIN_RAM_HEADROOM_GB. Symmetric to the existing _cold_vram_ema_gb * SAFETY_FACTOR logic.
  2. Worst-case audio scaling: pass max(audio_length_s in _work_queue) (not just sum) into the projection. The current total_work_s aggregate doesn't capture that one 5-minute clip ties up a worker for ~5× longer than mean.

Plus a defensive fix on re-queue to hot lane: drop the body bytes from the in-memory request object after first re-queue attempt, or impose a per-request retry cap, so a worker losing its connection mid-transcribe doesn't multiply RAM pressure.

Optionally: a metrics counter uttera_stt_oom_predictor_skipped_total incremented when _optimal_cold_workers() would have spawned but RAM gate vetoed — gives ops visibility before the host gets close to the OOM boundary.

Workaround until fixed

Operators with long-form corpora should set COLD_POOL_SIZE empirically below the value _optimal_cold_workers() would pick. For our corpus (~17% clips >60 s, ~2% clips >5 min) on a 128 GB host with sentiment+comfyui co-tenants, even COLD_POOL_SIZE=3 may overcommit RAM under sustained load. The safer path is to switch the wrapper to uttera-stt-vllm (single-process AsyncLLM, no per-worker model duplication) for the bulk backlog and keep hotcold for live MixMonitor sidecar traffic (low concurrency, short calls usually).

Bench coverage gap

results/2026-04-17-run1-hotcold-librispeech/ and the sustained profile at 0.5 × burst@64 rps validate behaviour for the LibriSpeech-shaped distribution. They do not exercise the corner case of long-form audio under sustained concurrent load on a shared GPU.

Proposal: add a fourth STT corpus to PROTOCOL.mduttera-stt-longform — with a clip-length distribution skewed toward 60-300 s (e.g. resampled call-recording WAVs, public-domain podcast cuts). And a new sustained-overload profile mentioned in run1 notes.md "Open questions" that exercises continuous load when HOT is saturated, with the RAM gauge captured at minute 0, 1, 2, 3, 4, 5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions