ceac: race-guard duplicate sglang launch when instance still loading by catoneone · Pull Request #493 · AffineFoundation/affine-cortex

catoneone · 2026-05-18T04:35:51Z

Summary

_ensure_sglang_running probed /model_info with a 10 s timeout and treated any non-200 as 'server dead → launch a new one'. But a 60 GB ckpt takes ~3-5 min to load shards into GPU memory, and during that window /model_info doesn't 200 yet. The worker then spawned a second sglang.launch_server — both processes raced to allocate ~136 GB of model into the same 143 GB GPU, the second OOMed mid-load, and the worker spent the next 15 min on the readiness deadline waiting for a launch that never finishes.

Production today: 5HWMR5yt7C job failed after 15 min with GPU 0 has a total capacity of 139.35 GiB of which 182.00 MiB is free, with two sglang.launch_server processes alive on the GPU host.

Add a SSH pgrep -f sglang.launch_server before the launch step — if one is alive, skip the launch and just poll /model_info until it comes up.

Test plan

Anticopy unit tests pass locally (63 passed)
After deploy, restart a worker mid-load and verify it doesn't spawn a duplicate

A 60 GB model takes ~3-5 min to load shards into the GPU. During that window /model_info doesn't 200 yet, so the worker's existing probe in _ensure_sglang_running concluded 'no server' and spawned ANOTHER launch — both processes then tried to load the same model into the same GPU concurrently, the second OOMed at allocator ~136 GB of 143 GB, and the worker burned the 15-min readiness deadline waiting on a launch that never finishes. Before launching, SSH-pgrep for sglang.launch_server. If one's already there, skip the launch and just poll /model_info — the loading instance will come up on its own.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceac: race-guard duplicate sglang launch when instance still loading#493

ceac: race-guard duplicate sglang launch when instance still loading#493
catoneone wants to merge 1 commit into
mainfrom
ceac/sglang-launch-race-guard

catoneone commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

catoneone commented May 18, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant