Skip to content

ceac: race-guard duplicate sglang launch when instance still loading#493

Open
catoneone wants to merge 1 commit into
mainfrom
ceac/sglang-launch-race-guard
Open

ceac: race-guard duplicate sglang launch when instance still loading#493
catoneone wants to merge 1 commit into
mainfrom
ceac/sglang-launch-race-guard

Conversation

@catoneone

Copy link
Copy Markdown
Collaborator

Summary

_ensure_sglang_running probed /model_info with a 10 s timeout and treated any non-200 as 'server dead → launch a new one'. But a 60 GB ckpt takes ~3-5 min to load shards into GPU memory, and during that window /model_info doesn't 200 yet. The worker then spawned a second sglang.launch_server — both processes raced to allocate ~136 GB of model into the same 143 GB GPU, the second OOMed mid-load, and the worker spent the next 15 min on the readiness deadline waiting for a launch that never finishes.

Production today: 5HWMR5yt7C job failed after 15 min with GPU 0 has a total capacity of 139.35 GiB of which 182.00 MiB is free, with two sglang.launch_server processes alive on the GPU host.

Add a SSH pgrep -f sglang.launch_server before the launch step — if one is alive, skip the launch and just poll /model_info until it comes up.

Test plan

  • Anticopy unit tests pass locally (63 passed)
  • After deploy, restart a worker mid-load and verify it doesn't spawn a duplicate

A 60 GB model takes ~3-5 min to load shards into the GPU. During
that window /model_info doesn't 200 yet, so the worker's existing
probe in _ensure_sglang_running concluded 'no server' and spawned
ANOTHER launch — both processes then tried to load the same model
into the same GPU concurrently, the second OOMed at allocator
~136 GB of 143 GB, and the worker burned the 15-min readiness
deadline waiting on a launch that never finishes.

Before launching, SSH-pgrep for sglang.launch_server. If one's
already there, skip the launch and just poll /model_info — the
loading instance will come up on its own.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant