fix(comfyui/provisioning): mkdir parent dir before lockfile mutex (avoid ENOENT deadlock)#197
Open
evolv3ai wants to merge 1 commit into
Open
Conversation
download_hf_file() acquires a per-file mutex via mkdir ".lock".
The parent directory of is only created inside the post-download
success branch, so if the parent does not yet exist when the lockfile mkdir
runs, it fails with ENOENT and the while-loop misclassifies the error as
"lock held by another process", causing an infinite wait.
When the Vast.ai platform launches the comfy image, the image init pre-creates
/workspace/ComfyUI/models/{vae,diffusion_models,text_encoders,...} which masks
the bug. Running the script from a non-Vast invocation path (dstack apply
against the vastai backend) hits the deadlock for 25 min until run timeout.
Insert mkdir -p "." before the lockfile loop.
Idempotent. Applied uniformly to the nine scripts sharing this pattern;
the four scripts using the newer flock-based mutex are unchanged.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
download_hf_file()helper in nine ComfyUI provisioning scripts usesmkdir "${output_path}.lock"as a per-file mutex:The parent directory of
$output_path(e.g./workspace/ComfyUI/models/vae/) is only created inside the post-download success branch:So if the parent directory doesn't exist when the lockfile
mkdirruns, thatmkdirfails withENOENT(notEEXIST). The while-loop swallows the error code and treats every failure as "lock held by another process" — printing the misleading "Another process is downloading…" message forever.When the Vast.ai platform runs these scripts, the comfyui image's normal init pre-creates
/workspace/ComfyUI/models/{vae,diffusion_models,text_encoders,…}first, so the lockfilemkdirsucceeds on the first try and the bug is masked. We hit it after invoking the provisioning script directly from a non-Vast environment (adstack applyrun against avastaibackend) — the image's pre-init was bypassed, and all three parallel downloads deadlocked in the lockfile loop for 25 minutes until the run timed out. Concrete reproduction logs available in the comment below if useful.Fix
Insert a single
mkdir -p "$(dirname "$output_path")"immediately before the lockfile-acquisition loop indownload_hf_file(). Idempotent (-p), three-line comment explaining why so a future reader doesn't undo it.Applied uniformly to all nine scripts that share the same broken pattern:
flux.2-dev.shhidream-i1-full.shjuggernaut-xi.shmochi-1-preview.shqwen-image.shrealvisxl-v5.0.shsdxl.shwan2.2-i2v.shwan2.2-t2v.sh(
ltx-2.shand the serverless scripts in this dir use a differentflock-based lockfile pattern and already include the parent-dirmkdir -p— no change needed there.)Diff is 5 inserted lines per script × 9 scripts = 45 insertions, 0 deletions, 0 deletions or behavioral change to any other code path. The post-success
mkdir -pis left in place as a harmless idempotent safety net (deliberately not consolidating — minimizes review surface).Verification
Patched
flux.2-dev.shvalidated by re-running the samedstack applythat hit the deadlock — the second run progressed past the lockfile-acquire step cleanly and the three parallelhf downloadcalls started downloading actual bytes. (Happy to post the green-path log excerpt in a follow-up comment if helpful.)The other eight scripts share the identical
download_hf_file()shape (verified bydiff); the same patch resolves the same bug.Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com