Skip to content

fix(comfyui/provisioning): mkdir parent dir before lockfile mutex (avoid ENOENT deadlock)#197

Open
evolv3ai wants to merge 1 commit into
vast-ai:mainfrom
evolv3ai:fix/comfyui-provisioning-lockfile-race
Open

fix(comfyui/provisioning): mkdir parent dir before lockfile mutex (avoid ENOENT deadlock)#197
evolv3ai wants to merge 1 commit into
vast-ai:mainfrom
evolv3ai:fix/comfyui-provisioning-lockfile-race

Conversation

@evolv3ai

Copy link
Copy Markdown

Problem

The download_hf_file() helper in nine ComfyUI provisioning scripts uses mkdir "${output_path}.lock" as a per-file mutex:

local lockfile="${output_path}.lock"
...
# Acquire lock for this specific file
while ! mkdir "$lockfile" 2>/dev/null; do
  echo "Another process is downloading to $output_path (waiting...)"
  sleep 1
done

The parent directory of $output_path (e.g. /workspace/ComfyUI/models/vae/) is only created inside the post-download success branch:

# Success - move file and clean up
mkdir -p "$(dirname "$output_path")"
mv "$temp_dir/$file_path" "$output_path"

So if the parent directory doesn't exist when the lockfile mkdir runs, that mkdir fails with ENOENT (not EEXIST). The while-loop swallows the error code and treats every failure as "lock held by another process" — printing the misleading "Another process is downloading…" message forever.

When the Vast.ai platform runs these scripts, the comfyui image's normal init pre-creates /workspace/ComfyUI/models/{vae,diffusion_models,text_encoders,…} first, so the lockfile mkdir succeeds on the first try and the bug is masked. We hit it after invoking the provisioning script directly from a non-Vast environment (a dstack apply run against a vastai backend) — the image's pre-init was bypassed, and all three parallel downloads deadlocked in the lockfile loop for 25 minutes until the run timed out. Concrete reproduction logs available in the comment below if useful.

Fix

Insert a single mkdir -p "$(dirname "$output_path")" immediately before the lockfile-acquisition loop in download_hf_file(). Idempotent (-p), three-line comment explaining why so a future reader doesn't undo it.

Applied uniformly to all nine scripts that share the same broken pattern:

  • flux.2-dev.sh
  • hidream-i1-full.sh
  • juggernaut-xi.sh
  • mochi-1-preview.sh
  • qwen-image.sh
  • realvisxl-v5.0.sh
  • sdxl.sh
  • wan2.2-i2v.sh
  • wan2.2-t2v.sh

(ltx-2.sh and the serverless scripts in this dir use a different flock-based lockfile pattern and already include the parent-dir mkdir -p — no change needed there.)

Diff is 5 inserted lines per script × 9 scripts = 45 insertions, 0 deletions, 0 deletions or behavioral change to any other code path. The post-success mkdir -p is left in place as a harmless idempotent safety net (deliberately not consolidating — minimizes review surface).

Verification

Patched flux.2-dev.sh validated by re-running the same dstack apply that hit the deadlock — the second run progressed past the lockfile-acquire step cleanly and the three parallel hf download calls started downloading actual bytes. (Happy to post the green-path log excerpt in a follow-up comment if helpful.)

The other eight scripts share the identical download_hf_file() shape (verified by diff); the same patch resolves the same bug.

Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com

download_hf_file() acquires a per-file mutex via mkdir ".lock".
The parent directory of  is only created inside the post-download
success branch, so if the parent does not yet exist when the lockfile mkdir
runs, it fails with ENOENT and the while-loop misclassifies the error as
"lock held by another process", causing an infinite wait.

When the Vast.ai platform launches the comfy image, the image init pre-creates
/workspace/ComfyUI/models/{vae,diffusion_models,text_encoders,...} which masks
the bug. Running the script from a non-Vast invocation path (dstack apply
against the vastai backend) hits the deadlock for 25 min until run timeout.

Insert mkdir -p "." before the lockfile loop.
Idempotent. Applied uniformly to the nine scripts sharing this pattern;
the four scripts using the newer flock-based mutex are unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant