Skip to content

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802

Merged
comfyanonymous merged 27 commits into
Comfy-Org:masterfrom
rattus128:dev/threaded-loader
May 21, 2026
Merged

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117)#13802
comfyanonymous merged 27 commits into
Comfy-Org:masterfrom
rattus128:dev/threaded-loader

Conversation

@rattus128
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 commented May 8, 2026

DGX-spark users please try this and comment your results - I am getting major improvements in load behaviours

Change to this ComfyUI PR, also available here: https://github.com/rattus128/ComfyUI/tree/dev/threaded-loader
Don't forget to update pip with requirements.txt.


Some people have so little RAM they cant fit a single large model in all their RAM.
Others have systems that are so fast, loading the model cold from disk is by far the slowest part of their comfy experience.

So make loading models from disk a lot faster.

Modern NVME disks require a small fleet of CPU threads to actually saturate their read bandwidth. At the same time we are using MMAP+cudaMemcpy(pageable) which is a single threaded, per page synchronous faulting. This limits progress to one disk thread serialized with OS disk activity to do MMAP page faults which is pretty slow. Here is current comfy loading a model (this one is from when the page cache is hot:

image

How to interpret this (left to right):

The bright green is unpinned transfer - these are slow as they imply full CPU synchronization with a (presumably) single threaded memcpy (notice the CUDA API track with the long red calls to cuMemcpy)

After that there is the cuMemAllocHost to allocated pinned memory which is not a lot better. A huge amount of time is tied up in the ioctl for the allocation, and it turns out this is delayed as the kernel needs to prefault the pinned range synchronously and single threaded. It takes more than double the time to actually copy the weight from the page cache (the following read()).

The time to copy from the pinned memory to GPU is then tiny (the darker green).

The blue is the only time the GPU is doing computational work.

So lets fix it by dumping MMAP and torch.to completely and instead implement a fleet of threads to do the whole lot in comfy-aimdo. This will be released in comfy-aimdo 0.4.0 (CORE-43)

Comfy-Org/comfy-aimdo#46

How it works:

Transfers to the GPU always go via pinned memory allocated via the aimdo HostBuffer API. HostBuffers that are expected to grow (like the set of pins for an active model of the course of model init) have a speculative prefetcher, which will prefault the pages for the subsequent allocations. This avoids single-threaded cuMemHostRegister from assuming the page faulting burden.

When allocating in the growing hostbuf, cuMemHostRegister should be fast if the RAM is already prefaulted (confirmed with extensive experiments).

To actually copy the data, a straight chunked multi-threaded cuMemcpy is used.

non-blocking GPU transfer is then used from there.

Here is what it looks like after:

image

TLDR: Denser blue. Denser dark green. No bright green. Smaller gaps == Faster load.

The pthread_cond_wait() is the main thread parking itself as it waits for the fleet of copy threads to do the copy. Notice the cond_wait is overlapping the GPU transfer (dark green), so it is effectively using non-blocking GPU DMA to read the next weight from the disk as the previous one copies. The goal is as much time reading from disk as possible (more pthread_cond_wait == good).


RAM behaviors and Interaction with --cache-ram (NEW)

This work significantly changes the caching of models WRT to pins. To compensate for the lower tendency to RAM cache models without an MMAP, instead the pinned memory pool is expanded on windows to go above the 50% shared memory threshold and a pressure mechanism is introduced to move pins from one resident model to another.

The catch is, this is all committed memory on windows. To compensate for that we deliver on the long held goal and making --cache-ram the default caching mode for comfyUI so existing workflows that used to ride the MMAP paging semantics instead cache what they can with committed (incl over-shared-limit) pins and both cache intermediates and these model pins respect the RAM cache threshold.

There was a missing feature in --cache-ram to properly manage the amount background workflows are allow to accrue in RAM. This is added as a second argument to --cache-ram, with a default of 25%.

A few loose ends on the semantics of --cache-ram are tied off.

The general priority for RAM occupancy is:

(HIGHEST)
1: Intermediates of the active workflow
2: Pinned memory for the current model
3: Pinned memory for inactive models
4: Intermediates of inactive workflows
5: Pins of models in inactive workflows (freed via total MP destruction in cache itself)
(LOWEST)


Finer details and other changes:

Even for non-offloaded weights, this system is still used. The offload streams now have an associated re-usable pin_stream_buffer that each transfer can use to stage and transfer weights. This is used for first step loads, or when normal pinned memory is fully exhausted.

This facilitated the need to implement a deeper priority hierarchy for pinned memory with JIT memory pressure release. The stream pin buffer is top priority and will trash all other pins to make space on demand. The active models then take precedence over inactive models. This also fixes a bug where pins were not freeing from inactive models in windows (CORE-164). Since models loading now fully bypasses the mmap, there is no need for the RAM based pressure mechanism on model load so this is removed along with the pin memory pressure mechanism. Cleanup the now unused windows specific memory logic.

Making this new pinning logic interact with regular ModelPatcher pins is awkward, so we take a first step towards deprecation on non-dynamic VRAM by switching off smart memory completely for non-dynamic. Going forward, new models should not need VRAM estimates (as dynamic does it dynamically) so avoid needing such estimates while solving our pin interaction problem (CORE-152).

Performance following this enhancement was then immediately gated on Loras still being unpinned, so follow up with general pinned-memory support for Loras which was deferred in the original lora async work in #13618. (CORE-165)


Matrix test results:

Windows RTX5060, 16GB RAM, PCIE4x4 NVME
Wan 2.2 2x14B
Varying precision and RAM
2 runs, disk cache is forced cold before first run

image

full workflow execution time in seconds
red - before, green after

Same test on PCIe Gen1 (slow disk test)

image

Example test conditions:

Windows RTX5060, 16GB RAM, PCIE4x4 NVME
LTX2.3 FP8 template workflow (with Lora), 720x520x121f

Before:

scr
Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [08:59<00:00, 67.40s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:37<00:00, 72.58s/it]
Prompt executed in 00:13:59

After:

scr
Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:11<00:00, 16.43s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.96s/it]
Prompt executed in 246.86 seconds

When running without the lora:

LTX2.3 FP8 T2V 360Px121f, Windows, RTX5060, 16GB RAM

Before:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:48<00:00, 28.56s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:42<00:00, 34.26s/it]
Prompt executed in 464.36 seconds

After:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:08<00:00,  8.60s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.95s/it]
Prompt executed in 143.79 seconds

Example test conditions:

Linux, DGX spark
Flux2 FP8 in 3 different quants (forces 3x models)

image image image image

Before (step speed ok but huge time spent in loading):

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.42s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:50<00:00,  2.51s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.41s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 00:14:10

Before --disable-dynamic-vram (OOM killered):

Using mixed precision operations
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Unloaded partially: 33080.59 MB freed, 0.00 MB remains loaded, 4480.00 MB buffer reserved, lowvram patches: 0
Killed

After:

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.35s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.36s/it]
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.37s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 226.51 seconds
image

Regression tests:

Linux, 5090, 96GB LTX2.3 ✅
Linux, 5090, 96GB, Ace step 1.5 turbo XL ✅
Linux, 5090, 96GB, stable cascade -> Flux2 ✅
Linux, 5090, 96GB, ZIT w/lora ✅

@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 8, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedcomfy-aimdo@​0.3.0 ⏵ 0.4.399 +110010010070

View full report

@rattus128
Copy link
Copy Markdown
Contributor Author

rattus128 commented May 9, 2026

This is bad:

qwen 2048x2048 on my 5060.

image

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

rattus128 added 12 commits May 12, 2026 11:04
Disable smart memory outright for non dynamic models.

This is a minor step towards deprecation of --disable-dynamic-vram
and the legacy ModelPatcher.

This is needed for estimate-free model development, where new models
can opt-out of supplying a memory estimate and not have to worry
about hard VRAM allocations due to legacy non-dynamic model patchers

This is also a general stability increase for a lot of stray use cases
where estimates may still be off and going forward we are not going
to accurately maintain such estimates.
Use a single growable buffer so we can do threaded pre-warming on
pinned memory.
Aimdo implements a faster threaded loader.
Introduce per-offload-stream HostBuffer reuse for pinned staging,
include it in cast buffer reset synchronization.

Defer actual casts that go via this pin path to a separate pass
such that the buffer can be allocated monolithically (to avoid
cudaHostRegister thrash).
Replace the predictive pin pressure mechanism with JIT PIN memory
pressure.
This was defeatured in aimdo iteration
@rattus128 rattus128 force-pushed the dev/threaded-loader branch from 1d792e1 to 44c0a06 Compare May 12, 2026 01:55
@rattus128 rattus128 marked this pull request as ready for review May 12, 2026 13:30
@rattus128
Copy link
Copy Markdown
Contributor Author

This is bad:

qwen 2048x2048 on my 5060.
image

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

This is fixed.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ced4184c-9bbd-4f19-b2c3-44eb29216aa1

📥 Commits

Reviewing files that changed from the base of the PR and between 09a98a9 and c836a5d.

📒 Files selected for processing (2)
  • tests/execution/test_async_nodes.py
  • tests/execution/test_execution.py

📝 Walkthrough

Walkthrough

Reworks prefetch/prepare to write into shared destination slices and adds stream-scoped host-backed pin buffers with headroom and allocation helpers. Introduces mmap-dirtiness tracking (DIRTY_MMAPS) and pin-budget eviction (free_pins, ensure_pin_budget, ensure_pin_registerable). Low‑VRAM patch prepare signatures changed; pinned-memory APIs became subset-aware and unpin was removed. cast/ops now queue and flush stream pin transfers via host buffers. memory_management added a HostBuffer fast-path and extra_ram_release free_active flag. CLI/cache logic now supports active/inactive RAM thresholds; execution frees pins on RAM shortfall. requirements bump to comfy-aimdo 0.4.3.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main objective: implementing multi-threaded model loading from disk with performance improvements and offload-to-disk functionality, supported by the referenced ticket numbers.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation approach, RAM caching behavior changes, performance metrics, and test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 1193-1200: The call to offload_stream.synchronize() inside
get_pin_buffer causes unwanted stream-wide blocking when reusing
STREAM_PIN_BUFFERS and should be removed to allow overlapping disk reads,
pinned-memory staging and GPU transfers (matching the non-blocking pattern used
in get_aimdo_cast_buffer()); update get_pin_buffer to drop the
offload_stream.synchronize() call when reusing an existing buffer, or if
synchronization is truly required, document in a clear comment why it is
necessary and add a targeted synchronization mechanism (e.g., per-buffer
event/query) instead of synchronizing the entire offload_stream; reference
get_pin_buffer, STREAM_PIN_BUFFERS, offload_stream.synchronize(),
get_offload_stream and get_aimdo_cast_buffer when making the change.

In `@comfy/pinned_memory.py`:
- Around line 19-29: The HostBuffer.extend call is using a relative increment
but expects an absolute target size, causing truncation and out-of-bounds
slices; update the extend invocation in the block that sets module._pin so it
grows to the absolute new size (current hostbuf.size + requested size) instead
of passing size directly. Specifically, compute the absolute target (e.g.,
offset + size or hostbuf.size + size) and pass that to hostbuf.extend(size=...),
keeping the subsequent
comfy_aimdo.torch.hostbuf_to_tensor(hostbuf)[offset:offset+size] slice and the
assignment to module._pin and module._pin.untyped_storage()._comfy_hostbuf
unchanged. Ensure use of the same behavior as
resize_pin_buffer()/pin_buffer.extend(...).

In `@requirements.txt`:
- Line 26: The requirements file pins a non-existent package version
"comfy-aimdo==0.4.0" which will fail installs; update the dependency line for
the comfy-aimdo package (the literal "comfy-aimdo" entry) to a published version
such as "comfy-aimdo==0.3.0" or remove the strict pin (e.g.,
"comfy-aimdo>=0.3.0") so installs succeed until 0.4.0 is published.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33b744a4-878e-4d18-84e1-79c21c2872d0

📥 Commits

Reviewing files that changed from the base of the PR and between 20e4394 and 44c0a06.

📒 Files selected for processing (9)
  • comfy/lora.py
  • comfy/memory_management.py
  • comfy/model_management.py
  • comfy/model_patcher.py
  • comfy/ops.py
  • comfy/pinned_memory.py
  • comfy/utils.py
  • comfy/windows.py
  • requirements.txt
💤 Files with no reviewable changes (2)
  • comfy/windows.py
  • comfy/utils.py

Comment thread comfy/model_management.py
Comment thread comfy/pinned_memory.py
Comment thread requirements.txt Outdated
rattus128 added 3 commits May 13, 2026 22:23
This was syncing with the offload stream which itself is synced with the
compute stream, so this was syncing CPU with compute transitively. Define
the event to sync it more gently.
Pinning is more important than inactive intermediates and the stream
pin buffer is more important than even active intermediates.
Add back proper pin freeing on RAM pressure
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
comfy/model_management.py (1)

1398-1399: 💤 Low value

Behavior change: pin_memory no longer hard‑rejects above MAX_PINNED_MEMORY.

Previously the function returned False once TOTAL_PINNED_MEMORY + size > MAX_PINNED_MEMORY. Now it best‑effort frees RAM cache + dynamic pins via ensure_pin_budget(size) and then unconditionally proceeds to cudaHostRegister. If ensure_pin_budget cannot make enough room (e.g., everything is the active stream pin, which the priority hierarchy keeps), the cap is silently exceeded — and PIN_PRESSURE_HYSTERESIS makes the actual ceiling effectively MAX_PINNED_MEMORY + 128 MiB even on the happy path.

This appears to be intentional given the PR's "pin priority hierarchy" notes, but it would be helpful to either (a) document MAX_PINNED_MEMORY as a soft target rather than a hard cap, or (b) bail when ensure_pin_budget couldn't free the shortfall, so we don't quietly drift past the user‑visible budget on systems with marginal RAM.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 1398 - 1399, pin_memory's behavior
changed to allow exceeding MAX_PINNED_MEMORY because it calls
ensure_pin_budget(size) then proceeds to cudaHostRegister regardless; update
pin_memory to verify the budget after calling ensure_pin_budget and bail if the
shortfall remains: after ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY +
size <= MAX_PINNED_MEMORY (or account for PIN_PRESSURE_HYSTERESIS if intended)
and if not, return False (or raise) instead of proceeding to cudaHostRegister so
the cap remains a hard limit; reference the pin_memory function,
ensure_pin_budget, TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY,
PIN_PRESSURE_HYSTERESIS, and cudaHostRegister when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 516-525: In free_pins(), avoid indexing into
model.model.dynamic_pins[model.load_device] directly; instead safely
.get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.

---

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 1398-1399: pin_memory's behavior changed to allow exceeding
MAX_PINNED_MEMORY because it calls ensure_pin_budget(size) then proceeds to
cudaHostRegister regardless; update pin_memory to verify the budget after
calling ensure_pin_budget and bail if the shortfall remains: after
ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY + size <= MAX_PINNED_MEMORY
(or account for PIN_PRESSURE_HYSTERESIS if intended) and if not, return False
(or raise) instead of proceeding to cudaHostRegister so the cap remains a hard
limit; reference the pin_memory function, ensure_pin_budget,
TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY, PIN_PRESSURE_HYSTERESIS, and
cudaHostRegister when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 79af7e61-7191-4ca1-9529-c80ae1a90494

📥 Commits

Reviewing files that changed from the base of the PR and between 44c0a06 and 3f71781.

📒 Files selected for processing (5)
  • comfy/memory_management.py
  • comfy/model_management.py
  • comfy/ops.py
  • comfy/pinned_memory.py
  • execution.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • comfy/ops.py
  • comfy/pinned_memory.py

Comment thread comfy/model_management.py Outdated
Comment on lines +516 to +525
def free_pins(size, evict_active=False):
if size <= 0:
return

for loaded_model in reversed(current_loaded_models):
model = loaded_model.model
if model is not None and model.is_dynamic() and (evict_active or not model.model.dynamic_pins[model.load_device]["active"]):
size -= model.partially_unload_ram(size)
if size <= 0:
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm dynamic_pins is always populated for load_device before any code path that might
# trigger ensure_pin_budget/free_pins.
ast-grep --pattern 'dynamic_pins[$_] = $_'
rg -nP --type=py -C3 '\bdynamic_pins\b'

Repository: Comfy-Org/ComfyUI

Length of output: 3537


🏁 Script executed:

#!/bin/bash
# Understand the class structure where dynamic_pins is initialized
cd comfy
# Find the class containing the constructor with the initialization
rg -B20 'if self.load_device not in self.model.dynamic_pins:' model_patcher.py | head -40

# Check what function/method contains the problematic line 522
rg -B10 'def free_pins' model_management.py

# Find where current_loaded_models is populated
rg 'current_loaded_models.append|current_loaded_models +=' model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1511


🏁 Script executed:

#!/bin/bash
# Find is_dynamic() method definition
rg 'def is_dynamic' comfy/

# Check where models are added to current_loaded_models with more context
rg -B5 -A5 'current_loaded_models.append|current_loaded_models +=' comfy/model_management.py | head -50

# Check if ModelPatcherDynamic is the only class with is_dynamic
rg 'class.*ModelPatcher' comfy/model_patcher.py

# Look for any guard/check before accessing dynamic_pins in other places
rg -C2 'is_dynamic\(\)' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1513


🏁 Script executed:

#!/bin/bash
# Get full definition of is_dynamic methods
rg -A5 'def is_dynamic' comfy/model_patcher.py

# Check the LoadedModel class and when models are added
rg -B10 -A10 'class LoadedModel' comfy/model_management.py

# Find where models are loaded and added to current_loaded_models
rg -B8 -A8 'current_loaded_models.append.*LoadedModel' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 969


🏁 Script executed:

#!/bin/bash
# Find where models get loaded/added to current_loaded_models
rg -B15 'loaded_model = LoadedModel' comfy/model_management.py | head -50

# Check for load_model function and how models are prepared
rg 'def load_model' comfy/model_management.py

# Look for any git history or PR context - check if this is new code
git log --oneline comfy/model_management.py 2>/dev/null | head -5 || echo "Git log not available"
git diff HEAD^ comfy/model_management.py 2>/dev/null | grep -A10 -B10 "free_pins\|dynamic_pins" | head -50 || echo "Git diff not available"

Repository: Comfy-Org/ComfyUI

Length of output: 656


🏁 Script executed:

#!/bin/bash
# Check if there's any code that changes load_device after construction
rg 'load_device\s*=' comfy/model_patcher.py | head -20

# Look for any re-attachment or device change logic
rg 'attach|detach|reload' comfy/model_patcher.py | grep -i 'def '

# Check if dynamic_pins gets updated for new devices anywhere
rg -B5 -A5 'dynamic_pins\[.*\] = {' comfy/

# Look at the full context of line 522 and around it
sed -n '510,535p' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1454


🏁 Script executed:

#!/bin/bash
# Fix the regex and search for dynamic_pins initialization pattern
rg -C5 'dynamic_pins\[' comfy/model_patcher.py | grep -E '(def |class |dynamic_pins)'

# Look for any code that might change load_device or reuse models across devices
rg 'model\.load_device' comfy/model_management.py | head -20

# Check if there's any scenario where a model might be used with different load_devices
rg -B10 'for loaded_model in.*current_loaded_models' comfy/model_management.py | head -40

# Search for attach/detach patterns that might change device association
rg -B3 -A8 'def.*attach|def.*prepare' comfy/model_patcher.py | head -60

Repository: Comfy-Org/ComfyUI

Length of output: 3858


🏁 Script executed:

#!/bin/bash
# Check if load_device can change or if models can be reused across devices
rg -B5 -A5 'load_device.*=' comfy/model_patcher.py | grep -A8 -B8 '__init__'

# Search for any scenario where a model might be attached/prepared with a new device
rg -C10 'load_device\s*=' comfy/model_management.py

# Check the git log to understand when this code was added
git log --all --full-history -p -- comfy/model_management.py 2>/dev/null | grep -A30 'def free_pins' | head -50 || echo "Checking commit history..."

# Look for any recent changes to dynamic_pins handling
git show HEAD:comfy/model_management.py 2>/dev/null | sed -n '516,525p' || git log -1 --format=%h comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 4627


🏁 Script executed:

#!/bin/bash
# Verify if load_device is ever reassigned after __init__
rg 'self\.load_device\s*=' comfy/model_patcher.py

# Check if there's any code path where a model could have load_device changed
git log -1 --format="%H %s" comfy/model_management.py

# Verify the exact changes in this commit
git show --stat comfy/model_management.py 2>/dev/null | head -20

# Check if prepare() method modifies load_device or affects dynamic_pins initialization
rg -B5 -A15 'def prepare\(' comfy/model_patcher.py | head -40

Repository: Comfy-Org/ComfyUI

Length of output: 1494


🏁 Script executed:

#!/bin/bash
# Final check: verify there's no other code path that could create a dynamic model without full initialization
rg 'is_dynamic\s*=\s*True|class.*Dynamic' comfy/ --type py

# Confirm that every place accessing dynamic_pins for write operations has it initialized
rg 'dynamic_pins\[.*\]\s*=' comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 533


Use .get() to safely access dynamic_pins[load_device] in case initialization is incomplete.

While the current initialization logic in ModelPatcherDynamic.__init__() should populate dynamic_pins[load_device] before any dynamic model reaches free_pins(), defensive access is prudent in this RAM-pressure code path. Consider:

if model is not None and model.is_dynamic():
    pin_state = model.model.dynamic_pins.get(model.load_device)
    if pin_state is not None and (evict_active or not pin_state["active"]):
        size -= model.partially_unload_ram(size)
        if size <= 0:
            break

This same pattern should also be applied at lines 1249 and 1251 in the same file for consistency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 516 - 525, In free_pins(), avoid
indexing into model.model.dynamic_pins[model.load_device] directly; instead
safely .get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.

@rattus128 rattus128 marked this pull request as draft May 13, 2026 15:14
@frauttauteffasu
Copy link
Copy Markdown

Yes I think defaults should only work in the DGX sparks favour. do you have any results one way or the other? For spark it should only come into play with a lot of intermediates.

cache-ram above 0 seems to trigger model eviction due to memory pressure. With 0 cache-ram this saves 20 seconds or so per run as text encoder and diffusion model do not have to reloaded. Is ComfyUI aware of unified memory systems?

@zwukong
Copy link
Copy Markdown

zwukong commented May 21, 2026

please do not remove --disable-dynamic-vram, dynamic vram is not stable, ram cost is huge, often stuck . My most effective and stable setting is :--disable-pinned-memory --disable-dynamic-vram --reserve-vram 2

@zwukong
Copy link
Copy Markdown

zwukong commented May 21, 2026

Most of your optimization is based on fp8 model i think, but most of us use gguf model, ram and disk saver, and good enough to run

@kingp0dd
Copy link
Copy Markdown

Most of your optimization is based on fp8 model i think, but most of us use gguf model, ram and disk saver, and good enough to run

is this detrimental to gguf? i still haven't got to test it

@kingp0dd
Copy link
Copy Markdown

for what it's worth, i tried this update just now. and the issue i'm describing in city96/ComfyUI-GGUF#427 (comment) , is gone. CLIP and UNET are now loaded from RAM instead of Disk.

@kingp0dd
Copy link
Copy Markdown

from 160-200s, now it's 100-115s

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

However, in my case, after applying this PR, ComfyUI completely reloads the CLIP from scratch even without any prompt changes, or it reloads the CLIP again instead of moving straight from Sampler 1 to Sampler 2.

But it's probably an issue with GGUF.
Before
city96/ComfyUI-GGUF#427 (comment)
After
city96/ComfyUI-GGUF#427 (comment)

#13802 (comment) --cache-ram 0 fixes this.

@rattus128
Copy link
Copy Markdown
Contributor Author

@frauttauteffasu is that preformance issue you experience on DGX proportional to the size of the RAM cache or does any size of RAM cache arg cause the issue? For example, if you --cache-ram 5 is it something in between or same perf?

@frauttauteffasu
Copy link
Copy Markdown

if you --cache-ram 5 is it something in between or same perf?

@rattus128 with --cache-ram 5 performance is the same as 0. The trigger point is 21 with that value or above models are unloaded. There is no change in performance / memory usage until that point.

Kosinkadink pushed a commit that referenced this pull request May 21, 2026
Brings in 18 commits from master so worksplit-multigpu does not regress
fixes that landed on main since the last sync:

- #13699 Hunyuan 3D 2.1 batch-size fixes (overlap with our own backport;
  conflict resolved in favor of the shape>=2 gate that binds
  swap_cfg_halves once and reuses it for the output swap-back)
- #14031 ModelPatcherDynamic lora reshape / backup restore fix
- #13802 Multi-threaded model load (memory_management / pinned_memory /
  model_management / aimdo plumbing)
- #12679 lanczos single-channel tensor fix
- #14010 Stable Audio 3 support
- assorted partner-node, openapi, workflow-template, and tooling updates

Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082
Co-authored-by: Amp <amp@ampcode.com>
@rattus128
Copy link
Copy Markdown
Contributor Author

if you --cache-ram 5 is it something in between or same perf?

@rattus128 with --cache-ram 5 performance is the same as 0. The trigger point is 21 with that value or above models are unloaded. There is no change in performance / memory usage until that point.

ok its likely workflow dependent too. what are your models?

@m8rr
Copy link
Copy Markdown

m8rr commented May 22, 2026

@zwukong It might just be that a feature that used to work well is broken now, but the new feature getting better. Give it a try.

LTX2.3, 12G VRAM, 32G RAM, 1280x768x121 5/3steps, Run 3 times(Change the prompt each time)
0.22 stable + original GGUF node: 180s 180s 180s
0.22 stable + D-VRAM GGUF node: 145s 110s 110s
this PR(--cache-ram 0) + D-VRAM GGUF node: 127s 83s 80s

https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram

git clone -b dynamic-vram https://github.com/rattus128/ComfyUI-GGUF

@zwukong
Copy link
Copy Markdown

zwukong commented May 22, 2026

really? thanks for the gguf pr. i will try

@frauttauteffasu
Copy link
Copy Markdown

ok its likely workflow dependent too. what are your models?

ComfyUI's memory use with these models does use all but 20GB of the system's memory. Does --cache-ram 21 limit ComfyUI's memory use to total system memory - 21?

@brendanhoar
Copy link
Copy Markdown
Contributor

Hopefully the author sees that there seem to be new performance problems arising under some circumstances with this PR after it was merged (see the SwarmUI one from today, for example): mcmonkeyprojects/SwarmUI#1391

Not to say I am not grateful for this work being done, but there are new issues now.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 25, 2026

I have spinning drives and not NVME but plenty of ram..there is no magic to make it load "faster". It is always great to assume.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 25, 2026

Ok.. so with some testing.. this hot garbage is loading from disk every gen on int8 klein with dynamic vram disabled. With dynamic vram enabled its like it was for regens but if the prompt changes the TE is loaded from DISK!?

So now I cannot disable dynamic vram nor can I use dynamic vram. And a workflow that has caching (e.g chromacache) is completely broken because caching doesn't work with dynamic vram.
My gens go from 7s on a new prompt to 10s while rerolls are the same exact speed prior to this commit. WTF! I'm reverting.

v.22 works fine.. after this.. not fine.

@brendanhoar
Copy link
Copy Markdown
Contributor

Are these pull requests being tested across all the target OSes before merging? Linux, Windows, MacOS, etc.? Or just Linux?

simonri pushed a commit to simonri/ComfyUI-flash-attention-3 that referenced this pull request May 26, 2026
…load to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) (Comfy-Org#13802)

* model_management: disable non-dynamic smart memory

Disable smart memory outright for non dynamic models.

This is a minor step towards deprecation of --disable-dynamic-vram
and the legacy ModelPatcher.

This is needed for estimate-free model development, where new models
can opt-out of supplying a memory estimate and not have to worry
about hard VRAM allocations due to legacy non-dynamic model patchers

This is also a general stability increase for a lot of stray use cases
where estimates may still be off and going forward we are not going
to accurately maintain such estimates.

* pinned_memory: implement with aimdo growable buffer

Use a single growable buffer so we can do threaded pre-warming on
pinned memory.

* mm: use aimdo to do transfer from disk to pin

Aimdo implements a faster threaded loader.

* Add stream host pin buffer for AIMDO casts

Introduce per-offload-stream HostBuffer reuse for pinned staging,
include it in cast buffer reset synchronization.

Defer actual casts that go via this pin path to a separate pass
such that the buffer can be allocated monolithically (to avoid
cudaHostRegister thrash).

* remove old pin path

* Implement JIT pinned memory pressure

Replace the predictive pin pressure mechanism with JIT PIN memory
pressure.

* LowVRAMPatch: change to two-phase visit

* lora: re-implement as inplace swiss-army-knife operation

* prepare for multiple pin sets

* implement pinned loras

* requirements: comfy-aimdo 0.4.0

* ops: remove unused arg

This was defeatured in aimdo iteration

* ops: sync the CPU with only the offload stream activity

This was syncing with the offload stream which itself is synced with the
compute stream, so this was syncing CPU with compute transitively. Define
the event to sync it more gently.

* pins: implement freeing intermediate for pinned memory

Pinning is more important than inactive intermediates and the stream
pin buffer is more important than even active intermediates.

* execution: implement pin eviction on RAM presure

Add back proper pin freeing on RAM pressure

* implement pin registration swaps

Uncap the windows pins from 50% by extending the pool and have a pressure
mechanism to move the pin reservations om demand.

This unfortunately implies a GPU sync to do the freeing so significant
hysterisis needs to be added to consolidate these pressure events.

* cli_args/execution: Implement lower background cache-ram threshold

Limit the amount of RAM background intermediates can use, so that
switching workflows doesn't degrade performance too much.

* make default

* bump aimdo

* model-patcher: force-cast tiny weights

Flux 2 gets crazy stalls due to a mix of tiny and giant weights
creating lopsided steam buffer rotations which creates stalls.

* ops: refactor in prep for chunking

* mm: delegate pin-on-the-way to aimdo

Aimdo is able to chunk and slice this on the way for better CPU->GPU
overlap. The main advantage is the ability to shorten the bus contention
window between previous weight transfer and the next weights vbar
fault.

* bump aimdo

* pinning updates

* specify hostbuf max allocation size

There a signs of virtual memory exhaustion on some linux systems when
throwing 128GB for every little piece. Pass the actual to save aimdo
from over-estimates

* tests: update execution tests for caching

The default caching changed to ram-cache so update these tests
accordingly.

Remove the LRU 0 test as this also falls through to RAM cache.
@frauttauteffasu
Copy link
Copy Markdown

@Ph0rk0z and @brendanhoar any improvement with #14116? If not with that pull applied what if add --cache-ram 0 or --cache-ram 1?

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 26, 2026

I will try but I'm on linux and not windows. Technically cache ram should never have evicted anything since I have >96g of ram.

@brendanhoar
Copy link
Copy Markdown
Contributor

brendanhoar commented May 27, 2026

@Ph0rk0z and @brendanhoar any improvement with #14116? If not with that pull applied what if add --cache-ram 0 or --cache-ram 1?

Using:
git switch -
git fetch origin "pull/14116/head:pr/14116"
git switch pr/14116

pulled it down and...restarted, then loaded up a T2U2V workflow and when it loads LTX dev...

20:43:39.985 [Debug] [ComfyUI-0/STDERR] [WARNING] Warning: TAESD previews enabled, but could not find models/vae_approx/None
20:43:39.990 [Debug] [ComfyUI-0/STDERR] [INFO] Requested to load LTXAV
20:43:41.128 [Debug] [ComfyUI-0/STDERR] [INFO] Model LTXAV prepared for dynamic VRAM loading. 40050MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 3303 KB.
20:43:50.850 [Debug] [ComfyUI-0/STDERR]
20:43:50.851 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s]
20:49:27.450 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s,   Model Initializing ...  ]
20:49:27.451 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [05:36<?, ?it/s,  Model Initialization complete!  ]

This is on first run from comfyui restart, but not machine restart. Task Manager shows no storage IO, so presumably it is taking 5.5 minutes (see timestamps for lines 5 to 6 above) to pull the model out of the Windows OS filecache and load it onto the card? Or maybe from storage?

On second run, I see more storage IO than the first time (with 512GB of RAM, why???), but it appears to be extremely low read rate, and this is the result when we get to LTX again, this time a six minute wait (see lines 3 and 4 below) to load the model from storage

20:54:03.677 [Debug] [ComfyUI-0/STDERR] [INFO] Model LTXAV prepared for dynamic VRAM loading. 40050MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 3303 KB.
20:54:13.110 [Debug] [ComfyUI-0/STDERR]
20:54:13.113 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s]
21:00:44.042 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s,   Model Initializing ...  ]
21:00:44.043 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [06:30<?, ?it/s,  Model Initialization complete!  ]
21:00:53.990 [Debug] [ComfyUI-0/STDERR]  17%|█?        | 1/6 [06:30<32:34, 390.93s/it,  Model Initialization complete!  ]
image

Why can't we just keep the models in the OS cache or in somewhere controlled by comfyui? It seems like these keep getting ejected and then pulled from storage again for no freaking reason. Sometimes storage can be slow, why isn't this taken into consideration for these changes?

@coderabbitai coderabbitai Bot mentioned this pull request May 27, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants