Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) by rattus128 · Pull Request #13802 · Comfy-Org/ComfyUI

rattus128 · 2026-05-08T14:34:36Z

DGX-spark users please try this and comment your results - I am getting major improvements in load behaviours

Change to this ComfyUI PR, also available here: https://github.com/rattus128/ComfyUI/tree/dev/threaded-loader
Don't forget to update pip with requirements.txt.

Some people have so little RAM they cant fit a single large model in all their RAM.
Others have systems that are so fast, loading the model cold from disk is by far the slowest part of their comfy experience.

So make loading models from disk a lot faster.

Modern NVME disks require a small fleet of CPU threads to actually saturate their read bandwidth. At the same time we are using MMAP+cudaMemcpy(pageable) which is a single threaded, per page synchronous faulting. This limits progress to one disk thread serialized with OS disk activity to do MMAP page faults which is pretty slow. Here is current comfy loading a model (this one is from when the page cache is hot:

How to interpret this (left to right):

The bright green is unpinned transfer - these are slow as they imply full CPU synchronization with a (presumably) single threaded memcpy (notice the CUDA API track with the long red calls to cuMemcpy)

After that there is the cuMemAllocHost to allocated pinned memory which is not a lot better. A huge amount of time is tied up in the ioctl for the allocation, and it turns out this is delayed as the kernel needs to prefault the pinned range synchronously and single threaded. It takes more than double the time to actually copy the weight from the page cache (the following read()).

The time to copy from the pinned memory to GPU is then tiny (the darker green).

The blue is the only time the GPU is doing computational work.

So lets fix it by dumping MMAP and torch.to completely and instead implement a fleet of threads to do the whole lot in comfy-aimdo. This will be released in comfy-aimdo 0.4.0 (CORE-43)

Comfy-Org/comfy-aimdo#46

How it works:

Transfers to the GPU always go via pinned memory allocated via the aimdo HostBuffer API. HostBuffers that are expected to grow (like the set of pins for an active model of the course of model init) have a speculative prefetcher, which will prefault the pages for the subsequent allocations. This avoids single-threaded cuMemHostRegister from assuming the page faulting burden.

When allocating in the growing hostbuf, cuMemHostRegister should be fast if the RAM is already prefaulted (confirmed with extensive experiments).

To actually copy the data, a straight chunked multi-threaded cuMemcpy is used.

non-blocking GPU transfer is then used from there.

Here is what it looks like after:

TLDR: Denser blue. Denser dark green. No bright green. Smaller gaps == Faster load.

The pthread_cond_wait() is the main thread parking itself as it waits for the fleet of copy threads to do the copy. Notice the cond_wait is overlapping the GPU transfer (dark green), so it is effectively using non-blocking GPU DMA to read the next weight from the disk as the previous one copies. The goal is as much time reading from disk as possible (more pthread_cond_wait == good).

RAM behaviors and Interaction with --cache-ram (NEW)

This work significantly changes the caching of models WRT to pins. To compensate for the lower tendency to RAM cache models without an MMAP, instead the pinned memory pool is expanded on windows to go above the 50% shared memory threshold and a pressure mechanism is introduced to move pins from one resident model to another.

The catch is, this is all committed memory on windows. To compensate for that we deliver on the long held goal and making --cache-ram the default caching mode for comfyUI so existing workflows that used to ride the MMAP paging semantics instead cache what they can with committed (incl over-shared-limit) pins and both cache intermediates and these model pins respect the RAM cache threshold.

There was a missing feature in --cache-ram to properly manage the amount background workflows are allow to accrue in RAM. This is added as a second argument to --cache-ram, with a default of 25%.

A few loose ends on the semantics of --cache-ram are tied off.

The general priority for RAM occupancy is:

(HIGHEST)
1: Intermediates of the active workflow
2: Pinned memory for the current model
3: Pinned memory for inactive models
4: Intermediates of inactive workflows
5: Pins of models in inactive workflows (freed via total MP destruction in cache itself)
(LOWEST)

Finer details and other changes:

Even for non-offloaded weights, this system is still used. The offload streams now have an associated re-usable pin_stream_buffer that each transfer can use to stage and transfer weights. This is used for first step loads, or when normal pinned memory is fully exhausted.

This facilitated the need to implement a deeper priority hierarchy for pinned memory with JIT memory pressure release. The stream pin buffer is top priority and will trash all other pins to make space on demand. The active models then take precedence over inactive models. This also fixes a bug where pins were not freeing from inactive models in windows (CORE-164). Since models loading now fully bypasses the mmap, there is no need for the RAM based pressure mechanism on model load so this is removed along with the pin memory pressure mechanism. Cleanup the now unused windows specific memory logic.

Making this new pinning logic interact with regular ModelPatcher pins is awkward, so we take a first step towards deprecation on non-dynamic VRAM by switching off smart memory completely for non-dynamic. Going forward, new models should not need VRAM estimates (as dynamic does it dynamically) so avoid needing such estimates while solving our pin interaction problem (CORE-152).

Performance following this enhancement was then immediately gated on Loras still being unpinned, so follow up with general pinned-memory support for Loras which was deferred in the original lora async work in #13618. (CORE-165)

Matrix test results:

Windows RTX5060, 16GB RAM, PCIE4x4 NVME
Wan 2.2 2x14B
Varying precision and RAM
2 runs, disk cache is forced cold before first run

full workflow execution time in seconds
red - before, green after

Same test on PCIe Gen1 (slow disk test)

Example test conditions:

Windows RTX5060, 16GB RAM, PCIE4x4 NVME
LTX2.3 FP8 template workflow (with Lora), 720x520x121f

Before:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [08:59<00:00, 67.40s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:37<00:00, 72.58s/it]

Prompt executed in 00:13:59

After:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [02:11<00:00, 16.43s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:56<00:00, 18.96s/it]

Prompt executed in 246.86 seconds

When running without the lora:

LTX2.3 FP8 T2V 360Px121f, Windows, RTX5060, 16GB RAM

Before:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:48<00:00, 28.56s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:42<00:00, 34.26s/it]

Prompt executed in 464.36 seconds

After:

Requested to load LTXAV
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:08<00:00,  8.60s/it]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.95s/it]

Prompt executed in 143.79 seconds

Example test conditions:

Linux, DGX spark
Flux2 FP8 in 3 different quants (forces 3x models)

Before (step speed ok but huge time spent in loading):

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.42s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:50<00:00,  2.51s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:48<00:00,  2.41s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 00:14:10

Before --disable-dynamic-vram (OOM killered):

Using mixed precision operations
model weight dtype torch.float8_e5m2, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Unloaded partially: 33080.59 MB freed, 0.00 MB remains loaded, 4480.00 MB buffer reserved, lowvram patches: 0
Killed

After:

model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.35s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.36s/it]
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 0 patches attached. Force pre-loaded 128 weights: 39 KB.
100%|█████████████████████████████████████████████████████████████████████████████| 20/20 [00:47<00:00,  2.37s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 160MB Staged. 0 patches attached.
Prompt executed in 226.51 seconds

Regression tests:

Linux, 5090, 96GB LTX2.3 ✅
Linux, 5090, 96GB, Ace step 1.5 turbo XL ✅
Linux, 5090, 96GB, stable cascade -> Flux2 ✅
Linux, 5090, 96GB, ZIT w/lora ✅

socket-security · 2026-05-08T14:35:05Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	comfy-aimdo@0.3.0 ⏵ 0.4.3	⁺¹

View full report

rattus128 · 2026-05-09T01:34:14Z

This is bad:

qwen 2048x2048 on my 5060.

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

Disable smart memory outright for non dynamic models. This is a minor step towards deprecation of --disable-dynamic-vram and the legacy ModelPatcher. This is needed for estimate-free model development, where new models can opt-out of supplying a memory estimate and not have to worry about hard VRAM allocations due to legacy non-dynamic model patchers This is also a general stability increase for a lot of stray use cases where estimates may still be off and going forward we are not going to accurately maintain such estimates.

Use a single growable buffer so we can do threaded pre-warming on pinned memory.

Aimdo implements a faster threaded loader.

Introduce per-offload-stream HostBuffer reuse for pinned staging, include it in cast buffer reset synchronization. Defer actual casts that go via this pin path to a separate pass such that the buffer can be allocated monolithically (to avoid cudaHostRegister thrash).

Replace the predictive pin pressure mechanism with JIT PIN memory pressure.

This was defeatured in aimdo iteration

rattus128 · 2026-05-12T13:30:43Z

This is bad:

qwen 2048x2048 on my 5060.

There is a sync in there that shouldn't be happening. Pretty sure its the per stream pre-buffer use syncing in too dumb a way.

This is fixed.

coderabbitai · 2026-05-12T13:38:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ced4184c-9bbd-4f19-b2c3-44eb29216aa1

📥 Commits

Reviewing files that changed from the base of the PR and between 09a98a9 and c836a5d.

📒 Files selected for processing (2)

tests/execution/test_async_nodes.py
tests/execution/test_execution.py

📝 Walkthrough

Walkthrough

Reworks prefetch/prepare to write into shared destination slices and adds stream-scoped host-backed pin buffers with headroom and allocation helpers. Introduces mmap-dirtiness tracking (DIRTY_MMAPS) and pin-budget eviction (free_pins, ensure_pin_budget, ensure_pin_registerable). Low‑VRAM patch prepare signatures changed; pinned-memory APIs became subset-aware and unpin was removed. cast/ops now queue and flush stream pin transfers via host buffers. memory_management added a HostBuffer fast-path and extra_ram_release free_active flag. CLI/cache logic now supports active/inactive RAM thresholds; execution frees pins on RAM shortfall. requirements bump to comfy-aimdo 0.4.3.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main objective: implementing multi-threaded model loading from disk with performance improvements and offload-to-disk functionality, supported by the referenced ticket numbers.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the motivation, implementation approach, RAM caching behavior changes, performance metrics, and test results.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 1193-1200: The call to offload_stream.synchronize() inside
get_pin_buffer causes unwanted stream-wide blocking when reusing
STREAM_PIN_BUFFERS and should be removed to allow overlapping disk reads,
pinned-memory staging and GPU transfers (matching the non-blocking pattern used
in get_aimdo_cast_buffer()); update get_pin_buffer to drop the
offload_stream.synchronize() call when reusing an existing buffer, or if
synchronization is truly required, document in a clear comment why it is
necessary and add a targeted synchronization mechanism (e.g., per-buffer
event/query) instead of synchronizing the entire offload_stream; reference
get_pin_buffer, STREAM_PIN_BUFFERS, offload_stream.synchronize(),
get_offload_stream and get_aimdo_cast_buffer when making the change.

In `@comfy/pinned_memory.py`:
- Around line 19-29: The HostBuffer.extend call is using a relative increment
but expects an absolute target size, causing truncation and out-of-bounds
slices; update the extend invocation in the block that sets module._pin so it
grows to the absolute new size (current hostbuf.size + requested size) instead
of passing size directly. Specifically, compute the absolute target (e.g.,
offset + size or hostbuf.size + size) and pass that to hostbuf.extend(size=...),
keeping the subsequent
comfy_aimdo.torch.hostbuf_to_tensor(hostbuf)[offset:offset+size] slice and the
assignment to module._pin and module._pin.untyped_storage()._comfy_hostbuf
unchanged. Ensure use of the same behavior as
resize_pin_buffer()/pin_buffer.extend(...).

In `@requirements.txt`:
- Line 26: The requirements file pins a non-existent package version
"comfy-aimdo==0.4.0" which will fail installs; update the dependency line for
the comfy-aimdo package (the literal "comfy-aimdo" entry) to a published version
such as "comfy-aimdo==0.3.0" or remove the strict pin (e.g.,
"comfy-aimdo>=0.3.0") so installs succeed until 0.4.0 is published.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33b744a4-878e-4d18-84e1-79c21c2872d0

📥 Commits

Reviewing files that changed from the base of the PR and between 20e4394 and 44c0a06.

📒 Files selected for processing (9)

comfy/lora.py
comfy/memory_management.py
comfy/model_management.py
comfy/model_patcher.py
comfy/ops.py
comfy/pinned_memory.py
comfy/utils.py
comfy/windows.py
requirements.txt

💤 Files with no reviewable changes (2)

comfy/windows.py
comfy/utils.py

This was syncing with the offload stream which itself is synced with the compute stream, so this was syncing CPU with compute transitively. Define the event to sync it more gently.

Pinning is more important than inactive intermediates and the stream pin buffer is more important than even active intermediates.

Add back proper pin freeing on RAM pressure

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

comfy/model_management.py (1)
1398-1399: 💤 Low value

Behavior change: pin_memory no longer hard‑rejects above MAX_PINNED_MEMORY.

Previously the function returned False once TOTAL_PINNED_MEMORY + size > MAX_PINNED_MEMORY. Now it best‑effort frees RAM cache + dynamic pins via ensure_pin_budget(size) and then unconditionally proceeds to cudaHostRegister. If ensure_pin_budget cannot make enough room (e.g., everything is the active stream pin, which the priority hierarchy keeps), the cap is silently exceeded — and PIN_PRESSURE_HYSTERESIS makes the actual ceiling effectively MAX_PINNED_MEMORY + 128 MiB even on the happy path.

This appears to be intentional given the PR's "pin priority hierarchy" notes, but it would be helpful to either (a) document MAX_PINNED_MEMORY as a soft target rather than a hard cap, or (b) bail when ensure_pin_budget couldn't free the shortfall, so we don't quietly drift past the user‑visible budget on systems with marginal RAM.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@comfy/model_management.py` around lines 1398 - 1399, pin_memory's behavior
changed to allow exceeding MAX_PINNED_MEMORY because it calls
ensure_pin_budget(size) then proceeds to cudaHostRegister regardless; update
pin_memory to verify the budget after calling ensure_pin_budget and bail if the
shortfall remains: after ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY +
size <= MAX_PINNED_MEMORY (or account for PIN_PRESSURE_HYSTERESIS if intended)
and if not, return False (or raise) instead of proceeding to cudaHostRegister so
the cap remains a hard limit; reference the pin_memory function,
ensure_pin_budget, TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY,
PIN_PRESSURE_HYSTERESIS, and cudaHostRegister when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@comfy/model_management.py`:
- Around line 516-525: In free_pins(), avoid indexing into
model.model.dynamic_pins[model.load_device] directly; instead safely
.get(model.load_device) and check for None before accessing ["active"] so
incomplete initialization won't raise; update the condition in free_pins
(referencing current_loaded_models, loaded_model.model, model.is_dynamic(),
model.partially_unload_ram) and apply the same defensive .get() pattern to the
other two occurrences noted around lines where dynamic pin checks occur (the
other methods that reference model.model.dynamic_pins[...] at the end of file)
to ensure you only call partially_unload_ram when pin_state is present and its
"active" flag is evaluated.

---

Nitpick comments:
In `@comfy/model_management.py`:
- Around line 1398-1399: pin_memory's behavior changed to allow exceeding
MAX_PINNED_MEMORY because it calls ensure_pin_budget(size) then proceeds to
cudaHostRegister regardless; update pin_memory to verify the budget after
calling ensure_pin_budget and bail if the shortfall remains: after
ensure_pin_budget(size) check if TOTAL_PINNED_MEMORY + size <= MAX_PINNED_MEMORY
(or account for PIN_PRESSURE_HYSTERESIS if intended) and if not, return False
(or raise) instead of proceeding to cudaHostRegister so the cap remains a hard
limit; reference the pin_memory function, ensure_pin_budget,
TOTAL_PINNED_MEMORY, MAX_PINNED_MEMORY, PIN_PRESSURE_HYSTERESIS, and
cudaHostRegister when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 79af7e61-7191-4ca1-9529-c80ae1a90494

📥 Commits

Reviewing files that changed from the base of the PR and between 44c0a06 and 3f71781.

📒 Files selected for processing (5)

comfy/memory_management.py
comfy/model_management.py
comfy/ops.py
comfy/pinned_memory.py
execution.py

🚧 Files skipped from review as they are similar to previous changes (2)

comfy/ops.py
comfy/pinned_memory.py

coderabbitai · 2026-05-13T12:33:13Z

+def free_pins(size, evict_active=False):
+    if size <= 0:
+        return
+
+    for loaded_model in reversed(current_loaded_models):
+        model = loaded_model.model
+        if model is not None and model.is_dynamic() and (evict_active or not model.model.dynamic_pins[model.load_device]["active"]):
+            size -= model.partially_unload_ram(size)
+            if size <= 0:
+                break


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm dynamic_pins is always populated for load_device before any code path that might # trigger ensure_pin_budget/free_pins. ast-grep --pattern 'dynamic_pins[$_] = $_' rg -nP --type=py -C3 '\bdynamic_pins\b'

Repository: Comfy-Org/ComfyUI

Length of output: 3537

🏁 Script executed:

#!/bin/bash # Understand the class structure where dynamic_pins is initialized cd comfy # Find the class containing the constructor with the initialization rg -B20 'if self.load_device not in self.model.dynamic_pins:' model_patcher.py | head -40 # Check what function/method contains the problematic line 522 rg -B10 'def free_pins' model_management.py # Find where current_loaded_models is populated rg 'current_loaded_models.append|current_loaded_models +=' model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1511

🏁 Script executed:

#!/bin/bash # Find is_dynamic() method definition rg 'def is_dynamic' comfy/ # Check where models are added to current_loaded_models with more context rg -B5 -A5 'current_loaded_models.append|current_loaded_models +=' comfy/model_management.py | head -50 # Check if ModelPatcherDynamic is the only class with is_dynamic rg 'class.*ModelPatcher' comfy/model_patcher.py # Look for any guard/check before accessing dynamic_pins in other places rg -C2 'is_dynamic' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1513

🏁 Script executed:

#!/bin/bash # Get full definition of is_dynamic methods rg -A5 'def is_dynamic' comfy/model_patcher.py # Check the LoadedModel class and when models are added rg -B10 -A10 'class LoadedModel' comfy/model_management.py # Find where models are loaded and added to current_loaded_models rg -B8 -A8 'current_loaded_models.append.*LoadedModel' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 969

🏁 Script executed:

#!/bin/bash # Find where models get loaded/added to current_loaded_models rg -B15 'loaded_model = LoadedModel' comfy/model_management.py | head -50 # Check for load_model function and how models are prepared rg 'def load_model' comfy/model_management.py # Look for any git history or PR context - check if this is new code git log --oneline comfy/model_management.py 2>/dev/null | head -5 || echo "Git log not available" git diff HEAD^ comfy/model_management.py 2>/dev/null | grep -A10 -B10 "free_pins\|dynamic_pins" | head -50 || echo "Git diff not available"

Repository: Comfy-Org/ComfyUI

Length of output: 656

🏁 Script executed:

#!/bin/bash # Check if there's any code that changes load_device after construction rg 'load_device\s*=' comfy/model_patcher.py | head -20 # Look for any re-attachment or device change logic rg 'attach|detach|reload' comfy/model_patcher.py | grep -i 'def ' # Check if dynamic_pins gets updated for new devices anywhere rg -B5 -A5 'dynamic_pins\[.*\] = {' comfy/ # Look at the full context of line 522 and around it sed -n '510,535p' comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 1454

🏁 Script executed:

#!/bin/bash # Fix the regex and search for dynamic_pins initialization pattern rg -C5 'dynamic_pins\[' comfy/model_patcher.py | grep -E '(def |class |dynamic_pins)' # Look for any code that might change load_device or reuse models across devices rg 'model\.load_device' comfy/model_management.py | head -20 # Check if there's any scenario where a model might be used with different load_devices rg -B10 'for loaded_model in.*current_loaded_models' comfy/model_management.py | head -40 # Search for attach/detach patterns that might change device association rg -B3 -A8 'def.*attach|def.*prepare' comfy/model_patcher.py | head -60

Repository: Comfy-Org/ComfyUI

Length of output: 3858

🏁 Script executed:

#!/bin/bash # Check if load_device can change or if models can be reused across devices rg -B5 -A5 'load_device.*=' comfy/model_patcher.py | grep -A8 -B8 '__init__' # Search for any scenario where a model might be attached/prepared with a new device rg -C10 'load_device\s*=' comfy/model_management.py # Check the git log to understand when this code was added git log --all --full-history -p -- comfy/model_management.py 2>/dev/null | grep -A30 'def free_pins' | head -50 || echo "Checking commit history..." # Look for any recent changes to dynamic_pins handling git show HEAD:comfy/model_management.py 2>/dev/null | sed -n '516,525p' || git log -1 --format=%h comfy/model_management.py

Repository: Comfy-Org/ComfyUI

Length of output: 4627

🏁 Script executed:

#!/bin/bash # Verify if load_device is ever reassigned after __init__ rg 'self\.load_device\s*=' comfy/model_patcher.py # Check if there's any code path where a model could have load_device changed git log -1 --format="%H %s" comfy/model_management.py # Verify the exact changes in this commit git show --stat comfy/model_management.py 2>/dev/null | head -20 # Check if prepare() method modifies load_device or affects dynamic_pins initialization rg -B5 -A15 'def prepare\(' comfy/model_patcher.py | head -40

Repository: Comfy-Org/ComfyUI

Length of output: 1494

🏁 Script executed:

#!/bin/bash # Final check: verify there's no other code path that could create a dynamic model without full initialization rg 'is_dynamic\s*=\s*True|class.*Dynamic' comfy/ --type py # Confirm that every place accessing dynamic_pins for write operations has it initialized rg 'dynamic_pins\[.*\]\s*=' comfy/

Repository: Comfy-Org/ComfyUI

Length of output: 533

Use .get() to safely access dynamic_pins[load_device] in case initialization is incomplete.

While the current initialization logic in ModelPatcherDynamic.__init__() should populate dynamic_pins[load_device] before any dynamic model reaches free_pins(), defensive access is prudent in this RAM-pressure code path. Consider:

if model is not None and model.is_dynamic(): pin_state = model.model.dynamic_pins.get(model.load_device) if pin_state is not None and (evict_active or not pin_state["active"]): size -= model.partially_unload_ram(size) if size <= 0: break

This same pattern should also be applied at lines 1249 and 1251 in the same file for consistency.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy/model_management.py` around lines 516 - 525, In free_pins(), avoid indexing into model.model.dynamic_pins[model.load_device] directly; instead safely .get(model.load_device) and check for None before accessing ["active"] so incomplete initialization won't raise; update the condition in free_pins (referencing current_loaded_models, loaded_model.model, model.is_dynamic(), model.partially_unload_ram) and apply the same defensive .get() pattern to the other two occurrences noted around lines where dynamic pin checks occur (the other methods that reference model.model.dynamic_pins[...] at the end of file) to ensure you only call partially_unload_ram when pin_state is present and its "active" flag is evaluated.

frauttauteffasu · 2026-05-21T00:25:07Z

Yes I think defaults should only work in the DGX sparks favour. do you have any results one way or the other? For spark it should only come into play with a lot of intermediates.

cache-ram above 0 seems to trigger model eviction due to memory pressure. With 0 cache-ram this saves 20 seconds or so per run as text encoder and diffusion model do not have to reloaded. Is ComfyUI aware of unified memory systems?

zwukong · 2026-05-21T03:08:54Z

please do not remove --disable-dynamic-vram, dynamic vram is not stable, ram cost is huge, often stuck . My most effective and stable setting is :--disable-pinned-memory --disable-dynamic-vram --reserve-vram 2

zwukong · 2026-05-21T03:10:39Z

Most of your optimization is based on fp8 model i think, but most of us use gguf model, ram and disk saver, and good enough to run

kingp0dd · 2026-05-21T06:07:49Z

Most of your optimization is based on fp8 model i think, but most of us use gguf model, ram and disk saver, and good enough to run

is this detrimental to gguf? i still haven't got to test it

kingp0dd · 2026-05-21T06:39:08Z

for what it's worth, i tried this update just now. and the issue i'm describing in city96/ComfyUI-GGUF#427 (comment) , is gone. CLIP and UNET are now loaded from RAM instead of Disk.

kingp0dd · 2026-05-21T06:41:23Z

from 160-200s, now it's 100-115s

m8rr · 2026-05-21T07:14:18Z

However, in my case, after applying this PR, ComfyUI completely reloads the CLIP from scratch even without any prompt changes, or it reloads the CLIP again instead of moving straight from Sampler 1 to Sampler 2.

But it's probably an issue with GGUF.
Before
city96/ComfyUI-GGUF#427 (comment)
After
city96/ComfyUI-GGUF#427 (comment)

#13802 (comment) --cache-ram 0 fixes this.

rattus128 · 2026-05-21T13:08:03Z

@frauttauteffasu is that preformance issue you experience on DGX proportional to the size of the RAM cache or does any size of RAM cache arg cause the issue? For example, if you --cache-ram 5 is it something in between or same perf?

frauttauteffasu · 2026-05-21T16:08:18Z

if you --cache-ram 5 is it something in between or same perf?

@rattus128 with --cache-ram 5 performance is the same as 0. The trigger point is 21 with that value or above models are unloaded. There is no change in performance / memory usage until that point.

Brings in 18 commits from master so worksplit-multigpu does not regress fixes that landed on main since the last sync: - #13699 Hunyuan 3D 2.1 batch-size fixes (overlap with our own backport; conflict resolved in favor of the shape>=2 gate that binds swap_cfg_halves once and reuses it for the output swap-back) - #14031 ModelPatcherDynamic lora reshape / backup restore fix - #13802 Multi-threaded model load (memory_management / pinned_memory / model_management / aimdo plumbing) - #12679 lanczos single-channel tensor fix - #14010 Stable Audio 3 support - assorted partner-node, openapi, workflow-template, and tooling updates Amp-Thread-ID: https://ampcode.com/threads/T-019e4a00-fe3d-76bd-a2f2-a8c8c4040082 Co-authored-by: Amp <amp@ampcode.com>

rattus128 · 2026-05-21T22:31:07Z

if you --cache-ram 5 is it something in between or same perf?

@rattus128 with --cache-ram 5 performance is the same as 0. The trigger point is 21 with that value or above models are unloaded. There is no change in performance / memory usage until that point.

ok its likely workflow dependent too. what are your models?

m8rr · 2026-05-22T00:22:43Z

@zwukong It might just be that a feature that used to work well is broken now, but the new feature getting better. Give it a try.

LTX2.3, 12G VRAM, 32G RAM, 1280x768x121 5/3steps, Run 3 times(Change the prompt each time)
0.22 stable + original GGUF node: 180s 180s 180s
0.22 stable + D-VRAM GGUF node: 145s 110s 110s
this PR(--cache-ram 0) + D-VRAM GGUF node: 127s 83s 80s

https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram

git clone -b dynamic-vram https://github.com/rattus128/ComfyUI-GGUF

zwukong · 2026-05-22T01:50:36Z

really? thanks for the gguf pr. i will try

frauttauteffasu · 2026-05-22T16:40:15Z

ok its likely workflow dependent too. what are your models?

ComfyUI's memory use with these models does use all but 20GB of the system's memory. Does --cache-ram 21 limit ComfyUI's memory use to total system memory - 21?

brendanhoar · 2026-05-24T18:45:20Z

Hopefully the author sees that there seem to be new performance problems arising under some circumstances with this PR after it was merged (see the SwarmUI one from today, for example): mcmonkeyprojects/SwarmUI#1391

Not to say I am not grateful for this work being done, but there are new issues now.

Ph0rk0z · 2026-05-25T11:38:50Z

I have spinning drives and not NVME but plenty of ram..there is no magic to make it load "faster". It is always great to assume.

Ph0rk0z · 2026-05-25T12:07:40Z

Ok.. so with some testing.. this hot garbage is loading from disk every gen on int8 klein with dynamic vram disabled. With dynamic vram enabled its like it was for regens but if the prompt changes the TE is loaded from DISK!?

So now I cannot disable dynamic vram nor can I use dynamic vram. And a workflow that has caching (e.g chromacache) is completely broken because caching doesn't work with dynamic vram.
My gens go from 7s on a new prompt to 10s while rerolls are the same exact speed prior to this commit. WTF! I'm reverting.

v.22 works fine.. after this.. not fine.

brendanhoar · 2026-05-25T16:09:59Z

Are these pull requests being tested across all the target OSes before merging? Linux, Windows, MacOS, etc.? Or just Linux?

…load to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) (Comfy-Org#13802) * model_management: disable non-dynamic smart memory Disable smart memory outright for non dynamic models. This is a minor step towards deprecation of --disable-dynamic-vram and the legacy ModelPatcher. This is needed for estimate-free model development, where new models can opt-out of supplying a memory estimate and not have to worry about hard VRAM allocations due to legacy non-dynamic model patchers This is also a general stability increase for a lot of stray use cases where estimates may still be off and going forward we are not going to accurately maintain such estimates. * pinned_memory: implement with aimdo growable buffer Use a single growable buffer so we can do threaded pre-warming on pinned memory. * mm: use aimdo to do transfer from disk to pin Aimdo implements a faster threaded loader. * Add stream host pin buffer for AIMDO casts Introduce per-offload-stream HostBuffer reuse for pinned staging, include it in cast buffer reset synchronization. Defer actual casts that go via this pin path to a separate pass such that the buffer can be allocated monolithically (to avoid cudaHostRegister thrash). * remove old pin path * Implement JIT pinned memory pressure Replace the predictive pin pressure mechanism with JIT PIN memory pressure. * LowVRAMPatch: change to two-phase visit * lora: re-implement as inplace swiss-army-knife operation * prepare for multiple pin sets * implement pinned loras * requirements: comfy-aimdo 0.4.0 * ops: remove unused arg This was defeatured in aimdo iteration * ops: sync the CPU with only the offload stream activity This was syncing with the offload stream which itself is synced with the compute stream, so this was syncing CPU with compute transitively. Define the event to sync it more gently. * pins: implement freeing intermediate for pinned memory Pinning is more important than inactive intermediates and the stream pin buffer is more important than even active intermediates. * execution: implement pin eviction on RAM presure Add back proper pin freeing on RAM pressure * implement pin registration swaps Uncap the windows pins from 50% by extending the pool and have a pressure mechanism to move the pin reservations om demand. This unfortunately implies a GPU sync to do the freeing so significant hysterisis needs to be added to consolidate these pressure events. * cli_args/execution: Implement lower background cache-ram threshold Limit the amount of RAM background intermediates can use, so that switching workflows doesn't degrade performance too much. * make default * bump aimdo * model-patcher: force-cast tiny weights Flux 2 gets crazy stalls due to a mix of tiny and giant weights creating lopsided steam buffer rotations which creates stalls. * ops: refactor in prep for chunking * mm: delegate pin-on-the-way to aimdo Aimdo is able to chunk and slice this on the way for better CPU->GPU overlap. The main advantage is the ability to shorten the bus contention window between previous weight transfer and the next weights vbar fault. * bump aimdo * pinning updates * specify hostbuf max allocation size There a signs of virtual memory exhaustion on some linux systems when throwing 128GB for every little piece. Pass the actual to save aimdo from over-estimates * tests: update execution tests for caching The default caching changed to ram-cache so update these tests accordingly. Remove the LRU 0 test as this also falls through to RAM cache.

frauttauteffasu · 2026-05-26T18:37:58Z

@Ph0rk0z and @brendanhoar any improvement with #14116? If not with that pull applied what if add --cache-ram 0 or --cache-ram 1?

Ph0rk0z · 2026-05-26T22:07:44Z

I will try but I'm on linux and not windows. Technically cache ram should never have evicted anything since I have >96g of ram.

brendanhoar · 2026-05-27T00:52:55Z

@Ph0rk0z and @brendanhoar any improvement with #14116? If not with that pull applied what if add --cache-ram 0 or --cache-ram 1?

Using:
git switch -
git fetch origin "pull/14116/head:pr/14116"
git switch pr/14116

pulled it down and...restarted, then loaded up a T2U2V workflow and when it loads LTX dev...

20:43:39.985 [Debug] [ComfyUI-0/STDERR] [WARNING] Warning: TAESD previews enabled, but could not find models/vae_approx/None
20:43:39.990 [Debug] [ComfyUI-0/STDERR] [INFO] Requested to load LTXAV
20:43:41.128 [Debug] [ComfyUI-0/STDERR] [INFO] Model LTXAV prepared for dynamic VRAM loading. 40050MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 3303 KB.
20:43:50.850 [Debug] [ComfyUI-0/STDERR]
20:43:50.851 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s]
20:49:27.450 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s,   Model Initializing ...  ]
20:49:27.451 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [05:36<?, ?it/s,  Model Initialization complete!  ]

This is on first run from comfyui restart, but not machine restart. Task Manager shows no storage IO, so presumably it is taking 5.5 minutes (see timestamps for lines 5 to 6 above) to pull the model out of the Windows OS filecache and load it onto the card? Or maybe from storage?

On second run, I see more storage IO than the first time (with 512GB of RAM, why???), but it appears to be extremely low read rate, and this is the result when we get to LTX again, this time a six minute wait (see lines 3 and 4 below) to load the model from storage

20:54:03.677 [Debug] [ComfyUI-0/STDERR] [INFO] Model LTXAV prepared for dynamic VRAM loading. 40050MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 3303 KB.
20:54:13.110 [Debug] [ComfyUI-0/STDERR]
20:54:13.113 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s]
21:00:44.042 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [00:00<?, ?it/s,   Model Initializing ...  ]
21:00:44.043 [Debug] [ComfyUI-0/STDERR]   0%|          | 0/6 [06:30<?, ?it/s,  Model Initialization complete!  ]
21:00:53.990 [Debug] [ComfyUI-0/STDERR]  17%|█?        | 1/6 [06:30<32:34, 390.93s/it,  Model Initialization complete!  ]

Why can't we just keep the models in the OS cache or in somewhere controlled by comfyui? It seems like these keep getting ejected and then pulled from storage again for no freaking reason. Sometimes storage can be slow, why isn't this taken into consideration for these changes?

rattus128 mentioned this pull request May 11, 2026

Bypass safetensors mmap when --disable-mmap is set #13609

Closed

rattus128 added 12 commits May 12, 2026 11:04

pinned_memory: implement with aimdo growable buffer

157965a

Use a single growable buffer so we can do threaded pre-warming on pinned memory.

mm: use aimdo to do transfer from disk to pin

b66b642

Aimdo implements a faster threaded loader.

remove old pin path

1795523

Implement JIT pinned memory pressure

8187cd7

Replace the predictive pin pressure mechanism with JIT PIN memory pressure.

LowVRAMPatch: change to two-phase visit

2b927e1

lora: re-implement as inplace swiss-army-knife operation

8e473d7

prepare for multiple pin sets

e48dace

implement pinned loras

3a3b75a

requirements: comfy-aimdo 0.4.0

c395f2d

ops: remove unused arg

44c0a06

This was defeatured in aimdo iteration

rattus128 force-pushed the dev/threaded-loader branch from 1d792e1 to 44c0a06 Compare May 12, 2026 01:55

rattus128 marked this pull request as ready for review May 12, 2026 13:30

rattus128 requested review from Kosinkadink, alexisrolland, comfyanonymous, guill and kijai as code owners May 12, 2026 13:30

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread comfy/model_management.py

Comment thread comfy/pinned_memory.py

Comment thread requirements.txt Outdated

rattus128 added 3 commits May 13, 2026 22:23

ops: sync the CPU with only the offload stream activity

ee927aa

This was syncing with the offload stream which itself is synced with the compute stream, so this was syncing CPU with compute transitively. Define the event to sync it more gently.

pins: implement freeing intermediate for pinned memory

d61026d

Pinning is more important than inactive intermediates and the stream pin buffer is more important than even active intermediates.

execution: implement pin eviction on RAM presure

3f71781

Add back proper pin freeing on RAM pressure

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

rattus128 marked this pull request as draft May 13, 2026 15:14

rattus128 mentioned this pull request May 21, 2026

Feature Request: Auto-detect encrypted volumes and disable mmap to prevent system freezes #14006

Open

kingp0dd mentioned this pull request May 21, 2026

Dynamic VRAM support city96/ComfyUI-GGUF#427

Draft

Kosinkadink mentioned this pull request May 22, 2026

Defer @pollockjj's tiled-VAE and UPSCALE_MODEL MultiGPU lanes #14066

Merged

coderabbitai Bot mentioned this pull request May 23, 2026

ComfyUI fails to cache model to RAM, reloads from disk every run. #14076

Open

1 task

mcmonkey4eva mentioned this pull request May 24, 2026

SDXL models seem to reload on every generation mcmonkeyprojects/SwarmUI#1391

Closed

coderabbitai Bot mentioned this pull request May 26, 2026

Running with "--cpu" flag still tries to use Kornia #14107

Open

1 task

Ph0rk0z mentioned this pull request May 26, 2026

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6) #14116

Draft

coderabbitai Bot mentioned this pull request May 27, 2026

OOM error after commit 0a2dd86 #14126

Open

1 task

Conversation

rattus128 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

frauttauteffasu commented May 21, 2026

Uh oh!

zwukong commented May 21, 2026

Uh oh!

zwukong commented May 21, 2026

Uh oh!

kingp0dd commented May 21, 2026

Uh oh!

kingp0dd commented May 21, 2026

Uh oh!

kingp0dd commented May 21, 2026

Uh oh!

m8rr commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented May 21, 2026

Uh oh!

frauttauteffasu commented May 21, 2026

Uh oh!

rattus128 commented May 21, 2026

Uh oh!

m8rr commented May 22, 2026

Uh oh!

zwukong commented May 22, 2026

Uh oh!

frauttauteffasu commented May 22, 2026

Uh oh!

brendanhoar commented May 24, 2026

Uh oh!

Ph0rk0z commented May 25, 2026

Uh oh!

Ph0rk0z commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brendanhoar commented May 25, 2026

Uh oh!

frauttauteffasu commented May 26, 2026

Uh oh!

Ph0rk0z commented May 26, 2026

Uh oh!

brendanhoar commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

rattus128 commented May 8, 2026 •

edited

Loading

socket-security Bot commented May 8, 2026 •

edited

Loading

rattus128 commented May 9, 2026 •

edited

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

m8rr commented May 21, 2026 •

edited

Loading

Ph0rk0z commented May 25, 2026 •

edited

Loading

brendanhoar commented May 27, 2026 •

edited

Loading