Skip to content

Dynamic VRAM support#427

Draft
rattus128 wants to merge 7 commits into
city96:mainfrom
rattus128:dynamic-vram
Draft

Dynamic VRAM support#427
rattus128 wants to merge 7 commits into
city96:mainfrom
rattus128:dynamic-vram

Conversation

@rattus128
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 commented Mar 5, 2026

The new dynamic VRAM system in the comfy-core enhances both RAM and VRAM management. Models are no longer offloader from VRAM to RAM (which has a habit of becoming swap) and are now loadable asynchronously on the sampler first iteration. This gives significant speedup to big multi-model workflows on low-resource systems. VRAM offloading is managed by demand offloading, such there is no need to have VRAM usage esitmates anymore.

The core has already upstreamed several of the resource saving features of GGUF in various forms.

  • The core linear layers are now inited un-allocated to avoid the naked commit charge for the empty tensor.
  • Models are loaded with assign=True to avoid deep copy and committed memory on model load (GGUF does similar but with _load_state_dict hooking)
  • the sft file is mmaped read only to avoid that commit charge. GGUF does this

So this implements a QuantizedTensor backend and subclasses the new ModelPatcherDynamic to bring GGUF+dynamic without needed custom ops.

The patcher subclass is needed to unhook the lora into on-the-fly. Otherwise its just load the state-dict into the new QuantizedTensor and go.

This brings the full feature-set of the core comfy caster to GGUF including, async-offload (and async primary load), pinned-memory and now the dynamic management.

There's some boilerplate to implement downgrade back to ModelPatcher. This is needed for things like torch compiler and hooks where Dynamic VRAM is TBD.

Still drafing and will post some more performance results. I am going to pull a RAM stick and go for some 16GB RAM flows with GGUF.

Example Test conditions:

WAN2.2 14B Q8 GGUF, 640x640x81f, RTX5090, Linux, 96GB, 2x Runs (disk caches warm with model first runs)

Before

Prompt executed in 60.31 seconds
Prompt executed in 55.99 seconds

After

Prompt executed in 48.75 seconds
Prompt executed in 43.35 seconds

Vibe code. To be reviewed.
If in dynamic mode, load GGUF as a QT.
Refactor this to support the new reconstructability protocol in the
comfy core. This is needed for DynamicVRAM (to support legacy
demotion for fallbacks). Add the logic for dynamic_vram construction.

This is also needed for worksplit multi-gpu branch where the model
is deep-cloned via reconstruction to put the model on two parallel
GPUs.
Refactor this to support the new reconstructability protocol in the
comfy core. This is needed for DynamicVRAM (to support legacy
demotion for fallbacks). Add the logic for dynamic_vram construction.

This is also needed for worksplit multi-gpu branch where the model
is deep-cloned via reconstruction to put the model on two parallel
GPUs.
Factor this out to a helper and implement the new core reconstruction
protocol. Consider the mmap_released flag 1:1 with the underlying model
such that it moves with the base model in model_override.
@m8rr
Copy link
Copy Markdown

m8rr commented Mar 6, 2026

https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram

Is this the same thing?
I used the above and an error occurs when using CLIPLoader (GGUF) with GGUF


D:\AI\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --disable-api-nodes --output-directory E:\output --temp-directory E:\output
Setting output directory to: E:\output
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Checkpoint files will always be loaded safely.
Total VRAM 12282 MB, total RAM 32085 MB
pytorch version: 2.10.0+cu130
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 SUPER : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 14438.0
working around nvidia conv3d memory bug.
Using pytorch attention
aimdo: src-win/cuda-detour.c:77:INFO:aimdo_setup_hooks: found driver at 00007FFB60C00000, installing 4 hooks
aimdo: src-win/cuda-detour.c:61:DEBUG:install_hook_entrys: hooks successfully installed
aimdo: src/control.c:66:INFO:comfy-aimdo inited for GPU: NVIDIA GeForce RTX 4070 SUPER (VRAM: 12281 MB)
DynamicVRAM support detected and enabled
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.16.3
Setting temp directory to: E:\output\temp
ComfyUI frontend version: 1.39.19
[Prompt Server] web root: D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfyui_frontend_package\static
ComfyUI-GGUF: Allowing full torch compile

Import times for custom nodes:
   0.0 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-KJNodes
   0.1 seconds: D:\AI\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-GGUF

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.056s (created=0, skipped_existing=81, orphans_pruned=0, total_seen=85)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 
....
....
'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
Model LTXAVTEModel_ prepared for dynamic VRAM loading. 50881MB Staged. 0 patches attached. Force pre-loaded 290 weights: 2995 KB.
!!! Exception during processing !!! shape '[4096, 3840]' is invalid for input of size 12902400
Traceback (most recent call last):
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 524, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 333, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 307, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\execution.py", line 295, in process_inputs
    result = f(**inputs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\nodes.py", line 80, in encode
    return (clip.encode_from_tokens_scheduled(tokens), )
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 313, in encode_from_tokens_scheduled
    pooled_dict = self.encode_from_tokens(tokens, return_pooled=return_pooled, return_dict=True)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd.py", line 377, in encode_from_tokens
    o = self.cond_stage_model.encode_token_weights(tokens)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\lt.py", line 167, in encode_token_weights
    out, pooled, extra = self.gemma3_12b.encode_token_weights(token_weight_pairs)
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 45, in encode_token_weights
    o = self.encode(to_encode)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 306, in encode
    return self(tokens)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\sd1_clip.py", line 279, in forward
    outputs = self.transformer(None, attention_mask_model, embeds=embeds, num_tokens=num_tokens, intermediate_output=intermediate_output, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32, embeds_info=embeds_info)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 794, in forward
    return self.model(input_ids, *args, **kwargs)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 719, in forward
    x, current_kv = layer(
                    ~~~~~^
        x=x,
        ^^^^
    ...<3 lines>...
        past_key_value=past_kv,
        ^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 605, in forward
    x, present_key_value = self.self_attn(
                           ~~~~~~~~~~~~~~^
        hidden_states=x,
        ^^^^^^^^^^^^^^^^
    ...<4 lines>...
        sliding_window=sliding_window,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\text_encoders\llama.py", line 466, in forward
    xq = self.q_proj(hidden_states)
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 373, in forward
    return self.forward_comfy_cast_weights(*args, **kwargs)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 365, in forward_comfy_cast_weights
    weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
                                   ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 228, in cast_bias_weight
    return cast_bias_weight_with_vbar(s, dtype, device, bias_dtype, non_blocking, compute_dtype, want_requant)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\ops.py", line 148, in cast_bias_weight_with_vbar
    comfy.model_management.cast_to_gathered(xfer_source, pin)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\model_management.py", line 1204, in cast_to_gathered
    dest_views = comfy.memory_management.interpret_gathered_like(tensors, r)
  File "D:\AI\ComfyUI_windows_portable\ComfyUI\comfy\memory_management.py", line 71, in interpret_gathered_like
    actuals[attr] = gathered[offset:offset+size].view(dtype=template.dtype).view(template.shape)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
RuntimeError: shape '[4096, 3840]' is invalid for input of size 12902400

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 15, 2026

This version definitely has a speed boost. However, if you're getting errors with the GGUF text encoder like me, try modifying the code as follows. Only the text encoder is operating the old way. it should serve as a good temporary workaround until the update.

nodes.py line 206~ (False->True)

def _load_gguf_clip_patcher(clip_paths, clip_type, disable_dynamic=True):
    return _load_gguf_clip(clip_paths, clip_type, disable_dynamic=disable_dynamic).patcher

def _load_gguf_clip(clip_paths, clip_type, disable_dynamic=True):

@kingp0dd
Copy link
Copy Markdown

This version definitely has a speed boost. However, if you're getting errors with the GGUF text encoder like me, try modifying the code as follows. Only the text encoder is operating the old way. it should serve as a good temporary workaround until the update.

nodes.py line 206~ (False->True)

def _load_gguf_clip_patcher(clip_paths, clip_type, disable_dynamic=True):
    return _load_gguf_clip(clip_paths, clip_type, disable_dynamic=disable_dynamic).patcher

def _load_gguf_clip(clip_paths, clip_type, disable_dynamic=True):

That means it's already working. How much% did it save you

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 17, 2026

without --disable-dynamic-vram

Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:29<00:00,  5.93s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.72s/it]
Requested to load AudioVAE
loaded completely; 1968.11 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 164.28 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:24<00:00,  4.81s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.69s/it]
Requested to load AudioVAE
loaded completely; 1934.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 90.59 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:23<00:00,  4.63s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.68s/it]
Requested to load AudioVAE
loaded completely; 1966.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 77.65 seconds

with --disable-dynamic-vram

Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
loaded partially; 9564.67 MB usable, 9525.25 MB loaded, 7689.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:30<00:00,  6.09s/it]
Unloaded partially: 620.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1287.84 MB freed, 7616.93 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.33s/it]
Requested to load AudioVAE
loaded completely; 2233.75 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 193.22 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.43s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:42<00:00, 14.15s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 97.68 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.25s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:41<00:00, 13.87s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 95.82 seconds

@kingp0dd
Copy link
Copy Markdown

kingp0dd commented Mar 17, 2026 via email

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 17, 2026

Check if the quant_ops.py file exists inside the ComfyUI-GGUF folder. If it’s not there, the installation wasn't done correctly.

I installed like this.
git clone -b dynamic-vram https://github.com/rattus128/ComfyUI-GGUF

@kingp0dd
Copy link
Copy Markdown

kingp0dd commented Mar 27, 2026

EDIT:

Sorry to bother. It works now after nuking my whole comfyui installation.


That's weird, i did exactly that but it's still not activating.

I have that file:

~/comfy/ComfyUI/custom_nodes/ComfyUI-GGUF$ ls
dequant.py   LICENSE    nodes.py  __pycache__     quant_ops.py  requirements.txt
__init__.py  loader.py  ops.py    pyproject.toml  README.md     tools
~/comfy/ComfyUI/custom_nodes/ComfyUI-GGUF$ git status
On branch dynamic-vram
Your branch is up to date with 'origin/dynamic-vram'.

nothing to commit, working tree clean

But when loading the GGUF Q4KM Wan2.2, i still can't see the Dynamic loading log:

Requested to load WAN21
loaded partially; 5395.22 MB usable, 5292.63 MB loaded, 4044.55 MB offloaded, 102.59 MB buffer reserved, lowvram patches: 0
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True

I know dynamic loading is enabled in Comfyui because other models have that log:

Unloaded partially: 3616.85 MB freed, 1431.91 MB remains loaded, 276.90 MB buffer reserved, lowvram patches: 276
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.

I tried this both in v0.16.4 and the latest Comfy v.0.18.2

EDIT:

Sorry to bother. It works now after nuking my whole comfyui installation.

@kingp0dd
Copy link
Copy Markdown

I ran a few tests win Wan2.2 Q4KM GGUF. It seems that GGUF Dynamic VRAM is slower than non-dynamic:
These are all third or fourth runs. I also noticed that the CLIP model (which is Q5KM GGUF) is consistently not being loaded from cache no matter how many consecutive runs I try. RTXVideoSuperResolution node also finishes slower (100s) vs when using non-dynamic VRAM which finishes after only 2s.

Dynamic VRAM:

Requested to load WanTEModel
loaded completely; 5674.75 MB usable, 5129.20 MB loaded, full load: True
-----------------#29:1044 [CLIPTextEncode]: 14.30s - vram 5894787076b
0 models unloaded.
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 17.00s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:961 [KSamplerAdvanced]: 64.11s - vram 20b

Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.05s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:968 [KSamplerAdvanced]: 45.32s - vram 3356487796b


Requested to load WanTEModel


loaded completely; 5672.75 MB usable, 5129.20 MB loaded, full load: True
-----------------#29:1044 [CLIPTextEncode]: 22.36s - vram 5894787076b
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
-----------------#29:960 [WanImageToVideo]: 7.20s - vram 2374470160b
-----------------#29:1076 [CFGZeroStar]: 0.00s - vram 0b
-----------------#29:1276 [PrimitiveInt]: 0.00s - vram 0b
Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.47s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:961 [KSamplerAdvanced]: 74.66s - vram 20b
Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
Requested to load WAN21
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 9202MB Staged. 1054 patches attached.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.67s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:968 [KSamplerAdvanced]: 61.17s - vram 3358965876b

Without Dynamic VRAM:

Requested to load WAN21
loaded partially; 5661.19 MB usable, 5558.60 MB loaded, 3778.57 MB offloaded, 102.59 MB buffer reserved, lowvram patches: 0
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.85s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:961 [KSamplerAdvanced]: 39.28s - vram 2035224048b
Warning: TAESD previews enabled, but could not find models/vae_approx/lighttaew2_1
Requested to load WAN21
loaded partially; 5661.19 MB usable, 5558.60 MB loaded, 3778.57 MB offloaded, 102.59 MB buffer reserved, lowvram patches: 0
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.87s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#29:968 [KSamplerAdvanced]: 38.92s - vram 3326974068b
-----------------#1323 [PrimitiveInt]: 0.00s - vram 0b
-----------------#1321 [SomethingToString]: 0.00s - vram 0b
-----------------#1320 [XDateTimeString]: 0.00s - vram 0b
-----------------#1317 [StringFunction|pysssss]: 0.00s - vram 0b
-----------------#1324 [PreviewAny]: 0.00s - vram 0b
-----------------#29:1307 [XLatentSave]: 0.01s - vram 0b
-----------------#29:1285:1280 [easy ifElse]: 0.00s - vram 0b
Requested to load WanVAE
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
-----------------#29:963 [VAEDecode]: 10.03s - vram 3296513680b
-----------------#29:1263 [Context (rgthree)]: 0.00s - vram 0b
-----------------#29:1265 [Context Switch (rgthree)]: 0.00s - vram 0b
-----------------#29:1285:1280 [easy ifElse]: 0.00s - vram 0b
-----------------#29:1284:1099 [easy ifElse]: 0.00s - vram 0b
-----------------#1340 [RTXVideoSuperResolution]: 2.24s - vram 357212160b
-----------------#28 [VHS_VideoCombine]: 3.59s - vram 0b
-----------------#9 [VHS_PruneOutputs]: 0.00s - vram 0b
Prompt executed in 108.08 seconds

@m8rr
Copy link
Copy Markdown

m8rr commented Mar 27, 2026

I ran Wan2.2 three times. Unlike LTX2.3, there wasn't a huge difference, but it doesn't seem any slower either.

Total VRAM 12282 MB, total RAM 32085 MB
pytorch version: 2.11.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 SUPER : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 14438.0
Using sage attention
aimdo: src-win/cuda-detour.c:77:INFO:aimdo_setup_hooks: found driver at 00007FFD38C80000, installing 4 hooks
aimdo: src-win/cuda-detour.c:61:DEBUG:install_hook_entrys: hooks successfully installed
aimdo: src/control.c:69:INFO:comfy-aimdo inited for GPU: NVIDIA GeForce RTX 4070 SUPER (VRAM: 12281 MB)
DynamicVRAM support detected and enabled
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.18.2
comfy-aimdo version: 0.2.12
comfy-kitchen version: 0.2.8
ComfyUI frontend version: 1.43.7
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.09s/it]
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:39<00:00, 49.68s/it]
Requested to load WanVAE
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
Prompt executed in 183.19 seconds
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.07s/it]
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:38<00:00, 49.45s/it]
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
Prompt executed in 156.86 seconds
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.11s/it]
0 models unloaded.
Model WAN21 prepared for dynamic VRAM loading. 8339MB Staged. 400 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:38<00:00, 49.48s/it]
0 models unloaded.
Model WanVAE prepared for dynamic VRAM loading. 242MB Staged. 0 patches attached. Force pre-loaded 52 weights: 28 KB.
Prompt executed in 140.45 seconds
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
loaded partially; 8181.12 MB usable, 8103.49 MB loaded, 370.42 MB offloaded, 96.28 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.26s/it]
gguf qtypes: F16 (694), Q4_K (356), Q5_K (44), F32 (1)
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 8473.91 MB offloaded, 527.89 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:41<00:00, 50.55s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 205.24 seconds
Requested to load WAN21
loaded partially; 8170.12 MB usable, 8089.41 MB loaded, 384.49 MB offloaded, 96.28 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:13<00:00,  6.80s/it]
Requested to load WAN21
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 8473.91 MB offloaded, 527.89 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:40<00:00, 50.42s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 155.54 seconds
Requested to load WAN21
loaded partially; 8170.12 MB usable, 8089.41 MB loaded, 384.49 MB offloaded, 96.28 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.61s/it]
Requested to load WAN21
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 8473.91 MB offloaded, 527.89 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:40<00:00, 50.44s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 144.12 seconds

@jorismak
Copy link
Copy Markdown

I hope this doesn't get merged, or always keeps a proper way to keep this disabled.
If ComfyUI is really going ahead and remove the option to disable dynamic-vram, currently the GGUF loader is the only workaround because it doesn't use it.

Even using --disable-dynamic-vram currently does not disable it 100% in 'core' ComfyUI, so already on some systems GGUF is now the only type of model that works.

I also don't really get if it's needed in the GGUF loader? It was already pretty good and smart about offloading.


Let's be clear (before I have to explain or edit this post):
The idea of the dynamic-vram patches sound awesome, and it can enable some workflows that didn't work at all before.... if your system is properly supported.

On my RTX Blackwell system, it's pretty neat. It's enabling workflows that I had to memory-juggle manually before or just gave OOM errors.

It's also increased processing time for other workflows - most often the workflows that are simpler and using smaller models - because it keeps unloading and loading parts that otherwise fitted nicely in memory.
Or, as an example, a simple flux2klein workflow where the text-encoder was loaded in RAM, and flux2klein was loaded in RAM + offloaded a bit. The model still runs fast, and for every new generation I didn't had to reload the text-encoder. So changes to prompts or lora strength where pretty fast to try another generaiton. Now, it's waay slower because it has to load the text-encoder again, then load the flux model again, then load the text model again, then load the flux model again, etc...

So, the dynamic-vram work - currently - isn't a silver bullet and I still see lots of issues reported with it, and the thought that ComfyUI maintainers say they are going to remove the disable option (that works for 90% or less) makes me a bit scared.

With the dynamic vram work activated, on my Intel system ComfyUI is now completely unusable. everything makes my system go OOM and the OS kills ComfyUI. Switching to the GGUF loader (even with Q8 and F16 models) and everything is fine.

It ignores --disable-dynamic-ram sometimes or for some parts, it seems to ignore --reserve-vram and it seems to be confused on unified memory / shared memory systems.

Until dynamic-vram in ComfyUI is stable and people are happy with it, please leave the GGUF loader alone. It's the only fallback.

@kingp0dd
Copy link
Copy Markdown

@rattus128 may we know if this is still incomplete? My ltx2.3 worflow OOM's without using this PR branch.

It works great with the model Gguf, But it does not with gguf CLIP. It always gets loaded from SSD and not cached from RAM. If i use safetensor CLIP, it gets loaded from RAM and is faster.

Edit: oh maybe the clip gguf is not yet implemented? #427 (comment)

@kingp0dd
Copy link
Copy Markdown

without --disable-dynamic-vram

Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:29<00:00,  5.93s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.72s/it]
Requested to load AudioVAE
loaded completely; 1968.11 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 164.28 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:24<00:00,  4.81s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.69s/it]
Requested to load AudioVAE
loaded completely; 1934.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 90.59 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:23<00:00,  4.63s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.68s/it]
Requested to load AudioVAE
loaded completely; 1966.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 77.65 seconds

with --disable-dynamic-vram

Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
loaded partially; 9564.67 MB usable, 9525.25 MB loaded, 7689.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:30<00:00,  6.09s/it]
Unloaded partially: 620.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1287.84 MB freed, 7616.93 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.33s/it]
Requested to load AudioVAE
loaded completely; 2233.75 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 193.22 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:32<00:00,  6.43s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:42<00:00, 14.15s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 97.68 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.25s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:41<00:00, 13.87s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 95.82 seconds

@m8rr May i ask if your clip is also gguf? Does it cache to vram for you? In my tests, gguf CLIP always loads from SSD even if I try Q3 and my RAM usage well below 60%. My ltx model is Q4 GGUF and should fit in my RAM as well.

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

@kingp0dd
Yes, the CLIP text encoder uses GGUF as well. And dynamic VRAM isn't used for CLIP.
First generation: CLIP and the model are both loaded from the model drive.
Second and third generations (with changed prompts): CLIP is loaded from the virtual memory drive, while the model is loaded from the model drive.
At least on Windows Task Manager, it’s not reading the drive where the CLIP model is stored when loading CLIP.
So, in conclusion, it seems like it's loading CLIP from the page_file in my case, which means i could say it's reading it from the RAM.

got prompt
Requested to load LTXAVTEModel_
loaded partially; 7210.68 MB usable, 0.00 MB loaded, 14131.07 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.33s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:28<00:00,  9.39s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 176.88 seconds
got prompt
Requested to load LTXAVTEModel_
loaded partially; 7210.68 MB usable, 0.00 MB loaded, 14131.07 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:31<00:00,  6.26s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:27<00:00,  9.25s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 171.64 seconds

@kingp0dd
Copy link
Copy Markdown

@m8rr do you use ltx text projection safetensor along wit the gemma clip gguf?

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

@kingp0dd
That's right, I'm using Gemma-Q6 GGUF and projection-BF16 safetensors. Honestly, I barely use them after testing. If I remember correctly, the default template's gemma_3_12B_it_fp4_mixed.safetensors felt faster when I tested it before. The file size is pretty much the same, too. Anyway, since GGUF CLIP doesn't seem to be supported yet, I don't think there's any need to stick with GGUF.

@kingp0dd
Copy link
Copy Markdown

i use the same. i retested after updating to:
Comfy-Org/ComfyUI#13802

and now it works. both CLIP and UNET are being loaded from RAM and not from disk.

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

Interesting! In my previous test, I was on version 0.22 stable. But now that I’ve updated to the latest commit, it’s reading the CLIP from the disk. It's behaving just like when I run Comfy for the very first time.

It even goes as far as stopping at sampler 1 and reloading the CLIP.

got prompt
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:23<00:00,  4.71s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:27<00:00,  9.24s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 68.23 seconds
got prompt
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 'multi_modal_projector.mm_soft_emb_norm.weight', 
...
'vision_model.encoder.layers.26.mlp.fc2.weight', 'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
loaded partially; 8462.35 MB usable, 495.93 MB loaded, 13636.29 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (211)
loaded partially; 8462.35 MB usable, 495.93 MB loaded, 13636.29 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:27<00:00,  5.50s/it]
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 'multi_modal_projector.mm_soft_emb_norm.weight', 
...
'vision_model.encoder.layers.26.mlp.fc2.weight', 'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
loaded partially; 8462.35 MB usable, 495.93 MB loaded, 13636.29 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (211)
loaded partially; 8462.35 MB usable, 495.93 MB loaded, 13636.29 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:25<00:00,  5.12s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:27<00:00,  9.06s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 290.53 seconds

@rattus128
Copy link
Copy Markdown
Contributor Author

@m8rr @kingp0dd I have some time to look at this. Can you send me any of:

  • Code diffs you think you need.
  • Models you are using.
  • Workflows that are regressive.

And your hardware specs. Hopefully we can get something in that works faster for everyone, as the latest stuff should be significantly faster for gguf users, especially the lower end of hardware.

Ill be running on my RTX5060 with 16GB of RAM as a general rule but I can reasonably match a fair few profiles if you have a regressive workflow and we can go from there.

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

Modified code
git clone -b dynamic-vram https://github.com/rattus128/ComfyUI-GGUF

nodes.py line 206~ (disable_dynamic: False->True)

def _load_gguf_clip_patcher(clip_paths, clip_type, disable_dynamic=True):
    return _load_gguf_clip(clip_paths, clip_type, disable_dynamic=disable_dynamic).patcher

def _load_gguf_clip(clip_paths, clip_type, disable_dynamic=True):

In the ComfyUI default template, I only changed the loader and ran it twice, but it keeps reloading the CLIP from scratch.

D:\AI\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast fp16_accumulation --use-sage-attention --disable-api-nodes --output-directory E:\output --temp-directory E:\output
setup plugin alembic.autogenerate.schemas
setup plugin alembic.autogenerate.tables
setup plugin alembic.autogenerate.types
setup plugin alembic.autogenerate.constraints
setup plugin alembic.autogenerate.defaults
setup plugin alembic.autogenerate.comments
Setting output directory to: E:\output
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_mxfp8', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_mxfp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Checkpoint files will always be loaded safely.
Total VRAM 12282 MB, total RAM 32085 MB
pytorch version: 2.12.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4070 SUPER : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 12834.0
Using sage attention
aimdo: src-win/cuda-detour.c:38:INFO:aimdo_setup_hooks: installing 6 hooks
aimdo: src-win/cuda-detour.c:28:DEBUG:install_hook_entries: hooks successfully installed
aimdo: src-win/shmem-detect.c:80:INFO:comfy-aimdo WDDM adapter match: NVIDIA GeForce RTX 4070 SUPER runtime_luid=00000000:0000dab7 dxgi_luid=00000000:0000dab7
aimdo: src/control.c:237:INFO:comfy-aimdo inited for GPU: NVIDIA GeForce RTX 4070 SUPER (VRAM: 12281 MB)
DynamicVRAM support detected and enabled
Python version: 3.13.9 (tags/v3.13.9:8183fa5, Oct 14 2025, 14:09:13) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.22.0
comfy-aimdo version: 0.4.3
comfy-kitchen version: 0.2.8
Setting temp directory to: E:\output\temp
comfyui-frontend-package version: 1.43.18
comfyui-workflow-templates version: 0.9.79
comfyui-embedded-docs version: 0.5.0
comfy-kitchen version: 0.2.8
comfy-aimdo version: 0.4.3
[Prompt Server] web root: D:\AI\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfyui_frontend_package\static
Asset seeder disabled
ComfyUI-GGUF: Allowing full torch compile

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 'multi_modal_projector.mm_soft_emb_norm.weight', 
'vision_model.encoder.layers.26.mlp.fc2.weight', 'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
loaded partially; 8543.00 MB usable, 576.58 MB loaded, 13556.30 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (166)
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:43<00:00,  5.47s/it]
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:30<00:00, 10.19s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 174.28 seconds
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
gguf qtypes: F32 (289), Q6_K (337)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 262208
Dequantizing token_embd.weight to prevent runtime OOM.
clip missing: ['multi_modal_projector.mm_input_projection_weight', 'multi_modal_projector.mm_soft_emb_norm.weight', 
...
'vision_model.encoder.layers.26.mlp.fc2.weight', 'vision_model.post_layernorm.weight', 'vision_model.post_layernorm.bias']
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
loaded partially; 8474.88 MB usable, 508.46 MB loaded, 13623.49 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (230)
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:45<00:00,  5.72s/it]
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:30<00:00, 10.18s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 169.80 seconds
Screenshot_21-5-2026_165130_127 0 0 1 Screenshot_21-5-2026_165441_127 0 0 1

@kingp0dd
Copy link
Copy Markdown

i turned off RAM SWAP. now it's still able to load CLIP from cache, but the first pass UNET is always loaded from disk, the second pass (upscale) is loaded from cache.

sudo sysctl vm.swappiness=0

def _load_gguf_clip_patcher(clip_paths, clip_type, disable_dynamic=True):
    return _load_gguf_clip(clip_paths, clip_type, disable_dynamic=disable_dynamic).patcher

def _load_gguf_clip(clip_paths, clip_type, disable_dynamic=True):
image

I also noticed an error when loading UNET model:

Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 13597MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 6567 KB.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
aimdo: /project/src-posix/hostbuf-plat.c:25:ERROR:hostbuf_reserve_address_space: mmap reserve failed for 0 bytes
  0%|                                    | 0/8 [00:00<?, ?it/s,   Model Initializing ...  ]Interrupting prompt 4dc8589d-cb5a-48b2-bd7a-1e35fb650eb4

@kingp0dd
Copy link
Copy Markdown

continuing from #427 (comment)

now i try using Gemma Q3KM (~4GB smaller than the Q6K), everything loads from cache now. I turned on sudo sysctl vm.swappiness=60 again (but i'll try to turn it off again and ran a test again).

What i don't understand is i still have unused RAM when using the Gemma Q6K but when using that, the models don't load from RAM cache.

image
Requested to load LTXAVTEModel_
loaded partially; 9618.03 MB usable, 1651.62 MB loaded, 8998.94 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
-----------------#2277:2643 [CLIPTextEncode]: 4.82s - vram 3762354112b
-----------------#2277:2642 [ComfySwitchNode]: 0.00s - vram 0b
-----------------#2277:2254 [ConditioningZeroOut]: 0.00s - vram 0b
-----------------#2277:2256 [LTXVConditioning]: 0.00s - vram 0b
-----------------#2277:2227 [CFGGuider]: 0.00s - vram 0b
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 13597MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 6567 KB.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
aimdo: /project/src-posix/hostbuf-plat.c:25:ERROR:hostbuf_reserve_address_space: mmap reserve failed for 0 bytes
100%|████████████████████████████████████████████████████████| 8/8 [00:41<00:00,  5.18s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#2277:2230 [SamplerCustomAdvanced]: 45.17s - vram 3193856b

Model LTXAV prepared for dynamic VRAM loading. 13597MB Staged. 1660 patches attached. Force pre-loaded 608 weights: 6567 KB.
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = True
aimdo: /project/src-posix/hostbuf-plat.c:25:ERROR:hostbuf_reserve_address_space: mmap reserve failed for 0 bytes
100%|████████████████████████████████████████████████████████| 3/3 [00:31<00:00, 10.54s/it]
Patching torch settings: torch.backends.cuda.matmul.allow_fp16_accumulation = False
-----------------#2277:2500 [SamplerCustomAdvanced]: 33.01s - vram 1469008732b

Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
-----------------#2277:2217 [LTXVAudioVAEDecode]: 0.42s - vram 941427987b
-----------------#2277:2785 [Context (rgthree)]: 0.00s - vram 0b
-----------------#2277:2789 [Context Switch (rgthree)]: 0.00s - vram 0b
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
-----------------#2277:2221 [VAEDecode]: 7.19s - vram 3090071420b
-----------------#2277:2274 [Context (rgthree)]: 0.00s - vram 0b
-----------------#2277:2223 [ComfySwitchNode]: 0.00s - vram 0b
-----------------#2277:2341 [ComfySwitchNode]: 0.00s - vram 0b
-----------------#2196 [VHS_VideoCombine]: 2.13s - vram 0b
-----------------#2182 [VHS_PruneOutputs]: 0.01s - vram 0b
Prompt executed in 97.76 seconds

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

Comfy-Org/ComfyUI#13802 (comment)
I added --cache-ram 0 after reading this post, and it seems to be working fine now.
When I changed the prompt, it got about 40 seconds faster compared to before. (From 175 down to 135)

CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load LTXAVTEModel_
loaded partially; 8543.00 MB usable, 576.58 MB loaded, 13556.30 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (166)
loaded partially; 8478.88 MB usable, 512.46 MB loaded, 13620.91 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:24<00:00,  4.84s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:27<00:00,  9.13s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 148.59 seconds
got prompt
Requested to load LTXAVTEModel_
loaded partially; 8462.35 MB usable, 495.93 MB loaded, 13636.29 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:26<00:00,  5.24s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.95s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 136.24 seconds
got prompt
Requested to load LTXAVTEModel_
loaded partially; 8462.35 MB usable, 495.93 MB loaded, 13636.29 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:25<00:00,  5.07s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:26<00:00,  8.99s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 138.89 seconds

The minor issue is, I'm not sure which version it started from, but clip loading is quite slow and it barely uses any VRAM.
Untitled

@rattus128
Copy link
Copy Markdown
Contributor Author

@m8rr @kingp0dd I think I have progress on this. I can see how RAM cache is pressuring @m8rr out as you are right on the threshold of not being able to cache that with the RAM cache pressure headroom making the difference. I think I might send the PR to core to default ram cache a bit higher. I wont go for 0 as we need pressure for other cases and we don't have an everybody wins just yet but I think we can do better for this profile of hardware.

Can I get your generated resolutions to compare numbers?

I am here for my 5060 1280x720x5s:

100%|██████████| 8/8 [01:15<00:00,  9.50s/it]                                   
Model LTXAV prepared for dynamic VRAM loading. 16915MB Staged. 0 patches attached. Force pre-loaded 608 weights: 6567 KB.
100%|██████████| 3/3 [01:48<00:00, 36.33s/it]                                   
Requested to load AudioVAE

Consider installing this https://github.com/kijai/ComfyUI-MemoryVisualization

To watch along with the load levels.

With --cache-ram 2 and 8GB VRAM and 32GB RAM my retention of the TE is partial:

image

The missing piece is barely noticable though with threaded loader and I have downgraded my PCIe bus to gen 1 to simulate a slow disk.

@m8rr are you on SATA or HDD by any chance?

@rattus128
Copy link
Copy Markdown
Contributor Author

rattus128 commented May 21, 2026

@m8rr your workaround outright disables dynamic VRAM for CLIP. I have put the bugfix on top of the branch now so you should be able to undo your workaround and get dynamic clip too. With dynamic clip your liklihood of getting swapped or completely dumped by the RAM cache is much lower.

--cache-ram 0 will still make it better. The question is how reliant you are on it with the clip fixed

@m8rr
Copy link
Copy Markdown

m8rr commented May 21, 2026

@rattus128 Sorry, I changed CLIP from q6 to q4. Anyway, I updated node with the new commit and ran it three times with different prompts. The CLIP (both q4) loading speed is noticeably faster—it dropped from over 20 seconds to just a few seconds.

IGPU+EGPU(OcuLink), PCIE4 NVME SSD, 1280x768x121, 5/3steps

without --cache-ram: 12Xs, 12Xs(hard reset)
--cache-ram 0: 127s, 90s, 90s
--cache-ram 1: 128s, 87s, 78s
--cache-ram 2: 126s, 84s, 127s(hard reset)
--cache-ram 0: 129s, 82s, 73s
--cache-ram 2: 127s, hard reset

--cache-ram 0: Shared memory usage is high.
Untitled

--cache-ram 2: Shared memory usage is low.
Untitled2

@kingp0dd
Copy link
Copy Markdown

kingp0dd commented May 21, 2026

@rattus128 seems CLIP doesn't want to pin.

Screenshot_20260521_225152_Firefox

@polarathene
Copy link
Copy Markdown

polarathene commented May 21, 2026

What i don't understand is i still have unused RAM when using the Gemma Q6K but when using that, the models don't load from RAM cache.

@kingp0dd is this linux host or WSL/guest?

I'm not too familiar with the discussion going on so if "RAM cache" is referring to a feature of ComfyUI caching to anonymous memory pages then ignore me, but for vm.swappiness you mentioned that defines the bias between page cache (disk reads cached into spare memory) vs anonymous pages being swapped out, with a range of 0-200, such that 0 means don't swap any anon pages to disk and 200 means always read files from disk (don't spare memory for caching disk I/O).


The RAM screenshot in your last comment says approx 12/32GB RAM used? Then shows Python using 20GB and is that 5GB pinned separate from that? So approx 7GB spare? Cheek free -m in the terminal to compare, it will report what is assigned to buffer cache (disk cache in memory) which might contribute to the issue if it's due to ComfyUI's view of memory affecting it from doing something expected?

I also know that memory monitoring is a bit iffy with containers for example. In that scenario memory can appear inaccurate, my memory is a bit fuzzy on specifics but I recall it being related to cgroups v2 memory stats and that there had been some changes there that's important to calculate usage correctly. I do recall a PR was made to ComfyUI some time ago to try improve on ComfyUI's view of memory related to cgroups but the PR was largely ignored and closed IIRC.


Another concern with Linux can be when swap is enabled, not necessarily as disk swap but via zram which will use up system memory with compressed pages whilst reporting uncompressed swap size as a separate device, that can mess with memory heuristics a bit. zswap is better integrated into the kernel's memory management to avoid that issue with it's memory pool allocations, but presently requires disk swap which also forces less compressible pages to disk swap by default (can opt-out per cgroup though). Not sure if you're using either of these (some distro enable by default), but might be worth taking into consideration since you are using swap.

Additionally on Linux nvidia GPUs lack shared memory support as the driver doesn't integrate with GTT IIRC? (something like that) While AMD and Intel GPUs are capable of leveraging system memory as a fallback. With WSL on Windows however, the nvidia integration makes the GPU available from the Windows host + driver, where shared memory is supported (50% system memory as cap).

@polarathene
Copy link
Copy Markdown

I'm a bit busy elsewhere for roughly a week, but if it'd be helpful I can also assist towards this PR?

I have a laptop with 32GB of RAM and an nvidia 4060 (8GB VRAM), Windows 11 host with WSL2 + Docker Desktop used for running ComfyUI via a container.

As noted in my prior comment there are some known caveats with linux memory management and that gets worse with WSL (I think the stance from ComfyUI in their announcement of Dynamic VRAM was that WSL would not be officially supported because of this).

By default WSL is permitted to 50% system memory as it's allocation ceiling. Nvidia's shared memory (effectively like swap but from VRAM to system memory) also has a 50% host memory ceiling. WSL is configured by default from Windows with a swap of 25% WSL RAM IIRC too, the kernel may also have zswap enabled and I believe it's now using cgroups v2 (important difference), I'd need to verify that.

The page cache in system memory of WSL does appear as allocated memory from the Windows host's task manager and also results in disk usage IIRC (I've had scenarios where I've pulled a container image on low disk, even though barely any extra memory is used to do that operation, and sufficient disk for the image itself would have been available IIRC, the overhead of the kernels file cache affecting the page file on the windows host resulted in 0 bytes left which completely crashes WSL).

I've got the weaker system in VRAM and memory (given preference to run in WSL + container), so it might be helpful for insights. I'm not sure if I can test with the same workloads being discussed, advice on what kind of load to test for is appreciated.

@m8rr
Copy link
Copy Markdown

m8rr commented May 25, 2026

Comfy-Org/ComfyUI#14089
@rattus128 I still need --cache-ram 0 even after the latest patch. The initialization happens from --cache-ram 2, so it makes sense that it's still too high even though it's now tweaked to 90% (3G).
Anyway, I'm happy with --cache-ram 0.

--cache-ram 0
Untitled

without --cache-ram 0
The log says 'Enabled pinned memory 12834.0', but unlike when using --cache-ram 0, it doesn't actually seem to be pinning.
Untitled2
After Sampler 1, the TE completely disappears from the list. After that, the TE is reloaded from scratch, and Samplers 1 and 2 are completed successfully.

Untitled3

@kingp0dd
Copy link
Copy Markdown

Comfy-Org/ComfyUI#14089

before applying the patch, my baseline is:

comfy launch -- --listen 0.0.0.0 --cache-ram 0 --reserve-vram=0.05 
> 140s
> TE always loaded from disk

then i pulled the latest patch. all results are the 4th/5th run:

comfy launch -- --listen 0.0.0.0 --cache-ram 0 --reserve-vram=0.05 
> 130s
> TE always loaded from disk
image
comfy launch -- --listen 0.0.0.0 --cache-ram 0 
> 137s
> TE always loaded from disk
comfy launch -- --listen 0.0.0.0 --reserve-vram=0.05                                             
> 135s
> TE always loaded from disk
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants