Dynamic VRAM support#427
Conversation
Vibe code. To be reviewed.
If in dynamic mode, load GGUF as a QT.
Refactor this to support the new reconstructability protocol in the comfy core. This is needed for DynamicVRAM (to support legacy demotion for fallbacks). Add the logic for dynamic_vram construction. This is also needed for worksplit multi-gpu branch where the model is deep-cloned via reconstruction to put the model on two parallel GPUs.
Refactor this to support the new reconstructability protocol in the comfy core. This is needed for DynamicVRAM (to support legacy demotion for fallbacks). Add the logic for dynamic_vram construction. This is also needed for worksplit multi-gpu branch where the model is deep-cloned via reconstruction to put the model on two parallel GPUs.
Factor this out to a helper and implement the new core reconstruction protocol. Consider the mmap_released flag 1:1 with the underlying model such that it moves with the base model in model_override.
|
https://github.com/rattus128/ComfyUI-GGUF/tree/dynamic-vram Is this the same thing? |
|
This version definitely has a speed boost. However, if you're getting errors with the GGUF text encoder like me, try modifying the code as follows. Only the text encoder is operating the old way. it should serve as a good temporary workaround until the update. nodes.py line 206~ (False->True) |
That means it's already working. How much% did it save you |
|
without --disable-dynamic-vram with --disable-dynamic-vram |
|
are there other cli flags needed to enable it? im on v16.4, my startup logs
have:
DynamicVRAM support detected and enabled
but when model is loaded, i don't get the same as yours:
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0
patches attached.
I'm using GGUF Wan2.2
…On Tue, Mar 17, 2026 at 8:20 AM m8rr ***@***.***> wrote:
*m8rr* left a comment (city96/ComfyUI-GGUF#427)
<#427 (comment)>
without --disable-dynamic-vram
Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:29<00:00, 5.93s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.72s/it]
Requested to load AudioVAE
loaded completely; 1968.11 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 164.28 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:24<00:00, 4.81s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.69s/it]
Requested to load AudioVAE
loaded completely; 1934.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 90.59 seconds
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:23<00:00, 4.63s/it]
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 16918MB Staged. 0 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:35<00:00, 11.68s/it]
Requested to load AudioVAE
loaded completely; 1966.00 MB usable, 693.46 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 77.65 seconds
with --disable-dynamic-vram
Requested to load LTXAVTEModel_
loaded partially; 8523.00 MB usable, 556.58 MB loaded, 13574.77 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (267)
loaded partially; 8457.88 MB usable, 491.46 MB loaded, 13639.97 MB offloaded, 7966.42 MB buffer reserved, lowvram patches: 0
gguf qtypes: F32 (2672), BF16 (28), Q6_K (1744)
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load LTXAV
loaded partially; 9564.67 MB usable, 9525.25 MB loaded, 7689.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:30<00:00, 6.09s/it]
Unloaded partially: 620.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1287.84 MB freed, 7616.93 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.33s/it]
Requested to load AudioVAE
loaded completely; 2233.75 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 193.22 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:32<00:00, 6.43s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:42<00:00, 14.15s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 97.68 seconds
Requested to load LTXAV
loaded partially; 9560.67 MB usable, 9521.25 MB loaded, 7693.63 MB offloaded, 39.42 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:31<00:00, 6.25s/it]
Unloaded partially: 616.48 MB freed, 8904.77 MB remains loaded, 39.42 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 1301.00 MB freed, 7603.77 MB remains loaded, 39.47 MB buffer reserved, lowvram patches: 0
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:41<00:00, 13.87s/it]
Requested to load AudioVAE
loaded completely; 2244.90 MB usable, 693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 1384.94 MB offloaded, 378.02 MB buffer reserved, lowvram patches: 0
Prompt executed in 95.82 seconds
—
Reply to this email directly, view it on GitHub
<#427 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGD6KSN4OMENNNQ5OMLYXL4RCK5ZAVCNFSM6AAAAACWIURSQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DANZRGQ4DEMJTHA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
Check if the quant_ops.py file exists inside the ComfyUI-GGUF folder. If it’s not there, the installation wasn't done correctly. I installed like this. |
|
EDIT: Sorry to bother. It works now after nuking my whole comfyui installation. That's weird, i did exactly that but it's still not activating. I have that file: But when loading the GGUF Q4KM Wan2.2, i still can't see the Dynamic loading log: I know dynamic loading is enabled in Comfyui because other models have that log: I tried this both in v0.16.4 and the latest Comfy v.0.18.2 EDIT: Sorry to bother. It works now after nuking my whole comfyui installation. |
|
I ran a few tests win Wan2.2 Q4KM GGUF. It seems that GGUF Dynamic VRAM is slower than non-dynamic: Dynamic VRAM: Without Dynamic VRAM: |
|
I ran Wan2.2 three times. Unlike LTX2.3, there wasn't a huge difference, but it doesn't seem any slower either. |
|
I hope this doesn't get merged, or always keeps a proper way to keep this disabled. Even using I also don't really get if it's needed in the GGUF loader? It was already pretty good and smart about offloading. Let's be clear (before I have to explain or edit this post): On my RTX Blackwell system, it's pretty neat. It's enabling workflows that I had to memory-juggle manually before or just gave OOM errors. It's also increased processing time for other workflows - most often the workflows that are simpler and using smaller models - because it keeps unloading and loading parts that otherwise fitted nicely in memory. So, the dynamic-vram work - currently - isn't a silver bullet and I still see lots of issues reported with it, and the thought that ComfyUI maintainers say they are going to remove the disable option (that works for 90% or less) makes me a bit scared. With the dynamic vram work activated, on my Intel system ComfyUI is now completely unusable. everything makes my system go OOM and the OS kills ComfyUI. Switching to the GGUF loader (even with Q8 and F16 models) and everything is fine. It ignores Until dynamic-vram in ComfyUI is stable and people are happy with it, please leave the GGUF loader alone. It's the only fallback. |
|
@rattus128 may we know if this is still incomplete? My ltx2.3 worflow OOM's without using this PR branch. It works great with the model Gguf, But it does not with gguf CLIP. It always gets loaded from SSD and not cached from RAM. If i use safetensor CLIP, it gets loaded from RAM and is faster. Edit: oh maybe the clip gguf is not yet implemented? #427 (comment) |
@m8rr May i ask if your clip is also gguf? Does it cache to vram for you? In my tests, gguf CLIP always loads from SSD even if I try Q3 and my RAM usage well below 60%. My ltx model is Q4 GGUF and should fit in my RAM as well. |
|
@kingp0dd |
|
@m8rr do you use ltx text projection safetensor along wit the gemma clip gguf? |
|
@kingp0dd |
|
i use the same. i retested after updating to: and now it works. both CLIP and UNET are being loaded from RAM and not from disk. |
|
Interesting! In my previous test, I was on version 0.22 stable. But now that I’ve updated to the latest commit, it’s reading the CLIP from the disk. It's behaving just like when I run Comfy for the very first time. It even goes as far as stopping at sampler 1 and reloading the CLIP. |
|
@m8rr @kingp0dd I have some time to look at this. Can you send me any of:
And your hardware specs. Hopefully we can get something in that works faster for everyone, as the latest stuff should be significantly faster for gguf users, especially the lower end of hardware. Ill be running on my RTX5060 with 16GB of RAM as a general rule but I can reasonably match a fair few profiles if you have a regressive workflow and we can go from there. |
|
Modified code nodes.py line 206~ (disable_dynamic: False->True) In the ComfyUI default template, I only changed the loader and ran it twice, but it keeps reloading the CLIP from scratch.
|
|
continuing from #427 (comment) now i try using Gemma Q3KM (~4GB smaller than the Q6K), everything loads from cache now. I turned on sudo sysctl vm.swappiness=60 again (but i'll try to turn it off again and ran a test again). What i don't understand is i still have unused RAM when using the Gemma Q6K but when using that, the models don't load from RAM cache.
|
|
Comfy-Org/ComfyUI#13802 (comment) The minor issue is, I'm not sure which version it started from, but clip loading is quite slow and it barely uses any VRAM. |
|
@m8rr @kingp0dd I think I have progress on this. I can see how RAM cache is pressuring @m8rr out as you are right on the threshold of not being able to cache that with the RAM cache pressure headroom making the difference. I think I might send the PR to core to default ram cache a bit higher. I wont go for 0 as we need pressure for other cases and we don't have an everybody wins just yet but I think we can do better for this profile of hardware. Can I get your generated resolutions to compare numbers? I am here for my 5060 1280x720x5s: Consider installing this https://github.com/kijai/ComfyUI-MemoryVisualization To watch along with the load levels. With --cache-ram 2 and 8GB VRAM and 32GB RAM my retention of the TE is partial:
The missing piece is barely noticable though with threaded loader and I have downgraded my PCIe bus to gen 1 to simulate a slow disk. @m8rr are you on SATA or HDD by any chance? |
|
@m8rr your workaround outright disables dynamic VRAM for CLIP. I have put the bugfix on top of the branch now so you should be able to undo your workaround and get dynamic clip too. With dynamic clip your liklihood of getting swapped or completely dumped by the RAM cache is much lower. --cache-ram 0 will still make it better. The question is how reliant you are on it with the clip fixed |
|
@rattus128 Sorry, I changed CLIP from q6 to q4. Anyway, I updated node with the new commit and ran it three times with different prompts. The CLIP (both q4) loading speed is noticeably faster—it dropped from over 20 seconds to just a few seconds. IGPU+EGPU(OcuLink), PCIE4 NVME SSD, 1280x768x121, 5/3steps without --cache-ram: 12Xs, 12Xs(hard reset) |
|
@rattus128 seems CLIP doesn't want to pin.
|
@kingp0dd is this linux host or WSL/guest? I'm not too familiar with the discussion going on so if "RAM cache" is referring to a feature of ComfyUI caching to anonymous memory pages then ignore me, but for The RAM screenshot in your last comment says approx 12/32GB RAM used? Then shows Python using 20GB and is that 5GB pinned separate from that? So approx 7GB spare? Cheek I also know that memory monitoring is a bit iffy with containers for example. In that scenario memory can appear inaccurate, my memory is a bit fuzzy on specifics but I recall it being related to cgroups v2 memory stats and that there had been some changes there that's important to calculate usage correctly. I do recall a PR was made to ComfyUI some time ago to try improve on ComfyUI's view of memory related to cgroups but the PR was largely ignored and closed IIRC. Another concern with Linux can be when swap is enabled, not necessarily as disk swap but via zram which will use up system memory with compressed pages whilst reporting uncompressed swap size as a separate device, that can mess with memory heuristics a bit. zswap is better integrated into the kernel's memory management to avoid that issue with it's memory pool allocations, but presently requires disk swap which also forces less compressible pages to disk swap by default (can opt-out per cgroup though). Not sure if you're using either of these (some distro enable by default), but might be worth taking into consideration since you are using swap. Additionally on Linux nvidia GPUs lack shared memory support as the driver doesn't integrate with GTT IIRC? (something like that) While AMD and Intel GPUs are capable of leveraging system memory as a fallback. With WSL on Windows however, the nvidia integration makes the GPU available from the Windows host + driver, where shared memory is supported (50% system memory as cap). |
|
I'm a bit busy elsewhere for roughly a week, but if it'd be helpful I can also assist towards this PR? I have a laptop with 32GB of RAM and an nvidia 4060 (8GB VRAM), Windows 11 host with WSL2 + Docker Desktop used for running ComfyUI via a container. As noted in my prior comment there are some known caveats with linux memory management and that gets worse with WSL (I think the stance from ComfyUI in their announcement of Dynamic VRAM was that WSL would not be officially supported because of this). By default WSL is permitted to 50% system memory as it's allocation ceiling. Nvidia's shared memory (effectively like swap but from VRAM to system memory) also has a 50% host memory ceiling. WSL is configured by default from Windows with a swap of 25% WSL RAM IIRC too, the kernel may also have zswap enabled and I believe it's now using cgroups v2 (important difference), I'd need to verify that. The page cache in system memory of WSL does appear as allocated memory from the Windows host's task manager and also results in disk usage IIRC (I've had scenarios where I've pulled a container image on low disk, even though barely any extra memory is used to do that operation, and sufficient disk for the image itself would have been available IIRC, the overhead of the kernels file cache affecting the page file on the windows host resulted in 0 bytes left which completely crashes WSL). I've got the weaker system in VRAM and memory (given preference to run in WSL + container), so it might be helpful for insights. I'm not sure if I can test with the same workloads being discussed, advice on what kind of load to test for is appreciated. |
|
Comfy-Org/ComfyUI#14089 without --cache-ram 0
|














The new dynamic VRAM system in the comfy-core enhances both RAM and VRAM management. Models are no longer offloader from VRAM to RAM (which has a habit of becoming swap) and are now loadable asynchronously on the sampler first iteration. This gives significant speedup to big multi-model workflows on low-resource systems. VRAM offloading is managed by demand offloading, such there is no need to have VRAM usage esitmates anymore.
The core has already upstreamed several of the resource saving features of GGUF in various forms.
So this implements a QuantizedTensor backend and subclasses the new ModelPatcherDynamic to bring GGUF+dynamic without needed custom ops.
The patcher subclass is needed to unhook the lora into on-the-fly. Otherwise its just load the state-dict into the new QuantizedTensor and go.
This brings the full feature-set of the core comfy caster to GGUF including, async-offload (and async primary load), pinned-memory and now the dynamic management.
There's some boilerplate to implement downgrade back to ModelPatcher. This is needed for things like torch compiler and hooks where Dynamic VRAM is TBD.
Still drafing and will post some more performance results. I am going to pull a RAM stick and go for some 16GB RAM flows with GGUF.
Example Test conditions:
WAN2.2 14B Q8 GGUF, 640x640x81f, RTX5090, Linux, 96GB, 2x Runs (disk caches warm with model first runs)
Before
After