Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6) by rattus128 · Pull Request #14116 · Comfy-Org/ComfyUI

rattus128 · 2026-05-26T16:26:35Z

A handful of RAM optimizations particularly on windows with slow disks.

Dismantle the stream-pin-buffer and instead aimdo 0.4.6 has a direct file -> VRAM load API using the same threaded load but with a static ring buffer that matches the chunk size and does coalescence in C. This saves a lot of RAM and also avoids prefault delay for larger stream-pin-buffer allocation which skirting the giant-weight WRT RAM.

From there, change the pin allocation and movement strategy to always max out pin allocation on the current model even if there isnt enough reservation quota. Instead move pins on the fly (taking the cuda sync hit) as that is preferable to risking a disk hit or having to do a RAM deep copy. The MRU 2GB chunk gets evicted repeatedly and rotated through the shortfall to avoid LRU all-weights eviction as the transformer cycles everything.

De-committing memory for the sake of pin buffer freeing is made lightly asynchronous to get this out of the CPU main thread critical path.

pinned memory is improved with a offload balancer algorithm. A max scatter algorithm is used to spread out the weights that miss out on getting loaded to RAM so disk bandwidth can be maximized by evening out the load.

Aimdo 0.4.7 improved VRAM load patterns by not loading past the VRAM usage accounting all yet-to-be-loaded pages. this avoid a disk revisit for these weights.

Finally fix the file open mode in windows and unify with the aimdo open which make disks just a little faster on Win.

Example test conditions:

Windows, RTX5060, 32GB RAM, PCIE x4 Gen1 (downgraded)
LTX2.3 960x540x10s

Before:

[INFO] got prompt
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load LTXAVTEModel_
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
[INFO] Found quantization metadata version 1
[INFO] Detected mixed precision quantization
[INFO] Using mixed precision operations
[INFO] Native ops: nvfp4, int8_blockwise, float8_e4m3fn_rowwise, mxfp8, hybrid_mxfp8, float8_e5m2, float8_e4m3fn_blockwise, float8_e4m3fn, int8_tensorwise
[INFO] model weight dtype torch.bfloat16, manual cast: torch.bfloat16
[INFO] model_type FLUX
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[WARNING] no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
[INFO] 0 models unloaded.
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:44<00:00, 28.03s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:01<00:00, 40.46s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 463.97 seconds

After:

[INFO] got prompt
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load LTXAVTEModel_
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
[INFO] Found quantization metadata version 1
[INFO] Detected mixed precision quantization
[INFO] Using mixed precision operations
[INFO] Native ops: float8_e4m3fn_rowwise, float8_e4m3fn_blockwise, nvfp4, int8_tensorwise, int8_blockwise, mxfp8, float8_e5m2, float8_e4m3fn, hybrid_mxfp8
[INFO] model weight dtype torch.bfloat16, manual cast: torch.bfloat16
[INFO] model_type FLUX
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[WARNING] no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
[INFO] 0 models unloaded.
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:43<00:00, 12.94s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:04<00:00, 21.59s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 277.78 seconds

v0.22.0:

Model LTXAV prepared for dynamic VRAM loading. 23838MB Staged. 1660 patches attached. Force pre-loaded 1496 weights: 44 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:53<00:00, 14.16s/it]
Model LTXAV prepared for dynamic VRAM loading. 23838MB Staged. 1660 patches attached. Force pre-loaded 1496 weights: 44 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:05<00:00, 21.95s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 361.52 seconds

After + #13971:

[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:09<00:00,  8.71s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:46<00:00, 15.65s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 239.95 seconds

socket-security · 2026-05-26T16:27:24Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	websocket-client@1.9.0
	pytest-asyncio@1.4.0

View full report

Ph0rk0z · 2026-05-26T22:33:22Z

Whole thing is still net negative for me.

#13802 (comment)

v22 dynamic vram disabled


INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
Requested to load Flux2
loaded completely; 14564.04 MB usable, 8996.02 MB loaded, full load: True
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.86s/it]
Requested to load TAESD
loaded completely; 4994.88 MB usable, 10.21 MB loaded, full load: True
Prompt executed in 33.81 seconds << compile (auto, no node)
got prompt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
Prompt executed in 6.11 seconds << reroll

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.46s/it]
Prompt executed in 7.19 seconds << new prompt
got prompt

master dynamic vram enabled


INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.20 seconds << (compile)
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.27 seconds << (slower reroll)
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.38 seconds << (owo, what's this?)

PR

[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 31.16 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.22 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.90 seconds

PR --cache-ram 1


[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.03 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.24 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.91 seconds

alldtes9-tech · 2026-05-27T04:35:51Z

Whole thing is still net negative for me.


[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.03 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.24 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.91 seconds

You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work.

Also, you're using GGUF for the text encoder, which I don't think supports Dynamic vram yet. rattus has a draft PR on the ComfyUI-GGUF repo, but I don't know if it's a complete implementation yet since it's still marked as draft.

Why not compare performance using quant that are natively supported in Comfy instead?

Edit:

Here are my results using quants that supported in Comfy, using zimage BF16 + Qwen3 4B BF16.

Master with --disable-dynamic-vram args.

[INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5677.80 MB usable, 5437.25 MB loaded, 2235.00 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] 0 models unloaded.
[INFO] Unloaded partially: 277.87 MB freed, 5159.38 MB remains loaded, 237.50 MB buffer reserved, lowvram patches: 0
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5621.67 MB usable, 5245.48 MB loaded, 6494.06 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
  0%|                                                                                            | 0/4 [00:00<?, ?it/s][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
 25%|█████████████████████                                                               | 1/4 [00:04<00:12,  4.03s/it][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
 50%|██████████████████████████████████████████                                          | 2/4 [00:05<00:05,  2.64s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.33s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4710.57 MB usable, 4334.25 MB loaded, 7405.29 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 112
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00,  4.53s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 74.20 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.78s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4733.45 MB usable, 4358.45 MB loaded, 7381.09 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00,  4.56s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 50.51 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5612.68 MB usable, 5374.75 MB loaded, 2297.50 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.80s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4735.93 MB usable, 4360.93 MB loaded, 7378.61 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:35<00:00,  4.46s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 53.89 seconds <<<< 3rd run (change prompt)

master with dynamic vram enabled (default)

[INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] 0 models unloaded.
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
  0%|                                                                | 0/4 [00:00<?, ?it/s,   Model Initializing ...  ][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
 25%|████████████▎                                    | 1/4 [00:07<00:23,  7.73s/it,  Model Initialization complete!  ][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
 50%|██████████████████████████████████████████                                          | 2/4 [00:02<00:02,  1.20s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.14s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:23<00:00,  3.00s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 42.62 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.10s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00,  2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 32.90 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.12s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00,  2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 35.02 seconds <<<< 3rd run (change prompt)

Ph0rk0z · 2026-05-27T12:13:14Z

You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work.

Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.

Master with --disable-dynamic-vram args.

Can't do that anymore because on master dynamic vram can no longer be properly disabled. The last PR that caused this disk-reloading behavior functionally deprecated this option. I can give you a log of .22 vs master if you'd like. You cannot induce a regression and then say that your "fix" makes it better.

alldtes9-tech · 2026-05-27T13:12:58Z

Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.

I'm not asking you to run a different model or use a different GPU or memory setup. I was asking why not compare using quants that are natively supported in Comfy.

If Comfy makes changes that improve performance for the native path and those changes end up affecting performance in your custom setup, I don't think it's fair to immediately conclude that Comfy introduced a performance regression just because a custom integration becomes slower.

Since you're using gguf/INT8 through a custom node / Comfy Kitchen fork, it might also worth ask with those maintainers whether latest master introduced changes they need to adapt to.

Personally, I mostly use models and quants that work through native Comfy paths, and I've generally seen benefits from recent changes.

That said, if the slowdown also happens on native setups, then that's a different discussion.

Ph0rk0z · 2026-05-27T21:42:50Z

There is literally no scenario where loading from disk will help me because my disks are slow and I have plenty of ram just for this reason.

I was asking why not compare using quants that are natively supported in Comfy

Because those quants don't work for me. If they had, I wouldn't have sought different ones. I am having to re-invent the wheel and most likely fix those things myself, they are not the only broken nodes from this change. Prior to #13802 I was able to turn it off for my use, and you were able to leave it on for your use. We could both have our cake.

Make destination optional (or make it optionally GPU) and use aimdo to file_read direct to GPU.

This consumed too much RAM and its better to just take the hit on the CPU syncing back the stream on a short ring buffer. Aimdo implements this so just rip the stream pin buffer from comfy.

Its better to just let the active model load past the pin limit as pins and let the pins move around. The saves the HDD and SATA people disk traffic while only costing a few GPU syncs.

This opens on windows with more favourable flags

Exclude live loras from the numbers to avoid the case where the reported loaded memory exceeds the size of the model. This causes me confusion in the Kijai visualizer when it looked fully loaded but was hitting disk due to this accounding disrepency.

useful for max scattering something ordered.

Use a max scatter alogorithm to prioritize pins of the same size such that when doing a little bit of offloading it gets scattered, allowing the prefetcher to more evenly swollow the offload.

Aimdo 0.4.7 implement VRAM buffer exhaustion predection to avoid early speculative load of weights that definately wont fix once the inference gets further in.

frauttauteffasu mentioned this pull request May 26, 2026

Multi-threaded load of models from disk (big load time speedups & Offload to disk) (CORE-43,CORE-152,CORE-164,CORE-165,CORE-117) #13802

Merged

rattus128 added 8 commits May 28, 2026 19:18

memory_management: Add direct to read GPU mode

04e6329

Make destination optional (or make it optionally GPU) and use aimdo to file_read direct to GPU.

ops: Remove stream pin buffers and use aimdo reads

fca9842

This consumed too much RAM and its better to just take the hit on the CPU syncing back the stream on a short ring buffer. Aimdo implements this so just rip the stream pin buffer from comfy.

model_management: all active pin registration movement

61c7887

Its better to just let the active model load past the pin limit as pins and let the pins move around. The saves the HDD and SATA people disk traffic while only costing a few GPU syncs.

utils: use aimdo file handle

5b72c2c

This opens on windows with more favourable flags

utils: add bit reverse utility

39a66d8

useful for max scattering something ordered.

pinned_memory: Implement offload balancing

83e267e

Use a max scatter alogorithm to prioritize pins of the same size such that when doing a little bit of offloading it gets scattered, allowing the prefetcher to more evenly swollow the offload.

comfy-aimdo 0.4.7

bf0ac49

Aimdo 0.4.7 implement VRAM buffer exhaustion predection to avoid early speculative load of weights that definately wont fix once the inference gets further in.

rattus128 force-pushed the prs/aimdo-046-threaded-loader-2 branch from 1eeb963 to bf0ac49 Compare May 28, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116
rattus128 wants to merge 8 commits into
Comfy-Org:masterfrom
rattus128:prs/aimdo-046-threaded-loader-2

rattus128 commented May 26, 2026 •

edited

Loading

Uh oh!

socket-security Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Ph0rk0z commented May 26, 2026

Uh oh!

alldtes9-tech commented May 27, 2026 •

edited

Loading

Uh oh!

Ph0rk0z commented May 27, 2026

Uh oh!

alldtes9-tech commented May 27, 2026

Uh oh!

Ph0rk0z commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rattus128 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented May 26, 2026

Uh oh!

alldtes9-tech commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented May 27, 2026

Uh oh!

alldtes9-tech commented May 27, 2026

Uh oh!

Ph0rk0z commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rattus128 commented May 26, 2026 •

edited

Loading

socket-security Bot commented May 26, 2026 •

edited

Loading

alldtes9-tech commented May 27, 2026 •

edited

Loading