Skip to content

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116

Draft
rattus128 wants to merge 8 commits into
Comfy-Org:masterfrom
rattus128:prs/aimdo-046-threaded-loader-2
Draft

Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116
rattus128 wants to merge 8 commits into
Comfy-Org:masterfrom
rattus128:prs/aimdo-046-threaded-loader-2

Conversation

@rattus128
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 commented May 26, 2026

A handful of RAM optimizations particularly on windows with slow disks.

Dismantle the stream-pin-buffer and instead aimdo 0.4.6 has a direct file -> VRAM load API using the same threaded load but with a static ring buffer that matches the chunk size and does coalescence in C. This saves a lot of RAM and also avoids prefault delay for larger stream-pin-buffer allocation which skirting the giant-weight WRT RAM.

From there, change the pin allocation and movement strategy to always max out pin allocation on the current model even if there isnt enough reservation quota. Instead move pins on the fly (taking the cuda sync hit) as that is preferable to risking a disk hit or having to do a RAM deep copy. The MRU 2GB chunk gets evicted repeatedly and rotated through the shortfall to avoid LRU all-weights eviction as the transformer cycles everything.

De-committing memory for the sake of pin buffer freeing is made lightly asynchronous to get this out of the CPU main thread critical path.

pinned memory is improved with a offload balancer algorithm. A max scatter algorithm is used to spread out the weights that miss out on getting loaded to RAM so disk bandwidth can be maximized by evening out the load.

Aimdo 0.4.7 improved VRAM load patterns by not loading past the VRAM usage accounting all yet-to-be-loaded pages. this avoid a disk revisit for these weights.

Finally fix the file open mode in windows and unify with the aimdo open which make disks just a little faster on Win.

Example test conditions:

Windows, RTX5060, 32GB RAM, PCIE x4 Gen1 (downgraded)
LTX2.3 960x540x10s

scr

Before:

[INFO] got prompt
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load LTXAVTEModel_
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
[INFO] Found quantization metadata version 1
[INFO] Detected mixed precision quantization
[INFO] Using mixed precision operations
[INFO] Native ops: nvfp4, int8_blockwise, float8_e4m3fn_rowwise, mxfp8, hybrid_mxfp8, float8_e5m2, float8_e4m3fn_blockwise, float8_e4m3fn, int8_tensorwise
[INFO] model weight dtype torch.bfloat16, manual cast: torch.bfloat16
[INFO] model_type FLUX
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[WARNING] no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
[INFO] 0 models unloaded.
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [03:44<00:00, 28.03s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:01<00:00, 40.46s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 463.97 seconds
scr

After:

[INFO] got prompt
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load LTXAVTEModel_
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
[INFO] Found quantization metadata version 1
[INFO] Detected mixed precision quantization
[INFO] Using mixed precision operations
[INFO] Native ops: float8_e4m3fn_rowwise, float8_e4m3fn_blockwise, nvfp4, int8_tensorwise, int8_blockwise, mxfp8, float8_e5m2, float8_e4m3fn, hybrid_mxfp8
[INFO] model weight dtype torch.bfloat16, manual cast: torch.bfloat16
[INFO] model_type FLUX
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[WARNING] no CLIP/text encoder weights in checkpoint, the text encoder model will not be loaded.
[INFO] 0 models unloaded.
[INFO] Model LTXAVTEModel_ prepared for dynamic VRAM loading. 25440MB Staged. 0 patches attached. Force pre-loaded 400 weights: 1745 KB.
[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:43<00:00, 12.94s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:04<00:00, 21.59s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 277.78 seconds
scr

v0.22.0:

Model LTXAV prepared for dynamic VRAM loading. 23838MB Staged. 1660 patches attached. Force pre-loaded 1496 weights: 44 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:53<00:00, 14.16s/it]
Model LTXAV prepared for dynamic VRAM loading. 23838MB Staged. 1660 patches attached. Force pre-loaded 1496 weights: 44 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:05<00:00, 21.95s/it]
Requested to load AudioVAE
loaded completely;  693.46 MB loaded, full load: True
Requested to load VideoVAE
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
Prompt executed in 361.52 seconds

After + #13971:

[INFO] Requested to load LTXAV
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:09<00:00,  8.71s/it]
[INFO] Model LTXAV prepared for dynamic VRAM loading. 23835MB Staged. 1660 patches attached. Force pre-loaded 2104 weights: 3308 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:46<00:00, 15.65s/it]
[INFO] Requested to load AudioVAE
[INFO] loaded completely;  693.46 MB loaded, full load: True
[INFO] Requested to load VideoVAE
[INFO] 0 models unloaded.
[INFO] Model VideoVAE prepared for dynamic VRAM loading. 1384MB Staged. 0 patches attached.
[INFO] Prompt executed in 239.95 seconds

@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 26, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedwebsocket-client@​1.9.098100100100100
Addedpytest-asyncio@​1.4.0100100100100100

View full report

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 26, 2026

Whole thing is still net negative for me.

#13802 (comment)

v22 dynamic vram disabled


INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
Requested to load Flux2
loaded completely; 14564.04 MB usable, 8996.02 MB loaded, full load: True
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.86s/it]
Requested to load TAESD
loaded completely; 4994.88 MB usable, 10.21 MB loaded, full load: True
Prompt executed in 33.81 seconds << compile (auto, no node)
got prompt
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
Prompt executed in 6.11 seconds << reroll

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.46s/it]
Prompt executed in 7.19 seconds << new prompt
got prompt

master dynamic vram enabled


INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.20 seconds << (compile)
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.27 seconds << (slower reroll)
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.38 seconds << (owo, what's this?)

PR

[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 31.16 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.47s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.22 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.90 seconds

PR --cache-ram 1


[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.03 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.24 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.91 seconds

@alldtes9-tech
Copy link
Copy Markdown

alldtes9-tech commented May 27, 2026

Whole thing is still net negative for me.


[INFO] INT8 Grouped LoRA: Stacked 4 LoRAs: klein_snofs_v1_3.safetensors, lenovo_flux_klein9b.safetensors, nicegirls_flux_klein9b.safetensors, Realism_Engine_Klein_V2.safetensors
[INFO] gguf qtypes: Q6_K (37), F32 (145), Q4_K (217)
[MINIMAL] Dequantizing token_embd.weight to prevent runtime OOM.
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 21002.17 MB usable, 6829.34 MB loaded, full load: True
[INFO] Requested to load Flux2
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Requested to load TAESD
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 30.03 seconds
[INFO] got prompt
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 6.24 seconds
[INFO] got prompt
[INFO] Requested to load Flux2TEModel_
[INFO] loaded completely; 11899.98 MB usable, 6829.34 MB loaded, full load: True
[INFO] Model Flux2 prepared for dynamic VRAM loading. 8996MB Staged. 112 patches attached. Force pre-loaded 80 weights: 59 KB.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.48s/it]
[INFO] Model TAESD prepared for dynamic VRAM loading. 10MB Staged. 0 patches attached. Force pre-loaded 16 weights: 12 KB.
[INFO] Prompt executed in 10.91 seconds

You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work.

Also, you're using GGUF for the text encoder, which I don't think supports Dynamic vram yet. rattus has a draft PR on the ComfyUI-GGUF repo, but I don't know if it's a complete implementation yet since it's still marked as draft.

Why not compare performance using quant that are natively supported in Comfy instead?

Edit:

Here are my results using quants that supported in Comfy, using zimage BF16 + Qwen3 4B BF16.

Master with --disable-dynamic-vram args.

[INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5677.80 MB usable, 5437.25 MB loaded, 2235.00 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] 0 models unloaded.
[INFO] Unloaded partially: 277.87 MB freed, 5159.38 MB remains loaded, 237.50 MB buffer reserved, lowvram patches: 0
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5621.67 MB usable, 5245.48 MB loaded, 6494.06 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
  0%|                                                                                            | 0/4 [00:00<?, ?it/s][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
 25%|█████████████████████                                                               | 1/4 [00:04<00:12,  4.03s/it][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
 50%|██████████████████████████████████████████                                          | 2/4 [00:05<00:05,  2.64s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00,  2.33s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4710.57 MB usable, 4334.25 MB loaded, 7405.29 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 112
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00,  4.53s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 74.20 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.78s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4733.45 MB usable, 4358.45 MB loaded, 7381.09 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00,  4.56s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 50.51 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5612.68 MB usable, 5374.75 MB loaded, 2297.50 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.80s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4735.93 MB usable, 4360.93 MB loaded, 7378.61 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:35<00:00,  4.46s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 53.89 seconds <<<< 3rd run (change prompt)

master with dynamic vram enabled (default)

[INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] 0 models unloaded.
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
  0%|                                                                | 0/4 [00:00<?, ?it/s,   Model Initializing ...  ][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
 25%|████████████▎                                    | 1/4 [00:07<00:23,  7.73s/it,  Model Initialization complete!  ][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
 50%|██████████████████████████████████████████                                          | 2/4 [00:02<00:02,  1.20s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.14s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:23<00:00,  3.00s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 42.62 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.10s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00,  2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 32.90 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.12s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00,  2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 35.02 seconds <<<< 3rd run (change prompt)

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 27, 2026

You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work.

Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.

Master with --disable-dynamic-vram args.

Can't do that anymore because on master dynamic vram can no longer be properly disabled. The last PR that caused this disk-reloading behavior functionally deprecated this option. I can give you a log of .22 vs master if you'd like. You cannot induce a regression and then say that your "fix" makes it better.

@alldtes9-tech
Copy link
Copy Markdown

Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.

I'm not asking you to run a different model or use a different GPU or memory setup. I was asking why not compare using quants that are natively supported in Comfy.

If Comfy makes changes that improve performance for the native path and those changes end up affecting performance in your custom setup, I don't think it's fair to immediately conclude that Comfy introduced a performance regression just because a custom integration becomes slower.

Since you're using gguf/INT8 through a custom node / Comfy Kitchen fork, it might also worth ask with those maintainers whether latest master introduced changes they need to adapt to.

Personally, I mostly use models and quants that work through native Comfy paths, and I've generally seen benefits from recent changes.

That said, if the slowdown also happens on native setups, then that's a different discussion.

@Ph0rk0z
Copy link
Copy Markdown

Ph0rk0z commented May 27, 2026

There is literally no scenario where loading from disk will help me because my disks are slow and I have plenty of ram just for this reason.

I was asking why not compare using quants that are natively supported in Comfy

Because those quants don't work for me. If they had, I wouldn't have sought different ones. I am having to re-invent the wheel and most likely fix those things myself, they are not the only broken nodes from this change. Prior to #13802 I was able to turn it off for my use, and you were able to leave it on for your use. We could both have our cake.

rattus128 added 8 commits May 28, 2026 19:18
Make destination optional (or make it optionally GPU) and use aimdo
to file_read direct to GPU.
This consumed too much RAM and its better to just take the hit on
the CPU syncing back the stream on a short ring buffer. Aimdo
implements this so just rip the stream pin buffer from comfy.
Its better to just let the active model load past the pin limit as
pins and let the pins move around. The saves the HDD and SATA
people disk traffic while only costing a few GPU syncs.
This opens on windows with more favourable flags
Exclude live loras from the numbers to avoid the case where the reported
loaded memory exceeds the size of the model.

This causes me confusion in the Kijai visualizer when it looked fully
loaded but was hitting disk due to this accounding disrepency.
useful for max scattering something ordered.
Use a max scatter alogorithm to prioritize pins of the same size such
that when doing a little bit of offloading it gets scattered, allowing
the prefetcher to more evenly swollow the offload.
Aimdo 0.4.7 implement VRAM buffer exhaustion predection to avoid
early speculative load of weights that definately wont fix once the
inference gets further in.
@rattus128 rattus128 force-pushed the prs/aimdo-046-threaded-loader-2 branch from 1eeb963 to bf0ac49 Compare May 28, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants