Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116
Threaded Loader performance fixes / improvements (+ Aimdo 0.4.6)#14116rattus128 wants to merge 8 commits into
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Whole thing is still net negative for me. v22 dynamic vram disabled master dynamic vram enabled PR PR --cache-ram 1 |
You're using INT8 quant, which isn't natively supported in Comfy. AFAIK, you need the Comfy Kitchen fork and a custom node to make it work. Also, you're using GGUF for the text encoder, which I don't think supports Dynamic vram yet. rattus has a draft PR on the ComfyUI-GGUF repo, but I don't know if it's a complete implementation yet since it's still marked as draft. Why not compare performance using quant that are natively supported in Comfy instead? Edit: Here are my results using quants that supported in Comfy, using zimage BF16 + Qwen3 4B BF16. Master with [INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5677.80 MB usable, 5437.25 MB loaded, 2235.00 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] 0 models unloaded.
[INFO] Unloaded partially: 277.87 MB freed, 5159.38 MB remains loaded, 237.50 MB buffer reserved, lowvram patches: 0
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5621.67 MB usable, 5245.48 MB loaded, 6494.06 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
0%| | 0/4 [00:00<?, ?it/s][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
25%|█████████████████████ | 1/4 [00:04<00:12, 4.03s/it][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
50%|██████████████████████████████████████████ | 2/4 [00:05<00:05, 2.64s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:09<00:00, 2.33s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4710.57 MB usable, 4334.25 MB loaded, 7405.29 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 112
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00, 4.53s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 74.20 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00, 1.78s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4733.45 MB usable, 4358.45 MB loaded, 7381.09 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:36<00:00, 4.56s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 50.51 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Requested to load ZImageTEModel_
[INFO] loaded partially; 5612.68 MB usable, 5374.75 MB loaded, 2297.50 MB offloaded, 237.50 MB buffer reserved, lowvram patches: 0
[INFO] Requested to load Lumina2
[INFO] loaded partially; 5612.67 MB usable, 5237.67 MB loaded, 6501.87 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 100
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00, 1.80s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] loaded partially; 4735.93 MB usable, 4360.93 MB loaded, 7378.61 MB offloaded, 375.00 MB buffer reserved, lowvram patches: 111
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:35<00:00, 4.46s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] loaded partially; 0.00 MB usable, 0.00 MB loaded, 159.87 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
[INFO] Prompt executed in 53.89 seconds <<<< 3rd run (change prompt)master with dynamic vram enabled (default) [INFO] got prompt
[INFO] Using pytorch attention in VAE
[INFO] Using pytorch attention in VAE
[INFO] VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[INFO] model weight dtype torch.bfloat16, manual cast: None
[INFO] model_type FLOW
[INFO] CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
[INFO] Requested to load ZImageTEModel_
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] 0 models unloaded.
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[WARNING] [FeatureInjLatent] Reference latent: shape=torch.Size([1, 16, 90, 68])
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
0%| | 0/4 [00:00<?, ?it/s, Model Initializing ... ][WARNING] [FeatureInjLatent] step=1 | progress=0.00 | eff_str=0.150 | no mask
25%|████████████▎ | 1/4 [00:07<00:23, 7.73s/it, Model Initialization complete! ][WARNING] [FeatureInjLatent] step=2 | progress=0.03 | eff_str=0.143 | no mask
50%|██████████████████████████████████████████ | 2/4 [00:02<00:02, 1.20s/it][WARNING] [FeatureInjLatent] step=3 | progress=0.06 | eff_str=0.135 | no mask
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.14s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:23<00:00, 3.00s/it]
[INFO] Requested to load AutoencodingEngine
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 42.62 seconds <<<< 1st run
[INFO] got prompt
[INFO] Requested to load Lumina2
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.10s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00, 2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 32.90 seconds <<<< 2nd run (change seed)
[INFO] got prompt
[INFO] Model ZImageTEModel_ prepared for dynamic VRAM loading. 7671MB Staged. 0 patches attached. Force pre-loaded 145 weights: 383 KB.
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.12s/it]
[INFO] Requested to load Lumina2
[INFO] 0 models unloaded.
[INFO] Model Lumina2 prepared for dynamic VRAM loading. 11738MB Staged. 166 patches attached. Force pre-loaded 205 weights: 1045 KB.
100%|████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:22<00:00, 2.86s/it]
[INFO] 0 models unloaded.
[INFO] Model AutoencodingEngine prepared for dynamic VRAM loading. 159MB Staged. 0 patches attached. Force pre-loaded 108 weights: 182 KB.
[INFO] Prompt executed in 35.02 seconds <<<< 3rd run (change prompt) |
Because this used to work for me. I am comparing the thing I actually want to use not some theoretical. One could ask "why not compare the results on a system with an 8gb GPU and 16gb of ram. Why not compare SDXL, etc.
Can't do that anymore because on master dynamic vram can no longer be properly disabled. The last PR that caused this disk-reloading behavior functionally deprecated this option. I can give you a log of .22 vs master if you'd like. You cannot induce a regression and then say that your "fix" makes it better. |
I'm not asking you to run a different model or use a different GPU or memory setup. If Comfy makes changes that improve performance for the native path and those changes end up affecting performance in your custom setup, I don't think it's fair to immediately conclude that Comfy introduced a performance regression just because a custom integration becomes slower. Since you're using gguf/INT8 through a custom node / Comfy Kitchen fork, it might also worth ask with those maintainers whether latest master introduced changes they need to adapt to. Personally, I mostly use models and quants that work through native Comfy paths, and I've generally seen benefits from recent changes. That said, if the slowdown also happens on native setups, then that's a different discussion. |
|
There is literally no scenario where loading from disk will help me because my disks are slow and I have plenty of ram just for this reason.
Because those quants don't work for me. If they had, I wouldn't have sought different ones. I am having to re-invent the wheel and most likely fix those things myself, they are not the only broken nodes from this change. Prior to #13802 I was able to turn it off for my use, and you were able to leave it on for your use. We could both have our cake. |
Make destination optional (or make it optionally GPU) and use aimdo to file_read direct to GPU.
This consumed too much RAM and its better to just take the hit on the CPU syncing back the stream on a short ring buffer. Aimdo implements this so just rip the stream pin buffer from comfy.
Its better to just let the active model load past the pin limit as pins and let the pins move around. The saves the HDD and SATA people disk traffic while only costing a few GPU syncs.
This opens on windows with more favourable flags
Exclude live loras from the numbers to avoid the case where the reported loaded memory exceeds the size of the model. This causes me confusion in the Kijai visualizer when it looked fully loaded but was hitting disk due to this accounding disrepency.
useful for max scattering something ordered.
Use a max scatter alogorithm to prioritize pins of the same size such that when doing a little bit of offloading it gets scattered, allowing the prefetcher to more evenly swollow the offload.
Aimdo 0.4.7 implement VRAM buffer exhaustion predection to avoid early speculative load of weights that definately wont fix once the inference gets further in.
1eeb963 to
bf0ac49
Compare
A handful of RAM optimizations particularly on windows with slow disks.
Dismantle the stream-pin-buffer and instead aimdo 0.4.6 has a direct file -> VRAM load API using the same threaded load but with a static ring buffer that matches the chunk size and does coalescence in C. This saves a lot of RAM and also avoids prefault delay for larger stream-pin-buffer allocation which skirting the giant-weight WRT RAM.
From there, change the pin allocation and movement strategy to always max out pin allocation on the current model even if there isnt enough reservation quota. Instead move pins on the fly (taking the cuda sync hit) as that is preferable to risking a disk hit or having to do a RAM deep copy. The MRU 2GB chunk gets evicted repeatedly and rotated through the shortfall to avoid LRU all-weights eviction as the transformer cycles everything.
De-committing memory for the sake of pin buffer freeing is made lightly asynchronous to get this out of the CPU main thread critical path.
pinned memory is improved with a offload balancer algorithm. A max scatter algorithm is used to spread out the weights that miss out on getting loaded to RAM so disk bandwidth can be maximized by evening out the load.
Aimdo 0.4.7 improved VRAM load patterns by not loading past the VRAM usage accounting all yet-to-be-loaded pages. this avoid a disk revisit for these weights.
Finally fix the file open mode in windows and unify with the aimdo open which make disks just a little faster on Win.
Example test conditions:
Windows, RTX5060, 32GB RAM, PCIE x4 Gen1 (downgraded)
LTX2.3 960x540x10s
Before:
After:
v0.22.0:
After + #13971: