MultiGPU Work Units For Accelerated Sampling (CORE-184)#7063
Merged
Conversation
…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…nsHook are not yet operational
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
…t torch hardware device
This was an attempt to be a fast path by ensuring the file slice was created by the owning thread and refusing without needing ot mutex but worksplit-multigpu doesnt work that way. Go mutex. Shoot me for overthinking next time.
Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf after full truncation. This pattern doesnt happen as a general rule, but does happen in the upcoming worksplit-multigpu branch.
Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler that dispatches per-device tile functions through the existing MultiGPUThreadPool and merges per-device CPU output buffers in deterministic key order. The worker only catches BaseException at the thread boundary to funnel errors to the main thread; bare torch.cuda.set_device and torch.cuda.synchronize calls inside the worker fail loud if the device is not CUDA, which is part of the primitive's contract. Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident until execute time and are returned to CPU afterward. ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when a multigpu descriptor is attached; the single-device path runs unchanged when no clones are present.
fixup threaded loader with worksplit multi-gpu
* Revert "Add tiled VAE lane to MultiGPU Work Units" This reverts commit 4d3d68e. The tiled VAE lane will land as part of a follow-up PR alongside the UPSCALE_MODEL lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. * Revert "Add UPSCALE_MODEL lane to MultiGPU CFG Split" This reverts commit 74b0a82. The UPSCALE_MODEL lane will land as part of a follow-up PR alongside the tiled VAE lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. --------- Co-authored-by: John Pollock <pollockjj@gmail.com>
Two fixes for single-GPU users on non-NVIDIA backends; multi-GPU
non-CUDA support is intentionally out of scope here (tracked separately).
1. get_all_torch_devices: add AMD/ROCm, MLU, and a generic fallback arm.
Previously the function only enumerated NVIDIA, Intel XPU, and Ascend
NPU when cpu_state==GPU; on AMD/ROCm (which exposes its GPU through
torch.cuda.*) and DirectML it fell through to an empty list. The
biggest user-visible regression: unload_all_models() iterates this
list, so it became a silent no-op on AMD/ROCm. /free, manager
unloads, and shutdown stopped releasing VRAM.
- is_amd() now shares the torch.cuda.* arm with is_nvidia(), since
ROCm reuses the CUDA API surface.
- is_mlu() gets its own arm using torch.mlu.device_count().
- A final fallback appends get_torch_device() for any GPU backend
the explicit arms miss (notably DirectML), so callers see at
least the current device and unload_all_models works.
MPS users are unaffected: cpu_state==MPS already routes to the
else branch which appends get_torch_device() returning mps.
2. main.py DynamicVRAM init: guard the comfy_aimdo branch with an
explicit is_nvidia() check.
The outer condition allows entering the DynamicVRAM init block when
the user passes --enable-dynamic-vram explicitly, bypassing the
implicit is_nvidia() gate. On non-NVIDIA backends this then runs
comfy_aimdo.control.init_devices(range(torch.cuda.device_count())),
which is comfy-aimdo-only territory and may crash at startup. Add a
leading is_nvidia() check that logs a clean warning and falls back
to the legacy ModelPatcher path.
Fix single-GPU non-CUDA regressions on worksplit-multigpu (AMD/ROCm unload, DynamicVRAM crash)
Master commit cf758bd (PR #13663, "chore(api-nodes): increase default timeout for partner API node tasks") removed three explicit max_poll_attempts=280 overrides from nodes_kling.py so the new 480 default in util/client.py would take effect. The May 19 merge of master into worksplit-multigpu (ff766e5) silently discarded those three deletions in the 3-way resolve - nodes_kling.py had no textual conflict but the resolution kept the pre-cf758bd2 lines. The other seven files cf758bd touched were merged correctly; this restores nodes_kling.py to match master. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>
….cuda The aimdo init call on worksplit-multigpu was using comfy_aimdo.control.init_devices(range(torch.cuda.device_count())) which required adding `import torch` at the top of main.py (violating the "torch should never be imported before this point" expectation) and an inner is_nvidia() guard added in PR #14068 to defend the raw cuda call on non-NVIDIA systems where --enable-dynamic-vram is explicitly passed. Replace the call with comfy_aimdo.control.init_devices( d.index for d in comfy.model_management.get_all_torch_devices() if d.type == "cuda" and d.index is not None ) comfy_aimdo.control.init_devices accepts any iterable of int-coercible device indices and returns False on an empty iterable, so on non-cuda systems the elif naturally falls through to the existing "No working comfy-aimdo install detected" fallback - no extra vendor gate needed. HIP devices appear as type "cuda" in torch, so ROCm setups (which comfy-aimdo supports via aimdo_rocm.so) are handled correctly too. This lets us drop both the `import torch` at the top of main.py and the inner is_nvidia() guard, leaving a single logical-line divergence from master (init_device(single index) -> init_devices(generator of cuda indices)) for multi-GPU aimdo support. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>
get_all_torch_devices() only enumerates one vendor at a time (the
is_nvidia/is_intel_xpu/is_ascend_npu branches are exclusive and each
constructs devices via torch.device("type", i) with a real integer
index), and aimdo_control.init_devices short-circuits on lib is None
before iterating, so the d.type == "cuda" and d.index is not None
filter cannot ever change the result. Match master's trust level and
just pass the indices directly.
Reduces the divergence from master to a single line:
init_device(get_torch_device().index)
-> init_devices(d.index for d in get_all_torch_devices())
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Remove the device-selection widgets that were added directly to existing
loader nodes (and the new CheckpointLoaderDevice / ImageOnlyCheckpointLoaderDevice
variants):
- nodes.py:
- delete CheckpointLoaderDevice class and its NODE_CLASS_MAPPINGS /
NODE_DISPLAY_NAME_MAPPINGS entries
- remove the optional `device` input + VALIDATE_INPUTS + resolve logic
from UNETLoader, VAELoader, CLIPLoader, DualCLIPLoader
- restore CLIPLoader/DualCLIPLoader `device` options to ["default", "cpu"]
- comfy_extras/nodes_video_model.py:
- delete ImageOnlyCheckpointLoaderDevice class + its mapping entries
- comfy_extras/nodes_lt_audio.py:
- restore LTXAVTextEncoderLoader `device` options to ["default", "cpu"]
and revert the resolve logic back to the simple `if device == "cpu"`
branch
The replacement approach is a small set of passthrough Select*Device
nodes (added in the next commit) that retarget MODEL/CLIP/VAE devices
without bloating every loader's UI or duplicating loaders.
The cuda_device_context helper and the model_management helpers
(get_gpu_device_options / resolve_gpu_device_option) from #13483 are
kept; they are still used by the new selector nodes.
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Replace the per-loader device widgets removed in the previous commit with three small passthrough selector nodes registered under advanced/multigpu: - Select Model Device (MODEL in/out) - options: default / cpu / gpu:N - Select CLIP Device (CLIP in/out) - options: default / cpu / gpu:N - Select VAE Device (VAE in/out) - options: default / gpu:N (no cpu) Each node clones the inbound patcher (model.clone() / clip.clone() / copy.copy(vae)+vae.patcher.clone()) and retargets load_device (and offload_device for cpu / vae_offload_device for VAE). Portability across machines with different GPU counts: - VALIDATE_INPUTS returns True so an unknown gpu:N value (e.g. a workflow saved on a 2-GPU machine opened on a 1-GPU machine) does not error at validation time. - At runtime, resolve_gpu_device_option(...) returns None for unknown options (with a warning), and each selector then logs a per-node info message and passes through unchanged, matching the no-op style used by MultiGPU CFG Split's "No extra torch devices need initialization..." log. Also adds comfy.model_management.get_gpu_device_options_no_cpu() which the VAE selector uses; on a single-GPU box this collapses to just ["default"], which is fine. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>
When --enable-dynamic-vram is on, every ModelPatcher is a
ModelPatcherDynamic whose underlying model has a per-device dynamic_pins
dict, initialized in __init__ for self.load_device only. If a cloned
patcher's load_device is later reassigned (as the Select{Model,CLIP,VAE}
Device nodes do), the new device key is missing and partially_unload_ram
raises KeyError: device(type='cuda', index=N).
Fix:
- Extract the per-device dynamic_pins init in ModelPatcherDynamic.__init__
into a new helper method register_load_device(device) which is now also
called from __init__.
- Each Select*Device node calls clone.patcher.register_load_device(resolved)
after retargeting load_device, guarded by hasattr so non-dynamic
patchers (plain ModelPatcher in non-dynamic-vram installs) skip it.
Caught by happy-path test where SelectCLIPDevice retargeted CLIP from
cuda:0 to cuda:1 and CLIPTextEncode then crashed in
partially_unload_ram -> dynamic_pins[cuda:1].
Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
V3 io.ComfyNode subclasses use the lowercase `validate_inputs` hook for opting out of strict combo validation (execution.py line 862); the uppercase `VALIDATE_INPUTS` is the V1 spelling and is ignored on V3 nodes. The strict combo check at execution.py line 1025 is gated on `if x not in validate_function_inputs`, so renaming to `validate_inputs(cls, device='default')` lets unknown `gpu:N` values pass validation and fall through to the runtime fallback. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>
True reset semantics for "default": - On first selector application, cache the loader's original load_device / offload_device on the underlying model object (which is shared across patcher clones) and restore those base values when the user picks "default". Previously "default" meant "passthrough" so SelectXDevice(gpu:1) -> SelectXDevice(default) silently kept the gpu:1 routing. CPU + dynamic VRAM: - When SelectModelDevice / SelectCLIPDevice resolves to CPU on a ModelPatcherDynamic, also call clone(disable_dynamic=True) so the result is a plain ModelPatcher, matching ModelPatcherDynamic.__new__'s intent that CPU loads never run through the dynamic path. Fallback to the regular dynamic clone if disable_dynamic is unsupported on that patcher. MultiGPU collision pruning: - After SelectModelDevice retargets the primary patcher, drop any multigpu clone (from a prior MultiGPU CFG Split) whose load_device now matches the primary; otherwise two patchers would be bound to the same device. Logs the prune at info level. SelectVAEDevice: reject CPU at runtime: - The UI uses get_gpu_device_options_no_cpu(), but a workflow opened from another machine could still pass "cpu" through validate_inputs. Detect that case explicitly, log a "CPU is not a supported choice" passthrough message, and leave the VAE unchanged. Cosmetic: - Update VAE node docstring to accurately reflect the runtime CPU rejection rather than the older "intentionally not offered" claim. - Demote the fallback warnings inside resolve_gpu_device_option to no log at all; the Select*Device nodes now own a single context-rich info-level message per failed lookup, so there is no double logging. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>
rattus128
reviewed
May 23, 2026
rattus128
reviewed
May 23, 2026
| n.model = copy.deepcopy(n.model) | ||
| # unlike for normal clone, backup dicts that shared same ref should not; | ||
| # otherwise, patchers that have deep copies of base models will erroneously influence each other. | ||
| n.backup = copy.deepcopy(n.backup) |
Contributor
There was a problem hiding this comment.
the backup is not validly clonable in the case of reconstruction.
How close are we to just being able to call clone() as-is? whats the main divergeance in logic?
Member
Author
There was a problem hiding this comment.
I think the logic here is now more cleaned up; there were some assumptions made a year ago that have not been updated since then.
…for CLIP/VAE; Select*Device retargets via deepclone - ModelPatcher.deepclone_multigpu: remove copy.deepcopy fallback. Require cached_patcher_init (raise a descriptive RuntimeError if missing) and always go through clone(model_override=...) with empty backup containers so the per-device clone owns a pristine, unpatched module instead of a deepcopy of an already-loaded/already-patched one. Also call register_load_device on the new patcher so ModelPatcherDynamic per-device bookkeeping (e.g. dynamic_pins) is populated for the requested load device. - comfy/sd.py: register cached_patcher_init on the CLIP and VAE patchers returned by load_checkpoint_guess_config, and on the patcher returned by load_diffusion_model's companion paths. Add load_checkpoint_clip_patcher, load_checkpoint_vae_patcher, and load_vae_patcher reload helpers so the same loader context can be reused to produce per-device clones. - nodes.py: VAELoader registers cached_patcher_init on the produced VAE's patcher when there is a single backing file (skip for pixel_space and composite image-TAESDs which aren't addressable by a single path). - comfy_extras/nodes_multigpu.py: SelectModelDevice / SelectCLIPDevice / SelectVAEDevice now retarget via deepclone_multigpu when the requested device differs from the current load_device, so the consumed model is not just relabeled but actually rehomed onto the chosen device. Verified on runner-2 (2x RTX 4090, comfy-aimdo 0.4.4): - 10/10 focused unit tests (deepclone behavior, missing-factory error path, Select*Device behavior). - Device-switch-after-consumption end-to-end (SD1.5) produces bit-identical PNGs on cuda:0 and cuda:1. - Z Image multigpu CFG split: ~1.90x speedup (10.5s vs 19.9s steady). - Qwen Image multigpu CFG split (real text negative, cfg=4): ~1.69x speedup (32.5s vs 54.8s steady) -- matches pre-refactor numbers. - Baseline (patch stashed) and patched produce identical timings on both models, so the refactor is performance-neutral. Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479 Co-authored-by: Amp <amp@ampcode.com>
…s_multigpu_base_clone Two issues surfaced while testing the worksplit-multigpu PR: 1. Select Model Device -> CPU sampled at roughly 0.01 it/s, looking like an indefinite hang. PyTorch's CPU conv2d kernels do not have native fp16/bf16 paths and software-emulate at ~500-600x slower than fp32. Force fp32 compute via set_model_compute_dtype when the target is CPU; this keeps weights fp16 in memory and casts at use so peak memory does not double. 2. After running SelectModelDevice(gpu:N) and then activating MultiGPU CFG Split, only one GPU did real work even though both were loaded. create_multigpu_deepclones' reuse_loaded path matched the prior SelectModelDevice patcher (same clone_base_uuid, same device) but never set is_multigpu_base_clone, so the cond scheduler later filtered it out. Restrict reuse to clones that already carry the flag and always set it on the chosen patcher. Also fix a related sharp edge: extra-device selection used get_all_torch_devices(exclude_current=True), which assumes the primary lives on the process's current CUDA device. After SelectModelDevice(gpu:N) that is not true. Exclude the primary model's actual load_device instead. Amp-Thread-ID: https://ampcode.com/threads/T-019e6131-7175-719e-ad94-df5d65507375 Co-authored-by: Amp <amp@ampcode.com>
1 task
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.
As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.
The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅.⚠️ .
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux
Implementation Details
Based on
max_gpusand the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher'sadditional_modelsunder keymultigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with anyadditional_modelskept consistent. To avoid unnecessarily deepcloning models,currently_loaded_modelsfromcomfy.model_managementare checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.When native conds are used as the work units,
_calc_cond_batchcalls and returns_calc_cond_batch_multigputo avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.Since deepcloning requires that the base model is 'clean',
comfy.model_managementhas received aunload_model_and_clonesfunction to unload only specific models and their clones.The


--cuda-devicestartup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)
Performance (will add more examples soon)
Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.


Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090


API Node PR Checklist
Scope
Pricing & Billing
If Need pricing update:
QA
Comms