Skip to content

MultiGPU Work Units For Accelerated Sampling (CORE-184)#7063

Merged
comfyanonymous merged 173 commits into
masterfrom
worksplit-multigpu
May 26, 2026
Merged

MultiGPU Work Units For Accelerated Sampling (CORE-184)#7063
comfyanonymous merged 173 commits into
masterfrom
worksplit-multigpu

Conversation

@Kosinkadink
Copy link
Copy Markdown
Member

@Kosinkadink Kosinkadink commented Mar 4, 2025

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:
image

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:
image

Nvidia (CUDA): Tested, works ✅.
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux ⚠️.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:
image
image
This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.
image
image

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090
image
image

API Node PR Checklist

Scope

  • Is API Node Change

Pricing & Billing

  • Need pricing update
  • No pricing update

If Need pricing update:

  • Metronome rate cards updated
  • Auto‑billing tests updated and passing

QA

  • QA done
  • QA not required

Comms

  • Informed Kosinkadink

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'
…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed
… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)
… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type
…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch
…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time
…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)
…added some doc strings and removed a so-far unused variable
…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)
rattus128 and others added 18 commits May 23, 2026 01:00
This was an attempt to be a fast path by ensuring the file slice was
created by the owning thread and refusing without needing ot mutex
but worksplit-multigpu doesnt work that way. Go mutex.

Shoot me for overthinking next time.
Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf
after full truncation.

This pattern doesnt happen as a general rule, but does happen in the
upcoming worksplit-multigpu branch.
Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler
that dispatches per-device tile functions through the existing
MultiGPUThreadPool and merges per-device CPU output buffers in deterministic
key order. The worker only catches BaseException at the thread boundary to
funnel errors to the main thread; bare torch.cuda.set_device and
torch.cuda.synchronize calls inside the worker fail loud if the device is
not CUDA, which is part of the primitive's contract.

Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model
descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident
until execute time and are returned to CPU afterward.

ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when
a multigpu descriptor is attached; the single-device path runs unchanged
when no clones are present.
fixup threaded loader with worksplit multi-gpu
* Revert "Add tiled VAE lane to MultiGPU Work Units"

This reverts commit 4d3d68e.

The tiled VAE lane will land as part of a follow-up PR alongside the
UPSCALE_MODEL lane, separated from the threaded-loader fix PR (#14052)
to keep the upstream merge focused.

* Revert "Add UPSCALE_MODEL lane to MultiGPU CFG Split"

This reverts commit 74b0a82.

The UPSCALE_MODEL lane will land as part of a follow-up PR alongside the
tiled VAE lane, separated from the threaded-loader fix PR (#14052) to
keep the upstream merge focused.

---------

Co-authored-by: John Pollock <pollockjj@gmail.com>
Two fixes for single-GPU users on non-NVIDIA backends; multi-GPU
non-CUDA support is intentionally out of scope here (tracked separately).

1. get_all_torch_devices: add AMD/ROCm, MLU, and a generic fallback arm.

   Previously the function only enumerated NVIDIA, Intel XPU, and Ascend
   NPU when cpu_state==GPU; on AMD/ROCm (which exposes its GPU through
   torch.cuda.*) and DirectML it fell through to an empty list. The
   biggest user-visible regression: unload_all_models() iterates this
   list, so it became a silent no-op on AMD/ROCm. /free, manager
   unloads, and shutdown stopped releasing VRAM.

   - is_amd() now shares the torch.cuda.* arm with is_nvidia(), since
     ROCm reuses the CUDA API surface.
   - is_mlu() gets its own arm using torch.mlu.device_count().
   - A final fallback appends get_torch_device() for any GPU backend
     the explicit arms miss (notably DirectML), so callers see at
     least the current device and unload_all_models works.

   MPS users are unaffected: cpu_state==MPS already routes to the
   else branch which appends get_torch_device() returning mps.

2. main.py DynamicVRAM init: guard the comfy_aimdo branch with an
   explicit is_nvidia() check.

   The outer condition allows entering the DynamicVRAM init block when
   the user passes --enable-dynamic-vram explicitly, bypassing the
   implicit is_nvidia() gate. On non-NVIDIA backends this then runs
   comfy_aimdo.control.init_devices(range(torch.cuda.device_count())),
   which is comfy-aimdo-only territory and may crash at startup. Add a
   leading is_nvidia() check that logs a clean warning and falls back
   to the legacy ModelPatcher path.
Fix single-GPU non-CUDA regressions on worksplit-multigpu (AMD/ROCm unload, DynamicVRAM crash)
Master commit cf758bd (PR #13663, "chore(api-nodes): increase default
timeout for partner API node tasks") removed three explicit
max_poll_attempts=280 overrides from nodes_kling.py so the new 480
default in util/client.py would take effect.

The May 19 merge of master into worksplit-multigpu (ff766e5) silently
discarded those three deletions in the 3-way resolve - nodes_kling.py
had no textual conflict but the resolution kept the pre-cf758bd2 lines.
The other seven files cf758bd touched were merged correctly; this
restores nodes_kling.py to match master.

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
….cuda

The aimdo init call on worksplit-multigpu was using
  comfy_aimdo.control.init_devices(range(torch.cuda.device_count()))
which required adding `import torch` at the top of main.py (violating the
"torch should never be imported before this point" expectation) and an
inner is_nvidia() guard added in PR #14068 to defend the raw cuda call
on non-NVIDIA systems where --enable-dynamic-vram is explicitly passed.

Replace the call with
  comfy_aimdo.control.init_devices(
      d.index for d in comfy.model_management.get_all_torch_devices()
      if d.type == "cuda" and d.index is not None
  )

comfy_aimdo.control.init_devices accepts any iterable of int-coercible
device indices and returns False on an empty iterable, so on non-cuda
systems the elif naturally falls through to the existing "No working
comfy-aimdo install detected" fallback - no extra vendor gate needed.
HIP devices appear as type "cuda" in torch, so ROCm setups (which
comfy-aimdo supports via aimdo_rocm.so) are handled correctly too.

This lets us drop both the `import torch` at the top of main.py and the
inner is_nvidia() guard, leaving a single logical-line divergence from
master (init_device(single index) -> init_devices(generator of cuda
indices)) for multi-GPU aimdo support.

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
get_all_torch_devices() only enumerates one vendor at a time (the
is_nvidia/is_intel_xpu/is_ascend_npu branches are exclusive and each
constructs devices via torch.device("type", i) with a real integer
index), and aimdo_control.init_devices short-circuits on lib is None
before iterating, so the d.type == "cuda" and d.index is not None
filter cannot ever change the result. Match master's trust level and
just pass the indices directly.

Reduces the divergence from master to a single line:
    init_device(get_torch_device().index)
  -> init_devices(d.index for d in get_all_torch_devices())

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Remove the device-selection widgets that were added directly to existing
loader nodes (and the new CheckpointLoaderDevice / ImageOnlyCheckpointLoaderDevice
variants):

- nodes.py:
  - delete CheckpointLoaderDevice class and its NODE_CLASS_MAPPINGS /
    NODE_DISPLAY_NAME_MAPPINGS entries
  - remove the optional `device` input + VALIDATE_INPUTS + resolve logic
    from UNETLoader, VAELoader, CLIPLoader, DualCLIPLoader
  - restore CLIPLoader/DualCLIPLoader `device` options to ["default", "cpu"]
- comfy_extras/nodes_video_model.py:
  - delete ImageOnlyCheckpointLoaderDevice class + its mapping entries
- comfy_extras/nodes_lt_audio.py:
  - restore LTXAVTextEncoderLoader `device` options to ["default", "cpu"]
    and revert the resolve logic back to the simple `if device == "cpu"`
    branch

The replacement approach is a small set of passthrough Select*Device
nodes (added in the next commit) that retarget MODEL/CLIP/VAE devices
without bloating every loader's UI or duplicating loaders.

The cuda_device_context helper and the model_management helpers
(get_gpu_device_options / resolve_gpu_device_option) from #13483 are
kept; they are still used by the new selector nodes.

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Replace the per-loader device widgets removed in the previous commit
with three small passthrough selector nodes registered under
advanced/multigpu:

- Select Model Device  (MODEL  in/out)  - options: default / cpu / gpu:N
- Select CLIP Device   (CLIP   in/out)  - options: default / cpu / gpu:N
- Select VAE Device    (VAE    in/out)  - options: default / gpu:N (no cpu)

Each node clones the inbound patcher (model.clone() / clip.clone() /
copy.copy(vae)+vae.patcher.clone()) and retargets load_device (and
offload_device for cpu / vae_offload_device for VAE).

Portability across machines with different GPU counts:
- VALIDATE_INPUTS returns True so an unknown gpu:N value (e.g. a
  workflow saved on a 2-GPU machine opened on a 1-GPU machine) does
  not error at validation time.
- At runtime, resolve_gpu_device_option(...) returns None for
  unknown options (with a warning), and each selector then logs a
  per-node info message and passes through unchanged, matching the
  no-op style used by MultiGPU CFG Split's
  "No extra torch devices need initialization..." log.

Also adds comfy.model_management.get_gpu_device_options_no_cpu() which
the VAE selector uses; on a single-GPU box this collapses to just
["default"], which is fine.

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
When --enable-dynamic-vram is on, every ModelPatcher is a
ModelPatcherDynamic whose underlying model has a per-device dynamic_pins
dict, initialized in __init__ for self.load_device only. If a cloned
patcher's load_device is later reassigned (as the Select{Model,CLIP,VAE}
Device nodes do), the new device key is missing and partially_unload_ram
raises KeyError: device(type='cuda', index=N).

Fix:
- Extract the per-device dynamic_pins init in ModelPatcherDynamic.__init__
  into a new helper method register_load_device(device) which is now also
  called from __init__.
- Each Select*Device node calls clone.patcher.register_load_device(resolved)
  after retargeting load_device, guarded by hasattr so non-dynamic
  patchers (plain ModelPatcher in non-dynamic-vram installs) skip it.

Caught by happy-path test where SelectCLIPDevice retargeted CLIP from
cuda:0 to cuda:1 and CLIPTextEncode then crashed in
partially_unload_ram -> dynamic_pins[cuda:1].

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
V3 io.ComfyNode subclasses use the lowercase `validate_inputs` hook for opting out of strict combo validation (execution.py line 862); the uppercase `VALIDATE_INPUTS` is the V1 spelling and is ignored on V3 nodes. The strict combo check at execution.py line 1025 is gated on `if x not in validate_function_inputs`, so renaming to `validate_inputs(cls, device='default')` lets unknown `gpu:N` values pass validation and fall through to the runtime fallback.

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
True reset semantics for "default":
- On first selector application, cache the loader's original
  load_device / offload_device on the underlying model object (which
  is shared across patcher clones) and restore those base values when
  the user picks "default". Previously "default" meant "passthrough"
  so SelectXDevice(gpu:1) -> SelectXDevice(default) silently kept the
  gpu:1 routing.

CPU + dynamic VRAM:
- When SelectModelDevice / SelectCLIPDevice resolves to CPU on a
  ModelPatcherDynamic, also call clone(disable_dynamic=True) so the
  result is a plain ModelPatcher, matching ModelPatcherDynamic.__new__'s
  intent that CPU loads never run through the dynamic path. Fallback to
  the regular dynamic clone if disable_dynamic is unsupported on that
  patcher.

MultiGPU collision pruning:
- After SelectModelDevice retargets the primary patcher, drop any
  multigpu clone (from a prior MultiGPU CFG Split) whose load_device
  now matches the primary; otherwise two patchers would be bound to
  the same device. Logs the prune at info level.

SelectVAEDevice: reject CPU at runtime:
- The UI uses get_gpu_device_options_no_cpu(), but a workflow opened
  from another machine could still pass "cpu" through validate_inputs.
  Detect that case explicitly, log a "CPU is not a supported choice"
  passthrough message, and leave the VAE unchanged.

Cosmetic:
- Update VAE node docstring to accurately reflect the runtime CPU
  rejection rather than the older "intentionally not offered" claim.
- Demote the fallback warnings inside resolve_gpu_device_option to no
  log at all; the Select*Device nodes now own a single context-rich
  info-level message per failed lookup, so there is no double logging.

Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13
Co-authored-by: Amp <amp@ampcode.com>
Comment thread comfy/model_patcher.py Outdated
Comment thread comfy/model_patcher.py Outdated
n.model = copy.deepcopy(n.model)
# unlike for normal clone, backup dicts that shared same ref should not;
# otherwise, patchers that have deep copies of base models will erroneously influence each other.
n.backup = copy.deepcopy(n.backup)
Copy link
Copy Markdown
Contributor

@rattus128 rattus128 May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the backup is not validly clonable in the case of reconstruction.

How close are we to just being able to call clone() as-is? whats the main divergeance in logic?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic here is now more cleaned up; there were some assumptions made a year ago that have not been updated since then.

Kosinkadink and others added 4 commits May 23, 2026 19:11
…for CLIP/VAE; Select*Device retargets via deepclone

- ModelPatcher.deepclone_multigpu: remove copy.deepcopy fallback. Require
  cached_patcher_init (raise a descriptive RuntimeError if missing) and
  always go through clone(model_override=...) with empty backup containers
  so the per-device clone owns a pristine, unpatched module instead of a
  deepcopy of an already-loaded/already-patched one. Also call
  register_load_device on the new patcher so ModelPatcherDynamic per-device
  bookkeeping (e.g. dynamic_pins) is populated for the requested load
  device.

- comfy/sd.py: register cached_patcher_init on the CLIP and VAE patchers
  returned by load_checkpoint_guess_config, and on the patcher returned by
  load_diffusion_model's companion paths. Add load_checkpoint_clip_patcher,
  load_checkpoint_vae_patcher, and load_vae_patcher reload helpers so the
  same loader context can be reused to produce per-device clones.

- nodes.py: VAELoader registers cached_patcher_init on the produced VAE's
  patcher when there is a single backing file (skip for pixel_space and
  composite image-TAESDs which aren't addressable by a single path).

- comfy_extras/nodes_multigpu.py: SelectModelDevice / SelectCLIPDevice /
  SelectVAEDevice now retarget via deepclone_multigpu when the requested
  device differs from the current load_device, so the consumed model is
  not just relabeled but actually rehomed onto the chosen device.

Verified on runner-2 (2x RTX 4090, comfy-aimdo 0.4.4):
- 10/10 focused unit tests (deepclone behavior, missing-factory error path,
  Select*Device behavior).
- Device-switch-after-consumption end-to-end (SD1.5) produces bit-identical
  PNGs on cuda:0 and cuda:1.
- Z Image multigpu CFG split: ~1.90x speedup (10.5s vs 19.9s steady).
- Qwen Image multigpu CFG split (real text negative, cfg=4): ~1.69x
  speedup (32.5s vs 54.8s steady) -- matches pre-refactor numbers.
- Baseline (patch stashed) and patched produce identical timings on both
  models, so the refactor is performance-neutral.

Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479
Co-authored-by: Amp <amp@ampcode.com>
…s_multigpu_base_clone

Two issues surfaced while testing the worksplit-multigpu PR:

1. Select Model Device -> CPU sampled at roughly 0.01 it/s, looking
   like an indefinite hang. PyTorch's CPU conv2d kernels do not have
   native fp16/bf16 paths and software-emulate at ~500-600x slower
   than fp32. Force fp32 compute via set_model_compute_dtype when the
   target is CPU; this keeps weights fp16 in memory and casts at use
   so peak memory does not double.

2. After running SelectModelDevice(gpu:N) and then activating
   MultiGPU CFG Split, only one GPU did real work even though both
   were loaded. create_multigpu_deepclones' reuse_loaded path matched
   the prior SelectModelDevice patcher (same clone_base_uuid, same
   device) but never set is_multigpu_base_clone, so the cond
   scheduler later filtered it out. Restrict reuse to clones that
   already carry the flag and always set it on the chosen patcher.

   Also fix a related sharp edge: extra-device selection used
   get_all_torch_devices(exclude_current=True), which assumes the
   primary lives on the process's current CUDA device. After
   SelectModelDevice(gpu:N) that is not true. Exclude the primary
   model's actual load_device instead.

Amp-Thread-ID: https://ampcode.com/threads/T-019e6131-7175-719e-ad94-df5d65507375
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.