MultiGPU Work Units For Accelerated Sampling (CORE-184) by Kosinkadink · Pull Request #7063 · Comfy-Org/ComfyUI

Kosinkadink · 2025-03-04T06:37:49Z

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅.
AMD (ROCm): Untested, will validate soon
AMD (DirectML): Untested,
Intel (Arc XPU): Tested, does not work on Windows but works on Linux ⚠️.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage:

This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090

API Node PR Checklist

Scope

Is API Node Change

Pricing & Billing

Need pricing update
No pricing update

If Need pricing update:

Metronome rate cards updated
Auto‑billing tests updated and passing

QA

QA done
QA not required

Comms

Informed Kosinkadink

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

…ade AddModelsHook operational and compliant with should_register result, moved TransformerOptionsHook handling out of ModelPatcher.register_all_hook_patches, support patches in TransformerOptionsHook properly by casting any patches/wrappers/hooks to proper device at sample time

…nsHook are not yet operational

…ops nodes by properly caching between positive and negative conds, make hook_patches_backup behave as intended (in the case that something pre-registers WeightHooks on the ModelPatcher instead of registering it at sample time)

…added some doc strings and removed a so-far unused variable

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

…t torch hardware device

…ltiple GPUs

…ction

…nit__.py

This was an attempt to be a fast path by ensuring the file slice was created by the owning thread and refusing without needing ot mutex but worksplit-multigpu doesnt work that way. Go mutex. Shoot me for overthinking next time.

Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf after full truncation. This pattern doesnt happen as a general rule, but does happen in the upcoming worksplit-multigpu branch.

Introduce tiled_scale_multidim_multigpu in comfy/utils.py: a tile scheduler that dispatches per-device tile functions through the existing MultiGPUThreadPool and merges per-device CPU output buffers in deterministic key order. The worker only catches BaseException at the thread boundary to funnel errors to the main thread; bare torch.cuda.set_device and torch.cuda.synchronize calls inside the worker fail loud if the device is not CUDA, which is part of the primitive's contract. Add UPSCALE_MODEL input on the MultiGPU CFG Split node and an upscale-model descriptor deepclone helper in comfy/multigpu.py. Clones stay CPU-resident until execute time and are returned to CPU afterward. ImageUpscaleWithModel dispatches through tiled_scale_multidim_multigpu when a multigpu descriptor is attached; the single-device path runs unchanged when no clones are present.

fixup threaded loader with worksplit multi-gpu

* Revert "Add tiled VAE lane to MultiGPU Work Units" This reverts commit 4d3d68e. The tiled VAE lane will land as part of a follow-up PR alongside the UPSCALE_MODEL lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. * Revert "Add UPSCALE_MODEL lane to MultiGPU CFG Split" This reverts commit 74b0a82. The UPSCALE_MODEL lane will land as part of a follow-up PR alongside the tiled VAE lane, separated from the threaded-loader fix PR (#14052) to keep the upstream merge focused. --------- Co-authored-by: John Pollock <pollockjj@gmail.com>

Two fixes for single-GPU users on non-NVIDIA backends; multi-GPU non-CUDA support is intentionally out of scope here (tracked separately). 1. get_all_torch_devices: add AMD/ROCm, MLU, and a generic fallback arm. Previously the function only enumerated NVIDIA, Intel XPU, and Ascend NPU when cpu_state==GPU; on AMD/ROCm (which exposes its GPU through torch.cuda.*) and DirectML it fell through to an empty list. The biggest user-visible regression: unload_all_models() iterates this list, so it became a silent no-op on AMD/ROCm. /free, manager unloads, and shutdown stopped releasing VRAM. - is_amd() now shares the torch.cuda.* arm with is_nvidia(), since ROCm reuses the CUDA API surface. - is_mlu() gets its own arm using torch.mlu.device_count(). - A final fallback appends get_torch_device() for any GPU backend the explicit arms miss (notably DirectML), so callers see at least the current device and unload_all_models works. MPS users are unaffected: cpu_state==MPS already routes to the else branch which appends get_torch_device() returning mps. 2. main.py DynamicVRAM init: guard the comfy_aimdo branch with an explicit is_nvidia() check. The outer condition allows entering the DynamicVRAM init block when the user passes --enable-dynamic-vram explicitly, bypassing the implicit is_nvidia() gate. On non-NVIDIA backends this then runs comfy_aimdo.control.init_devices(range(torch.cuda.device_count())), which is comfy-aimdo-only territory and may crash at startup. Add a leading is_nvidia() check that logs a clean warning and falls back to the legacy ModelPatcher path.

Fix single-GPU non-CUDA regressions on worksplit-multigpu (AMD/ROCm unload, DynamicVRAM crash)

Master commit cf758bd (PR #13663, "chore(api-nodes): increase default timeout for partner API node tasks") removed three explicit max_poll_attempts=280 overrides from nodes_kling.py so the new 480 default in util/client.py would take effect. The May 19 merge of master into worksplit-multigpu (ff766e5) silently discarded those three deletions in the 3-way resolve - nodes_kling.py had no textual conflict but the resolution kept the pre-cf758bd2 lines. The other seven files cf758bd touched were merged correctly; this restores nodes_kling.py to match master. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

….cuda The aimdo init call on worksplit-multigpu was using comfy_aimdo.control.init_devices(range(torch.cuda.device_count())) which required adding `import torch` at the top of main.py (violating the "torch should never be imported before this point" expectation) and an inner is_nvidia() guard added in PR #14068 to defend the raw cuda call on non-NVIDIA systems where --enable-dynamic-vram is explicitly passed. Replace the call with comfy_aimdo.control.init_devices( d.index for d in comfy.model_management.get_all_torch_devices() if d.type == "cuda" and d.index is not None ) comfy_aimdo.control.init_devices accepts any iterable of int-coercible device indices and returns False on an empty iterable, so on non-cuda systems the elif naturally falls through to the existing "No working comfy-aimdo install detected" fallback - no extra vendor gate needed. HIP devices appear as type "cuda" in torch, so ROCm setups (which comfy-aimdo supports via aimdo_rocm.so) are handled correctly too. This lets us drop both the `import torch` at the top of main.py and the inner is_nvidia() guard, leaving a single logical-line divergence from master (init_device(single index) -> init_devices(generator of cuda indices)) for multi-GPU aimdo support. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

get_all_torch_devices() only enumerates one vendor at a time (the is_nvidia/is_intel_xpu/is_ascend_npu branches are exclusive and each constructs devices via torch.device("type", i) with a real integer index), and aimdo_control.init_devices short-circuits on lib is None before iterating, so the d.type == "cuda" and d.index is not None filter cannot ever change the result. Match master's trust level and just pass the indices directly. Reduces the divergence from master to a single line: init_device(get_torch_device().index) -> init_devices(d.index for d in get_all_torch_devices()) Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

Remove the device-selection widgets that were added directly to existing loader nodes (and the new CheckpointLoaderDevice / ImageOnlyCheckpointLoaderDevice variants): - nodes.py: - delete CheckpointLoaderDevice class and its NODE_CLASS_MAPPINGS / NODE_DISPLAY_NAME_MAPPINGS entries - remove the optional `device` input + VALIDATE_INPUTS + resolve logic from UNETLoader, VAELoader, CLIPLoader, DualCLIPLoader - restore CLIPLoader/DualCLIPLoader `device` options to ["default", "cpu"] - comfy_extras/nodes_video_model.py: - delete ImageOnlyCheckpointLoaderDevice class + its mapping entries - comfy_extras/nodes_lt_audio.py: - restore LTXAVTextEncoderLoader `device` options to ["default", "cpu"] and revert the resolve logic back to the simple `if device == "cpu"` branch The replacement approach is a small set of passthrough Select*Device nodes (added in the next commit) that retarget MODEL/CLIP/VAE devices without bloating every loader's UI or duplicating loaders. The cuda_device_context helper and the model_management helpers (get_gpu_device_options / resolve_gpu_device_option) from #13483 are kept; they are still used by the new selector nodes. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

Replace the per-loader device widgets removed in the previous commit with three small passthrough selector nodes registered under advanced/multigpu: - Select Model Device (MODEL in/out) - options: default / cpu / gpu:N - Select CLIP Device (CLIP in/out) - options: default / cpu / gpu:N - Select VAE Device (VAE in/out) - options: default / gpu:N (no cpu) Each node clones the inbound patcher (model.clone() / clip.clone() / copy.copy(vae)+vae.patcher.clone()) and retargets load_device (and offload_device for cpu / vae_offload_device for VAE). Portability across machines with different GPU counts: - VALIDATE_INPUTS returns True so an unknown gpu:N value (e.g. a workflow saved on a 2-GPU machine opened on a 1-GPU machine) does not error at validation time. - At runtime, resolve_gpu_device_option(...) returns None for unknown options (with a warning), and each selector then logs a per-node info message and passes through unchanged, matching the no-op style used by MultiGPU CFG Split's "No extra torch devices need initialization..." log. Also adds comfy.model_management.get_gpu_device_options_no_cpu() which the VAE selector uses; on a single-GPU box this collapses to just ["default"], which is fine. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

When --enable-dynamic-vram is on, every ModelPatcher is a ModelPatcherDynamic whose underlying model has a per-device dynamic_pins dict, initialized in __init__ for self.load_device only. If a cloned patcher's load_device is later reassigned (as the Select{Model,CLIP,VAE} Device nodes do), the new device key is missing and partially_unload_ram raises KeyError: device(type='cuda', index=N). Fix: - Extract the per-device dynamic_pins init in ModelPatcherDynamic.__init__ into a new helper method register_load_device(device) which is now also called from __init__. - Each Select*Device node calls clone.patcher.register_load_device(resolved) after retargeting load_device, guarded by hasattr so non-dynamic patchers (plain ModelPatcher in non-dynamic-vram installs) skip it. Caught by happy-path test where SelectCLIPDevice retargeted CLIP from cuda:0 to cuda:1 and CLIPTextEncode then crashed in partially_unload_ram -> dynamic_pins[cuda:1]. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

V3 io.ComfyNode subclasses use the lowercase `validate_inputs` hook for opting out of strict combo validation (execution.py line 862); the uppercase `VALIDATE_INPUTS` is the V1 spelling and is ignored on V3 nodes. The strict combo check at execution.py line 1025 is gated on `if x not in validate_function_inputs`, so renaming to `validate_inputs(cls, device='default')` lets unknown `gpu:N` values pass validation and fall through to the runtime fallback. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

True reset semantics for "default": - On first selector application, cache the loader's original load_device / offload_device on the underlying model object (which is shared across patcher clones) and restore those base values when the user picks "default". Previously "default" meant "passthrough" so SelectXDevice(gpu:1) -> SelectXDevice(default) silently kept the gpu:1 routing. CPU + dynamic VRAM: - When SelectModelDevice / SelectCLIPDevice resolves to CPU on a ModelPatcherDynamic, also call clone(disable_dynamic=True) so the result is a plain ModelPatcher, matching ModelPatcherDynamic.__new__'s intent that CPU loads never run through the dynamic path. Fallback to the regular dynamic clone if disable_dynamic is unsupported on that patcher. MultiGPU collision pruning: - After SelectModelDevice retargets the primary patcher, drop any multigpu clone (from a prior MultiGPU CFG Split) whose load_device now matches the primary; otherwise two patchers would be bound to the same device. Logs the prune at info level. SelectVAEDevice: reject CPU at runtime: - The UI uses get_gpu_device_options_no_cpu(), but a workflow opened from another machine could still pass "cpu" through validate_inputs. Detect that case explicitly, log a "CPU is not a supported choice" passthrough message, and leave the VAE unchanged. Cosmetic: - Update VAE node docstring to accurately reflect the runtime CPU rejection rather than the older "intentionally not offered" claim. - Demote the fallback warnings inside resolve_gpu_device_option to no log at all; the Select*Device nodes now own a single context-rich info-level message per failed lookup, so there is no double logging. Amp-Thread-ID: https://ampcode.com/threads/T-019e52b4-31ee-72cd-996b-64ecd9420e13 Co-authored-by: Amp <amp@ampcode.com>

rattus128 · 2026-05-23T11:23:16Z

+            n.model = copy.deepcopy(n.model)
+        # unlike for normal clone, backup dicts that shared same ref should not;
+        # otherwise, patchers that have deep copies of base models will erroneously influence each other.
+        n.backup = copy.deepcopy(n.backup)


the backup is not validly clonable in the case of reconstruction.

How close are we to just being able to call clone() as-is? whats the main divergeance in logic?

I think the logic here is now more cleaned up; there were some assumptions made a year ago that have not been updated since then.

…for CLIP/VAE; Select*Device retargets via deepclone - ModelPatcher.deepclone_multigpu: remove copy.deepcopy fallback. Require cached_patcher_init (raise a descriptive RuntimeError if missing) and always go through clone(model_override=...) with empty backup containers so the per-device clone owns a pristine, unpatched module instead of a deepcopy of an already-loaded/already-patched one. Also call register_load_device on the new patcher so ModelPatcherDynamic per-device bookkeeping (e.g. dynamic_pins) is populated for the requested load device. - comfy/sd.py: register cached_patcher_init on the CLIP and VAE patchers returned by load_checkpoint_guess_config, and on the patcher returned by load_diffusion_model's companion paths. Add load_checkpoint_clip_patcher, load_checkpoint_vae_patcher, and load_vae_patcher reload helpers so the same loader context can be reused to produce per-device clones. - nodes.py: VAELoader registers cached_patcher_init on the produced VAE's patcher when there is a single backing file (skip for pixel_space and composite image-TAESDs which aren't addressable by a single path). - comfy_extras/nodes_multigpu.py: SelectModelDevice / SelectCLIPDevice / SelectVAEDevice now retarget via deepclone_multigpu when the requested device differs from the current load_device, so the consumed model is not just relabeled but actually rehomed onto the chosen device. Verified on runner-2 (2x RTX 4090, comfy-aimdo 0.4.4): - 10/10 focused unit tests (deepclone behavior, missing-factory error path, Select*Device behavior). - Device-switch-after-consumption end-to-end (SD1.5) produces bit-identical PNGs on cuda:0 and cuda:1. - Z Image multigpu CFG split: ~1.90x speedup (10.5s vs 19.9s steady). - Qwen Image multigpu CFG split (real text negative, cfg=4): ~1.69x speedup (32.5s vs 54.8s steady) -- matches pre-refactor numbers. - Baseline (patch stashed) and patched produce identical timings on both models, so the refactor is performance-neutral. Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479 Co-authored-by: Amp <amp@ampcode.com>

Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479 Co-authored-by: Amp <amp@ampcode.com>

…s_multigpu_base_clone Two issues surfaced while testing the worksplit-multigpu PR: 1. Select Model Device -> CPU sampled at roughly 0.01 it/s, looking like an indefinite hang. PyTorch's CPU conv2d kernels do not have native fp16/bf16 paths and software-emulate at ~500-600x slower than fp32. Force fp32 compute via set_model_compute_dtype when the target is CPU; this keeps weights fp16 in memory and casts at use so peak memory does not double. 2. After running SelectModelDevice(gpu:N) and then activating MultiGPU CFG Split, only one GPU did real work even though both were loaded. create_multigpu_deepclones' reuse_loaded path matched the prior SelectModelDevice patcher (same clone_base_uuid, same device) but never set is_multigpu_base_clone, so the cond scheduler later filtered it out. Restrict reuse to clones that already carry the flag and always set it on the chosen patcher. Also fix a related sharp edge: extra-device selection used get_all_torch_devices(exclude_current=True), which assumes the primary lives on the process's current CUDA device. After SelectModelDevice(gpu:N) that is not true. Exclude the primary model's actual load_device instead. Amp-Thread-ID: https://ampcode.com/threads/T-019e6131-7175-719e-ad94-df5d65507375 Co-authored-by: Amp <amp@ampcode.com>

Kosinkadink added 30 commits December 29, 2024 15:49

Add 'sigmas' to transformer_options so that downstream code can know …

72bbf49

…about the full scope of current sampling run, fix Hook Keyframes' guarantee_steps=1 inconsistent behavior with sampling split across different Sampling nodes/sampling runs by referencing 'sigmas'

Merge branch 'master' into hooks_part2

bf21be0

Merge branch 'master' into hooks_part2

d44295e

Cleaned up hooks.py, refactored Hook.should_register and add_hook_pat…

5a2ad03

…ches to use target_dict instead of target so that more information can be provided about the current execution environment if needed

Refactor WrapperHook into TransformerOptionsHook, as there is no need…

776aa73

… to separate out Wrappers/Callbacks/Patches into different hook types (all affect transformer_options)

Refactored HookGroup to also store a dictionary of hooks separated by…

111fd0c

… hook_type, modified necessary code to no longer need to manually separate out hooks by hook_type

In inner_sample, change "sigmas" to "sampler_sigmas" in transformer_o…

6620d86

…ptions to not conflict with the "sigmas" that will overwrite "sigmas" in _calc_cond_batch

Merge branch 'add_sample_sigmas' into hooks_part2

db2d7ad

Made hook clone code sane, made clear ObjectPatchHook and SetInjectio…

4446c86

…nsHook are not yet operational

Filter only registered hooks on self.conds in CFGGuider.sample

0a7e2ae

Merge branch 'master' into hooks_part2

6463c39

Make hook_scope functional for TransformerOptionsHook

f48f90e

Merge branch 'master' into hooks_part2

2724ac4

removed 4 whitespace lines to satisfy Ruff,

1b38f5b

Add a get_injections function to ModelPatcher

58bf881

Made TransformerOptionsHook contribute to registered hooks properly, …

216fea1

…added some doc strings and removed a so-far unused variable

Merge branch 'master' into hooks_part2

11c6d56

Rename AddModelsHooks to AdditionalModelsHook, rename SetInjectionsHo…

3cd4c5c

…ok to InjectionsHook (not yet implemented, but at least getting the naming figured out)

Clean up a typehint

7333281

Merge branch 'comfyanonymous:master' into multigpu_support

66838eb

Add get_all_torch_devices to get detected devices intended for curren…

871258a

…t torch hardware device

Initial proof of concept of giving splitting cond sampling between mu…

7448f02

…ltiple GPUs

Merge branch 'comfyanonymous:master' into multigpu_support

d3cf2b7

Fix cond_cat to not try to cast anything that doesn't have a 'to' fun…

e88c6c0

…ction

Merge branch 'master' into multigpu_support

8d4b501

Make test node for multigpu instead of storing it in just a local __i…

d508807

…nit__.py

Merge branch 'master' into multigpu_support

ec16ee2

Add nodes_multigpu.py to loaded nodes

198953c

rattus128 and others added 18 commits May 23, 2026 01:00

memory_management: replace thread refusal with mutex

df17b56

This was an attempt to be a fast path by ensuring the file slice was created by the owning thread and refusing without needing ot mutex but worksplit-multigpu doesnt work that way. Go mutex. Shoot me for overthinking next time.

comfy-aimdo 0.4.4

7a18f9a

Comfy-aimdo 0.4.4 contains a small bugfix to allow recovery of a hostbuf after full truncation. This pattern doesnt happen as a general rule, but does happen in the upcoming worksplit-multigpu branch.

Add tiled VAE lane to MultiGPU Work Units

4d3d68e

Merge pull request #14052 from rattus128/prs/worksplit-t-load-fix

cb83c41

fixup threaded loader with worksplit multi-gpu

Merge pull request #14068 from Comfy-Org/fix/single-gpu-non-cuda

e6c65fa

Fix single-GPU non-CUDA regressions on worksplit-multigpu (AMD/ROCm unload, DynamicVRAM crash)

Merge branch 'master' into worksplit-multigpu

2e5211e

Merge branch 'master' into worksplit-multigpu

5c2e34c

rattus128 reviewed May 23, 2026

View reviewed changes

Comment thread comfy/model_patcher.py Outdated

rattus128 reviewed May 23, 2026

View reviewed changes

Kosinkadink and others added 4 commits May 23, 2026 19:11

multigpu: drop unused copy import; sync requirements.txt with master

ac5b7e8

Amp-Thread-ID: https://ampcode.com/threads/T-019e5783-b810-74b1-8ca9-09d675de1479 Co-authored-by: Amp <amp@ampcode.com>

Merge branch 'master' into worksplit-multigpu

de487c1

comfyanonymous merged commit 0a2dd86 into master May 26, 2026
16 checks passed

comfyanonymous deleted the worksplit-multigpu branch May 26, 2026 01:26

coderabbitai Bot mentioned this pull request May 26, 2026

Running with "--cpu" flag still tries to use Kornia #14107

Open

1 task

Kosinkadink mentioned this pull request May 26, 2026

multigpu: use unet_manual_cast for SelectModelDevice compute dtype #14108

Merged

This was referenced May 27, 2026

OOM error after commit 0a2dd86 #14126

Open

Regression in memory management in a commit made after v0.22.0, Flux 2 Dev can no longer be run in full precision on 3090 #14165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU Work Units For Accelerated Sampling (CORE-184)#7063

MultiGPU Work Units For Accelerated Sampling (CORE-184)#7063
comfyanonymous merged 173 commits into
masterfrom
worksplit-multigpu

Kosinkadink commented Mar 4, 2025 •

edited by github-actions Bot

Loading

Uh oh!

Uh oh!

rattus128 May 23, 2026 •

edited

Loading

Uh oh!

Kosinkadink May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Kosinkadink commented Mar 4, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Implementation Details

Performance (will add more examples soon)

API Node PR Checklist

Scope

Pricing & Billing

QA

Comms

Uh oh!

Uh oh!

rattus128 May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kosinkadink May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Kosinkadink commented Mar 4, 2025 •

edited by github-actions Bot

Loading

rattus128 May 23, 2026 •

edited

Loading