feat: Support NVIDIA PixelDiT and PiD (CORE-201)#14103
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis pull request adds PixelDiT, a pixel-space multimodal diffusion model, and PiD, a pixel-diffusion decoder variant, to ComfyUI. The implementation includes new pixel- and patch-level transformer modules, a Gemma2-2B-based text encoder/tokenizer, PiD low-quality latent injection and gating, updates to model detection/registration, model wrappers, and a ComfyUI node for PiD conditioning. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi, quick question about NVIDIA licensing.
I didn’t find the restrictive clause (“may not be distributed, deployed…”) on the HF page or in the LICENSE file. Since NVIDIA sometimes applies multiple overlapping licenses to their research models, could you clarify whether we should always assume the more restrictive NVIDIA Research license applies, even when the HuggingFace page only shows NSCLv1? Just trying to understand the policy you follow for ComfyUI integration. Thanks. |
|
Really want to try PiD, but I can't get access to the TE model yet. |
It's two different repos, the new PiD model uses the PixelDiT as base, this PR implements both. PixelDiT license: https://huggingface.co/nvidia/PixelDiT-1300M-1024px/blob/main/LICENSE So okay to repackage and share when license is included.
Edit: They have now changed the license to match PixelDiT: https://huggingface.co/nvidia/PiD/commit/b87dba45e5a2b2a18bac9515fca883f52b957558 |
I've made the repo public now: https://huggingface.co/Comfy-Org/PixelDiT |
Thank you! Just tested it, but I'm still a bit confused. Is PiD sensitive to the dimensions/image size? I tried with a vertical image, the source is 1088×1440. Then I downscaled it by 0.5 using the After that, I used another Image Resize node and set it to 2048×2048 with the method set to
If I increase it to something like 2880×2880, the image looks fine.
I also noticed it always adds a green color at the bottom part of image, even though I'm not seeing that in your example results?
|
|
I added better explanation for the different models, it's important to choose the correct one closest to your resolution, their naming is a bit confusing. |
Input image is too large maybe? That selected model works with 512x512 inputs. Also make sure to use the ELM variant of the Gemma2... at least with the PixelDit T2I the standard Gemma2 gives broken outputs. |
|
I have used the same workflow - [z_image_turbo_to_pid_02.json] but kept returning with error on the incorrect dimensions, e.g. Even after the PID conditioning, it does not work. Not sure what ight be the exact issue, thanks |
Z-image uses flux1 VAE, do you have that selected in the conditioning node? |
Thank you for the prompt reply. Yes, Flux1 VAE is selected in the conditioning node (latent_format = flux1) and the PID model itself being flux1 (e.g. [PiD_res2kto4k_sr4x_official_flux_distill_4step]) but same issue; perhaps I'll reinstall ComfyUI for a re-test thanks! |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
comfy/ldm/pixeldit/modules.py (1)
149-170: ⚡ Quick winAvoid rebuilding the same RoPE table in every block call.
_fetch_pos()allocates a fresh position tensor for eachPiTBlock.forward(). In this path the same(Hs, Ws, dtype, rope_options)values repeat across blocks and denoise steps, so this is just extra latency and GPU memory churn. Reusing a per-forward cache would pay off here. As per coding guidelinescomfy/**: Core ML/diffusion engine. Focus on performance implications in hot paths.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy/ldm/pixeldit/modules.py` around lines 149 - 170, PiTBlock.forward currently calls _fetch_pos each block which calls precompute_freqs_cis_2d and reallocates the same RoPE tensor repeatedly; add a small per-instance cache keyed by (Hs, Ws, device, dtype, rope_options) (e.g., self._rope_cache = {}) and modify _fetch_pos to check the cache and return the cached tensor when present, otherwise compute, store, and return it; ensure the cache key includes transformer_options.get("rope_options") and that tensors are moved to the correct device/dtype if needed and invalidated when shapes or devices change to avoid stale tensors.comfy/ldm/pixeldit/pid.py (1)
117-131: ⚡ Quick winAdd an explicit LQ latent channel check.
LQProjection2Dis configured for a fixedlatent_channels, but wrong latent formats currently hit the first conv and fail with a backend shape-mismatch error. A small upfront validation would make this failure deterministic and much easier to diagnose.Suggested fix
def _align_latent_to_patch_grid(self, lq_latent: torch.Tensor, pH: int, pW: int) -> torch.Tensor: B, z_dim = lq_latent.shape[:2] + if z_dim != self.latent_channels: + raise ValueError(f"Expected lq_latent with {self.latent_channels} channels, got {z_dim}") if self.z_to_patch_ratio >= 1: if lq_latent.shape[2] != pH or lq_latent.shape[3] != pW: z_aligned = F.interpolate(lq_latent, size=(pH, pW), mode="nearest")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@comfy/ldm/pixeldit/pid.py` around lines 117 - 131, Add a deterministic channel validation at the start of _align_latent_to_patch_grid: check that lq_latent.shape[1] (channels) equals the configured latent_channels (or self.latent_channels / expected channel count used by LQProjection2D) and raise a clear ValueError if it does not; this prevents an opaque backend shape-mismatch later in the forward pass and directs the user to the expected tensor format before any interpolation or reshape logic runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@comfy_extras/nodes_pid.py`:
- Around line 37-40: When latent_format == "flux" explicitly validate the
channel count instead of treating all non-128 shapes as Flux1: check
samples.shape[1] and set fmt_cls = comfy.latent_formats.Flux2 when it equals
128, set fmt_cls = comfy.latent_formats.Flux when it equals the expected Flux
channel count (e.g., 64 or your project's Flux channels), and otherwise raise a
clear ValueError/RuntimeError referencing the invalid channel count and
latent_format so the caller fails fast; update the code that assigns fmt_cls
(the branch using latent_format, samples.shape[1], comfy.latent_formats.Flux,
and comfy.latent_formats.Flux2) to perform this validation and error handling.
In `@comfy/ldm/pixeldit/model.py`:
- Around line 200-206: The code silently drops border pixels when computing
patches (Hs = H // self.patch_size, Ws = W // self.patch_size and using F.unfold
with stride=self.patch_size), so reject inputs whose height or width are not
divisible by self.patch_size: in the beginning of the patchifying logic (around
where H, W are read and pos_img is computed and x_patches is created, and
likewise where F.fold is used to reconstruct) add a guard that checks H %
self.patch_size == 0 and W % self.patch_size == 0 and raise a clear ValueError
(mentioning expected patch_size and actual H/W) instead of proceeding; reference
the functions/variables _fetch_patch_pos, x_patches (F.unfold), and the
reconstruction code that uses F.fold to locate where to add the guard.
- Around line 221-223: The loop over self.patch_blocks is hardcoding the
attention mask as None when calling each block, so padded text tokens remain
visible; update the call inside the loop in model.py (the code invoking
self.patch_blocks and _pre_patch_block) to pass the attention_mask from the
outer forward signature into each block call instead of None (i.e., forward the
attention_mask to MMDiTJointAttention.forward via the existing
transformer_options/kwargs), ensuring every patch block receives the same
attention_mask.
In `@comfy/ldm/pixeldit/modules.py`:
- Around line 128-139: The module sizes compress_to_attn and expand_from_attn
using the constructor patch_size but forward() recomputes a caller-supplied
patch_size; add a fast-fail check to ensure they match: store the constructor
patch_size (or p2) on self (e.g., self._patch_size or self._p2) in __init__, and
in forward() assert the recomputed patch_size (or p2) equals the stored value
and raise a clear error referencing compress_to_attn/expand_from_attn if not;
apply the same guard for the related logic referenced around the other block
(lines ~155-159) to prevent opaque reshape failures.
In `@comfy/ldm/pixeldit/pid.py`:
- Around line 200-209: The _forward method currently calls degrade_sigma.to(...)
without checking for None and doesn't validate its length; update the beginning
of PidNet._forward to (1) raise a clear ValueError if degrade_sigma is None, (2)
ensure degrade_sigma is a tensor (or convert it) before calling .to(device=...,
dtype=...), and (3) validate that degrade_sigma.numel() is either 1 or equals
the batch size B, raising a ValueError on mismatch; keep the existing
expand(B).contiguous() behavior when numel()==1 so downstream gating logic
receives a 1-D tensor of length B on the correct device and dtype.
In `@comfy/model_detection.py`:
- Around line 466-481: The PiD detection only checks for an unprefixed lq_proj
key but later checks a core-prefixed pixel_embedder key, causing mis-detection
when checkpoints use a core.* prefix; update the detection to look for both
prefixed and unprefixed forms: when testing for the PiD marker, check both
'{}lq_proj.latent_proj.0.weight' and '{}core.lq_proj.latent_proj.0.weight' (and
build the gate prefix similarly from whichever matched), and when falling back
to PixelDiT T2I check both '{}core.pixel_embedder.proj.weight' and
'{}pixel_embedder.proj.weight', so the code correctly handles checkpoints with
or without the core. prefix.
---
Nitpick comments:
In `@comfy/ldm/pixeldit/modules.py`:
- Around line 149-170: PiTBlock.forward currently calls _fetch_pos each block
which calls precompute_freqs_cis_2d and reallocates the same RoPE tensor
repeatedly; add a small per-instance cache keyed by (Hs, Ws, device, dtype,
rope_options) (e.g., self._rope_cache = {}) and modify _fetch_pos to check the
cache and return the cached tensor when present, otherwise compute, store, and
return it; ensure the cache key includes transformer_options.get("rope_options")
and that tensors are moved to the correct device/dtype if needed and invalidated
when shapes or devices change to avoid stale tensors.
In `@comfy/ldm/pixeldit/pid.py`:
- Around line 117-131: Add a deterministic channel validation at the start of
_align_latent_to_patch_grid: check that lq_latent.shape[1] (channels) equals the
configured latent_channels (or self.latent_channels / expected channel count
used by LQProjection2D) and raise a clear ValueError if it does not; this
prevents an opaque backend shape-mismatch later in the forward pass and directs
the user to the expected tensor format before any interpolation or reshape logic
runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1afcc0ec-af3d-48ac-849b-40d5b1da9892
📒 Files selected for processing (12)
comfy/latent_formats.pycomfy/ldm/modules/diffusionmodules/mmdit.pycomfy/ldm/pixeldit/model.pycomfy/ldm/pixeldit/modules.pycomfy/ldm/pixeldit/pid.pycomfy/model_base.pycomfy/model_detection.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/pixeldit.pycomfy_extras/nodes_pid.pynodes.py
|
Hello kijai, It is lack of example in original paper for flux2. I am not sure it is a model issue or implementation issue. |
|
Hello, I am facing the following error: "TypeError: HostBuffer.init() takes 2 positional arguments but 4 were given" I know this error can happen when there is an incompatibility between some custom nodes and the installed ComfyUI version. I had an updated version of ComfyUI installed, and then I did the following:
After that, I started getting this error in every workflow I try to run. My guess is that this may be caused by a version mismatch between ComfyUI and one or more custom nodes/dependencies, but I am not sure which versions I should be using. If anyone could help me identify what needs to be updated or downgraded, I would really appreciate it. If not, no problem — I can wait until this PR is officially merged. Thanks! |
This is probably due to the PR code being new and requiring the latest |
|
|
||
|
|
||
| def _build_padded_tokens(combined_text: str, spiece_tokenizer, pad_id: int, chi_token_count: int): | ||
| # Right-pad to chi_token_count + 300 - 2 (matches upstream's max_length_all). |
There was a problem hiding this comment.
Is this actually needed? SDTokenizer class already has functionality to pad the tokens.
There was a problem hiding this comment.
It's indeed not, also other redundant things in this file so cleaned it up more.
Well that is completely wrong, especially for this model that's best used as part of a pipeline, as the decode/upscale stage. Only reason SUPIR/SEEDVR2 were standalone was the effort/knowledge needed to integrate them to the core, not because they'd work better standalone, very much the opposite. Also I'd appreciate if you didn't come here telling me I'm wasting my time and advertising your node pack (FYI there are already multiple other wrappers for it too). It is not better approach, just one that takes less effort. |
I never meant to offend you or belittle your work—it was just a figure of speech. I believe what you say, and yes, if we commit to implementing it, things will certainly improve.... I apologize; I didn’t mean to offend you by posting the link to the node I created... I just wanted to help in some way by trying out the technology in an alternative way until the implementation is stable. [ ]'s |
Ok, well appreciate the clarification. The implementation is stable and tested to match the original code, with additional memory optimizations, rest is up to the workflows. It's not the easiest model to use due to the resolution and some aspect ratio restrictions, the green artifact comes from going outside the model's training for example (and yes tiling can be one way to combat that). I know my examples here are just barebones basic ones and should only be taken as starting point. |
|
Heads' up for anyone following along: the |
Yes, sorry for not mentioning that here, for the actually merged version it's just "flux" and "sd3" currently since it's easy to autodetect flux2, this makes the node automatic for anything but sd3. |
|
Thanks @kijai - I believe the attached JSON workflows from first post still use the old values. |
|
Hi @kijai . Why PiD for flux2, will make the picture turn whitish? I use ERNIE-Image to generate the origin picture |
It's just how the model is, I don't know why, could be the distillation (undistilled models are not released yet) or just their training in general.. all their examples are bright scenes too. The ComfyUI implementation matches what I get from their code:
|
|
What would have the most "real" details, generating at 1536x1536 natively, or using PiD to get from 1024x1024 to 4096x4096 ? |
I suppose it depends on the model, but probably the best use is to generate at given models native resolution, ending it early and passing the latent with the noise to PiD, then using the degrade_sigma to finish and decode the image. I don't know the optimal workflow for this yet, but seems to work great with Z-image at least. |
|
Hey guys, we have a new checkpoint for Thanks for merging PiD to ComfyUI!! |



















Adds support for Nvidia PixelDiT T2I image model, as well as the new PiD models.
Models (nsclv1 license):
https://huggingface.co/Comfy-Org/PixelDiT
PixelDiT text to image
pixeldit_test_01.json

PiD
Encode - decode example:
pid_512-2048_flux1_upscale_example_02.json
Z-image to 4096 example:
z_image_turbo_to_pid_03.json
