Fix BO mode wired as 'none' in non-wavelet SURE samplers by pycms-nube · Pull Request #486 · Panchovix/stable-diffusion-webui-reForge

pycms-nube · 2026-05-16T10:16:32Z

What changed

All 8 non-wavelet SURE samplers (sample_sure, sample_dpmpp_2s_a_sure, sample_dpmpp_2s_a_sure_adaptive, sample_dpmpp_2m_sure, sample_dpmpp_2m_sde_sure, sample_dpmpp_3m_sde_sure, sample_dpmpp_2m_sde_sure_adaptive, sample_sure_adaptive) were initialising _adam_state with:

_adam_state = {'optimizer': None, 'param': None} if sure_adam_mode != 'none' else None

This allocated a real dict when sure_adam_mode='bo' and passed it as a non-None adam_state to _sure_correct_x0.

Why it was wrong

_sure_correct_x0 only activates Adam when adam_mode in ('adam', 'adamw'). When adam_mode='bo', the adam block is skipped and the function falls through to plain SGD — the non-None dict is silently ignored. BO mode in the non-wavelet path ran as plain SGD, wasting an allocation each call and masking the mismatch.

The wavelet sampler sample_sure_wavelet already had the correct check (sure_adam_mode in ('adam', 'adamw')) so _adam_state=None for BO mode.

Fix

Changed all 8 non-wavelet samplers to match the wavelet pattern:

_adam_state = {'optimizer': None, 'param': None} if sure_adam_mode in ('adam', 'adamw') else None

Test plan

Run a generation with each SURE sampler variant and sure_adam_mode='bo' — confirm no runtime errors
Verify sure_adam_mode='adam' and 'adamw' still initialise _adam_state correctly (non-None)
Confirm sure_adam_mode='none' keeps _adam_state=None

🤖 Generated with Claude Code

Add support of diffuser of UNet, LoRA. Also Claude now has capability to handle complex usage

And we now ready for some fun features

MPS backend has some updates now, let's bump up

Mer

This commit fix all noneType and other issiue cause by some werid loading problem. Also add typing Diffuser pipline now has capiability to compile with LoRA. This commit also introduce first version of forge auto offloading (maxiumn fir offloading by layers) using diffuser auto device map infere

The diffuser pipline now can do model loading using diffuser. This allows advanced model support and better model loading. Assiatant by claude opus

We kind of fix problem in orginal DoRA support???? I not sure... Claude says orginal is wrong but at this point both reForge pipline and diffuser has support. Restore orginal support of multi chunk CLIP. for Diffuser.

This commit introduce fix about not using pipline. Also create stub for future optimzation

This should solve problem of SDP is not efficent for non tech users, plus easier to check if we hit performance maxiumn

On MacOS, memory foot print is essential. This commit use autocast so we can run on bp16 or extreme fp16

This commit support mapping aginst diffuser schecduler and adding diffuser scheduler. Expand LRU cache to diffuser functions with high cost

…webui-reForge into reforge_upstream

Reforge upstream

Fix noise scale on EDM for Euler A2. And add DC-sampler, SURE, a trojactory sampler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This fix introduce some performance improve around SURE sampler. A proper preheat is added. Along with vaiant of DPM++ 2M/2S a I will suggest you use DPM++ 2S a SURE. This is the most best one that somehow match the paper but not introduce werid artifacts For SURE it's sure now.

SURE reimplement after noticing nosie is add back later And we now can use model metadata infer what is right

This fix allows SURE to actually run under the sampling assumption. The problem is due sampling needs high sigma while SURE don't like it. Also we inject noise wrong, so it blew up somthing Sure this is a SURE fix;)

Though mostly you should not change but... Yeah why not?

Replace the fixed sigma-scaled SGD step in _sure_correct_x0 with an optional Adam/AdamW optimiser whose state (m, v, t) persists across diffusion steps, mapping each denoising step to one optimizer iteration. Adam normalises each pixel's gradient by its historical variance so that alpha becomes a true scale-invariant learning rate rather than a raw gradient magnitude knob. The sigma-scaling heuristic (alpha / (1 + sigma_t)) is therefore dropped when Adam is active — it is redundant and would double-suppress early steps that Adam already handles via its second moment. AdamW adds decoupled weight decay applied directly to x0_hat, pulling the corrected estimate toward zero each step without contaminating the moment estimates. Four new UI options are exposed under Settings: sure_adam_mode — none / adam / adamw (Radio) sure_adam_beta1 — first-moment decay (Slider, default 0.9) sure_adam_beta2 — second-moment decay (Slider, default 0.999) sure_adam_wd — AdamW weight decay (Slider, default 0.01) All eight SURE samplers receive the new parameters via the existing signature-inspection path in sd_samplers_common.py. Diagnostic logging now reports eff_grad_rms and adam_ratio (eff/raw) so the per-pixel adaptation can be verified at runtime. Empirically the adam_ratio and eff_grad_rms trajectories are nearly identical across a 5× range of alpha values, confirming that Adam has absorbed the scale sensitivity that previously required manual tuning. Signed-off-by: PYCMS <zenghongyi2004@gmail.com> Co-developed-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Restructures SURE-AG (Attention-Guided SURE) from a bespoke DPM++2M sampler into a `post_cfg_function` guidance node matching the SAG/APG pattern — works with any sampler, not just DPM++2M. New files - ldm_patched/contrib/nodes_sure_ag.py SureAttentionGuidance node - extensions-builtin/sd_forge_sure_ag/ Forge UI accordion script Changed - sure_attention.py: add public build_capture_model_options() API; fix entropy NaN from bf16 log (cast to float32, clamp ≥ 0); add NaN/Inf guard before returning entropy map - sampling.py: replace sample_sure_attention body with NotImplementedError stub (API-compatible tombstone, clear migration message) - sd_samplers_kdiffusion.py: remove 'SURE Attention' sampler entry - sd_samplers_common.py: remove sure_attn_* param forwarding and metadata - shared_options.py: remove sure_attn_group/weight/blocks settings The guidance node runs one extra UNet forward pass per step (same cost as SAG), captures per-layer attention entropy, and applies the SURE-AG weighted correction: x0_hat − α·(1 + w·U)·2·(x0_hat − model(x0_hat, σ)) Alpha is auto-clamped to < 1/(2·(1+w)) per Lean §3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…RU block loading Task 1 — compile is now opt-in: - Add --forge-diffusers-compile flag (cmd_args.py) - Whole-model torch.compile in apply_model() gated on self._compile - Per-block _install_compile_regions() in _setup_auto_offload() gated on self._compile - Default run no longer stalls 1-3 min on first generation Task 2 — LRU dynamic block loading (mirrors reForge default behaviour): - New diff_pipeline/_lru_blocks.py: LRUBlockCache + estimate_capacity() - Decouples weights (param.data redirect) from structure (Module tree stays put) - GPU tensors allocated on load, freed on evict — no repeated malloc overhead - CUDA transfer stream for async H→D copies; compute stream waits via Event - Pinned CPU copies for fast DMA transfers - Default path in apply_model() checks VRAM via _should_use_lru(); automatically falls back to LRU block loading when UNet + 512 MiB headroom exceeds free VRAM - Non-block children (conv_in, time_embedding, etc.) stay on device permanently - _reset_lru_offload() wired into _sync_lora() so LoRA swaps rebuild the cache Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rectness The original param.data storage-pointer redirect approach was producing a device-mismatch error at runtime: GroupNorm weight remained on CPU despite the redirect logic completing without error. Root cause: module.to() is the only safe primitive for cross-device parameter movement that reliably updates all internal PyTorch device bookkeeping. Direct param.data = gpu_tensor storage reassignment is fragile across PyTorch versions and diffusers' UNet subclass hierarchy. Fix: use module.to(device) for load and module.to("cpu") for evict — the same primitives the existing auto-offload pre/post hooks use. module.to() is CPU-synchronous (non_blocking=False), so parameters are valid before the pre-hook returns and the block's forward begins. CUDA transfer stream is preserved: submitting the module.to() inside torch.cuda.stream(xfer_stream) routes the DMA copies to a dedicated stream, keeping the default compute stream free during the CPU stall. No CUDA event handshake is needed because CPU synchrony already guarantees completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_sync_lora() was calling _reset_lru_offload() whenever patches_uuid changed — even on the very first apply_model() call when the uuid transitions from None to a real value. This tore out the pre-hooks at step 5 before the UNet forward at step 6b, leaving all block parameters on CPU and causing a device-mismatch crash in GroupNorm. PEFT load_lora_adapter / delete_adapter only adds/removes adapter sub-layers inside the block modules; it never replaces the DownBlock2D / MidBlock2D / UpBlock2D objects that the hooks are registered on. The hooks remain valid across LoRA swaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add _inference_headroom(x) helper in pipeline.py that uses reForge's minimum_inference_memory() (1 GiB) plus a resolution-scaled activation estimate (B × 320 × H × W × 2 bytes × 4) derived from the latent shape - Update estimate_capacity() in _lru_blocks.py to accept x_shape and apply the same headroom formula instead of the hardcoded 512 MiB - Pass x to both _should_use_lru() and _setup_lru_offload() at call sites so capacity estimation reflects the actual inference tensor dimensions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nt OOM - Use soft_empty_cache() (handles CUDA/XPU/MPS/NPU) before VRAM measurement in estimate_capacity() to return fragmented reserved memory to the allocator - Subtract inactive_split_bytes.all.current from free-VRAM estimate in both estimate_capacity() and _should_use_lru(): fragmented reserved blocks cannot service large (100s-MB) block allocations even when the formula counts them - Increase activation headroom factor 4→16 in _inference_headroom() and estimate_capacity(): SDXL skip-connection tensors from all encoder blocks are held live until consumed by the matching decoder block (~172 MB at 1280×1280), plus ~200 MB intermediate activations; factor=16 yields ~524 MB extra on top of the 1 GiB base from minimum_inference_memory() - Log detailed VRAM snapshot (cuda_free/reserved/active/fragmented/effective) so capacity decisions are auditable in the debug log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…VRAM GPUs Adds a generalized VRAM pool allocator (VRAMAllocator) with RRIP-inspired generation-based eviction so that UNet, CLIP, and VAE blocks can be streamed on-demand from CPU to GPU without holding the full model in VRAM. Key changes: - diff_pipeline/vram_allocator.py: new pool allocator — LRU + 4-state generation counter (COLD/COOL/WARM/HOT), free-list for GC'd modules, TorchDispatchMode activation tracking, weakref lifecycle hooks, prepare_for_prefix for atomic prefix-scoped eviction, sync_device_state to repair flags after external free_memory() calls - diff_pipeline/adapter.py: block-level registration for VAE (10 blocks) and CLIP (44 layers); _decode_needs_tiling() pre-check skips wasteful full decode for large latents and goes straight to tiled fallback - diff_pipeline/pipeline.py: tracking_context() wraps UNet forward; pre-evicts CLIP before UNet sampling starts so VRAM is available from step 1; sync_device_state() after free_memory() keeps allocator coherent - diff_pipeline/_lru_blocks.py: flush_to_cpu() for VRAM pressure hook; 1.15x CUDA alignment overhead in capacity estimate to prevent OOM - ldm_patched/modules/model_management.py: VRAM pressure hook registry so VRAMAllocator can respond to free_memory() calls from other consumers Validated: 6-step 1280×1280 SDXL generation on RTX 2080 Max-Q (8 GB). CLIP loads once per generation; VAE tiling decision is made pre-emptively; UNet blocks load and age across steps without unnecessary reloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-diffusers-lru-headroom The base activation headroom reserved by the LRU allocator was hardcoded to reForge's minimum_inference_memory() (~1 GiB). Users with low fragmentation can now pass --forge-diffusers-lru-headroom <MB> to reduce it, allowing more UNet blocks to stay resident between steps at the cost of a smaller OOM safety margin. The resolution-scaled skip-connection estimate is always added on top of the specified base, so the flag only controls the floor, not the total. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…F841) - adapter.py: split `import gc, logging as _log` onto two lines (×2) - vram_allocator.py: remove unused `size_mb` in `_do_evict` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>