Skip to content

Fix BO mode wired as 'none' in non-wavelet SURE samplers#486

Open
pycms-nube wants to merge 138 commits into
Panchovix:mainfrom
pycms-nube:main
Open

Fix BO mode wired as 'none' in non-wavelet SURE samplers#486
pycms-nube wants to merge 138 commits into
Panchovix:mainfrom
pycms-nube:main

Conversation

@pycms-nube

Copy link
Copy Markdown

What changed

All 8 non-wavelet SURE samplers (sample_sure, sample_dpmpp_2s_a_sure, sample_dpmpp_2s_a_sure_adaptive, sample_dpmpp_2m_sure, sample_dpmpp_2m_sde_sure, sample_dpmpp_3m_sde_sure, sample_dpmpp_2m_sde_sure_adaptive, sample_sure_adaptive) were initialising _adam_state with:

_adam_state = {'optimizer': None, 'param': None} if sure_adam_mode != 'none' else None

This allocated a real dict when sure_adam_mode='bo' and passed it as a non-None adam_state to _sure_correct_x0.

Why it was wrong

_sure_correct_x0 only activates Adam when adam_mode in ('adam', 'adamw'). When adam_mode='bo', the adam block is skipped and the function falls through to plain SGD — the non-None dict is silently ignored. BO mode in the non-wavelet path ran as plain SGD, wasting an allocation each call and masking the mismatch.

The wavelet sampler sample_sure_wavelet already had the correct check (sure_adam_mode in ('adam', 'adamw')) so _adam_state=None for BO mode.

Fix

Changed all 8 non-wavelet samplers to match the wavelet pattern:

_adam_state = {'optimizer': None, 'param': None} if sure_adam_mode in ('adam', 'adamw') else None

Test plan

  • Run a generation with each SURE sampler variant and sure_adam_mode='bo' — confirm no runtime errors
  • Verify sure_adam_mode='adam' and 'adamw' still initialise _adam_state correctly (non-None)
  • Confirm sure_adam_mode='none' keeps _adam_state=None

🤖 Generated with Claude Code

pycms-nube and others added 30 commits March 28, 2026 13:47
Add support of diffuser of UNet, LoRA.
Also Claude now has capability to handle complex usage
And we now ready for some fun features
MPS backend has some updates now, let's bump up
This commit fix all noneType and other issiue cause by some werid loading problem. Also add typing

Diffuser pipline now has capiability to compile with LoRA.
This commit also introduce first version of forge auto offloading (maxiumn fir offloading by layers) using diffuser auto device map infere
The diffuser pipline now can do model loading using diffuser. This allows advanced model support and better model loading.
Assiatant by claude opus
We kind of fix problem in orginal DoRA support???? I not sure... Claude says orginal is wrong but at this point both reForge pipline and diffuser has support.

Restore orginal support of multi chunk CLIP. for Diffuser.
This commit introduce fix about not using pipline.

Also create stub for future optimzation
This should solve problem of SDP is not efficent for non tech users, plus easier to check if we hit performance maxiumn
On MacOS, memory foot print is essential. This commit use autocast so we can run on bp16 or extreme fp16
This commit support mapping aginst diffuser schecduler and adding diffuser scheduler.
Expand LRU cache to diffuser functions with high cost
Fix noise scale on EDM for Euler A2. And add DC-sampler, SURE, a trojactory sampler

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This fix introduce some performance improve around SURE sampler.
A proper preheat is added. Along with vaiant of DPM++ 2M/2S a
I will suggest you use DPM++ 2S a SURE. This is the most best one that somehow match the paper but not introduce werid artifacts
For SURE it's sure now.
SURE reimplement after noticing nosie is add back later
And we now can use model metadata infer what is right
This fix allows SURE to actually run under the sampling assumption.
The problem is due sampling needs high sigma while SURE don't like it.
Also we inject noise wrong, so it blew up somthing
Sure this is a SURE fix;)
Though mostly you should not change but...
Yeah why not?
Replace the fixed sigma-scaled SGD step in _sure_correct_x0 with an
optional Adam/AdamW optimiser whose state (m, v, t) persists across
diffusion steps, mapping each denoising step to one optimizer iteration.

Adam normalises each pixel's gradient by its historical variance so that
alpha becomes a true scale-invariant learning rate rather than a raw
gradient magnitude knob.  The sigma-scaling heuristic (alpha / (1 +
sigma_t)) is therefore dropped when Adam is active — it is redundant
and would double-suppress early steps that Adam already handles via its
second moment.

AdamW adds decoupled weight decay applied directly to x0_hat, pulling
the corrected estimate toward zero each step without contaminating the
moment estimates.

Four new UI options are exposed under Settings:
  sure_adam_mode   — none / adam / adamw  (Radio)
  sure_adam_beta1  — first-moment decay   (Slider, default 0.9)
  sure_adam_beta2  — second-moment decay  (Slider, default 0.999)
  sure_adam_wd     — AdamW weight decay   (Slider, default 0.01)

All eight SURE samplers receive the new parameters via the existing
signature-inspection path in sd_samplers_common.py.  Diagnostic logging
now reports eff_grad_rms and adam_ratio (eff/raw) so the per-pixel
adaptation can be verified at runtime.

Empirically the adam_ratio and eff_grad_rms trajectories are nearly
identical across a 5× range of alpha values, confirming that Adam has
absorbed the scale sensitivity that previously required manual tuning.

Signed-off-by: PYCMS <zenghongyi2004@gmail.com>
Co-developed-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Claude Sonnet 4.6 <noreply@anthropic.com>
pycms-nube and others added 30 commits June 5, 2026 21:00
Restructures SURE-AG (Attention-Guided SURE) from a bespoke DPM++2M
sampler into a `post_cfg_function` guidance node matching the SAG/APG
pattern — works with any sampler, not just DPM++2M.

New files
- ldm_patched/contrib/nodes_sure_ag.py      SureAttentionGuidance node
- extensions-builtin/sd_forge_sure_ag/      Forge UI accordion script

Changed
- sure_attention.py: add public build_capture_model_options() API;
  fix entropy NaN from bf16 log (cast to float32, clamp ≥ 0);
  add NaN/Inf guard before returning entropy map
- sampling.py: replace sample_sure_attention body with NotImplementedError
  stub (API-compatible tombstone, clear migration message)
- sd_samplers_kdiffusion.py: remove 'SURE Attention' sampler entry
- sd_samplers_common.py: remove sure_attn_* param forwarding and metadata
- shared_options.py: remove sure_attn_group/weight/blocks settings

The guidance node runs one extra UNet forward pass per step (same cost
as SAG), captures per-layer attention entropy, and applies the
SURE-AG weighted correction:
  x0_hat − α·(1 + w·U)·2·(x0_hat − model(x0_hat, σ))
Alpha is auto-clamped to < 1/(2·(1+w)) per Lean §3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RU block loading

Task 1 — compile is now opt-in:
  - Add --forge-diffusers-compile flag (cmd_args.py)
  - Whole-model torch.compile in apply_model() gated on self._compile
  - Per-block _install_compile_regions() in _setup_auto_offload() gated on self._compile
  - Default run no longer stalls 1-3 min on first generation

Task 2 — LRU dynamic block loading (mirrors reForge default behaviour):
  - New diff_pipeline/_lru_blocks.py: LRUBlockCache + estimate_capacity()
  - Decouples weights (param.data redirect) from structure (Module tree stays put)
  - GPU tensors allocated on load, freed on evict — no repeated malloc overhead
  - CUDA transfer stream for async H→D copies; compute stream waits via Event
  - Pinned CPU copies for fast DMA transfers
  - Default path in apply_model() checks VRAM via _should_use_lru(); automatically
    falls back to LRU block loading when UNet + 512 MiB headroom exceeds free VRAM
  - Non-block children (conv_in, time_embedding, etc.) stay on device permanently
  - _reset_lru_offload() wired into _sync_lora() so LoRA swaps rebuild the cache

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness

The original param.data storage-pointer redirect approach was producing a
device-mismatch error at runtime: GroupNorm weight remained on CPU despite
the redirect logic completing without error.

Root cause: module.to() is the only safe primitive for cross-device parameter
movement that reliably updates all internal PyTorch device bookkeeping. Direct
param.data = gpu_tensor storage reassignment is fragile across PyTorch versions
and diffusers' UNet subclass hierarchy.

Fix: use module.to(device) for load and module.to("cpu") for evict — the same
primitives the existing auto-offload pre/post hooks use. module.to() is
CPU-synchronous (non_blocking=False), so parameters are valid before the
pre-hook returns and the block's forward begins.

CUDA transfer stream is preserved: submitting the module.to() inside
torch.cuda.stream(xfer_stream) routes the DMA copies to a dedicated stream,
keeping the default compute stream free during the CPU stall. No CUDA event
handshake is needed because CPU synchrony already guarantees completion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_sync_lora() was calling _reset_lru_offload() whenever patches_uuid
changed — even on the very first apply_model() call when the uuid
transitions from None to a real value. This tore out the pre-hooks at
step 5 before the UNet forward at step 6b, leaving all block parameters
on CPU and causing a device-mismatch crash in GroupNorm.

PEFT load_lora_adapter / delete_adapter only adds/removes adapter
sub-layers inside the block modules; it never replaces the
DownBlock2D / MidBlock2D / UpBlock2D objects that the hooks are
registered on. The hooks remain valid across LoRA swaps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _inference_headroom(x) helper in pipeline.py that uses reForge's
  minimum_inference_memory() (1 GiB) plus a resolution-scaled activation
  estimate (B × 320 × H × W × 2 bytes × 4) derived from the latent shape
- Update estimate_capacity() in _lru_blocks.py to accept x_shape and apply
  the same headroom formula instead of the hardcoded 512 MiB
- Pass x to both _should_use_lru() and _setup_lru_offload() at call sites
  so capacity estimation reflects the actual inference tensor dimensions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt OOM

- Use soft_empty_cache() (handles CUDA/XPU/MPS/NPU) before VRAM measurement
  in estimate_capacity() to return fragmented reserved memory to the allocator
- Subtract inactive_split_bytes.all.current from free-VRAM estimate in both
  estimate_capacity() and _should_use_lru(): fragmented reserved blocks cannot
  service large (100s-MB) block allocations even when the formula counts them
- Increase activation headroom factor 4→16 in _inference_headroom() and
  estimate_capacity(): SDXL skip-connection tensors from all encoder blocks are
  held live until consumed by the matching decoder block (~172 MB at 1280×1280),
  plus ~200 MB intermediate activations; factor=16 yields ~524 MB extra on top
  of the 1 GiB base from minimum_inference_memory()
- Log detailed VRAM snapshot (cuda_free/reserved/active/fragmented/effective)
  so capacity decisions are auditable in the debug log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…VRAM GPUs

Adds a generalized VRAM pool allocator (VRAMAllocator) with RRIP-inspired
generation-based eviction so that UNet, CLIP, and VAE blocks can be streamed
on-demand from CPU to GPU without holding the full model in VRAM.

Key changes:
- diff_pipeline/vram_allocator.py: new pool allocator — LRU + 4-state
  generation counter (COLD/COOL/WARM/HOT), free-list for GC'd modules,
  TorchDispatchMode activation tracking, weakref lifecycle hooks,
  prepare_for_prefix for atomic prefix-scoped eviction, sync_device_state
  to repair flags after external free_memory() calls
- diff_pipeline/adapter.py: block-level registration for VAE (10 blocks)
  and CLIP (44 layers); _decode_needs_tiling() pre-check skips wasteful
  full decode for large latents and goes straight to tiled fallback
- diff_pipeline/pipeline.py: tracking_context() wraps UNet forward;
  pre-evicts CLIP before UNet sampling starts so VRAM is available from
  step 1; sync_device_state() after free_memory() keeps allocator coherent
- diff_pipeline/_lru_blocks.py: flush_to_cpu() for VRAM pressure hook;
  1.15x CUDA alignment overhead in capacity estimate to prevent OOM
- ldm_patched/modules/model_management.py: VRAM pressure hook registry
  so VRAMAllocator can respond to free_memory() calls from other consumers

Validated: 6-step 1280×1280 SDXL generation on RTX 2080 Max-Q (8 GB).
CLIP loads once per generation; VAE tiling decision is made pre-emptively;
UNet blocks load and age across steps without unnecessary reloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-diffusers-lru-headroom

The base activation headroom reserved by the LRU allocator was hardcoded
to reForge's minimum_inference_memory() (~1 GiB).  Users with low
fragmentation can now pass --forge-diffusers-lru-headroom <MB> to reduce
it, allowing more UNet blocks to stay resident between steps at the cost
of a smaller OOM safety margin.

The resolution-scaled skip-connection estimate is always added on top of
the specified base, so the flag only controls the floor, not the total.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…F841)

- adapter.py: split `import gc, logging as _log` onto two lines (×2)
- vram_allocator.py: remove unused `size_mb` in `_do_evict`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RU block loading

Task 1 — compile is now opt-in:
  - Add --forge-diffusers-compile flag (cmd_args.py)
  - Whole-model torch.compile in apply_model() gated on self._compile
  - Per-block _install_compile_regions() in _setup_auto_offload() gated on self._compile
  - Default run no longer stalls 1-3 min on first generation

Task 2 — LRU dynamic block loading (mirrors reForge default behaviour):
  - New diff_pipeline/_lru_blocks.py: LRUBlockCache + estimate_capacity()
  - Decouples weights (param.data redirect) from structure (Module tree stays put)
  - GPU tensors allocated on load, freed on evict — no repeated malloc overhead
  - CUDA transfer stream for async H→D copies; compute stream waits via Event
  - Pinned CPU copies for fast DMA transfers
  - Default path in apply_model() checks VRAM via _should_use_lru(); automatically
    falls back to LRU block loading when UNet + 512 MiB headroom exceeds free VRAM
  - Non-block children (conv_in, time_embedding, etc.) stay on device permanently
  - _reset_lru_offload() wired into _sync_lora() so LoRA swaps rebuild the cache

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness

The original param.data storage-pointer redirect approach was producing a
device-mismatch error at runtime: GroupNorm weight remained on CPU despite
the redirect logic completing without error.

Root cause: module.to() is the only safe primitive for cross-device parameter
movement that reliably updates all internal PyTorch device bookkeeping. Direct
param.data = gpu_tensor storage reassignment is fragile across PyTorch versions
and diffusers' UNet subclass hierarchy.

Fix: use module.to(device) for load and module.to("cpu") for evict — the same
primitives the existing auto-offload pre/post hooks use. module.to() is
CPU-synchronous (non_blocking=False), so parameters are valid before the
pre-hook returns and the block's forward begins.

CUDA transfer stream is preserved: submitting the module.to() inside
torch.cuda.stream(xfer_stream) routes the DMA copies to a dedicated stream,
keeping the default compute stream free during the CPU stall. No CUDA event
handshake is needed because CPU synchrony already guarantees completion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_sync_lora() was calling _reset_lru_offload() whenever patches_uuid
changed — even on the very first apply_model() call when the uuid
transitions from None to a real value. This tore out the pre-hooks at
step 5 before the UNet forward at step 6b, leaving all block parameters
on CPU and causing a device-mismatch crash in GroupNorm.

PEFT load_lora_adapter / delete_adapter only adds/removes adapter
sub-layers inside the block modules; it never replaces the
DownBlock2D / MidBlock2D / UpBlock2D objects that the hooks are
registered on. The hooks remain valid across LoRA swaps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _inference_headroom(x) helper in pipeline.py that uses reForge's
  minimum_inference_memory() (1 GiB) plus a resolution-scaled activation
  estimate (B × 320 × H × W × 2 bytes × 4) derived from the latent shape
- Update estimate_capacity() in _lru_blocks.py to accept x_shape and apply
  the same headroom formula instead of the hardcoded 512 MiB
- Pass x to both _should_use_lru() and _setup_lru_offload() at call sites
  so capacity estimation reflects the actual inference tensor dimensions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt OOM

- Use soft_empty_cache() (handles CUDA/XPU/MPS/NPU) before VRAM measurement
  in estimate_capacity() to return fragmented reserved memory to the allocator
- Subtract inactive_split_bytes.all.current from free-VRAM estimate in both
  estimate_capacity() and _should_use_lru(): fragmented reserved blocks cannot
  service large (100s-MB) block allocations even when the formula counts them
- Increase activation headroom factor 4→16 in _inference_headroom() and
  estimate_capacity(): SDXL skip-connection tensors from all encoder blocks are
  held live until consumed by the matching decoder block (~172 MB at 1280×1280),
  plus ~200 MB intermediate activations; factor=16 yields ~524 MB extra on top
  of the 1 GiB base from minimum_inference_memory()
- Log detailed VRAM snapshot (cuda_free/reserved/active/fragmented/effective)
  so capacity decisions are auditable in the debug log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…VRAM GPUs

Adds a generalized VRAM pool allocator (VRAMAllocator) with RRIP-inspired
generation-based eviction so that UNet, CLIP, and VAE blocks can be streamed
on-demand from CPU to GPU without holding the full model in VRAM.

Key changes:
- diff_pipeline/vram_allocator.py: new pool allocator — LRU + 4-state
  generation counter (COLD/COOL/WARM/HOT), free-list for GC'd modules,
  TorchDispatchMode activation tracking, weakref lifecycle hooks,
  prepare_for_prefix for atomic prefix-scoped eviction, sync_device_state
  to repair flags after external free_memory() calls
- diff_pipeline/adapter.py: block-level registration for VAE (10 blocks)
  and CLIP (44 layers); _decode_needs_tiling() pre-check skips wasteful
  full decode for large latents and goes straight to tiled fallback
- diff_pipeline/pipeline.py: tracking_context() wraps UNet forward;
  pre-evicts CLIP before UNet sampling starts so VRAM is available from
  step 1; sync_device_state() after free_memory() keeps allocator coherent
- diff_pipeline/_lru_blocks.py: flush_to_cpu() for VRAM pressure hook;
  1.15x CUDA alignment overhead in capacity estimate to prevent OOM
- ldm_patched/modules/model_management.py: VRAM pressure hook registry
  so VRAMAllocator can respond to free_memory() calls from other consumers

Validated: 6-step 1280×1280 SDXL generation on RTX 2080 Max-Q (8 GB).
CLIP loads once per generation; VAE tiling decision is made pre-emptively;
UNet blocks load and age across steps without unnecessary reloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-diffusers-lru-headroom

The base activation headroom reserved by the LRU allocator was hardcoded
to reForge's minimum_inference_memory() (~1 GiB).  Users with low
fragmentation can now pass --forge-diffusers-lru-headroom <MB> to reduce
it, allowing more UNet blocks to stay resident between steps at the cost
of a smaller OOM safety margin.

The resolution-scaled skip-connection estimate is always added on top of
the specified base, so the flag only controls the floor, not the total.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…F841)

- adapter.py: split `import gc, logging as _log` onto two lines (×2)
- vram_allocator.py: remove unused `size_mb` in `_do_evict`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n prediction

Integrates GPU temperature throttling as a built-in reForge module instead
of requiring the external extension.

New module: modules/gpu_temperature.py
- Sensor backends: nvidia-smi (Win+Linux), ROCm-smi (Linux), OpenHardwareMonitor (Win)
- 1-D Kalman filter tracking temperature + rate of change (pure Python, no numpy)
- Predictive throttling: pauses before threshold using projected temp at configurable horizon
- Thermal plateau detection: if GPU hits max passive heat exchange and temp won't drop,
  triggers warn_and_continue or abort_generation after configurable timeout
- Lazy initialization + reinitialize() when sensor setting changes

Wired into modules/sd_samplers_common.store_latent (checked at each sampler step).

Settings added to modules/shared_options.py under "GPU Temperature Protection":
- Enable/disable, sensor choice, device index, OHM GPU name filter
- Sleep/wake thresholds, poll interval, max pause duration
- Kalman: enable, q_pos, q_vel, R, prediction horizon
- Plateau: timeout, action (warn_and_continue | abort_generation)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(gpu-temp): first-party GPU temperature protection with Kalman prediction

So we finally got a good protection on long run
Got caught by CodeQL ;(
…by attention entropy

Combines location-aware (SURE-AG attention entropy) and frequency-aware
(wavelet SURE) guidance into a single correction pass:

  - Wavelet-decomposes the SURE residual into approximation + detail subbands.
  - Projects the attention entropy map U to each subband's spatial scale via
    bilinear resampling → per-subband spatial weight W_k = 1 + w·Ũ_k.
  - Corrects each subband independently: x0_k − α·approx_coeff·W_k·r_k.
  - Optional FFT low-pass scale on the approximation subband gradient
    (FreeU-style: boosts structure, dampens high-freq noise in correction).
  - Graceful fallback: pywt missing → pixel-space SURE-AG; attn_weight=0 →
    wavelet SURE without attention weighting; fft_scale=1.0 → no FFT.

New files:
  ldm_patched/k_diffusion/sure_wav_ag.py          — core correction function
  ldm_patched/contrib/nodes_sure_wav_ag.py        — post-CFG guidance node
  extensions-builtin/sd_forge_sure_wav_ag/        — Forge UI extension

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four alpha modes for the post-CFG guidance node:
  - fixed:      static alpha clamped by Lean §3 bound (original behaviour)
  - sigma:      _sure_effective_alpha sigma scaling — EDM/Karras aware
  - analytical: α* = ⟨r,g⟩/‖g‖² closed-form via Parseval dot products in
                wavelet space (zero extra UNet calls; non-trivial when
                attn_weight>0 makes g non-collinear with r)
  - bo:         Bayesian optimisation (optuna) with cross-step warm-start

Cross-step state (_step_state) is captured in the post_cfg closure and
auto-resets when σ_t jumps > 1.5× (new generation / highres-fix restart).

Debug prints: every step emits [SURE-AGWAV-DBG] lines covering sigma,
entropy_rms, per-subband rdg/gsq, Parseval totals, alpha_eff, delta_rms.
Controlled by _DEBUG_PRINT = True in sure_wav_ag.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PassthroughAttnProcessor silently dropped all patches_replace["attn1"]
hooks, causing SURE-AGWAV entropy capture to yield attn_layers=0 and
U=None on every step when diff_pipeline=True.

Replace PassthroughAttnProcessor with ForgeAttnSelfProcessor, which:
  - Burns in (block_name, ldm_idx, t_idx) at construction (same as
    ForgeAttnProcessor for attn2)
  - Checks patches_replace["attn1"] for (block, idx, t) or (block, idx)
  - If hook found: computes Q/K/V explicitly in flat (B,N,heads*dim)
    format and calls hook(q,k,v,extra_options,mask) — compatible with
    attention_basic_with_sim used by _make_entropy_hook
  - If no hook: delegates to AttnProcessor2_0 (identical to old behaviour)
  - Adds [SURE-AGWAV-DBG] print when hook fires so attn_layers > 0

PassthroughAttnProcessor kept as alias for external compatibility.
_install_attn_processors updated to pass (block_name, ldm_idx, t_idx).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ization

Per-image min-max in _aggregate_entropy_map collapses to zero whenever
attention is spatially uniform — which happens every step in SDXL middle
block at high noise (entropy ≈ log(1600) ≈ 7.38 everywhere, U_max−U_min ≈ 0
→ divide by ε → all zeros → entropy_rms=0.0000 every step).

Fix: normalize each layer's raw entropy by log(N_k) (theoretical maximum).
This gives absolute uncertainty in [0,1]:
- High noise:  entropy ≈ log(N) → U ≈ 1 everywhere → W ≈ 1+w uniform
- Low noise:   entropy varies spatially → targeted correction

Also expand entropy debug print to show min/max/mean for next test run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ff_pipeline

In attention_basic_with_sim the else-branch computes similarity in the
input dtype (fp16 on SDXL diff_pipeline path).  At seq=1600, dim_head=64
the fp16 dot-products overflow to inf → softmax([inf,0,…]) = [1,0,…] →
one-hot attention → entropy = 0 every step.

The ldm_patched path avoids this because it sets attn_precision=torch.float32
via transformer_options; the diff_pipeline path leaves it None.

Fix: force attn_precision=torch.float32 unconditionally in _make_entropy_hook.
This only controls the sim computation — the returned out is still cast back to
v.dtype (fp16), so the correction pass is unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…2 sim

torch.autocast (fp16/bf16) downcasts fp32 inputs before eligible compute ops
(einsum, matmul) even after explicit .float() conversion.  At seq=1600
(SDXL middle block) fp16 dot-products overflow to inf, softmax collapses to
one-hot, entropy=0 every step.  The earlier attn_precision=torch.float32 fix
only changed the branch selection inside attention_basic_with_sim but autocast
still ran the einsum in fp16.

Fix: wrap the attention_basic_with_sim call in
  torch.autocast(device_type=device_type, enabled=False)
to guarantee true fp32 computation regardless of outer autocast state.

Also: return out in original dtype so the model stays in fp16/bf16.
Add a one-shot diagnostic print of sim row-max and raw entropy stats.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sim_row_max mean=0.0085 (near-uniform) but ent=0 is paradoxical.
Add dissection prints: sim dtype/shape, row sums, term=p*log(p+e)
statistics, raw_neg before clamp, and a torch.special.entr cross-check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The clamp was applied before the negation, not after:

  Wrong: -(sum(p*log(p+e),-1)).clamp(0)
         sum ≈ -6.7  →  clamp → 0  →  negate → 0

  Right: (-(sum(p*log(p+e),-1))).clamp(0)
         sum ≈ -6.7  →  negate → +6.7  →  clamp → +6.7

This single missing pair of parentheses caused entropy to be exactly 0
across all steps despite correct fp32 sim values (mean entropy 6.7 nats
confirmed by diagnostic and cross-checked with torch.special.entr).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip all diagnostic prints added during the entropy=0 and shape-mismatch
investigations — EDIAG, SHAPE-DIAG, ForgeAttnSelfProcessor, and per-step
[SURE-AGWAV-DBG] lines. Set _DEBUG_PRINT=False in sure_wav_ag.py. Ongoing
diagnostics are now via _logger.debug/info only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant