Fix BO mode wired as 'none' in non-wavelet SURE samplers#486
Open
pycms-nube wants to merge 138 commits into
Open
Fix BO mode wired as 'none' in non-wavelet SURE samplers#486pycms-nube wants to merge 138 commits into
pycms-nube wants to merge 138 commits into
Conversation
Add support of diffuser of UNet, LoRA. Also Claude now has capability to handle complex usage
And we now ready for some fun features
MPS backend has some updates now, let's bump up
This commit fix all noneType and other issiue cause by some werid loading problem. Also add typing Diffuser pipline now has capiability to compile with LoRA. This commit also introduce first version of forge auto offloading (maxiumn fir offloading by layers) using diffuser auto device map infere
The diffuser pipline now can do model loading using diffuser. This allows advanced model support and better model loading. Assiatant by claude opus
We kind of fix problem in orginal DoRA support???? I not sure... Claude says orginal is wrong but at this point both reForge pipline and diffuser has support. Restore orginal support of multi chunk CLIP. for Diffuser.
This commit introduce fix about not using pipline. Also create stub for future optimzation
This should solve problem of SDP is not efficent for non tech users, plus easier to check if we hit performance maxiumn
On MacOS, memory foot print is essential. This commit use autocast so we can run on bp16 or extreme fp16
This commit support mapping aginst diffuser schecduler and adding diffuser scheduler. Expand LRU cache to diffuser functions with high cost
…webui-reForge into reforge_upstream
Reforge upstream
Fix noise scale on EDM for Euler A2. And add DC-sampler, SURE, a trojactory sampler Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This fix introduce some performance improve around SURE sampler. A proper preheat is added. Along with vaiant of DPM++ 2M/2S a I will suggest you use DPM++ 2S a SURE. This is the most best one that somehow match the paper but not introduce werid artifacts For SURE it's sure now.
SURE reimplement after noticing nosie is add back later And we now can use model metadata infer what is right
This fix allows SURE to actually run under the sampling assumption. The problem is due sampling needs high sigma while SURE don't like it. Also we inject noise wrong, so it blew up somthing Sure this is a SURE fix;)
Though mostly you should not change but... Yeah why not?
Replace the fixed sigma-scaled SGD step in _sure_correct_x0 with an optional Adam/AdamW optimiser whose state (m, v, t) persists across diffusion steps, mapping each denoising step to one optimizer iteration. Adam normalises each pixel's gradient by its historical variance so that alpha becomes a true scale-invariant learning rate rather than a raw gradient magnitude knob. The sigma-scaling heuristic (alpha / (1 + sigma_t)) is therefore dropped when Adam is active — it is redundant and would double-suppress early steps that Adam already handles via its second moment. AdamW adds decoupled weight decay applied directly to x0_hat, pulling the corrected estimate toward zero each step without contaminating the moment estimates. Four new UI options are exposed under Settings: sure_adam_mode — none / adam / adamw (Radio) sure_adam_beta1 — first-moment decay (Slider, default 0.9) sure_adam_beta2 — second-moment decay (Slider, default 0.999) sure_adam_wd — AdamW weight decay (Slider, default 0.01) All eight SURE samplers receive the new parameters via the existing signature-inspection path in sd_samplers_common.py. Diagnostic logging now reports eff_grad_rms and adam_ratio (eff/raw) so the per-pixel adaptation can be verified at runtime. Empirically the adam_ratio and eff_grad_rms trajectories are nearly identical across a 5× range of alpha values, confirming that Adam has absorbed the scale sensitivity that previously required manual tuning. Signed-off-by: PYCMS <zenghongyi2004@gmail.com> Co-developed-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Restructures SURE-AG (Attention-Guided SURE) from a bespoke DPM++2M sampler into a `post_cfg_function` guidance node matching the SAG/APG pattern — works with any sampler, not just DPM++2M. New files - ldm_patched/contrib/nodes_sure_ag.py SureAttentionGuidance node - extensions-builtin/sd_forge_sure_ag/ Forge UI accordion script Changed - sure_attention.py: add public build_capture_model_options() API; fix entropy NaN from bf16 log (cast to float32, clamp ≥ 0); add NaN/Inf guard before returning entropy map - sampling.py: replace sample_sure_attention body with NotImplementedError stub (API-compatible tombstone, clear migration message) - sd_samplers_kdiffusion.py: remove 'SURE Attention' sampler entry - sd_samplers_common.py: remove sure_attn_* param forwarding and metadata - shared_options.py: remove sure_attn_group/weight/blocks settings The guidance node runs one extra UNet forward pass per step (same cost as SAG), captures per-layer attention entropy, and applies the SURE-AG weighted correction: x0_hat − α·(1 + w·U)·2·(x0_hat − model(x0_hat, σ)) Alpha is auto-clamped to < 1/(2·(1+w)) per Lean §3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RU block loading
Task 1 — compile is now opt-in:
- Add --forge-diffusers-compile flag (cmd_args.py)
- Whole-model torch.compile in apply_model() gated on self._compile
- Per-block _install_compile_regions() in _setup_auto_offload() gated on self._compile
- Default run no longer stalls 1-3 min on first generation
Task 2 — LRU dynamic block loading (mirrors reForge default behaviour):
- New diff_pipeline/_lru_blocks.py: LRUBlockCache + estimate_capacity()
- Decouples weights (param.data redirect) from structure (Module tree stays put)
- GPU tensors allocated on load, freed on evict — no repeated malloc overhead
- CUDA transfer stream for async H→D copies; compute stream waits via Event
- Pinned CPU copies for fast DMA transfers
- Default path in apply_model() checks VRAM via _should_use_lru(); automatically
falls back to LRU block loading when UNet + 512 MiB headroom exceeds free VRAM
- Non-block children (conv_in, time_embedding, etc.) stay on device permanently
- _reset_lru_offload() wired into _sync_lora() so LoRA swaps rebuild the cache
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness
The original param.data storage-pointer redirect approach was producing a
device-mismatch error at runtime: GroupNorm weight remained on CPU despite
the redirect logic completing without error.
Root cause: module.to() is the only safe primitive for cross-device parameter
movement that reliably updates all internal PyTorch device bookkeeping. Direct
param.data = gpu_tensor storage reassignment is fragile across PyTorch versions
and diffusers' UNet subclass hierarchy.
Fix: use module.to(device) for load and module.to("cpu") for evict — the same
primitives the existing auto-offload pre/post hooks use. module.to() is
CPU-synchronous (non_blocking=False), so parameters are valid before the
pre-hook returns and the block's forward begins.
CUDA transfer stream is preserved: submitting the module.to() inside
torch.cuda.stream(xfer_stream) routes the DMA copies to a dedicated stream,
keeping the default compute stream free during the CPU stall. No CUDA event
handshake is needed because CPU synchrony already guarantees completion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_sync_lora() was calling _reset_lru_offload() whenever patches_uuid changed — even on the very first apply_model() call when the uuid transitions from None to a real value. This tore out the pre-hooks at step 5 before the UNet forward at step 6b, leaving all block parameters on CPU and causing a device-mismatch crash in GroupNorm. PEFT load_lora_adapter / delete_adapter only adds/removes adapter sub-layers inside the block modules; it never replaces the DownBlock2D / MidBlock2D / UpBlock2D objects that the hooks are registered on. The hooks remain valid across LoRA swaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _inference_headroom(x) helper in pipeline.py that uses reForge's minimum_inference_memory() (1 GiB) plus a resolution-scaled activation estimate (B × 320 × H × W × 2 bytes × 4) derived from the latent shape - Update estimate_capacity() in _lru_blocks.py to accept x_shape and apply the same headroom formula instead of the hardcoded 512 MiB - Pass x to both _should_use_lru() and _setup_lru_offload() at call sites so capacity estimation reflects the actual inference tensor dimensions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt OOM - Use soft_empty_cache() (handles CUDA/XPU/MPS/NPU) before VRAM measurement in estimate_capacity() to return fragmented reserved memory to the allocator - Subtract inactive_split_bytes.all.current from free-VRAM estimate in both estimate_capacity() and _should_use_lru(): fragmented reserved blocks cannot service large (100s-MB) block allocations even when the formula counts them - Increase activation headroom factor 4→16 in _inference_headroom() and estimate_capacity(): SDXL skip-connection tensors from all encoder blocks are held live until consumed by the matching decoder block (~172 MB at 1280×1280), plus ~200 MB intermediate activations; factor=16 yields ~524 MB extra on top of the 1 GiB base from minimum_inference_memory() - Log detailed VRAM snapshot (cuda_free/reserved/active/fragmented/effective) so capacity decisions are auditable in the debug log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…VRAM GPUs Adds a generalized VRAM pool allocator (VRAMAllocator) with RRIP-inspired generation-based eviction so that UNet, CLIP, and VAE blocks can be streamed on-demand from CPU to GPU without holding the full model in VRAM. Key changes: - diff_pipeline/vram_allocator.py: new pool allocator — LRU + 4-state generation counter (COLD/COOL/WARM/HOT), free-list for GC'd modules, TorchDispatchMode activation tracking, weakref lifecycle hooks, prepare_for_prefix for atomic prefix-scoped eviction, sync_device_state to repair flags after external free_memory() calls - diff_pipeline/adapter.py: block-level registration for VAE (10 blocks) and CLIP (44 layers); _decode_needs_tiling() pre-check skips wasteful full decode for large latents and goes straight to tiled fallback - diff_pipeline/pipeline.py: tracking_context() wraps UNet forward; pre-evicts CLIP before UNet sampling starts so VRAM is available from step 1; sync_device_state() after free_memory() keeps allocator coherent - diff_pipeline/_lru_blocks.py: flush_to_cpu() for VRAM pressure hook; 1.15x CUDA alignment overhead in capacity estimate to prevent OOM - ldm_patched/modules/model_management.py: VRAM pressure hook registry so VRAMAllocator can respond to free_memory() calls from other consumers Validated: 6-step 1280×1280 SDXL generation on RTX 2080 Max-Q (8 GB). CLIP loads once per generation; VAE tiling decision is made pre-emptively; UNet blocks load and age across steps without unnecessary reloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-diffusers-lru-headroom The base activation headroom reserved by the LRU allocator was hardcoded to reForge's minimum_inference_memory() (~1 GiB). Users with low fragmentation can now pass --forge-diffusers-lru-headroom <MB> to reduce it, allowing more UNet blocks to stay resident between steps at the cost of a smaller OOM safety margin. The resolution-scaled skip-connection estimate is always added on top of the specified base, so the flag only controls the floor, not the total. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…F841) - adapter.py: split `import gc, logging as _log` onto two lines (×2) - vram_allocator.py: remove unused `size_mb` in `_do_evict` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…RU block loading
Task 1 — compile is now opt-in:
- Add --forge-diffusers-compile flag (cmd_args.py)
- Whole-model torch.compile in apply_model() gated on self._compile
- Per-block _install_compile_regions() in _setup_auto_offload() gated on self._compile
- Default run no longer stalls 1-3 min on first generation
Task 2 — LRU dynamic block loading (mirrors reForge default behaviour):
- New diff_pipeline/_lru_blocks.py: LRUBlockCache + estimate_capacity()
- Decouples weights (param.data redirect) from structure (Module tree stays put)
- GPU tensors allocated on load, freed on evict — no repeated malloc overhead
- CUDA transfer stream for async H→D copies; compute stream waits via Event
- Pinned CPU copies for fast DMA transfers
- Default path in apply_model() checks VRAM via _should_use_lru(); automatically
falls back to LRU block loading when UNet + 512 MiB headroom exceeds free VRAM
- Non-block children (conv_in, time_embedding, etc.) stay on device permanently
- _reset_lru_offload() wired into _sync_lora() so LoRA swaps rebuild the cache
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rectness
The original param.data storage-pointer redirect approach was producing a
device-mismatch error at runtime: GroupNorm weight remained on CPU despite
the redirect logic completing without error.
Root cause: module.to() is the only safe primitive for cross-device parameter
movement that reliably updates all internal PyTorch device bookkeeping. Direct
param.data = gpu_tensor storage reassignment is fragile across PyTorch versions
and diffusers' UNet subclass hierarchy.
Fix: use module.to(device) for load and module.to("cpu") for evict — the same
primitives the existing auto-offload pre/post hooks use. module.to() is
CPU-synchronous (non_blocking=False), so parameters are valid before the
pre-hook returns and the block's forward begins.
CUDA transfer stream is preserved: submitting the module.to() inside
torch.cuda.stream(xfer_stream) routes the DMA copies to a dedicated stream,
keeping the default compute stream free during the CPU stall. No CUDA event
handshake is needed because CPU synchrony already guarantees completion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_sync_lora() was calling _reset_lru_offload() whenever patches_uuid changed — even on the very first apply_model() call when the uuid transitions from None to a real value. This tore out the pre-hooks at step 5 before the UNet forward at step 6b, leaving all block parameters on CPU and causing a device-mismatch crash in GroupNorm. PEFT load_lora_adapter / delete_adapter only adds/removes adapter sub-layers inside the block modules; it never replaces the DownBlock2D / MidBlock2D / UpBlock2D objects that the hooks are registered on. The hooks remain valid across LoRA swaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add _inference_headroom(x) helper in pipeline.py that uses reForge's minimum_inference_memory() (1 GiB) plus a resolution-scaled activation estimate (B × 320 × H × W × 2 bytes × 4) derived from the latent shape - Update estimate_capacity() in _lru_blocks.py to accept x_shape and apply the same headroom formula instead of the hardcoded 512 MiB - Pass x to both _should_use_lru() and _setup_lru_offload() at call sites so capacity estimation reflects the actual inference tensor dimensions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt OOM - Use soft_empty_cache() (handles CUDA/XPU/MPS/NPU) before VRAM measurement in estimate_capacity() to return fragmented reserved memory to the allocator - Subtract inactive_split_bytes.all.current from free-VRAM estimate in both estimate_capacity() and _should_use_lru(): fragmented reserved blocks cannot service large (100s-MB) block allocations even when the formula counts them - Increase activation headroom factor 4→16 in _inference_headroom() and estimate_capacity(): SDXL skip-connection tensors from all encoder blocks are held live until consumed by the matching decoder block (~172 MB at 1280×1280), plus ~200 MB intermediate activations; factor=16 yields ~524 MB extra on top of the 1 GiB base from minimum_inference_memory() - Log detailed VRAM snapshot (cuda_free/reserved/active/fragmented/effective) so capacity decisions are auditable in the debug log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…VRAM GPUs Adds a generalized VRAM pool allocator (VRAMAllocator) with RRIP-inspired generation-based eviction so that UNet, CLIP, and VAE blocks can be streamed on-demand from CPU to GPU without holding the full model in VRAM. Key changes: - diff_pipeline/vram_allocator.py: new pool allocator — LRU + 4-state generation counter (COLD/COOL/WARM/HOT), free-list for GC'd modules, TorchDispatchMode activation tracking, weakref lifecycle hooks, prepare_for_prefix for atomic prefix-scoped eviction, sync_device_state to repair flags after external free_memory() calls - diff_pipeline/adapter.py: block-level registration for VAE (10 blocks) and CLIP (44 layers); _decode_needs_tiling() pre-check skips wasteful full decode for large latents and goes straight to tiled fallback - diff_pipeline/pipeline.py: tracking_context() wraps UNet forward; pre-evicts CLIP before UNet sampling starts so VRAM is available from step 1; sync_device_state() after free_memory() keeps allocator coherent - diff_pipeline/_lru_blocks.py: flush_to_cpu() for VRAM pressure hook; 1.15x CUDA alignment overhead in capacity estimate to prevent OOM - ldm_patched/modules/model_management.py: VRAM pressure hook registry so VRAMAllocator can respond to free_memory() calls from other consumers Validated: 6-step 1280×1280 SDXL generation on RTX 2080 Max-Q (8 GB). CLIP loads once per generation; VAE tiling decision is made pre-emptively; UNet blocks load and age across steps without unnecessary reloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-diffusers-lru-headroom The base activation headroom reserved by the LRU allocator was hardcoded to reForge's minimum_inference_memory() (~1 GiB). Users with low fragmentation can now pass --forge-diffusers-lru-headroom <MB> to reduce it, allowing more UNet blocks to stay resident between steps at the cost of a smaller OOM safety margin. The resolution-scaled skip-connection estimate is always added on top of the specified base, so the flag only controls the floor, not the total. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…F841) - adapter.py: split `import gc, logging as _log` onto two lines (×2) - vram_allocator.py: remove unused `size_mb` in `_do_evict` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n prediction Integrates GPU temperature throttling as a built-in reForge module instead of requiring the external extension. New module: modules/gpu_temperature.py - Sensor backends: nvidia-smi (Win+Linux), ROCm-smi (Linux), OpenHardwareMonitor (Win) - 1-D Kalman filter tracking temperature + rate of change (pure Python, no numpy) - Predictive throttling: pauses before threshold using projected temp at configurable horizon - Thermal plateau detection: if GPU hits max passive heat exchange and temp won't drop, triggers warn_and_continue or abort_generation after configurable timeout - Lazy initialization + reinitialize() when sensor setting changes Wired into modules/sd_samplers_common.store_latent (checked at each sampler step). Settings added to modules/shared_options.py under "GPU Temperature Protection": - Enable/disable, sensor choice, device index, OHM GPU name filter - Sleep/wake thresholds, poll interval, max pause duration - Kalman: enable, q_pos, q_vel, R, prediction horizon - Plateau: timeout, action (warn_and_continue | abort_generation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(gpu-temp): first-party GPU temperature protection with Kalman prediction So we finally got a good protection on long run
…by attention entropy
Combines location-aware (SURE-AG attention entropy) and frequency-aware
(wavelet SURE) guidance into a single correction pass:
- Wavelet-decomposes the SURE residual into approximation + detail subbands.
- Projects the attention entropy map U to each subband's spatial scale via
bilinear resampling → per-subband spatial weight W_k = 1 + w·Ũ_k.
- Corrects each subband independently: x0_k − α·approx_coeff·W_k·r_k.
- Optional FFT low-pass scale on the approximation subband gradient
(FreeU-style: boosts structure, dampens high-freq noise in correction).
- Graceful fallback: pywt missing → pixel-space SURE-AG; attn_weight=0 →
wavelet SURE without attention weighting; fft_scale=1.0 → no FFT.
New files:
ldm_patched/k_diffusion/sure_wav_ag.py — core correction function
ldm_patched/contrib/nodes_sure_wav_ag.py — post-CFG guidance node
extensions-builtin/sd_forge_sure_wav_ag/ — Forge UI extension
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four alpha modes for the post-CFG guidance node:
- fixed: static alpha clamped by Lean §3 bound (original behaviour)
- sigma: _sure_effective_alpha sigma scaling — EDM/Karras aware
- analytical: α* = ⟨r,g⟩/‖g‖² closed-form via Parseval dot products in
wavelet space (zero extra UNet calls; non-trivial when
attn_weight>0 makes g non-collinear with r)
- bo: Bayesian optimisation (optuna) with cross-step warm-start
Cross-step state (_step_state) is captured in the post_cfg closure and
auto-resets when σ_t jumps > 1.5× (new generation / highres-fix restart).
Debug prints: every step emits [SURE-AGWAV-DBG] lines covering sigma,
entropy_rms, per-subband rdg/gsq, Parseval totals, alpha_eff, delta_rms.
Controlled by _DEBUG_PRINT = True in sure_wav_ag.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PassthroughAttnProcessor silently dropped all patches_replace["attn1"]
hooks, causing SURE-AGWAV entropy capture to yield attn_layers=0 and
U=None on every step when diff_pipeline=True.
Replace PassthroughAttnProcessor with ForgeAttnSelfProcessor, which:
- Burns in (block_name, ldm_idx, t_idx) at construction (same as
ForgeAttnProcessor for attn2)
- Checks patches_replace["attn1"] for (block, idx, t) or (block, idx)
- If hook found: computes Q/K/V explicitly in flat (B,N,heads*dim)
format and calls hook(q,k,v,extra_options,mask) — compatible with
attention_basic_with_sim used by _make_entropy_hook
- If no hook: delegates to AttnProcessor2_0 (identical to old behaviour)
- Adds [SURE-AGWAV-DBG] print when hook fires so attn_layers > 0
PassthroughAttnProcessor kept as alias for external compatibility.
_install_attn_processors updated to pass (block_name, ldm_idx, t_idx).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ization Per-image min-max in _aggregate_entropy_map collapses to zero whenever attention is spatially uniform — which happens every step in SDXL middle block at high noise (entropy ≈ log(1600) ≈ 7.38 everywhere, U_max−U_min ≈ 0 → divide by ε → all zeros → entropy_rms=0.0000 every step). Fix: normalize each layer's raw entropy by log(N_k) (theoretical maximum). This gives absolute uncertainty in [0,1]: - High noise: entropy ≈ log(N) → U ≈ 1 everywhere → W ≈ 1+w uniform - Low noise: entropy varies spatially → targeted correction Also expand entropy debug print to show min/max/mean for next test run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ff_pipeline In attention_basic_with_sim the else-branch computes similarity in the input dtype (fp16 on SDXL diff_pipeline path). At seq=1600, dim_head=64 the fp16 dot-products overflow to inf → softmax([inf,0,…]) = [1,0,…] → one-hot attention → entropy = 0 every step. The ldm_patched path avoids this because it sets attn_precision=torch.float32 via transformer_options; the diff_pipeline path leaves it None. Fix: force attn_precision=torch.float32 unconditionally in _make_entropy_hook. This only controls the sim computation — the returned out is still cast back to v.dtype (fp16), so the correction pass is unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…2 sim torch.autocast (fp16/bf16) downcasts fp32 inputs before eligible compute ops (einsum, matmul) even after explicit .float() conversion. At seq=1600 (SDXL middle block) fp16 dot-products overflow to inf, softmax collapses to one-hot, entropy=0 every step. The earlier attn_precision=torch.float32 fix only changed the branch selection inside attention_basic_with_sim but autocast still ran the einsum in fp16. Fix: wrap the attention_basic_with_sim call in torch.autocast(device_type=device_type, enabled=False) to guarantee true fp32 computation regardless of outer autocast state. Also: return out in original dtype so the model stays in fp16/bf16. Add a one-shot diagnostic print of sim row-max and raw entropy stats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sim_row_max mean=0.0085 (near-uniform) but ent=0 is paradoxical. Add dissection prints: sim dtype/shape, row sums, term=p*log(p+e) statistics, raw_neg before clamp, and a torch.special.entr cross-check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The clamp was applied before the negation, not after:
Wrong: -(sum(p*log(p+e),-1)).clamp(0)
sum ≈ -6.7 → clamp → 0 → negate → 0
Right: (-(sum(p*log(p+e),-1))).clamp(0)
sum ≈ -6.7 → negate → +6.7 → clamp → +6.7
This single missing pair of parentheses caused entropy to be exactly 0
across all steps despite correct fp32 sim values (mean entropy 6.7 nats
confirmed by diagnostic and cross-checked with torch.special.entr).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Strip all diagnostic prints added during the entropy=0 and shape-mismatch investigations — EDIAG, SHAPE-DIAG, ForgeAttnSelfProcessor, and per-step [SURE-AGWAV-DBG] lines. Set _DEBUG_PRINT=False in sure_wav_ag.py. Ongoing diagnostics are now via _logger.debug/info only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
All 8 non-wavelet SURE samplers (
sample_sure,sample_dpmpp_2s_a_sure,sample_dpmpp_2s_a_sure_adaptive,sample_dpmpp_2m_sure,sample_dpmpp_2m_sde_sure,sample_dpmpp_3m_sde_sure,sample_dpmpp_2m_sde_sure_adaptive,sample_sure_adaptive) were initialising_adam_statewith:This allocated a real dict when
sure_adam_mode='bo'and passed it as a non-Noneadam_stateto_sure_correct_x0.Why it was wrong
_sure_correct_x0only activates Adam whenadam_mode in ('adam', 'adamw'). Whenadam_mode='bo', the adam block is skipped and the function falls through to plain SGD — the non-None dict is silently ignored. BO mode in the non-wavelet path ran as plain SGD, wasting an allocation each call and masking the mismatch.The wavelet sampler
sample_sure_waveletalready had the correct check (sure_adam_mode in ('adam', 'adamw')) so_adam_state=Nonefor BO mode.Fix
Changed all 8 non-wavelet samplers to match the wavelet pattern:
Test plan
sure_adam_mode='bo'— confirm no runtime errorssure_adam_mode='adam'and'adamw'still initialise_adam_statecorrectly (non-None)sure_adam_mode='none'keeps_adam_state=None🤖 Generated with Claude Code