Skip to content

fix: clamp H and W to epsilon floor after each LS-NMF multiplicative update (Rust)#38

Open
collord wants to merge 2 commits into
quanted:mainfrom
collord:fix/ls-nmf-epsilon-clamp
Open

fix: clamp H and W to epsilon floor after each LS-NMF multiplicative update (Rust)#38
collord wants to merge 2 commits into
quanted:mainfrom
collord:fix/ls-nmf-epsilon-clamp

Conversation

@collord

@collord collord commented Feb 23, 2026

Copy link
Copy Markdown

Summary

The Rust ls_nmf implementation diverges to a spurious local minimum
(Q ≈ 1e5) and then falsely reports converged=True when the model is
initialised with the default column_mean method. The Python path is
unaffected.

Root cause

SA.initialize() clips any negative draws from the random normal
initialisation to 1e-12, leaving ~50 % of W and H entries at that
floor. With these near-zero entries, the denominator of the
multiplicative update

H ← H * (W.T @ (We·V)) / (W.T @ (We·W@H))

underflows in f32 on the first few steps, producing ratios >> 1 that
push H and W to a large-but-finite local minimum. Because the Q window
eventually stabilises there, the convergence check fires on garbage
factors.

The Python fallback (ls_nmf.py) happens to be numerically stable
enough to self-correct from the same starting point, so the mismatch
only surfaces when comparing the two paths directly.

Minimal reproducer

import numpy as np
from esat.model.sa import SA
from esat.metrics import q_loss

rng = np.random.default_rng(0)
m, n, p = 50, 12, 3
F = rng.uniform(0.1, 1.0, size=(p, n))
G = rng.uniform(0.1, 1.0, size=(m, p))
V = G @ F
U = 0.001 + 0.05 * np.abs(V)

for label, use_rust in [("Rust  ", True), ("Python", False)]:
    sa = SA(V=V, U=U, factors=p, method="ls-nmf", seed=42, parallel=False, verbose=False)
    sa.optimized = use_rust
    sa.train(max_iter=20000, converge_delta=0.01, converge_n=100)
    W, H = sa.W.astype(float), sa.H.astype(float)
    print(f"{label}: Q={q_loss(V,U,W,H):.2f}  converged={sa.converged}")

Before fix:

Rust  : Q=122014.37  converged=True   ← wrong
Python: Q=5.10       converged=True

Fix

Clamp every element of H and W to [1e-12, ∞) after each multiplicative
update step (ls_nmf_update_cpu and ls_nmf_update_gpu), mirroring the
epsilon floor that SA.initialize() already applies to the initial
matrices.

Test

tests/model/test_ls_nmf_rust_stability.py — three assertions:

Test Pre-fix Expected post-fix
Rust Q is finite ✅ pass ✅ pass
Rust Q ≈ Python Q under column_mean init ❌ fail (ratio ≈ 24 000×) ✅ pass
Rust Q ≈ Python Q under kmeans init ❌ fail (ratio ≈ 4 000 000×) ✅ pass

The Rust extension will need to be recompiled (maturin develop or
pip install -e .) before running the test suite.

🤖 Generated with Claude Code

…update

The default `column_mean` initialisation clips ~50 % of W and H entries to
1e-12 (negative draws from the normal distribution).  These near-zero entries
caused the denominator of the multiplicative update

    H ← H * (W.T @ (We·V)) / (W.T @ (We·W@H))

to underflow in f32, producing a ratio >> 1 on the first several steps.  The
Rust implementation accumulated this error, converged to a spurious local
minimum (Q ≈ 1e5), and reported `converged=True` on garbage factors because
the Q-window eventually stabilised at the wrong value.  The Python path avoided
the trap due to differences in floating-point evaluation order with numpy.

Fix: after each H and W multiplicative update, clamp every element to
[1e-12, ∞), matching the epsilon floor that `SA.initialize()` already applies
to the initial matrices.  The same clamp is added to the GPU path for
consistency.

Adds a regression test (`tests/model/test_ls_nmf_rust_stability.py`) with
three assertions:
  - Rust Q is finite (not NaN/Inf) — was already passing
  - Rust Q is within 20 % of Python Q under column_mean init — was failing
  - Rust Q is within 20 % of Python Q under kmeans init  — was failing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 23, 2026 17:38

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical numerical instability bug in the Rust LS-NMF implementation where near-zero matrix entries caused multiplicative update ratios to overflow, leading to spurious convergence at a poor local minimum (Q ≈ 1e5 instead of the expected Q ≈ 5).

Changes:

  • Added epsilon clamping (1e-12 floor) after each H and W multiplicative update in both CPU and GPU Rust paths
  • Added comprehensive regression test to verify Rust and Python implementations produce comparable results

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/model/test_ls_nmf_rust_stability.py New regression test with three test cases verifying numerical stability under column_mean and kmeans initialization
rust/lib.rs Added post-update clamping to prevent NaN/Inf from near-zero denominators in both ls_nmf_update_cpu and ls_nmf_update_gpu

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rust/lib.rs Outdated
w = w.mul(&w_delta)?;
// Same clamp for W.
let eps_w = Tensor::full(1e-12f64, w.shape(), w.device())?;
w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?;

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the H update above, the W clamping in the GPU path may not handle NaN values consistently with the CPU implementation. The CPU version explicitly replaces non-finite values with epsilon, while the GPU version relies on Tensor::clamp behavior which may propagate NaN. Consider adding explicit NaN handling for consistency.

Suggested change
w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?;
w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?;
// Explicitly sanitize non-finite values in W to match CPU behavior.
// Any NaN or infinite entries are replaced with epsilon to avoid propagation.
w = w.nan_to_num(1e-12f64)?;

Copilot uses AI. Check for mistakes.
Comment thread rust/lib.rs Outdated
Comment on lines +469 to +470
let eps_h = Tensor::full(1e-12f64, h.shape(), h.device())?;
h = h.clamp(&eps_h, &Tensor::full(f64::MAX, h.shape(), h.device())?)?;

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU implementation creates a new epsilon tensor for every H update in every iteration, which is inefficient. This tensor creation should be moved outside the loop to avoid the overhead of allocating the same tensor repeatedly during training (which can run for 20,000 iterations). The same epsilon tensor can be reused across all iterations.

Copilot uses AI. Check for mistakes.
Comment thread rust/lib.rs Outdated
Comment on lines +480 to +481
let eps_w = Tensor::full(1e-12f64, w.shape(), w.device())?;
w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?;

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU implementation creates a new epsilon tensor for every W update in every iteration. This should be moved outside the loop and reused, similar to the recommendation for the H epsilon tensor. Creating f64::MAX tensors in every iteration is particularly inefficient.

Copilot uses AI. Check for mistakes.
Comment thread rust/lib.rs Outdated
Comment on lines +468 to +470
// Clamp: guard against NaN/Inf from near-zero denominators.
let eps_h = Tensor::full(1e-12f64, h.shape(), h.device())?;
h = h.clamp(&eps_h, &Tensor::full(f64::MAX, h.shape(), h.device())?)?;

Copilot AI Feb 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU and GPU implementations handle non-finite values differently. The CPU version explicitly checks for non-finite values (NaN or Inf) using !x.is_finite() and replaces them with the epsilon floor. However, the GPU version uses Tensor::clamp which may not handle NaN values the same way - depending on the clamp implementation, NaN values might propagate through rather than being replaced with epsilon. Consider using an explicit NaN check for the GPU path to ensure consistent behavior, or verify that Tensor::clamp handles NaN appropriately.

Copilot uses AI. Check for mistakes.
Address Copilot review feedback on quanted#38:

- Move eps_h, eps_w, max_val tensor allocations before the loop to
  avoid 4 × max_iter GPU allocations per training run.
- Remove fill_nan calls — Tensor::fill_nan is not a standard
  candle-core method.  NaN cannot arise at this point anyway: the
  epsilon floor on both W and H from the previous iteration guarantees
  strictly positive denominators in the multiplicative update.  Add
  comment explaining this reasoning.
- CPU path is unchanged (already uses nalgebra .map() with an explicit
  is_finite() check per element).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants