fix: clamp H and W to epsilon floor after each LS-NMF multiplicative update (Rust)#38
fix: clamp H and W to epsilon floor after each LS-NMF multiplicative update (Rust)#38collord wants to merge 2 commits into
Conversation
…update
The default `column_mean` initialisation clips ~50 % of W and H entries to
1e-12 (negative draws from the normal distribution). These near-zero entries
caused the denominator of the multiplicative update
H ← H * (W.T @ (We·V)) / (W.T @ (We·W@H))
to underflow in f32, producing a ratio >> 1 on the first several steps. The
Rust implementation accumulated this error, converged to a spurious local
minimum (Q ≈ 1e5), and reported `converged=True` on garbage factors because
the Q-window eventually stabilised at the wrong value. The Python path avoided
the trap due to differences in floating-point evaluation order with numpy.
Fix: after each H and W multiplicative update, clamp every element to
[1e-12, ∞), matching the epsilon floor that `SA.initialize()` already applies
to the initial matrices. The same clamp is added to the GPU path for
consistency.
Adds a regression test (`tests/model/test_ls_nmf_rust_stability.py`) with
three assertions:
- Rust Q is finite (not NaN/Inf) — was already passing
- Rust Q is within 20 % of Python Q under column_mean init — was failing
- Rust Q is within 20 % of Python Q under kmeans init — was failing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical numerical instability bug in the Rust LS-NMF implementation where near-zero matrix entries caused multiplicative update ratios to overflow, leading to spurious convergence at a poor local minimum (Q ≈ 1e5 instead of the expected Q ≈ 5).
Changes:
- Added epsilon clamping (1e-12 floor) after each H and W multiplicative update in both CPU and GPU Rust paths
- Added comprehensive regression test to verify Rust and Python implementations produce comparable results
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
tests/model/test_ls_nmf_rust_stability.py |
New regression test with three test cases verifying numerical stability under column_mean and kmeans initialization |
rust/lib.rs |
Added post-update clamping to prevent NaN/Inf from near-zero denominators in both ls_nmf_update_cpu and ls_nmf_update_gpu |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| w = w.mul(&w_delta)?; | ||
| // Same clamp for W. | ||
| let eps_w = Tensor::full(1e-12f64, w.shape(), w.device())?; | ||
| w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?; |
There was a problem hiding this comment.
Similar to the H update above, the W clamping in the GPU path may not handle NaN values consistently with the CPU implementation. The CPU version explicitly replaces non-finite values with epsilon, while the GPU version relies on Tensor::clamp behavior which may propagate NaN. Consider adding explicit NaN handling for consistency.
| w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?; | |
| w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?; | |
| // Explicitly sanitize non-finite values in W to match CPU behavior. | |
| // Any NaN or infinite entries are replaced with epsilon to avoid propagation. | |
| w = w.nan_to_num(1e-12f64)?; |
| let eps_h = Tensor::full(1e-12f64, h.shape(), h.device())?; | ||
| h = h.clamp(&eps_h, &Tensor::full(f64::MAX, h.shape(), h.device())?)?; |
There was a problem hiding this comment.
The GPU implementation creates a new epsilon tensor for every H update in every iteration, which is inefficient. This tensor creation should be moved outside the loop to avoid the overhead of allocating the same tensor repeatedly during training (which can run for 20,000 iterations). The same epsilon tensor can be reused across all iterations.
| let eps_w = Tensor::full(1e-12f64, w.shape(), w.device())?; | ||
| w = w.clamp(&eps_w, &Tensor::full(f64::MAX, w.shape(), w.device())?)?; |
There was a problem hiding this comment.
The GPU implementation creates a new epsilon tensor for every W update in every iteration. This should be moved outside the loop and reused, similar to the recommendation for the H epsilon tensor. Creating f64::MAX tensors in every iteration is particularly inefficient.
| // Clamp: guard against NaN/Inf from near-zero denominators. | ||
| let eps_h = Tensor::full(1e-12f64, h.shape(), h.device())?; | ||
| h = h.clamp(&eps_h, &Tensor::full(f64::MAX, h.shape(), h.device())?)?; |
There was a problem hiding this comment.
The CPU and GPU implementations handle non-finite values differently. The CPU version explicitly checks for non-finite values (NaN or Inf) using !x.is_finite() and replaces them with the epsilon floor. However, the GPU version uses Tensor::clamp which may not handle NaN values the same way - depending on the clamp implementation, NaN values might propagate through rather than being replaced with epsilon. Consider using an explicit NaN check for the GPU path to ensure consistent behavior, or verify that Tensor::clamp handles NaN appropriately.
Address Copilot review feedback on quanted#38: - Move eps_h, eps_w, max_val tensor allocations before the loop to avoid 4 × max_iter GPU allocations per training run. - Remove fill_nan calls — Tensor::fill_nan is not a standard candle-core method. NaN cannot arise at this point anyway: the epsilon floor on both W and H from the previous iteration guarantees strictly positive denominators in the multiplicative update. Add comment explaining this reasoning. - CPU path is unchanged (already uses nalgebra .map() with an explicit is_finite() check per element). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
The Rust
ls_nmfimplementation diverges to a spurious local minimum(Q ≈ 1e5) and then falsely reports
converged=Truewhen the model isinitialised with the default
column_meanmethod. The Python path isunaffected.
Root cause
SA.initialize()clips any negative draws from the random normalinitialisation to
1e-12, leaving ~50 % of W and H entries at thatfloor. With these near-zero entries, the denominator of the
multiplicative update
underflows in f32 on the first few steps, producing ratios >> 1 that
push H and W to a large-but-finite local minimum. Because the Q window
eventually stabilises there, the convergence check fires on garbage
factors.
The Python fallback (
ls_nmf.py) happens to be numerically stableenough to self-correct from the same starting point, so the mismatch
only surfaces when comparing the two paths directly.
Minimal reproducer
Before fix:
Fix
Clamp every element of H and W to
[1e-12, ∞)after each multiplicativeupdate step (
ls_nmf_update_cpuandls_nmf_update_gpu), mirroring theepsilon floor that
SA.initialize()already applies to the initialmatrices.
Test
tests/model/test_ls_nmf_rust_stability.py— three assertions:column_meaninitkmeansinitThe Rust extension will need to be recompiled (
maturin developorpip install -e .) before running the test suite.🤖 Generated with Claude Code