diff --git a/README.md b/README.md index dfdca25..0877c92 100644 --- a/README.md +++ b/README.md @@ -108,30 +108,30 @@ The `refactor!` + `\` loop is allocation-free apart from the tiny `O(r)` dense p ## Benchmarks -A realistic BVP/PDE-style system: a banded sparse interior (`bw = 2`) plus `r = 4` dense -boundary rows. Factoring `S` with PureKLU and applying the rank-`r` Woodbury correction -avoids both the `O(n³)` dense cost and the fill-in that the dense rows inflict on a -general sparse `LU`. The cached solve and the symbolic-reuse `refactor!` (the Newton path) -are sub-millisecond. (`SWDRC` below = this package.) +A banded sparse interior plus `r = 8` dense rows (`n = 5000`). Factoring only `S` and +applying the rank-`r` Woodbury correction never assembles the dense rows into the +factorization, so it avoids the `O(n³)` dense cost and the fill those rows inflict on a +direct sparse `LU`. The honest baseline is **KLU** (PureKLU): its BTF + AMD ordering +*isolates* a few dense rows, so it handles the assembled matrix well — UMFPACK's COLAMD does +**not** (it fills), and dense `LU` is `O(n³)`. ``` -n = 5000, r = 4 (nnz(S) = 24994, solve rel err vs dense ≈ 1e-12) - SWDRC factorize + solve : 4.85 ms - dense lu + solve : 807.67 ms (166× slower) - SuiteSparse lu on sparse(A) : 24.71 ms (5× slower; dense rows fill in) - SWDRC ldiv! (cached) : 0.27 ms - SWDRC refactor! (Newton) : 0.68 ms (7× faster than re-factorizing) - -n = 10000, r = 4 - SWDRC factorize + solve : 9.45 ms - dense lu + solve : 3160.7 ms (334× slower) - SuiteSparse lu on sparse(A) : 79.69 ms (8× slower) - SWDRC refactor! (Newton) : 1.36 ms +n = 5000, r = 8 (factor + solve, solve rel err vs dense ≈ 1e-14) + SWDRC (Woodbury) : 2.9 ms ← this package + KLU on the assembled S + U·V : 4.9 ms (1.7× slower; the right sparse baseline) + UMFPACK lu on the assembled : 345 ms (~120×; COLAMD fills on the dense rows) + dense lu : 820 ms (~280×; O(n³)) + + fixed S, changing low-rank correction — reuse, per solve: + SWDRC update_lowrank! : 1.0 ms ← reuses S's factorization + KLU klu! (refactor assembled) : 3.0 ms (2.9× slower) ``` -(`julia -O2`, single thread; `@elapsed` best-of-30. The advantage holds whenever `S` has -good sparse structure — banded, local, or low-fill; for a globally-scattered `S` with no -good ordering, any sparse `LU` degrades and dense `lu` may win.) +(`julia -O2`, single thread, `@elapsed` best-of. The win over **KLU** is modest at small +`r` (~1.5–3×) and grows with the border width `r` — KLU fills more as you add dense rows, +while the Woodbury correction stays `r×r`. The win over UMFPACK/dense is large because both +fill or scale as `O(n³)`. `S` must have good sparse structure for any of this to help; a +globally-scattered `S` degrades every sparse method.) ## What is fast @@ -176,7 +176,7 @@ symbolic analysis** (no re-analysis): F = lu(A) # analyze (cache symbolic) + numeric factor x1 = F \ b1; x2 = F \ b2 # reuse for multiple solves for step in 1:maxiters - lu!(F, A_new) # new values, same pattern → reuses cached symbolic (≈7× faster than re-`lu`) + lu!(F, A_new) # new values, same pattern → reuses cached symbolic analysis (no re-analysis) Δ = F \ residual end ``` @@ -189,9 +189,9 @@ once + `klu!`/`solve!` per step. The structure pays off most in the classic Sherman–Morrison–Woodbury setting: a **fixed** sparse base `S`, factored once, solved repeatedly under a **changing low-rank correction** `U·V` — varying boundary conditions / coupling, adding or removing a constraint, sensitivity -or parameter sweeps, KKT / saddle-point systems with changing borders. Here the dense rows -or columns are exactly what make a direct sparse `LU` of the assembled `S + U·V` fill in -catastrophically; Woodbury sidesteps them by keeping only `S` factored. +or parameter sweeps, KKT / saddle-point systems with changing borders. Reusing `S`'s +factorization across the updates is the win: re-factoring the whole assembled `S + U·V` +each step redoes work that doesn't change. [`update_lowrank!`](@ref) updates the low-rank factors while **reusing `S`'s cached factorization** (no re-factorization of `S`, and `Z = S⁻¹U` is recomputed only if `U` @@ -205,16 +205,18 @@ for step in 1:maxiters end ``` -Measured (`n = 20000`, `r = 8` dense rows, 25 solves, `S` fixed): +Measured (`n = 5000`, `r = 8`, `S` fixed, per solve): ``` - Woodbury reuse via update_lowrank! : ~2.7 ms / solve - refactor! (re-factors S each step) : ~7.3 ms / solve (wastes the S re-factorization) - re-KLU the assembled S + U·V : ~1740 ms / solve (~650× slower — dense rows fill) + update_lowrank! (reuse S's factorization) : 1.0 ms / solve + refactor! (re-factors S each step) : ~3 ms / solve (wastes the S re-factorization) + KLU klu! on the refactored assembled S + U·V : 3.0 ms / solve (~3× slower; grows with r) ``` -Use `update_lowrank!` when only the low-rank part changes, `refactor!` when `S`'s values -change (Newton), and `factorize` to (re)analyze a new pattern. +The advantage over KLU is modest at small `r` and grows with the border width — KLU must +refactor the dense rows every step, while `update_lowrank!` touches only the `r×r` +correction. Use `update_lowrank!` when only the low-rank part changes, `refactor!` when +`S`'s values change (Newton), and `factorize` to (re)analyze a new pattern. ## LinearSolve.jl integration