finite-sample
diff --git a/‎.github/workflows/python-publish.yml‎
Lines changed: 30 additions & 0 deletions b/‎.github/workflows/python-publish.yml‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 12 additions & 42 deletions b/‎README.md‎
Lines changed: 12 additions & 42 deletions
diff --git a/‎docs/source/_snippets/basic_usage.md‎
Lines changed: 18 additions & 0 deletions b/‎docs/source/_snippets/basic_usage.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎docs/source/_snippets/installation.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/source/_snippets/installation.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/source/_snippets/performance_table.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/_snippets/performance_table.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/_snippets/synopsis.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/_snippets/synopsis.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/als_vs_em.md‎
Lines changed: 139 additions & 0 deletions b/‎docs/source/als_vs_em.md‎
Lines changed: 139 additions & 0 deletions
@@ -6,8 +6,38 @@ on:
   workflow_dispatch:
 
 jobs:
+  sync-citation:
+    name: Sync citation.cff with package version
+    runs-on: ubuntu-latest
+    
+    steps:
+    - uses: actions/checkout@v4
+      with:
+        fetch-depth: 0
+        token: ${{ secrets.GITHUB_TOKEN }}
+    
+    - name: Sync citation
+      uses: gojiplus/citation-sync@main
+      with:
+        citation_path: CITATION.cff
+        package_path: pyproject.toml
+    
+    - name: Commit and push if changed
+      run: |
+        git config --local user.email "action@github.com"
+        git config --local user.name "GitHub Action"
+        git add CITATION.cff
+        if git diff --staged --quiet; then
+          echo "No changes to commit"
+        else
+          git commit -m "Update citation.cff version"
+          git push
+        fi
+
   build:
     name: Build distribution
+    needs:
+    - sync-citation
     runs-on: ubuntu-latest
 
     steps:
 
@@ -6,39 +6,13 @@
 [![License](https://img.shields.io/badge/dynamic/toml?url=https://raw.githubusercontent.com/finite-sample/alsgls/main/pyproject.toml&query=$.project.license.text&label=License)](https://opensource.org/licenses/MIT)
 
 
-When a GLS problem involves hundreds of equations, the $K × K$ covariance matrix becomes the computational bottleneck.  A simple statistical remedy is to assume that most of the cross‑equation dependence can be captured by a *handful of latent factors* plus equation‑specific noise.  This “low‑rank + diagonal” assumption slashes the number of unknowns from roughly $K^²$ to about $K×k$ parameters, where **k** (the latent factor rank) is much smaller than $K$.  The model alone, however, does **not** guarantee speed: we still have to fit the parameters.
-
-### Installation
-
-Install the library from PyPI:
-
-```bash
-pip install alsgls
+```{include} docs/source/_snippets/synopsis.md
 ```
 
-For local development, clone the repo and use an editable install:
-
-```bash
-pip install -e .
+```{include} docs/source/_snippets/installation.md
 ```
 
-### Usage
-
-```python
-from alsgls import ALSGLS, ALSGLSSystem, simulate_sur
-
-Xs_tr, Y_tr, Xs_te, Y_te = simulate_sur(N_tr=240, N_te=120, K=60, p=3, k=4)
-
-# Scikit-learn style estimator
-est = ALSGLS(rank="auto", max_sweeps=12)
-est.fit(Xs_tr, Y_tr)
-test_score = est.score(Xs_te, Y_te)  # negative test NLL per observation
-
-# Statsmodels-style system interface
-system = {f"eq{j}": (Y_tr[:, j], Xs_tr[j]) for j in range(Y_tr.shape[1])}
-sys_model = ALSGLSSystem(system, rank="auto")
-sys_results = sys_model.fit()
-params = sys_results.params_as_series()  # pandas optional
+```{include} docs/source/_snippets/basic_usage.md
 ```
 
 See `examples/compare_als_vs_em.py` for a complete ALS versus EM comparison. The
@@ -50,29 +24,26 @@ peak memory (via Memray, Fil, or the POSIX RSS high-water mark).
 
 Background material and reproducible experiments are available in the notebooks under [`als_sim/`](als_sim/), such as [`als_sim/als_comparison.ipynb`](als_sim/als_comparison.ipynb) and [`als_sim/als_sur.ipynb`](als_sim/als_sur.ipynb).
 
-### Solving low‑rank GLS: EM versus ALS
+### Solving low‑rank GLS: EM versus ALS
 
-The classic EM algorithm alternates between updating the regression coefficients $\beta$ and updating the factor loadings $F$ and the diagonal noise $D$.  Even though $\hat{\Sigma}$ is low‑rank, EM’s M‑step recreates the **full** $K × K$ inverse, wiping out the memory win.
+The classic EM algorithm alternates between updating the regression coefficients $\beta$ and updating the factor loadings $F$ and the diagonal noise $D$.  Even though $\hat{\Sigma}$ is low‑rank, EM's M‑step recreates the **full** $K × K$ inverse, wiping out the memory win.
 
-An alternative is **Alternating‑Least‑Squares (ALS)**. The Woodbury identity reduces the expensive inverse to a tiny k × k system, and the β‑update can be written without explicitly forming the dense matrix at all.  In practice, ALS converges in 5–6 sweeps and never allocates more than $O(K k)$ memory, while EM allocates $O(K^²)$.
+An alternative is **Alternating‑Least‑Squares (ALS)**. The Woodbury identity reduces the expensive inverse to a tiny k × k system, and the β‑update can be written without explicitly forming the dense matrix at all.  In practice, ALS converges in 5–6 sweeps and never allocates more than $O(K k)$ memory, while EM allocates $O(K^²)$.
 
 **Rule of thumb:** if your GLS routine keeps looping between $\beta$ and a fresh $\hat{\Sigma}$, replacing the $\hat{\Sigma}$‑update by a factor‑ALS step yields the same statistical fit with an order‑of‑magnitude smaller memory footprint.
 
 ### Beyond SUR: where the idea travels
 
-Random‑effects models, feasible GLS with estimated heteroskedastic weights, optimal‑weight GMM, and spatial autoregressive GLS all iterate β ↔ Σ̂.  Each can adopt the same ALS trick: treat the weight matrix as low‑rank + diagonal, invert only the k × k core, and avoid the dense K × K algebra.  Memory savings in published examples range from 5× to 20×, depending on k.
+Random‑effects models, feasible GLS with estimated heteroskedastic weights, optimal‑weight GMM, and spatial autoregressive GLS all iterate β ↔ Σ̂.  Each can adopt the same ALS trick: treat the weight matrix as low‑rank + diagonal, invert only the k × k core, and avoid the dense K × K algebra.  Memory savings in published examples range from 5× to 20×, depending on k.
 
 ### A concrete case‑study: Seemingly‑Unrelated Regressions
 
-To show the magnitude, we ran a Monte‑Carlo experiment with N = 300 observations, three regressors, rank‑3 factors, and K set to 50, 80, 120.  EM was given 45 iterations; ALS, six sweeps.  The largest array EM holds is the dense Σ⁻¹, whereas ALS’s largest is the skinny factor matrix F.  The table summarises six replications:
+To show the magnitude, we ran a Monte‑Carlo experiment with N = 300 observations, three regressors, rank‑3 factors, and K set to 50, 80, 120.  EM was given 45 iterations; ALS, six sweeps.  The largest array EM holds is the dense Σ⁻¹, whereas ALS's largest is the skinny factor matrix F.  The table summarises six replications:
 
-|   K | β‑RMSE EM | β‑RMSE ALS | Peak MB EM | Peak MB ALS | Memory ratio |
-| --: | :-------: | :--------: | ---------: | ----------: | -----------: |
-|  50 |   0.021   |    0.021   |     0.020  |      0.002  |         10×  |
-|  80 |   0.020   |    0.020   |     0.051  |      0.003  |         17×  |
-| 120 |   0.020   |    0.020   |     0.115  |      0.004  |         29×  |
+```{include} docs/source/_snippets/performance_table.md
+```
 
-Statistically, the two estimators are indistinguishable (paired‑test p ≥ 0.14).  Computationally, ALS needs only a few megabytes whereas EM needs tens to hundreds.
+Statistically, the two estimators are indistinguishable (paired‑test p ≥ 0.14).  Computationally, ALS needs only a few megabytes whereas EM needs tens to hundreds.
 
 ### Defaults, tuning knobs, and failure modes
 
@@ -93,5 +64,4 @@ Statistically, the two estimators are indistinguishable (paired‑test p ≥ 0
   can make the β-step CG solve slow; adjust `cg_tol`/`cg_maxit`, add stronger
   ridge, or re-scale predictors. If `info["accept_t"]` stays at zero and the
   NLL does not improve, the factor rank may be too large relative to the sample
-  size.
-
+  size.
@@ -0,0 +1,18 @@
+## Usage
+
+```python
+from alsgls import ALSGLS, ALSGLSSystem, simulate_sur
+
+Xs_tr, Y_tr, Xs_te, Y_te = simulate_sur(N_tr=240, N_te=120, K=60, p=3, k=4)
+
+# Scikit-learn style estimator
+est = ALSGLS(rank="auto", max_sweeps=12)
+est.fit(Xs_tr, Y_tr)
+test_score = est.score(Xs_te, Y_te)  # negative test NLL per observation
+
+# Statsmodels-style system interface
+system = {f"eq{j}": (Y_tr[:, j], Xs_tr[j]) for j in range(Y_tr.shape[1])}
+sys_model = ALSGLSSystem(system, rank="auto")
+sys_results = sys_model.fit()
+params = sys_results.params_as_series()  # pandas optional
+```
@@ -0,0 +1,13 @@
+## Installation
+
+Install the library from PyPI:
+
+```bash
+pip install alsgls
+```
+
+For local development, clone the repo and use an editable install:
+
+```bash
+pip install -e .
+```
@@ -0,0 +1,5 @@
+|   K | β‑RMSE EM | β‑RMSE ALS | Peak MB EM | Peak MB ALS | Memory ratio |
+| --: | :-------: | :--------: | ---------: | ----------: | -----------: |
+|  50 |   0.021   |    0.021   |     0.020  |      0.002  |         10×  |
+|  80 |   0.020   |    0.020   |     0.051  |      0.003  |         17×  |
+| 120 |   0.020   |    0.020   |     0.115  |      0.004  |         29×  |
@@ -0,0 +1 @@
+When a GLS problem involves hundreds of equations, the $K × K$ covariance matrix becomes the computational bottleneck.  A simple statistical remedy is to assume that most of the cross‑equation dependence can be captured by a *handful of latent factors* plus equation‑specific noise.  This "low‑rank + diagonal" assumption slashes the number of unknowns from roughly $K^²$ to about $K×k$ parameters, where **k** (the latent factor rank) is much smaller than $K$.  The model alone, however, does **not** guarantee speed: we still have to fit the parameters.
@@ -0,0 +1,139 @@
+# ALS vs EM Comparison
+
+This section compares the Alternating Least Squares (ALS) and Expectation-Maximization (EM) 
+approaches for low-rank+diagonal GLS estimation.
+
+## Mathematical Background
+
+Both algorithms solve the same statistical problem: estimating regression coefficients β 
+when the error covariance has a low-rank+diagonal structure:
+
+Σ = FF^T + diag(D)
+
+where F is K×k factor loadings and D is K×1 diagonal noise.
+
+## EM Algorithm
+
+The EM algorithm alternates between:
+
+**E-step**: Compute expected sufficient statistics given current parameters
+**M-step**: Update parameters by solving weighted least squares with full Σ^(-1)
+
+The critical issue: EM's M-step explicitly forms the K×K inverse covariance matrix, 
+requiring O(K²) memory even though the model has only O(Kk) parameters.
+
+```python
+# EM's expensive step (pseudocode)
+Sigma = F @ F.T + np.diag(D)  # K×K dense matrix
+Sigma_inv = np.linalg.inv(Sigma)  # K×K inversion
+# Use Sigma_inv for β updates...
+```
+
+## ALS Algorithm  
+
+ALS also alternates between updating β and updating (F, D), but uses the Woodbury 
+matrix identity to avoid forming dense matrices:
+
+Σ^(-1) = D^(-1) - D^(-1)F(I + F^T D^(-1)F)^(-1)F^T D^(-1)
+
+The key insight: we only need to invert a k×k matrix (I + F^T D^(-1)F), not K×K.
+
+```python
+# ALS's efficient step (pseudocode)  
+D_inv = 1.0 / D  # K×1 vector
+small_inv = np.linalg.inv(np.eye(k) + F.T @ (D_inv[:, None] * F))  # k×k
+# Apply Woodbury formula without forming K×K matrices
+```
+
+## Memory Comparison
+
+| Algorithm | Largest Array | Memory Complexity |
+|-----------|---------------|------------------|
+| EM | Σ^(-1) (K×K dense) | O(K²) |
+| ALS | F (K×k skinny) | O(Kk) |
+
+For K=100 equations and k=5 factors:
+- EM: 100×100 = 10,000 floats
+- ALS: 100×5 = 500 floats  
+- **Ratio: 20×**
+
+## Computational Comparison
+
+```python
+from alsgls import simulate_sur, als_gls, em_gls
+import time
+
+# Generate test problem
+K = 100  # equations
+N = 300  # observations  
+k = 5    # factors
+Xs_tr, Y_tr, _, _ = simulate_sur(N_tr=N, N_te=50, K=K, p=3, k=k)
+
+# Time ALS
+t0 = time.time()
+B_als, F_als, D_als, mem_als, info_als = als_gls(
+    Xs_tr, Y_tr, k=k, sweeps=8
+)
+time_als = time.time() - t0
+
+# Time EM
+t0 = time.time()
+B_em, F_em, D_em, mem_em, info_em = em_gls(
+    Xs_tr, Y_tr, k=k, iters=30
+)
+time_em = time.time() - t0
+
+print(f"ALS: {time_als:.2f}s, {mem_als:.1f}MB")
+print(f"EM:  {time_em:.2f}s, {mem_em:.1f}MB")
+print(f"Memory ratio: {mem_em/mem_als:.1f}×")
+```
+
+## Statistical Equivalence
+
+Despite different computational approaches, both algorithms optimize the same 
+likelihood and produce statistically indistinguishable estimates:
+
+```python
+from alsgls import XB_from_Blist, mse
+import numpy as np
+
+# Compare predictions
+Y_pred_als = XB_from_Blist(Xs_te, B_als)
+Y_pred_em = XB_from_Blist(Xs_te, B_em)
+
+# MSE should be nearly identical
+mse_als = mse(Y_te, Y_pred_als)
+mse_em = mse(Y_te, Y_pred_em)
+
+print(f"ALS MSE: {mse_als:.6f}")
+print(f"EM MSE:  {mse_em:.6f}")
+print(f"Difference: {abs(mse_als - mse_em):.2e}")
+
+# Factor structures should be equivalent (up to rotation)
+cov_als = F_als @ F_als.T + np.diag(D_als)
+cov_em = F_em @ F_em.T + np.diag(D_em)
+print(f"Covariance difference: {np.linalg.norm(cov_als - cov_em):.2e}")
+```
+
+## When to Use Each
+
+**Use ALS when:**
+- K is large (>50 equations)
+- Memory is constrained
+- You need the Woodbury form for other computations
+- k << K (low-rank assumption holds)
+
+**Use EM when:**
+- K is small (<30 equations)  
+- You need the full covariance matrix anyway
+- Implementing a standard EM framework
+- Debugging (EM is conceptually simpler)
+
+## Implementation Details
+
+The package provides both for comparison:
+
+- `als_gls()`: Memory-efficient ALS implementation
+- `em_gls()`: Baseline EM implementation  
+
+Both use the same convergence criteria and regularization options for fair comparison.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+When a GLS problem involves hundreds of equations, the $K × K$ covariance matrix becomes the computational bottleneck. A simple statistical remedy is to assume that most of the cross‑equation dependence can be captured by a handful of latent factors plus equation‑specific noise. This "low‑rank + diagonal" assumption slashes the number of unknowns from roughly $K^²$ to about $K×k$ parameters, where k (the latent factor rank) is much smaller than $K$. The model alone, however, does not guarantee speed: we still have to fit the parameters.