d-morrison · d-morrison · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -46,9 +46,11 @@ Suggests:
     ggdag,
     ggplot2,
     glmx,
+    gratia (>= 0.9),
     haven,
     kableExtra,
     KMsurv,
+    mgcv,
     muhaz,
     palmerpenguins,
     parameters,

diff --git a/_quarto-book.yml b/_quarto-book.yml
@@ -29,6 +29,7 @@ book:
        - chapters/causal-inference.qmd
        - chapters/predictor-selection.qmd
        - chapters/intro-multilevel-models.qmd
+       - chapters/generalized-additive-models.qmd
     - part: chapters/time-to-event-models.qmd
       chapters:
        - chapters/intro-to-survival-analysis.qmd

diff --git a/_quarto-website.yml b/_quarto-website.yml
@@ -12,6 +12,7 @@ project:
     - chapters/causal-inference.qmd
     - chapters/predictor-selection.qmd
     - chapters/intro-multilevel-models.qmd
+    - chapters/generalized-additive-models.qmd
     - chapters/time-to-event-models.qmd
     - chapters/intro-to-survival-analysis.qmd
     - chapters/proportional-hazards-models.qmd
@@ -77,6 +78,8 @@ website:
             href: chapters/predictor-selection.qmd
           - text: "Multilevel Models"
             href: chapters/intro-multilevel-models.qmd
+          - text: "Generalized Additive Models"
+            href: chapters/generalized-additive-models.qmd
           - text: "---"
           - text: "Time to Event Models"
             href: chapters/time-to-event-models.qmd

diff --git a/_subfiles/generalized-additive-models/_exr-gams-applied.qmd b/_subfiles/generalized-additive-models/_exr-gams-applied.qmd
@@ -0,0 +1,160 @@
+::: notes
+Applied exercises adapted from @james2021islr2e [§7.9] and
+@hastie2009esl2e [§9.1].
+:::
+
+### Polynomial regression and step functions (ISLRv2 Applied 7) {#sec-exr-wage-poly}
+
+::::{#exr-wage-polynomial-cv}
+
+Using the `Wage` dataset from the `ISLR2` package [@james2021islr2e]:
+
+**(a)** Use 10-fold cross-validation to choose the polynomial degree $d^*$ for
+predicting `wage` from `age`. Plot the CV errors for degrees 1 through 10.
+
+**(b)** Fit the degree-$d^*$ polynomial and plot it with a 95% confidence band.
+
+**(c)** Separately, choose the number of cut-points $K^*$ for a step-function
+model using cross-validation. Plot the chosen step-function fit.
+
+::::
+
+
+::::{#sol-wage-polynomial-cv}
+
+::: {.panel-tabset}
+
+#### R
+
+```{r}
+#| code-fold: true
+#| eval: false
+library(ISLR2)
+library(boot)
+library(ggplot2)
+set.seed(1)
+cv_poly <- vapply(1:10, function(d) {
+  cv.glm(Wage, glm(wage ~ poly(age, d), data = Wage), K = 10)$delta[1]
+}, numeric(1))
+best_d <- which.min(cv_poly)
+cat("CV-selected degree:", best_d, "\n")
+plot(1:10, cv_poly, type = "b", xlab = "Degree", ylab = "CV MSE",
+     main = "Polynomial degree selection")
+points(best_d, cv_poly[best_d], col = "red", pch = 19)
+```
+
+#### Python
+
+```{python}
+#| eval: false
+#| python.reticulate: false
+import numpy as np
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.linear_model import LinearRegression
+from sklearn.pipeline import Pipeline
+from sklearn.model_selection import cross_val_score
+from ISLP import load_data
+Wage = load_data("Wage")
+X = Wage["age"].values.reshape(-1, 1)
+y = Wage["wage"].values
+cv_errs = [-cross_val_score(
+    Pipeline([("poly", PolynomialFeatures(d, include_bias=False)),
+              ("reg", LinearRegression())]),
+    X, y, cv=10, scoring="neg_mean_squared_error").mean() for d in range(1, 11)]
+best_d = int(np.argmin(cv_errs)) + 1
+print(f"CV-selected degree: {best_d}")
+```
+
+:::
+
+::::
+
+
+### Backfitting algorithm (ESL §9.1.1 / ISLRv2 Applied 11) {#sec-exr-backfitting}
+
+::::{#exr-backfitting}
+
+The **backfitting algorithm** [@hastie2009esl2e, §9.1.1] fits an additive model
+$Y = \a + \sum_j f_j(X_j) + \eps$ by iteratively smoothing partial residuals.
+
+For $p=2$ predictors with linear component smoothers $\hat f_j(X_j) = \hat\b_j X_j$:
+
+1. Initialise: $\hat\a = \bar{y}$, $\hat f_1 = \hat f_2 = 0$.
+2. Repeat until convergence (Gauss–Seidel order — re-centre each
+   $\hat f_j$ immediately, so the centred $\hat f_1$ is used when
+   computing $r_2$):
+   a. $r_1 \leftarrow y - \hat\a - \hat f_2(x_2)$; regress $r_1$ on
+      $x_1$ to update $\hat f_1$; $\hat f_1 \leftarrow \hat f_1 -
+      \overline{\hat f_1}$.
+   b. $r_2 \leftarrow y - \hat\a - \hat f_1(x_1)$; regress $r_2$ on
+      $x_2$ to update $\hat f_2$; $\hat f_2 \leftarrow \hat f_2 -
+      \overline{\hat f_2}$.
+
+**(a)** Implement this on simulated data with known true coefficients.
+
+**(b)** Verify the coefficients match `lm(y ~ x1 + x2)`.
+
+::::
+
+
+::::{#sol-backfitting}
+
+::: {.panel-tabset}
+
+#### R
+
+```{r}
+#| code-fold: true
+#| eval: false
+set.seed(1)
+n <- 100
+x1 <- rnorm(n)
+x2 <- rnorm(n)
+y <- 2 + 3 * x1 - 1.5 * x2 + rnorm(n, sd = 0.5)
+alpha_hat <- mean(y)
+f1 <- f2 <- rep(0, n)
+for (iter in seq_len(500)) {
+  r1 <- y - alpha_hat - f2
+  b1 <- coef(lm(r1 ~ x1))[2]
+  f1 <- b1 * x1
+  f1 <- f1 - mean(f1)
+  r2 <- y - alpha_hat - f1
+  b2 <- coef(lm(r2 ~ x2))[2]
+  f2 <- b2 * x2
+  f2 <- f2 - mean(f2)
+}
+lm_fit <- lm(y ~ x1 + x2)
+cat(sprintf("Backfitting: b1=%.4f, b2=%.4f\n", b1, b2))
+cat(sprintf("lm():        b1=%.4f, b2=%.4f\n",
+            coef(lm_fit)[2], coef(lm_fit)[3]))
+```
+
+#### Python
+
+```{python}
+#| eval: false
+#| python.reticulate: false
+import numpy as np
+rng = np.random.default_rng(1)
+n = 100; x1 = rng.standard_normal(n); x2 = rng.standard_normal(n)
+y = 2 + 3*x1 - 1.5*x2 + rng.normal(0, 0.5, n)
+alpha_hat = y.mean(); f1 = np.zeros(n); f2 = np.zeros(n)
+for _ in range(500):
+    r1 = y - alpha_hat - f2
+    b1 = float(np.cov(x1, r1)[0, 1] / np.var(x1, ddof=1))
+    f1 = b1*x1 - (b1*x1).mean()
+    r2 = y - alpha_hat - f1
+    b2 = float(np.cov(x2, r2)[0, 1] / np.var(x2, ddof=1))
+    f2 = b2*x2 - (b2*x2).mean()
+X_ols = np.column_stack([np.ones(n), x1, x2])
+b_ols = np.linalg.lstsq(X_ols, y, rcond=None)[0]
+print(f"Backfitting: b1={b1:.4f}, b2={b2:.4f}")
+print(f"OLS:         b1={b_ols[1]:.4f}, b2={b_ols[2]:.4f}")
+```
+
+:::
+
+The coefficients agree, confirming backfitting converges to OLS when
+all component smoothers are linear [@hastie2009esl2e, §9.1.1].
+
+::::
diff --git a/_subfiles/generalized-additive-models/_exr-gams-conceptual.qmd b/_subfiles/generalized-additive-models/_exr-gams-conceptual.qmd
@@ -0,0 +1,76 @@
+::: notes
+Exercises adapted from @james2021islr2e [§7.9] and @hastie2009esl2e [§9.1].
+:::
+
+### Piecewise polynomial continuity (ISLRv2 Conceptual 1) {#sec-exr-continuity}
+
+::::{#exr-spline-continuity}
+
+A cubic polynomial with one interior knot at $\xi$ uses the truncated power basis:
+
+$$f(X) = \b_0 + \b_1 X + \b_2 X^2 + \b_3 X^3 + \b_4 (X-\xi)_+^3$$
+
+where $(u)_+ = \max(u,0)$.
+
+**(a)** Write $f(X)$ as two cubics: $f_1(X)$ for $X < \xi$ and $f_2(X)$ for $X \ge \xi$.
+
+**(b)** Show $f_1(\xi) = f_2(\xi)$ (continuity).
+
+**(c)** Show the first derivatives match at $\xi$.
+
+**(d)** Show the second derivatives match at $\xi$.
+
+**(e)** Do the third derivatives match? What order of smoothness does this cubic spline achieve?
+
+::::
+
+
+::::{#sol-spline-continuity}
+
+**(a)** For $X < \xi$: $(X-\xi)_+ = 0$, giving $f_1(X) = \b_0 + \b_1 X + \b_2 X^2 + \b_3 X^3$.
+For $X \ge \xi$: $f_2(X) = f_1(X) + \b_4(X-\xi)^3$.
+
+**(b)** At $X=\xi$: $(X-\xi)_+^3 = 0$, so $f_2(\xi) = f_1(\xi)$. $\checkmark$
+
+**(c)** $f_2^{\prime}(X) = f_1^{\prime}(X) + 3\b_4(X-\xi)_+^2$.
+At $X=\xi$: $(X-\xi)_+^2 = 0$, so the derivatives agree. $\checkmark$
+
+**(d)** $f_2^{\prime\prime}(X) = f_1^{\prime\prime}(X) + 6\b_4(X-\xi)_+$.
+At $X=\xi$: $(X-\xi)_+ = 0$. $\checkmark$
+
+**(e)** $f_1^{\prime\prime\prime} = 6\b_3$ and $f_2^{\prime\prime\prime} = 6\b_3 + 6\b_4$ for $X > \xi$.
+These differ (unless $\b_4=0$). The truncated power basis gives a **cubic spline**:
+continuous with continuous 1st and 2nd derivatives, but a jump in the 3rd derivative at each knot.
+
+::::
+
+
+### Smoothing spline penalty behavior (ISLRv2 Conceptual 2) {#sec-exr-smoothing-penalty}
+
+::::{#exr-smoothing-penalty}
+
+The smoothing spline minimizes
+
+$$\hat{g} = \arg\min_g \left\{ \sum_{i=1}^n (y_i - g(x_i))^2 + \l \int g^{\prime\prime}(t)^2 \, dt \right\}.$$
+
+**(a)** What is the limiting form of $\hat{g}$ as $\l \to \infty$?
+
+**(b)** What is the limiting form as $\l \to 0$?
+
+**(c)** For $\l \to 0$, is the training RSS equal to zero?
+
+::::
+
+
+::::{#sol-smoothing-penalty}
+
+**(a)** As $\l \to \infty$ the roughness penalty dominates. The only $g$ with
+$\int g^{\prime\prime}(t)^2\,dt = 0$ is a linear function $g(x) = a + bx$.
+So $\hat{g}$ converges to the ordinary least-squares line.
+
+**(b)** As $\l \to 0$ there is no penalty; the minimizer interpolates every
+training point: $g(x_i) = y_i$ for all $i$.
+
+**(c)** Yes. Interpolation means training RSS $= 0$ -- the overfit extreme.
+
+::::
diff --git a/_subfiles/generalized-additive-models/_sec_basis_functions.qmd b/_subfiles/generalized-additive-models/_sec_basis_functions.qmd
@@ -0,0 +1,96 @@
+::: notes
+Adapted from @james2021islr2e [§7.3] and @wood2017generalized [§4].
+:::
+
+Polynomial regression and step-function regression look very
+different, but they share a common structure: both replace $X$ with
+a fixed set of **transformations** of $X$, then fit a linear
+regression in those transformations. This shared structure is the
+**basis function** framework, and it's the foundation for splines.
+
+::::{#def-basis-expansion}
+#### Basis expansion
+
+A **basis expansion** of $X$ is a fixed (data-free) set of functions
+$b_1(X), \ldots, b_K(X)$. The basis-expansion regression model
+replaces $X$ with these transformations:
+
+$$y_i = \b_0 + \b_1 b_1(x_i) + \b_2 b_2(x_i) + \cdots + \b_K b_K(x_i) + \eps_i.$$
+
+The basis functions $b_k(\cdot)$ are chosen ahead of time; the
+coefficients $\b_k$ are estimated by least squares (or, more
+generally, by maximum likelihood for a GLM with link
+$g(\E{Y \mid X}) = \b_0 + \sum_k \b_k b_k(X)$).
+::::
+
+::::{#exm-basis-polynomial}
+#### Polynomial regression as a basis expansion
+
+For polynomial regression of degree $d$ (@def-polynomial-regression),
+the basis functions are the monomials
+
+$$b_k(X) = X^k, \quad k = 1, \ldots, d.$$
+::::
+
+::::{#exm-basis-step}
+#### Step-function regression as a basis expansion
+
+For step-function regression with cutpoints $c_1 < \cdots < c_K$
+(@def-step-function-regression), the basis functions are the bin
+indicators
+
+$$b_k(X) = \indicp{c_{k-1} < X \le c_k}, \quad k = 1, \ldots, K+1.$$
+::::
+
+## Why the basis-function abstraction matters
+
+Once we see polynomial and step-function regression as instances of
+the same template, we can ask: *which other basis functions might be
+worth trying?* Two answers turn out to be especially useful:
+
+- **Truncated power basis functions** —
+  $b_k(X) = (X - \xi_k)_+^d$ for chosen knots $\xi_k$. These give the
+  **regression spline** family, which enforces piecewise-polynomial
+  structure with continuity and smoothness constraints at the knots
+  (developed in the next section).
+- **B-spline basis functions** — a numerically well-conditioned
+  alternative basis that spans the same space of piecewise
+  polynomials as the truncated power basis, but with locally
+  supported basis functions (each $b_k$ is nonzero only on a small
+  region of the $x$-axis).
+
+Both of these are still **linear** in the coefficients, so all of the
+GLM machinery — likelihood, standard errors, Wald tests, AIC, etc. —
+applies unchanged. The only thing that changes is the design matrix:
+each row $i$ of the design matrix has entries
+$1, b_1(x_i), \ldots, b_K(x_i)$ instead of $1, x_i$.
+
+## The general estimation recipe
+
+For a continuous response $Y$ and a basis expansion
+$b_1(\cdot), \ldots, b_K(\cdot)$:
+
+1. **Build the design matrix.** Construct the $n \times (K+1)$
+   matrix
+
+   $$\mat{B} = \sbmat{
+   1 & b_1(x_1) & \cdots & b_K(x_1) \\
+   1 & b_1(x_2) & \cdots & b_K(x_2) \\
+   \vdots & \vdots & \ddots & \vdots \\
+   1 & b_1(x_n) & \cdots & b_K(x_n)
+   }.$$
+
+2. **Fit by least squares (or MLE for a GLM).**
+   The OLS estimator $\hat\b = (\tp{\mat{B}} \mat{B})^{-1} \tp{\mat{B}} \mat{y}$
+   minimizes the residual sum of squares — exactly the linear
+   regression machinery from earlier chapters, just with a different
+   design matrix.
+
+3. **Plot the fitted curve.** For a dense grid of $x$ values, compute
+   $\hat y(x) = \hat\b_0 + \sum_k \hat\b_k b_k(x)$ and plot. Standard
+   errors and pointwise confidence bands come from the usual linear
+   model formulas applied to $\mat{B}$.
+
+The remaining design decision is **which basis functions to use** —
+and that choice is what distinguishes regression splines from
+smoothing splines from local regression from GAMs.