diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 46133eec5..df4cc2c24 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -266,6 +266,21 @@ Do not use generic acknowledgements without locators or plaintext author-title references when a BibTeX citation is available. +## Observational vs Causal Estimands + +Always distinguish observational estimands +from causal estimands in notation and prose. + +- Use observational notation + (for example, standardized risks based on `\E{Y \mid A=a, Z=z}`) + when discussing model-based associations. +- Use potential-outcome notation + (for example, `\Pr(Y^a = 1)`) + only when making a causal claim. +- If observational and causal estimands are equated, + explicitly state identification assumptions + (consistency, exchangeability, and positivity). + ## Variable Definitions in Exercises When introducing model variables in exercises, @@ -383,6 +398,113 @@ When introducing or editing formal statistical definitions in `.qmd` files: ensure those terms also have formal `#def-` div definitions in the relevant scope before relying on them +## Slidebreaks Before Theorem-Type Divs + +Always add `{{< slidebreak >}}` on a blank line immediately before +every theorem-type div opener. +This ensures slide-format output stays readable. + +Theorem-type div types (per [Quarto cross-reference docs](https://quarto.org/docs/authoring/cross-references.html#theorems-and-proofs)): +`#thm-`, `#lem-`, `#cor-`, `#prp-`, `#cnj-`, `#def-`, `#exm-`, `#exr-`, `#rem-` + +### Slidebreaks in including vs. included files + +When a subfile's **first** content is a theorem-type div, +place the `{{< slidebreak >}}` in the **including** (parent) file, +immediately before the `{{< include >}}` shortcode — +**not** at the start of the included subfile itself. + +**Correct** — slidebreak in the including file: +```qmd + +{{< slidebreak >}} + +{{< include _subfiles/chapter/_exm-my-example.qmd >}} +``` + +```qmd + +:::{#exm-my-example} + +#### My example title + +Content... + +::: +``` + +**Incorrect** — slidebreak inside the included file: +```qmd + +{{< include _subfiles/chapter/_exm-my-example.qmd >}} +``` + +```qmd + +{{< slidebreak >}} + +:::{#exm-my-example} + +#### My example title + +Content... + +::: +``` + +**Correct** example (non-leading slidebreak, same file): +```qmd +{{< slidebreak >}} + +:::{#def-collapsibility} + +#### Collapsibility + +A measure is *collapsible* if ... + +::: +``` + +**Incorrect** (missing slidebreak): +```qmd +:::{#def-collapsibility} + +#### Collapsibility + +A measure is *collapsible* if ... + +::: +``` + +## Example Formatting + +All worked examples in `.qmd` files must be wrapped in a Quarto `#exm-` div. +Never leave a named example as a plain markdown section. + +**Correct:** +```qmd +:::{#exm-wcgs-marginal-rd} + +##### Example: Marginal risk difference + +Content of the example... + +::: +``` + +**Incorrect:** +```qmd +##### Example: Marginal risk difference + +Content of the example... +``` + +- Use an id beginning `#exm-` (for example, `#exm-wcgs-marginal-rd`) +- Put the example title in a heading inside the div, + at the heading level matching the surrounding section depth +- All content for the example (setup, computation, interpretation) + should live inside the div + ## Div Titles vs. Markdown Headings **CRITICAL**: Div titles (headings inside divs like `:::{#def-...}`, `:::{#thm-...}`, `:::{#exm-...}`, etc.) are NOT the same as regular markdown headings. @@ -822,6 +944,7 @@ Content here. More content. ``` +When a subfile begins with a theorem-type div (`#thm-`, `#lem-`, `#cor-`, `#prp-`, `#cnj-`, `#def-`, `#exm-`, `#exr-`, `#rem-`), place the preceding `{{< slidebreak >}}` in the **parent** file (before the `{{< include >}}`), not inside the subfile. The subfile itself should not start with `{{< slidebreak >}}`. ## Computer Algebra Systems (CAS) diff --git a/CLAUDE.md b/CLAUDE.md index b099e21cf..30532008a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -41,6 +41,8 @@ Before committing any `.qmd`, `.R`, or config file change: ### Quarto - Use `{{< slidebreak >}}` instead of `---` for slide breaks +- Add `{{< slidebreak >}}` immediately before every theorem-type div (`#thm-`, `#lem-`, `#cor-`, `#prp-`, `#cnj-`, `#def-`, `#exm-`, `#exr-`, `#rem-`) +- When a subfile begins with a theorem-type div, put the preceding `{{< slidebreak >}}` in the **parent** file (before the `{{< include >}}`), not inside the subfile - Default to `#| code-fold: true` for figure/table chunks - Use div format (`:::{#fig-...}`) for figures and tables, not chunk-option `fig-cap`/`tbl-cap` - Do not indent `:::` fenced div markers inside lists diff --git a/_subfiles/logistic-regression/_exm-collapsibility.qmd b/_subfiles/logistic-regression/_exm-collapsibility.qmd new file mode 100644 index 000000000..99a2dbf47 --- /dev/null +++ b/_subfiles/logistic-regression/_exm-collapsibility.qmd @@ -0,0 +1,68 @@ +:::{#exm-collapsibility} + +#### Collapsibility: numerical illustration + +Consider a hypothetical scenario with two strata +($Z = 0$: low-risk, $Z = 1$: high-risk). +We make two simplifying assumptions: + +1. **No confounding** ($A \perp Z$): + exposure is independent of the covariate, + so $\p(Z = z \mid A = a) = \p(Z = z)$ for all $a$ and $z$. +2. **Equal stratum sizes** ($\p(Z = 0) = \p(Z = 1) = \tfrac{1}{2}$): + both strata have the same marginal probability. + +Together, these imply +$\p(Z = z \mid A = a) = \tfrac{1}{2}$ for all $a$ and $z$, +so the simple (equal-weight) average of stratum-specific effects +equals both the observed marginal risk $\pi(a)$ and the causal marginal risk $\pi_a$: + +```{r} +#| label: collapsibility-example +#| code-fold: true + +strata <- tibble::tibble( + stratum = c("Z = 0 (low risk)", "Z = 1 (high risk)"), + pi_0 = c(0.05, 0.30), + pi_1 = c(0.10, 0.50) +) |> + dplyr::mutate( + RD = pi_1 - pi_0, + RR = pi_1 / pi_0, + OR = (pi_1 / (1 - pi_1)) / (pi_0 / (1 - pi_0)) + ) + +pi1_marg <- mean(strata$pi_1) +pi0_marg <- mean(strata$pi_0) + +tibble::tibble( + Measure = c("Risk difference", "Risk ratio", "Odds ratio"), + Marginal = c( + pi1_marg - pi0_marg, + pi1_marg / pi0_marg, + (pi1_marg / (1 - pi1_marg)) / (pi0_marg / (1 - pi0_marg)) + ), + `Avg. conditional` = c( + mean(strata$RD), mean(strata$RR), mean(strata$OR) + ), + `Marginal = avg. conditional?` = c("Yes", "No", "No") +) |> + knitr::kable(digits = 3) +``` + +Even with no confounding, the marginal RR and marginal OR both differ from +the average of their conditional counterparts, +while the marginal RD equals the average conditional RD exactly. + +Non-collapsibility is distinct from effect-measure modification. +Here the stratum-specific effects also vary across strata +(risk ratios $2.00$ vs. $1.67$, odds ratios $2.11$ vs. $2.33$), +but that variation is not what drives the discrepancy: +even if the conditional odds ratio were held *constant* across strata, +the marginal odds ratio would generally still differ from it +(and lie closer to the null), +whereas a constant conditional risk difference +always reproduces the marginal risk difference. + +::: + diff --git a/_subfiles/logistic-regression/_sec-OR-alternatives.qmd b/_subfiles/logistic-regression/_sec-OR-alternatives.qmd index be69a9809..d86f3de3a 100644 --- a/_subfiles/logistic-regression/_sec-OR-alternatives.qmd +++ b/_subfiles/logistic-regression/_sec-OR-alternatives.qmd @@ -1,16 +1,16 @@ -### Objections to odds ratios +## Objections to odds ratios {{< include _subfiles/logistic-regression/_sec_OR_objections.qmd >}} -### Deriving risk ratios and risk differences from logistic regression models +## Deriving risk ratios and risk differences from logistic regression models {{< include _subfiles/logistic-regression/_sec-logistic-RR-RD.qmd >}} -### Other link functions for Bernoulli outcomes +## Other link functions for Bernoulli outcomes {{< include _subfiles/logistic-regression/_sec-non-logistic-bernoulli-models.qmd >}} -### Quasibinomial +## Quasibinomial See [Hua Zhou](https://hua-zhou.github.io/)'s [lecture notes](https://ucla-biostat-200c-2020spring.github.io/slides/04-binomial/binomial.html#:~:text=0.05%20%27.%27%200.1%20%27%20%27%201-,Quasi%2Dbinomial,-Another%20way%20to) diff --git a/_subfiles/logistic-regression/_sec-bootstrap-boot-package.qmd b/_subfiles/logistic-regression/_sec-bootstrap-boot-package.qmd new file mode 100644 index 000000000..d9f2983fd --- /dev/null +++ b/_subfiles/logistic-regression/_sec-bootstrap-boot-package.qmd @@ -0,0 +1,51 @@ +The [`boot`](https://cran.r-project.org/package=boot) package provides a more streamlined interface for bootstrap inference. +Here's how to compute the same confidence interval using `boot::boot()`: + +```{r} +#| label: boot-package-example +#| code-fold: show +#| eval: false + +library(boot) + +statistic_fn <- function(data, indices) { + boot_data <- data[indices, ] + + boot_fit <- glm( + chd69_binary ~ dibpat + age, + data = boot_data, + family = binomial(link = "logit") + ) + + compute_marginal_rd( + model = boot_fit, + data = boot_data, + exposure_var = "dibpat", + exposed_level = "Type A", + unexposed_level = "Type B" + ) +} + +set.seed(20260512) +boot_results <- boot( + data = wcgs_clean, + statistic = statistic_fn, + # Keep the rendered example fast; increase to 2000+ for final analyses. + R = 300 +) + +boot_results + +boot.ci(boot_results, type = c("perc", "bca")) +``` + +The [`boot`](https://cran.r-project.org/package=boot) package provides several types of confidence intervals, +including the percentile method (`perc`) +and the bias-corrected and accelerated (BCa) method (`bca`), +which can provide better coverage in some situations. +The BCa method is more demanding than the percentile method: +its bias-correction and acceleration estimates are unstable +at the illustrative `R = 300` used here +(and can warn or fail outright), +so use a substantially larger `R` (at least 2000) +before relying on `bca` intervals. diff --git a/_subfiles/logistic-regression/_sec-bootstrap-inference.qmd b/_subfiles/logistic-regression/_sec-bootstrap-inference.qmd new file mode 100644 index 000000000..1323f34b9 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-bootstrap-inference.qmd @@ -0,0 +1,42 @@ +Adapted from +[@vittinghoff2e, Section 3.6, p. 62], +[@hastie2009esl2e, Section 7.11, p. 249], +and +[@james2021islr2e, Chapter 5, Section 5.2, p. 209]. + +The bootstrap is a resampling method +that allows us to estimate the sampling distribution +of a statistic without making strong parametric assumptions. +For an introduction to the bootstrap, +see [Bootstrap Confidence Intervals](basic-statistical-methods.qmd#sec-bootstrap-ci). + +#### Bootstrap algorithm + +To construct a bootstrap confidence interval +for a marginal risk difference: + +1. For $b = 1, \ldots, B$ (e.g., $B = 1000$): + a. Draw a bootstrap sample of size $n$ with replacement from the original data + b. Fit the logistic regression model to the bootstrap sample + c. Compute the marginal risk difference from the fitted model +2. The bootstrap distribution of the $B$ risk difference estimates + approximates the sampling distribution +3. Construct a confidence interval using the percentile method + (e.g., the 2.5th and 97.5th percentiles for a 95% CI) + +The bootstrap standard error is the standard deviation +of the bootstrap distribution. + +{{< slidebreak >}} + +:::{#exm-wcgs-marginal-rd} + +#### Example: CHD risk and behavioral pattern + +{{< include _subfiles/logistic-regression/_sec-wcgs-bootstrap-example.qmd >}} + +::: + +#### Alternative: Using the boot package + +{{< include _subfiles/logistic-regression/_sec-bootstrap-boot-package.qmd >}} diff --git a/_subfiles/logistic-regression/_sec-logistic-RR-RD.qmd b/_subfiles/logistic-regression/_sec-logistic-RR-RD.qmd index 37601edea..5c07e2b39 100644 --- a/_subfiles/logistic-regression/_sec-logistic-RR-RD.qmd +++ b/_subfiles/logistic-regression/_sec-logistic-RR-RD.qmd @@ -1,16 +1,63 @@ - If you want to report risk differences or risk ratios instead of odds ratios, you can obtain estimates from logistic regression models, as long as you didn't stratify sampling by the outcome; in other words, not in case-control studies (see @sec-CC-studies). -To compute risk ratios from logistic regression models: +### Computing marginal risk differences + +::: {.callout-note} +This section is **optional for 2026** and will not be tested on exams. +::: + +#### Conceptual approach -- Apply the expit function to the linear predictor for each covariate pattern -to compute the (estimated) risks, +To compute risk ratios or risk differences from logistic regression models: + +- Apply the expit function (@def-expit) + to the linear predictor + for each covariate pattern + to compute the (estimated) risks, - Then take the differences or ratios of the risks, as needed. +For example, +suppose we have a logistic regression model +with a binary exposure $A$ and a covariate $Z$: + +$$ +\logit\{\Pr(Y = 1 | A, Z)\} = \b_0 + \b_A A + \b_Z Z +$$ + +The predicted risk for an individual with exposure $A = a$ and covariate $Z = z$ is: + +$$ +\hat\pi(a, z) = \expit(\hb_0 + \hb_A a + \hb_Z z) +$$ + +The **risk difference** comparing exposure levels $a = 1$ vs. $a = 0$ +at a fixed covariate value $z$ is: + +$$ +\widehat{\text{RD}}(z) = \hat\pi(1, z) - \hat\pi(0, z) +$$ + +The **risk ratio** comparing exposure levels $a = 1$ vs. $a = 0$ +at a fixed covariate value $z$ is: + +$$ +\widehat{\text{RR}}(z) = \frac{\hat\pi(1, z)}{\hat\pi(0, z)} +$$ + +#### Marginal vs. conditional effects + +{{< include _subfiles/logistic-regression/_sec-marginal-vs-conditional.qmd >}} + +{{< slidebreak >}} + +{{< include _subfiles/logistic-regression/_sec-marginal-point-estimate.qmd >}} + +#### Standard errors and confidence intervals + To quantify uncertainty for risk difference or risk ratio estimates derived from logistic regression models (e.g., to calculate SEs, CIs, and p-values), @@ -19,126 +66,19 @@ bootstrap, the multivariate delta method, or some other special technique. -::: notes -Adapted from -[@vittinghoff2e, Section 3.6, p. 62], -[@hastie2009esl2e, Section 7.11, p. 249], -and -[@james2021islr2e, Chapter 5, Section 5.2, p. 209]. -::: +The **bootstrap** is generally the most straightforward approach +and does not require deriving complex formulas. -#### Bootstrap workflow for derived risk measures - -When using nonparametric bootstrap inference -for a derived quantity -like a marginal risk difference or risk ratio: - -1. Fit the logistic model on the observed data. -2. Compute the target estimand - (for example, a g-computation risk difference or risk ratio). -3. Repeat for - $b = 1, \ldots, B$: - draw a bootstrap sample of size $n$ - with replacement, - refit the model, - and recompute the target estimand. -4. Use the empirical standard deviation of the - $B$ bootstrap estimates - as the bootstrap standard error. -5. Build a confidence interval from the bootstrap distribution - (percentile or BCa intervals are common when the bootstrap distribution is skewed). - -In practice, -larger $B$ is usually needed for interval estimation than for standard errors; -for percentile or BCa intervals, -it is common to use at least 1,000 resamples, -and often 2,000 or more for stability. - -::: {#exm-wcgs-bootstrap-rd} - -#### Bootstrap CI for marginal risk difference and risk ratio (WCGS) - -We estimate the marginal risk difference and risk ratio -comparing Type A vs. Type B personality patterns in the WCGS study, -using g-computation to derive the estimands -and the nonparametric bootstrap to quantify uncertainty. - -**Steps 1–2: Point estimates via g-computation** - -```{r} -#| label: exm-boot-gcomp-point -#| code-fold: true -# Fit logistic model: CHD ~ behavioral pattern + age -fit_wcgs <- glm( - chd69 ~ dibpat + age, - family = binomial, - data = wcgs -) - -# G-computation: set everyone to Type A, then to Type B, -# and average the predicted risks -dibpat_levels <- levels(wcgs$dibpat) - -newdata_A <- wcgs |> mutate(dibpat = factor("Type A", levels = dibpat_levels)) -newdata_B <- wcgs |> mutate(dibpat = factor("Type B", levels = dibpat_levels)) - -risk_A <- mean(predict(fit_wcgs, newdata = newdata_A, type = "response")) -risk_B <- mean(predict(fit_wcgs, newdata = newdata_B, type = "response")) - -tibble::tibble( - measure = c( - "Risk (Type A)", "Risk (Type B)", - "Risk difference", "Risk ratio" - ), - estimate = c(risk_A, risk_B, risk_A - risk_B, risk_A / risk_B) -) -``` - -**Steps 3–5: Bootstrap SE and BCa CIs** - -```{r} -#| label: exm-boot-gcomp-ci -#| code-fold: true -library(boot) -library(dplyr) - -boot_gcomp <- function(data, indices) { - d <- data[indices, ] - fit <- glm(chd69 ~ dibpat + age, family = binomial, data = d) - lev <- levels(d$dibpat) - risk_A <- mean( - predict(fit, newdata = mutate(d, dibpat = factor("Type A", levels = lev)), - type = "response") - ) - risk_B <- mean( - predict(fit, newdata = mutate(d, dibpat = factor("Type B", levels = lev)), - type = "response") - ) - c(RD = risk_A - risk_B, RR = risk_A / risk_B) -} - -set.seed(42) -boot_wcgs <- boot(data = wcgs, statistic = boot_gcomp, R = 1000) - -boot_wcgs -``` - -```{r} -#| label: exm-boot-gcomp-ci-rd -#| code-fold: true -# BCa CI for risk difference (index = 1) -boot.ci(boot_wcgs, type = "bca", index = 1) -``` - -```{r} -#| label: exm-boot-gcomp-ci-rr -#| code-fold: true -# BCa CI for risk ratio (index = 2) -boot.ci(boot_wcgs, type = "bca", index = 2) -``` - -The bootstrap standard errors and BCa intervals -provide valid inference for both the risk difference and risk ratio -without requiring any closed-form delta-method derivation. +### Bootstrap inference -::: +{{< include _subfiles/logistic-regression/_sec-bootstrap-inference.qmd >}} + +### Computing marginal risk ratios + +{{< include _subfiles/logistic-regression/_sec-marginal-rr.qmd >}} + +{{< slidebreak >}} + +### Notes and caveats + +{{< include _subfiles/logistic-regression/_sec-marginal-notes.qmd >}} diff --git a/_subfiles/logistic-regression/_sec-logistic-marginal-considerations.qmd b/_subfiles/logistic-regression/_sec-logistic-marginal-considerations.qmd new file mode 100644 index 000000000..7ef9cbdbd --- /dev/null +++ b/_subfiles/logistic-regression/_sec-logistic-marginal-considerations.qmd @@ -0,0 +1,53 @@ +**Important considerations when computing marginal effects:** + +1. **Target population**: + The marginal effect is specific to the covariate distribution + in your sample. + If your sample is not representative of your target population, + you may want to use standardization or weighting. + +2. **Collapsibility** (see @def-collapsibility): + Risk differences are **collapsible**: + the causal marginal RD ($\pi_1 - \pi_0$) equals the average + of stratum-specific causal RDs ($\pi(1,z) - \pi(0,z)$) + weighted by the marginal distribution $f_Z$, + + $$\pi_1 - \pi_0 + = \E{\pi(1, Z)} - \E{\pi(0, Z)} + = \E{\pi(1, Z) - \pi(0, Z)} + = \E{\text{RD}(Z)},$$ + + so there is no collapsibility bias from averaging over a covariate $Z$. + Risk ratios are **not** collapsible in general either: + the marginal RR is not a weighted average of conditional RRs, + because a ratio of expectations is not the same as an expectation of ratios: + + $$\frac{\pi_1}{\pi_0} + = \frac{\E{\pi(1, Z)}}{\E{\pi(0, Z)}} + \neq \E{\frac{\pi(1, Z)}{\pi(0, Z)}} + = \E{\text{RR}(Z)}.$$ + + Odds ratios are non-collapsible in an even stronger sense: + the marginal OR is not any weighted average of stratum-specific odds ratios, + and can differ from every conditional OR + even when there is no confounding (see @exm-collapsibility). + +3. **Bootstrap sample size**: + Use at least 1000 bootstrap iterations for standard errors, + and at least 2000 for confidence intervals. + More iterations provide more stable estimates + but take longer to compute. + +4. **Alternative methods**: + The multivariate delta method can also be used + to compute standard errors analytically, + but the formulas are complex. + The [`margins`](https://cran.r-project.org/package=margins) package in R + provides implementations of the delta method + for marginal effects. + +5. **Case-control studies**: + These methods are **not valid** for case-control studies, + where sampling is stratified by outcome. + In case-control designs, + only odds ratios can be estimated consistently. diff --git a/_subfiles/logistic-regression/_sec-marginal-notes.qmd b/_subfiles/logistic-regression/_sec-marginal-notes.qmd new file mode 100644 index 000000000..e1930d6b3 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-marginal-notes.qmd @@ -0,0 +1,23 @@ +:::{#def-collapsibility} + +#### Collapsibility + +An estimand $\theta$ is **collapsible** with respect to a covariate $Z$ +if the population-level estimand equals the marginal-distribution-weighted average +of stratum-specific estimands: + +$$\theta = \E{\theta(Z)} = \int \theta(z)\, f_Z(z)\, dz.$$ + +**Collapsibility bias** is the discrepancy that arises when $\theta \neq \E{\theta(Z)}$. + +::: + +{{< slidebreak >}} + +{{< include _subfiles/logistic-regression/_thm-collapsibility.qmd >}} + +{{< include _subfiles/logistic-regression/_sec-logistic-marginal-considerations.qmd >}} + +{{< slidebreak >}} + +{{< include _subfiles/logistic-regression/_exm-collapsibility.qmd >}} diff --git a/_subfiles/logistic-regression/_sec-marginal-point-estimate.qmd b/_subfiles/logistic-regression/_sec-marginal-point-estimate.qmd new file mode 100644 index 000000000..24e21ab7d --- /dev/null +++ b/_subfiles/logistic-regression/_sec-marginal-point-estimate.qmd @@ -0,0 +1,108 @@ +:::{#exm-marginal-point-estimate} + +#### Numerical point-estimate example + +Using the WCGS data setup and model, +we can estimate the **causal marginal risks** +(i.e., potential probabilities @def-potential-probability) +via g-computation (@def-g-computation), +by averaging model predictions over the observed covariate distribution: + +```{r} +#| label: wcgs-data-load-and-model-fit +#| echo: false +#| message: false +#| warning: false + +library(dplyr) +library(haven) +library(ggplot2) + +# Load and prepare WCGS data +wcgs_path <- here::here("Data", "wcgs.rda") +if (!file.exists(wcgs_path)) { + stop( + "Could not find WCGS data file at: ", wcgs_path, + ". This file should be in the repository's Data directory. ", + paste0( + "Run this chapter from the repository root and verify ", + "that Data/wcgs.rda is present." + ) + ) +} +wcgs_env <- new.env(parent = emptyenv()) +invisible(load(wcgs_path, envir = wcgs_env)) +wcgs <- wcgs_env$wcgs + +wcgs_clean <- wcgs |> + mutate( + dibpat = dibpat |> + as_factor() |> + relevel(ref = "Type B"), + chd69_binary = as.numeric(chd69 == "Yes"), + across(where(is.labelled), haven::as_factor) + ) |> + filter(!is.na(chd69_binary), !is.na(dibpat), !is.na(age)) + +fit <- glm( + chd69_binary ~ dibpat + age, + data = wcgs_clean, + family = binomial(link = "logit") +) + +dibpat_levels <- levels(wcgs_clean$dibpat) +wcgs_counterfactual_a <- wcgs_clean |> + mutate( + dibpat = factor("Type A", levels = dibpat_levels) + ) +wcgs_counterfactual_b <- wcgs_clean |> + mutate( + dibpat = factor("Type B", levels = dibpat_levels) + ) + +pi1_hat <- mean( + predict(fit, newdata = wcgs_counterfactual_a, type = "response") +) +pi0_hat <- mean( + predict(fit, newdata = wcgs_counterfactual_b, type = "response") +) +rd_hat <- pi1_hat - pi0_hat + +pi1_obs <- mean(wcgs_clean$chd69_binary[wcgs_clean$dibpat == "Type A"]) +pi0_obs <- mean(wcgs_clean$chd69_binary[wcgs_clean$dibpat == "Type B"]) +``` + +The **observed marginal risks** +$\hat\pi(a) = \Pr(Y=1 \mid A=a)$ +estimated directly from the data are: + +$$ +\hat\pi(1) = `r round(pi1_obs, 3)`, \quad +\hat\pi(0) = `r round(pi0_obs, 3)` +$$ + +The **causal marginal risks** +$\hat\pi_a = \frac{1}{n}\sumin \hat\pi(a, z_i)$ +standardized over the full covariate distribution are: + +$$ +\hat\pi_1 = `r round(pi1_hat, 3)`, \quad +\hat\pi_0 = `r round(pi0_hat, 3)` +$$ + +Note that $\hat\pi(1) \neq \hat\pi_1$ and $\hat\pi(0) \neq \hat\pi_0$: +because age is associated with behavioral pattern (Type A and Type B men +have different age distributions), the observed marginal and causal marginal +risks differ. +Standardization via g-computation adjusts for this difference. + +The causal marginal risk difference estimate is: + +$$ +\widehat{\text{Causal Marginal RD}} = +\hat\pi_1 - \hat\pi_0 = +`r round(pi1_hat, 3)` - `r round(pi0_hat, 3)` = +`r round(rd_hat, 3)` +$$ + +::: diff --git a/_subfiles/logistic-regression/_sec-marginal-rr.qmd b/_subfiles/logistic-regression/_sec-marginal-rr.qmd new file mode 100644 index 000000000..b50bae7e9 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-marginal-rr.qmd @@ -0,0 +1,72 @@ +The same approach can be used to compute marginal risk ratios. +We simply modify our statistic function +to compute ratios instead of differences: + +```{r} +#| label: marginal-rr-example +#| code-fold: show +#| eval: false + +library(boot) + +compute_marginal_rr <- function(model, + data, + exposure_var, + exposed_level, + unexposed_level) { + counterfactual_data <- make_counterfactual_datasets( + data = data, + exposure_var = exposure_var, + exposed_level = exposed_level, + unexposed_level = unexposed_level + ) + data_exposed <- counterfactual_data$data_exposed + data_unexposed <- counterfactual_data$data_unexposed + + risk_exposed <- predict( + model, + newdata = data_exposed, + type = "response" + ) + + risk_unexposed <- predict( + model, + newdata = data_unexposed, + type = "response" + ) + + mean(risk_exposed) / mean(risk_unexposed) +} + +statistic_fn_rr <- function(data, indices) { + boot_data <- data[indices, ] + boot_fit <- glm( + chd69_binary ~ dibpat + age, + data = boot_data, + family = binomial(link = "logit") + ) + compute_marginal_rr( + model = boot_fit, + data = boot_data, + exposure_var = "dibpat", + exposed_level = "Type A", + unexposed_level = "Type B" + ) +} + +set.seed(20260512) +boot_results_rr <- boot( + data = wcgs_clean, + statistic = statistic_fn_rr, + # Keep the rendered example fast; increase to 2000+ for final analyses. + R = 300 +) + +boot_ci_rr <- boot.ci(boot_results_rr, type = c("perc", "bca")) + +cat("Marginal Risk Ratio (Type A vs Type B):", + round(boot_results_rr$t0, 3), "\n") +cat("95% Percentile CI: (", + round(boot_ci_rr$percent[4], 3), ",", + round(boot_ci_rr$percent[5], 3), ")\n") +``` diff --git a/_subfiles/logistic-regression/_sec-marginal-vs-conditional.qmd b/_subfiles/logistic-regression/_sec-marginal-vs-conditional.qmd new file mode 100644 index 000000000..f61f46a32 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-marginal-vs-conditional.qmd @@ -0,0 +1,149 @@ +Note that these risk differences and risk ratios are **conditional** on the covariate value $z$. +In many applications, +we want to summarize the effect across the population, +which requires computing a **marginal** risk difference or risk ratio. + +{{< slidebreak >}} + +:::{#def-observed-marginal-risk} + +#### Observed marginal risk + +The **observed marginal risk** under exposure level $A = a$ is: + +$$\pi(a) \eqdef \Pr(Y = 1 \mid A = a)$$ + +::: + +{{< slidebreak >}} + +{{< include _subfiles/logistic-regression/_thm-observed-marginal-risk.qmd >}} + +{{< slidebreak >}} + +:::{#def-observed-marginal-rd} + +#### Observed marginal risk difference + +$$ +\text{Observed Marginal RD} \eqdef \pi(1) - \pi(0) +$$ + +::: + +In causal inference, +[potential outcomes](causal-inference.qmd#def-potential-outcomes) $Y^a$ +represent the outcome each unit would have under exposure level $a$. +Population-level causal estimands summarize these potential outcomes: + +{{< slidebreak >}} + +:::{#def-potential-mean} + +#### Potential mean + +The **potential mean** under treatment $A = a$ is: + +$$\mu_a \eqdef \E{Y^a}$$ + +This is the expected value of the potential outcome $Y^a$, +representing what the average outcome would be +if the entire population were assigned treatment $a$. + +::: + +{{< slidebreak >}} + +:::{#def-potential-probability} + +#### Potential probability + +For a binary outcome $Y \in \{0, 1\}$, +the **potential probability** under treatment $A = a$ is: + +$$\pi_a \eqdef \Pr(Y^a = 1) = \E{Y^a}$$ + +This is a special case of the potential mean (@def-potential-mean): +since $Y$ is binary, the expectation equals the probability. + +::: + +{{< slidebreak >}} + +:::{#def-causal-marginal-risk} + +#### Causal marginal risk + +The **causal marginal risk** (also: **potential marginal risk**) +under exposure level $a$ is the potential probability (@def-potential-probability): + +$$\pi_a = \Pr(Y^a = 1)$$ + +::: + +We use the term **causal marginal risk** throughout. + +Under causal identification assumptions +(see [Consistency](causal-inference.qmd#def-consistency), +[Exchangeability](causal-inference.qmd#def-exchangeability), +and [Positivity](causal-inference.qmd#def-positivity)), +the causal marginal risk $\pi_a$ (@def-causal-marginal-risk) +can be expressed in terms of observed data: + +{{< slidebreak >}} + +{{< include _subfiles/logistic-regression/_thm-causal-marginal-risk.qmd >}} + +{{< slidebreak >}} + +:::{#def-causal-marginal-rd} + +#### Causal marginal risk difference + +$$ +\text{Causal Marginal RD} \eqdef \pi_1 - \pi_0 +$$ + +::: + +{{< slidebreak >}} + +:::{#def-g-computation} + +#### G-computation + +**G-computation** (also called the **g-formula** or **standardization**) +is a procedure for estimating the causal marginal risk $\pi_a$ +(@def-causal-marginal-risk) from observed data: + +1. Fit a regression model for $\hat\pi(a, z) = \widehat{\Pr}(Y = 1 \mid A = a, Z = z)$ +2. For each observation $i$, predict the risk under exposure level $a$: $\hat\pi(a, z_i)$ +3. Average the predicted risks over the observed covariate distribution: + +$$\hat\pi_a = \frac{1}{n}\sumin \hat\pi(a, z_i)$$ + +Under identification assumptions (@thm-causal-marginal-risk), +$\hat\pi_a$ consistently estimates $\pi_a = \E{\pi(a, Z)}$. + +::: + +{{< slidebreak >}} + +:::{#def-predictive-margins} + +#### Predictive margins + +**Predictive margins** are model-predicted probabilities averaged over a +covariate distribution. + +- Averaged over the **full sample** distribution of $Z$: + estimates the causal marginal risk $\pi_a$ (@def-causal-marginal-risk) + — equivalent to g-computation (@def-g-computation) +- Averaged over the **subgroup** with $A_i = a$: + estimates the observed marginal risk $\pi(a)$ (@def-observed-marginal-risk) + (under correct model specification) + +The term "predictive margins" is commonly used in social science +and in Stata's `margins` command. + +::: diff --git a/_subfiles/logistic-regression/_sec-wcgs-bootstrap-example.qmd b/_subfiles/logistic-regression/_sec-wcgs-bootstrap-example.qmd new file mode 100644 index 000000000..13161266d --- /dev/null +++ b/_subfiles/logistic-regression/_sec-wcgs-bootstrap-example.qmd @@ -0,0 +1,35 @@ +We'll use the Western Collaborative Group Study (WCGS) data +to illustrate computing marginal risk differences from a logistic regression model. + +The research question is: +what is the marginal effect of Type A behavioral pattern on CHD risk, +adjusting for age? + +##### Fit the model + +We reuse `wcgs_clean` and `fit` +from the setup chunk above: + +```{r} +#| label: fit-wcgs-model +#| code-fold: true + +broom::tidy(fit, conf.int = TRUE) |> + pander::pander() +``` + +##### Compute the point estimate + +{{< include _subfiles/logistic-regression/_sec-wcgs-rd-compute.qmd >}} + +##### Bootstrap confidence interval + +{{< include _subfiles/logistic-regression/_sec-wcgs-rd-bootstrap.qmd >}} + +##### Visualize the bootstrap distribution + +{{< include _subfiles/logistic-regression/_sec-wcgs-rd-viz.qmd >}} + +##### Interpretation + +{{< include _subfiles/logistic-regression/_sec-wcgs-rd-interpretation.qmd >}} diff --git a/_subfiles/logistic-regression/_sec-wcgs-rd-bootstrap.qmd b/_subfiles/logistic-regression/_sec-wcgs-rd-bootstrap.qmd new file mode 100644 index 000000000..e6b7b9f6d --- /dev/null +++ b/_subfiles/logistic-regression/_sec-wcgs-rd-bootstrap.qmd @@ -0,0 +1,76 @@ +Now we use the bootstrap to estimate +the standard error and construct a 95% confidence interval. +In the non-parametric bootstrap, +each iteration applies both the model-fitting step +and the covariate standardization to the same bootstrap resample. +This mirrors the original world in the bootstrap world +and is standard practice for this type of estimator. +(An alternative would fix the covariates at the observed values, +but is less common for g-computation estimators.) + +```{r} +#| label: bootstrap-ci +#| code-fold: show +#| cache: true +#| message: false +#| warning: false + +# Set seed for reproducibility +set.seed(20260512) + +# Number of bootstrap iterations in this rendered example +# (illustrative only; use >= 2000 for final CIs) +B <- 300 + +boot_estimates <- numeric(B) + +for (b in 1:B) { + + boot_indices <- sample( + seq_len(nrow(wcgs_clean)), + size = nrow(wcgs_clean), + replace = TRUE + ) + + boot_data <- wcgs_clean[boot_indices, ] + + boot_fit <- glm( + chd69_binary ~ dibpat + age, + data = boot_data, + family = binomial(link = "logit") + ) + + boot_estimates[b] <- compute_marginal_rd( + model = boot_fit, + data = boot_data, + exposure_var = "dibpat", + exposed_level = "Type A", + unexposed_level = "Type B" + ) +} + +boot_se <- sd(boot_estimates) + +# percentile method: quantiles of the bootstrap distribution +ci_lower <- as.numeric(quantile(boot_estimates, 0.025)) +ci_upper <- as.numeric(quantile(boot_estimates, 0.975)) +``` + +::: {#tbl-marginal-rd-results} + +```{r} +#| code-fold: true + +tibble( + Estimate = marginal_rd_est, + `Std. Error` = boot_se, + `95% CI Lower` = ci_lower, + `95% CI Upper` = ci_upper +) |> + knitr::kable(digits = 4) +``` + +Marginal risk difference for Type A vs Type B behavioral pattern +(adjusted for age) with bootstrap standard error and 95% confidence interval + +::: diff --git a/_subfiles/logistic-regression/_sec-wcgs-rd-compute.qmd b/_subfiles/logistic-regression/_sec-wcgs-rd-compute.qmd new file mode 100644 index 000000000..508fadcd2 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-wcgs-rd-compute.qmd @@ -0,0 +1,81 @@ +We define helper functions for g-computation — +these are reused in the bootstrap loop below — +and extract the point estimate already computed in the setup chunk above +for the contrast comparing Type A to Type B behavioral patterns: + +```{r} +#| label: compute-marginal-rd +#| code-fold: true + +make_counterfactual_datasets <- function(data, + exposure_var, + exposed_level, + unexposed_level) { + if (!is.factor(data[[exposure_var]])) { + stop("The exposure variable must be a factor.") + } + exposure_levels <- levels(data[[exposure_var]]) + if (!all(c(exposed_level, unexposed_level) %in% exposure_levels)) { + stop( + sprintf( + paste0( + "Specified exposure levels '%s' and '%s' ", + "are not in factor levels: %s" + ), + exposed_level, + unexposed_level, + paste(exposure_levels, collapse = ", ") + ) + ) + } + + data_exposed <- data + data_exposed[[exposure_var]] <- factor( + rep(exposed_level, nrow(data_exposed)), + levels = exposure_levels + ) + + data_unexposed <- data + data_unexposed[[exposure_var]] <- factor( + rep(unexposed_level, nrow(data_unexposed)), + levels = exposure_levels + ) + + list( + data_exposed = data_exposed, + data_unexposed = data_unexposed + ) +} + +compute_marginal_rd <- function(model, + data, + exposure_var, + exposed_level, + unexposed_level) { + counterfactual_data <- make_counterfactual_datasets( + data = data, + exposure_var = exposure_var, + exposed_level = exposed_level, + unexposed_level = unexposed_level + ) + data_exposed <- counterfactual_data$data_exposed + data_unexposed <- counterfactual_data$data_unexposed + + risk_exposed <- predict( + model, + newdata = data_exposed, + type = "response" + ) + + risk_unexposed <- predict( + model, + newdata = data_unexposed, + type = "response" + ) + + mean(risk_exposed) - mean(risk_unexposed) +} + +# alias to rd_hat so both computation paths stay in sync +marginal_rd_est <- rd_hat +``` diff --git a/_subfiles/logistic-regression/_sec-wcgs-rd-interpretation.qmd b/_subfiles/logistic-regression/_sec-wcgs-rd-interpretation.qmd new file mode 100644 index 000000000..481931d01 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-wcgs-rd-interpretation.qmd @@ -0,0 +1,21 @@ +Under the identification assumptions +([Consistency](causal-inference.qmd#def-consistency), +[Exchangeability](causal-inference.qmd#def-exchangeability), +and [Positivity](causal-inference.qmd#def-positivity)), +we estimate that Type A behavioral pattern +causally increases CHD risk by +`r round(marginal_rd_est * 100, 2)` percentage points +compared to Type B behavioral pattern, +after standardizing over the observed age distribution +(95% CI: +`r round(ci_lower * 100, 2)`% to +`r round(ci_upper * 100, 2)`%). + +Because this rendered example uses `r B` bootstrap samples, +the displayed confidence interval is illustrative. +For final inferential reporting, +rerun with at least 2000 bootstrap samples. + +A 95% CI that excludes zero does not by itself imply +practical or clinical importance; +also interpret the estimated risk difference magnitude. diff --git a/_subfiles/logistic-regression/_sec-wcgs-rd-viz.qmd b/_subfiles/logistic-regression/_sec-wcgs-rd-viz.qmd new file mode 100644 index 000000000..49121eda9 --- /dev/null +++ b/_subfiles/logistic-regression/_sec-wcgs-rd-viz.qmd @@ -0,0 +1,51 @@ +We can visualize the bootstrap distribution +to assess the sampling distribution of our estimate: + +::: {#fig-bootstrap-dist} + +```{r} +#| code-fold: true +#| fig-width: 7 +#| fig-height: 4 + +tibble(boot_rd = boot_estimates) |> + ggplot() + + aes(x = boot_rd) + + geom_histogram( + bins = 50, + fill = "steelblue", + alpha = 0.7, + color = "black" + ) + + geom_vline( + xintercept = marginal_rd_est, + color = "red", + linetype = "dashed", + linewidth = 1 + ) + + geom_vline( + xintercept = ci_lower, + color = "darkgreen", + linetype = "dotted", + linewidth = 1 + ) + + geom_vline( + xintercept = ci_upper, + color = "darkgreen", + linetype = "dotted", + linewidth = 1 + ) + + labs( + x = "Bootstrap Marginal Risk Difference", + y = "Frequency", + title = "Bootstrap Distribution of Marginal Risk Difference" + ) + + theme_minimal() +``` + +Bootstrap distribution of the marginal risk difference +(Type A vs Type B behavioral pattern, adjusted for age). +Red dashed line: point estimate. +Green dotted lines: 95% CI bounds. + +::: diff --git a/_subfiles/logistic-regression/_thm-causal-marginal-risk.qmd b/_subfiles/logistic-regression/_thm-causal-marginal-risk.qmd new file mode 100644 index 000000000..318ebca4c --- /dev/null +++ b/_subfiles/logistic-regression/_thm-causal-marginal-risk.qmd @@ -0,0 +1,39 @@ +:::{#thm-causal-marginal-risk} + +#### Causal marginal risk under identification assumptions + +Under [Consistency](causal-inference.qmd#def-consistency), +[Conditional Exchangeability](causal-inference.qmd#def-exchangeability), +and [Positivity](causal-inference.qmd#def-positivity): + +$$\pi_a = \E{\teal{\pi(a, Z)}}$$ + +where the expectation is over the **marginal** distribution of $Z$ +(not the conditional distribution $\p(Z=z \mid A=a)$). + +::: {.proof} + +$$ +\ba +\pi_a +&= \Pr(Y^a = 1) +\\ +&= \int \Pr(Y^a = 1, Z=z)\, dz +\\ +&= \int \Pr(Y^a = 1 \mid Z=z)\, \red{\p(Z=z)}\, dz +\\ +&= \int \Pr(Y^a = 1 \mid A=a, Z=z)\, \red{\p(Z=z)}\, dz +&& \text{[by conditional exchangeability]} +\\ +&= \int \Pr(Y=1 \mid A=a, Z=z)\, \red{\p(Z=z)}\, dz +&& \text{[by consistency]} +\\ +&= \int \teal{\pi(a, z)}\, \red{\p(Z=z)}\, dz +\\ +&= \E{\teal{\pi(a, Z)}} +\ea +$$ + +::: + +::: diff --git a/_subfiles/logistic-regression/_thm-collapsibility.qmd b/_subfiles/logistic-regression/_thm-collapsibility.qmd new file mode 100644 index 000000000..914e33fb8 --- /dev/null +++ b/_subfiles/logistic-regression/_thm-collapsibility.qmd @@ -0,0 +1,22 @@ +:::{#thm-collapsibility} + +#### Which estimands are collapsible? + +For **causal** estimands (@def-causal-marginal-risk) defined via g-computation, +@def-collapsibility uses the **marginal** distribution $f_Z(z)$ as weights. +Additive (difference) measures are collapsible by linearity of expectation: + +$$\pi_1 - \pi_0 + = \E{\pi(1,Z)} - \E{\pi(0,Z)} + = \E{\pi(1,Z) - \pi(0,Z)}.$$ + +Ratio and log-odds measures are not collapsible in general +(see @exm-collapsibility). + +For **observational** estimands (@def-observed-marginal-risk), +the marginal measure uses the conditional distribution +$f_{Z \mid A}(z \mid a)$ as weights rather than the marginal $f_Z(z)$. +Collapsibility with respect to $f_Z$ therefore holds +only when $Z \perp A$ (no confounding). + +::: diff --git a/_subfiles/logistic-regression/_thm-observed-marginal-risk.qmd b/_subfiles/logistic-regression/_thm-observed-marginal-risk.qmd new file mode 100644 index 000000000..7fa764df5 --- /dev/null +++ b/_subfiles/logistic-regression/_thm-observed-marginal-risk.qmd @@ -0,0 +1,30 @@ +:::{#thm-observed-marginal-risk} + +#### Marginal risk as a conditional expectation + +By the law of total probability, + +$$\pi(a) = \E{\teal{\pi(a, Z)} \mid \red{A = a}}$$ + +where $\teal{\pi(a, z)} \eqdef \Pr(Y = 1 \mid A = a, Z = z)$ is the conditional risk function. + +::: {.proof} + +$$ +\ba +\pi(a) +&= \Pr(Y = 1 \mid A = a) +\\ +&= \int \Pr(Y = 1, Z=z \mid A = a)\, dz +\\ +&= \int \Pr(Y = 1 \mid Z=z, A = a)\, \red{\p(Z=z \mid A=a)}\, dz +\\ +&= \int \teal{\pi(a, z)}\, \red{\p(Z=z \mid A=a)}\, dz +\\ +&= \E{\teal{\pi(a, Z)} \mid \red{A = a}} +\ea +$$ + +::: + +::: diff --git a/inst/WORDLIST b/inst/WORDLIST index dfc2e21ed..f84ed85fd 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -11,6 +11,7 @@ CIF CLT Cady Cholangitis +Collapsibility Confounder DAGs De @@ -102,6 +103,7 @@ cdot cdots cmprsk coef +collapsibility collinear confounder counterfactual