From 60d3af326c996d0800e6811e32461d0976c85bed Mon Sep 17 00:00:00 2001 From: "claude[bot]" <41898282+claude[bot]@users.noreply.github.com> Date: Sat, 16 May 2026 06:57:30 +0000 Subject: [PATCH 01/34] revise count regression chapter (issue #747) - Fix duplicate "if" and "value is predicted" typos in overdispersion definition - Replace \mathbb{E} with \Expp macro in rate-ratio derivation for consistency - Replace \text{sign} with \signt macro in deviance residual formula and callout - Rewrite first-person language in Quasipoisson section - Move NB models explanatory content out of notes-only block so it renders in HTML/PDF - Add worked solutions for zero-inflation PMF and E/Var exercises - Flesh out inference section with explicit CI, hypothesis test, and LRT formulas Co-authored-by: Douglas Ezra Morrison --- .../count-regression/_sec-overdispersion.qmd | 4 +- .../count-regression/_sec_poisson_RRs.qmd | 2 +- .../count-regression/_sec_poisson_dx.qmd | 10 +-- .../_sec_poisson_inference.qmd | 34 ++++++-- .../count-regression/_sec_zero-inflation.qmd | 87 ++++++++++++++++++- chapters/count-regression.qmd | 20 +++-- 6 files changed, 136 insertions(+), 21 deletions(-) diff --git a/_subfiles/count-regression/_sec-overdispersion.qmd b/_subfiles/count-regression/_sec-overdispersion.qmd index 753b166b52..56331ec240 100644 --- a/_subfiles/count-regression/_sec-overdispersion.qmd +++ b/_subfiles/count-regression/_sec-overdispersion.qmd @@ -13,8 +13,8 @@ In practice, many count distributions will have a variance substantially larger A random variable $X$ is **overdispersed** relative to a model $\p(X=x)$ if -if its empirical variance in a dataset is larger than -the value is predicted by the fitted model $\hat{\p}(X=x)$. +its empirical variance in a dataset is larger than +the value predicted by the fitted model $\hat{\p}(X=x)$. :::: diff --git a/_subfiles/count-regression/_sec_poisson_RRs.qmd b/_subfiles/count-regression/_sec_poisson_RRs.qmd index 48b836b241..4a82cd0065 100644 --- a/_subfiles/count-regression/_sec_poisson_RRs.qmd +++ b/_subfiles/count-regression/_sec_poisson_RRs.qmd @@ -37,7 +37,7 @@ And accordingly, $$ \begin{aligned} \frac -{\mathbb{E}[Y |{\color{red}{X_j = a}}, \vX_{-j} = \vx_{-j}, T = t] +{\Expp[Y |{\color{red}{X_j = a}}, \vX_{-j} = \vx_{-j}, T = t] } { \Expp[Y |{\color{red}{X_j = b}}, \vX_{-j}=\vx_{-j},T=t] diff --git a/_subfiles/count-regression/_sec_poisson_dx.qmd b/_subfiles/count-regression/_sec_poisson_dx.qmd index dc91df9c45..f98a3d9c1b 100644 --- a/_subfiles/count-regression/_sec_poisson_dx.qmd +++ b/_subfiles/count-regression/_sec_poisson_dx.qmd @@ -17,17 +17,17 @@ where $h$ is the "leverage" (which we will continue to leave undefined). #### Deviance residuals $$ -d_k = \text{sign}(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\} +d_k = \signt(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\} $$ :::{.callout-note} -$$\text{sign}(x) \eqdef \frac{x}{|x|}$$ +$$\signt(x) \eqdef \frac{x}{|x|}$$ In other words: -* $\text{sign}(x) = -1$ if $x < 0$ -* $\text{sign}(x) = 0$ if $x = 0$ -* $\text{sign}(x) = 1$ if $x > 0$ +* $\signt(x) = -1$ if $x < 0$ +* $\signt(x) = 0$ if $x = 0$ +* $\signt(x) = 1$ if $x > 0$ ::::{.content-hidden} diff --git a/_subfiles/count-regression/_sec_poisson_inference.qmd b/_subfiles/count-regression/_sec_poisson_inference.qmd index 0851e1460e..f19a7288d8 100644 --- a/_subfiles/count-regression/_sec_poisson_inference.qmd +++ b/_subfiles/count-regression/_sec_poisson_inference.qmd @@ -1,26 +1,46 @@ ### Confidence intervals for regression coefficients and rate ratios -As usual: +A Wald 95% confidence interval for a single coefficient $\beta_j$ is: $$ -\beta \in \left[\ci \right] +\beta_j \in \left[\ci \right] $$ -Rate ratios: exponentiate CI endpoints +where $z_{1-\alpha/2} \approx 1.96$ for $\alpha = 0.05$. + +Because the log-rate scale is related to the rate scale by exponentiation, +we obtain a confidence interval for the rate ratio $e^{\beta_j}$ +by exponentiating both endpoints: $$ -\exp{\beta} \in \left[\exp{\ci} \right] +e^{\beta_j} \in \left[\exp{\ci} \right] $$ ### Hypothesis tests for regression coefficients +To test $H_0: \beta_j = \beta_0$ against a one- or two-sided alternative, +compute the Wald $z$-statistic: + $$ -z = \frac{\hat \beta - \beta_0}{\hse{\hb}} +z = \frac{\hat \beta_j - \beta_0}{\hse{\hat\beta_j}} $$ -Compare $z$ or $|z|$ to the tails of the standard Gaussian distribution, according to the null hypothesis. +and compare $z$ (one-sided) or $|z|$ (two-sided) to the tails of the +standard Gaussian distribution. +The most common null hypothesis is $H_0: \beta_j = 0$, +i.e., that covariate $j$ has no association with the outcome rate. ### Comparing nested models -log(likelihood ratio) tests, as usual. +To compare a smaller model (with $p_0$ parameters) to a larger model +(with $p_1 > p_0$ parameters), use the likelihood ratio test statistic: + +$$ +G^2 = 2\bigl[\ell(\hat\beta_1) - \ell(\hat\beta_0)\bigr] +$$ + +Under $H_0$ that the additional $p_1 - p_0$ parameters are all zero, +$G^2 \dot\sim \chi^2_{p_1 - p_0}$. + +See @dobson4e [Chapter 9] and @vittinghoff2e [§8.1] for details. diff --git a/_subfiles/count-regression/_sec_zero-inflation.qmd b/_subfiles/count-regression/_sec_zero-inflation.qmd index c49dfd0336..15bd551c66 100644 --- a/_subfiles/count-regression/_sec_zero-inflation.qmd +++ b/_subfiles/count-regression/_sec_zero-inflation.qmd @@ -34,9 +34,94 @@ $$ Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x,T=t)$ and $P(Y=y|Z=0,X=x,T=t)$. ::: +::: {.solution} + +Let $\pi = \P(Z=1|X=x)$ and $\mu = \Expp[Y|Z=0,X=x,T=t]$. + +**$P(Y=0)$:** $Y=0$ occurs either because $Z=1$ (always zero) +or because $Z=0$ and the Poisson draw equals 0: + +$$ +\ba +\P(Y=0|X=x,T=t) +&= \P(Z=1|X=x) + \P(Z=0|X=x)\,\P(Y=0|Z=0,X=x,T=t)\\ +&= \pi + (1-\pi)\,e^{-\mu} +\ea +$$ + +**$P(Y=1)$:** $Z=1$ can never produce $Y=1$, so: + +$$ +\P(Y=1|X=x,T=t) += (1-\pi)\,\P(Y=1|Z=0,X=x,T=t) += (1-\pi)\,\mu e^{-\mu} +$$ + +**$P(Y=y)$ for $y \geq 1$:** Identical reasoning gives + +$$ +\P(Y=y|X=x,T=t) += (1-\pi)\,\frac{\mu^y e^{-\mu}}{y!} +$$ + +::: + --- ::: {#exr-zinf-moments} -Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $P(Z=1|X=x,T=t)$ and $\Expp[Y|Z=0,X=x,T=t]$. +Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $\pi = P(Z=1|X=x,T=t)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. +::: + +::: {.solution} + +Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. + +**Expected value.** By the Law of Total Expectation: + +$$ +\ba +\Expp[Y|X=x,T=t] +&= \Expp[Y|Z=1]\,\pi + \Expp[Y|Z=0]\,(1-\pi)\\ +&= 0 \cdot \pi + \mu_0(1-\pi)\\ +&= (1-\pi)\,\mu_0 +\ea +$$ + +**Variance.** By the Law of Total Variance: + +$$ +\Var{Y} = \Expp[\Var{Y|Z}] + \Var{\Expp[Y|Z]} +$$ + +For the first term, since $\Var{Y|Z=1}=0$ and $\Var{Y|Z=0}=\mu_0$ (Poisson): + +$$ +\Expp[\Var{Y|Z}] = 0 \cdot \pi + \mu_0(1-\pi) = (1-\pi)\mu_0 +$$ + +For the second term, $\Expp[Y|Z]$ takes the value 0 (with prob $\pi$) +or $\mu_0$ (with prob $1-\pi$), so: + +$$ +\ba +\Var{\Expp[Y|Z]} +&= \pi(0 - (1-\pi)\mu_0)^2 + (1-\pi)(\mu_0 - (1-\pi)\mu_0)^2\\ +&= \pi(1-\pi)^2\mu_0^2 + (1-\pi)\pi^2\mu_0^2\\ +&= \pi(1-\pi)\mu_0^2[\,(1-\pi)+\pi\,]\\ +&= \pi(1-\pi)\mu_0^2 +\ea +$$ + +Combining: + +$$ +\Var{Y|X=x,T=t} = (1-\pi)\mu_0 + \pi(1-\pi)\mu_0^2 += (1-\pi)\mu_0\bigl(1 + \pi\mu_0\bigr) +$$ + +Since $(1-\pi)\mu_0\bigl(1+\pi\mu_0\bigr) \geq (1-\pi)\mu_0 = \Expp[Y|X,T]$ for any $\pi > 0$, +zero-inflated count models always exhibit overdispersion relative to a Poisson model +with the same mean. + ::: diff --git a/chapters/count-regression.qmd b/chapters/count-regression.qmd index e302040286..bc3e5cee76 100644 --- a/chapters/count-regression.qmd +++ b/chapters/count-regression.qmd @@ -72,13 +72,17 @@ This content is adapted from: ::: notes There are alternatives to the Poisson model. -Most notably, +Most notably, the [negative binomial model](probability.qmd#sec-nb-dist). +::: -We can still model $\mu$ as a function of $X$ and $T$ as before, -and we can combine this model with zero-inflation +The [negative binomial distribution](probability.qmd#sec-nb-dist) +is a common alternative to the Poisson distribution for count outcomes. +It adds a dispersion parameter that allows the variance to exceed the mean, +making it more flexible when overdispersion is present. +We can still model $\mu$ as a function of $X$ and $T$ as before, +and we can combine this model with zero-inflation (as the conditional distribution for the non-zero component). -::: --- @@ -88,7 +92,13 @@ and we can combine this model with zero-inflation ## Quasipoisson -An alternative to Negative binomial is the "quasipoisson" distribution. I've never used it, but it seems to be a method-of-moments type approach rather than maximum likelihood. It models the variance as $\Var{Y} = \mu\theta$, and estimates $\theta$ accordingly. +An alternative to the negative binomial model is the quasi-Poisson approach. +Rather than specifying a full probability distribution, +it uses a method-of-moments approach rather than maximum likelihood estimation. +It models the variance as $\Var{Y} = \mu\theta$ +and estimates the dispersion parameter $\theta$ accordingly. +This approach is simpler to implement but provides less information +than the full negative binomial likelihood. See `?quasipoisson` in R for more. From 3323cd564b90988cbb4103c9001df61404e99f05 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 07:10:53 +0000 Subject: [PATCH 02/34] Fix wrong variables in needle-sharing exploratory and prediction figures MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit fig-needles plotted y = shsyryn (binary yes/no indicator) but its caption reads "Rates of needle sharing" and the Poisson model below it fits shared_syr ~ ... + offset(log(nivdu)) — i.e., the rate of shared syringes per injection. Switch to y = shared_syr / nivdu (rate) with nivdu > 0 filter to avoid divide-by-zero, and label the y-axis. fig-needles-pois used #| tbl-cap with a fig- label; the chunk produces a plot, so swap to fig-cap to keep the cross-reference type consistent. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing.qmd | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index baae46b332..59d359d48c 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -72,22 +72,24 @@ needles #| label: fig-needles library(ggplot2) -needles |> +needles |> + dplyr::filter(nivdu > 0) |> ggplot( aes( x = age, - y = shsyryn, + y = shared_syr / nivdu, shape = sex, col = ethn ) - ) + + ) + geom_point( - aes(size = nivdu), + aes(size = nivdu), alpha = .5) + scale_size_area(max_size = 4) + facet_grid( - cols = vars(polydrug), + cols = vars(polydrug), rows = vars(homeless)) + + labs(y = "Shared needles per injection") + theme(legend.position = "bottom") ``` @@ -149,7 +151,7 @@ glm1 |> parameters(exponentiate = TRUE) |> --- ```{r} -#| tbl-cap: "fitted Poisson model for needle-sharing data" +#| fig-cap: "fitted Poisson model for needle-sharing data" #| label: fig-needles-pois library(sjPlot) glm1 |> From ceaac412dc79dd084978804d3b893da09aa128f6 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 08:24:31 +0000 Subject: [PATCH 03/34] Add demographics table and enrich data dictionary MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Data dictionary improvements: - Add labels for the nine variables that previously had no label (id, age, dprsn_dx, sexabuse, shared_syr, hplsns, nivdu, sqrtnivd, shsyr), pulling descriptions from the rmb package's documentation (https://d-morrison.github.io/rmb/man/needle_sharing.html). - Fix typo on sqrtninj: "sqrt(No. of infections in 30 days)" → "sqrt(No. of injections in 30 days)" (verified against .dta values; sqrtninj is the square root of nivdu, the injection count). - Tighten and standardise wording on the other labels. - Add a short prose blurb naming the dataset source (RMB2e chapter 8, UCSF companion website) and sample size (n = 128, 17 variables). New demographics table (tbl-needles-demographics): - Reports N, age (median, Q1–Q3), sex, ethnicity, HIV status, homelessness, polydrug use, and the two count variables (injections and shared syringes per 30 days), all on the analysis sample (Trans-sex singleton excluded). - Verified locally: n = 127 after Trans exclusion, age median 41 (IQR 35-48), 24% female, 59% White, 7% HIV+, 50% homeless. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing.qmd | 100 +++++++++++++++++++++++++++----- 1 file changed, 87 insertions(+), 13 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 59d359d48c..be4efb8047 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -31,17 +31,32 @@ needles = sex = sex |> factor() |> relevel(ref = "M") ) |> labelled::set_variable_labels( - "sex" = "sex (reference = Male)", - "ethn" = "ethnicity (reference = White)", - "shsyryn" = "shared syringe yes/no (1 = yes, 0 = no)", + "id" = "Participant identifier", + "sex" = "Biological sex (reference = Male)", + "ethn" = "Self-reported ethnicity (reference = White)", + "age" = "Age at first interview (years)", + "dprsn_dx" = "Depression diagnosis indicator", + "sexabuse" = "History of sexual abuse (1 = yes, 0 = no)", + "shared_syr" = "Number of shared syringes (in 30 days)", + "hivstat" = "HIV serostatus (reference = HIV-)", + "hplsns" = "Hopelessness score", + "nivdu" = "Number of injections (in 30 days)", + "shsyryn" = "Shared syringe? (1 = yes, 0 = no)", + "sqrtnivd" = "sqrt(No. of injections in 30 days)", "logshsyr" = "log(No. of shared needles)", - "polydrug" = "how many drugs used?", - "sqrtninj" = "sqrt(No. of infections in 30 days)", - "homeless" = "Homeless (1 = yes, 0 = no)", - "hivstat" = "HIV status (reference = HIV-)" + "polydrug" = "Multiple non-injection drugs used?", + "sqrtninj" = "sqrt(No. of injections in 30 days)", + "homeless" = "Homeless? (reference = not homeless)", + "shsyr" = "Number of shared needles (in 30 days)" ) ``` +Dataset source: Chapter 8 of *Regression Methods in Biostatistics*, 2e +([UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems)) — +a longitudinal panel study of injection drug users measuring +repeated needle-sharing behaviour. +$n = 128$ participants, 17 variables. + :::{#tbl-needles-data-dict} ```{r} @@ -59,6 +74,69 @@ Data dictionary for `needles` data --- +:::{#tbl-needles-demographics} + +```{r} +#| code-fold: true + +demog_data <- needles |> + dplyr::filter(sex != "Trans") + +med_iqr <- function(x) { + q <- stats::quantile(x, c(0.25, 0.5, 0.75), na.rm = TRUE) + sprintf("%.0f (%.0f, %.0f)", q[2], q[1], q[3]) +} + +n_pct <- function(x, level) { + sprintf( + "%d (%.0f%%)", + sum(x == level, na.rm = TRUE), + 100 * mean(x == level, na.rm = TRUE) + ) +} + +ethn_levels <- levels(demog_data$ethn) +ethn_rows <- vapply( + ethn_levels, + function(L) n_pct(demog_data$ethn, L), + character(1) +) + +demog_tbl <- tibble::tibble( + Characteristic = c( + "N", + "Age (years), median (Q1, Q3)", + "Female, n (%)", + paste0("Ethnicity — ", ethn_levels, ", n (%)"), + "HIV+, n (%)", + "Homeless, n (%)", + "Multiple drugs used, n (%)", + "Injections in 30 days, median (Q1, Q3)", + "Shared syringes in 30 days, median (Q1, Q3)" + ), + Value = c( + as.character(nrow(demog_data)), + med_iqr(demog_data$age), + n_pct(demog_data$sex, "F"), + ethn_rows, + n_pct(demog_data$hivstat, "HIV+"), + n_pct(demog_data$homeless, "homeless"), + n_pct(demog_data$polydrug, "multiple drugs used"), + med_iqr(demog_data$nivdu), + med_iqr(demog_data$shared_syr) + ) +) + +demog_tbl |> pander::pander() +``` + +Demographics and key behavioural variables for the needle-sharing dataset +(excluding one individual with `sex == "Trans"`). + +::: + +--- + ```{r} #| tbl-cap: "Needle-sharing data" #| label: tbl-needle-data @@ -82,13 +160,9 @@ needles |> col = ethn ) ) + - geom_point( - aes(size = nivdu), - alpha = .5) + + geom_point(aes(size = nivdu), alpha = .5) + scale_size_area(max_size = 4) + - facet_grid( - cols = vars(polydrug), - rows = vars(homeless)) + + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + labs(y = "Shared needles per injection") + theme(legend.position = "bottom") ``` From 6316dcce64a6f3f64939fa259312cc3e6a45c016 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 08:38:44 +0000 Subject: [PATCH 04/34] Address #860 review findings: CI formulas, dsim, T-conditioning, slidebreaks, dup label MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit _sec_poisson_inference.qmd: - Make the coefficient-CI formula's subscripts internally consistent: expand \ci to \hat\beta_j \pm \ciradf{\hat\beta_j} (matching the \beta_j on the LHS) instead of using the unsubscripted \ci macro. - Rewrite the rate-ratio CI as two explicit endpoints [\exp{\hat\beta_j - rad}, \exp{\hat\beta_j + rad}] — the prose says "exponentiating both endpoints" but \exp{\ci} expands to a single exp{...} expression with the ± inside the braces, not two endpoints. - Switch \dot\sim to the project's \dsim macro on the LRT line so the spacing matches the rest of the codebase. _sec_zero-inflation.qmd: - exr-zinf-moments exercise text defined π conditional on T, but the model definition and the solution both define π conditional only on X. Drop T=t from the exercise prompt so a student's derivation cannot legitimately diverge from the solution. - Replace the new --- slide separator above the exercise with {{< slidebreak >}} per CLAUDE.md. exr-needle-sharing.qmd: - Replace the --- I added between the new demographics table and the raw data table with {{< slidebreak >}}. - Clarify the sqrtninj label: sqrtninj and sqrtnivd are numerically identical columns in the .dta (verified: zero rows differ across all 128), so note the duplication instead of giving the two columns identical descriptions. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- _subfiles/count-regression/_sec_poisson_inference.qmd | 10 +++++++--- _subfiles/count-regression/_sec_zero-inflation.qmd | 4 ++-- chapters/exr-needle-sharing.qmd | 4 ++-- 3 files changed, 11 insertions(+), 7 deletions(-) diff --git a/_subfiles/count-regression/_sec_poisson_inference.qmd b/_subfiles/count-regression/_sec_poisson_inference.qmd index f19a7288d8..5bfba6720c 100644 --- a/_subfiles/count-regression/_sec_poisson_inference.qmd +++ b/_subfiles/count-regression/_sec_poisson_inference.qmd @@ -4,7 +4,7 @@ A Wald 95% confidence interval for a single coefficient $\beta_j$ is: $$ -\beta_j \in \left[\ci \right] +\beta_j \in \left[\hat\beta_j \pm \ciradf{\hat\beta_j}\right] $$ where $z_{1-\alpha/2} \approx 1.96$ for $\alpha = 0.05$. @@ -14,7 +14,11 @@ we obtain a confidence interval for the rate ratio $e^{\beta_j}$ by exponentiating both endpoints: $$ -e^{\beta_j} \in \left[\exp{\ci} \right] +e^{\beta_j} \in + \left[ + \exp{\hat\beta_j - \ciradf{\hat\beta_j}},\; + \exp{\hat\beta_j + \ciradf{\hat\beta_j}} + \right] $$ ### Hypothesis tests for regression coefficients @@ -41,6 +45,6 @@ G^2 = 2\bigl[\ell(\hat\beta_1) - \ell(\hat\beta_0)\bigr] $$ Under $H_0$ that the additional $p_1 - p_0$ parameters are all zero, -$G^2 \dot\sim \chi^2_{p_1 - p_0}$. +$G^2 \dsim \chi^2_{p_1 - p_0}$. See @dobson4e [Chapter 9] and @vittinghoff2e [§8.1] for details. diff --git a/_subfiles/count-regression/_sec_zero-inflation.qmd b/_subfiles/count-regression/_sec_zero-inflation.qmd index 15bd551c66..6864782ee6 100644 --- a/_subfiles/count-regression/_sec_zero-inflation.qmd +++ b/_subfiles/count-regression/_sec_zero-inflation.qmd @@ -66,11 +66,11 @@ $$ ::: ---- +{{< slidebreak >}} ::: {#exr-zinf-moments} -Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $\pi = P(Z=1|X=x,T=t)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. +Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $\pi = P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. ::: ::: {.solution} diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index be4efb8047..b3df4bdf39 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -45,7 +45,7 @@ needles = "sqrtnivd" = "sqrt(No. of injections in 30 days)", "logshsyr" = "log(No. of shared needles)", "polydrug" = "Multiple non-injection drugs used?", - "sqrtninj" = "sqrt(No. of injections in 30 days)", + "sqrtninj" = "sqrt(No. of injections in 30 days); duplicate of sqrtnivd", "homeless" = "Homeless? (reference = not homeless)", "shsyr" = "Number of shared needles (in 30 days)" ) @@ -135,7 +135,7 @@ Demographics and key behavioural variables for the needle-sharing dataset ::: ---- +{{< slidebreak >}} ```{r} #| tbl-cap: "Needle-sharing data" From 72fdee5d687ff3ea3d4e0f8f91d8c37022da1e97 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 09:00:52 +0000 Subject: [PATCH 05/34] Lump rare ethnicities via forcats::fct_lump_min MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The needle-sharing data has five ethnicity categories with n = 1 or 2 participants (Asian, Filipino, Indian, "Indian & White", "White & Hispa"), which makes the data dictionary and demographics tables noisy and would make any ethn-stratified model have empty/near-empty cells. Collapse those into "Other" via forcats::fct_lump_min(min = 5). Verified: post-lump ethn has 4 levels — White (76), AA (36), Hispanic (10), Other (6). Lump correctly bundles the five rare categories. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing.qmd | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index b3df4bdf39..cf9a8a029b 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -27,13 +27,17 @@ needles = 0 ~ "not homeless") |> factor() |> relevel(ref = "not homeless"), - ethn = ethn |> factor() |> relevel(ref = "White"), + ethn = + ethn |> + factor() |> + forcats::fct_lump_min(min = 5, other_level = "Other") |> + relevel(ref = "White"), sex = sex |> factor() |> relevel(ref = "M") ) |> labelled::set_variable_labels( "id" = "Participant identifier", "sex" = "Biological sex (reference = Male)", - "ethn" = "Self-reported ethnicity (reference = White)", + "ethn" = "Self-reported ethnicity (reference = White; n<5 lumped to Other)", "age" = "Age at first interview (years)", "dprsn_dx" = "Depression diagnosis indicator", "sexabuse" = "History of sexual abuse (1 = yes, 0 = no)", From d4b48d908cb8e3a911f61d9a48fee6c82b1dcd22 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 09:04:00 +0000 Subject: [PATCH 06/34] fig-needles-pois: plot model predictions on the rate scale MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit sjPlot::plot_model(type="pred") doesn't expose the offset, so its y-axis was on the *count* scale and didn't match @fig-needles (which plots shared_syr / nivdu). Replace with a direct prediction grid where nivdu = 1 forces log(nivdu) = 0 in the offset; predict(type="response") then returns the Poisson rate per injection. Overlay observed shared_syr / nivdu points (sized by nivdu) and the fitted rate lines, faceted by homeless × polydrug, coloured by sex — same panel structure as @fig-needles. Verified locally: pred_grid 400 rows, rates 4.5e-9 to 0.066; observed data 121 complete-case rows after dropping NAs in homeless/polydrug/ sex/age (also removes the previous stray "homeless: NA" facet). https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing.qmd | 41 +++++++++++++++++++++++++++------ 1 file changed, 34 insertions(+), 7 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index cf9a8a029b..cdc7d6cf16 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -229,15 +229,42 @@ glm1 |> parameters(exponentiate = TRUE) |> --- ```{r} -#| fig-cap: "fitted Poisson model for needle-sharing data" +#| fig-cap: "Fitted Poisson rate model overlaid on observed needle-sharing rates" #| label: fig-needles-pois -library(sjPlot) -glm1 |> - sjPlot::plot_model( - type = "pred", - terms = c("age", "sex", "homeless", "polydrug"), - show.data = TRUE +#| code-fold: true +library(ggplot2) + +# Build a prediction grid that varies age and the categorical covariates, +# with nivdu = 1 so the offset log(nivdu) = 0 and predict(type = "response") +# returns the Poisson rate per injection (matching the y-axis scale of +# @fig-needles). +pred_grid <- with(glm1$model, expand.grid( + age = seq(min(age), max(age), length.out = 100), + sex = levels(droplevels(sex)), + homeless = levels(homeless), + polydrug = levels(polydrug), + nivdu = 1 +)) +pred_grid$rate <- predict(glm1, newdata = pred_grid, type = "response") + +obs_data <- needles |> + dplyr::filter( + nivdu > 0, + !is.na(homeless), !is.na(polydrug), !is.na(sex), !is.na(age) + ) |> + dplyr::mutate(rate = shared_syr / nivdu) + +ggplot(obs_data, aes(x = age, y = rate, color = sex)) + + geom_point(aes(size = nivdu), alpha = 0.4) + + geom_line(data = pred_grid, linewidth = 0.8) + + scale_size_area(max_size = 4) + + facet_grid( + rows = vars(homeless), + cols = vars(polydrug), + labeller = label_both ) + + labs(y = "Shared needles per injection") + + theme_bw() + theme(legend.position = "bottom") ``` From 05d412bc1ba9b72a56cad969cc7923afbf778c13 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 09:05:46 +0000 Subject: [PATCH 07/34] Cross-link to def-orthogonal-vectors; tighten fig-needles-pois caption exr-prac-glm-score's solution invokes orthogonality of the residual vector against each predictor column. def-orthogonal-vectors already exists in the math-prereqs linear-algebra section and matches the usage exactly, so cross-link to it and add the explicit inner-product equation so a student following the link back can verify the claim. Also shorten fig-needles-pois's fig-cap below the 80-char limit. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- _subfiles/count-regression/_exr-prac-glm-score.qmd | 7 ++++++- chapters/exr-needle-sharing.qmd | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/_subfiles/count-regression/_exr-prac-glm-score.qmd b/_subfiles/count-regression/_exr-prac-glm-score.qmd index 38961caf7b..457290f79b 100644 --- a/_subfiles/count-regression/_exr-prac-glm-score.qmd +++ b/_subfiles/count-regression/_exr-prac-glm-score.qmd @@ -69,7 +69,12 @@ weighted by $x_i$. More generally, these score equations say that the **residuals $(y_i - \hat\mu_i)$ -are orthogonal to each predictor column**. +are [orthogonal](math-prereqs.qmd#def-orthogonal-vectors) +to each predictor column**: +for each predictor $j$, the residual vector $(y - \hat\mu)$ satisfies +$\tp{\vx_{(j)}}(y - \hat\mu) = 0$, +where $\vx_{(j)} = (x_{1j}, \ldots, x_{nj})$ is the column of $j$-th +predictor values across observations. This is the GLM analogue of the OLS normal equations. ::: diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index cdc7d6cf16..e8f0fa16f6 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -229,7 +229,7 @@ glm1 |> parameters(exponentiate = TRUE) |> --- ```{r} -#| fig-cap: "Fitted Poisson rate model overlaid on observed needle-sharing rates" +#| fig-cap: "Fitted Poisson rate model and observed needle-sharing rates" #| label: fig-needles-pois #| code-fold: true library(ggplot2) From b763dcc35eb17d69c0c9996856433badf0c76ae2 Mon Sep 17 00:00:00 2001 From: "claude[bot]" <209825114+claude[bot]@users.noreply.github.com> Date: Tue, 2 Jun 2026 17:09:31 +0000 Subject: [PATCH 08/34] Fix P(Z=1|X=x,T=t) to P(Z=1|X=x) in exr-zinf-pmf exercise Co-Authored-By: Claude Sonnet 4.6 --- _subfiles/count-regression/_sec_zero-inflation.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_subfiles/count-regression/_sec_zero-inflation.qmd b/_subfiles/count-regression/_sec_zero-inflation.qmd index 6864782ee6..1005e9172b 100644 --- a/_subfiles/count-regression/_sec_zero-inflation.qmd +++ b/_subfiles/count-regression/_sec_zero-inflation.qmd @@ -31,7 +31,7 @@ $$ --- ::: {#exr-zinf-pmf} -Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x,T=t)$ and $P(Y=y|Z=0,X=x,T=t)$. +Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x)$ and $P(Y=y|Z=0,X=x,T=t)$. ::: ::: {.solution} From 0cf320a1bfa3958d2d210730b5fd8499a461c61e Mon Sep 17 00:00:00 2001 From: "claude[bot]" <209825114+claude[bot]@users.noreply.github.com> Date: Tue, 2 Jun 2026 17:12:30 +0000 Subject: [PATCH 09/34] Fix (y - \hat\mu) to (\vy - \hat\mu) for bold vector consistency Co-Authored-By: Claude Sonnet 4.6 --- _subfiles/count-regression/_exr-prac-glm-score.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_subfiles/count-regression/_exr-prac-glm-score.qmd b/_subfiles/count-regression/_exr-prac-glm-score.qmd index 457290f79b..fe39484c3e 100644 --- a/_subfiles/count-regression/_exr-prac-glm-score.qmd +++ b/_subfiles/count-regression/_exr-prac-glm-score.qmd @@ -71,8 +71,8 @@ More generally, these score equations say that the **residuals $(y_i - \hat\mu_i)$ are [orthogonal](math-prereqs.qmd#def-orthogonal-vectors) to each predictor column**: -for each predictor $j$, the residual vector $(y - \hat\mu)$ satisfies -$\tp{\vx_{(j)}}(y - \hat\mu) = 0$, +for each predictor $j$, the residual vector $(\vy - \hat\mu)$ satisfies +$\tp{\vx_{(j)}}(\vy - \hat\mu) = 0$, where $\vx_{(j)} = (x_{1j}, \ldots, x_{nj})$ is the column of $j$-th predictor values across observations. This is the GLM analogue of the OLS normal equations. From 5e115325b7cdb1299ad095927d9ed4baae47cc0a Mon Sep 17 00:00:00 2001 From: "claude[bot]" <209825114+claude[bot]@users.noreply.github.com> Date: Tue, 2 Jun 2026 17:13:12 +0000 Subject: [PATCH 10/34] Add fig-needles-raw: scatter of shared_syr vs nivdu Shows the raw count relationship between shared syringes and injections, complementing fig-needles (which shows the rate shared_syr/nivdu). Co-Authored-By: Claude Sonnet 4.6 --- chapters/exr-needle-sharing.qmd | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index e8f0fa16f6..da94d26d7e 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -171,6 +171,33 @@ needles |> theme(legend.position = "bottom") ``` + +{{< slidebreak >}} + +```{r} +#| fig-cap: "Shared syringes vs. number of injections (raw counts)" +#| label: fig-needles-raw +#| code-fold: true +library(ggplot2) + +needles |> + ggplot( + aes( + x = nivdu, + y = shared_syr, + shape = sex, + col = ethn + ) + ) + + geom_point(alpha = .5) + + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + + labs( + x = "Number of injections (in 30 days)", + y = "Number of shared syringes (in 30 days)" + ) + + theme(legend.position = "bottom") +``` + --- #### Covariate counts From 74378b8d0bdefddd0369afc9fe9b17485ecb6de9 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 10:45:26 -0700 Subject: [PATCH 11/34] Add x=y reference line, wrap legend, and causal DAG to needle-sharing - fig-needles-raw: add dashed x=y line (shared syringes can't exceed injections) and wrap the bottom legends onto multiple rows so they don't get clipped - add fig-needles-dag: hypothesized causal structure (demographics + social factors -> nivdu -> shared_syr) Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 69 ++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index da94d26d7e..690b42379b 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -61,6 +61,68 @@ a longitudinal panel study of injection drug users measuring repeated needle-sharing behaviour. $n = 128$ participants, 17 variables. +::: {#fig-needles-dag} + +```{r} +#| fig-width: 7 +#| fig-height: 5 +#| code-fold: true + +library(dagitty) + +needles_dag <- dagitty( + 'dag { + age [pos="0,-2"] + sex [pos="0,-1"] + ethn [pos="0,0"] + homeless [pos="0,1"] + polydrug [pos="0,2"] + nivdu [pos="2,0"] + shared_syr [outcome, pos="4,0"] + + age -> nivdu + sex -> nivdu + ethn -> nivdu + homeless -> nivdu + polydrug -> nivdu + + age -> shared_syr + sex -> shared_syr + ethn -> shared_syr + homeless -> shared_syr + polydrug -> shared_syr + + nivdu -> shared_syr + }' +) + +plot(needles_dag) +``` + +Hypothesized causal structure for the needle-sharing example. +Demographic factors (`age`, `sex`, `ethn`) and social factors +(`homeless`, `polydrug`) influence both how often a participant injects +(`nivdu`) and how many of those injections involve a shared syringe +(`shared_syr`, the outcome). + +::: + +::: notes +In @fig-needles-dag: + +- **`nivdu`** (number of injections in 30 days) acts as the denominator +for the sharing rate: more injections create more opportunities to share. +The Poisson model enters it as an offset (`offset = log(nivdu)`), +so the covariates are modeled as predictors of the *rate* +of sharing per injection. +- The demographic and social factors have both an **indirect** path +to `shared_syr` (through injection frequency `nivdu`) +and a **direct** path (their effect on the propensity to share +per injection). +- This DAG is simplified; a fuller model might include +HIV status, depression, or hopelessness. +::: + :::{#tbl-needles-data-dict} ```{r} @@ -190,12 +252,17 @@ needles |> ) ) + geom_point(alpha = .5) + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", alpha = .5) + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + labs( x = "Number of injections (in 30 days)", y = "Number of shared syringes (in 30 days)" ) + - theme(legend.position = "bottom") + guides( + col = guide_legend(nrow = 2), + shape = guide_legend(nrow = 2) + ) + + theme(legend.position = "bottom", legend.box = "vertical") ``` --- From db60071abd76de65ad17e646f159e1ca78a94626 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 10:48:13 -0700 Subject: [PATCH 12/34] Use gtsummary for tbl-needles-demographics; show all variables incl. Trans Replace the hand-built demographics table with gtsummary::tbl_summary() over the full needles dataset (no longer filtering out sex == "Trans"), summarizing every variable except the id identifier. Variable labels are picked up automatically from the labelled data. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 58 ++++----------------------------- 1 file changed, 6 insertions(+), 52 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 690b42379b..5f9d0d3407 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -144,60 +144,14 @@ Data dictionary for `needles` data ```{r} #| code-fold: true - -demog_data <- needles |> - dplyr::filter(sex != "Trans") - -med_iqr <- function(x) { - q <- stats::quantile(x, c(0.25, 0.5, 0.75), na.rm = TRUE) - sprintf("%.0f (%.0f, %.0f)", q[2], q[1], q[3]) -} - -n_pct <- function(x, level) { - sprintf( - "%d (%.0f%%)", - sum(x == level, na.rm = TRUE), - 100 * mean(x == level, na.rm = TRUE) - ) -} - -ethn_levels <- levels(demog_data$ethn) -ethn_rows <- vapply( - ethn_levels, - function(L) n_pct(demog_data$ethn, L), - character(1) -) - -demog_tbl <- tibble::tibble( - Characteristic = c( - "N", - "Age (years), median (Q1, Q3)", - "Female, n (%)", - paste0("Ethnicity — ", ethn_levels, ", n (%)"), - "HIV+, n (%)", - "Homeless, n (%)", - "Multiple drugs used, n (%)", - "Injections in 30 days, median (Q1, Q3)", - "Shared syringes in 30 days, median (Q1, Q3)" - ), - Value = c( - as.character(nrow(demog_data)), - med_iqr(demog_data$age), - n_pct(demog_data$sex, "F"), - ethn_rows, - n_pct(demog_data$hivstat, "HIV+"), - n_pct(demog_data$homeless, "homeless"), - n_pct(demog_data$polydrug, "multiple drugs used"), - med_iqr(demog_data$nivdu), - med_iqr(demog_data$shared_syr) - ) -) - -demog_tbl |> pander::pander() +needles |> + dplyr::select(-id) |> + gtsummary::tbl_summary() ``` -Demographics and key behavioural variables for the needle-sharing dataset -(excluding one individual with `sex == "Trans"`). +Demographics and behavioural variables for the needle-sharing dataset +(all participants, including the one individual with `sex == "Trans"`; +the `id` identifier column is omitted). ::: From 9e75c255f29399a0620fd4c66c1d107a77ef3887 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 10:55:03 -0700 Subject: [PATCH 13/34] Clean up lint findings in needle-sharing data/model chunks Use <- for assignment, fix mutate indentation, add infix spaces around *, break pipe continuations onto their own lines, and shorten the tbl-count-needles caption to fit the line-length limit. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 51 ++++++++++++++------------------- 1 file changed, 21 insertions(+), 30 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 5f9d0d3407..49accd4635 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -1,39 +1,29 @@ ```{r} library(tidyverse) library(haven) -needles = - here::here("inst/extdata/needle_sharing.dta") |> +needles <- + here::here("inst/extdata/needle_sharing.dta") |> read_dta() |> as_tibble() |> mutate( - hivstat = - hivstat |> - case_match( - 1 ~ "HIV+", - 0 ~ "HIV-") |> - factor() |> + hivstat = hivstat |> + case_match(1 ~ "HIV+", 0 ~ "HIV-") |> + factor() |> relevel(ref = "HIV-"), - polydrug = - polydrug |> - case_match( - 1 ~ "multiple drugs used", - 0 ~ "one drug used") |> + polydrug = polydrug |> + case_match(1 ~ "multiple drugs used", 0 ~ "one drug used") |> factor() |> relevel(ref = "one drug used"), - homeless = - homeless |> - case_match( - 1 ~ "homeless", - 0 ~ "not homeless") |> + homeless = homeless |> + case_match(1 ~ "homeless", 0 ~ "not homeless") |> factor() |> relevel(ref = "not homeless"), - ethn = - ethn |> + ethn = ethn |> factor() |> forcats::fct_lump_min(min = 5, other_level = "Other") |> relevel(ref = "White"), sex = sex |> factor() |> relevel(ref = "M") - ) |> + ) |> labelled::set_variable_labels( "id" = "Participant identifier", "sex" = "Biological sex (reference = Male)", @@ -224,11 +214,11 @@ needles |> #### Covariate counts ```{r} -#| tbl-cap: "Counts of observations in `needles` dataset by sex, unhoused status, and multiple drug use" +#| tbl-cap: "Counts in `needles` by sex, housing status, and multiple drug use" #| label: tbl-count-needles #| code-fold: show -needles |> - dplyr::select(sex, homeless, polydrug) |> +needles |> + dplyr::select(sex, homeless, polydrug) |> summary() ``` @@ -241,7 +231,7 @@ We will remove that individual: ```{r} #| label: remove-trans-obs #| code-summary: 'remove singleton observation with sex == Trans' -needles = needles |> filter(sex != "Trans") +needles <- needles |> filter(sex != "Trans") ``` @@ -250,13 +240,13 @@ needles = needles |> filter(sex != "Trans") ### Model {.smaller} ```{r} -glm1 = - needles |> - dplyr::filter(nivdu > 0) |> +glm1 <- + needles |> + dplyr::filter(nivdu > 0) |> glm( offset = log(nivdu), family = stats::poisson, - formula = shared_syr ~ age + sex + homeless*polydrug + formula = shared_syr ~ age + sex + homeless * polydrug ) library(equatiomatic) equatiomatic::extract_eq(glm1) @@ -270,7 +260,8 @@ equatiomatic::extract_eq(glm1) library(parameters) -glm1 |> parameters(exponentiate = TRUE) |> +glm1 |> + parameters(exponentiate = TRUE) |> print_md() ``` From 26d2d983321813400801f58aba9e9bd1cf0bab98 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 11:12:54 -0700 Subject: [PATCH 14/34] Address review findings on count-regression PR - LTE step in zero-inflation moments solution now carries (X=x,T=t) conditioning explicitly; add note justifying suppressed conditioning in the variance steps - Disambiguate LRT notation (maximized log-likelihoods of M_0/M_1) from the scalar Wald null value beta_0 - Standardize PMF solution on mu_0 to match the moments solution - Convert fig-needles-raw to div format, add a prose cross-reference and a caption note that it predates the Trans-participant removal - Move the Poisson offset into the model formula so predict() honors it Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .../_sec_poisson_inference.qmd | 11 +++++++--- .../count-regression/_sec_zero-inflation.qmd | 22 +++++++++++++------ chapters/exr-needle-sharing.qmd | 21 ++++++++++++++---- 3 files changed, 40 insertions(+), 14 deletions(-) diff --git a/_subfiles/count-regression/_sec_poisson_inference.qmd b/_subfiles/count-regression/_sec_poisson_inference.qmd index 5bfba6720c..fb3167539e 100644 --- a/_subfiles/count-regression/_sec_poisson_inference.qmd +++ b/_subfiles/count-regression/_sec_poisson_inference.qmd @@ -37,13 +37,18 @@ i.e., that covariate $j$ has no association with the outcome rate. ### Comparing nested models -To compare a smaller model (with $p_0$ parameters) to a larger model -(with $p_1 > p_0$ parameters), use the likelihood ratio test statistic: +To compare a smaller model $M_0$ (with $p_0$ parameters) to a larger model +$M_1$ (with $p_1 > p_0$ parameters), use the likelihood ratio test statistic: $$ -G^2 = 2\bigl[\ell(\hat\beta_1) - \ell(\hat\beta_0)\bigr] +G^2 = 2\bigl[\hat\ell_1 - \hat\ell_0\bigr] $$ +where $\hat\ell_1$ and $\hat\ell_0$ are the maximized log-likelihoods +of $M_1$ and $M_0$ respectively. +(Here the subscripts index the two *models*; they are unrelated to the +scalar null value $\beta_0$ used in the Wald test above.) + Under $H_0$ that the additional $p_1 - p_0$ parameters are all zero, $G^2 \dsim \chi^2_{p_1 - p_0}$. diff --git a/_subfiles/count-regression/_sec_zero-inflation.qmd b/_subfiles/count-regression/_sec_zero-inflation.qmd index 1005e9172b..801392b62a 100644 --- a/_subfiles/count-regression/_sec_zero-inflation.qmd +++ b/_subfiles/count-regression/_sec_zero-inflation.qmd @@ -36,7 +36,7 @@ Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions ::: {.solution} -Let $\pi = \P(Z=1|X=x)$ and $\mu = \Expp[Y|Z=0,X=x,T=t]$. +Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. **$P(Y=0)$:** $Y=0$ occurs either because $Z=1$ (always zero) or because $Z=0$ and the Poisson draw equals 0: @@ -45,7 +45,7 @@ $$ \ba \P(Y=0|X=x,T=t) &= \P(Z=1|X=x) + \P(Z=0|X=x)\,\P(Y=0|Z=0,X=x,T=t)\\ -&= \pi + (1-\pi)\,e^{-\mu} +&= \pi + (1-\pi)\,e^{-\mu_0} \ea $$ @@ -54,14 +54,14 @@ $$ $$ \P(Y=1|X=x,T=t) = (1-\pi)\,\P(Y=1|Z=0,X=x,T=t) -= (1-\pi)\,\mu e^{-\mu} += (1-\pi)\,\mu_0 e^{-\mu_0} $$ **$P(Y=y)$ for $y \geq 1$:** Identical reasoning gives $$ \P(Y=y|X=x,T=t) -= (1-\pi)\,\frac{\mu^y e^{-\mu}}{y!} += (1-\pi)\,\frac{\mu_0^y e^{-\mu_0}}{y!} $$ ::: @@ -77,18 +77,26 @@ Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as fu Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. -**Expected value.** By the Law of Total Expectation: +**Expected value.** By the Law of Total Expectation +(conditioning on $Z$, within the subpopulation $\{X=x, T=t\}$): $$ \ba \Expp[Y|X=x,T=t] -&= \Expp[Y|Z=1]\,\pi + \Expp[Y|Z=0]\,(1-\pi)\\ +&= \Expp[Y|Z=1,X=x,T=t]\,\pi + \Expp[Y|Z=0,X=x,T=t]\,(1-\pi)\\ &= 0 \cdot \pi + \mu_0(1-\pi)\\ &= (1-\pi)\,\mu_0 \ea $$ -**Variance.** By the Law of Total Variance: +The substitution $\Expp[Y|Z=0,X=x,T=t] = \mu_0$ follows immediately +from the definition of $\mu_0$ above. + +**Variance.** By the Law of Total Variance. +To reduce clutter, we suppress the $(X=x, T=t)$ conditioning in the +intermediate steps below: every expectation and variance is taken within +the subpopulation $\{X=x, T=t\}$, and we restore the explicit conditioning +in the final line. $$ \Var{Y} = \Expp[\Var{Y|Z}] + \Var{\Expp[Y|Z]} diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 49accd4635..91e32dbad2 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -180,9 +180,14 @@ needles |> {{< slidebreak >}} +@fig-needles-raw shows the underlying counts: +the number of shared syringes against the number of injections, +with a dashed $y = x$ line marking the ceiling +where every injection involves a shared syringe. + +::: {#fig-needles-raw} + ```{r} -#| fig-cap: "Shared syringes vs. number of injections (raw counts)" -#| label: fig-needles-raw #| code-fold: true library(ggplot2) @@ -209,6 +214,14 @@ needles |> theme(legend.position = "bottom", legend.box = "vertical") ``` +Number of shared syringes versus number of injections +(raw counts, 30-day window). +The dashed line marks $y = x$ (every injection involves a shared syringe). +This panel is drawn before the singleton `sex = "Trans"` participant +is removed, so that participant appears as a third shape category. + +::: + --- #### Covariate counts @@ -244,9 +257,9 @@ glm1 <- needles |> dplyr::filter(nivdu > 0) |> glm( - offset = log(nivdu), family = stats::poisson, - formula = shared_syr ~ age + sex + homeless * polydrug + formula = shared_syr ~ age + sex + homeless * polydrug + + offset(log(nivdu)) ) library(equatiomatic) equatiomatic::extract_eq(glm1) From 7bc9155f53b3b7f5f514a3a87b7f39c869f876f5 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 11:19:54 -0700 Subject: [PATCH 15/34] Address non-blocking review notes: consistent conditioning and vector notation - Use specific-value conditioning E[Y|X=x,T=t] in the overdispersion conclusion, matching the rest of the zero-inflation proof - Mark the fitted-mean residual vector with the project's vector macro (\v{\hat\mu}) so it matches \vy's tilde convention Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _subfiles/count-regression/_exr-prac-glm-score.qmd | 4 ++-- _subfiles/count-regression/_sec_zero-inflation.qmd | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_subfiles/count-regression/_exr-prac-glm-score.qmd b/_subfiles/count-regression/_exr-prac-glm-score.qmd index fe39484c3e..db93777e15 100644 --- a/_subfiles/count-regression/_exr-prac-glm-score.qmd +++ b/_subfiles/count-regression/_exr-prac-glm-score.qmd @@ -71,8 +71,8 @@ More generally, these score equations say that the **residuals $(y_i - \hat\mu_i)$ are [orthogonal](math-prereqs.qmd#def-orthogonal-vectors) to each predictor column**: -for each predictor $j$, the residual vector $(\vy - \hat\mu)$ satisfies -$\tp{\vx_{(j)}}(\vy - \hat\mu) = 0$, +for each predictor $j$, the residual vector $(\vy - \v{\hat\mu})$ satisfies +$\tp{\vx_{(j)}}(\vy - \v{\hat\mu}) = 0$, where $\vx_{(j)} = (x_{1j}, \ldots, x_{nj})$ is the column of $j$-th predictor values across observations. This is the GLM analogue of the OLS normal equations. diff --git a/_subfiles/count-regression/_sec_zero-inflation.qmd b/_subfiles/count-regression/_sec_zero-inflation.qmd index 801392b62a..8d77b43f47 100644 --- a/_subfiles/count-regression/_sec_zero-inflation.qmd +++ b/_subfiles/count-regression/_sec_zero-inflation.qmd @@ -128,7 +128,7 @@ $$ = (1-\pi)\mu_0\bigl(1 + \pi\mu_0\bigr) $$ -Since $(1-\pi)\mu_0\bigl(1+\pi\mu_0\bigr) \geq (1-\pi)\mu_0 = \Expp[Y|X,T]$ for any $\pi > 0$, +Since $(1-\pi)\mu_0\bigl(1+\pi\mu_0\bigr) \geq (1-\pi)\mu_0 = \Expp[Y|X=x,T=t]$ for any $\pi > 0$, zero-inflated count models always exhibit overdispersion relative to a Poisson model with the same mean. From 2b65df6b5e89f810e433f55f39c5773035f760f6 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 2 Jun 2026 18:56:56 +0000 Subject: [PATCH 16/34] Convention: estimator symbol on top of vector symbol MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per @d-morrison's review on PR #860: https://github.com/d-morrison/rme/pull/860#pullrequestreview-4412643614 For estimators/estimates of vector-valued estimands, the estimator symbol (\hat, \bar, \tilde, …) wraps the already-vectorized symbol — write \hat{\v{\mu}} (hat on top), not \v{\hat\mu} (hat inside the vector decoration). Document the rule in CLAUDE.md and .github/copilot-instructions.md, and update both occurrences of the fitted-mean residual vector in _exr-prac-glm-score.qmd to the correct form. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- .github/copilot-instructions.md | 4 ++++ CLAUDE.md | 3 +++ _subfiles/count-regression/_exr-prac-glm-score.qmd | 4 ++-- 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 46133eec51..fdee9e8300 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -588,6 +588,10 @@ Key macros to use: - **Greek letters**: Use `\b` for $\beta$, `\g` for $\gamma$, `\a` for $\alpha$ - **Formatting**: Use `\red{...}` and `\blue{...}` for colored text in math - **Deviation/error notation**: Use `\erf{...}` for deviations of estimates/estimators from their estimands; use `\devn(...)` for all other deviations (e.g., observations from population means) +- **Estimators of vector estimands**: the estimator symbol (e.g. `\hat`, + `\bar`, `\tilde`) goes on top of the vector symbol, not inside it — + write `\hat{\v{\mu}}`, not `\v{\hat\mu}`. The hat sits on top of the + already-vectorized symbol. matrix-product helper macros: diff --git a/CLAUDE.md b/CLAUDE.md index 786f4d7540..e076fc8b7e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -50,6 +50,9 @@ Before committing any `.qmd`, `.R`, or config file change: - Key macros: `\E{Y|X=x}`, `\ba`/`\ea`, `\tp{v}`, `\b`, `\g`, `\a`, `\devn(...)`, `\erf{...}` - Include every intermediate step in derivations — do not skip steps - Color coding: `\red{...}` for focal/extra terms, `\blue{...}` for shared terms +- Estimators of vector estimands: the estimator symbol (e.g. `\hat`) goes + on top of the vector symbol, not inside it — write `\hat{\v{\mu}}`, not + `\v{\hat\mu}`. (Same for `\bar`, `\tilde`, etc.) - Ratios vs. factors: - Use the generic `\ratio`/`\ratiof` macro when a ratio's inputs are the **quantities themselves** (the odds, hazards, rates, etc.) — e.g. `\ratio(\odds_1, \odds_2)`, **not** `\ror(\odds_1, \odds_2)` — because the type of ratio is clear from the inputs. - Use the type-subscripted ratio macros (`\ror` for odds ratios, `\hazratio`/`\hr` for hazard ratios, `\rateratio`, `\riskratio`, `\prevratio`, `\cuhazratio`, …) only when the inputs are **covariate patterns** (e.g. `\ror(\vx,\vxs)`, `\hr(t\mid\vx:\vxs)`), where the subscript is needed to say which kind of ratio it is. diff --git a/_subfiles/count-regression/_exr-prac-glm-score.qmd b/_subfiles/count-regression/_exr-prac-glm-score.qmd index db93777e15..009fdc9924 100644 --- a/_subfiles/count-regression/_exr-prac-glm-score.qmd +++ b/_subfiles/count-regression/_exr-prac-glm-score.qmd @@ -71,8 +71,8 @@ More generally, these score equations say that the **residuals $(y_i - \hat\mu_i)$ are [orthogonal](math-prereqs.qmd#def-orthogonal-vectors) to each predictor column**: -for each predictor $j$, the residual vector $(\vy - \v{\hat\mu})$ satisfies -$\tp{\vx_{(j)}}(\vy - \v{\hat\mu}) = 0$, +for each predictor $j$, the residual vector $(\vy - \hat{\v{\mu}})$ satisfies +$\tp{\vx_{(j)}}(\vy - \hat{\v{\mu}}) = 0$, where $\vx_{(j)} = (x_{1j}, \ldots, x_{nj})$ is the column of $j$-th predictor values across observations. This is the GLM analogue of the OLS normal equations. From ca62297790e751c5cf4319b1518d969db2ff91a7 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 12:42:30 -0700 Subject: [PATCH 17/34] Contrast rate vs count models, add coefficient interpretation, move DAG MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Fix fig-needles-pois legend: stack/​wrap the color+size legends so they don't clip (legend.box = "vertical") - Move the causal DAG to after the data dictionary and demographics table; rewrite its notes to frame nivdu as a mediator - Fit both a rate model (offset(log(nivdu)) -> per-injection direct effect) and a count model (no offset -> total effect), since the DAG shows nivdu is a mediator - Add an "Interpreting the coefficients" section (parallel to the linear and logistic chapters) reading the rate ratios vs count ratios and explaining the direct/total (mediation) contrast; flag the sparse interaction cell - Normalize heading levels under the example (## / ###; no skipped levels) - Defer the sex == "Trans" singleton mention to the removal step (drop the premature notes in the demographics and fig-needles-raw captions) - Update renv.lock (bit64, clipr, cpp11) Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 216 ++++++++++++++++++++++++-------- renv.lock | 12 +- 2 files changed, 171 insertions(+), 57 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 91e32dbad2..2cb7069e06 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -51,6 +51,42 @@ a longitudinal panel study of injection drug users measuring repeated needle-sharing behaviour. $n = 128$ participants, 17 variables. +:::{#tbl-needles-data-dict} + +```{r} +dict <- tibble( + variable = names(needles), + description = labelled::get_variable_labels(needles) |> + sapply(function(x) ifelse(is.null(x), "", x)), +) +dict |> pander::pander() +``` + +Data dictionary for `needles` data + +::: + +--- + +:::{#tbl-needles-demographics} + +```{r} +#| code-fold: true +needles |> + dplyr::select(-id) |> + gtsummary::tbl_summary() +``` + +Demographics and behavioural variables for the needle-sharing dataset +(all participants; the `id` identifier column is omitted). + +::: + +{{< slidebreak >}} + +Before modeling, we sketch a hypothesized causal structure for these +variables. + ::: {#fig-needles-dag} ```{r} @@ -98,55 +134,25 @@ Demographic factors (`age`, `sex`, `ethn`) and social factors ::: ::: notes -In @fig-needles-dag: - -- **`nivdu`** (number of injections in 30 days) acts as the denominator -for the sharing rate: more injections create more opportunities to share. -The Poisson model enters it as an offset (`offset = log(nivdu)`), -so the covariates are modeled as predictors of the *rate* -of sharing per injection. -- The demographic and social factors have both an **indirect** path -to `shared_syr` (through injection frequency `nivdu`) -and a **direct** path (their effect on the propensity to share -per injection). -- This DAG is simplified; a fuller model might include +In @fig-needles-dag, injection frequency `nivdu` is a **mediator**: +each covariate can affect the number of shared syringes both +**directly** (its effect on the propensity to share *per injection*) +and **indirectly** (through how often the participant injects). + +This distinction drives the two models we fit below: + +- Putting `nivdu` in the model as an **offset** (`offset(log(nivdu))`) +conditions on the mediator. The covariate coefficients then describe the +*per-injection sharing rate* — the **direct** path, holding injection +frequency fixed. +- Leaving `nivdu` out of the model lets the covariate coefficients +describe the *total* number of shared syringes — the **total** effect, +including the indirect path through `nivdu`. + +This DAG is simplified; a fuller model might include HIV status, depression, or hopelessness. ::: -:::{#tbl-needles-data-dict} - -```{r} -dict <- tibble( - variable = names(needles), - description = labelled::get_variable_labels(needles) |> - sapply(function(x) ifelse(is.null(x), "", x)), -) -dict |> pander::pander() -``` - -Data dictionary for `needles` data - -::: - ---- - -:::{#tbl-needles-demographics} - -```{r} -#| code-fold: true -needles |> - dplyr::select(-id) |> - gtsummary::tbl_summary() -``` - -Demographics and behavioural variables for the needle-sharing dataset -(all participants, including the one individual with `sex == "Trans"`; -the `id` identifier column is omitted). - -::: - -{{< slidebreak >}} - ```{r} #| tbl-cap: "Needle-sharing data" #| label: tbl-needle-data @@ -217,14 +223,12 @@ needles |> Number of shared syringes versus number of injections (raw counts, 30-day window). The dashed line marks $y = x$ (every injection involves a shared syringe). -This panel is drawn before the singleton `sex = "Trans"` participant -is removed, so that participant appears as a third shape category. ::: --- -#### Covariate counts +## Covariate counts ```{r} #| tbl-cap: "Counts in `needles` by sex, housing status, and multiple drug use" @@ -250,7 +254,23 @@ needles <- needles |> filter(sex != "Trans") --- -### Model {.smaller} +## Model {.smaller} + +As @fig-needles-dag suggests, injection frequency `nivdu` is a mediator, +so we fit two complementary Poisson models: + +- a **rate model** that includes `nivdu` as an offset, estimating the +effect of each covariate on the *per-injection* sharing rate +(the direct path, holding injection frequency fixed); and +- a **count model** that omits `nivdu`, estimating the *total* effect of +each covariate on the number of shared syringes +(direct plus the indirect path through `nivdu`). + +### Rate model {.smaller} + +The offset `offset(log(nivdu))` fixes the coefficient of `log(nivdu)` +at 1, which is equivalent to modeling the *rate* of shared syringes +per injection (`shared_syr / nivdu`): ```{r} glm1 <- @@ -316,7 +336,101 @@ ggplot(obs_data, aes(x = age, y = rate, color = sex)) + labeller = label_both ) + labs(y = "Shared needles per injection") + + guides( + color = guide_legend(nrow = 1), + size = guide_legend(nrow = 1) + ) + theme_bw() + - theme(legend.position = "bottom") + theme(legend.position = "bottom", legend.box = "vertical") ``` +--- + +### Count model {.smaller} + +Dropping the offset models the *count* of shared syringes directly. +We keep the same sample of injectors (`nivdu > 0`) so the two models are +comparable; the only difference is whether `nivdu` enters as an offset: + +```{r} +glm2 <- + needles |> + dplyr::filter(nivdu > 0) |> + glm( + family = stats::poisson, + formula = shared_syr ~ age + sex + homeless * polydrug + ) +equatiomatic::extract_eq(glm2) +``` + +--- + +```{r} +#| tbl-cap: "Count model (no offset) for needle-sharing data" +#| label: tbl-needles-count +glm2 |> + parameters(exponentiate = TRUE) |> + print_md() +``` + +## Interpreting the coefficients {.smaller} + +For a Poisson model with a log link, +exponentiating a coefficient turns it into a *ratio*: +$\exp(\beta_j)$ is the multiplicative change in the expected outcome +per one-unit increase in $x_j$, +holding the other covariates fixed +(this is the rate-ratio interpretation introduced earlier in this chapter). +In the **rate model** this ratio is a *rate ratio* (shared syringes per +injection); in the **count model** it is a *count ratio* +(total shared syringes). + +```{r} +#| code-fold: true +rr_rate <- exp(coef(glm1)) +rr_count <- exp(coef(glm2)) +``` + +**Baseline (intercept).** +In the rate model, $\exp(\hat\beta_0) = `r round(rr_rate[["(Intercept)"]], 3)`$ +is the estimated sharing rate per injection for the reference participant +(male, not homeless, one drug used), +extrapolated to age 0. +Age 0 is outside the data, so the intercept is a mathematical anchor +rather than a directly interpretable quantity. + +**Age.** +The rate ratio per additional year of age is +$\exp(\hat\beta_{\text{age}}) = `r round(rr_rate[["age"]], 3)`$, +i.e., essentially no change in the per-injection sharing rate with age. + +**Sex.** +The rate ratio for female versus male is +$\exp(\hat\beta_{\text{sexF}}) = `r round(rr_rate[["sexF"]], 2)`$: +women share needles at about `r round(rr_rate[["sexF"]], 2)` times +the per-injection rate of men with the same age, housing status, and +drug-use pattern. +In the count model the corresponding count ratio is larger, +about `r round(rr_count[["sexF"]], 2)`. +Because the total-count ratio exceeds the per-injection rate ratio, +part of women's higher total sharing operates **indirectly** — +through injecting more often +(the `sex → nivdu → shared_syr` path in @fig-needles-dag) — +on top of their higher per-injection rate. + +**Homelessness.** +Among participants using one drug, +the rate ratio for homeless versus not-homeless is +$\exp(\hat\beta_{\text{homeless}}) = `r round(rr_rate[["homelesshomeless"]], 2)`$, +versus a count ratio of about `r round(rr_count[["homelesshomeless"]], 2)` +in the count model — +again, a larger *total* association than *direct* (per-injection) one. + +**Interaction (caution).** +The `homeless × polydrug` cell is sparse, +which produces an extreme, unstable interaction estimate +(a rate ratio in the hundreds of thousands). +We would not interpret that coefficient without more data; +it is a reminder to check cell counts (see @tbl-count-needles) +before trusting an interaction term. + diff --git a/renv.lock b/renv.lock index f5008745dd..9b94d44ed9 100644 --- a/renv.lock +++ b/renv.lock @@ -1014,7 +1014,7 @@ }, "bit64": { "Package": "bit64", - "Version": "4.8.0", + "Version": "4.8.2", "Source": "Repository", "Title": "A S3 Class for Vectors of 64bit Integers", "Authors@R": "c( person(\"Michael\", \"Chirico\", email=\"michaelchirico4@gmail.com\", role=c(\"aut\", \"cre\")), person(\"Jens\", \"Oehlschlägel\", role=\"aut\"), person(\"Leonardo\", \"Silvestri\", role=\"ctb\"), person(\"Ofek\", \"Shilon\", role=\"ctb\"), person(\"Christian\", \"Ullerich\", role=\"ctb\") )", @@ -1884,7 +1884,7 @@ }, "clipr": { "Package": "clipr", - "Version": "0.8.0", + "Version": "0.8.1", "Source": "Repository", "Type": "Package", "Title": "Read and Write from the System Clipboard", @@ -1906,10 +1906,10 @@ "VignetteBuilder": "knitr", "Encoding": "UTF-8", "Language": "en-US", - "RoxygenNote": "7.1.2", + "RoxygenNote": "7.3.3", "SystemRequirements": "xclip (https://github.com/astrand/xclip) or xsel (http://www.vergenet.net/~conrad/software/xsel/) for accessing the X11 clipboard, or wl-clipboard (https://github.com/bugaevc/wl-clipboard) for systems using Wayland.", "NeedsCompilation": "no", - "Author": "Matthew Lincoln [aut, cre] (), Louis Maddox [ctb], Steve Simpson [ctb], Jennifer Bryan [ctb]", + "Author": "Matthew Lincoln [aut, cre] (ORCID: ), Louis Maddox [ctb], Steve Simpson [ctb], Jennifer Bryan [ctb]", "Maintainer": "Matthew Lincoln ", "Repository": "CRAN" }, @@ -1929,7 +1929,7 @@ "License": "GPL (>= 2)", "URL": "https://www.R-project.org", "NeedsCompilation": "yes", - "Repository": "https://packagemanager.posit.co/cran/__linux__/noble/latest", + "Repository": "https://packagemanager.posit.co/cran/latest", "Encoding": "UTF-8" }, "coda": { @@ -2164,7 +2164,7 @@ }, "cpp11": { "Package": "cpp11", - "Version": "0.5.4", + "Version": "0.5.5", "Source": "Repository", "Title": "A C++11 Interface for R's C Interface", "Authors@R": "c( person(\"Davis\", \"Vaughan\", email = \"davis@posit.co\", role = c(\"aut\", \"cre\"), comment = c(ORCID = \"0000-0003-4777-038X\")), person(\"Jim\",\"Hester\", role = \"aut\", comment = c(ORCID = \"0000-0002-2739-7082\")), person(\"Romain\", \"François\", role = \"aut\", comment = c(ORCID = \"0000-0002-2444-4226\")), person(\"Benjamin\", \"Kietzman\", role = \"ctb\"), person(\"Posit Software, PBC\", role = c(\"cph\", \"fnd\")) )", From bbf9c262c385d8850ddaf2c63df47a14ab565f63 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 13:02:42 -0700 Subject: [PATCH 18/34] Address review: slidebreaks, div tables, cross-refs, macros, shsyr label - Replace two raw --- separators with {{< slidebreak >}} - Convert tbl-needles-count to div format (matches tbl-needles-demographics) - Add prose cross-references for @tbl-needles-demographics and @tbl-needles-count - Clarify the shsyr dictionary label (sparse duplicate of shared_syr) - Use the \exp{} macro and \b for beta in the interpretation section - Drop the id-column note from the demographics caption (per author) Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 2cb7069e06..fd1153b9df 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -41,7 +41,7 @@ needles <- "polydrug" = "Multiple non-injection drugs used?", "sqrtninj" = "sqrt(No. of injections in 30 days); duplicate of sqrtnivd", "homeless" = "Homeless? (reference = not homeless)", - "shsyr" = "Number of shared needles (in 30 days)" + "shsyr" = "Shared needles (30 days); sparse duplicate of shared_syr" ) ``` @@ -68,6 +68,9 @@ Data dictionary for `needles` data --- +@tbl-needles-demographics summarizes the demographic and behavioural +variables for the participants. + :::{#tbl-needles-demographics} ```{r} @@ -78,7 +81,7 @@ needles |> ``` Demographics and behavioural variables for the needle-sharing dataset -(all participants; the `id` identifier column is omitted). +(all participants). ::: @@ -344,7 +347,7 @@ ggplot(obs_data, aes(x = age, y = rate, color = sex)) + theme(legend.position = "bottom", legend.box = "vertical") ``` ---- +{{< slidebreak >}} ### Count model {.smaller} @@ -363,27 +366,34 @@ glm2 <- equatiomatic::extract_eq(glm2) ``` ---- +{{< slidebreak >}} + +:::{#tbl-needles-count} ```{r} -#| tbl-cap: "Count model (no offset) for needle-sharing data" -#| label: tbl-needles-count +#| code-fold: true glm2 |> parameters(exponentiate = TRUE) |> print_md() ``` +Count model (no offset) for needle-sharing data. + +::: + ## Interpreting the coefficients {.smaller} For a Poisson model with a log link, exponentiating a coefficient turns it into a *ratio*: -$\exp(\beta_j)$ is the multiplicative change in the expected outcome +$\exp{\b_j}$ is the multiplicative change in the expected outcome per one-unit increase in $x_j$, holding the other covariates fixed (this is the rate-ratio interpretation introduced earlier in this chapter). In the **rate model** this ratio is a *rate ratio* (shared syringes per injection); in the **count model** it is a *count ratio* (total shared syringes). +The fitted ratios are reported in @tbl-needles-pois (rate model) and +@tbl-needles-count (count model). ```{r} #| code-fold: true @@ -392,7 +402,7 @@ rr_count <- exp(coef(glm2)) ``` **Baseline (intercept).** -In the rate model, $\exp(\hat\beta_0) = `r round(rr_rate[["(Intercept)"]], 3)`$ +In the rate model, $\exp{\hat\b_0} = `r round(rr_rate[["(Intercept)"]], 3)`$ is the estimated sharing rate per injection for the reference participant (male, not homeless, one drug used), extrapolated to age 0. @@ -401,12 +411,12 @@ rather than a directly interpretable quantity. **Age.** The rate ratio per additional year of age is -$\exp(\hat\beta_{\text{age}}) = `r round(rr_rate[["age"]], 3)`$, +$\exp{\hat\b_{\text{age}}} = `r round(rr_rate[["age"]], 3)`$, i.e., essentially no change in the per-injection sharing rate with age. **Sex.** The rate ratio for female versus male is -$\exp(\hat\beta_{\text{sexF}}) = `r round(rr_rate[["sexF"]], 2)`$: +$\exp{\hat\b_{\text{sexF}}} = `r round(rr_rate[["sexF"]], 2)`$: women share needles at about `r round(rr_rate[["sexF"]], 2)` times the per-injection rate of men with the same age, housing status, and drug-use pattern. @@ -421,7 +431,7 @@ on top of their higher per-injection rate. **Homelessness.** Among participants using one drug, the rate ratio for homeless versus not-homeless is -$\exp(\hat\beta_{\text{homeless}}) = `r round(rr_rate[["homelesshomeless"]], 2)`$, +$\exp{\hat\b_{\text{homeless}}} = `r round(rr_rate[["homelesshomeless"]], 2)`$, versus a count ratio of about `r round(rr_count[["homelesshomeless"]], 2)` in the count model — again, a larger *total* association than *direct* (per-injection) one. From 482b8dbe45d8cf5a4c1ad0e9d6d876396d682a9a Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 13:59:58 -0700 Subject: [PATCH 19/34] Model the count directly: drop the nivdu offset and the interaction MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per Vittinghoff RMB 2e §8.3.1, which models shared_syr as a Poisson count on homelessness with no offset: - Collapse the rate/count two-model structure to a single Poisson count model: shared_syr ~ age + sex + homeless + polydrug (no offset) - Explain why we don't use nivdu as an offset: it's a mediator (not an exposure window) and an offset would impose an untested 1:1 proportionality and block the indirect path; we want total associations - Drop the homeless × polydrug interaction (unidentifiable — no non-homeless multiple-drug user shared any needles) - Rewrite fig-needles-pois to plot fitted counts; rewrite the interpretation section as count ratios (IRRs) with an overdispersion caveat per Vittinghoff - Update the DAG notes (don't adjust for the mediator nivdu) - Drop the interaction from the NB and zero-inflated models too, so the Poisson-vs-NB coefficient comparison aligns Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing-extensions.qmd | 11 +- chapters/exr-needle-sharing.qmd | 225 ++++++++------------- 2 files changed, 93 insertions(+), 143 deletions(-) diff --git a/chapters/exr-needle-sharing-extensions.qmd b/chapters/exr-needle-sharing-extensions.qmd index d3000c723e..63f0783bff 100644 --- a/chapters/exr-needle-sharing-extensions.qmd +++ b/chapters/exr-needle-sharing-extensions.qmd @@ -3,7 +3,7 @@ library(MASS) #need this for glm.nb() glm1.nb = glm.nb( data = needles, - shared_syr ~ age + sex + homeless*polydrug + shared_syr ~ age + sex + homeless + polydrug ) equatiomatic::extract_eq(glm1.nb) @@ -32,8 +32,8 @@ library(glmmTMB) zinf_fit1 = glmmTMB( family = "poisson", data = needles, - formula = shared_syr ~ age + sex + homeless*polydrug, - ziformula = ~ age + sex + homeless + polydrug # fit won't converge with interaction + formula = shared_syr ~ age + sex + homeless + polydrug, + ziformula = ~ age + sex + homeless + polydrug ) zinf_fit1 |> @@ -57,9 +57,8 @@ library(glmmTMB) zinf_fit1 = glmmTMB( family = nbinom2, data = needles, - formula = shared_syr ~ age + sex + homeless*polydrug, - ziformula = ~ age + sex + homeless + polydrug - # fit won't converge with interaction + formula = shared_syr ~ age + sex + homeless + polydrug, + ziformula = ~ age + sex + homeless + polydrug ) zinf_fit1 |> diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index fd1153b9df..8796740d66 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -139,18 +139,15 @@ Demographic factors (`age`, `sex`, `ethn`) and social factors ::: notes In @fig-needles-dag, injection frequency `nivdu` is a **mediator**: each covariate can affect the number of shared syringes both -**directly** (its effect on the propensity to share *per injection*) +**directly** (its effect on the propensity to share per injection) and **indirectly** (through how often the participant injects). -This distinction drives the two models we fit below: - -- Putting `nivdu` in the model as an **offset** (`offset(log(nivdu))`) -conditions on the mediator. The covariate coefficients then describe the -*per-injection sharing rate* — the **direct** path, holding injection -frequency fixed. -- Leaving `nivdu` out of the model lets the covariate coefficients -describe the *total* number of shared syringes — the **total** effect, -including the indirect path through `nivdu`. +Because `nivdu` lies *on* these causal paths, we do **not** adjust for it +(neither as a covariate nor as an offset): conditioning on a mediator +would block the indirect path and leave only a partial effect. +Modeling the count of shared syringes on the covariates alone therefore +estimates each covariate's **total** association with needle-sharing — +the approach taken by @vittinghoff2e [§8.3.1]. This DAG is simplified; a fuller model might include HIV status, depression, or hopelessness. @@ -259,31 +256,41 @@ needles <- needles |> filter(sex != "Trans") ## Model {.smaller} -As @fig-needles-dag suggests, injection frequency `nivdu` is a mediator, -so we fit two complementary Poisson models: - -- a **rate model** that includes `nivdu` as an offset, estimating the -effect of each covariate on the *per-injection* sharing rate -(the direct path, holding injection frequency fixed); and -- a **count model** that omits `nivdu`, estimating the *total* effect of -each covariate on the number of shared syringes -(direct plus the indirect path through `nivdu`). - -### Rate model {.smaller} - -The offset `offset(log(nivdu))` fixes the coefficient of `log(nivdu)` -at 1, which is equivalent to modeling the *rate* of shared syringes -per injection (`shared_syr / nivdu`): +Following @vittinghoff2e [§8.3.1], +we model the **count** of shared syringes directly +with a Poisson regression (log link), +rather than modeling a rate. + +**Why not use `nivdu` as an offset?** +A Poisson *rate* model would add `offset(log(nivdu))`, +forcing the expected count to be proportional to injection frequency +(the coefficient of `log(nivdu)` fixed at exactly 1). +We avoid that here for two reasons: + +- **`nivdu` is a mediator, not an exposure window.** +An offset is appropriate when its variable is a known denominator — +person-time or population at risk — that is external to the causal +question. Injection frequency is instead *caused* by the covariates +(@fig-needles-dag), so conditioning on it (whether as an offset or as a +covariate) would block the indirect path and estimate only a partial +effect. We want each covariate's *total* association with needle-sharing. +- **It imposes an untested proportionality assumption.** +Fixing the `log(nivdu)` coefficient at 1 assumes sharing scales +one-for-one with injecting, which the data need not support. + +Vittinghoff's analysis uses neither an offset nor `nivdu` as a covariate. + +We also use **main effects only**: +a `homeless × polydrug` interaction is not identifiable here, +because no non-homeless participant who used multiple drugs shared any +needles, leaving that cell empty for the outcome. ```{r} -glm1 <- - needles |> - dplyr::filter(nivdu > 0) |> - glm( - family = stats::poisson, - formula = shared_syr ~ age + sex + homeless * polydrug + - offset(log(nivdu)) - ) +glm1 <- glm( + formula = shared_syr ~ age + sex + homeless + polydrug, + family = stats::poisson, + data = needles +) library(equatiomatic) equatiomatic::extract_eq(glm1) ``` @@ -291,7 +298,7 @@ equatiomatic::extract_eq(glm1) --- ```{r} -#| tbl-cap: "Poisson model for needle-sharing data" +#| tbl-cap: "Poisson count model for needle-sharing data" #| label: tbl-needles-pois @@ -304,143 +311,87 @@ glm1 |> --- ```{r} -#| fig-cap: "Fitted Poisson rate model and observed needle-sharing rates" +#| fig-cap: "Fitted Poisson count model and observed shared-syringe counts" #| label: fig-needles-pois #| code-fold: true library(ggplot2) -# Build a prediction grid that varies age and the categorical covariates, -# with nivdu = 1 so the offset log(nivdu) = 0 and predict(type = "response") -# returns the Poisson rate per injection (matching the y-axis scale of -# @fig-needles). +# Predicted expected counts across age and the categorical covariates. +# The model has no offset, so predict(type = "response") returns the +# expected number of shared syringes directly. pred_grid <- with(glm1$model, expand.grid( age = seq(min(age), max(age), length.out = 100), sex = levels(droplevels(sex)), homeless = levels(homeless), - polydrug = levels(polydrug), - nivdu = 1 + polydrug = levels(polydrug) )) -pred_grid$rate <- predict(glm1, newdata = pred_grid, type = "response") - -obs_data <- needles |> - dplyr::filter( - nivdu > 0, - !is.na(homeless), !is.na(polydrug), !is.na(sex), !is.na(age) - ) |> - dplyr::mutate(rate = shared_syr / nivdu) +pred_grid$shared_syr <- predict(glm1, newdata = pred_grid, type = "response") -ggplot(obs_data, aes(x = age, y = rate, color = sex)) + - geom_point(aes(size = nivdu), alpha = 0.4) + +ggplot(glm1$model, aes(x = age, y = shared_syr, color = sex)) + + geom_point(alpha = 0.4) + geom_line(data = pred_grid, linewidth = 0.8) + - scale_size_area(max_size = 4) + facet_grid( rows = vars(homeless), cols = vars(polydrug), labeller = label_both ) + - labs(y = "Shared needles per injection") + - guides( - color = guide_legend(nrow = 1), - size = guide_legend(nrow = 1) - ) + + labs(y = "Number of shared syringes (30 days)") + + guides(color = guide_legend(nrow = 1)) + theme_bw() + - theme(legend.position = "bottom", legend.box = "vertical") -``` - -{{< slidebreak >}} - -### Count model {.smaller} - -Dropping the offset models the *count* of shared syringes directly. -We keep the same sample of injectors (`nivdu > 0`) so the two models are -comparable; the only difference is whether `nivdu` enters as an offset: - -```{r} -glm2 <- - needles |> - dplyr::filter(nivdu > 0) |> - glm( - family = stats::poisson, - formula = shared_syr ~ age + sex + homeless * polydrug - ) -equatiomatic::extract_eq(glm2) -``` - -{{< slidebreak >}} - -:::{#tbl-needles-count} - -```{r} -#| code-fold: true -glm2 |> - parameters(exponentiate = TRUE) |> - print_md() + theme(legend.position = "bottom") ``` -Count model (no offset) for needle-sharing data. - -::: - ## Interpreting the coefficients {.smaller} For a Poisson model with a log link, -exponentiating a coefficient turns it into a *ratio*: -$\exp{\b_j}$ is the multiplicative change in the expected outcome -per one-unit increase in $x_j$, -holding the other covariates fixed -(this is the rate-ratio interpretation introduced earlier in this chapter). -In the **rate model** this ratio is a *rate ratio* (shared syringes per -injection); in the **count model** it is a *count ratio* -(total shared syringes). -The fitted ratios are reported in @tbl-needles-pois (rate model) and -@tbl-needles-count (count model). +exponentiating a coefficient turns it into a *ratio of expected counts* +(an **incidence rate ratio**, IRR, in @vittinghoff2e's terminology): +$\exp{\b_j}$ is the multiplicative change in the expected number of +shared syringes per one-unit increase in $x_j$, +holding the other covariates fixed. +Because we do not condition on the mediator `nivdu`, +each ratio is a *total* association. +The fitted ratios are reported in @tbl-needles-pois. ```{r} #| code-fold: true -rr_rate <- exp(coef(glm1)) -rr_count <- exp(coef(glm2)) +irr <- exp(coef(glm1)) ``` **Baseline (intercept).** -In the rate model, $\exp{\hat\b_0} = `r round(rr_rate[["(Intercept)"]], 3)`$ -is the estimated sharing rate per injection for the reference participant -(male, not homeless, one drug used), -extrapolated to age 0. +$\exp{\hat\b_0} = `r round(irr[["(Intercept)"]], 3)`$ +is the expected count of shared syringes for the reference participant +(male, not homeless, one drug used), extrapolated to age 0. Age 0 is outside the data, so the intercept is a mathematical anchor rather than a directly interpretable quantity. **Age.** -The rate ratio per additional year of age is -$\exp{\hat\b_{\text{age}}} = `r round(rr_rate[["age"]], 3)`$, -i.e., essentially no change in the per-injection sharing rate with age. +$\exp{\hat\b_{\text{age}}} = `r round(irr[["age"]], 3)`$ per year: +essentially no association between age and the expected count. **Sex.** -The rate ratio for female versus male is -$\exp{\hat\b_{\text{sexF}}} = `r round(rr_rate[["sexF"]], 2)`$: -women share needles at about `r round(rr_rate[["sexF"]], 2)` times -the per-injection rate of men with the same age, housing status, and -drug-use pattern. -In the count model the corresponding count ratio is larger, -about `r round(rr_count[["sexF"]], 2)`. -Because the total-count ratio exceeds the per-injection rate ratio, -part of women's higher total sharing operates **indirectly** — -through injecting more often -(the `sex → nivdu → shared_syr` path in @fig-needles-dag) — -on top of their higher per-injection rate. +$\exp{\hat\b_{\text{sexF}}} = `r round(irr[["sexF"]], 2)`$: +women's expected count of shared syringes is about +`r round(irr[["sexF"]], 2)` times men's, +adjusting for age, housing status, and drug-use pattern. **Homelessness.** -Among participants using one drug, -the rate ratio for homeless versus not-homeless is -$\exp{\hat\b_{\text{homeless}}} = `r round(rr_rate[["homelesshomeless"]], 2)`$, -versus a count ratio of about `r round(rr_count[["homelesshomeless"]], 2)` -in the count model — -again, a larger *total* association than *direct* (per-injection) one. - -**Interaction (caution).** -The `homeless × polydrug` cell is sparse, -which produces an extreme, unstable interaction estimate -(a rate ratio in the hundreds of thousands). -We would not interpret that coefficient without more data; -it is a reminder to check cell counts (see @tbl-count-needles) -before trusting an interaction term. +$\exp{\hat\b_{\text{homeless}}} = `r round(irr[["homelesshomeless"]], 2)`$: +homeless participants share syringes about +`r round(irr[["homelesshomeless"]], 2)` times as often as the non-homeless, +adjusting for the other covariates. +@vittinghoff2e [§8.3.1] reports a comparable unadjusted IRR of about 3.3 +for homelessness in a single-predictor model. + +**Multiple-drug use.** +$\exp{\hat\b_{\text{polydrug}}} = `r round(irr[["polydrugmultiple drugs used"]], 2)`$: +multiple-drug users have a *lower* expected count than single-drug users, +though only `r sum(glm1$model[["polydrug"]] == "multiple drugs used")` +participants used multiple drugs, so this estimate is imprecise. + +**Overdispersion caveat.** +These data are highly overdispersed — the variance far exceeds the mean — +so the model-based Poisson standard errors are too small. +@vittinghoff2e [§8.3.1] recommends robust standard errors or a +negative-binomial model for this example. From 576db1d65661d1b71041445f505abaa9ed836332 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 14:07:35 -0700 Subject: [PATCH 20/34] Add dataset introduction; plot counts in fig-needles - Add an Introduction section adapting the needle-sharing dataset writeup from the rmb dataset vignette (HIV/HCV transmission context, homelessness and drug-use risk factors, study/source/citation, outcome definition), framed for the count analysis (no log-transform/linear-model specifics) - Switch fig-needles from a per-injection rate to the shared-syringe count, consistent with the count model Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 8796740d66..fd09c7eba7 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -45,11 +45,23 @@ needles <- ) ``` -Dataset source: Chapter 8 of *Regression Methods in Biostatistics*, 2e -([UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems)) — -a longitudinal panel study of injection drug users measuring -repeated needle-sharing behaviour. -$n = 128$ participants, 17 variables. +## Introduction + +Needle sharing among injection drug users (IDUs) +is a primary route of HIV and hepatitis C transmission. +Unstable housing (homelessness) and patterns of drug use +have been proposed as structural drivers of risk behavior, +including receptive syringe sharing, +while injection frequency reflects overall exposure and addiction severity. + +These data come from a longitudinal panel study of injection drug users +measuring repeated needle-sharing behavior, +analyzed in @vittinghoff2e [§8.3.1] +and available from the +[UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems). +There are $n = 128$ participants and 17 variables. +Our outcome is `shared_syr`, +the number of times a participant shared a syringe in the past 30 days. :::{#tbl-needles-data-dict} @@ -162,7 +174,7 @@ needles --- ```{r} -#| fig-cap: "Rates of needle sharing" +#| fig-cap: "Needle-sharing counts by age" #| label: fig-needles library(ggplot2) @@ -171,7 +183,7 @@ needles |> ggplot( aes( x = age, - y = shared_syr / nivdu, + y = shared_syr, shape = sex, col = ethn ) @@ -179,7 +191,7 @@ needles |> geom_point(aes(size = nivdu), alpha = .5) + scale_size_area(max_size = 4) + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + - labs(y = "Shared needles per injection") + + labs(y = "Number of shared syringes (30 days)") + theme(legend.position = "bottom") ``` From f7756c1e901e2e70044723d1d7982c9c9b425924 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 14:16:55 -0700 Subject: [PATCH 21/34] Address review: cite/soften intro claims, cross-ref fig-needles, drop filter MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Reframe the Introduction: attribute the example to @vittinghoff2e [§8.3.1] and soften the uncited epidemiological assertions (no bib entry supports a "primary route of HIV/HCV transmission" claim, and I won't fabricate one) - Add a prose @fig-needles cross-reference - Drop the now-cosmetic filter(nivdu > 0) from fig-needles so the EDA sample matches the count model's Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index fd09c7eba7..2e433217f0 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -47,17 +47,18 @@ needles <- ## Introduction -Needle sharing among injection drug users (IDUs) -is a primary route of HIV and hepatitis C transmission. -Unstable housing (homelessness) and patterns of drug use -have been proposed as structural drivers of risk behavior, -including receptive syringe sharing, -while injection frequency reflects overall exposure and addiction severity. +This example, from @vittinghoff2e [§8.3.1], +models *risky drug-use behavior* — needle sharing — among +injection drug users (IDUs), +a behavior of public-health interest because shared syringes can transmit +blood-borne infections. +We examine how needle-sharing frequency relates to participants' age, sex, +housing status (homelessness), and drug-use patterns; +injection frequency reflects each participant's overall exposure. These data come from a longitudinal panel study of injection drug users measuring repeated needle-sharing behavior, -analyzed in @vittinghoff2e [§8.3.1] -and available from the +available from the [UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems). There are $n = 128$ participants and 17 variables. Our outcome is `shared_syr`, @@ -173,13 +174,15 @@ needles --- +@fig-needles plots the shared-syringe counts against age, +split by the covariate subgroups (point size shows injection frequency). + ```{r} #| fig-cap: "Needle-sharing counts by age" #| label: fig-needles library(ggplot2) needles |> - dplyr::filter(nivdu > 0) |> ggplot( aes( x = age, From 4cb8fe6bab91d2cc8c9cfb60b25132e97f81f52b Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 14:22:51 -0700 Subject: [PATCH 22/34] Reword "exposure" in intro to avoid offset connotation Address the non-blocking review note: "exposure" carries a person-time / offset meaning; reword to "how often each participant has the opportunity to share a syringe." Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 2e433217f0..9fd43367b2 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -54,7 +54,8 @@ a behavior of public-health interest because shared syringes can transmit blood-borne infections. We examine how needle-sharing frequency relates to participants' age, sex, housing status (homelessness), and drug-use patterns; -injection frequency reflects each participant's overall exposure. +injection frequency reflects how often each participant has the +opportunity to share a syringe. These data come from a longitudinal panel study of injection drug users measuring repeated needle-sharing behavior, From 5788ac538a32b882671031c85ecd43d046238ffc Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 14:46:16 -0700 Subject: [PATCH 23/34] Add aside: design-determined/exogenous exposure vs. behavioral mediator MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make the offset criterion explicit — an offset's exposure is usually a design-fixed, exogenous denominator (person-time, population, area, trials); when the denominator is a behavior the predictors also drive (injection frequency), it's a mediator and we model the count directly. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 9fd43367b2..c353835ba8 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -296,6 +296,18 @@ one-for-one with injecting, which the data need not support. Vittinghoff's analysis uses neither an offset nor `nivdu` as a covariate. +::: notes +**Rule of thumb.** +An offset's exposure is usually something fixed by the *study design* — +person-time of follow-up, population at risk, area surveyed, number of +trials — a known denominator that is *exogenous* to the question +(not caused by the predictors of interest). +When the candidate denominator is instead something the participants +*did*, and the predictors also influence it (here, how often they inject), +it is a mediator rather than an exposure window, and we model the count +directly instead of dividing by it. +::: + We also use **main effects only**: a `homeless × polydrug` interaction is not identifiable here, because no non-homeless participant who used multiple drugs shared any From a500a711ef6b59ac04f2c5502e019a4143f2ce88 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 3 Jun 2026 01:33:37 +0000 Subject: [PATCH 24/34] Add hivstat to model+figures and expand DAG to all dataset variables MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Model: - glm1 now adjusts for hivstat: shared_syr ~ age + sex + homeless + polydrug + hivstat. The fitted IRR ≈ 0.18 (n = 8 HIV+) gets a new interpretation paragraph; the baseline-intercept prose now names HIV-negative as the reference level. Figures: - fig-needles and fig-needles-raw: color by hivstat (replacing ethn, which isn't in the model), keep shape = sex. - fig-needles-pois: predict over an age × sex × homeless × polydrug × hivstat grid; map color = hivstat to points and lines and linetype = sex to the prediction lines so all 4 (sex × hivstat) lines are distinguishable per panel. DAG (fig-needles-dag): - Add the four remaining non-transformation, non-ID dataset variables — hivstat, sexabuse, dprsn_dx, hplsns — each with arrows to nivdu and shared_syr, mirroring the structure used for the existing covariates. - Attach data-dictionary labels via dagitty's labels(); switch the plot to ggdag(use_labels = "label", text = TRUE) so each node shows both its variable name and its human-readable label. - Refresh the figure caption to list all four covariate groups (demographic, social, mental-health, serostatus). Verified locally that the model + fig-needles-pois grid + predict() sequence runs cleanly. The DAG chunk uses dagitty + ggdag (both pinned in renv) and can't be exercised in this sandbox; relying on CI build to catch any rendering issue there. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing.qmd | 75 +++++++++++++++++++++++++-------- 1 file changed, 58 insertions(+), 17 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index c353835ba8..f498b092d9 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -112,14 +112,19 @@ variables. #| code-fold: true library(dagitty) +library(ggdag) needles_dag <- dagitty( 'dag { - age [pos="0,-2"] - sex [pos="0,-1"] - ethn [pos="0,0"] - homeless [pos="0,1"] - polydrug [pos="0,2"] + age [pos="0,-4"] + sex [pos="0,-3"] + ethn [pos="0,-2"] + homeless [pos="0,-1"] + polydrug [pos="0,0"] + hivstat [pos="0,1"] + sexabuse [pos="0,2"] + dprsn_dx [pos="0,3"] + hplsns [pos="0,4"] nivdu [pos="2,0"] shared_syr [outcome, pos="4,0"] @@ -128,25 +133,53 @@ needles_dag <- dagitty( ethn -> nivdu homeless -> nivdu polydrug -> nivdu + hivstat -> nivdu + sexabuse -> nivdu + dprsn_dx -> nivdu + hplsns -> nivdu age -> shared_syr sex -> shared_syr ethn -> shared_syr homeless -> shared_syr polydrug -> shared_syr + hivstat -> shared_syr + sexabuse -> shared_syr + dprsn_dx -> shared_syr + hplsns -> shared_syr nivdu -> shared_syr }' ) -plot(needles_dag) +# Pair each variable name with the human-readable label from the data +# dictionary so the DAG shows both. +labels(needles_dag) <- c( + age = "Age (years)", + sex = "Biological sex", + ethn = "Ethnicity", + homeless = "Homeless", + polydrug = "Multiple drugs used", + hivstat = "HIV serostatus", + sexabuse = "Sexual abuse history", + dprsn_dx = "Depression diagnosis", + hplsns = "Hopelessness score", + nivdu = "Injections (30 d)", + shared_syr = "Shared syringes (30 d)" +) + +ggdag(needles_dag, use_labels = "label", text = TRUE) + + theme_dag() ``` Hypothesized causal structure for the needle-sharing example. -Demographic factors (`age`, `sex`, `ethn`) and social factors -(`homeless`, `polydrug`) influence both how often a participant injects -(`nivdu`) and how many of those injections involve a shared syringe -(`shared_syr`, the outcome). +Demographic (`age`, `sex`, `ethn`), social (`homeless`, `polydrug`), +mental-health (`dprsn_dx`, `hplsns`, `sexabuse`), and serostatus +(`hivstat`) factors all plausibly influence both how often a participant +injects (`nivdu`) and how many of those injections involve a shared +syringe (`shared_syr`, the outcome). +Each node carries both its variable name (top) and its data-dictionary +label. ::: @@ -189,7 +222,7 @@ needles |> x = age, y = shared_syr, shape = sex, - col = ethn + col = hivstat ) ) + geom_point(aes(size = nivdu), alpha = .5) + @@ -219,7 +252,7 @@ needles |> x = nivdu, y = shared_syr, shape = sex, - col = ethn + col = hivstat ) ) + geom_point(alpha = .5) + @@ -315,7 +348,7 @@ needles, leaving that cell empty for the outcome. ```{r} glm1 <- glm( - formula = shared_syr ~ age + sex + homeless + polydrug, + formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, family = stats::poisson, data = needles ) @@ -351,13 +384,14 @@ pred_grid <- with(glm1$model, expand.grid( age = seq(min(age), max(age), length.out = 100), sex = levels(droplevels(sex)), homeless = levels(homeless), - polydrug = levels(polydrug) + polydrug = levels(polydrug), + hivstat = levels(hivstat) )) pred_grid$shared_syr <- predict(glm1, newdata = pred_grid, type = "response") -ggplot(glm1$model, aes(x = age, y = shared_syr, color = sex)) + +ggplot(glm1$model, aes(x = age, y = shared_syr, color = hivstat, shape = sex)) + geom_point(alpha = 0.4) + - geom_line(data = pred_grid, linewidth = 0.8) + + geom_line(data = pred_grid, aes(linetype = sex), linewidth = 0.8) + facet_grid( rows = vars(homeless), cols = vars(polydrug), @@ -389,7 +423,7 @@ irr <- exp(coef(glm1)) **Baseline (intercept).** $\exp{\hat\b_0} = `r round(irr[["(Intercept)"]], 3)`$ is the expected count of shared syringes for the reference participant -(male, not homeless, one drug used), extrapolated to age 0. +(male, not homeless, one drug used, HIV-negative), extrapolated to age 0. Age 0 is outside the data, so the intercept is a mathematical anchor rather than a directly interpretable quantity. @@ -417,6 +451,13 @@ multiple-drug users have a *lower* expected count than single-drug users, though only `r sum(glm1$model[["polydrug"]] == "multiple drugs used")` participants used multiple drugs, so this estimate is imprecise. +**HIV serostatus.** +$\exp{\hat\b_{\text{hivstatHIV+}}} = `r round(irr[["hivstatHIV+"]], 2)`$: +HIV-positive participants have a lower expected count of shared syringes +than HIV-negative participants, adjusting for the other covariates. +Only `r sum(glm1$model[["hivstat"]] == "HIV+")` participants are HIV+ +in this sample, so this estimate is imprecise. + **Overdispersion caveat.** These data are highly overdispersed — the variance far exceeds the mean — so the model-based Poisson standard errors are too small. From cb604433e6d0c423ad477cd41abf3d461d786b3b Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 3 Jun 2026 01:42:41 +0000 Subject: [PATCH 25/34] DAG: revert to plot(needles_dag); inline labels in caption a500a71's DAG chunk failed render with: ! could not find function "labels<-" The pinned dagitty in renv doesn't expose a labels<-() setter, so ggdag(needles_dag, use_labels = "label") can't pull node labels via that path. Going for the safest fix that still surfaces the variable labels: keep the expanded DAG structure (all 11 non-transformation, non-ID dataset variables), revert plotting to plot(needles_dag), and inline each variable's data-dictionary label parenthetically in the figure caption. The DAG figure now shows the variable names on the nodes; the caption gives the human-readable labels next to each variable. If we later want both name and label on each node, that needs a tidy-dagitty/ggdag pattern that's been verified against the pinned ggdag version. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing.qmd | 36 +++++++++------------------------ 1 file changed, 10 insertions(+), 26 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index f498b092d9..eaa1c1a110 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -112,7 +112,6 @@ variables. #| code-fold: true library(dagitty) -library(ggdag) needles_dag <- dagitty( 'dag { @@ -152,34 +151,19 @@ needles_dag <- dagitty( }' ) -# Pair each variable name with the human-readable label from the data -# dictionary so the DAG shows both. -labels(needles_dag) <- c( - age = "Age (years)", - sex = "Biological sex", - ethn = "Ethnicity", - homeless = "Homeless", - polydrug = "Multiple drugs used", - hivstat = "HIV serostatus", - sexabuse = "Sexual abuse history", - dprsn_dx = "Depression diagnosis", - hplsns = "Hopelessness score", - nivdu = "Injections (30 d)", - shared_syr = "Shared syringes (30 d)" -) - -ggdag(needles_dag, use_labels = "label", text = TRUE) + - theme_dag() +plot(needles_dag) ``` Hypothesized causal structure for the needle-sharing example. -Demographic (`age`, `sex`, `ethn`), social (`homeless`, `polydrug`), -mental-health (`dprsn_dx`, `hplsns`, `sexabuse`), and serostatus -(`hivstat`) factors all plausibly influence both how often a participant -injects (`nivdu`) and how many of those injections involve a shared -syringe (`shared_syr`, the outcome). -Each node carries both its variable name (top) and its data-dictionary -label. +Demographic (`age` — age in years; `sex` — biological sex; `ethn` — ethnicity), +social (`homeless`; `polydrug` — multiple drugs used), +mental-health (`sexabuse` — sexual-abuse history; `dprsn_dx` — depression diagnosis; +`hplsns` — hopelessness score), +and serostatus (`hivstat` — HIV serostatus) factors +all plausibly influence both how often a participant injects +(`nivdu` — injections in 30 days) +and how many of those injections involve a shared syringe +(`shared_syr` — shared syringes in 30 days; the outcome). ::: From b3ea696e066a56303b7d9c37e36a857610943023 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 3 Jun 2026 01:51:27 +0000 Subject: [PATCH 26/34] Add hivstat to the NB and zero-inflated models too cb60443's build-deploy crashed in chapters/count-regression.qmd at tbl-compare-poisson-nb because that chunk binds coef(glm1) and coef(glm1.nb) into one tibble, but glm1 now has 6 coefficients (after adding hivstat) while glm1.nb was still on the old 5-coef formula. Add hivstat to (a) the NB model used by tbl-compare-poisson-nb and (b) both the count and zero-inflation formulas of the zero-inflated Poisson / zero-inflated NB models in tbl-zeroinf-poisson and tbl-zeroinf-nb so all four follow-up models are aligned with glm1. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing-extensions.qmd | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/chapters/exr-needle-sharing-extensions.qmd b/chapters/exr-needle-sharing-extensions.qmd index 63f0783bff..58d774e6c5 100644 --- a/chapters/exr-needle-sharing-extensions.qmd +++ b/chapters/exr-needle-sharing-extensions.qmd @@ -3,7 +3,7 @@ library(MASS) #need this for glm.nb() glm1.nb = glm.nb( data = needles, - shared_syr ~ age + sex + homeless + polydrug + shared_syr ~ age + sex + homeless + polydrug + hivstat ) equatiomatic::extract_eq(glm1.nb) @@ -32,8 +32,8 @@ library(glmmTMB) zinf_fit1 = glmmTMB( family = "poisson", data = needles, - formula = shared_syr ~ age + sex + homeless + polydrug, - ziformula = ~ age + sex + homeless + polydrug + formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, + ziformula = ~ age + sex + homeless + polydrug + hivstat ) zinf_fit1 |> @@ -57,8 +57,8 @@ library(glmmTMB) zinf_fit1 = glmmTMB( family = nbinom2, data = needles, - formula = shared_syr ~ age + sex + homeless + polydrug, - ziformula = ~ age + sex + homeless + polydrug + formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, + ziformula = ~ age + sex + homeless + polydrug + hivstat ) zinf_fit1 |> From 543355419276679bd01b5ec1aee97df2ee0ae191 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 19:07:33 -0700 Subject: [PATCH 27/34] Soften unsourced "longitudinal panel study" claim in needle-sharing intro MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The data's provenance dead-ends at the UCSF companion website download: the textbook (§8.3.1), the companion site, and the rmb package doc all cite no primary study, and the .dta carries no embedded citation. The data as distributed is also one row per participant (n=128, single 30-day window), not repeated measures. Replace the unverifiable "longitudinal panel study measuring repeated needle-sharing behavior" with a defensible description and an explicit note that the original study is undocumented. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index eaa1c1a110..65099f5e89 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -57,10 +57,11 @@ housing status (homelessness), and drug-use patterns; injection frequency reflects how often each participant has the opportunity to share a syringe. -These data come from a longitudinal panel study of injection drug users -measuring repeated needle-sharing behavior, -available from the -[UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems). +These data are a sample of injection drug users +distributed with @vittinghoff2e [§8.3.1] +via the +[UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems); +the original study is not documented in the textbook or the companion materials. There are $n = 128$ participants and 17 variables. Our outcome is `shared_syr`, the number of times a participant shared a syringe in the past 30 days. From aab866d232078918ef63c7de03d05395c11ae60d Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Tue, 2 Jun 2026 20:15:26 -0700 Subject: [PATCH 28/34] Note Tsui et al. (2009) as a possible (unconfirmed) data source MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add @tsui2009hcv (Drug Alcohol Depend) to the needle-sharing intro as a possible source: a study of injection drug users in San Francisco co-authored by RMB author E. Vittinghoff. Flagged unconfirmed — the data's provenance remains undocumented and the variable battery isn't a verified match. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 3 +++ references.bib | 11 +++++++++++ 2 files changed, 14 insertions(+) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 65099f5e89..4d16e768d1 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -62,6 +62,9 @@ distributed with @vittinghoff2e [§8.3.1] via the [UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems); the original study is not documented in the textbook or the companion materials. +One possible (though unconfirmed) source is @tsui2009hcv, +a study of injection-drug-using participants in San Francisco +co-authored by one of the textbook's authors (E. Vittinghoff). There are $n = 128$ participants and 17 variables. Our outcome is `shared_syr`, the number of times a participant shared a syringe in the past 30 days. diff --git a/references.bib b/references.bib index c4c0be3eda..19127ce6c8 100644 --- a/references.bib +++ b/references.bib @@ -1689,3 +1689,14 @@ @article{debruyn2011mira year={2011}, doi={10.1136/sti.2010.044990} } + +@article{tsui2009hcv, + title={Risk behaviors after hepatitis C virus seroconversion in young injection drug users in {San Francisco}}, + author={Tsui, Judith I. and Vittinghoff, Eric and Hahn, Judith A. and Evans, Jennifer L. and Davidson, Peter J. and Page, Kimberly}, + journal={Drug and Alcohol Dependence}, + volume={105}, + number={1-2}, + pages={160--163}, + year={2009}, + doi={10.1016/j.drugalcdep.2009.05.022} +} From a95e486ec3a0914db019d038c58e5f45d877b7ac Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 3 Jun 2026 18:36:54 +0000 Subject: [PATCH 29/34] Address #860 review: hivstat consistency + ziformula safety MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per the latest @claude review's four findings: 1. Intro covariate list (line ~55): add HIV serostatus alongside age, sex, housing status, and drug-use patterns; hivstat is in glm1 now and was conspicuously missing from the overview sentence. 2. Stale "simplified DAG" note (line ~187): the previous note said a fuller model could add HIV status, depression, hopelessness — but the DAG already shows all three (and glm1 includes hivstat). Replaced with a note that explains which DAG nodes the fitted model does and doesn't condition on, and points to the mental-health nodes as an extension exercise. 3. Sex IRR paragraph (line ~426): drop the hard-coded adjustment list ("age, housing status, and drug-use pattern") and use the generic "the other covariates" phrasing the homelessness and polydrug paragraphs already use — now consistent and won't drift if the covariate set changes again. 4. Zero-inflation ziformulas (extensions, lines 36 and 61): drop hivstat from the zero-inflation submodel — with only ~8/128 HIV+ participants this risks the same convergence failure the prior code explicitly avoided for the homeless*polydrug interaction. Keep hivstat in the count formula and add a comment recording the rationale. https://claude.ai/code/session_01PtDgktdMme5SZtze1EGx9U --- chapters/exr-needle-sharing-extensions.qmd | 10 ++++++++-- chapters/exr-needle-sharing.qmd | 10 ++++++---- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/chapters/exr-needle-sharing-extensions.qmd b/chapters/exr-needle-sharing-extensions.qmd index 58d774e6c5..5e2f029603 100644 --- a/chapters/exr-needle-sharing-extensions.qmd +++ b/chapters/exr-needle-sharing-extensions.qmd @@ -33,7 +33,10 @@ zinf_fit1 = glmmTMB( family = "poisson", data = needles, formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, - ziformula = ~ age + sex + homeless + polydrug + hivstat + # hivstat omitted from ziformula because only ~8 of 128 participants are + # HIV+, which risks convergence failure in the zero-inflation submodel + # (mirroring why the homeless*polydrug interaction was avoided above). + ziformula = ~ age + sex + homeless + polydrug ) zinf_fit1 |> @@ -58,7 +61,10 @@ zinf_fit1 = glmmTMB( family = nbinom2, data = needles, formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, - ziformula = ~ age + sex + homeless + polydrug + hivstat + # hivstat omitted from ziformula because only ~8 of 128 participants are + # HIV+, which risks convergence failure in the zero-inflation submodel + # (mirroring why the homeless*polydrug interaction was avoided above). + ziformula = ~ age + sex + homeless + polydrug ) zinf_fit1 |> diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 4d16e768d1..92412f7532 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -53,7 +53,7 @@ injection drug users (IDUs), a behavior of public-health interest because shared syringes can transmit blood-borne infections. We examine how needle-sharing frequency relates to participants' age, sex, -housing status (homelessness), and drug-use patterns; +housing status (homelessness), drug-use patterns, and HIV serostatus; injection frequency reflects how often each participant has the opportunity to share a syringe. @@ -184,8 +184,10 @@ Modeling the count of shared syringes on the covariates alone therefore estimates each covariate's **total** association with needle-sharing — the approach taken by @vittinghoff2e [§8.3.1]. -This DAG is simplified; a fuller model might include -HIV status, depression, or hopelessness. +This DAG depicts the hypothesized causal structure; the fitted Poisson +model in @tbl-needles-pois adjusts only for the demographic, social, +and serostatus subset, not for the mental-health nodes (`dprsn_dx`, +`hplsns`, `sexabuse`) — extending the model is left as an exercise. ::: ```{r} @@ -423,7 +425,7 @@ essentially no association between age and the expected count. $\exp{\hat\b_{\text{sexF}}} = `r round(irr[["sexF"]], 2)`$: women's expected count of shared syringes is about `r round(irr[["sexF"]], 2)` times men's, -adjusting for age, housing status, and drug-use pattern. +adjusting for the other covariates. **Homelessness.** $\exp{\hat\b_{\text{homeless}}} = `r round(irr[["homelesshomeless"]], 2)`$: From b00607fcc300fcc52233e122dc0c724b43f5757a Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Wed, 3 Jun 2026 12:39:37 -0700 Subject: [PATCH 30/34] Fix offset xref; drop ZI covariates; pseudo-log count axes; un-nest aes - Migrate def-offset into the rendered count chapter and link it with @def-offset (the old poisson.qmd#def-offset target is an orphan chapter that never renders, so the link was dead). - tbl-zeroinf-nb: reduce the zero-inflation submodel to ziformula = ~ 1. With all four covariates the NB overdispersion and the zero-inflation component compete to explain the zeros, the logistic submodel separates, and the odds ratios run off to +/-Inf. The DAG also gives no covariates to a structural-zero class. Add a note explaining both reasons. - Put shared_syr (count) y-axes on a pseudo_log scale in fig-needles, fig-needles-raw, and fig-needles-pois (handles the many zeros). For fig-needles-raw, transform both axes so the y = x reference line stays straight. - Un-nest aes() out of ggplot() in those three figures. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- .../count-regression/_sec_pois-reg_intro.qmd | 10 +++- chapters/exr-needle-sharing-extensions.qmd | 36 +++++++++++--- chapters/exr-needle-sharing.qmd | 47 +++++++++++++------ 3 files changed, 71 insertions(+), 22 deletions(-) diff --git a/_subfiles/count-regression/_sec_pois-reg_intro.qmd b/_subfiles/count-regression/_sec_pois-reg_intro.qmd index fc2b957702..4949e23d54 100644 --- a/_subfiles/count-regression/_sec_pois-reg_intro.qmd +++ b/_subfiles/count-regression/_sec_pois-reg_intro.qmd @@ -69,7 +69,15 @@ $$ In contrast with the other covariates (represented by $\vX$), $t$ enters this expression with a $\log{}$ transformation and without a corresponding $\beta$ coefficient; -in other words, $\logf{t}$ is an [offset term](poisson.qmd#def-offset). +in other words, $\logf{t}$ is an **offset** term (see @def-offset). +::: + +:::{#def-offset} +#### Offset + +When the linear component of a model involves a term without an unknown coefficient, +that term is called an **offset**. + ::: --- diff --git a/chapters/exr-needle-sharing-extensions.qmd b/chapters/exr-needle-sharing-extensions.qmd index 5e2f029603..456df67885 100644 --- a/chapters/exr-needle-sharing-extensions.qmd +++ b/chapters/exr-needle-sharing-extensions.qmd @@ -57,19 +57,43 @@ Another R package for zero-inflated models is [`pscl`](https://cran.r-project.or #| tbl-cap: "Zero-inflated negative binomial model" #| label: tbl-zeroinf-nb library(glmmTMB) -zinf_fit1 = glmmTMB( +zinf_fit_nb = glmmTMB( family = nbinom2, data = needles, formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, - # hivstat omitted from ziformula because only ~8 of 128 participants are - # HIV+, which risks convergence failure in the zero-inflation submodel - # (mirroring why the homeless*polydrug interaction was avoided above). - ziformula = ~ age + sex + homeless + polydrug + # Intercept-only zero-inflation submodel (see note below). + ziformula = ~ 1 ) -zinf_fit1 |> +zinf_fit_nb |> parameters(exponentiate = TRUE) |> print_md() ``` +::: notes +Unlike the zero-inflated Poisson model above, +here the zero-inflation submodel uses only an intercept (`ziformula = ~ 1`). +Two reasons to drop the covariates from this submodel: + +- **The DAG gives it no covariates.** +@fig-needles-dag posits a *single* process +in which the covariates act on `shared_syr` +(directly and through the mediator `nivdu`); +it does not posit a separate latent class of "structural never-sharers" +for those covariates to predict. +So a covariate-laden zero-inflation submodel +is not identified by the assumed causal structure. + +- **The data cannot identify it either.** +The negative binomial's overdispersion parameter and the zero-inflation component +both compete to explain the excess zeros, +so with all four covariates in the zero-inflation submodel +the logistic fit *separates*: +its coefficients diverge and the exponentiated estimates (odds ratios) +run off to $\pm\infty$. +Reducing the zero-inflation submodel to a single constant +removes the separation +while still allowing a fixed fraction of structural zeros. +::: + diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 92412f7532..9d928c6ce2 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -207,16 +207,19 @@ split by the covariate subgroups (point size shows injection frequency). library(ggplot2) needles |> - ggplot( - aes( - x = age, - y = shared_syr, - shape = sex, - col = hivstat - ) + ggplot() + + aes( + x = age, + y = shared_syr, + shape = sex, + col = hivstat ) + geom_point(aes(size = nivdu), alpha = .5) + scale_size_area(max_size = 4) + + scale_y_continuous( + transform = "pseudo_log", + breaks = c(0, 1, 3, 10, 30, 60) + ) + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + labs(y = "Number of shared syringes (30 days)") + theme(legend.position = "bottom") @@ -237,16 +240,25 @@ where every injection involves a shared syringe. library(ggplot2) needles |> - ggplot( - aes( - x = nivdu, - y = shared_syr, - shape = sex, - col = hivstat - ) + ggplot() + + aes( + x = nivdu, + y = shared_syr, + shape = sex, + col = hivstat ) + geom_point(alpha = .5) + geom_abline(slope = 1, intercept = 0, linetype = "dashed", alpha = .5) + + # both axes are counts, so pseudo-log both: a shared transform keeps the + # y = x reference line straight (data diagonal -> panel diagonal) + scale_x_continuous( + transform = "pseudo_log", + breaks = c(0, 3, 10, 30, 100, 300, 900) + ) + + scale_y_continuous( + transform = "pseudo_log", + breaks = c(0, 1, 3, 10, 30, 60) + ) + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + labs( x = "Number of injections (in 30 days)", @@ -379,9 +391,14 @@ pred_grid <- with(glm1$model, expand.grid( )) pred_grid$shared_syr <- predict(glm1, newdata = pred_grid, type = "response") -ggplot(glm1$model, aes(x = age, y = shared_syr, color = hivstat, shape = sex)) + +ggplot(glm1$model) + + aes(x = age, y = shared_syr, color = hivstat, shape = sex) + geom_point(alpha = 0.4) + geom_line(data = pred_grid, aes(linetype = sex), linewidth = 0.8) + + scale_y_continuous( + transform = "pseudo_log", + breaks = c(0, 1, 3, 10, 30, 60) + ) + facet_grid( rows = vars(homeless), cols = vars(polydrug), From e433d8ebedd63102ef2008f9429593c2d94f83a3 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Wed, 3 Jun 2026 12:51:45 -0700 Subject: [PATCH 31/34] Show covariate causal structure in fig-needles-dag Rebuild the needle-sharing DAG to depict structure among the covariates: - Birth-fixed variables (age, sex, ethn) stay exogenous. - The six time-varying variables (homeless, polydrug, hivstat, sexabuse, dprsn_dx, hplsns) are joined pairwise by bidirected edges (<->), denoting associations whose causal direction is unknown. Lay the time-varying nodes on a hexagon and render with ggdag so the bidirected web is legible; explain the bidirected edges in the caption and notes. Add "bidirected" and "leftrightarrow" to WORDLIST. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 99 ++++++++++++++++++++------------- inst/WORDLIST | 2 + 2 files changed, 62 insertions(+), 39 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index 9d928c6ce2..cc5ce1a3c8 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -111,51 +111,61 @@ variables. ::: {#fig-needles-dag} ```{r} -#| fig-width: 7 -#| fig-height: 5 +#| fig-width: 8 +#| fig-height: 5.5 #| code-fold: true library(dagitty) +library(ggdag) + +# Variables fixed at birth -- or, like age, a deterministic function of birth +# date and interview date -- are exogenous: any causal arrow points away from +# them. +birthfixed <- c("age", "sex", "ethn") + +# Time-varying variables: we have no basis for ordering them causally, so we +# connect every pair with a bidirected edge (<->), denoting an association of +# unknown direction (e.g. an unmeasured common cause). +timevarying <- c( + "homeless", "polydrug", "hivstat", "sexabuse", "dprsn_dx", "hplsns" +) -needles_dag <- dagitty( - 'dag { - age [pos="0,-4"] - sex [pos="0,-3"] - ethn [pos="0,-2"] - homeless [pos="0,-1"] - polydrug [pos="0,0"] - hivstat [pos="0,1"] - sexabuse [pos="0,2"] - dprsn_dx [pos="0,3"] - hplsns [pos="0,4"] - nivdu [pos="2,0"] - shared_syr [outcome, pos="4,0"] - - age -> nivdu - sex -> nivdu - ethn -> nivdu - homeless -> nivdu - polydrug -> nivdu - hivstat -> nivdu - sexabuse -> nivdu - dprsn_dx -> nivdu - hplsns -> nivdu - - age -> shared_syr - sex -> shared_syr - ethn -> shared_syr - homeless -> shared_syr - polydrug -> shared_syr - hivstat -> shared_syr - sexabuse -> shared_syr - dprsn_dx -> shared_syr - hplsns -> shared_syr - - nivdu -> shared_syr - }' +covariates <- c(birthfixed, timevarying) + +needles_dag <- dagitty(paste0("dag {", paste(c( + # every covariate may affect needle-sharing directly and through `nivdu` + paste(covariates, "-> nivdu"), + paste(covariates, "-> shared_syr"), + "nivdu -> shared_syr", + # mutual associations of unknown direction among the time-varying variables + combn(timevarying, 2, function(p) paste(p[1], "<->", p[2])) +), collapse = "; "), "}")) + +# lay the time-varying variables on a hexagon so the bidirected web is legible +hex_x <- c(0.6, 2, 2, 0.6, -0.8, -0.8) +hex_y <- c(4, 3, 1, 0, 1, 3) +coordinates(needles_dag) <- list( + x = c( + age = -4, sex = -4, ethn = -4, + setNames(hex_x, timevarying), nivdu = 4.5, shared_syr = 6.5 + ), + y = c( + age = 3, sex = 2, ethn = 1, + setNames(hex_y, timevarying), nivdu = 2, shared_syr = 2 + ) ) -plot(needles_dag) +needles_dag |> + tidy_dagitty() |> + ggplot() + + aes(x = x, y = y, xend = xend, yend = yend) + + geom_dag_edges( + start_cap = ggraph::circle(6, "mm"), + end_cap = ggraph::circle(6, "mm") + ) + + geom_dag_point(colour = "grey85") + + geom_dag_text(colour = "black", size = 2.3) + + theme_dag() ``` Hypothesized causal structure for the needle-sharing example. @@ -168,6 +178,11 @@ all plausibly influence both how often a participant injects (`nivdu` — injections in 30 days) and how many of those injections involve a shared syringe (`shared_syr` — shared syringes in 30 days; the outcome). +Variables fixed at birth (`age`, `sex`, `ethn`) are treated as exogenous, +while the time-varying variables +(`homeless`, `polydrug`, `hivstat`, `sexabuse`, `dprsn_dx`, `hplsns`) +are joined by bidirected arrows ($\leftrightarrow$), +denoting likely associations whose causal direction is unknown. ::: @@ -184,6 +199,12 @@ Modeling the count of shared syringes on the covariates alone therefore estimates each covariate's **total** association with needle-sharing — the approach taken by @vittinghoff2e [§8.3.1]. +The bidirected arrows among the time-varying variables say that we expect them +to be associated but cannot order them causally (each pair may share an +unmeasured common cause). They are an honest admission of ignorance, not extra +data: with the covariates mutually entangled in this way, a single fitted +coefficient is an **adjusted association**, not an isolated causal effect. + This DAG depicts the hypothesized causal structure; the fitted Poisson model in @tbl-needles-pois adjusts only for the demographic, social, and serostatus subset, not for the mental-health nodes (`dprsn_dx`, diff --git a/inst/WORDLIST b/inst/WORDLIST index 2c8b0f2d73..11e138ea8c 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -93,6 +93,7 @@ antiderivative antiderivatives arg ba +bidirected binom biomarkers bmatrix @@ -164,6 +165,7 @@ ki ldots le leftarrow +leftrightarrow leq lik linreg From 97a3a72be3158e63d60a39134a28ae87eb582617 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Wed, 3 Jun 2026 13:12:50 -0700 Subject: [PATCH 32/34] Fix offset link target; protect hepatitis {C} in tsui2009hcv - The offset reference pointed at poisson.qmd#def-offset, but poisson.qmd is include-only (pulled into probability.qmd) and has no standalone page. Point the link at probability.qmd#def-offset, where the existing def-offset anchor actually renders. (Earlier I had added a local def-offset definition; that duplicated the anchor, so this replaces it with the corrected link.) - Brace-protect the C in "hepatitis {C}" in the tsui2009hcv title so sentence-case citation styles don't lowercase it. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- _subfiles/count-regression/_sec_pois-reg_intro.qmd | 10 +--------- references.bib | 2 +- 2 files changed, 2 insertions(+), 10 deletions(-) diff --git a/_subfiles/count-regression/_sec_pois-reg_intro.qmd b/_subfiles/count-regression/_sec_pois-reg_intro.qmd index 4949e23d54..af9303f79b 100644 --- a/_subfiles/count-regression/_sec_pois-reg_intro.qmd +++ b/_subfiles/count-regression/_sec_pois-reg_intro.qmd @@ -69,15 +69,7 @@ $$ In contrast with the other covariates (represented by $\vX$), $t$ enters this expression with a $\log{}$ transformation and without a corresponding $\beta$ coefficient; -in other words, $\logf{t}$ is an **offset** term (see @def-offset). -::: - -:::{#def-offset} -#### Offset - -When the linear component of a model involves a term without an unknown coefficient, -that term is called an **offset**. - +in other words, $\logf{t}$ is an [offset term](probability.qmd#def-offset). ::: --- diff --git a/references.bib b/references.bib index 19127ce6c8..a09a277cef 100644 --- a/references.bib +++ b/references.bib @@ -1691,7 +1691,7 @@ @article{debruyn2011mira } @article{tsui2009hcv, - title={Risk behaviors after hepatitis C virus seroconversion in young injection drug users in {San Francisco}}, + title={Risk behaviors after hepatitis {C} virus seroconversion in young injection drug users in {San Francisco}}, author={Tsui, Judith I. and Vittinghoff, Eric and Hahn, Judith A. and Evans, Jennifer L. and Davidson, Peter J. and Page, Kimberly}, journal={Drug and Alcohol Dependence}, volume={105}, From 61886400ccbadce41d6a122a637c61807ffb9ac7 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Wed, 3 Jun 2026 16:10:44 -0700 Subject: [PATCH 33/34] Redesign fig-needles-dag: cross-sectional, homelessness-focal causal structure MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Recast the needle-sharing DAG around homelessness as the exposure of interest (following Vittinghoff §8.3.1) and the cross-sectional study design: - Mark `homeless` as exposure and `shared_syr` as outcome. - Cross-sectional => the current covariates can't be ordered causally; each is drawn as a proxy for an unobserved *past state* (`U_x`, hollow node) that causes both the current measurement and the 30-day outcome. These latent past states are the confounders. - `nivdu` (injection frequency) is the only genuine mediator: a shared syringe is a subset of injections. - Add a conditioned-on study-inclusion node (`selected`); several covariates affect being sampled, so conditioning on it induces selection bias. - All directed edges (no bidirected), which also removes the ggdag double-arrow rendering ambiguity from the prior version. Update the caption and notes to explain the cross-sectional reading, the single mediator, the latent-past confounders, and selection bias. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing.qmd | 162 +++++++++++++++++++------------- 1 file changed, 96 insertions(+), 66 deletions(-) diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index cc5ce1a3c8..c15ca1b4be 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -111,104 +111,134 @@ variables. ::: {#fig-needles-dag} ```{r} -#| fig-width: 8 -#| fig-height: 5.5 +#| fig-width: 8.5 +#| fig-height: 6 #| code-fold: true library(dagitty) library(ggdag) +library(dplyr) -# Variables fixed at birth -- or, like age, a deterministic function of birth -# date and interview date -- are exogenous: any causal arrow points away from -# them. -birthfixed <- c("age", "sex", "ethn") - -# Time-varying variables: we have no basis for ordering them causally, so we -# connect every pair with a bidirected edge (<->), denoting an association of -# unknown direction (e.g. an unmeasured common cause). +# This is a cross-sectional study: every covariate is measured at one interview, +# so we cannot order the *current* covariates causally among themselves. The +# only genuine mediator is injection frequency `nivdu` -- sharing a syringe +# requires injecting, so a shared syringe is a subset of injections. +demographics <- c("age", "sex", "ethn") # fixed at (or determined by) birth timevarying <- c( "homeless", "polydrug", "hivstat", "sexabuse", "dprsn_dx", "hplsns" ) -covariates <- c(birthfixed, timevarying) +# The real confounders are the *past* versions of the time-varying covariates +# (a person's earlier circumstances), which we never observe. Each latent past +# state `U_x` causes both its current measured proxy `x` and the 30-day outcome. +past <- paste0("U_", timevarying) needles_dag <- dagitty(paste0("dag {", paste(c( - # every covariate may affect needle-sharing directly and through `nivdu` - paste(covariates, "-> nivdu"), - paste(covariates, "-> shared_syr"), - "nivdu -> shared_syr", - # mutual associations of unknown direction among the time-varying variables - combn(timevarying, 2, function(p) paste(p[1], "<->", p[2])) + # demographics cause the latent past states and the outcome directly + as.vector(outer(demographics, past, function(a, b) paste(a, "->", b))), + paste(demographics, "-> shared_syr"), + # each latent past state -> its current measured proxy, and -> the outcome + paste(past, "->", timevarying), + paste(past, "-> shared_syr"), + # the only mediator: past homelessness -> more injecting -> more sharing + "U_homeless -> nivdu", "nivdu -> shared_syr", + # study inclusion is conditioned on; several covariates affect being sampled, + # conditioning on `selected` then induces associations (selection bias) + paste(c(demographics, "homeless", "polydrug", "hivstat"), "-> selected") ), collapse = "; "), "}")) -# lay the time-varying variables on a hexagon so the bidirected web is legible -hex_x <- c(0.6, 2, 2, 0.6, -0.8, -0.8) -hex_y <- c(4, 3, 1, 0, 1, 3) +# layout: time flows left -> right (birth-fixed -> latent past -> current proxy) +y6 <- seq(3.5, -3.5, length.out = 6) coordinates(needles_dag) <- list( x = c( - age = -4, sex = -4, ethn = -4, - setNames(hex_x, timevarying), nivdu = 4.5, shared_syr = 6.5 + age = -1, sex = -1, ethn = -1, + setNames(rep(1, 6), past), setNames(rep(3, 6), timevarying), + nivdu = 4.3, shared_syr = 6, selected = 3 ), y = c( - age = 3, sex = 2, ethn = 1, - setNames(hex_y, timevarying), nivdu = 2, shared_syr = 2 + age = 1, sex = 0, ethn = -1, + setNames(y6, past), setNames(y6, timevarying), + nivdu = 2.4, shared_syr = 0, selected = -4.7 ) ) needles_dag |> tidy_dagitty() |> + mutate(role = case_when( + name == "homeless" ~ "exposure", + name == "shared_syr" ~ "outcome", + name == "nivdu" ~ "mediator", + name == "selected" ~ "study inclusion (conditioned on)", + startsWith(name, "U_") ~ "latent past state (unobserved)", + TRUE ~ "measured covariate" + )) |> ggplot() + aes(x = x, y = y, xend = xend, yend = yend) + geom_dag_edges( - start_cap = ggraph::circle(6, "mm"), - end_cap = ggraph::circle(6, "mm") + start_cap = ggraph::circle(5, "mm"), + end_cap = ggraph::circle(5, "mm") ) + - geom_dag_point(colour = "grey85") + - geom_dag_text(colour = "black", size = 2.3) + - theme_dag() + geom_dag_point(aes(colour = role, shape = role), size = 7) + + geom_dag_text(colour = "black", size = 1.7) + + scale_shape_manual(values = c( + "exposure" = 16, "outcome" = 16, "mediator" = 16, + "study inclusion (conditioned on)" = 15, + "latent past state (unobserved)" = 1, "measured covariate" = 16 + )) + + scale_colour_manual(values = c( + "exposure" = "#d73027", "outcome" = "#4575b4", "mediator" = "#1a9850", + "study inclusion (conditioned on)" = "grey30", + "latent past state (unobserved)" = "grey55", + "measured covariate" = "grey80" + )) + + theme_dag() + + theme(legend.position = "bottom", legend.title = element_blank()) ``` -Hypothesized causal structure for the needle-sharing example. -Demographic (`age` — age in years; `sex` — biological sex; `ethn` — ethnicity), -social (`homeless`; `polydrug` — multiple drugs used), -mental-health (`sexabuse` — sexual-abuse history; `dprsn_dx` — depression diagnosis; -`hplsns` — hopelessness score), -and serostatus (`hivstat` — HIV serostatus) factors -all plausibly influence both how often a participant injects -(`nivdu` — injections in 30 days) -and how many of those injections involve a shared syringe -(`shared_syr` — shared syringes in 30 days; the outcome). -Variables fixed at birth (`age`, `sex`, `ethn`) are treated as exogenous, -while the time-varying variables -(`homeless`, `polydrug`, `hivstat`, `sexabuse`, `dprsn_dx`, `hplsns`) -are joined by bidirected arrows ($\leftrightarrow$), -denoting likely associations whose causal direction is unknown. +Hypothesized causal structure for the needle-sharing example, +drawn for a **cross-sectional** study. +The exposure of interest is homelessness (`homeless`, red) +and the outcome is the count of shared syringes (`shared_syr`, blue). +Because every covariate is measured at a single interview, +the current measured covariates (grey) cannot be ordered causally among +themselves; each is a proxy for an unobserved **past state** (`U_…`, hollow) +that is the actual cause of both the current measurement and the 30-day behaviour. +The only genuine mediator is injection frequency (`nivdu`, green): +sharing a syringe requires injecting. +Demographic variables (`age`, `sex`, `ethn`) are fixed at (or determined by) birth. +The study-inclusion indicator (`selected`, square) is conditioned on by design; +because several covariates influence who is sampled, +conditioning on it induces associations among them (selection bias). ::: ::: notes -In @fig-needles-dag, injection frequency `nivdu` is a **mediator**: -each covariate can affect the number of shared syringes both -**directly** (its effect on the propensity to share per injection) -and **indirectly** (through how often the participant injects). - -Because `nivdu` lies *on* these causal paths, we do **not** adjust for it -(neither as a covariate nor as an offset): conditioning on a mediator -would block the indirect path and leave only a partial effect. -Modeling the count of shared syringes on the covariates alone therefore -estimates each covariate's **total** association with needle-sharing — -the approach taken by @vittinghoff2e [§8.3.1]. - -The bidirected arrows among the time-varying variables say that we expect them -to be associated but cannot order them causally (each pair may share an -unmeasured common cause). They are an honest admission of ignorance, not extra -data: with the covariates mutually entangled in this way, a single fitted -coefficient is an **adjusted association**, not an isolated causal effect. - -This DAG depicts the hypothesized causal structure; the fitted Poisson -model in @tbl-needles-pois adjusts only for the demographic, social, -and serostatus subset, not for the mental-health nodes (`dprsn_dx`, -`hplsns`, `sexabuse`) — extending the model is left as an exercise. +We treat **homelessness as the exposure of interest** +and ask how it affects the number of shared syringes — +following @vittinghoff2e [§8.3.1], who model this outcome on `homeless`. + +Injection frequency `nivdu` is a **mediator**: +a shared syringe is necessarily one of the injections, +so homelessness can raise sharing partly by raising how often a person injects. +Because `nivdu` lies *on* the causal path from `homeless` to `shared_syr`, +we do **not** adjust for it +(neither as a covariate nor as an offset); +doing so would block that path and estimate only a partial effect. +We want the **total** effect of homelessness on needle-sharing. + +Because the study is **cross-sectional**, +the remaining covariates are measured at the same interview as the exposure, +so none of them can be ordered causally relative to `homeless`. +We read each one as a proxy for an unobserved **past state** (`U_…`) +that confounds the homelessness–sharing relationship, +and we adjust for the measured proxies to control that confounding. +A single fitted coefficient is therefore an **adjusted association**, +not an isolated causal effect. + +Conditioning on study inclusion (`selected`) is unavoidable — +we only observe people who were sampled — +but because several covariates influence who is sampled, +it can induce **selection bias** that no covariate adjustment removes. ::: ```{r} From 7d2fcde1cdfc10c3d64e1c7e1f81ce5684a8bf10 Mon Sep 17 00:00:00 2001 From: Douglas Ezra Morrison Date: Thu, 4 Jun 2026 11:41:39 -0700 Subject: [PATCH 34/34] Reframe needle-sharing example around homelessness as the exposure MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Following Vittinghoff §8.3.1 (which models shared_syr on homelessness) and the cross-sectional DAG, recenter the analysis on homelessness as the exposure of interest: - Fit shared_syr ~ homeless + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns: homelessness as exposure, the other measured covariates as proxies for the latent past confounders (all except the mediator nivdu). Add sexabuse to the data prep. Exclude dprsn_dx, whose 1/5 coding is undocumented in the source data (the book uses only homeless). - Headline the homelessness IRR in the interpretation (adjusted ~6, vs Vittinghoff's unadjusted ~3.3); add ethn/sexabuse/hplsns coefficients. - Recast the fitted-model figure as the homeless-vs-not contrast over age (other covariates at reference), on the pseudo-log scale. - NB/ZIP/ZINB use the same count model; both zero-inflation submodels use ziformula = ~ homeless, matching Vittinghoff's inflate(i.homeless). This also resolves the review finding that the old ZINB note (ziformula = ~ 1) contradicted the ZIP submodel: both are now consistent and book-faithful. - Clean up pre-existing = / glm1.nb lint in the rewritten extensions chunks. Co-Authored-By: Claude Opus 4.8 (1M context) Co-authored-by: d-morrison <2474437+d-morrison@users.noreply.github.com> --- chapters/exr-needle-sharing-extensions.qmd | 81 +++++++-------- chapters/exr-needle-sharing.qmd | 109 ++++++++++++++------- 2 files changed, 106 insertions(+), 84 deletions(-) diff --git a/chapters/exr-needle-sharing-extensions.qmd b/chapters/exr-needle-sharing-extensions.qmd index 456df67885..6497939398 100644 --- a/chapters/exr-needle-sharing-extensions.qmd +++ b/chapters/exr-needle-sharing-extensions.qmd @@ -1,18 +1,18 @@ - ```{r} -library(MASS) #need this for glm.nb() -glm1.nb = glm.nb( - data = needles, - shared_syr ~ age + sex + homeless + polydrug + hivstat +library(MASS) # for glm.nb() +glm1_nb <- glm.nb( + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + data = needles ) -equatiomatic::extract_eq(glm1.nb) +equatiomatic::extract_eq(glm1_nb) ``` ```{r} #| tbl-cap: "Negative binomial model for needle-sharing data" #| label: tbl-needles-nb -summary(glm1.nb) +summary(glm1_nb) ``` --- @@ -20,29 +20,28 @@ summary(glm1.nb) ```{r} #| tbl-cap: "Poisson versus Negative Binomial Regression coefficient estimates" #| label: tbl-compare-poisson-nb -tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1.nb)) +tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1_nb)) ``` #### zero-inflation ```{r} -#| tbl-cap: "Zero-inflated poisson model" +#| tbl-cap: "Zero-inflated Poisson model" #| label: tbl-zeroinf-poisson library(glmmTMB) -zinf_fit1 = glmmTMB( +zinf_pois <- glmmTMB( family = "poisson", - data = needles, - formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, - # hivstat omitted from ziformula because only ~8 of 128 participants are - # HIV+, which risks convergence failure in the zero-inflation submodel - # (mirroring why the homeless*polydrug interaction was avoided above). - ziformula = ~ age + sex + homeless + polydrug + data = needles, + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + # the zero-inflation submodel uses the exposure, matching Vittinghoff's + # `inflate(i.homeless)` (@vittinghoff2e, §8.3.1) + ziformula = ~ homeless ) -zinf_fit1 |> +zinf_pois |> parameters(exponentiate = TRUE) |> print_md() - ``` ::: notes @@ -56,44 +55,30 @@ Another R package for zero-inflated models is [`pscl`](https://cran.r-project.or ```{r} #| tbl-cap: "Zero-inflated negative binomial model" #| label: tbl-zeroinf-nb -library(glmmTMB) -zinf_fit_nb = glmmTMB( +zinf_nb <- glmmTMB( family = nbinom2, - data = needles, - formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, - # Intercept-only zero-inflation submodel (see note below). - ziformula = ~ 1 + data = needles, + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + ziformula = ~ homeless ) -zinf_fit_nb |> +zinf_nb |> parameters(exponentiate = TRUE) |> print_md() - ``` ::: notes -Unlike the zero-inflated Poisson model above, -here the zero-inflation submodel uses only an intercept (`ziformula = ~ 1`). -Two reasons to drop the covariates from this submodel: +Both zero-inflated models keep the zero-inflation (structural-zero) submodel +parsimonious — just the exposure, `ziformula = ~ homeless` — +matching @vittinghoff2e's `inflate(i.homeless)` [§8.3.1]. -- **The DAG gives it no covariates.** -@fig-needles-dag posits a *single* process -in which the covariates act on `shared_syr` -(directly and through the mediator `nivdu`); -it does not posit a separate latent class of "structural never-sharers" -for those covariates to predict. -So a covariate-laden zero-inflation submodel -is not identified by the assumed causal structure. - -- **The data cannot identify it either.** -The negative binomial's overdispersion parameter and the zero-inflation component +A richer zero-inflation submodel is poorly identified here. +With the negative binomial in particular, +the overdispersion parameter and the zero-inflation component both compete to explain the excess zeros, -so with all four covariates in the zero-inflation submodel -the logistic fit *separates*: -its coefficients diverge and the exponentiated estimates (odds ratios) -run off to $\pm\infty$. -Reducing the zero-inflation submodel to a single constant -removes the separation -while still allowing a fixed fraction of structural zeros. +so adding several covariates to the submodel makes the logistic fit +*separate*: its coefficients diverge and the exponentiated estimates +(odds ratios) run off to $\pm\infty$. +Keeping the submodel small avoids that. ::: - diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index c15ca1b4be..40e72eeb87 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -18,6 +18,10 @@ needles <- case_match(1 ~ "homeless", 0 ~ "not homeless") |> factor() |> relevel(ref = "not homeless"), + sexabuse = sexabuse |> + case_match(1 ~ "yes", 0 ~ "no") |> + factor() |> + relevel(ref = "no"), ethn = ethn |> factor() |> forcats::fct_lump_min(min = 5, other_level = "Other") |> @@ -359,9 +363,13 @@ needles <- needles |> filter(sex != "Trans") ## Model {.smaller} Following @vittinghoff2e [§8.3.1], -we model the **count** of shared syringes directly -with a Poisson regression (log link), -rather than modeling a rate. +we take **homelessness as the exposure of interest** +and ask how it affects the **count** of shared syringes, +using a Poisson regression (log link) rather than modeling a rate. +The remaining covariates enter as adjustments for confounding +(@fig-needles-dag): because the study is cross-sectional, +each is read as a proxy for an unobserved past state +that may influence both homelessness and sharing. **Why not use `nivdu` as an offset?** A Poisson *rate* model would add `offset(log(nivdu))`, @@ -372,10 +380,12 @@ We avoid that here for two reasons: - **`nivdu` is a mediator, not an exposure window.** An offset is appropriate when its variable is a known denominator — person-time or population at risk — that is external to the causal -question. Injection frequency is instead *caused* by the covariates -(@fig-needles-dag), so conditioning on it (whether as an offset or as a -covariate) would block the indirect path and estimate only a partial -effect. We want each covariate's *total* association with needle-sharing. +question. Injection frequency is instead on the causal path from +homelessness to sharing (@fig-needles-dag): being homeless can raise +sharing partly by raising how often a person injects. Conditioning on it +(whether as an offset or as a covariate) would block that path and estimate +only a partial effect. We want the *total* effect of homelessness on +needle-sharing. - **It imposes an untested proportionality assumption.** Fixing the `log(nivdu)` coefficient at 1 assumes sharing scales one-for-one with injecting, which the data need not support. @@ -394,14 +404,18 @@ it is a mediator rather than an exposure window, and we model the count directly instead of dividing by it. ::: -We also use **main effects only**: -a `homeless × polydrug` interaction is not identifiable here, -because no non-homeless participant who used multiple drugs shared any -needles, leaving that cell empty for the outcome. +We adjust for the measured covariates as proxies for the latent past +confounders in @fig-needles-dag — demographic (`age`, `sex`, `ethn`), +social (`polydrug`), serostatus (`hivstat`), and trauma/mental-health +(`sexabuse`, hopelessness score `hplsns`) — using **main effects only**. +We exclude the mediator `nivdu` (above) and the depression indicator +`dprsn_dx`, whose `1`/`5` coding is undocumented in the source data and +so cannot be interpreted. ```{r} glm1 <- glm( - formula = shared_syr ~ age + sex + homeless + polydrug + hivstat, + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, family = stats::poisson, data = needles ) @@ -425,38 +439,40 @@ glm1 |> --- ```{r} -#| fig-cap: "Fitted Poisson count model and observed shared-syringe counts" +#| fig-cap: "Fitted Poisson model: expected sharing by age and housing status" #| label: fig-needles-pois #| code-fold: true library(ggplot2) -# Predicted expected counts across age and the categorical covariates. -# The model has no offset, so predict(type = "response") returns the -# expected number of shared syringes directly. -pred_grid <- with(glm1$model, expand.grid( - age = seq(min(age), max(age), length.out = 100), - sex = levels(droplevels(sex)), - homeless = levels(homeless), - polydrug = levels(polydrug), - hivstat = levels(hivstat) -)) +# Predicted expected counts across age for the two homelessness groups, +# holding the other covariates at their reference levels (hplsns at its +# median). The model has no offset, so predict(type = "response") returns +# the expected number of shared syringes directly. +ref_lvl <- function(v) levels(glm1$model[[v]])[1] +pred_grid <- expand.grid( + age = seq(min(glm1$model$age), max(glm1$model$age), length.out = 100), + homeless = levels(glm1$model$homeless), + sex = ref_lvl("sex"), + ethn = ref_lvl("ethn"), + polydrug = ref_lvl("polydrug"), + hivstat = ref_lvl("hivstat"), + sexabuse = ref_lvl("sexabuse"), + hplsns = median(glm1$model$hplsns) +) pred_grid$shared_syr <- predict(glm1, newdata = pred_grid, type = "response") ggplot(glm1$model) + - aes(x = age, y = shared_syr, color = hivstat, shape = sex) + + aes(x = age, y = shared_syr, colour = homeless) + geom_point(alpha = 0.4) + - geom_line(data = pred_grid, aes(linetype = sex), linewidth = 0.8) + + geom_line(data = pred_grid, linewidth = 1) + scale_y_continuous( transform = "pseudo_log", breaks = c(0, 1, 3, 10, 30, 60) ) + - facet_grid( - rows = vars(homeless), - cols = vars(polydrug), - labeller = label_both + labs( + y = "Number of shared syringes (30 days)", + colour = "Housing status" ) + - labs(y = "Number of shared syringes (30 days)") + - guides(color = guide_legend(nrow = 1)) + theme_bw() + theme(legend.position = "bottom") ``` @@ -469,8 +485,11 @@ exponentiating a coefficient turns it into a *ratio of expected counts* $\exp{\b_j}$ is the multiplicative change in the expected number of shared syringes per one-unit increase in $x_j$, holding the other covariates fixed. +Our **focal estimate is the IRR for homelessness**; +the remaining coefficients are adjustments for confounding, +not effects of separate interest. Because we do not condition on the mediator `nivdu`, -each ratio is a *total* association. +the homelessness IRR is a *total* effect. The fitted ratios are reported in @tbl-needles-pois. ```{r} @@ -481,7 +500,8 @@ irr <- exp(coef(glm1)) **Baseline (intercept).** $\exp{\hat\b_0} = `r round(irr[["(Intercept)"]], 3)`$ is the expected count of shared syringes for the reference participant -(male, not homeless, one drug used, HIV-negative), extrapolated to age 0. +(male, not homeless, White, one drug used, HIV-negative, +no sexual-abuse history, hopelessness score 0), extrapolated to age 0. Age 0 is outside the data, so the intercept is a mathematical anchor rather than a directly interpretable quantity. @@ -495,13 +515,15 @@ women's expected count of shared syringes is about `r round(irr[["sexF"]], 2)` times men's, adjusting for the other covariates. -**Homelessness.** +**Homelessness (the focal exposure).** $\exp{\hat\b_{\text{homeless}}} = `r round(irr[["homelesshomeless"]], 2)`$: homeless participants share syringes about `r round(irr[["homelesshomeless"]], 2)` times as often as the non-homeless, adjusting for the other covariates. -@vittinghoff2e [§8.3.1] reports a comparable unadjusted IRR of about 3.3 -for homelessness in a single-predictor model. +@vittinghoff2e [§8.3.1] reports an *unadjusted* IRR of about 3.3 +for homelessness in a single-predictor model; +adjusting for the confounder-proxies here moves the estimate +to about `r round(irr[["homelesshomeless"]], 1)`. **Multiple-drug use.** $\exp{\hat\b_{\text{polydrug}}} = `r round(irr[["polydrugmultiple drugs used"]], 2)`$: @@ -516,6 +538,21 @@ than HIV-negative participants, adjusting for the other covariates. Only `r sum(glm1$model[["hivstat"]] == "HIV+")` participants are HIV+ in this sample, so this estimate is imprecise. +**Ethnicity.** +The `ethn` coefficients compare each retained group to the White reference +(small groups were lumped into "Other"); they adjust for any confounding +of the homelessness effect by ethnicity but are not of separate interest. + +**Sexual-abuse history.** +$\exp{\hat\b_{\text{sexabuse}}} = `r round(irr[["sexabuseyes"]], 2)`$: +the expected count for participants reporting a history of sexual abuse +relative to those not reporting one, adjusting for the other covariates. + +**Hopelessness.** +$\exp{\hat\b_{\text{hplsns}}} = `r round(irr[["hplsns"]], 2)`$ per point: +the multiplicative change in the expected count per one-point increase +in the hopelessness score. + **Overdispersion caveat.** These data are highly overdispersed — the variance far exceeds the mean — so the model-based Poisson standard errors are too small.