diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 46133eec51..fdee9e8300 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -588,6 +588,10 @@ Key macros to use: - **Greek letters**: Use `\b` for $\beta$, `\g` for $\gamma$, `\a` for $\alpha$ - **Formatting**: Use `\red{...}` and `\blue{...}` for colored text in math - **Deviation/error notation**: Use `\erf{...}` for deviations of estimates/estimators from their estimands; use `\devn(...)` for all other deviations (e.g., observations from population means) +- **Estimators of vector estimands**: the estimator symbol (e.g. `\hat`, + `\bar`, `\tilde`) goes on top of the vector symbol, not inside it — + write `\hat{\v{\mu}}`, not `\v{\hat\mu}`. The hat sits on top of the + already-vectorized symbol. matrix-product helper macros: diff --git a/CLAUDE.md b/CLAUDE.md index f087254695..e27448bb22 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -50,6 +50,9 @@ Before committing any `.qmd`, `.R`, or config file change: - Key macros: `\E{Y|X=x}`, `\ba`/`\ea`, `\tp{v}`, `\b`, `\g`, `\a`, `\devn(...)`, `\erf{...}` - Include every intermediate step in derivations — do not skip steps - Color coding: `\red{...}` for focal/extra terms, `\blue{...}` for shared terms +- Estimators of vector estimands: the estimator symbol (e.g. `\hat`) goes + on top of the vector symbol, not inside it — write `\hat{\v{\mu}}`, not + `\v{\hat\mu}`. (Same for `\bar`, `\tilde`, etc.) - Ratios vs. factors: - Use the generic `\ratio`/`\ratiof` macro when a ratio's inputs are the **quantities themselves** (the odds, hazards, rates, etc.) — e.g. `\ratio(\odds_1, \odds_2)`, **not** `\ror(\odds_1, \odds_2)` — because the type of ratio is clear from the inputs. - Use the type-subscripted ratio macros (`\ror` for odds ratios, `\hazratio`/`\hr` for hazard ratios, `\rateratio`, `\riskratio`, `\prevratio`, `\cuhazratio`, …) only when the inputs are **covariate patterns** (e.g. `\ror(\vx,\vxs)`, `\hr(t\mid\vx:\vxs)`), where the subscript is needed to say which kind of ratio it is. diff --git a/_subfiles/count-regression/_exr-prac-glm-score.qmd b/_subfiles/count-regression/_exr-prac-glm-score.qmd index 38961caf7b..009fdc9924 100644 --- a/_subfiles/count-regression/_exr-prac-glm-score.qmd +++ b/_subfiles/count-regression/_exr-prac-glm-score.qmd @@ -69,7 +69,12 @@ weighted by $x_i$. More generally, these score equations say that the **residuals $(y_i - \hat\mu_i)$ -are orthogonal to each predictor column**. +are [orthogonal](math-prereqs.qmd#def-orthogonal-vectors) +to each predictor column**: +for each predictor $j$, the residual vector $(\vy - \hat{\v{\mu}})$ satisfies +$\tp{\vx_{(j)}}(\vy - \hat{\v{\mu}}) = 0$, +where $\vx_{(j)} = (x_{1j}, \ldots, x_{nj})$ is the column of $j$-th +predictor values across observations. This is the GLM analogue of the OLS normal equations. ::: diff --git a/_subfiles/count-regression/_sec-overdispersion.qmd b/_subfiles/count-regression/_sec-overdispersion.qmd index 753b166b52..56331ec240 100644 --- a/_subfiles/count-regression/_sec-overdispersion.qmd +++ b/_subfiles/count-regression/_sec-overdispersion.qmd @@ -13,8 +13,8 @@ In practice, many count distributions will have a variance substantially larger A random variable $X$ is **overdispersed** relative to a model $\p(X=x)$ if -if its empirical variance in a dataset is larger than -the value is predicted by the fitted model $\hat{\p}(X=x)$. +its empirical variance in a dataset is larger than +the value predicted by the fitted model $\hat{\p}(X=x)$. :::: diff --git a/_subfiles/count-regression/_sec_pois-reg_intro.qmd b/_subfiles/count-regression/_sec_pois-reg_intro.qmd index fc2b957702..af9303f79b 100644 --- a/_subfiles/count-regression/_sec_pois-reg_intro.qmd +++ b/_subfiles/count-regression/_sec_pois-reg_intro.qmd @@ -69,7 +69,7 @@ $$ In contrast with the other covariates (represented by $\vX$), $t$ enters this expression with a $\log{}$ transformation and without a corresponding $\beta$ coefficient; -in other words, $\logf{t}$ is an [offset term](poisson.qmd#def-offset). +in other words, $\logf{t}$ is an [offset term](probability.qmd#def-offset). ::: --- diff --git a/_subfiles/count-regression/_sec_poisson_dx.qmd b/_subfiles/count-regression/_sec_poisson_dx.qmd index dc91df9c45..f98a3d9c1b 100644 --- a/_subfiles/count-regression/_sec_poisson_dx.qmd +++ b/_subfiles/count-regression/_sec_poisson_dx.qmd @@ -17,17 +17,17 @@ where $h$ is the "leverage" (which we will continue to leave undefined). #### Deviance residuals $$ -d_k = \text{sign}(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\} +d_k = \signt(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\} $$ :::{.callout-note} -$$\text{sign}(x) \eqdef \frac{x}{|x|}$$ +$$\signt(x) \eqdef \frac{x}{|x|}$$ In other words: -* $\text{sign}(x) = -1$ if $x < 0$ -* $\text{sign}(x) = 0$ if $x = 0$ -* $\text{sign}(x) = 1$ if $x > 0$ +* $\signt(x) = -1$ if $x < 0$ +* $\signt(x) = 0$ if $x = 0$ +* $\signt(x) = 1$ if $x > 0$ ::::{.content-hidden} diff --git a/_subfiles/count-regression/_sec_poisson_inference.qmd b/_subfiles/count-regression/_sec_poisson_inference.qmd index 0851e1460e..fb3167539e 100644 --- a/_subfiles/count-regression/_sec_poisson_inference.qmd +++ b/_subfiles/count-regression/_sec_poisson_inference.qmd @@ -1,26 +1,55 @@ ### Confidence intervals for regression coefficients and rate ratios -As usual: +A Wald 95% confidence interval for a single coefficient $\beta_j$ is: $$ -\beta \in \left[\ci \right] +\beta_j \in \left[\hat\beta_j \pm \ciradf{\hat\beta_j}\right] $$ -Rate ratios: exponentiate CI endpoints +where $z_{1-\alpha/2} \approx 1.96$ for $\alpha = 0.05$. + +Because the log-rate scale is related to the rate scale by exponentiation, +we obtain a confidence interval for the rate ratio $e^{\beta_j}$ +by exponentiating both endpoints: $$ -\exp{\beta} \in \left[\exp{\ci} \right] +e^{\beta_j} \in + \left[ + \exp{\hat\beta_j - \ciradf{\hat\beta_j}},\; + \exp{\hat\beta_j + \ciradf{\hat\beta_j}} + \right] $$ ### Hypothesis tests for regression coefficients +To test $H_0: \beta_j = \beta_0$ against a one- or two-sided alternative, +compute the Wald $z$-statistic: + $$ -z = \frac{\hat \beta - \beta_0}{\hse{\hb}} +z = \frac{\hat \beta_j - \beta_0}{\hse{\hat\beta_j}} $$ -Compare $z$ or $|z|$ to the tails of the standard Gaussian distribution, according to the null hypothesis. +and compare $z$ (one-sided) or $|z|$ (two-sided) to the tails of the +standard Gaussian distribution. +The most common null hypothesis is $H_0: \beta_j = 0$, +i.e., that covariate $j$ has no association with the outcome rate. ### Comparing nested models -log(likelihood ratio) tests, as usual. +To compare a smaller model $M_0$ (with $p_0$ parameters) to a larger model +$M_1$ (with $p_1 > p_0$ parameters), use the likelihood ratio test statistic: + +$$ +G^2 = 2\bigl[\hat\ell_1 - \hat\ell_0\bigr] +$$ + +where $\hat\ell_1$ and $\hat\ell_0$ are the maximized log-likelihoods +of $M_1$ and $M_0$ respectively. +(Here the subscripts index the two *models*; they are unrelated to the +scalar null value $\beta_0$ used in the Wald test above.) + +Under $H_0$ that the additional $p_1 - p_0$ parameters are all zero, +$G^2 \dsim \chi^2_{p_1 - p_0}$. + +See @dobson4e [Chapter 9] and @vittinghoff2e [§8.1] for details. diff --git a/_subfiles/count-regression/_sec_zero-inflation.qmd b/_subfiles/count-regression/_sec_zero-inflation.qmd index c49dfd0336..8d77b43f47 100644 --- a/_subfiles/count-regression/_sec_zero-inflation.qmd +++ b/_subfiles/count-regression/_sec_zero-inflation.qmd @@ -31,12 +31,105 @@ $$ --- ::: {#exr-zinf-pmf} -Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x,T=t)$ and $P(Y=y|Z=0,X=x,T=t)$. +Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x)$ and $P(Y=y|Z=0,X=x,T=t)$. ::: ---- +::: {.solution} + +Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. + +**$P(Y=0)$:** $Y=0$ occurs either because $Z=1$ (always zero) +or because $Z=0$ and the Poisson draw equals 0: + +$$ +\ba +\P(Y=0|X=x,T=t) +&= \P(Z=1|X=x) + \P(Z=0|X=x)\,\P(Y=0|Z=0,X=x,T=t)\\ +&= \pi + (1-\pi)\,e^{-\mu_0} +\ea +$$ + +**$P(Y=1)$:** $Z=1$ can never produce $Y=1$, so: + +$$ +\P(Y=1|X=x,T=t) += (1-\pi)\,\P(Y=1|Z=0,X=x,T=t) += (1-\pi)\,\mu_0 e^{-\mu_0} +$$ + +**$P(Y=y)$ for $y \geq 1$:** Identical reasoning gives + +$$ +\P(Y=y|X=x,T=t) += (1-\pi)\,\frac{\mu_0^y e^{-\mu_0}}{y!} +$$ + +::: + +{{< slidebreak >}} ::: {#exr-zinf-moments} -Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $P(Z=1|X=x,T=t)$ and $\Expp[Y|Z=0,X=x,T=t]$. +Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $\pi = P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. +::: + +::: {.solution} + +Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$. + +**Expected value.** By the Law of Total Expectation +(conditioning on $Z$, within the subpopulation $\{X=x, T=t\}$): + +$$ +\ba +\Expp[Y|X=x,T=t] +&= \Expp[Y|Z=1,X=x,T=t]\,\pi + \Expp[Y|Z=0,X=x,T=t]\,(1-\pi)\\ +&= 0 \cdot \pi + \mu_0(1-\pi)\\ +&= (1-\pi)\,\mu_0 +\ea +$$ + +The substitution $\Expp[Y|Z=0,X=x,T=t] = \mu_0$ follows immediately +from the definition of $\mu_0$ above. + +**Variance.** By the Law of Total Variance. +To reduce clutter, we suppress the $(X=x, T=t)$ conditioning in the +intermediate steps below: every expectation and variance is taken within +the subpopulation $\{X=x, T=t\}$, and we restore the explicit conditioning +in the final line. + +$$ +\Var{Y} = \Expp[\Var{Y|Z}] + \Var{\Expp[Y|Z]} +$$ + +For the first term, since $\Var{Y|Z=1}=0$ and $\Var{Y|Z=0}=\mu_0$ (Poisson): + +$$ +\Expp[\Var{Y|Z}] = 0 \cdot \pi + \mu_0(1-\pi) = (1-\pi)\mu_0 +$$ + +For the second term, $\Expp[Y|Z]$ takes the value 0 (with prob $\pi$) +or $\mu_0$ (with prob $1-\pi$), so: + +$$ +\ba +\Var{\Expp[Y|Z]} +&= \pi(0 - (1-\pi)\mu_0)^2 + (1-\pi)(\mu_0 - (1-\pi)\mu_0)^2\\ +&= \pi(1-\pi)^2\mu_0^2 + (1-\pi)\pi^2\mu_0^2\\ +&= \pi(1-\pi)\mu_0^2[\,(1-\pi)+\pi\,]\\ +&= \pi(1-\pi)\mu_0^2 +\ea +$$ + +Combining: + +$$ +\Var{Y|X=x,T=t} = (1-\pi)\mu_0 + \pi(1-\pi)\mu_0^2 += (1-\pi)\mu_0\bigl(1 + \pi\mu_0\bigr) +$$ + +Since $(1-\pi)\mu_0\bigl(1+\pi\mu_0\bigr) \geq (1-\pi)\mu_0 = \Expp[Y|X=x,T=t]$ for any $\pi > 0$, +zero-inflated count models always exhibit overdispersion relative to a Poisson model +with the same mean. + ::: diff --git a/chapters/count-regression.qmd b/chapters/count-regression.qmd index e302040286..bc3e5cee76 100644 --- a/chapters/count-regression.qmd +++ b/chapters/count-regression.qmd @@ -72,13 +72,17 @@ This content is adapted from: ::: notes There are alternatives to the Poisson model. -Most notably, +Most notably, the [negative binomial model](probability.qmd#sec-nb-dist). +::: -We can still model $\mu$ as a function of $X$ and $T$ as before, -and we can combine this model with zero-inflation +The [negative binomial distribution](probability.qmd#sec-nb-dist) +is a common alternative to the Poisson distribution for count outcomes. +It adds a dispersion parameter that allows the variance to exceed the mean, +making it more flexible when overdispersion is present. +We can still model $\mu$ as a function of $X$ and $T$ as before, +and we can combine this model with zero-inflation (as the conditional distribution for the non-zero component). -::: --- @@ -88,7 +92,13 @@ and we can combine this model with zero-inflation ## Quasipoisson -An alternative to Negative binomial is the "quasipoisson" distribution. I've never used it, but it seems to be a method-of-moments type approach rather than maximum likelihood. It models the variance as $\Var{Y} = \mu\theta$, and estimates $\theta$ accordingly. +An alternative to the negative binomial model is the quasi-Poisson approach. +Rather than specifying a full probability distribution, +it uses a method-of-moments approach rather than maximum likelihood estimation. +It models the variance as $\Var{Y} = \mu\theta$ +and estimates the dispersion parameter $\theta$ accordingly. +This approach is simpler to implement but provides less information +than the full negative binomial likelihood. See `?quasipoisson` in R for more. diff --git a/chapters/exr-needle-sharing-extensions.qmd b/chapters/exr-needle-sharing-extensions.qmd index d3000c723e..6497939398 100644 --- a/chapters/exr-needle-sharing-extensions.qmd +++ b/chapters/exr-needle-sharing-extensions.qmd @@ -1,18 +1,18 @@ - ```{r} -library(MASS) #need this for glm.nb() -glm1.nb = glm.nb( - data = needles, - shared_syr ~ age + sex + homeless*polydrug +library(MASS) # for glm.nb() +glm1_nb <- glm.nb( + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + data = needles ) -equatiomatic::extract_eq(glm1.nb) +equatiomatic::extract_eq(glm1_nb) ``` ```{r} #| tbl-cap: "Negative binomial model for needle-sharing data" #| label: tbl-needles-nb -summary(glm1.nb) +summary(glm1_nb) ``` --- @@ -20,26 +20,28 @@ summary(glm1.nb) ```{r} #| tbl-cap: "Poisson versus Negative Binomial Regression coefficient estimates" #| label: tbl-compare-poisson-nb -tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1.nb)) +tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1_nb)) ``` #### zero-inflation ```{r} -#| tbl-cap: "Zero-inflated poisson model" +#| tbl-cap: "Zero-inflated Poisson model" #| label: tbl-zeroinf-poisson library(glmmTMB) -zinf_fit1 = glmmTMB( +zinf_pois <- glmmTMB( family = "poisson", - data = needles, - formula = shared_syr ~ age + sex + homeless*polydrug, - ziformula = ~ age + sex + homeless + polydrug # fit won't converge with interaction + data = needles, + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + # the zero-inflation submodel uses the exposure, matching Vittinghoff's + # `inflate(i.homeless)` (@vittinghoff2e, §8.3.1) + ziformula = ~ homeless ) -zinf_fit1 |> +zinf_pois |> parameters(exponentiate = TRUE) |> print_md() - ``` ::: notes @@ -53,18 +55,30 @@ Another R package for zero-inflated models is [`pscl`](https://cran.r-project.or ```{r} #| tbl-cap: "Zero-inflated negative binomial model" #| label: tbl-zeroinf-nb -library(glmmTMB) -zinf_fit1 = glmmTMB( +zinf_nb <- glmmTMB( family = nbinom2, - data = needles, - formula = shared_syr ~ age + sex + homeless*polydrug, - ziformula = ~ age + sex + homeless + polydrug - # fit won't converge with interaction + data = needles, + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + ziformula = ~ homeless ) -zinf_fit1 |> +zinf_nb |> parameters(exponentiate = TRUE) |> print_md() - ``` +::: notes +Both zero-inflated models keep the zero-inflation (structural-zero) submodel +parsimonious — just the exposure, `ziformula = ~ homeless` — +matching @vittinghoff2e's `inflate(i.homeless)` [§8.3.1]. + +A richer zero-inflation submodel is poorly identified here. +With the negative binomial in particular, +the overdispersion parameter and the zero-inflation component +both compete to explain the excess zeros, +so adding several covariates to the submodel makes the logistic fit +*separate*: its coefficients diverge and the exponentiated estimates +(odds ratios) run off to $\pm\infty$. +Keeping the submodel small avoids that. +::: diff --git a/chapters/exr-needle-sharing.qmd b/chapters/exr-needle-sharing.qmd index baae46b332..40e72eeb87 100644 --- a/chapters/exr-needle-sharing.qmd +++ b/chapters/exr-needle-sharing.qmd @@ -1,47 +1,78 @@ ```{r} library(tidyverse) library(haven) -needles = - here::here("inst/extdata/needle_sharing.dta") |> +needles <- + here::here("inst/extdata/needle_sharing.dta") |> read_dta() |> as_tibble() |> mutate( - hivstat = - hivstat |> - case_match( - 1 ~ "HIV+", - 0 ~ "HIV-") |> - factor() |> + hivstat = hivstat |> + case_match(1 ~ "HIV+", 0 ~ "HIV-") |> + factor() |> relevel(ref = "HIV-"), - polydrug = - polydrug |> - case_match( - 1 ~ "multiple drugs used", - 0 ~ "one drug used") |> + polydrug = polydrug |> + case_match(1 ~ "multiple drugs used", 0 ~ "one drug used") |> factor() |> relevel(ref = "one drug used"), - homeless = - homeless |> - case_match( - 1 ~ "homeless", - 0 ~ "not homeless") |> + homeless = homeless |> + case_match(1 ~ "homeless", 0 ~ "not homeless") |> factor() |> relevel(ref = "not homeless"), - ethn = ethn |> factor() |> relevel(ref = "White"), + sexabuse = sexabuse |> + case_match(1 ~ "yes", 0 ~ "no") |> + factor() |> + relevel(ref = "no"), + ethn = ethn |> + factor() |> + forcats::fct_lump_min(min = 5, other_level = "Other") |> + relevel(ref = "White"), sex = sex |> factor() |> relevel(ref = "M") - ) |> + ) |> labelled::set_variable_labels( - "sex" = "sex (reference = Male)", - "ethn" = "ethnicity (reference = White)", - "shsyryn" = "shared syringe yes/no (1 = yes, 0 = no)", + "id" = "Participant identifier", + "sex" = "Biological sex (reference = Male)", + "ethn" = "Self-reported ethnicity (reference = White; n<5 lumped to Other)", + "age" = "Age at first interview (years)", + "dprsn_dx" = "Depression diagnosis indicator", + "sexabuse" = "History of sexual abuse (1 = yes, 0 = no)", + "shared_syr" = "Number of shared syringes (in 30 days)", + "hivstat" = "HIV serostatus (reference = HIV-)", + "hplsns" = "Hopelessness score", + "nivdu" = "Number of injections (in 30 days)", + "shsyryn" = "Shared syringe? (1 = yes, 0 = no)", + "sqrtnivd" = "sqrt(No. of injections in 30 days)", "logshsyr" = "log(No. of shared needles)", - "polydrug" = "how many drugs used?", - "sqrtninj" = "sqrt(No. of infections in 30 days)", - "homeless" = "Homeless (1 = yes, 0 = no)", - "hivstat" = "HIV status (reference = HIV-)" + "polydrug" = "Multiple non-injection drugs used?", + "sqrtninj" = "sqrt(No. of injections in 30 days); duplicate of sqrtnivd", + "homeless" = "Homeless? (reference = not homeless)", + "shsyr" = "Shared needles (30 days); sparse duplicate of shared_syr" ) ``` +## Introduction + +This example, from @vittinghoff2e [§8.3.1], +models *risky drug-use behavior* — needle sharing — among +injection drug users (IDUs), +a behavior of public-health interest because shared syringes can transmit +blood-borne infections. +We examine how needle-sharing frequency relates to participants' age, sex, +housing status (homelessness), drug-use patterns, and HIV serostatus; +injection frequency reflects how often each participant has the +opportunity to share a syringe. + +These data are a sample of injection drug users +distributed with @vittinghoff2e [§8.3.1] +via the +[UCSF companion website](https://regression.ucsf.edu/second-edition/data-examples-and-problems); +the original study is not documented in the textbook or the companion materials. +One possible (though unconfirmed) source is @tsui2009hcv, +a study of injection-drug-using participants in San Francisco +co-authored by one of the textbook's authors (E. Vittinghoff). +There are $n = 128$ participants and 17 variables. +Our outcome is `shared_syr`, +the number of times a participant shared a syringe in the past 30 days. + :::{#tbl-needles-data-dict} ```{r} @@ -59,6 +90,161 @@ Data dictionary for `needles` data --- +@tbl-needles-demographics summarizes the demographic and behavioural +variables for the participants. + +:::{#tbl-needles-demographics} + +```{r} +#| code-fold: true +needles |> + dplyr::select(-id) |> + gtsummary::tbl_summary() +``` + +Demographics and behavioural variables for the needle-sharing dataset +(all participants). + +::: + +{{< slidebreak >}} + +Before modeling, we sketch a hypothesized causal structure for these +variables. + +::: {#fig-needles-dag} + +```{r} +#| fig-width: 8.5 +#| fig-height: 6 +#| code-fold: true + +library(dagitty) +library(ggdag) +library(dplyr) + +# This is a cross-sectional study: every covariate is measured at one interview, +# so we cannot order the *current* covariates causally among themselves. The +# only genuine mediator is injection frequency `nivdu` -- sharing a syringe +# requires injecting, so a shared syringe is a subset of injections. +demographics <- c("age", "sex", "ethn") # fixed at (or determined by) birth +timevarying <- c( + "homeless", "polydrug", "hivstat", "sexabuse", "dprsn_dx", "hplsns" +) + +# The real confounders are the *past* versions of the time-varying covariates +# (a person's earlier circumstances), which we never observe. Each latent past +# state `U_x` causes both its current measured proxy `x` and the 30-day outcome. +past <- paste0("U_", timevarying) + +needles_dag <- dagitty(paste0("dag {", paste(c( + # demographics cause the latent past states and the outcome directly + as.vector(outer(demographics, past, function(a, b) paste(a, "->", b))), + paste(demographics, "-> shared_syr"), + # each latent past state -> its current measured proxy, and -> the outcome + paste(past, "->", timevarying), + paste(past, "-> shared_syr"), + # the only mediator: past homelessness -> more injecting -> more sharing + "U_homeless -> nivdu", "nivdu -> shared_syr", + # study inclusion is conditioned on; several covariates affect being sampled, + # conditioning on `selected` then induces associations (selection bias) + paste(c(demographics, "homeless", "polydrug", "hivstat"), "-> selected") +), collapse = "; "), "}")) + +# layout: time flows left -> right (birth-fixed -> latent past -> current proxy) +y6 <- seq(3.5, -3.5, length.out = 6) +coordinates(needles_dag) <- list( + x = c( + age = -1, sex = -1, ethn = -1, + setNames(rep(1, 6), past), setNames(rep(3, 6), timevarying), + nivdu = 4.3, shared_syr = 6, selected = 3 + ), + y = c( + age = 1, sex = 0, ethn = -1, + setNames(y6, past), setNames(y6, timevarying), + nivdu = 2.4, shared_syr = 0, selected = -4.7 + ) +) + +needles_dag |> + tidy_dagitty() |> + mutate(role = case_when( + name == "homeless" ~ "exposure", + name == "shared_syr" ~ "outcome", + name == "nivdu" ~ "mediator", + name == "selected" ~ "study inclusion (conditioned on)", + startsWith(name, "U_") ~ "latent past state (unobserved)", + TRUE ~ "measured covariate" + )) |> + ggplot() + + aes(x = x, y = y, xend = xend, yend = yend) + + geom_dag_edges( + start_cap = ggraph::circle(5, "mm"), + end_cap = ggraph::circle(5, "mm") + ) + + geom_dag_point(aes(colour = role, shape = role), size = 7) + + geom_dag_text(colour = "black", size = 1.7) + + scale_shape_manual(values = c( + "exposure" = 16, "outcome" = 16, "mediator" = 16, + "study inclusion (conditioned on)" = 15, + "latent past state (unobserved)" = 1, "measured covariate" = 16 + )) + + scale_colour_manual(values = c( + "exposure" = "#d73027", "outcome" = "#4575b4", "mediator" = "#1a9850", + "study inclusion (conditioned on)" = "grey30", + "latent past state (unobserved)" = "grey55", + "measured covariate" = "grey80" + )) + + theme_dag() + + theme(legend.position = "bottom", legend.title = element_blank()) +``` + +Hypothesized causal structure for the needle-sharing example, +drawn for a **cross-sectional** study. +The exposure of interest is homelessness (`homeless`, red) +and the outcome is the count of shared syringes (`shared_syr`, blue). +Because every covariate is measured at a single interview, +the current measured covariates (grey) cannot be ordered causally among +themselves; each is a proxy for an unobserved **past state** (`U_…`, hollow) +that is the actual cause of both the current measurement and the 30-day behaviour. +The only genuine mediator is injection frequency (`nivdu`, green): +sharing a syringe requires injecting. +Demographic variables (`age`, `sex`, `ethn`) are fixed at (or determined by) birth. +The study-inclusion indicator (`selected`, square) is conditioned on by design; +because several covariates influence who is sampled, +conditioning on it induces associations among them (selection bias). + +::: + +::: notes +We treat **homelessness as the exposure of interest** +and ask how it affects the number of shared syringes — +following @vittinghoff2e [§8.3.1], who model this outcome on `homeless`. + +Injection frequency `nivdu` is a **mediator**: +a shared syringe is necessarily one of the injections, +so homelessness can raise sharing partly by raising how often a person injects. +Because `nivdu` lies *on* the causal path from `homeless` to `shared_syr`, +we do **not** adjust for it +(neither as a covariate nor as an offset); +doing so would block that path and estimate only a partial effect. +We want the **total** effect of homelessness on needle-sharing. + +Because the study is **cross-sectional**, +the remaining covariates are measured at the same interview as the exposure, +so none of them can be ordered causally relative to `homeless`. +We read each one as a proxy for an unobserved **past state** (`U_…`) +that confounds the homelessness–sharing relationship, +and we adjust for the measured proxies to control that confounding. +A single fitted coefficient is therefore an **adjusted association**, +not an isolated causal effect. + +Conditioning on study inclusion (`selected`) is unavoidable — +we only observe people who were sampled — +but because several covariates influence who is sampled, +it can induce **selection bias** that no covariate adjustment removes. +::: + ```{r} #| tbl-cap: "Needle-sharing data" #| label: tbl-needle-data @@ -67,40 +253,95 @@ needles --- +@fig-needles plots the shared-syringe counts against age, +split by the covariate subgroups (point size shows injection frequency). + ```{r} -#| fig-cap: "Rates of needle sharing" +#| fig-cap: "Needle-sharing counts by age" #| label: fig-needles library(ggplot2) -needles |> - ggplot( - aes( - x = age, - y = shsyryn, - shape = sex, - col = ethn - ) - ) + - geom_point( - aes(size = nivdu), - alpha = .5) + +needles |> + ggplot() + + aes( + x = age, + y = shared_syr, + shape = sex, + col = hivstat + ) + + geom_point(aes(size = nivdu), alpha = .5) + scale_size_area(max_size = 4) + - facet_grid( - cols = vars(polydrug), - rows = vars(homeless)) + + scale_y_continuous( + transform = "pseudo_log", + breaks = c(0, 1, 3, 10, 30, 60) + ) + + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + + labs(y = "Number of shared syringes (30 days)") + theme(legend.position = "bottom") ``` + +{{< slidebreak >}} + +@fig-needles-raw shows the underlying counts: +the number of shared syringes against the number of injections, +with a dashed $y = x$ line marking the ceiling +where every injection involves a shared syringe. + +::: {#fig-needles-raw} + +```{r} +#| code-fold: true +library(ggplot2) + +needles |> + ggplot() + + aes( + x = nivdu, + y = shared_syr, + shape = sex, + col = hivstat + ) + + geom_point(alpha = .5) + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", alpha = .5) + + # both axes are counts, so pseudo-log both: a shared transform keeps the + # y = x reference line straight (data diagonal -> panel diagonal) + scale_x_continuous( + transform = "pseudo_log", + breaks = c(0, 3, 10, 30, 100, 300, 900) + ) + + scale_y_continuous( + transform = "pseudo_log", + breaks = c(0, 1, 3, 10, 30, 60) + ) + + facet_grid(cols = vars(polydrug), rows = vars(homeless)) + + labs( + x = "Number of injections (in 30 days)", + y = "Number of shared syringes (in 30 days)" + ) + + guides( + col = guide_legend(nrow = 2), + shape = guide_legend(nrow = 2) + ) + + theme(legend.position = "bottom", legend.box = "vertical") +``` + +Number of shared syringes versus number of injections +(raw counts, 30-day window). +The dashed line marks $y = x$ (every injection involves a shared syringe). + +::: + --- -#### Covariate counts +## Covariate counts ```{r} -#| tbl-cap: "Counts of observations in `needles` dataset by sex, unhoused status, and multiple drug use" +#| tbl-cap: "Counts in `needles` by sex, housing status, and multiple drug use" #| label: tbl-count-needles #| code-fold: show -needles |> - dplyr::select(sex, homeless, polydrug) |> +needles |> + dplyr::select(sex, homeless, polydrug) |> summary() ``` @@ -113,23 +354,71 @@ We will remove that individual: ```{r} #| label: remove-trans-obs #| code-summary: 'remove singleton observation with sex == Trans' -needles = needles |> filter(sex != "Trans") +needles <- needles |> filter(sex != "Trans") ``` --- -### Model {.smaller} +## Model {.smaller} + +Following @vittinghoff2e [§8.3.1], +we take **homelessness as the exposure of interest** +and ask how it affects the **count** of shared syringes, +using a Poisson regression (log link) rather than modeling a rate. +The remaining covariates enter as adjustments for confounding +(@fig-needles-dag): because the study is cross-sectional, +each is read as a proxy for an unobserved past state +that may influence both homelessness and sharing. + +**Why not use `nivdu` as an offset?** +A Poisson *rate* model would add `offset(log(nivdu))`, +forcing the expected count to be proportional to injection frequency +(the coefficient of `log(nivdu)` fixed at exactly 1). +We avoid that here for two reasons: + +- **`nivdu` is a mediator, not an exposure window.** +An offset is appropriate when its variable is a known denominator — +person-time or population at risk — that is external to the causal +question. Injection frequency is instead on the causal path from +homelessness to sharing (@fig-needles-dag): being homeless can raise +sharing partly by raising how often a person injects. Conditioning on it +(whether as an offset or as a covariate) would block that path and estimate +only a partial effect. We want the *total* effect of homelessness on +needle-sharing. +- **It imposes an untested proportionality assumption.** +Fixing the `log(nivdu)` coefficient at 1 assumes sharing scales +one-for-one with injecting, which the data need not support. + +Vittinghoff's analysis uses neither an offset nor `nivdu` as a covariate. + +::: notes +**Rule of thumb.** +An offset's exposure is usually something fixed by the *study design* — +person-time of follow-up, population at risk, area surveyed, number of +trials — a known denominator that is *exogenous* to the question +(not caused by the predictors of interest). +When the candidate denominator is instead something the participants +*did*, and the predictors also influence it (here, how often they inject), +it is a mediator rather than an exposure window, and we model the count +directly instead of dividing by it. +::: + +We adjust for the measured covariates as proxies for the latent past +confounders in @fig-needles-dag — demographic (`age`, `sex`, `ethn`), +social (`polydrug`), serostatus (`hivstat`), and trauma/mental-health +(`sexabuse`, hopelessness score `hplsns`) — using **main effects only**. +We exclude the mediator `nivdu` (above) and the depression indicator +`dprsn_dx`, whose `1`/`5` coding is undocumented in the source data and +so cannot be interpreted. ```{r} -glm1 = - needles |> - dplyr::filter(nivdu > 0) |> - glm( - offset = log(nivdu), - family = stats::poisson, - formula = shared_syr ~ age + sex + homeless*polydrug - ) +glm1 <- glm( + formula = shared_syr ~ homeless + + age + sex + ethn + polydrug + hivstat + sexabuse + hplsns, + family = stats::poisson, + data = needles +) library(equatiomatic) equatiomatic::extract_eq(glm1) ``` @@ -137,27 +426,136 @@ equatiomatic::extract_eq(glm1) --- ```{r} -#| tbl-cap: "Poisson model for needle-sharing data" +#| tbl-cap: "Poisson count model for needle-sharing data" #| label: tbl-needles-pois library(parameters) -glm1 |> parameters(exponentiate = TRUE) |> +glm1 |> + parameters(exponentiate = TRUE) |> print_md() ``` --- ```{r} -#| tbl-cap: "fitted Poisson model for needle-sharing data" +#| fig-cap: "Fitted Poisson model: expected sharing by age and housing status" #| label: fig-needles-pois -library(sjPlot) -glm1 |> - sjPlot::plot_model( - type = "pred", - terms = c("age", "sex", "homeless", "polydrug"), - show.data = TRUE +#| code-fold: true +library(ggplot2) + +# Predicted expected counts across age for the two homelessness groups, +# holding the other covariates at their reference levels (hplsns at its +# median). The model has no offset, so predict(type = "response") returns +# the expected number of shared syringes directly. +ref_lvl <- function(v) levels(glm1$model[[v]])[1] +pred_grid <- expand.grid( + age = seq(min(glm1$model$age), max(glm1$model$age), length.out = 100), + homeless = levels(glm1$model$homeless), + sex = ref_lvl("sex"), + ethn = ref_lvl("ethn"), + polydrug = ref_lvl("polydrug"), + hivstat = ref_lvl("hivstat"), + sexabuse = ref_lvl("sexabuse"), + hplsns = median(glm1$model$hplsns) +) +pred_grid$shared_syr <- predict(glm1, newdata = pred_grid, type = "response") + +ggplot(glm1$model) + + aes(x = age, y = shared_syr, colour = homeless) + + geom_point(alpha = 0.4) + + geom_line(data = pred_grid, linewidth = 1) + + scale_y_continuous( + transform = "pseudo_log", + breaks = c(0, 1, 3, 10, 30, 60) + ) + + labs( + y = "Number of shared syringes (30 days)", + colour = "Housing status" ) + + theme_bw() + theme(legend.position = "bottom") ``` +## Interpreting the coefficients {.smaller} + +For a Poisson model with a log link, +exponentiating a coefficient turns it into a *ratio of expected counts* +(an **incidence rate ratio**, IRR, in @vittinghoff2e's terminology): +$\exp{\b_j}$ is the multiplicative change in the expected number of +shared syringes per one-unit increase in $x_j$, +holding the other covariates fixed. +Our **focal estimate is the IRR for homelessness**; +the remaining coefficients are adjustments for confounding, +not effects of separate interest. +Because we do not condition on the mediator `nivdu`, +the homelessness IRR is a *total* effect. +The fitted ratios are reported in @tbl-needles-pois. + +```{r} +#| code-fold: true +irr <- exp(coef(glm1)) +``` + +**Baseline (intercept).** +$\exp{\hat\b_0} = `r round(irr[["(Intercept)"]], 3)`$ +is the expected count of shared syringes for the reference participant +(male, not homeless, White, one drug used, HIV-negative, +no sexual-abuse history, hopelessness score 0), extrapolated to age 0. +Age 0 is outside the data, so the intercept is a mathematical anchor +rather than a directly interpretable quantity. + +**Age.** +$\exp{\hat\b_{\text{age}}} = `r round(irr[["age"]], 3)`$ per year: +essentially no association between age and the expected count. + +**Sex.** +$\exp{\hat\b_{\text{sexF}}} = `r round(irr[["sexF"]], 2)`$: +women's expected count of shared syringes is about +`r round(irr[["sexF"]], 2)` times men's, +adjusting for the other covariates. + +**Homelessness (the focal exposure).** +$\exp{\hat\b_{\text{homeless}}} = `r round(irr[["homelesshomeless"]], 2)`$: +homeless participants share syringes about +`r round(irr[["homelesshomeless"]], 2)` times as often as the non-homeless, +adjusting for the other covariates. +@vittinghoff2e [§8.3.1] reports an *unadjusted* IRR of about 3.3 +for homelessness in a single-predictor model; +adjusting for the confounder-proxies here moves the estimate +to about `r round(irr[["homelesshomeless"]], 1)`. + +**Multiple-drug use.** +$\exp{\hat\b_{\text{polydrug}}} = `r round(irr[["polydrugmultiple drugs used"]], 2)`$: +multiple-drug users have a *lower* expected count than single-drug users, +though only `r sum(glm1$model[["polydrug"]] == "multiple drugs used")` +participants used multiple drugs, so this estimate is imprecise. + +**HIV serostatus.** +$\exp{\hat\b_{\text{hivstatHIV+}}} = `r round(irr[["hivstatHIV+"]], 2)`$: +HIV-positive participants have a lower expected count of shared syringes +than HIV-negative participants, adjusting for the other covariates. +Only `r sum(glm1$model[["hivstat"]] == "HIV+")` participants are HIV+ +in this sample, so this estimate is imprecise. + +**Ethnicity.** +The `ethn` coefficients compare each retained group to the White reference +(small groups were lumped into "Other"); they adjust for any confounding +of the homelessness effect by ethnicity but are not of separate interest. + +**Sexual-abuse history.** +$\exp{\hat\b_{\text{sexabuse}}} = `r round(irr[["sexabuseyes"]], 2)`$: +the expected count for participants reporting a history of sexual abuse +relative to those not reporting one, adjusting for the other covariates. + +**Hopelessness.** +$\exp{\hat\b_{\text{hplsns}}} = `r round(irr[["hplsns"]], 2)`$ per point: +the multiplicative change in the expected count per one-point increase +in the hopelessness score. + +**Overdispersion caveat.** +These data are highly overdispersed — the variance far exceeds the mean — +so the model-based Poisson standard errors are too small. +@vittinghoff2e [§8.3.1] recommends robust standard errors or a +negative-binomial model for this example. + diff --git a/inst/WORDLIST b/inst/WORDLIST index dfc2e21edf..9ca1d5d989 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -93,6 +93,7 @@ antiderivative antiderivatives arg ba +bidirected binom biomarkers bmatrix @@ -166,6 +167,7 @@ ki ldots le leftarrow +leftrightarrow leq lik linreg diff --git a/references.bib b/references.bib index 845703a653..dc28bee0c5 100644 --- a/references.bib +++ b/references.bib @@ -1689,3 +1689,14 @@ @article{debruyn2011mira year={2011}, doi={10.1136/sti.2010.044990} } + +@article{tsui2009hcv, + title={Risk behaviors after hepatitis {C} virus seroconversion in young injection drug users in {San Francisco}}, + author={Tsui, Judith I. and Vittinghoff, Eric and Hahn, Judith A. and Evans, Jennifer L. and Davidson, Peter J. and Page, Kimberly}, + journal={Drug and Alcohol Dependence}, + volume={105}, + number={1-2}, + pages={160--163}, + year={2009}, + doi={10.1016/j.drugalcdep.2009.05.022} +} diff --git a/renv.lock b/renv.lock index f5008745dd..9b94d44ed9 100644 --- a/renv.lock +++ b/renv.lock @@ -1014,7 +1014,7 @@ }, "bit64": { "Package": "bit64", - "Version": "4.8.0", + "Version": "4.8.2", "Source": "Repository", "Title": "A S3 Class for Vectors of 64bit Integers", "Authors@R": "c( person(\"Michael\", \"Chirico\", email=\"michaelchirico4@gmail.com\", role=c(\"aut\", \"cre\")), person(\"Jens\", \"Oehlschlägel\", role=\"aut\"), person(\"Leonardo\", \"Silvestri\", role=\"ctb\"), person(\"Ofek\", \"Shilon\", role=\"ctb\"), person(\"Christian\", \"Ullerich\", role=\"ctb\") )", @@ -1884,7 +1884,7 @@ }, "clipr": { "Package": "clipr", - "Version": "0.8.0", + "Version": "0.8.1", "Source": "Repository", "Type": "Package", "Title": "Read and Write from the System Clipboard", @@ -1906,10 +1906,10 @@ "VignetteBuilder": "knitr", "Encoding": "UTF-8", "Language": "en-US", - "RoxygenNote": "7.1.2", + "RoxygenNote": "7.3.3", "SystemRequirements": "xclip (https://github.com/astrand/xclip) or xsel (http://www.vergenet.net/~conrad/software/xsel/) for accessing the X11 clipboard, or wl-clipboard (https://github.com/bugaevc/wl-clipboard) for systems using Wayland.", "NeedsCompilation": "no", - "Author": "Matthew Lincoln [aut, cre] (), Louis Maddox [ctb], Steve Simpson [ctb], Jennifer Bryan [ctb]", + "Author": "Matthew Lincoln [aut, cre] (ORCID: ), Louis Maddox [ctb], Steve Simpson [ctb], Jennifer Bryan [ctb]", "Maintainer": "Matthew Lincoln ", "Repository": "CRAN" }, @@ -1929,7 +1929,7 @@ "License": "GPL (>= 2)", "URL": "https://www.R-project.org", "NeedsCompilation": "yes", - "Repository": "https://packagemanager.posit.co/cran/__linux__/noble/latest", + "Repository": "https://packagemanager.posit.co/cran/latest", "Encoding": "UTF-8" }, "coda": { @@ -2164,7 +2164,7 @@ }, "cpp11": { "Package": "cpp11", - "Version": "0.5.4", + "Version": "0.5.5", "Source": "Repository", "Title": "A C++11 Interface for R's C Interface", "Authors@R": "c( person(\"Davis\", \"Vaughan\", email = \"davis@posit.co\", role = c(\"aut\", \"cre\"), comment = c(ORCID = \"0000-0003-4777-038X\")), person(\"Jim\",\"Hester\", role = \"aut\", comment = c(ORCID = \"0000-0002-2739-7082\")), person(\"Romain\", \"François\", role = \"aut\", comment = c(ORCID = \"0000-0002-2444-4226\")), person(\"Benjamin\", \"Kietzman\", role = \"ctb\"), person(\"Posit Software, PBC\", role = c(\"cph\", \"fnd\")) )",