Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
60d3af3
revise count regression chapter (issue #747)
github-actions[bot] May 16, 2026
7181ab9
Merge remote-tracking branch 'origin/main' into claude/issue-747-2026…
d-morrison Jun 2, 2026
3323cd5
Fix wrong variables in needle-sharing exploratory and prediction figures
claude Jun 2, 2026
ceaac41
Add demographics table and enrich data dictionary
claude Jun 2, 2026
6316dcc
Address #860 review findings: CI formulas, dsim, T-conditioning, slid…
claude Jun 2, 2026
72fdee5
Lump rare ethnicities via forcats::fct_lump_min
claude Jun 2, 2026
d4b48d9
fig-needles-pois: plot model predictions on the rate scale
claude Jun 2, 2026
05d412b
Cross-link to def-orthogonal-vectors; tighten fig-needles-pois caption
claude Jun 2, 2026
b763dcc
Fix P(Z=1|X=x,T=t) to P(Z=1|X=x) in exr-zinf-pmf exercise
claude[bot] Jun 2, 2026
0cf320a
Fix (y - \hat\mu) to (\vy - \hat\mu) for bold vector consistency
claude[bot] Jun 2, 2026
5e11532
Add fig-needles-raw: scatter of shared_syr vs nivdu
claude[bot] Jun 2, 2026
74378b8
Add x=y reference line, wrap legend, and causal DAG to needle-sharing
d-morrison Jun 2, 2026
db60071
Use gtsummary for tbl-needles-demographics; show all variables incl. …
d-morrison Jun 2, 2026
9e75c25
Clean up lint findings in needle-sharing data/model chunks
d-morrison Jun 2, 2026
26d2d98
Address review findings on count-regression PR
d-morrison Jun 2, 2026
7bc9155
Address non-blocking review notes: consistent conditioning and vector…
d-morrison Jun 2, 2026
2b65df6
Convention: estimator symbol on top of vector symbol
claude Jun 2, 2026
ca62297
Contrast rate vs count models, add coefficient interpretation, move DAG
d-morrison Jun 2, 2026
df1d1bd
Merge remote-tracking branch 'origin/claude/issue-747-20260516-0652' …
d-morrison Jun 2, 2026
bbf9c26
Address review: slidebreaks, div tables, cross-refs, macros, shsyr label
d-morrison Jun 2, 2026
482b8db
Model the count directly: drop the nivdu offset and the interaction
d-morrison Jun 2, 2026
576db1d
Add dataset introduction; plot counts in fig-needles
d-morrison Jun 2, 2026
f7756c1
Address review: cite/soften intro claims, cross-ref fig-needles, drop…
d-morrison Jun 2, 2026
4cb8fe6
Reword "exposure" in intro to avoid offset connotation
d-morrison Jun 2, 2026
5788ac5
Add aside: design-determined/exogenous exposure vs. behavioral mediator
d-morrison Jun 2, 2026
a500a71
Add hivstat to model+figures and expand DAG to all dataset variables
claude Jun 3, 2026
cb60443
DAG: revert to plot(needles_dag); inline labels in caption
claude Jun 3, 2026
b3ea696
Add hivstat to the NB and zero-inflated models too
claude Jun 3, 2026
5433554
Soften unsourced "longitudinal panel study" claim in needle-sharing i…
d-morrison Jun 3, 2026
aab866d
Note Tsui et al. (2009) as a possible (unconfirmed) data source
d-morrison Jun 3, 2026
0c68ab5
Merge branch 'main' into claude/issue-747-20260516-0652
d-morrison Jun 3, 2026
c87f507
Merge branch 'main' into claude/issue-747-20260516-0652
d-morrison Jun 3, 2026
a95e486
Address #860 review: hivstat consistency + ziformula safety
claude Jun 3, 2026
b00607f
Fix offset xref; drop ZI covariates; pseudo-log count axes; un-nest aes
d-morrison Jun 3, 2026
e433d8e
Show covariate causal structure in fig-needles-dag
d-morrison Jun 3, 2026
97a3a72
Fix offset link target; protect hepatitis {C} in tsui2009hcv
d-morrison Jun 3, 2026
6188640
Redesign fig-needles-dag: cross-sectional, homelessness-focal causal …
d-morrison Jun 3, 2026
7d2fcde
Reframe needle-sharing example around homelessness as the exposure
d-morrison Jun 4, 2026
8d2d87a
Merge branch 'main' into claude/issue-747-20260516-0652
d-morrison Jun 5, 2026
03effe5
Merge remote-tracking branch 'origin/main' into claude/issue-747-2026…
claude Jun 6, 2026
141e3ab
Merge branch 'main' into claude/issue-747-20260516-0652
d-morrison Jun 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -588,6 +588,10 @@ Key macros to use:
- **Greek letters**: Use `\b` for $\beta$, `\g` for $\gamma$, `\a` for $\alpha$
- **Formatting**: Use `\red{...}` and `\blue{...}` for colored text in math
- **Deviation/error notation**: Use `\erf{...}` for deviations of estimates/estimators from their estimands; use `\devn(...)` for all other deviations (e.g., observations from population means)
- **Estimators of vector estimands**: the estimator symbol (e.g. `\hat`,
`\bar`, `\tilde`) goes on top of the vector symbol, not inside it —
write `\hat{\v{\mu}}`, not `\v{\hat\mu}`. The hat sits on top of the
already-vectorized symbol.

matrix-product helper macros:

Expand Down
3 changes: 3 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,9 @@ Before committing any `.qmd`, `.R`, or config file change:
- Key macros: `\E{Y|X=x}`, `\ba`/`\ea`, `\tp{v}`, `\b`, `\g`, `\a`, `\devn(...)`, `\erf{...}`
- Include every intermediate step in derivations — do not skip steps
- Color coding: `\red{...}` for focal/extra terms, `\blue{...}` for shared terms
- Estimators of vector estimands: the estimator symbol (e.g. `\hat`) goes
on top of the vector symbol, not inside it — write `\hat{\v{\mu}}`, not
`\v{\hat\mu}`. (Same for `\bar`, `\tilde`, etc.)
- Ratios vs. factors:
- Use the generic `\ratio`/`\ratiof` macro when a ratio's inputs are the **quantities themselves** (the odds, hazards, rates, etc.) — e.g. `\ratio(\odds_1, \odds_2)`, **not** `\ror(\odds_1, \odds_2)` — because the type of ratio is clear from the inputs.
- Use the type-subscripted ratio macros (`\ror` for odds ratios, `\hazratio`/`\hr` for hazard ratios, `\rateratio`, `\riskratio`, `\prevratio`, `\cuhazratio`, …) only when the inputs are **covariate patterns** (e.g. `\ror(\vx,\vxs)`, `\hr(t\mid\vx:\vxs)`), where the subscript is needed to say which kind of ratio it is.
Expand Down
7 changes: 6 additions & 1 deletion _subfiles/count-regression/_exr-prac-glm-score.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,12 @@ weighted by $x_i$.

More generally,
these score equations say that the **residuals $(y_i - \hat\mu_i)$
are orthogonal to each predictor column**.
are [orthogonal](math-prereqs.qmd#def-orthogonal-vectors)
to each predictor column**:
for each predictor $j$, the residual vector $(\vy - \hat{\v{\mu}})$ satisfies
$\tp{\vx_{(j)}}(\vy - \hat{\v{\mu}}) = 0$,
where $\vx_{(j)} = (x_{1j}, \ldots, x_{nj})$ is the column of $j$-th
predictor values across observations.
This is the GLM analogue of the OLS normal equations.

:::
4 changes: 2 additions & 2 deletions _subfiles/count-regression/_sec-overdispersion.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ In practice, many count distributions will have a variance substantially larger

A random variable $X$ is **overdispersed**
relative to a model $\p(X=x)$ if
if its empirical variance in a dataset is larger than
the value is predicted by the fitted model $\hat{\p}(X=x)$.
its empirical variance in a dataset is larger than
the value predicted by the fitted model $\hat{\p}(X=x)$.

::::

Expand Down
2 changes: 1 addition & 1 deletion _subfiles/count-regression/_sec_pois-reg_intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ $$
In contrast with the other covariates (represented by $\vX$),
$t$ enters this expression with a $\log{}$ transformation
and without a corresponding $\beta$ coefficient;
in other words, $\logf{t}$ is an [offset term](poisson.qmd#def-offset).
in other words, $\logf{t}$ is an [offset term](probability.qmd#def-offset).
:::

---
Expand Down
10 changes: 5 additions & 5 deletions _subfiles/count-regression/_sec_poisson_dx.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ where $h$ is the "leverage" (which we will continue to leave undefined).
#### Deviance residuals

$$
d_k = \text{sign}(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\}
d_k = \signt(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\}
$$

:::{.callout-note}

$$\text{sign}(x) \eqdef \frac{x}{|x|}$$
$$\signt(x) \eqdef \frac{x}{|x|}$$
In other words:

* $\text{sign}(x) = -1$ if $x < 0$
* $\text{sign}(x) = 0$ if $x = 0$
* $\text{sign}(x) = 1$ if $x > 0$
* $\signt(x) = -1$ if $x < 0$
* $\signt(x) = 0$ if $x = 0$
* $\signt(x) = 1$ if $x > 0$

::::{.content-hidden}

Expand Down
43 changes: 36 additions & 7 deletions _subfiles/count-regression/_sec_poisson_inference.qmd
Original file line number Diff line number Diff line change
@@ -1,26 +1,55 @@

### Confidence intervals for regression coefficients and rate ratios

As usual:
A Wald 95% confidence interval for a single coefficient $\beta_j$ is:

$$
\beta \in \left[\ci \right]
\beta_j \in \left[\hat\beta_j \pm \ciradf{\hat\beta_j}\right]
$$

Rate ratios: exponentiate CI endpoints
where $z_{1-\alpha/2} \approx 1.96$ for $\alpha = 0.05$.

Because the log-rate scale is related to the rate scale by exponentiation,
we obtain a confidence interval for the rate ratio $e^{\beta_j}$
by exponentiating both endpoints:

$$
\exp{\beta} \in \left[\exp{\ci} \right]
e^{\beta_j} \in
\left[
\exp{\hat\beta_j - \ciradf{\hat\beta_j}},\;
\exp{\hat\beta_j + \ciradf{\hat\beta_j}}
\right]
$$

### Hypothesis tests for regression coefficients

To test $H_0: \beta_j = \beta_0$ against a one- or two-sided alternative,
compute the Wald $z$-statistic:

$$
z = \frac{\hat \beta - \beta_0}{\hse{\hb}}
z = \frac{\hat \beta_j - \beta_0}{\hse{\hat\beta_j}}
$$

Compare $z$ or $|z|$ to the tails of the standard Gaussian distribution, according to the null hypothesis.
and compare $z$ (one-sided) or $|z|$ (two-sided) to the tails of the
standard Gaussian distribution.
The most common null hypothesis is $H_0: \beta_j = 0$,
i.e., that covariate $j$ has no association with the outcome rate.

### Comparing nested models

log(likelihood ratio) tests, as usual.
To compare a smaller model $M_0$ (with $p_0$ parameters) to a larger model
$M_1$ (with $p_1 > p_0$ parameters), use the likelihood ratio test statistic:

$$
G^2 = 2\bigl[\hat\ell_1 - \hat\ell_0\bigr]
$$

where $\hat\ell_1$ and $\hat\ell_0$ are the maximized log-likelihoods
of $M_1$ and $M_0$ respectively.
(Here the subscripts index the two *models*; they are unrelated to the
scalar null value $\beta_0$ used in the Wald test above.)

Under $H_0$ that the additional $p_1 - p_0$ parameters are all zero,
$G^2 \dsim \chi^2_{p_1 - p_0}$.

See @dobson4e [Chapter 9] and @vittinghoff2e [§8.1] for details.
99 changes: 96 additions & 3 deletions _subfiles/count-regression/_sec_zero-inflation.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,105 @@ $$
---

::: {#exr-zinf-pmf}
Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x,T=t)$ and $P(Y=y|Z=0,X=x,T=t)$.
Expand $P(Y=0|X=x,T=t)$, $P(Y=1|X=x,T=t)$ and $P(Y=y|X=x,T=t)$ into expressions involving $P(Z=1|X=x)$ and $P(Y=y|Z=0,X=x,T=t)$.
:::

---
::: {.solution}

Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$.

**$P(Y=0)$:** $Y=0$ occurs either because $Z=1$ (always zero)
or because $Z=0$ and the Poisson draw equals 0:

$$
\ba
\P(Y=0|X=x,T=t)
&= \P(Z=1|X=x) + \P(Z=0|X=x)\,\P(Y=0|Z=0,X=x,T=t)\\
&= \pi + (1-\pi)\,e^{-\mu_0}
\ea
$$

**$P(Y=1)$:** $Z=1$ can never produce $Y=1$, so:

$$
\P(Y=1|X=x,T=t)
= (1-\pi)\,\P(Y=1|Z=0,X=x,T=t)
= (1-\pi)\,\mu_0 e^{-\mu_0}
$$

**$P(Y=y)$ for $y \geq 1$:** Identical reasoning gives

$$
\P(Y=y|X=x,T=t)
= (1-\pi)\,\frac{\mu_0^y e^{-\mu_0}}{y!}
$$

:::

{{< slidebreak >}}

::: {#exr-zinf-moments}

Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $P(Z=1|X=x,T=t)$ and $\Expp[Y|Z=0,X=x,T=t]$.
Derive the expected value and variance of $Y$, conditional on $X$ and $T$, as functions of $\pi = P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$.
:::

::: {.solution}

Let $\pi = \P(Z=1|X=x)$ and $\mu_0 = \Expp[Y|Z=0,X=x,T=t]$.

**Expected value.** By the Law of Total Expectation
(conditioning on $Z$, within the subpopulation $\{X=x, T=t\}$):

$$
\ba
\Expp[Y|X=x,T=t]
&= \Expp[Y|Z=1,X=x,T=t]\,\pi + \Expp[Y|Z=0,X=x,T=t]\,(1-\pi)\\
&= 0 \cdot \pi + \mu_0(1-\pi)\\
&= (1-\pi)\,\mu_0
\ea
$$

The substitution $\Expp[Y|Z=0,X=x,T=t] = \mu_0$ follows immediately
from the definition of $\mu_0$ above.

**Variance.** By the Law of Total Variance.
To reduce clutter, we suppress the $(X=x, T=t)$ conditioning in the
intermediate steps below: every expectation and variance is taken within
the subpopulation $\{X=x, T=t\}$, and we restore the explicit conditioning
in the final line.

$$
\Var{Y} = \Expp[\Var{Y|Z}] + \Var{\Expp[Y|Z]}
$$

For the first term, since $\Var{Y|Z=1}=0$ and $\Var{Y|Z=0}=\mu_0$ (Poisson):

$$
\Expp[\Var{Y|Z}] = 0 \cdot \pi + \mu_0(1-\pi) = (1-\pi)\mu_0
$$

For the second term, $\Expp[Y|Z]$ takes the value 0 (with prob $\pi$)
or $\mu_0$ (with prob $1-\pi$), so:

$$
\ba
\Var{\Expp[Y|Z]}
&= \pi(0 - (1-\pi)\mu_0)^2 + (1-\pi)(\mu_0 - (1-\pi)\mu_0)^2\\
&= \pi(1-\pi)^2\mu_0^2 + (1-\pi)\pi^2\mu_0^2\\
&= \pi(1-\pi)\mu_0^2[\,(1-\pi)+\pi\,]\\
&= \pi(1-\pi)\mu_0^2
\ea
$$

Combining:

$$
\Var{Y|X=x,T=t} = (1-\pi)\mu_0 + \pi(1-\pi)\mu_0^2
= (1-\pi)\mu_0\bigl(1 + \pi\mu_0\bigr)
$$

Since $(1-\pi)\mu_0\bigl(1+\pi\mu_0\bigr) \geq (1-\pi)\mu_0 = \Expp[Y|X=x,T=t]$ for any $\pi > 0$,
zero-inflated count models always exhibit overdispersion relative to a Poisson model
with the same mean.

:::
20 changes: 15 additions & 5 deletions chapters/count-regression.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,17 @@ This content is adapted from:

::: notes
There are alternatives to the Poisson model.
Most notably,
Most notably,
the [negative binomial model](probability.qmd#sec-nb-dist).
:::

We can still model $\mu$ as a function of $X$ and $T$ as before,
and we can combine this model with zero-inflation
The [negative binomial distribution](probability.qmd#sec-nb-dist)
is a common alternative to the Poisson distribution for count outcomes.
It adds a dispersion parameter that allows the variance to exceed the mean,
making it more flexible when overdispersion is present.
We can still model $\mu$ as a function of $X$ and $T$ as before,
and we can combine this model with zero-inflation
(as the conditional distribution for the non-zero component).
:::

---

Expand All @@ -88,7 +92,13 @@ and we can combine this model with zero-inflation

## Quasipoisson

An alternative to Negative binomial is the "quasipoisson" distribution. I've never used it, but it seems to be a method-of-moments type approach rather than maximum likelihood. It models the variance as $\Var{Y} = \mu\theta$, and estimates $\theta$ accordingly.
An alternative to the negative binomial model is the quasi-Poisson approach.
Rather than specifying a full probability distribution,
it uses a method-of-moments approach rather than maximum likelihood estimation.
It models the variance as $\Var{Y} = \mu\theta$
and estimates the dispersion parameter $\theta$ accordingly.
This approach is simpler to implement but provides less information
than the full negative binomial likelihood.

See `?quasipoisson` in R for more.

Expand Down
60 changes: 37 additions & 23 deletions chapters/exr-needle-sharing-extensions.qmd
Original file line number Diff line number Diff line change
@@ -1,45 +1,47 @@

```{r}
library(MASS) #need this for glm.nb()
glm1.nb = glm.nb(
data = needles,
shared_syr ~ age + sex + homeless*polydrug
library(MASS) # for glm.nb()
glm1_nb <- glm.nb(
formula = shared_syr ~ homeless +
age + sex + ethn + polydrug + hivstat + sexabuse + hplsns,
data = needles
)

equatiomatic::extract_eq(glm1.nb)
equatiomatic::extract_eq(glm1_nb)
```

```{r}
#| tbl-cap: "Negative binomial model for needle-sharing data"
#| label: tbl-needles-nb
summary(glm1.nb)
summary(glm1_nb)
```

---

```{r}
#| tbl-cap: "Poisson versus Negative Binomial Regression coefficient estimates"
#| label: tbl-compare-poisson-nb
tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1.nb))
tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1_nb))
```

#### zero-inflation

```{r}
#| tbl-cap: "Zero-inflated poisson model"
#| tbl-cap: "Zero-inflated Poisson model"
#| label: tbl-zeroinf-poisson
library(glmmTMB)
zinf_fit1 = glmmTMB(
zinf_pois <- glmmTMB(
family = "poisson",
data = needles,
formula = shared_syr ~ age + sex + homeless*polydrug,
ziformula = ~ age + sex + homeless + polydrug # fit won't converge with interaction
data = needles,
formula = shared_syr ~ homeless +
age + sex + ethn + polydrug + hivstat + sexabuse + hplsns,
# the zero-inflation submodel uses the exposure, matching Vittinghoff's
# `inflate(i.homeless)` (@vittinghoff2e, §8.3.1)
ziformula = ~ homeless
)

zinf_fit1 |>
zinf_pois |>
parameters(exponentiate = TRUE) |>
print_md()

```

::: notes
Expand All @@ -53,18 +55,30 @@ Another R package for zero-inflated models is [`pscl`](https://cran.r-project.or
```{r}
#| tbl-cap: "Zero-inflated negative binomial model"
#| label: tbl-zeroinf-nb
library(glmmTMB)
zinf_fit1 = glmmTMB(
zinf_nb <- glmmTMB(
family = nbinom2,
data = needles,
formula = shared_syr ~ age + sex + homeless*polydrug,
ziformula = ~ age + sex + homeless + polydrug
# fit won't converge with interaction
data = needles,
formula = shared_syr ~ homeless +
age + sex + ethn + polydrug + hivstat + sexabuse + hplsns,
ziformula = ~ homeless
)

zinf_fit1 |>
zinf_nb |>
parameters(exponentiate = TRUE) |>
print_md()

```

::: notes
Both zero-inflated models keep the zero-inflation (structural-zero) submodel
parsimonious — just the exposure, `ziformula = ~ homeless` —
matching @vittinghoff2e's `inflate(i.homeless)` [§8.3.1].

A richer zero-inflation submodel is poorly identified here.
With the negative binomial in particular,
the overdispersion parameter and the zero-inflation component
both compete to explain the excess zeros,
so adding several covariates to the submodel makes the logistic fit
*separate*: its coefficients diverge and the exponentiated estimates
(odds ratios) run off to $\pm\infty$.
Keeping the submodel small avoids that.
:::
Loading
Loading