HINTS/hints.Rmd at master · mlaviolet/HINTS · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: "HINTS with R"
author:
- name: Michael Laviolette, PhD MPH
  email: statman54@gmail.com
date: "`r format(Sys.time(), '%Y-%m-%d %H:%M')`"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits = 4)
```

```{r load-pkgs, echo=FALSE, message=FALSE}
library(broom)
library(dplyr)
library(haven)
library(purrr)
library(srvyr)
library(survey)
library(tidyr)
library(kableExtra)
```

# Introduction

The Health Information National Trends Survey (HINTS) is conducted by the National Cancer Institute (NCI) to collect nationally representative data about the American public's use of cancer-related information.

This document shows how to analyze the public HINTS data using the R statistics-graphics language. It follows the discussion in *Overview of the HINTS 5 Cycle 3 Survey and Data Analysis Recommendations*, available [here](https://hints.cancer.gov/data/download-data.aspx). Page numbers and section citations refer to that document.

# Analyzing HINTS data

## Recommendations for statistical analyses using HINTS 5 Cycle 3

NCI strongly recommends that analysts first assess for possible differences in the three survey modalities between their target variables. The modalities are

* Paper only
* Option to respond by paper or online ("Web option")
* Paper/online option with a \$10 incentive ("Web bonus")

If group differences are not found, the analyst can proceed with analyzing the entire sample. Otherwise the analyst can control for group membership or analyze only a particular group. The data for HINTS 5, Cycle 3 contains several sets of weights appropriate for different analyses. The weights consist of a final sampling weight for point estimates and a series of replicate weights for computing standard errors and confidence intervals. This table lists the weights that are appropriate for particular situations.

```{r wgt_tbl, echo=FALSE}
data.frame(
  `Analysis data` = c("Combined sample", "Combined sample, controlling for group differences", "Paper only", "Web option", "Web bonus"),
  `Final weight` = c("TG_all_FINWT0", "nwgt0", "TG1_FINWT0", "TG2_FINWT0",
                     "TG3_FINWT0"),
  `Replicate weights` = c("nwgt1 - nwgt150", "TG_all_FINWT1 - TG_all_FINWT50",
                          "TG_1_FINWT1 - TG_1_FINWT50", "TG_2_FINWT1 - TG_2_FINWT50",
                          "TG_3_FINWT1 - TG_3_FINWT50")) %>%
  kable() %>%
  kable_styling()
```

### Analyzing HINTS 5 Cycle 3 data using the composite sample

I'll begin with analyses of the composite sample. Later I'll discuss methods for assessing and dealing with group differences.

To start, load the following packages.

```{r show-pkgs, eval=FALSE}
library(broom)
library(dplyr)
library(haven)
library(purrr)
library(srvyr)
library(survey)
library(tidyr)
```

Most of the work is done by the `survey` and `srvyr` packages. The `survey` package provides functions for analysis of data from complex samples with similar functionality to SUDAAN. `srvyr` is designed to work with the `tidyverse` suite. I'll use the `tidyverse` packages extensively.

### Import and recode data
##### Pages 14-15

HINTS data are provided in three formats: SAS `.sas7bdat` data set, SPSS (`.sav`, `.por`), and Stata (`.dta`, versions prior to 13). All can be imported by R; I'll work with the SAS data. First we import the SAS data set into R using `read_sas()` from the `haven` package, then do some recoding.

```{r recodes}
edu_lbl <- c("Less than high school",
             "12 years or completed high school",
             "Some college",
             "College graduate or higher")
health_lbl <- c("Excellent", "Very good", "Good", "Fair", "Poor")
group_lbl <-  c("Paper only", "Web option", "Web bonus")
# convenience function to show percentages to four places like SAS
percent <- function(x, decimals = 4) round(100 * x, decimals)
# import SAS data set contained in .zip file
hints5_3 <- read_sas(unz("HINTS5_Cycle3_SAS_03112020.zip",
                         "hints5_cycle3_public.sas7bdat")) %>%
  mutate(gender = factor(GenderC, 1:2, c("Male", "Female")),
         # recode negative GeneralHealth values to missing
         GeneralHealth = if_else(GeneralHealth < 0, NA_real_, GeneralHealth),
         # collapse education to four levels
         edu = case_when(Education %in% 1:2 ~ 1,
                         Education == 3     ~ 2,
                         Education %in% 4:5 ~ 3,
                         Education %in% 6:7 ~ 4,
                         TRUE ~ NA_real_)) %>%
  # change edu and Treatment_H5C3 from numeric to factor
  mutate(edu = factor(edu, 1:4, edu_lbl),
         Treatment_H5C3 = factor(Treatment_H5C3, 1:3, group_lbl),
         # "No" must be the referent level of SeekCancerInfo to model
         # probability of "Yes". By default the first level of a factor is
         # the referent level, so I'll reverse the order from the data.
         SeekCancerInfo = factor(SeekCancerInfo, 2:1, c("No", "Yes")))
```

To analyze complex survey data in R we create a survey object that contains information about the sampling design and the survey data. For a replicate-weight design, we need to specify

* The data frame containing the survey data
* The variable containing the final sampling weight
* The variables containing the replicate weights
* The type of replication weights, such as jackknife or balanced repeated replication (BRR)

Standard errors can also be computed with Taylor series linearization. For more, see the section "Final Sample and Replicate Weights for Jackknife Replication."

### Frequency table and chi-square test
##### Pages 15-16

Using `survey_mean()` from `srvyr` I'll generate a cross-frequency table of education by gender, along with a Wald chi-squared test of independence. First  construct a replicate weights survey object using the `TG_all_FINWT` series of 50 replicate weights.

```{r survey_rep, warning = FALSE}
# construct replicate weights survey object
hints5_3_rep <- as_survey_rep(hints5_3, weights = "TG_all_FINWT0",
                              repweights = paste0("TG_all_FINWT", 1:50),
                              type = "JK1", scale = 49/50, mse = TRUE)

```

Now we can generate a table of unweighted sample sizes and percentages of gender within each education level along with confidence intervals.

```{r table_1, warning=FALSE}
tbl_1 <- hints5_3_rep %>%
  group_by(edu, gender) %>%
  summarize(n = unweighted(n()),
            pct = survey_mean(na.rm = TRUE, vartype = c("se", "ci"))) %>%
  drop_na() %>%
  mutate(across(starts_with("pct"), percent))

tbl_1 %>%
  kable() %>%
  kable_styling()
```

To perform the chi-square test we use `svychisq()` from `survey`. Wald and adjusted Wald tests are available.

```{r chi_sq_rep}
svychisq(~ edu + gender, hints5_3_rep, statistic = "Wald") %>%
  tidy()


svychisq(~ edu + gender, hints5_3_rep, statistic = "adjWald") %>%
  tidy()

```

Results match the table on page 16.

### Multivariable logistic regression of gender and education on `SeekCancerInfo`
##### Page 17

This example illustrates a multivariable logistic regression model using the `svyglm()` function from `survey`. The first argument is a modeling formula, the second is the survey object, and the third specifies a binary response. Store the fitted model in an object so that you can extract fitted values, residuals, and model metrics. For a logistic regression model use the argument `family = quasibinomial`.

```{r logistic}
model03 <- svyglm(SeekCancerInfo ~ gender + edu, hints5_3_rep,
                  family = quasibinomial)

```

To test for model effects, use the `regTermTest()` function. The first argument is the model object and the second is the model term or terms being tested (as a formula). Note the `df = 49` argument that adjusts for the replicate weight design. See the discussion "Denominator Degrees of Freedom (DDF)" on page 12 of the document. These results match the "Type 3 Analysis of Effects" table on page 17.

```{r test_terms}
regTermTest(model03, ~ gender, df = 49)
regTermTest(model03, ~ edu, df = 49)

```

To obtain odds ratios with confidence intervals, exponentiate the model coefficients and interval endpoints. Note the `ddf = 49` argument to the `tidy` function. Results match the "Odds Ratio Estimates" table on page 18. Note that you must have version 0.7.0 or later of the `broom` package to get correct results.

```{r odds_ratios}
# odds ratios
tidy(model03, conf.int = TRUE, exponentiate = TRUE, ddf = 49) %>%
  kable() %>%
  kable_styling()
```

### Multivariable linear regression of gender and education on `GeneralHealth`
##### Pages 18-19

Although `GeneralHealth` is actually a categorical variable I'll follow the document and treat it as continuous for purposes of discussion. Here's the linear regression model fitting `GeneralHealth` against `Treatment_H5C3`. In the following, omitting the `family` argument defaults to the normal regression model. This code reproduces the results on page 19.

```{r lin_model}
model04 <- svyglm(GeneralHealth ~ gender + edu, hints5_3_rep)
tidy(model04) %>%
  kable() %>%
  kable_styling()
```

```{r chi_sq_lin}
regTermTest(model04, ~ gender + edu)
regTermTest(model04, ~ gender)
regTermTest(model04, ~ edu)

```

### Taylor series linearization variance estimation method

Replication is the recommended method for variance estimation in HINTS. The variables VAR_CLUSTER and VAR_STRATUM are provided for variance estimation using Taylor series linearization. These variables are provided for users of software packages which don't support replicate weights. The following code constructs the survey object for analyzing the combined sample (no group differences) and replicates the results on pages 20 and 21. Standard errors and confidence intervals differ slightly from those obtained using replicate weights.

```{r taylor, warning=FALSE}
hints5_3_lin <-
  as_survey_design(hints5_3, ids = VAR_CLUSTER, strata = VAR_STRATUM,
                   weight = TG_all_FINWT0, nest = TRUE)
hints5_3_lin %>%
  group_by(edu, gender) %>%
  summarize(n = unweighted(n()),
            pct = survey_mean(na.rm = TRUE, vartype = c("se", "ci"))) %>%
  drop_na() %>%
  mutate(across(starts_with("pct"), percent)) %>%
  kable() %>%
  kable_styling()

```

```{r chi_sq_taylor}
svychisq(~ edu + gender, hints5_3_lin, statistic = "Wald") %>%
  tidy()

svychisq(~ edu + gender, hints5_3_lin, statistic = "adjWald") %>%
  tidy()

```

To assess for differences across modality groups, create a survey object `hints5_3_grp` that uses `nwgt0` as the sampling weight and `nwgt1` through `nwgt150` as replicate weights. Then fit a regression model using the variable of interest as the response and the group variable `Treatment_H5C3` as the covariate. Since the survey function `svychisq()` currently supports only two-way tables, use `regTermTest()` to test for group effect.

## Assessing differences across groups
#### Page 8

```{r group_diff}
hints5_3_grp <- as_survey_rep(hints5_3, weights = "nwgt0",
                              repweights = paste0("nwgt", 1:150),
                              type = "JK1", scale = 49/50, mse = TRUE)
```

### Assessing differences across groups, binary outcome
##### Page 8

This model tests for differences across groups in the binary outcome `SeekCancerInfo`.

```{r group_diff_binary}
model01 <- svyglm(SeekCancerInfo ~ Treatment_H5C3, hints5_3_grp,
                  family = quasibinomial)
tidy(model01)
regTermTest(model01, ~ Treatment_H5C3)
```

Note that "Paper only" is the referent because it's the first level of `Treatment_H5C3`. The results indicate no significant differences in `SeekCancerInfo` between the treatment groups.

### Assessing differences across groups, continuous outcome

This model tests for differences across groups in the outcome `GeneralHealth`, considered as continuous.

```{r group_diff_cont}
model02 <- svyglm(GeneralHealth ~ Treatment_H5C3, hints5_3_grp)
tidy(model02)
regTermTest(model02, ~ Treatment_H5C3)

```

### If group differences are found

#### Option A: Use the combined sample and control for group assignment

```{r control_for_group, eval=FALSE}
# predictor1 and predictor2 are placeholders
model00 <-
  svyglm(SeekCancerInfo ~ Treatment_H5C3 + predictor1 + predictor2 + ...,
                 hints5_3_grp, family = quasibinomial)
```

#### Option B: Use one group only, without accounting for group differences

```{r one_group_only, eval=FALSE}
# using "Paper only" group
hints5_3_paper <- hints5_3 %>%
  filter(Treatment_H5C3 == "Paper only") %>%
  as_survey_rep(weights = "TG1_FINWT0",
                repweights = paste0("TG1_FINWT", 1:50,
                type = "JK1", scale = 49/50, mse = TRUE)

model03 <-
  svyglm(SeekCancerInfo ~ Treatment_H5C3 + predictor1 + predictor2 + ...,
                 hints5_3_grp, family = "quasibinomial")

```

### If group differences are not found

#### Option C: Use the combined sample without accounting for group differences

For Cycle 3, if group differences are not found the appropriate weights are the sample weight `TG_all_FINWT0` and replicate weights `TG_all_FINWT1-TG_all_FINWT150`.

```{r combined_sample, eval=FALSE}
hints5_3_all <- hints5_3 %>%
  as_survey_rep(weights = "TG_all_FINWT0",
                repweights = paste0("TG_all_FINWT", 1:50,
                type = "JK1", scale = 49/50, mse = TRUE)

model04 <-
  svyglm(SeekCancerInfo ~ predictor1 + predictor2 + ...,
                 hints5_3_grp, family = quasibinomial)
```

# Merging HINTS survey iterations
##### Pages 52-54

The following code (pp. 52-54) merges HINTS 5, Cycles 2 and 3 assuming no differences between groups.

```{r merge_23}
# Cycle 2, 100 replicate weights
# cycle2a, import Cycle 2 data
# cycle2b, new replicate weights 1 - 50: copy PERSON_FINWT1 - PERSON_FINWT50 as
#   Merged_NWGT1 - Merged_NWGT50
# cycle 2c, new replicate weights 51 - 100: copy PERSON_FINWT0 50 times as
#   Merged_NWGT51 - Merged_NWGT100
# temp_hints5_cycle2: join cycle2a, cycle2b, cycle2c by PersonID
cycle2a <-
  read_sas(unz("HINTS_5_Cycle_2_SAS_03192020.zip",
               "HINTS 5- Cycle 2-SAS-03192020/hints5_cycle2_public.sas7bdat"))
cycle2b <- cycle2a %>%
  select(PersonID, c(paste0("PERSON_FINWT", 0:50))) %>%
  rename_at(paste0("PERSON_FINWT", 0:50), ~ paste0("Merged_NWGT", 0:50))
cycle2c <- map(1:50, ~ select(cycle2b, PersonID, Merged_NWGT0)) %>%
  reduce(inner_join, by = "PersonID") %>%
  set_names(c("PersonID", paste0("Merged_NWGT", 51:100)))
temp_hints5_cycle2 <- list(cycle2a, cycle2b, cycle2c) %>%
  reduce(inner_join, by = "PersonID") %>%
  mutate(survey = 1)
# clean up
rm(cycle2a, cycle2b, cycle2c)

# Cycle 3, 100 replicate weights
# cycle3a, import Cycle 3 data
# cycle3b, new replicate weights 51 - 100: copy TG_all_FINWT1 - TG_all_FINWT50 as
#   Merged_NWGT51 - Merged_NWGT100
# cycle 3c, new replicate weights 1 - 50: copy TG_all_FINWT0 50 times as
#   Merged_NWGT1 - Merged_NWGT50
# temp_hints5_cycle3: join cycle3a, cycle3b, cycle3c by PersonID
cycle3a <- read_sas(unz("HINTS5_Cycle3_SAS_03112020.zip",
                         "hints5_cycle3_public.sas7bdat"))
cycle3b <- cycle3a %>%
  select(PersonID, c(paste0("TG_all_FINWT", 0:50))) %>%
  rename_at(paste0("TG_all_FINWT", 0:50), ~ paste0("Merged_NWGT", c(0, 51:100)))
cycle3c <- map(1:50, ~ select(cycle3b, PersonID, Merged_NWGT0)) %>%
  reduce(inner_join, by = "PersonID") %>%
  set_names(c("PersonID", paste0("Merged_NWGT", 1:50)))
temp_hints5_cycle3  <- list(cycle3a, cycle3b, cycle3c) %>%
  reduce(inner_join, by = "PersonID") %>%
  mutate(survey = 2)
rm(cycle3a, cycle3b, cycle3c)

# stack Cycle 2 and Cycle 3 into a single table and create survey object with
#   the new 100 replicate weights
# resulting table has 8,942 rows (3,504 for Cycle 2 and 5,438 for Cycle 3)
#   and 1,016 columns
#     914 variables appear in either or both of Cycle 2 and Cycle 3
#     102 added variables: cycle ID, final sample weight, 100 replicate
#     weights
mergeHINTSS5C2_HINTS5C3 <-
  bind_rows(temp_hints5_cycle2, temp_hints5_cycle3) %>%
  # reorder columns to put new weights in front
  select(survey, PersonID, num_range("Merged_NWGT", 0:100), everything()) %>%
  # variable to distinguish survey iterations
  mutate(survey = factor(survey, 1:2,
                         c("HINTS 5 Cycle 2", "HINTS 5 Cycle 3"))) %>%
  # survey design object
  as_survey_rep(weights = "Merged_NWGT0",
                repweights = paste0("Merged_NWGT", 1:100),
                type = "JK1", scale = 49/50, mse = TRUE)
rm(temp_hints5_cycle2, temp_hints5_cycle3)

```

The following code replicates the results obtained from the merged data using SAS.

```{r merge_seek_info}
mergeHINTSS5C2_HINTS5C3 %>%
  mutate(SeekHealthInfo =
           factor(SeekHealthInfo, labels = c("NA", "Yes", "No"))) %>%
  group_by(SeekHealthInfo) %>%
  summarize(n = unweighted(n()),
            pct = survey_mean(na.rm = TRUE))  %>%
  mutate(across(starts_with("pct"), percent))
```

```{r merge_ask_ques}
mergeHINTSS5C2_HINTS5C3 %>%
  mutate_at("ChanceAskQuestions", factor,
            labels = c(paste0("NA", 1:5),
                       "Always", "Usually", "Sometimes", "Never")) %>%
  group_by(ChanceAskQuestions) %>%
  summarize(n = unweighted(n()),
            pct = survey_mean(na.rm = TRUE)) %>%
  mutate(across(starts_with("pct"), percent))
```

See the script `HINTS-merge5.R` for details of merging HINTS 5, Cycles 1, 2, and 3

# Conclusion

The R language, with the packages `survey` and `srvyr`, provides an effective means for analyzing HINTS data without the need for expensive proprietary software.