generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 1
/
03-Using-R.Rmd
563 lines (357 loc) · 20.4 KB
/
03-Using-R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
# Using R {#using_r}
The R environment for Statistical Computing is a free, open source environment for managing and visualizing data and performing analyses. Its capabilities can be augmented by downloading software packages from repositories, such as the [Comprehensive R Archival Network (CRAN)](https://cran.r-project.org/), and more recently, [GitHub](https://github.com/). [Rstudio](https://rstudio.com/) is a powerful development environment that makes it easier to use R, expands its functionality, and allows the integration of other tools, such as Git, into an analyst's workflow. If you are new to R or would like additional resources on using R in practice, see [the appendix](#appendix_b_using_r).
In addition to R and Rstudio, we will also need to install [RTools](https://cran.r-project.org/bin/windows/Rtools/), also known as the R toolchain. These are software tools for compiling software packages. Some of the packages we use may have to be compiled after being downloaded, and RTools provides all the software needed to do so. CRAN provides instruction on which version of RTools you should use, depending on your version of R, and how to install it.
This section is primarily about how to use R and associated software packages. While this does include demonstration of how to implement statistical models and hypothesis tests, a larger discussion of their appropriate use and interpretation will take place in later sections.
--------------------------------------------------------------------------------
## Installing Packages
Once we have Rtools installed, we are ready to install the required packages from CRAN and Github. The `install.packages()` function in R can be used to install packages from CRAN, R-Forge, BioC, and other repositories via the command line. In Rstudio, users can also use the 'Packages' tab to see which packages are installed, their current version, and whether or not they have been loaded into the workspace for use. If you have installed the `devtools` package, `devtools::install_github()` can be used to install packages from Github or other version control repositories.
If you would like to explore which R packages are available for a given application or research area, the [CRAN Task Views](https://cran.r-project.org/web/views/), including the [CRAN Clinical Trials taskview](https://cran.r-project.org/web/views/ClinicalTrials.html), are worth exploring.
--------------------------------------------------------------------------------
### Packages from CRAN
Below are some of the packages that we will use from CRAN, along with a brief description of their purposes:
- [devtools](https://cloud.r-project.org/web/packages/devtools/index.html) - A suite of tools for R package development
- [cobalt](https://cran.r-project.org/web/packages/cobalt/index.html) - Creating tables and plots for assessing covariate balance
- [knitr](https://cran.r-project.org/web/packages/knitr/index.html) - Tools for literate programming: including code in reproducible reports
- [margins](https://cran.r-project.org/web/packages/margins/index.html) - Calculating marginal or partial effects from regression models
- [mgcv](https://cran.r-project.org/web/packages/mgcv/index.html) - Fitting generalized additive models
- [sandwich](https://cran.r-project.org/web/packages/sandwich/index.html) Robust covariance matrix estimation
- [speff2trial](https://cran.r-project.org/web/packages/speff2trial/index.html) Semiparametric estimation of the marginal hazard ratio using inverse probability weighting
- [survminer](https://cran.r-project.org/web/packages/survminer/index.html) - Creating plots of time-to-event data
- [survRM2](https://cran.r-project.org/web/packages/survRM2/index.html) - Calculating the restricted mean survival time (RMST) with and without covariate adjustment.
- [table1](https://cran.r-project.org/web/packages/table1/index.html) - Creating simple tabulations in aggregate and by treatment arm
- [tidyverse](https://www.tidyverse.org/packages/) - An ecosystem of packages for working with data
```{r install-packages, eval = FALSE}
required_packages <-
c("devtools",
"cobalt",
"knitr",
"margins",
"mgcv",
"sandwich",
"speff2trial"
"survminer",
"survRM2",
"table1",
"tidyverse"
)
packages_to_install <-
setdiff(
x = required_packages,
y = installed.packages(.Library)[, "Package"]
)
install.packages(packages_to_install)
```
--------------------------------------------------------------------------------
### Packages from Github
We will also use the following packages from Github:
- [simul](https://github.com/nt-williams/simul) Inference based on Efficient Influence Function and Multiplier Bootstrap
- [adjrct](https://github.com/nt-williams/adjrct) Doubly Robust, Efficient Estimators for Survival and Time to Event Outcomes
```{r install-github, eval = FALSE}
devtools::install_github("nt-williams/simul")
devtools::install_github("nt-williams/adjrct")
```
### Loading Packages into Workspace
Once a package has been successfully installed, we use the `library()` command to load it into the workspace for use:
```{r load-packages}
library(tidyverse) # Data manipulation: dplyr, tidyr
library(table1) # Creation of Summary tables
```
It is possible that different packages contain a function of the same name. For example, the `table1` package contains the function `table1()`: there are other packages that also have a function named `table1()`, such as the [`furniture`](https://cran.r-project.org/web/packages/furniture/index.html) package. If both of these packages are loaded, it can cause confusion about which version should be used when `table1()` is called. R will warn when such conflicts can arise, but as a best practice, it is useful to specify the package and function as as follows: `table1::table1()`. This makes code easier to use and potentially reduces ambiguity.
--------------------------------------------------------------------------------
## Loading the Data: MISTIE-III {#Load_MISTIE}
The [data dictionary](#mistie_iii) and more information about the MISTIE III study were presented earlier [@Hanley2019]. All of the data needed for the examples are available on the web. To load these, we create a file connection to the URL using the `url()` function, and then use `read.csv()` to read in the comma separated values (CSV) file. To load data from a local file path, the `file.path()` function is useful for creating file paths. We can start by loading the simulated MISTIE III data. Once we have loaded the full data, we can use `dplyr::slice` to take the first 500 rows.
```{r Load-MISTIE-III-Data}
data_url <-
paste0("https://github.com/jbetz-jhu/CovariateAdjustmentTutorial",
"/raw/main/Simulated_MISTIE_III_v1.2.csv")
sim_miii_full <- read.csv(file = url(data_url))
# Read in data: Recast categorical variables as factors
sim_miii_full <-
sim_miii_full %>%
dplyr::tibble() %>%
dplyr::mutate(
# Convert variables from binary indicators to labeled categorical variables
male =
factor(
x = male,
levels = 0:1,
labels = c("0. Female", "1. Male")
),
across(
.cols =
all_of(
x = c("hx_cvd", "hx_hyperlipidemia",
"on_anticoagulants", "on_antiplatelets")
),
.fns = function(x) factor(x, levels = 0:1, labels = c("0. No", "1. Yes"))
),
# Convert GCS and MRS variables from character data to categorical variables
across(
.cols = starts_with("gcs") | starts_with("mrs"),
.fns = factor
),
ich_location =
factor(
x = ich_location,
levels = c("Deep", "Lobar")
),
arm =
factor(
x = arm,
levels = c("medical", "surgical")
),
tx = 1*(arm == "surgical")
)
# Take the first 500 rows
sim_miii <-
sim_miii_full %>%
dplyr::slice(1:500)
```
Other useful functions include:
- `head()`/`tail()` - Looking at the first $n$ rows of a dataset
- `nrow()`/`ncol()` - Counting the rows/columns of a dataset
- `colnames()`/`rownames()` - Getting the row/column names of a dataset
```{r Useful-Functions}
head(sim_miii)
nrow(sim_miii)
ncol(sim_miii)
colnames(sim_miii)
```
--------------------------------------------------------------------------------
## Assessing Baseline Balance
While randomization will tend to produce treatment groups that are similar on both measured and unmeasured factors, there will always be some degree of imbalance between groups in some characteristics. It is important to remember that these differences only represent confounding if groups are imbalanced on variables that are associated with the outcome. To assess the degree of imbalance, we can tabulate characteristics by treatment arm, and compute standardized differences to get a scale-free measure of the magnitude of imbalance.
```{r mistie-iii-table-1}
table1(
~ age + male +
on_antiplatelets + ich_location + ich_s_volume + ivh_s_volume +
gcs_category | arm,
data = sim_miii
)
```
This allows us to compare the means, medians, frequencies, and ranges of variables between groups.
```{r mistie-iii-standardized-differences}
library(cobalt)
cobalt::bal.tab(
x =
# Only tabulate baseline variables
sim_miii %>%
dplyr::select(
dplyr::all_of(
x = c("age", "male", "on_antiplatelets",
"ich_location", "ich_s_volume", "ivh_s_volume", "gcs_category")
)
),
treat = sim_miii$arm,
# Compute standardized differences for both binary and continuous variables
binary = "std",
continuous = "std"
)
```
--------------------------------------------------------------------------------
## Visualizing Data
One of the many strength of R is the powerful and flexible data visualization tools in its software ecosystem: see [the R Graph Gallery](https://r-graph-gallery.com/) for some examples. The [`ggplot2`](https://ggplot2.tidyverse.org/) package, and some related packages like [survminer](https://cran.r-project.org/web/packages/survminer/index.html), are useful for assessing baseline balance and visualizing outcome data.
For example, we may want to assess the cumulative distribution of a baseline covariate instead of just checking the summary statistics.
```{r Plot-eCDF-Age}
library(ggplot2)
ggplot(
data = sim_miii,
aes(
x = age
)
) +
stat_ecdf(
alpha = 0.6,
) +
stat_ecdf(
aes(color = arm),
alpha = 0.6,
) +
theme_bw()
```
Or we may want to create a plot of the Kaplan-Meier estimate of the survival function (see the [estimands](#estimands) section for its definition, and the [time-to-event](#timetoevent) section for further discussion).
```{r Plot-KM-MISTIE-III}
library(survival)
library(survminer)
# Create a 'survival object' from event times and indicators
miii_surv <-
with(sim_miii,
survival::Surv(
time = days_on_study,
event = died_on_study
)
)
# Use survfit to calculate survival data
time_to_death_km <-
survival::survfit(
formula = miii_surv ~ arm,
data = sim_miii
)
# Create the Kaplan-Meier Plot
survminer::ggsurvplot(
fit = time_to_death_km,
conf.int = TRUE,
risk.table = TRUE,
xlab = "Days",
ylab = "Survival probability"
)
```
From this plot, we can see that most of the mortality occurs soon after randomization, with greater mortality in the medical management arm early in the trial. After the first 90 days, the rate of events decreases in both arms.
--------------------------------------------------------------------------------
## Fitting Regression Models
While covariate adjustment does involve regression modeling, an in-depth discussion of regression modeling is not needed for the purposes of implementing covariate adjustment. For those wanting a more in-depth presentation of fitting generalized linear models (GLMs) in R see Dobson & Barnett [-@DobsonBarnett2018].
### Generalized Linear Model
Fitting a GLM in R is done using the `glm()` function. For example if we wanted to model the probability of being assigned to the surgical arm in MISTIE III by an individual's age, ICH volume, and IVH volume, the code is as follows:
```{r Binomial-GLM-MISTIE-III}
pr_mis_glm <-
glm(
formula =
tx ~ age + ich_s_volume + ivh_s_volume,
data = sim_miii,
family = binomial(link = "logit")
)
```
Once the model has been fit, we can use it for creating summary tables, calculating confidence intervals, or generating fitted values for a new dataset:
```{r Using-Fitted-Binomial-GLM-MISTIE-III}
# Produce GLM Summary Table
summary(pr_mis_glm)
# Calculate Confidence Intervals for Coefficients
confint(pr_mis_glm)
# Calculate fitted probabilities for new data:
# 1. 65 years old, 30 mL ICH, 0 mL IVH
# 2. 70 years old, 50 mL ICH, 15 mL IVH
predict(
object = pr_mis_glm,
newdata =
data.frame(
age = c(65, 70),
ich_s_volume = c(30, 50),
ivh_s_volume = c(0, 15)
),
type = "response"
)
```
### Logrank Test and Cox Proportional Hazards Model
For time-to-event outcomes, such as mortality, the logrank test and Cox Proportional Hazards (PH) model are commonly used analytic approaches. Using the survival object `miii_surv` created earlier using `survival::Surv()`, we can pass this object to other functions to perform the logrank test (using `survival::survdiff`) and fit the Cox PH model (using `survival::coxph`).
```{r Logrank-MISTIE-III}
mortality_logrank <-
survival::survdiff(
formula = miii_surv ~ arm,
data = sim_miii
)
mortality_logrank
```
For the Cox PH model, the `ties = "efron"` argument specifies how tied survival times are addressed, and the `robust = TRUE` computes robust estimates of the covariance matrix of regression coefficients:
```{r Cox-PH-MISTIE-III}
mortality_cox <-
survival::coxph(
formula = miii_surv ~ arm,
data = sim_miii,
ties = "efron",
robust = TRUE
)
summary(mortality_cox)
```
This model assumes that the ratio of the rates of events between the treatment and control arm is approximately constant over time. We can assess using the `survival::cox.zph` function:
```{r Cox-PH-Test-MISTIE-III}
# Test Proportionality Assumption: Schoenfeld Residuals
cox.zph(fit = mortality_cox)
# Plot smoothed residuals
plot(cox.zph(fit = mortality_cox))
abline(h = 0, col = "red")
```
From the plot, we can see that the hazard ratio is initially negative, but increases towards zero as time goes on. It seems as if the treatment has a beneficial effect on mortality for a few weeks after randomization, which then diminishes. This is a violation of the proportional hazards assumption, which is further discussed in the chapter on [estimands](#survival_estimands).
--------------------------------------------------------------------------------
## Variance Estimation
### Robust Standard Errors
Most of the standard errors reported in software, such as those returned by `vcov()`, are model-based estimates of the standard error, which assume that the model is correctly specified. Robust or "Sandwich" standard errors can be used to obtain a consistent estimate of the standard error in such cases. Note that robust estimates of standard errors are different from robust estimates of the regression coefficients themselves.
The `sandwich::vcovHC()` function can be used to obtain different types of robust standard errors:
```{r HC-Covariance-Estimation}
library(sandwich)
# Model-based standard errors
vcov(object = pr_mis_glm)
# Robust standard errors
sandwich::vcovHC(x = pr_mis_glm, type = "HC3")
```
These can be passed as an argument to other functions for computing confidence intervals for contrasts and marginal means.
--------------------------------------------------------------------------------
### Bootstrap Estimator
The bootstrap procedure uses resampling to obtain an estimate of the variance of the sampling distribution of an estimator. In R, this is done using the `boot` package, which is part of base R.
First, we need to write a function to produce the statistic of interest. In this case, we will bootstrap the Mann-Whitney U statistic: see the [estimands](#ordinal_estimands) for more information on this estimand. The first argument to this function must be the data, and the second argument should be a vector of indices for our bootstrap sample, and any other arguments can be supplied thereafter.
```{r Boot-Example-Step-1-Create-Function}
# 1. Write a function that produces the test statistic:
wilcox_to_auc <-
function(data, indices = NULL, formula){
# Input data must be a data.frame
if(!all(class(data) == "data.frame")){
stop("`data` must be a data.frame: use `as.data.frame()` for a tibble.")
}
# If bootstrap indices not supplied, use entire dataset
if(is.null(indices)) indices <- 1:nrow(data)
# Extract Outcome/Treatment from Formula
outcome <- all.vars(update(formula, . ~ 0))
treatment <- all.vars(update(formula, 0 ~ .))
stopifnot(length(treatment) == 1)
# Convert outcome to numeric using levels: Assumes levels are ordered
if(!is.numeric(data[, outcome])){
data[, outcome] <- as.numeric(data[, outcome])
}
# Run Wilcoxon Rank Sum on data using the bootstrap indices
wrst_result <-
wilcox.test(
formula = formula,
data = data[indices,]
)
# Compute AUC statistic
return(wrst_result$statistic/prod(table(data[indices, treatment])))
}
```
Now, we call this function using `boot`, passing any arguments required: `boot` will resample the data, and pass the indices to the function, and evaluate the result. From the result, we can calculate the standard error or estimate of the bootstrap distribution:
```{r Boot-Example-Step-2-Call-Boot, eval = FALSE}
library(boot)
# Perform 10,000 bootstraps of data
n_boot_samples <- 10000
mrs_365d_auc_boot <-
boot::boot(
data = as.data.frame(sim_miii),
statistic = wilcox_to_auc,
R = n_boot_samples,
formula = mrs_365d ~ arm
)
```
```{r Load-Boot-Package, echo = FALSE}
library(boot)
```
Once the bootstrap procedure is complete, we can view and summarize the results.
```{r Boot-Example-Step-3-Use-Results, fig.height = 0.6*fig_w}
# Bootstrap Standard Error
sd(mrs_365d_auc_boot$t[,1])
# Bootstrap Variance (SE^2)
var(mrs_365d_auc_boot$t[,1])
# Produce a histogram and quantile-quantile plot
plot(mrs_365d_auc_boot)
```
Confidence intervals can be computed using `boot::boot.ci`:
```{r Boot-Example-Compute-CI, eval = FALSE}
mrs_365d_auc_boot_ci <-
boot::boot.ci(
boot.out = mrs_365d_auc_boot,
conf = 0.95,
type = "bca",
index = 1
)
```
The results can be printed out or extracted for use elsewhere:
```{r Boot-Example-Report-CI}
# boot.ci result
mrs_365d_auc_boot_ci
# Extract the lower/upper confidence limits
mrs_365d_auc_boot_ci_result <-
tail(x = mrs_365d_auc_boot_ci$bca[1,], n = 2)
# Print out result
mrs_365d_auc_boot_ci_result
```
The most straightforward way to calculate a p-value using the bootstrap involves finding the smallest value of $\alpha$ at which the $100(1 - \alpha)\%$ confidence interval does not contain the null value, resulting in a rejection of the null hypothesis. While not implemented in `boot`, this can be done using a binary search algorithm for the values of $\alpha$ typically used in practice: $0.05 \ge \alpha \ge 0.0001$.
--------------------------------------------------------------------------------
## Addressing Multiplicity
For pivotal trials, it is often required to have an analysis plan that control the probability of rejecting at least one truly null hypothesis, also known as the familywise type I error rate (FWER). When there are more than one primary comparison of interest, due to having multiple pairwise treatment contrasts, endpoints, sequential analyses, or timepoints, a strategy must be used to control the FWER at a pre-specified level.
A later chapter is devoted to [Group Sequential Designs](#groupsequential), a methodology for doing pre-planned analyses that allow stopping a randomized trial early for success or futility. When all hypotheses are tested simultaneously, the [multcomp](https://cran.r-project.org/web/packages/multcomp/index.html) package allows for decisions and confidence intervals that appropriately control the FWER. When there is a specific ordering to how hypotheses are tested, the [gMCP](https://cran.r-project.org/web/packages/gMCP/index.html) package implements graphical methods for addressing multiple comparisons. We will focus mainly on controlling the FWER for one pairwise treatment comparison on a single endpoint, with one or more pre-planned analyses.