-
Notifications
You must be signed in to change notification settings - Fork 3
/
Randomization-Inference-primer-for-SBST.Rmd
414 lines (338 loc) · 15.2 KB
/
Randomization-Inference-primer-for-SBST.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: Randomization Inference Primer for SBST
author:
- name: Nathaniel Higgins
affiliation: SBST
- name: Jake Bowers
affiliation: SBST and University of Illinois
date: '`r format(Sys.Date(), "%B %d, %Y")`'
bibliography: references.bib
output:
html_document:
theme: cosmo
toc: yes
pdf_document:
keep_tex: true
number_sections: true
fig_width: 5
fig_height: 5
fig_caption: true
template: bowersarticletemplate.latex
...
```{r include=FALSE, cache=FALSE}
# Some customization. You can alter or delete as desired (if you know what you are doing).
# knitr settings to control how R chunks work.
## To make the html file do
## render("thefile.Rmd",output_format=html_document(fig_retina=FALSE))
## To make the pdf file do
## render("thefile.Rmd",output_format=pdf_document())
require(knitr)
opts_chunk$set(tidy=TRUE,echo=TRUE,results='markup',strip.white=TRUE,cache=FALSE,highlight=TRUE,width.cutoff=60,size='footnotesize',out.width='.5\\textwidth',message=FALSE,comment=NA,fig.env="figure", fig.align="center",fig.lp="fig:",fig.pos="H")
```
*Purpose of this document:* To explain what *Randomization Inference
(RI)* is, how it can be used by SBST, and how to use RI to calculate
test statistics (and maybe point estimates).^[Chapter 2 of @rosenbaum2010design provides an excellent introduction to this.]
***1. What Randomization Inference does***
Randomization Inference (RI) is used primarily to evaluate hypotheses.
So… a question: how is it different from regular old hypothesis testing?
Answer: RI doesn’t use distributional assumptions to calculate critical
values. The critical values generated by RI are, in this sense, *exact*,
rather than being generated by a model meant to approximate the data
generating process of a test statistic (what happens in ordinary
hypothesis testing).
So, for instance, RI could be used to test whether or not the mean
$\overset{\bar{}}{x}$ of a bunch of numbers is zero. More generally, RI
can be used to test hypotheses about the value of a statistic $\theta$.
***2. How does Randomization Inference work?***
To begin with, RI is only useful when the data we observe is generated
by a process that involves random assignment -- an experiment. RI uses
the physical act of randomization -- and the fact that the randomization
which we observe *could have gone differently* -- to examine a full set
of alternative values for a statistic $\theta$.
***3. An example to illustrate basic mechanics***
The value of an outcome variable $y$ could have been observed for both
treatment and control units, as in the table below.
Unit Number $y_{i}$ $T_{i}$
------------- --------- ---------
1 > 4 0
2 > 5 0
3 > 11 0
4 > 10 0
5 > 3 0
6 > 4 1
7 > 6 1
8 > 2 1
9 > 2 1
10 > 5 1
The first 5 observations are of control units ($T_{i} = 0$) and the
second 5 observations are of treatment units ($T_{i} = 1$). If we want
to test the null hypothesis that the treatment had no effect on $y$, the
usual way to do this would be to compare $\overset{\bar{}}{y_{1}}$to
$\overset{\bar{}}{y_{0}}$, (i.e. evaluate
$\theta = \overset{\bar{}}{y_{1}} - \overset{\bar{}}{y_{0}}$) where
$\overset{\bar{}}{y_{1}}$ represents the mean of $y$ for treatment units
and $\overset{\bar{}}{y_{0}}$ represents the mean of $y$ for control
units. To implement this example in R:
```{r}
# y values for treatment units
y1 <- c(4,5,11,10,3)
n1 <- length(y1)
# y values for control units
y0 <- c(4,6,2,2,5)
n0 <- length(y0)
# means
y1.bar <- sum(y1)/n1
y0.bar <- sum(y0)/n0
```
In the above example, $\overset{\bar{}}{y_{1}} = 6.6$ and
$\overset{\bar{}}{y_{0}} = 3.8$. To test $H_{0}:\mu_{1} - \mu_{0} = 0$
(no effect of treatment) using classical statistics, we compare the two
means, $\overset{\bar{}}{y_{1}} - \overset{\bar{}}{y_{0}} = 2.8$,
calculate the standard error of the difference in means (SEDM), and
create a t-statistic. To calculate the SEDM, first calculate the
standard error of $\overset{\bar{}}{y_{1}}$ and
$\overset{\bar{}}{y_{0}}$
$se_{x} = sd_{x}/.$
Or in R:
```{r}
var1 <- sum((y1 - y1.bar)^2)/(n1-1)
sd1 <- sqrt(var1)
se1 <- sd1/sqrt(n1)
var0 <- sum((y0 - y0.bar)^2)/(n0-1)
sd0 <- sqrt(var0)
se0 <- sd0/sqrt(n0)
```
To evaluate the null hypothesis, we then calculate SEDM as
$SEDM =$
Or in R:
```{r}
sedm <- sqrt(se1^2 + se0^2)
```
which yields 1.81659. Take the difference in means of 2.8 and divide by
the SEDM to get the t-statistic of 1.541349. Or, of course, just take a
shortcut and get right to the t-stat with:
```{r}
ttest1 <- t.test(y1,y0)
ttest1
ttest1
```
However we get here, we do not reject the null hypothesis (at
$\alpha = 0.05$) because the t-stat is less than 2.
Now, how would something similar look using an RI framework? In the RI
framework, we utilize the fact that although the randomization worked
out such that observations 1 through 5 were assigned to treatment and
observations 6 through 10 were assigned to control, this was not the
only way assignment could have gone. It could have, in fact, gone a
whole bunch of other ways. Instead of the treatment group being made up
of observations 1 through 5,
$\{ 1,2,3,4,5\}$,
it could have been made up of observations 1 through 4, plus observation
6 (or 7 or 8 or 9 or 10),
$\{ 1,2,3,4,6\}$
...
$\{ 1,2,3,4,10\}.$
In fact, when separating a dset with 10 observations into two sets of 5,
you are simply creating one group of 5 (the treatment group, say) by
sampling without replacement, with the other group of 5 (the control
group) being the compliment (whatever is not sampled to be a part of the
first group). That is, you are choosing 5 from a group of 10 without
replacement. There are a total of n-choose-k (choose(10,5) = 252)
different randomizations that could have occurred. This is the key to
randomization inference.
***3. Randomization Inference***
To test the null hypothesis of no effect in the RI framework, we don’t
test $H_{0}:\mu_{1} - \mu_{0} = 0$ -- we test the null hypothesis of *no
effect*, i.e. no effect at all on any unit. Testing this null is
different from testing for no effect *on average*. This means that we
test whether or not the data are consistent with a treatment effect of 0
on each unit.
If the effect of treatment on each unit were precisely 0, then the
assignment to treatment was immaterial to generating the $y$ data that
we actually observed. No matter what we did -- no matter which of the
252 possible randomizations we ended up with -- we would have observed
the same $y$ data that we did actually observe (unit 1 would have
produced $y_{1} = 4$, and so on). Our goal now will be to find the value
of the test statistic $\theta$ under each of the 252 possible
assignments to treatment, so that we can see where our test statistic
falls in the distribution. Is it unusual/extreme? Or is it a value in
the heart of the distribution? Let’s find out.
Recall the data as we observed it:
Unit Number $y_{i}$ $T_{i}$
------------- --------- ---------
1 > 4 0
2 > 5 0
3 > 11 0
4 > 10 0
5 > 3 0
6 > 4 1
7 > 6 1
8 > 2 1
9 > 2 1
10 > 5 1
If instead the assignment to treatment were exactly reversed, as in
Unit Number $y_{i}$ $T_{i}$
------------- --------- ---------
1 > 4 1
2 > 5 1
3 > 11 1
4 > 10 1
5 > 3 1
6 > 4 0
7 > 6 0
8 > 2 0
9 > 2 0
10 > 5 0
then the statistic of interest $\theta$ (difference in means) would have
taken on a value of -2.8 instead of 2.8. If instead the assignment to
treatment were
Unit Number $y_{i}$ $T_{i}$
------------- --------- ---------
1 > 4 1
2 > 5 0
3 > 11 1
4 > 10 0
5 > 3 1
6 > 4 0
7 > 6 1
8 > 2 0
9 > 2 1
10 > 5 0
then $\theta$ is equal to 0. What we need is a systematic way to find
each of the possible assignments to treatment, calculate $\theta$ each
time, and save the values of $\theta$ so that we can explore its
distribution. We want to look at each possible assignment of units 1
through 10 to treatment, with the compliment of that set being assigned
to control. To do this, we can use the combinations function in the
gtools package. Invoking combinations(10,5) provides us with a matrix of
all 252 combinations, with each row representing the units assigned to
treatment. For example, the first 10 rows of this matrix would look
like:
```{r}
# Create the set of all possible assignments to treatment
library(gtools)
treat <- combinations(10,5)
treat[1:10,]
```
We can use this to easily recalculate $\theta$ for each of the 252
observations, storing them as we go.
```{r}
# Create a vector y containing all outcomes
y <- c(y1,y0)
# Initialize a vector to store all the thetas
theta <- rep(0,choose(10,5))
# Calculate y1.bar for the first assignment
mean(y[treat[1,]])
# Calculate y0.bar for the first assignment
mean(y[-treat[1,]])
# Calculate theta for the first assignment
mean(y[treat[1,]]) - mean(y[-treat[1,]])
# Now automate that pattern for each of the 252 assignments
for (i in 1:(choose(10,5))){
theta[i] <- mean(y[treat[i,]]) - mean(y[-treat[i,]])
}
```
Done! Now we can look at the vector theta and examine how 2.8 (the
original statistic that we calculated based on the randomization as it
actually happened) compares, i.e. where it falls in the distribution.
It’s probably easiest to first look at data like this in an ordered
fashion.
```{r}
# Sort from lowest to highest
theta <- theta[order(theta)]
# Calculate exact quantile (1/252) for each possible theta
# How many of 252 are less than or equal to theta\_i?
idx <- 1:(length(theta))
quant <- idx/(length(idx))
# Now find 2.8 on this mapping
which(theta == 2.8)
# Note that there are more than 1 (because of ties)
# Look up the quantiles of theta = 2.8
quant[which(theta == 2.8)]
```
So depending on how we deal with ties, anywhere between about 89% and
92% of randomization assignments would have resulted in a theta less
than or equal to 2.8 (the theta we observed), given that the treatment
had no effect. So should we be surprised by a theta of 2.8? Does a theta
of 2.8 provide evidence against the null? Some. If the null is true, 2.8
is unusually large, in the sense that theta is *usually* less than 2.8.
Note that all of this is akin to calculating a p-value for a one-sided
test. To do a two-sided test, we’d look at the absolute values of theta
rather than their actual values.
```{r}
# Calculate absolute values of theta
abs.theta <- abs(theta)
# Re-sort from lowest to highest
abs.theta <- abs.theta[order(abs.theta)]
# Look at two-sided p-value(s)
quant[which(abs.theta == 2.8)]
```
So, because of ties, there are actually twenty different random
assignments that all lead to a theta of |2.8|. This makes a theta of 2.8
seem a little less unlikely, but it’s still unlikely (at least 77% of
thetas are less than 2.8 when there is no treatment effect).
# What about larger experiments?
Randomization inference reflects the design of the experiment. Because of this, it turns out to have good properties even when sample sizes are large. For example, @imbens2005robust show how 2SLS has terrible properties when an instrumental variable is a weak instrument and the sample size is about 500,000.
In that case, however, you could not calculate $p$-values from the exact or enumerated randomization distribution. Rather, you would want to sample from that distribution. Here, for example, we use only 100 samples from the exact distribution that we used above. We also show a tool that is slightly easier.
First, install the development version of RItools ("Randomization inference tools") [@bowers2009ritools]. (The following installs the development version of RItools and devtools into a local R library rather than your global R library.)
```{r}
if(!dir.exists("libraries")){
dir.create("libraries")
}
.libPaths("libraries") ## make the default place for libraries the local one
## Install devtools in the local library if it is not already installed
installedpackages<-installed.packages(lib.loc="libraries")
if(!any(installedpackages[,"Package"]=="devtools")) {
install.packages("devtools", repos="http://cran.rstudio.com")
library(devtools)
}
## Install RItools development version if it is not already installed
if(!any(installedpackages[,"Package"]=="RItools")) {
install_github("markmfredrickson/RItools@randomization-distribution")
}
library(RItools)
```
Now, remember that randomization inference requires a description of the experimental design, a hypothesis, and a test statistic.
```{r}
## the design is simple: complete random assignment of 5 out of 10 to treatment
## Create a treatment assignment column
Z<-c(rep(1,5),rep(0,5))
## For now, with the development package, create a blocking indicator that is all 1.
b<-rep(1,10)
randomassignment<-simpleRandomSampler(z=Z,b=b)
## Notice that randomassignment is a function that produces draws from the set of possible treatments:
randomassignment(5)$samples
## Let's use the mean difference test statistic for simplicity
## a bunch of these are hard coded in the package, but I'll write one out here.
meandiffTZ <- function(ys, z) {
mean(ys[z==1]) - mean(ys[z==0])
}
meandifftestExact<-RItest(y=y,
z=Z,
test.stat=meandiffTZ,
sampler = randomassignment)
## This provides a one-sided p-value by default
meandifftestExact
## Here is the two-sided p-value
2 * min( meandifftestExact, 1 - meandifftestExact)
```
Now, let's only use 100 draws from the possible 252. And, to show that this method comes with a bit of simulation error, I do it multiple times and calculate the standard deviation of the p-values. Notice that if you drew 10,000 times, your simulation error would be very small (and you could approximately calculate it using $\sqrt{p(1-p)/B}$ where $B$ is the number of simulations). Notice below sampling in this way provides an unbiased estimate of the exact $p$-value? Also, notice that simulation error can be large --- a difference of 2*.03 in $p$-value can push you over the magic .05 line easily. So, in this case, we would much prefer to use use the exact distribution. That said, if we had drawn 10,000 simulations from a larger study (say, with $N=100$), then we could have had simulation error below .003.
```{r ritest, cache=TRUE}
set.seed(1234567)
meandifftestSim<-RItest(y=y,
z=Z,
test.stat=meandiffTZ,
sampler = randomassignment,
samples = 100)
meandifftestSim
simedps<-replicate(1000,RItest(y=y,
z=Z,
test.stat=meandiffTZ,
sampler = randomassignment,
samples = 100))
summary(simedps)
sd(simedps)
## Approximate and quick formula for error based on simulation.
## sqrt( ( p * (1-p)) / number of sims )
sqrt( .11*(1-.11) / 100 )
```
# References