Randomization-Inference-primer-for-SBST.Rmd

---
title: Randomization Inference Primer for SBST
author:
- name: Nathaniel Higgins
  affiliation: SBST
- name: Jake Bowers
  affiliation: SBST and University of Illinois
date: '`r format(Sys.Date(), "%B %d, %Y")`'
bibliography: references.bib
output:
    html_document:
      theme: cosmo
      toc: yes
    pdf_document:
      keep_tex: true
      number_sections: true
      fig_width: 5
      fig_height: 5
      fig_caption: true
      template: bowersarticletemplate.latex
...


```{r include=FALSE, cache=FALSE}
# Some customization.  You can alter or delete as desired (if you know what you are doing).
# knitr settings to control how R chunks work.
## To make the html file do
## render("thefile.Rmd",output_format=html_document(fig_retina=FALSE))
## To make the pdf file do
## render("thefile.Rmd",output_format=pdf_document())
require(knitr)
opts_chunk$set(tidy=TRUE,echo=TRUE,results='markup',strip.white=TRUE,cache=FALSE,highlight=TRUE,width.cutoff=60,size='footnotesize',out.width='.5\\textwidth',message=FALSE,comment=NA,fig.env="figure", fig.align="center",fig.lp="fig:",fig.pos="H")
```

*Purpose of this document:* To explain what *Randomization Inference
(RI)* is, how it can be used by SBST, and how to use RI to calculate
test statistics (and maybe point estimates).^[Chapter 2 of @rosenbaum2010design provides an excellent introduction to this.]

***1. What Randomization Inference does***

Randomization Inference (RI) is used primarily to evaluate hypotheses.
So… a question: how is it different from regular old hypothesis testing?
Answer: RI doesn’t use distributional assumptions to calculate critical
values. The critical values generated by RI are, in this sense, *exact*,
rather than being generated by a model meant to approximate the data
generating process of a test statistic (what happens in ordinary
hypothesis testing).

So, for instance, RI could be used to test whether or not the mean
$\overset{\bar{}}{x}$ of a bunch of numbers is zero. More generally, RI
can be used to test hypotheses about the value of a statistic $\theta$.

***2. How does Randomization Inference work?***

To begin with, RI is only useful when the data we observe is generated
by a process that involves random assignment -- an experiment. RI uses
the physical act of randomization -- and the fact that the randomization
which we observe *could have gone differently* -- to examine a full set
of alternative values for a statistic $\theta$.

***3. An example to illustrate basic mechanics***

The value of an outcome variable $y$ could have been observed for both
treatment and control units, as in the table below.

  Unit Number   $y_{i}$   $T_{i}$
  ------------- --------- ---------
  1             > 4       0
  2             > 5       0
  3             > 11      0
  4             > 10      0
  5             > 3       0
  6             > 4       1
  7             > 6       1
  8             > 2       1
  9             > 2       1
  10            > 5       1

The first 5 observations are of control units ($T_{i} = 0$) and the
second 5 observations are of treatment units ($T_{i} = 1$). If we want
to test the null hypothesis that the treatment had no effect on $y$, the
usual way to do this would be to compare $\overset{\bar{}}{y_{1}}$to
$\overset{\bar{}}{y_{0}}$, (i.e. evaluate
$\theta = \overset{\bar{}}{y_{1}} - \overset{\bar{}}{y_{0}}$) where
$\overset{\bar{}}{y_{1}}$ represents the mean of $y$ for treatment units
and $\overset{\bar{}}{y_{0}}$ represents the mean of $y$ for control
units. To implement this example in R:

```{r}
# y values for treatment units
y1 <- c(4,5,11,10,3)
n1 <- length(y1)

# y values for control units
y0 <- c(4,6,2,2,5)
n0 <- length(y0)

# means
y1.bar <- sum(y1)/n1
y0.bar <- sum(y0)/n0
```

In the above example, $\overset{\bar{}}{y_{1}} = 6.6$ and
$\overset{\bar{}}{y_{0}} = 3.8$. To test $H_{0}:\mu_{1} - \mu_{0} = 0$
(no effect of treatment) using classical statistics, we compare the two
means, $\overset{\bar{}}{y_{1}} - \overset{\bar{}}{y_{0}} = 2.8$,
calculate the standard error of the difference in means (SEDM), and
create a t-statistic. To calculate the SEDM, first calculate the
standard error of $\overset{\bar{}}{y_{1}}$ and
$\overset{\bar{}}{y_{0}}$

$se_{x} = sd_{x}/.$

Or in R:
```{r}
var1 <- sum((y1 - y1.bar)^2)/(n1-1)
sd1 <- sqrt(var1)
se1 <- sd1/sqrt(n1)
var0 <- sum((y0 - y0.bar)^2)/(n0-1)
sd0 <- sqrt(var0)
se0 <- sd0/sqrt(n0)
```

To evaluate the null hypothesis, we then calculate SEDM as

$SEDM =$

Or in R:

```{r}
sedm <- sqrt(se1^2 + se0^2)
```

which yields 1.81659. Take the difference in means of 2.8 and divide by
the SEDM to get the t-statistic of 1.541349. Or, of course, just take a
shortcut and get right to the t-stat with:

```{r}
ttest1 <- t.test(y1,y0)
ttest1
ttest1
```

However we get here, we do not reject the null hypothesis (at
$\alpha = 0.05$) because the t-stat is less than 2.

Now, how would something similar look using an RI framework? In the RI
framework, we utilize the fact that although the randomization worked
out such that observations 1 through 5 were assigned to treatment and
observations 6 through 10 were assigned to control, this was not the
only way assignment could have gone. It could have, in fact, gone a
whole bunch of other ways. Instead of the treatment group being made up
of observations 1 through 5,

$\{ 1,2,3,4,5\}$,

it could have been made up of observations 1 through 4, plus observation
6 (or 7 or 8 or 9 or 10),

$\{ 1,2,3,4,6\}$

...

$\{ 1,2,3,4,10\}.$

In fact, when separating a dset with 10 observations into two sets of 5,
you are simply creating one group of 5 (the treatment group, say) by
sampling without replacement, with the other group of 5 (the control
group) being the compliment (whatever is not sampled to be a part of the
first group). That is, you are choosing 5 from a group of 10 without
replacement. There are a total of n-choose-k (choose(10,5) = 252)
different randomizations that could have occurred. This is the key to
randomization inference.

***3. Randomization Inference***

To test the null hypothesis of no effect in the RI framework, we don’t
test $H_{0}:\mu_{1} - \mu_{0} = 0$ -- we test the null hypothesis of *no
effect*, i.e. no effect at all on any unit. Testing this null is
different from testing for no effect *on average*. This means that we
test whether or not the data are consistent with a treatment effect of 0
on each unit.

If the effect of treatment on each unit were precisely 0, then the
assignment to treatment was immaterial to generating the $y$ data that
we actually observed. No matter what we did -- no matter which of the
252 possible randomizations we ended up with -- we would have observed
the same $y$ data that we did actually observe (unit 1 would have
produced $y_{1} = 4$, and so on). Our goal now will be to find the value
of the test statistic $\theta$ under each of the 252 possible
assignments to treatment, so that we can see where our test statistic
falls in the distribution. Is it unusual/extreme? Or is it a value in
the heart of the distribution? Let’s find out.

Recall the data as we observed it:

  Unit Number   $y_{i}$   $T_{i}$
  ------------- --------- ---------
  1             > 4       0
  2             > 5       0
  3             > 11      0
  4             > 10      0
  5             > 3       0
  6             > 4       1
  7             > 6       1
  8             > 2       1
  9             > 2       1
  10            > 5       1

If instead the assignment to treatment were exactly reversed, as in

  Unit Number   $y_{i}$   $T_{i}$
  ------------- --------- ---------
  1             > 4       1
  2             > 5       1
  3             > 11      1
  4             > 10      1
  5             > 3       1
  6             > 4       0
  7             > 6       0
  8             > 2       0
  9             > 2       0
  10            > 5       0

then the statistic of interest $\theta$ (difference in means) would have
taken on a value of -2.8 instead of 2.8. If instead the assignment to
treatment were

  Unit Number   $y_{i}$   $T_{i}$
  ------------- --------- ---------
  1             > 4       1
  2             > 5       0
  3             > 11      1
  4             > 10      0
  5             > 3       1
  6             > 4       0
  7             > 6       1
  8             > 2       0
  9             > 2       1
  10            > 5       0

then $\theta$ is equal to 0. What we need is a systematic way to find
each of the possible assignments to treatment, calculate $\theta$ each
time, and save the values of $\theta$ so that we can explore its
distribution. We want to look at each possible assignment of units 1
through 10 to treatment, with the compliment of that set being assigned
to control. To do this, we can use the combinations function in the
gtools package. Invoking combinations(10,5) provides us with a matrix of
all 252 combinations, with each row representing the units assigned to
treatment. For example, the first 10 rows of this matrix would look
like:

```{r}
# Create the set of all possible assignments to treatment
library(gtools)
treat <- combinations(10,5)
treat[1:10,]
```

We can use this to easily recalculate $\theta$ for each of the 252
observations, storing them as we go.
```{r}
# Create a vector y containing all outcomes
y <- c(y1,y0)
# Initialize a vector to store all the thetas
theta <- rep(0,choose(10,5))
# Calculate y1.bar for the first assignment
mean(y[treat[1,]])
# Calculate y0.bar for the first assignment
mean(y[-treat[1,]])
# Calculate theta for the first assignment
mean(y[treat[1,]]) - mean(y[-treat[1,]])
# Now automate that pattern for each of the 252 assignments
for (i in 1:(choose(10,5))){
	theta[i] <- mean(y[treat[i,]]) - mean(y[-treat[i,]])
}
```

Done! Now we can look at the vector theta and examine how 2.8 (the
original statistic that we calculated based on the randomization as it
actually happened) compares, i.e. where it falls in the distribution.
It’s probably easiest to first look at data like this in an ordered
fashion.

```{r}
# Sort from lowest to highest
theta <- theta[order(theta)]
# Calculate exact quantile (1/252) for each possible theta
# How many of 252 are less than or equal to theta\_i?
idx <- 1:(length(theta))
quant <- idx/(length(idx))
# Now find 2.8 on this mapping
which(theta == 2.8)
# Note that there are more than 1 (because of ties)
# Look up the quantiles of theta = 2.8
quant[which(theta == 2.8)]
```

So depending on how we deal with ties, anywhere between about 89% and
92% of randomization assignments would have resulted in a theta less
than or equal to 2.8 (the theta we observed), given that the treatment
had no effect. So should we be surprised by a theta of 2.8? Does a theta
of 2.8 provide evidence against the null? Some. If the null is true, 2.8
is unusually large, in the sense that theta is *usually* less than 2.8.

Note that all of this is akin to calculating a p-value for a one-sided
test. To do a two-sided test, we’d look at the absolute values of theta
rather than their actual values.

```{r}
# Calculate absolute values of theta
abs.theta <- abs(theta)
# Re-sort from lowest to highest
abs.theta <- abs.theta[order(abs.theta)]
# Look at two-sided p-value(s)
quant[which(abs.theta == 2.8)]
```

So, because of ties, there are actually twenty different random
assignments that all lead to a theta of |2.8|. This makes a theta of 2.8
seem a little less unlikely, but it’s still unlikely (at least 77% of
thetas are less than 2.8 when there is no treatment effect).

# What about larger experiments?

Randomization inference reflects the design of the experiment. Because of this, it turns out to have good properties even when sample sizes are large. For example, @imbens2005robust show how 2SLS has terrible properties when an instrumental variable is a weak instrument and the sample size is about 500,000.

In that case, however, you could not calculate $p$-values from the exact or enumerated randomization distribution. Rather, you would want to sample from that distribution. Here, for example, we use only 100 samples from the exact distribution that we used above. We also show a tool that is slightly easier.

First, install the development version of RItools ("Randomization inference tools") [@bowers2009ritools]. (The following installs the development version of RItools and devtools into a local R library rather than your global R library.)

```{r}
if(!dir.exists("libraries")){
	dir.create("libraries")
}
.libPaths("libraries") ## make the default place for libraries the local one

## Install devtools in the local library if it is not already installed
installedpackages<-installed.packages(lib.loc="libraries")
if(!any(installedpackages[,"Package"]=="devtools")) {
  install.packages("devtools", repos="http://cran.rstudio.com")
  library(devtools)
}

## Install RItools development version if it is not already installed
if(!any(installedpackages[,"Package"]=="RItools")) {
  install_github("markmfredrickson/RItools@randomization-distribution")
}
library(RItools)
```

Now, remember that randomization inference requires a description of the experimental design, a hypothesis, and a test statistic. 

```{r}
## the design is simple: complete random assignment of 5 out of 10 to treatment

## Create a treatment assignment column
Z<-c(rep(1,5),rep(0,5))

## For now, with the development package, create a blocking indicator that is all 1.
b<-rep(1,10)

randomassignment<-simpleRandomSampler(z=Z,b=b)

## Notice that randomassignment is a function that produces draws from the set of possible treatments:

randomassignment(5)$samples

## Let's use the mean difference test statistic for simplicity
## a bunch of these are hard coded in the package, but I'll write one out here.
meandiffTZ <- function(ys, z) {
    mean(ys[z==1]) - mean(ys[z==0])
}

meandifftestExact<-RItest(y=y,
                     z=Z,
                     test.stat=meandiffTZ,
                     sampler = randomassignment)
## This provides a one-sided p-value by default
meandifftestExact

## Here is the two-sided p-value
2 * min( meandifftestExact, 1 - meandifftestExact)
```

Now, let's only use 100 draws from the possible 252. And, to show that this method comes with a bit of simulation error, I do it multiple times and calculate the standard deviation of the p-values. Notice that if you drew 10,000 times, your simulation error would be very small (and you could approximately calculate it using $\sqrt{p(1-p)/B}$ where $B$ is the number of simulations). Notice below sampling in this way provides an unbiased estimate of the exact $p$-value? Also, notice that simulation error can be large --- a difference of 2*.03 in $p$-value can push you over the magic .05 line easily. So, in this case, we would much prefer to use use the exact distribution. That said, if we had drawn 10,000 simulations from a larger study (say, with $N=100$), then we could have had simulation error below .003. 

```{r ritest, cache=TRUE}
set.seed(1234567)
meandifftestSim<-RItest(y=y,
                     z=Z,
                     test.stat=meandiffTZ,
                     sampler = randomassignment,
		     samples = 100)
meandifftestSim

simedps<-replicate(1000,RItest(y=y,
                     z=Z,
                     test.stat=meandiffTZ,
                     sampler = randomassignment,
		     samples = 100))

summary(simedps)
sd(simedps)

## Approximate and quick formula for error based on simulation.
## sqrt( ( p * (1-p)) / number of sims )
sqrt( .11*(1-.11) /  100 )

```


# References