day5_cartoons_covadj_exper.Rmd

---
title: |
  | Taking Stock \&
  | Linear Regression in Experiments and Observational Studies
date: '`r format(Sys.Date(), "%B %d, %Y")`'
author: |
  | ICPSR 2023 Session 1
  | Jake Bowers \& Tom Leavitt
bibliography:
 - BIB/MasterBibliography.bib
fontsize: 10pt
geometry: margin=1in
graphics: yes
biblio-style: authoryear-comp
output:
  beamer_presentation:
    slide_level: 2
    keep_tex: true
    latex_engine: xelatex
    citation_package: biblatex
    template: icpsr.beamer
    incremental: true
    includes:
        in_header:
           - defs-all.sty
    md_extensions: +raw_attribute-tex_math_single_backslash+autolink_bare_uris+ascii_identifiers+tex_math_dollars
---


<!-- To show notes  -->
<!-- https://stackoverflow.com/questions/44906264/add-speaker-notes-to-beamer-presentations-using-rmarkdown -->

```{r setup1_env, echo=FALSE, include=FALSE}
library(here)
source(here::here("rmd_setup.R"))
```

```{r setup2_loadlibs, echo=FALSE, include=FALSE}
## Load all of the libraries that we will use when we compile this file
## We are using the renv system. So these will all be loaded from a local library directory
library(dplyr)
library(ggplot2)
library(estimatr)
library(coin)
library(DeclareDesign)
```

## Today

 1. Agenda: Overview and context for where we are in the course, Modes of
    Statistical Inference for Counterfactual Causal Inference, Linear
    regresssion for estimation in randomized experiments, Linear regression for
    covariance adjustment in randomized experiments
 2. Maybe open time for work on HW 1 with TAs and me around.
 3. Questions arising from the reading or assignments or life.

# An overview of approaches to statistical inference for causal quantities

## Three General Approaches To Learning About The Unobserved Using Data

\centering
  \includegraphics[width=.5\textwidth]{images/MorrowPlots.jpg}
  \hfill
  \includegraphics[width=.5\textwidth]{images/aces-research-history-morrow-plots--uw.jpg}

## Potential Outcomes

  \includegraphics[width=.65\textwidth]{images/cartoonNeymanBayesFisherCropped.pdf}

Imagine we would observe so many bushels of corn, $y$, if plot $i$ were randomly assigned to new fertilizer, $y_{i,Z_i=1}$ (where $Z_i=1$ means "assigned to new fertilizer" and $Z_i=0$ means "assigned status quo fertilizer") and another amount of corn, $y_{i,Z_i=0}$, if the same plot were  assigned the status quo fertilizer condition.
These $y$ are are *potential* or *partially observed* outcomes.

## Notation

  - *Treatment* $Z_i=1$ for treatment and $Z_i=0$ for control for units $i$

  - In a two arm experiment each unit has at least a pair of *potential outcomes* $(y_{i,Z_i=1},y_{i,Z_i=0})$
    (also written  $(y_{i,1},y_{i,0})$  to indicate that $y_{1,Z_1=1,Z_2=1} = y_{1,Z_1=1,Z_2=0}$ --- unit 1's outcome depends only on treatment assignment to unit 1 and not to unit 2.)
  - *Causal Effect* for unit $i$ is $\tau_i$,  $\tau_i   =
    f(y_{i,1},y_{i,0})$. For example, $\tau_i =  y_{i,1} - y_{i,0}$.

  - *Fundamental Problem of (Counterfactual) Causality* We only see one
    potential outcome $Y_i = Z_i  y_{i,1} + (1-Z_i) y_{i,0}$ in our observed outcome, $Y_i$. Treatment reveals one potential outcome to us in a simple randomized experiment.

\bigskip

So how do we learn about $\tau_i$ if we cannot directly see it?

## Design Based 1: Compare Models of Potential Outcomes to Data

 1. Make a guess about (or model of) $\tau_i = f(y_{i,1},y_{i,0})$. For example
    $H_0: y_{i,1}=y_{i,0}+\tau_{i}$ and $\tau_i=0$ is the sharp null hypothesis
    of no effects.
 2. Measure consistency of the data with this model given the research design and choice of test statistic (summarizing the treatment-to-outcome relationship).

\centering
  \includegraphics[width=.65\textwidth]{images/cartoonFisherQuestionMarks.pdf}

## Design Based 1: Compare Models of Potential Outcomes to Data

 1. Make a guess (or model of) about $\tau_i$.
 2. Measure consistency of data with this model given the design and test statistic.

\centering
  \includegraphics[width=.65\textwidth]{images/cartoonFisher.pdf}

## Design Based 1: Compare Models of Potential Outcomes to Data

Comparing the model, $H_0: \tau_i = 0 \implies y_{i,1} = y_{i,0}$ to data:

\centering
\includegraphics[width=\textwidth]{images/cartoonFisherNew1.pdf}

## Design Based 1: Compare Models of Potential Outcomes to Data

\centering
\includegraphics[width=.9\textwidth]{images/cartoonFisherNew2.pdf}

##  Design Based 1: Compare Models of Potential Outcomes to Data
\framesubtitle{Testing Models of No-Effects.}

Here is some fake data from a tiny experiment with weird outcomes.

```{r makesmdat, echo=FALSE}
smdat <- data.frame(Z=c(0,1,0,1),y0=c(16,22,7,3990),y1=c(16,24,10,4000))
smdat$Y <- with(smdat,Z*y1 + (1-Z)*y0)
smdat$zF <- factor(smdat$Z)
smdat$rY <- rank(smdat$Y)
print(smdat)
```

Next, define a function that compares treated to control outcomes:

```{r teststats_and_fns, echo=TRUE}
## A mean difference test statistic
tz_mean_diff <- function(z,y){
    mean(y[z==1]) - mean(y[z==0])
}
## A mean difference of ranks test statistic
tz_mean_rank_diff <- function(z,y){
  ry <- rank(y)
  mean(ry[z==1]) - mean(ry[z==0])
}
```

And define a function to repeat the experimental randomization

```{r echo=TRUE}
## Function to repeat the experimental randomization
newexp <- function(z){
  sample(z)
}
```

```{r echo=FALSE}
set.seed(12345)
```

##  Design Based 1: Compare Models of Potential Outcomes to Data
\framesubtitle{Testing Models of No-Effects.}

Here we **approximate the randomization distribution**:

```{r repexp, echo=TRUE, cache=TRUE}
set.seed(12345)
rand_dist_md <- with(smdat,replicate(1000,tz_mean_diff(z=newexp(Z),y=Y)))
rand_dist_rank_md <- with(smdat,replicate(1000,tz_mean_rank_diff(z=newexp(Z),y=Y)))
obs_md <- with(smdat,tz_mean_diff(z=Z,y=Y))
obs_rank_md <- with(smdat,tz_mean_rank_diff(z=Z,y=Y))
c(observed_mean_diff=obs_md,observed_mean_rank_diff=obs_rank_md)
table(rand_dist_md)/1000 ## Probability Distributions Under the Null of No Effects
table(rand_dist_rank_md)/1000
p_md <- mean(rand_dist_md >= obs_md) ## P-Values
p_rank_md <- mean(rand_dist_rank_md >= obs_rank_md)
c(mean_diff_p=p_md, mean_rank_diff_p=p_rank_md)
## Now using R directly
library(coin)
coin_test1 <- oneway_test(Y~zF,data=smdat,distribution=approximate(nresample=1000),alternative="less")
coin_test2 <- oneway_test(rY~zF,data=smdat,distribution=approximate(nresample=1000),alternative="less")
pvalue(coin_test1)
pvalue(coin_test2)
```


##  Design Based 1: Compare Models of Potential Outcomes to Data
\framesubtitle{Testing Models of Effects.}

To learn about whether the data are consistent with $\tau_i=100$ for all $i$ notice how treatment assignment reveals part of the unobserved outcomes: $Y_{i} = Z_i  y_{i,1} + (1-Z_i)  y_{i,0}$ and if $H_0: \tau_i=100$ or $H_0: y_{i,1}=y_{i,0}+100$ then:

\begin{align}
Y_{i} & =   Z_i  ( y_{i,0} + 100 ) + (1-Z_i)  y_{i,0} \\
 & =  Z_i  y_{i,0} + Z_i 100 + y_{i,0} - Z_i y_{i,0} \\
 & =  Z_i  100 + y_{i,0} \\
 y_{i,0} =  Y_{i} & - Z_i 100
 \end{align}

##  Design Based 1:  Compare Models of Potential Outcomes to Data
\framesubtitle{Testing Models of Effects.}

To test a *model of causal effects* we adjust the observed outcomes to be consistent
with our hypothesis about unobserved outcomes and then repeat the experiment:

```{r model_effects, echo=TRUE}
tz_mean_diff_effects <- function(z,y,tauvec){
    adjy <- y - z*tauvec
    radjy <- rank(adjy)
    mean(radjy[z==1]) - mean(radjy[z==0])
}
rand_dist_md_tau_cae <- with(smdat,replicate(1000,tz_mean_diff_effects(z=newexp(Z),y=Y,tauvec=c(100,100,100,100))))
obs_md_tau_cae <- with(smdat,tz_mean_diff_effects(z=Z,y=Y,tauvec=c(100,100,100,100)))
mean(rand_dist_md_tau_cae >= obs_md_tau_cae)
```


## Design Based 1: Compare Models of Potential Outcomes to Data
\framesubtitle{Testing Models of Effects.}

Now let's test a model with different effects for each unit --- $H_0: \symbf{\tau} = \{0,2,3,10\}$

\smallskip

```{r model_effects2, echo=TRUE}
rand_dist_md_taux<- with(smdat,replicate(1000,tz_mean_diff_effects(z=newexp(Z),y=Y,
                                                                   tauvec=c(0,2,3,10))))
obs_md_taux <- with(smdat,tz_mean_diff_effects(z=Z,y=Y,tauvec=c(0,2,3,10)))
mean(rand_dist_md_taux >= obs_md_taux)
```

So: We can learn about claims about unobserved potential outcomes by gauging the
consistency our observed results with the distributions implied by the
claims+test statistics+research design (including sample size and randomization
scheme).

\bigskip

Questions about this first approach to using statistical inference to do counterfactual causal inference?

## Design Based 2: Estimate Averages of Potential Outcomes

  1. Notice that the observed $Y_i$ are a sample from the (small, finite) population of unobserved potential outcomes $(y_{i,1},y_{i,0})$.
  2. Decide to focus on the average, $\bar{\tau}$, because sample averages, $\hat{\bar{\tau}}$ are unbiased and consistent estimators of population averages.
  3. Estimate $\bar{\tau}$ with the observed difference in means as $\hat{\bar{\tau}}$.

\centering
  \includegraphics[width=.8\textwidth]{images/cartoonNeyman.pdf}

## Design Based 2: Estimate Averages of Potential Outcomes

\centering
  \includegraphics[width=.95\textwidth]{images/cartoonNeyman.pdf}


## Design Based 2: Estimate Averages of Potential Outcomes

Here using Neyman's standard errors (same as HC2 SEs) and Central Limit Theorem
based $p$-values and 95% confidence intervals.

Notice that OLS/Linear Regression is just a difference of means calculator here.

\smallskip

```{r dim, echo=TRUE}
with(smdat,mean(Y[Z==1]) - mean(Y[Z==0]))
est_se_est_ate <- with(smdat,sqrt( var(Y[Z==1])/sum(Z) + var(Y[Z==0])/(sum(1-Z)) ))
est_se_est_ate
est1 <- difference_in_means(Y~Z,data=smdat)
tidy(est1)
lm1 <- lm_robust(Y~Z,data=smdat)
tidy(lm1)
```

Those $p$-values raise the next question: Where does the randomization
distribution for these tests come from?

## Design Based 2: Test hypotheses about averages.

A focus on the difference of average potential outcomes, on an
average causal effect, also allows for testing hypotheses about these average
causal efffects. This is called a test of the "weak null" hypothesis.

The challenge:

\medskip

A hypothesis like $H_0: \bar{\tau}=\tau_0$ is compatible with many different sharp
   null hypotheses: for example, $\tau=\{-5,0,0,5\}$ and $\tau=\{0,0,0,0\}$
   are both compatible with $\bar{\tau}=0$.

\medskip

   And a hypothesis test requires that we compare what we observe with what the
   hypothesis implies that we would observe. But a hypothesis about an average
   implies many different probability distributions: one for each sharp
   hypothesis.

## Hypothesis tests of the weak null

\begin{itemize}
\item The finite population CLT tells us that
 \begin{align*}
 \cfrac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right) - \E\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}{\sqrt{\Var\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}} & \overset{d}{\to} \mathcal{N}\left(0, 1\right)
 \end{align*}
\item Diff-in-Means is unbiased, so write
\begin{align*}
\cfrac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right) - \tau}{\sqrt{\Var\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}} & \overset{d}{\to} \mathcal{N}\left(0, 1\right)
\end{align*}
\item The CLT is an asymptotic results as $N \to \infty$
\item But we can bound error of Normal approximation for fixed $N$
\item Thus, with experiments of at least moderate size and outcomes that aren't too skewed or have extreme outliers,
\begin{align*}
\cfrac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right) - \tau}{\sqrt{\Var\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}} & \overset{\text{approx.}}{\sim} \mathcal{N}\left(0, 1\right)
\end{align*}
\item This justifies use of standard Normal distribution for hypothesis tests
\end{itemize}

```{r, warning=FALSE}

y_C <- c(10, 15, 20, 20, 10, 15, 15)
y_T <- c(15, 15, 30, 15, 20, 15, 30)
tau <- y_T - y_C
ATE <- mean(tau)
## Imagine one Z:
Z <- c(1,1,0,0,0,0,0)
Y <- Z*y_T + (1-Z) * y_C
est_ATE <- mean(Y[Z==1]) - mean(Y[Z==0])

asymp_ests_exact <- function(.y_C,
                             .y_T,
                             .prop_T,
                             .h){
  y_C = c(.y_C, rep(x = .y_C, times = .h - 1))
  y_T = c(.y_T, rep(x = .y_T, times = .h - 1))
  Omega = apply(X = combn(x = length(y_C),
                          m = length(y_C) * .prop_T,
                          simplify = TRUE),
                MARGIN = 2,
                FUN = function(x) as.integer(1:(length(y_C)) %in% x))
  obs_pot_outs = sapply(X = 1:ncol(Omega),
                        FUN = function(x) { y_T * Omega[,x] + y_C * (1 - Omega[,x]) })
  diff_means_ests = sapply(X = 1:ncol(Omega),
                           FUN = function(x) { mean(obs_pot_outs[,x][Omega[,x] == 1]) -
                               mean(obs_pot_outs[,x][Omega[,x] == 0]) })
  return(diff_means_ests)

}


## Show asymptotic properties
asymp_ests_approx <- function(.y_C,
                              .y_T,
                              .prop_T,
                              .h){
  y_C = c(.y_C, rep(x = .y_C, times = .h - 1))
  y_T = c(.y_T, rep(x = .y_T, times = .h - 1))
  true_EV_diff_means_est = mean(y_T) - mean(y_C)
  true_var_diff_means_est = diff_means_var(.n = length(y_C),
                                           .n_t = length(y_C) * .prop_T,
                                           .y_C = y_C,
                                           .y_T = y_T)
  return(rnorm(n = 10^3,
               mean = true_EV_diff_means_est,
               sd = sqrt(true_var_diff_means_est)))

}


diff_means_var <- function(.n,
                           .n_t,
                           .y_C,
                           .y_T) {
  var_y_C = mean((.y_C - mean(.y_C))^2)
  var_y_T = mean((.y_T - mean(.y_T))^2)
  cov_y_C_y_T = mean((.y_C - mean(.y_C)) * (.y_T - mean(.y_T)))
  var = (1/(.n - 1)) * ((.n_t * var_y_C) / (.n - .n_t) +
                          (((.n - .n_t) * var_y_T) / .n_t) +
                          (2 * cov_y_C_y_T))
  return(var)
}


diff_means_var_est <- function(.n,
                               .n_1,
                               .z,
                               .y) {
  obs_y_C = .y[which(.z == 0)]
  obs_y_T = .y[which(.z == 1)]
  est_mean_y_C = mean(obs_y_C)
  est_mean_y_T = mean(obs_y_T)
  est_var_y_C = ((.n - 1)/(.n * ((.n - .n_1) - 1))) * sum((obs_y_C - est_mean_y_C)^2)
  est_var_y_T = ((.n - 1)/(.n * (.n_1 - 1))) * sum((obs_y_T - est_mean_y_T)^2)
  return((.n/(.n - 1)) * ( (est_var_y_C/(.n - .n_1)) + (est_var_y_T/.n_1)))
}


stand_ests_plot_data <- data.frame(est = c( (asymp_ests_exact(.y_C=y_C,.y_T=y_T,.prop_T=2/7,.h=1) - 5 )/sqrt(21.19048),
       # (ests_cra - 5) / sqrt(21.19048),
                                     (asymp_ests_exact(.y_C = y_C, .y_T = y_T, .prop_T = (2/7), .h = 3) - 5) / sqrt(6.357143),
                                     (asymp_ests_approx(.y_C = y_C, .y_T = y_T, .prop_T = (2/7), .h = 15) - 5) / sqrt(0.4049136),
                                     (asymp_ests_approx(.y_C = y_C, .y_T = y_T, .prop_T = (2/7), .h = 150) - 5) / sqrt(0.002690911)),
                             N = c(rep(x = "N = 7", times = 21 ),
                                   rep(x = "N = 21", times = choose(21, 6)),
                                   rep(x = "N = 100", times = 10^3),
                                   rep(x = "N = 1000", times = 10^3)))

stand_ests_plot_data$N <- factor(x = stand_ests_plot_data$N, levels = c("N = 7", "N = 21", "N = 100", "N = 1000"))

##devtools::install_github("teunbrand/ggh4x")
library(ggh4x)

asymp_stand_ests_plot <- ggplot(data = stand_ests_plot_data,
                          mapping = aes(x = est)) +
  geom_histogram(aes(y =..density..)) +
  stat_theodensity(colour = "red") +
  xlab(label = "Standardized Diff-in-Means estimates") +
  ylab(label = "Probability") +
  theme_bw() +
  facet_wrap(facets = .~ N,
             nrow = 1,
             ncol = 4,
             scales = "free") +
  scale_x_continuous(labels = 0, breaks = 1) +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank())

ggsave(plot = asymp_stand_ests_plot,
       file = "asymp_stand_ests_plot.pdf",
       width = 6,
       height = 4,
       units = "in",
       dpi = 600)

```


## Hypothesis tests of the weak null
\begin{itemize}
\item In the ``village heads'' example we calculate an estimate for each of the
possible ways to randomize and subtract off the expected value ("5") such
that the distribution is now centered at 0 (a hypothesized value).
\begin{figure}[H]
\includegraphics[width=\linewidth]{asymp_stand_ests_plot.pdf}
\end{figure}
\end{itemize}


## Hypothesis tests of the weak null
\begin{itemize}
\item To test null hypothesis relative to alternative
\begin{align*}
H_0: & \tau = \tau_0 \text{ versus either } \\
H_A: & \tau > \tau_0, \, H_A: \tau < \tau_0 \text{ or } H_A: \left\lvert \tau \right\rvert > \left\lvert \tau_0 \right \rvert
\end{align*}
\item Calculate upper(u), lower(l) or two-sided(t) p-value as
\begin{align*}
p_u & =  1 - \Phi\left(\frac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right) - \tau_0}{\sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}}\right) \\
p_l & =  \Phi\left(\frac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right) - \tau_0}{\sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}}\right) \\
p_t & = 2\left(1 - \Phi\left(\frac{\left\lvert\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right) - \tau_0\right\rvert}{\sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}}\right)\right)
\end{align*}
\item If p-value is less than size $\alpha$-level of test, reject. Otherwise, don't
\item Note that, since we don't know $\Var\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]$, \\ we have used its conservative estimator instead, $\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]$
\end{itemize}

## Hypothesis tests of the weak null
\bh{Hypothesis tests susceptible to two errors}:
\begin{itemize}
\item Type I error: Rejecting null hypothesis when it is true
\item Type II error: \textit{Failing} to reject null hypothesis when it is false
\end{itemize}
\bh{A good test} controls these errors:
\begin{enumerate}
\item Type I error probability is less than or equal to size ($\alpha$-level) of test
\item Power (1 - type II error probability) is at least as great as $\alpha$-level
\item Power tends to $1$ as $N \to \infty$
\end{enumerate}

## Hypothesis tests of the weak null
\begin{itemize}
\item We can prove that tests of weak null satisfy (1) -- (3) as $N \to \infty$ \vfill
\item Thus, when experiments are large, we can often safely use such tests \vfill
\item But (1) -- (3) may not always be satisfied when experiments are small, have skewed outcome distributions or extreme outliers \vfill
\end{itemize}

## Confidence intervals
\begin{itemize}
\item Equivalence between hypothesis testing and confidence intervals
\item Confidence interval is set of null hypotheses we fail to reject
\end{itemize}
Consider two-sided confidence interval, $\mathcal{C}_t$:
\footnotesize
\begin{equation*}
\begin{split}
\mathcal{C}_t & = \left\{\tau_0 : \left\lvert \cfrac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)- \tau_0}{\sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}} \right\rvert \leq z_{1 - \alpha/2}\right\} \\
& = \left\{\tau_0 : - z_{1 - \alpha/2} \leq \cfrac{\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)- \tau_0}{\sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right]}} \leq z_{1 - \alpha/2} \right\} \\
& = \left\{\tau_0 : - z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\leq \hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)- \tau_0 \leq z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\right\} \\
& = \left\{\tau_0 : -\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)- z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\leq - \tau_0 \leq - \hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)+ z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\right\} \\
& = \left\{\tau_0 : \hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)+ z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\geq \tau_0 \geq  \hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)- z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\right\} \\
& = \left\{\tau_0 : \hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)- z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\leq \tau_0 \leq \hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)+ z_{1 - \alpha/2} \sqrt{\widehat{\Var}\left[\hat{\tau}\left(\mathbf{Z}, \mathbf{Y}\right)\right] }\right\}
\end{split}
\end{equation*}
\normalsize


## Confidence intervals {.allowframebreaks}

Here is an example showing how a confidence interval is a collection of
not-rejected hypotheses (here using the weak null hypothesis)

```{r, echo=TRUE, warning=FALSE}
vh_dat <- data.frame(Y=Y,Z=Z,y_C=y_C,y_T=y_T)

p_val_fun <- function(h,thealpha=.05,return_p=FALSE){
    ## Return a p-value for each weak null hypothesis
    test_res <- tidy(lm_robust(I(Y - h*Z)~Z,data=vh_dat))
    the_p <- test_res %>% filter(term=="Z") %>% select(p.value)
    if(return_p){
        return(the_p[[1]])
    } else {
    return(thealpha - the_p[[1]])
    }
}

## First just test many hypotheses in a row. This is a "grid-search"
the_hyps <- seq(-20,20,.1)
res <- sapply(the_hyps,function(h){
    p_val_fun(h=h,return_p=TRUE)
       })
hyp_res <- data.frame(h=the_hyps,p=res)
## Keep the largest and smallest hypotheses for which p>=.05
hyp_res %>% filter(abs(p)>=.05) %>% reframe(range(h)) 

## Second directly search for the hypotheses where alpha-p = 0.
upper_lim <- uniroot(p_val_fun,interval=c(0,20),extendInt="no",trace=2)
lower_lim <- uniroot(p_val_fun,interval=c(-20,0),extendInt="no",trace=2)
ci2 <- c(lower_lim$root,upper_lim$root)

## Third, use the analytic formula with standard errors etc.
lm2 <- lm_robust(Y~Z,data=vh_dat)
lm2_df <- tidy(lm2)
lm2_df %>% mutate(ll=estimate-qt(p=.975,df=5) * std.error,
    ul=estimate+qt(p=.975,df=5) * std.error)

confint(lm2,parm="Z")
```


## Model Based 1: Predict Distributions of Potential Outcomes

  \smallskip
  \centering
  \includegraphics[width=.95\textwidth]{images/cartoonBayesNew.pdf}

## Model Based 1: Predict Distributions of $(y_{i,1},y_{i,0})$
> 1. Given a model of $Y_i$:(see [this website](https://mc-stan.org/users/documentation/case-studies/model-based_causal_inference_for_RCT.html).)
\begin{equation}
\mathrm{Pr}(Y_{i}^{obs} \vert \mathrm{Z}, \theta) \sim \mathsf{Normal}(Z_{i} \cdot \mu_{1} + (1 - Z_{i}) \cdot \mu_{0} , Z_{i} \sigma_{1}^{2} + (1 - Z_{i}) \cdot \sigma_{0}^{2})
\end{equation}
where $\mu_{0}=\alpha$ and $\mu_{1}=\alpha + \tau$.

> 2. And a model of the pair $\{y_{i,0},y_{i,1}\} \equiv \{Y_{i}(0), Y_{i}(1)\}$ but random not fixed as before (and so written as upper-case):

\begin{equation}
\begin{pmatrix} Y_{i}(0) \\ Y_{i}(1) \end{pmatrix} \biggm\vert \theta
\sim
\mathsf{Normal}
\begin{pmatrix}
\begin{pmatrix} \mu_{0} \\ \mu_{1} \end{pmatrix},
\begin{pmatrix} \sigma_{0}^{2} & \rho \sigma_{0} \sigma_{1} \\ \rho \sigma_{0} \sigma_{1} & \sigma_{1}^{2} \end{pmatrix}
\end{pmatrix}
\end{equation}

> 3. And a model of $Z_i$ is known because of randomization so we can write: $\mathrm{Pr}(\mathrm{Z}|\mathrm{Y}(0), \mathrm{Z}(1)) = \mathrm{Pr}(\mathrm{Z})$

> 4. And given priors on $\theta= \{ \alpha$, $\tau$, $\sigma_c$, $\sigma_t \}$ (here make them all independent Normal(0,5))).

We can generate the posterior distribution of $\alpha$, $\tau$, $\sigma_c$, and $\sigma_t$ and thus can impute $\{Y_{i}(0),Y_{i}(1)\}$ to generate a distribution for $\tau_i$.

## Model Based 1: Predict Distributions of Potential Outcomes

A snippet of `rctbayes.stan`:

```
model {
   // PRIORS
   alpha ~ normal(0, 5);
   tau ~ normal(0, 5);
   sigma_c ~ normal(0, 5);
   sigma_t ~ normal(0, 5);

   // LIKELIHOOD
   y ~ normal(alpha + tau*w, sigma_t*w + sigma_c*(1 - w));
}
```


## Model Based 1: Predict Distributions of Potential Outcomes


```{r loadstan, echo=FALSE, message=FALSE,results='hide'}
## Not putting rstan in the top because installation can be involved and we don't plan to use it otherwise in this course.
library(rstan)
rstan_options(auto_write = TRUE)
```

```{r stanbayes, echo=TRUE, cache=TRUE, results="hide",warning=FALSE,message=FALSE}
## rho is correlation between the potential outcomes
stan_data <- list(N = 4, y = smdat$Y, w = smdat$Z, rho = 0)
# Compile and run the stan model
fit_simdat <- stan(file = "rctbayes.stan", data = stan_data, iter = 5000, warmup=2500,chains = 4,control=list(adapt_delta=.99))
res <- as.matrix(fit_simdat)
```

```{r rctbayes2, echo=TRUE, results="markup"}
## Summary of the 2000 Predicted Treatment effects for units 1 and 4
t(apply(res[,c("tau_unit[1]","tau_unit[4]")],2,summary))
## Probability that effect on unit 1 is greater than 0
mean(res[,"tau_unit[1]"]>0)
## Overall mean of the effects:
mean_tau <- rowMeans(res[,c("tau_unit[1]","tau_unit[2]","tau_unit[3]","tau_unit[4]")])
summary(mean_tau)
```

## Summmary: Modes of Statistical Inference for Causal Effects

We can infer about unobserved counterfactuals by:

  1. assessing claims or models or hypotheses about relationships between
     unobserved potential outcomes (Fisher's sharp null testing approach via
     Rosenbaum) (this includes testing hypotheses of "no effects at all", "same
     effect for all", or different effects for each unit, or anything in
     between).
  2. estimating averages (or other summaries) of unobserved potential outcomes
     (Neyman's estimation approach) (Unbiased estimators of effects,
     conservative variance estimators and the Central Limit
     Theorem allow us to use Normal Approximations to use the difference of
     means as a test statistic to test weak null hypotheses, too).
  3. predicting individual level outcomes based on probability models of
     outcomes, interventions, etc. (Bayes's predictive approach via Rubin)

We can go from tests to intervals (even to quite narrow intervals) because a
confidence interval is a collection of not-rejected hypothesis tests.

## Summary: Modes of Statistical Inference for Causal Effects

Statistical inferences --- formalized reasoning about "what if" statements
("What if I had randomly assigned other plots to treatment?") --- and their properties (like bias, error rates, precision) arise from:

  1. Repeating the design and using the hypothesis and test statistics to
     generate a reference distribution that describes the variation in the hypothetical world. Compare the observed to the hypothesized to measure consistency between hypothesis, or model, and observed outcomes (*Fisher and Rosenbaum's
     randomization-based inference for individual causal effects*).
  2. Repeating the design and the estimation such that standard errors, $p$-values,
     and confidence intervals reflect design-based variability. Probability distributions (like the Normal or t-distribution) arise from Limit Theorems in large samples.
     (*Neyman's randomization-based inference for average causal effects*).
  3. Repeatedly drawing from the probability distributions that generate the
     observed data (that represent the design) --- the likelihood and the
     priors --- to describe a posterior distribution for unit-level causal
     effects. Calculate posterior distributions for aggregated causal effects (like averages of individual level
     effects). (*Bayes and Rubin's predictive model-based causal inference*).

## Summary: Applications of the Model-Based Prediction Approach

Examples of use of the model-based prediction approach:

 - Estimating causal effects when we need to model processes of missing outcomes, missing treatment indicators, or complex non-compliance with treatment \parencite{barnard2003psa}
 - Searching for heterogeneity (subgroup differences) in how units react to treatment (ex. \parencite{hahn2020bayesian} but see also literature on BART, Bayesian Machine Learning as applied to causal inference questions).

## Summary: Applications of the Testing Approach

Examples of use of the testing approach:

 - Assessing evidence of pareto optimal effects or no aberrant effect (i.e. no
   unit was made worse off by the treatment) \parencite{caughey2016beyond, rosenbaum2008aberrant}.
 - Assessing evidence that the treatment group was made better than the control
   group (but being agnostic about the precise nature of the difference) (ex.
   $p>.2$ with difference of means but $p<.001$ with difference of ranks in
   Office of Evaluation Sciences study of General Services Administration
   Auctions)
 - Focusing on detection rather than on estimation (for example to identify
   promising sites for future research in experiments with many blocks or
   strata) (Bowers and Chen 2020 working paper).
 - Assessing hypotheses of no effects in small samples, with rare outcomes,
   cluster randomization, or other designs where reference distributions may
   not be Normal (see for example, \parencite{gerber2012field}).
 - Assessing structural models of causal effects (for example models of
   treatment effect propagation across networks)
   \parencite{bowers2016research,bowers2013sutva,bowers2018models}.


# Linear Regression to Enhance Precision in Experiments

## What does linear regression do in an experiment?

Let's make a slightly bigger dataset where the potential outcomes depend on a background covariate:

```{r newdat, echo=TRUE}
N <- 100
tau <- .2
dat <- data.frame(id=1:N,
                  x1 = rpois(n = N, lambda = 10))
dat <- mutate(dat,
      y0 = .5*sd(x1)*x1 + runif(n = N,min=-2*sd(x1),max=2*sd(x1)),
      y1 = y0 + tau * sd(y0) #+ runif(n = N,min=-.5*sd(y0),max=.5*sd(y0)),
    )
set.seed(12345)
dat$Z <- complete_ra(N=N,m=floor(N/2))
dat <- mutate(dat,Y=Z*y1 + (1-Z)*y0)
dat$tau <- with(dat,y1-y0)
head(dat)
##summary(lm(y0~x1,data=dat))$r.squared
## blah <- lm_robust(Y~Z,data=dat); blah$p.value["Z"]
##blah2 <- lm_robust(Y~Z+x1,data=dat); blah2$p.value["Z"]
```

## What does linear regression do in an experiment?

Let's make a slightly bigger dataset where the potential outcomes depend on a background covariate:


Notice that background outcomes are now a function of a background covariate:
```{r echo=TRUE,out.width="50%"}
with(dat,cor(x1,y0))
summary(lm(y0~x1,data=dat))$r.squared
with(dat,plot(x1,y0))
```

## What does linear regression do in an experiment?

First, how does linear regression "adjust" for x1?

```{r simpcovadj, echo=TRUE}
unadj_est <- lm_robust(Y~Z,data=dat)
adj_est <- lm_robust(Y~Z+x1,data=dat)
c(coef(unadj_est)[2],coef(adj_est)[2])
```

## What does linear regression do in an experiment?

First, how does linear regression "adjust" for x1? (Answer: by removing linear relationships between `x1` and `Z` and between `x1` and `Y`):


```{r covadj2, echo=TRUE}
lm_Y_x1 <- lm(Y~x1,data=dat)
lm_Z_x1 <- lm(Z~x1,data=dat)
dat$resid_Y_x1 <- resid(lm_Y_x1)
dat$resid_Z_x1 <- resid(lm_Z_x1)
lm_resid_Y_x1 <- lm(resid_Y_x1~x1,data=dat)
lm_resid_Z_x1 <- lm(resid_Z_x1~x1,data=dat)
```

## What does linear regression do in an experiment?

First, how does linear regression "adjust" for x1? (Answer: by removing linear relationships between `x1` and `Z` and between `x1` and `Y`):

```{r plotresids, echo=FALSE,out.width=".8\\textwidth"}
par(mfrow=c(2,2),mar=c(2,3,1,0),mgp=c(1.25,.5,0),oma=rep(0,4))
with(dat,plot(x1,Y,col=c("black","blue")[dat$Z+1],pch=c(1,19)[dat$Z+1]))
abline(lm_Y_x1)
with(dat,plot(x1,resid_Y_x1,ylab="Y - b*x1 or Y without linear relation with x1",col=c("black","blue")[dat$Z+1],pch=c(1,19)[dat$Z+1]))
abline(lm_resid_Y_x1)
# with(dat,plot(x1,jitter(Z,factor=.1)))
with(dat,plot(x1,Z,col=c("black","blue")[dat$Z+1],pch=c(1,19)[dat$Z+1]))
abline(lm_Z_x1)
with(dat,plot(x1,resid_Z_x1,ylab="Y - b*x1 or Y without linear relation with x1",col=c("black","blue")[dat$Z+1],pch=c(1,19)[dat$Z+1]))
abline(lm_resid_Z_x1)
```

## What does linear regression do in an experiment?

Linear regression "adjusts" the $Z \rightarrow Y$ relationship for a background covariate $x$  by removing linear relationships: For example, here we show how it works after removing the linear relationships between `x1` and `Z` and between `x1` and `Y`):

```{r residlm, echo=TRUE}
coef(adj_est)[2]
adj_est2 <- lm(resid_Y_x1 ~ resid_Z_x1,data=dat)
coef(adj_est2)[2]
stopifnot(all.equal(coef(adj_est)[[2]],coef(adj_est2)[[2]]))
```

## What does linear regression do in an experiment?

Linear regression "adjusts" the $Z \rightarrow Y$ relationship for a background covariate $x$  by removing linear relationships: For example, here we show how it works after removing the linear relationships between `x1` and `Z` and between `x1` and `Y`):

```{r plot2, out.width=".8\\textwidth"}
dat$ZF <- factor(dat$Z)
with(dat,plot(x1,Y,col=c("black","blue")[dat$Z+1],pch=c(1,19)[dat$Z+1]))
preddat <- expand.grid(Z=c(0,1),x1=range(dat$x1))
preddat$fit <- predict(adj_est,newdata=preddat)
with(preddat[preddat$Z==0,],lines(x1,fit))
with(preddat[preddat$Z==1,],lines(x1,fit,col="blue",lwd=2))
```

## What does linear regression do in an experiment?

Why might we adjust in an experiment? The idea is to increase precision. Notice the smaller standard errors arising from the fact that we've removed noise independent of treatment from the outcome (recall that background covariates are indepedent of treatment assignment).

```{r expadj, echo=TRUE}
unadj_est$std.error["Z"]
adj_est$std.error["Z"]
```

But what about bias? The point estimates differ here (because in finite samples,
independence doesn't guarantee lack of correlation!)

```{r bias1, echo=TRUE}
coef(unadj_est)[["Z"]]
coef(adj_est)[["Z"]]
```

## How would we know whether the bias is too big? {.allowframebreaks}

Different may not mean worse. Let's see how these (and some other) estimators work. First, just specify a bunch of different estimators:

```{r biassimsetup, echo=TRUE}
## The truth:
trueATE_estimand <- with(dat,mean(y1-y0))

## A function to make a new experiment.
new_exp <- function(Z){
    ## This next shuffles Z
    newZ <- sample(Z)
    return(newZ)
}

## The est1.. functions are estimators of the trueATE, each returning a single number --- the estimatedATE
est1 <- function(Z,Y,x=NULL){
    mean(Y[Z==1]) - mean(Y[Z==0])
}
est2 <- function(Z,Y,x=NULL){
    coef(lm(Y~Z))[["Z"]]
}
est3 <- function(Z,Y,x){
  ## "controlling for" but not really because this is an experiment.
    coef(lm(Y~Z+x))[["Z"]]
}
est4 <- function(Z,Y,x){
    quantY <- quantile(Y,.9)
    ## Recode highest 10% of outcome scores to 90% percentile to minimize effect of outliers.
    Y[Y>quantY] <- quantY
    coef(lm(Y~Z+x))[["Z"]]
}
est5 <- function(Z,Y,x){
    ## Use the Lin method of covariance adjustment to minimize bias
    thedat <- data.frame(Z=Z,Y=Y,x=x)
    lm_lin_mod <- lm_lin(Y~Z,covariates=~x,data=thedat)
    coef(lm_lin_mod)[["Z"]]
}
est6 <- function(Z,Y,x){
    ## Use the Rosenbaum method of covariance adjustment to increase precision
    resids1 <- resid(lm(Y~x))
    coef(lm(resids1~Z))[["Z"]]
}

## A function to use them all on a new experiment --- after repeating it and re-revealing potential outcomes.
compare_ests <- function(Z,y1,y0,x){
    newZ <- new_exp(Z)
    newY <- newZ * y1 + (1-newZ)*y0
    est1_coef <- est1(Z=newZ,Y=newY)
    est2_coef <- est2(Z=newZ,Y=newY)
    est3_coef <- est3(Z=newZ,Y=newY,x=x)
    est4_coef <- est4(Z=newZ,Y=newY,x=x)
    est5_coef <- est5(Z=newZ,Y=newY,x=x)
    est6_coef <- est6(Z=newZ,Y=newY,x=x)
    return(c(est1=est1_coef,est2=est2_coef,est3=est3_coef,
             est4=est4_coef,est5=est5_coef,est6=est6_coef))
}
```

## How would we know whether the bias is too big? {.allowframebreaks}

Different may not mean worse. Let's see how these (and some other) estimators work. First, just specify a bunch of different estimators:


```{r dobiassims, echo=TRUE, cache=TRUE}
set.seed(1235)
nsims <- 5000
dist_ests <- with(dat,
    replicate(nsims,compare_ests(Z=Z,y1=y1,y0=y0,x=x1)))
apply(dist_ests,1,summary)
rbind(apply(dist_ests,1,sd),
apply(dist_ests,1,function(x){ trueATE_estimand - mean(x)}))
```


## How would we know whether the bias is too big?

Different may not mean worse. Let's see how these estimators work (black line is
true ATE):

```{r out.width=".9\\textwidth"}
estdat <- data.frame(tauhat = c(dist_ests[1,],dist_ests[2,],dist_ests[3,],dist_ests[4,],dist_ests[5,],dist_ests[6,]),
    est=rep(c("mean diff","lm","cov adj","topcoded+adj","lin adj","rose adj"),each=nsims))
estmeans <- estdat %>% group_by(est) %>% summarize(mnest=mean(tauhat))

# http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette
# The palette with black:
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# To use for fills, add
#  scale_fill_manual(values=cbPalette)

# To use for line and point colors, add

g1 <- ggplot(data=estdat,aes(x=tauhat,group=est,color=est))+
    geom_density() +
    geom_point(data=estmeans,aes(y=.1,x=mnest,group=est,color=est)) +
    geom_vline(xintercept=trueATE_estimand) +
    scale_colour_manual(values=cbbPalette) +
    theme_bw()
print(g1)
```

## More on the Rosenbaum 2002 approach:

The first approach worked because the treatment was independent of the covariate. Yet, in a finite sample, there is still some relationship --- that was why the plot `Z` by `x1` and the plot of the `resid_Z_x1` by `x1` were not the same.

```{r echo=TRUE}
summary(lm(Z~x1,dat))$r.squared
```

Another approach (from @rosenbaum2002) does not adjust `Z` at all --- after all, we know that is randomized. We have no need to adjust `Z` we just want to remove non-treatment related noise in our outcome in order to increase precision. Notice that we also gain in precision here by not reducing the variation in `Z`.

```{r echo=TRUE}
rose_mod <- lm_robust(resid_Y_x1~Z,data=dat)
rose_mod$p.value["Z"]
unadj_est$p.value["Z"]
adj_est$p.value["Z"]
```

## More on the Lin 2013 approach:

@freedman2008a showed the bias in simple covariance adjusted estimates of the ATE. @lin2013 showed how to significantly reduce this bias.

```{r linapproach, echo=TRUE}
## remove the mean from the covariate, center the covariate.
dat$x1_c <- with(dat,x1 - mean(x1))
## fit a model interacting the centered covariate with the treatment
lm_lin1 <- lm_robust(Y~Z + Z*x1_c,data=dat)
coef(lm_lin1)[["Z"]]
lm_lin2 <- lm_lin(Y~Z,covariates=~x1,data=dat)
coef(lm_lin2)[["Z"]]
c(lm_lin2$std.error["Z"],
    lm_lin1$std.error["Z"],
    adj_est$std.error["Z"],
    rose_mod$std.error["Z"],
    unadj_est$std.error["Z"])
```


## Linear Regression for Covariance Adjustment in a RCT

 Covariance adjustment is the removal of linear relationships. We can think of
 "removing a linear relationship" as "residualization".

 - In randomized experiments:
   - Direct covariance adjustment can increase precision in the estimation of average
     causal effects.
   - The precision comes with some bias (perhaps a very small amount, perhaps worthwhile making the bias versus variance tradeoff in some cases, perhaps not in others).
   - We can reduce the bias following @lin2013.
   - More precision can arise if we don't adjust the treatment following @rosenbaum2002 (could involve some bias here too).


\medskip

We don't say "controlling for" when we adjust for covariates in an experiment
because we not removing their relationship with $Z$ (after all, randomization
is supposed to have already done that. If asked "What is the correlation
between $X$ and $Z$?" the answer is "On average, 0.")


## Summary of the Day

  - There is more than one way to use what we observe to reason about potential
    outcomes: testing, estimating, predicting.
  - Linear regression is the most common way to "adjust for" background
    covariates in non-experimentable studies. We learned about what it means to
    adjust for covariates in experimental studies so that we can better
    evaluate linear regression in observational studies (tomorrow).
  - Structure of the course. Where we are now. Where we are going: away from
    randomized experiments toward observational studies, but with the
    statistical basis developed from our understanding of randomized
    experiments.

## References