day10-Scores.Rmd

---
title: |
 | "Statistical Adjustment and Assessment of Adjustment in Observational Studies --- Matched Stratification for One and Multiple Variables."
date: '`r format(Sys.Date(), "%B %d, %Y")`'
author: |
  | ICPSR 2021 Session 1
  | Jake Bowers, Ben Hansen, Tom Leavitt
bibliography:
 - 'BIB/MasterBibliography.bib'
 - 'BIB/master.bib'
 - 'BIB/misc.bib'
 - 'BIB/refs.bib'
fontsize: 10pt
geometry: margin=1in
graphics: yes
biblio-style: authoryear-comp
colorlinks: true
biblatexoptions:
  - natbib=true
output:
  beamer_presentation:
    slide_level: 2
    keep_tex: true
    latex_engine: xelatex
    citation_package: biblatex
    template: icpsr.beamer
    incremental: true
    includes:
        in_header:
           - defs-all.sty
    md_extensions: +raw_attribute-tex_math_single_backslash+autolink_bare_uris+ascii_identifiers+tex_math_dollars
    pandoc_args: [ "--csl", "chicago-author-date.csl" ]
---


<!-- To show notes  -->
<!-- https://stackoverflow.com/questions/44906264/add-speaker-notes-to-beamer-presentations-using-rmarkdown -->

```{r setup1_env, echo=FALSE, include=FALSE}
library(here)
source(here::here("rmd_setup.R"))

## Print code by default
opts_chunk$set(echo = TRUE)

```


```{r setup2_loadlibs, echo=FALSE, include=FALSE}
## Load all of the libraries that we will use when we compile this file
## We are using the renv system. So these will all be loaded from a local library directory
library(dplyr)
library(ggplot2)
library(coin)
library(RItools)
library(optmatch)
```

## Today

\begin{enumerate}
  \item Agenda:  Continue to think about an adjustment strategy to enhance counterfactual causal interpretations in non-randomized studies: \\	
Need adjust $\rightarrow$ "fair comparison" $\rightarrow$
  stratification $\rightarrow$ evaluation of the stratification;
  How to do this with one variable and/or more than one
  variable.
\item Questions arising from the reading or assignments or life?
\end{enumerate}

# But first, review:

## Review of randomization assessment in a randomized experiment

How, in principle, might one do this?


```{r echo=FALSE, cache=TRUE}
load(url("http://jakebowers.org/Data/meddat.rda"))
```

## Review of linear model "control for" 

Did the Metrocable intervention decrease
violence in those neighborhoods? We have Homicides per 1000 people in 2008 (`HomRate08`) as a function of Metrocable.

```{r}
meddat<- mutate(meddat, HomRate03=(HomCount2003/Pop2003)*1000,
                HomRate08=(HomCount2008/Pop2008)*1000)
row.names(meddat) <- meddat$nh

lmRaw <- lm(HomRate08~nhTrt,data=meddat)
coef(lmRaw)[["nhTrt"]]
```

What do we need to believe or know in order to imagine that we have done a good job adjusting for Proportion with more than HS Education below? (concerns about extrapolation, interpolation, linearity, influential points, parallel slopes) (Ignoring whether we have done a good job **isolating** the causal relationship of interest. Just focusing on defending the claim that we have adjusted for HS Education well enough).

```{r}
lmAdj1 <- lm(HomRate08 ~ nhTrt + nhAboveHS, data=meddat)
coef(lmAdj1)["nhTrt"]
```

## Review of linear model "control for" 

What about when we try to adjust for more than one variable? (all previous
problems (but harder to explain choices)+the curse of dimensionality)

```{r}
lmAdj2 <- lm(HomRate08 ~ nhTrt + nhAboveHS + nhRent, data=meddat)
coef(lmAdj2)["nhTrt"]
```

# Matching on one variable to create strata

## The optmatch workflow: The distance matrix

The goal of optmatch: collect observations into strata which minimize the
within strata differences (calculate the absolute difference between treated and
controls within strata, sum across strata, minimize this quantity).


```{r}
tmp <- meddat$nhAboveHS
names(tmp) <- rownames(meddat)
absdist <- match_on(tmp, z = meddat$nhTrt,data=meddat)
absdist[1:3,1:3]
abs(meddat$nhAboveHS[meddat$nhTrt==1][1] - meddat$nhAboveHS[meddat$nhTrt==0][1] )
```

## Do the match

We will talk in more detail later, for now, we merely use the software.

```{r}
fm1 <- fullmatch(absdist,data=meddat)
summary(fm1, min.controls=0, max.controls=Inf )
table(meddat$nhTrt,fm1)

pm1 <- pairmatch(absdist,data=meddat)
summary(pm1, min.controls=0, max.controls=Inf )
table(meddat$nhTrt,pm1,exclude=c())
```

## Evaluate the design: Within set differences

Look within sets:

```{r echo=FALSE}
meddat$fm1 <- fm1
meddat$pm1 <- pm1
```

```{r echo=FALSE, out.width=".7\\textwidth"}
library(gridExtra)
bpfm1 <- ggplot(meddat,aes(x=fm1,y=nhAboveHS)) +
	geom_boxplot() +
	 stat_summary(fun=mean, geom="point", shape=20, size=3, color="red", fill="red")

bporig <- ggplot(meddat,aes(y=nhAboveHS))+
	 geom_boxplot()

grid.arrange(bpfm1,bporig,ncol=2,layout_matrix=matrix(c(1,1,1,1,2),nrow=1))
```

## Evaluate the design: Within set differences

Look within sets:

```{r echo=FALSE, out.width=".7\\textwidth"}
bppm1 <- ggplot(meddat,aes(x=pm1,y=nhAboveHS)) +
	geom_boxplot() +
	 stat_summary(fun=mean, geom="point", shape=20, size=3, color="red", fill="red")

bporig <- ggplot(meddat,aes(y=nhAboveHS))+
	 geom_boxplot()

grid.arrange(bppm1,bporig,ncol=2,layout_matrix=matrix(c(1,1,1,1,2),nrow=1))
```

## Evaluate the design: Within set differences


```{r}
rawmndiffs <- with(meddat, mean(nhAboveHS[nhTrt==1]) - mean(nhAboveHS[nhTrt==0]))
setdiffsfm1 <- meddat %>% group_by(fm1) %>% summarize(mneddiffs =
						   mean(nhAboveHS[nhTrt==1]) -
						   mean(nhAboveHS[nhTrt==0]),
					   mnAboveHS = mean(nhAboveHS),
					   minAboveHS = min(nhAboveHS),
					   maxAboveHS = max(nhAboveHS))

setdiffsfm1
#summary(setdiffs$mneddiffs)
#quantile(setdiffs$mneddiffs, seq(0,1,.1))

```

## Evaluate the design: Within set differences


```{r}
setdiffspm1 <- meddat %>% group_by(pm1) %>% summarize(mneddiffs =
						   mean(nhAboveHS[nhTrt==1]) -
						   mean(nhAboveHS[nhTrt==0]),
					   mnAboveHS = mean(nhAboveHS),
					   minAboveHS = min(nhAboveHS),
					   maxAboveHS = max(nhAboveHS))

setdiffspm1
```

## Evaluate the design: Compare to a randomized experiment.

The within-set differences look different from those that would be expected
from a randomized experiment.

```{r}
xbHS2 <- xBalance(nhTrt~nhAboveHS,
                  strata=list(nostrat=NULL,
                              fm = ~fm1,
			      pm = ~pm1),
                  data=meddat,report="all")
xbHS2$results
xbHS2$overall
```


## What is xBalance doing?

```{r}
setmeanDiffs <- meddat %>% group_by(fm1) %>%
  summarise(diffAboveHS=mean(nhAboveHS[nhTrt==1])-mean(nhAboveHS[nhTrt==0]),
            nb=n(),
            nTb = sum(nhTrt),
            nCb = sum(1-nhTrt),
            hwt = ( 2*( nCb * nTb ) / (nTb + nCb))
            )
setmeanDiffs
```

## What is xBalance doing with multiple sets/blocks?

The test statistic is a weighted average of the set-specific differences (same
approach as we would use to test the null in a block-randomized experiment)

```{r}
## The descriptive adj.mean diff from xBalance 
with(setmeanDiffs, sum(diffAboveHS*nTb/sum(nTb)))
## The mean diff used as the observed value in the testing
with(setmeanDiffs, sum(diffAboveHS*hwt/sum(hwt)))
## Compare to xBalance output
xbHS2$results[,,"fm"]
```


```{r}
btHS2 <- xBalance(nhTrt~nhAboveHS, strata=list(fm1=~fm1, pm1=~pm1),
                  data=meddat,report="all")
btHS2
```

#  Matching on Many Covariates: Using Mahalnobis Distance

## Dimension reduction using the Mahalanobis Distance

The general idea: dimension reduction. When we convert many columns into one column we reduce the dimensions of the dataset (to one column).


```{r echo=FALSE}
X <- meddat[,c("nhAboveHS","nhPopD")]
plot(meddat$nhAboveHS,meddat$nhPopD,xlim=c(-.3,.6),ylim=c(50,700))
```

## Dimension reduction using the Mahalanobis Distance

First, let's look at Euclidean distance: $\sqrt{ (x_1 - x_2)^2 + (y_1 - y_2)^2 }$

```{r echo=FALSE, out.width=".8\\textwidth"}
par(mgp=c(1.25,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
plot(meddat$nhAboveHS,meddat$nhPopD,xlim=c(-.3,.6),ylim=c(50,700))
points(mean(X[,1]),mean(X[,2]),pch=19,cex=1)
arrows(mean(X[,1]),mean(X[,2]),X["407",1],X["407",2])
text(.4,200,label=round(dist(rbind(colMeans(X),X["407",])),2))
```

## Dimension reduction using the Mahalanobis Distance

First, let's look at Euclidean distance: $\sqrt{ (x_1 - x_2)^2 + (y_1 - y_2)^2 }$

```{r echo=FALSE, out.width=".4\\textwidth"}
par(mgp=c(1.25,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
plot(meddat$nhAboveHS,meddat$nhPopD,xlim=c(-.3,.6),ylim=c(50,700))
points(mean(X[,1]),mean(X[,2]),pch=19,cex=1)
arrows(mean(X[,1]),mean(X[,2]),X["407",1],X["407",2])
text(.4,200,label=round(dist(rbind(colMeans(X),X["407",])),2))
```

Distance between point 0,0 and unit "407".

```{r}
tmp <- rbind(colMeans(X),X["407",])
tmp
sqrt( (tmp[1,1] - tmp[2,1])^2 + (tmp[1,2]-tmp[2,2])^2 )
```

Problem: overweights variables with bigger scales (Population Density dominates).

## Dimension reduction using the Mahalanobis Distance

Now the Euclidean distance (on a standardized scale) so neither variable is overly dominant.

```{r echo=TRUE}
Xsd <-scale(X)
apply(Xsd,2,sd)
apply(Xsd,2,mean)
```

```{r echo=FALSE,out.width=".5\\textwidth"}
plot(Xsd[,1],Xsd[,2],xlab="nhAboveHS/sd",ylab="nhPopD/sd")
points(mean(Xsd[,1]),mean(Xsd[,2]),pch=19,cex=1)
arrows(mean(Xsd[,1]),mean(Xsd[,2]),Xsd["407",1],Xsd["407",2])
text(2,-1.2,label=round(dist(rbind(colMeans(Xsd),Xsd["407",])),2))
```


## Dimension reduction using the Mahalanobis Distance

The mahalanobis distance avoids the scale problem in the euclidean distance.^[For more [see here](https://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance)] Here each circle are points of the same MH distance.

```{r mhfig, echo=FALSE,out.width=".6\\textwidth"}
library(chemometrics)
par(mgp=c(1.5,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
mh <- mahalanobis(X,center=colMeans(X),cov=cov(X))
drawMahal(X,center=colMeans(X),covariance=cov(X),
          quantile = c(0.975, 0.75, 0.5, 0.25))
abline(v=mean(meddat$nhAboveHS),h=mean(meddat$nhPopD))
pts <-c("401","407","411","202")
arrows(rep(mean(X[,1]),4),rep(mean(X[,2]),4),X[pts,1],X[pts,2])
text(X[pts,1],X[pts,2],labels=round(mh[pts],2),pos=1)
```

```{r}
Xsd <- scale(X)
tmp<-rbind(c(0,0),Xsd["407",])
mahalanobis(tmp,center=c(0,0),cov=cov(Xsd)) ## compare to Euclidean distances
```

## Dimension reduction using the Mahalanobis Distance


```{r echo=FALSE, out.width=".6\\textwidth"}
plot(Xsd[,1],Xsd[,2],xlab="nhAboveHS/sd",ylab="nhPopD/sd")
```

## Dimension reduction using the Mahalanobis Distance


```{r out.width=".6\\textwidth"}
drawMahal(X,center=colMeans(X),covariance=cov(X),quantile=c(.1,.2,.5,.6))
```

## Dimension reduction using the Mahalanobis Distance

Should be more circular if no covariance:

```{r echo=TRUE}
covX <- cov(X)
newcovX <- covX
newcovX[1,2] <- 0
newcovX[2,1] <- 0
```

```{r out.width=".6\\textwidth"}
drawMahal(X,center=colMeans(X),covariance=newcovX,quantile=c(.1,.2))
```

## Matching on the Mahalanobis Distance

Here using the rank based Mahalanobis distance following DOS Chap. 8 (but comparing to the ordinary version).

```{r}
mhdist <- match_on(nhTrt~nhPopD+nhAboveHS,data=meddat,method="rank_mahalanobis")
mhdist[1:3,1:3]
mhdist2 <- match_on(nhTrt~nhPopD+nhAboveHS,data=meddat)
mhdist2[1:3,1:3]
mhdist2[,"407"]
```

## Matching on the Mahalanobis Distance

```{r echo=FALSE}
par(mgp=c(1.5,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
drawMahal(X,center=colMeans(X),covariance=cov(X),
          quantile = c(0.975, 0.75, 0.5, 0.25))
abline(v=mean(meddat$nhAboveHS),h=mean(meddat$nhPopD))
cpts <-c("401","407","411")
tpts <-c("101","102","202")
arrows(X[tpts,1],X[tpts,2],rep(X["407",1]),rep(X["407",2]))
text(X[tpts,1],X[tpts,2],labels=round(mhdist2[tpts,"407"],2),pos=1)
mhdist2[tpts,"407"]

```


## Matching on the Mahalanobis Distance

```{r}
fmMh <- fullmatch(mhdist,data=meddat)
summary(fmMh,min.controls=0,max.controls=Inf)
## Require no more than one treated in each set
fmMh1 <- fullmatch(mhdist,data=meddat,min.controls=1)
summary(fmMh1,min.controls=0,max.controls=Inf)
```

```{r}
xbMh <- xBalance(nhTrt~nhAboveHS+nhPopD,strata=list(unstrat=NULL,fmMh=~fmMh,fmMh1=~fmMh1),report="all",data=meddat)
xbMh
```


#  Matching on Many Covariates: Using Propensity Scores

## Matching on the propensity score

**Make the score**^[Note that we will be using `brglm` or `bayesglm` in the
future because of logit separation problems when the number of covariates
increases.]

```{r}
theglm <- glm(nhTrt~nhPopD+nhAboveHS+HomRate03,data=meddat,family=binomial(link="logit"))
thepscore <- theglm$linear.predictor
thepscore01 <- predict(theglm,type="response")
````

We tend to match on the linear predictor rather than the version required to
range only between 0 and 1.

```{r echo=FALSE, out.width=".7\\textwidth"}
par(mfrow=c(1,2),oma=rep(0,4),mar=c(3,3,2,0),mgp=c(1.5,.5,0))
boxplot(split(thepscore,meddat$nhTrt),main="Linear Predictor (XB)")
stripchart(split(thepscore,meddat$nhTrt),add=TRUE,vertical=TRUE)

boxplot(split(thepscore01,meddat$nhTrt),main="Inverse Link Function (g^-1(XB)")
stripchart(split(thepscore01,meddat$nhTrt),add=TRUE,vertical=TRUE)
```

## Matching on the propensity score

```{r}
psdist <- match_on(theglm,data=meddat)
psdist[1:4,1:4]
fmPs <- fullmatch(psdist,data=meddat)
summary(fmPs,min.controls=0,max.controls=Inf)
```

## Propensity score versus Mahalanobis score

 - The **Mahalanobis distance** uses distance from the center of the
   distribution of all covariates. Distances between units are distances in $X$
   space. Rank-based MH distance aims to weight all covariates equally (or at
   least not overweight because of variance differences.) (See @rosenbaum2010,
   Chap 8).

 - The **Propensity score distance** uses distance between propensity scores
   --- which are weighted sums of covariates. The weights are selected so that
   covariates that relate more to treatment are weighted more, and covariates
   that relate less to treatment are weighted less.


Overall goal: Break the $X \rightarrow Z$ relationship to isolate $Z
\rightarrow Y$ from $X$ (where $X$ is the matrix of **observed covariates**).


## Can you do better?

**Challenge:** Improve the matched design by adding covariates or functions of
covariates using either or both of the propensity score or mahalanobis distance
(rank- or not-rank based). So far we have:

```{r}
thecovs <- unique(c(names(meddat)[c(5:7,9:24)],"HomRate03"))
balfmla<-reformulate(thecovs,response="nhTrt")
xb5 <- xBalance(balfmla,strata=list(unstrat=NULL,fmMh=~fmMh,fmPs=~fmPs),
                   data=meddat,report="all")
xb5$overall
```

## Can you do better?

Challenge: Improve the matched design. So far we have:

```{r}
plot(xb5)
```

## Summary:

What do you think?

 - Statistical adjustment with linear regression models is hard to justify.
 - Stratification via matching is easier to justify and assess (and describe).
 - Matching solves the problem of making comparisons that are transparent
   (Question: "Did you adjust enough for X?" Ans: "Here is some evidence about
   how well I did.")
 - You can adjust for one variable or more than one (if more than one, you need
   to choose one or more methods for reducing many columns to one column).
 - The workflow involves the creation of a distance matrix, asking an algorithm
   to find the best configuration of sets that minimize the distances within
   set, and checking balance. (Eventually, it will also be concerned about the
   effective sample size.)
  - Next: We will get more into the differences between full matching, optimal
    matching, greedy matching, matching with and without replacement, etc..
    next week. (Also: handling missing data, calipers and other methods of
    improving design).
  - Next: How to estimate causal effects and test causal hypotheses with a
    matched design. (Although I bet you already can guess about how to do that
    given our discussions about estimation and testing with stratified
    experiments.)

## References