-
Notifications
You must be signed in to change notification settings - Fork 6
/
day10-Scores.Rmd
521 lines (387 loc) · 15.6 KB
/
day10-Scores.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
---
title: |
| "Statistical Adjustment and Assessment of Adjustment in Observational Studies --- Matched Stratification for One and Multiple Variables."
date: '`r format(Sys.Date(), "%B %d, %Y")`'
author: |
| ICPSR 2021 Session 1
| Jake Bowers, Ben Hansen, Tom Leavitt
bibliography:
- 'BIB/MasterBibliography.bib'
- 'BIB/master.bib'
- 'BIB/misc.bib'
- 'BIB/refs.bib'
fontsize: 10pt
geometry: margin=1in
graphics: yes
biblio-style: authoryear-comp
colorlinks: true
biblatexoptions:
- natbib=true
output:
beamer_presentation:
slide_level: 2
keep_tex: true
latex_engine: xelatex
citation_package: biblatex
template: icpsr.beamer
incremental: true
includes:
in_header:
- defs-all.sty
md_extensions: +raw_attribute-tex_math_single_backslash+autolink_bare_uris+ascii_identifiers+tex_math_dollars
pandoc_args: [ "--csl", "chicago-author-date.csl" ]
---
<!-- To show notes -->
<!-- https://stackoverflow.com/questions/44906264/add-speaker-notes-to-beamer-presentations-using-rmarkdown -->
```{r setup1_env, echo=FALSE, include=FALSE}
library(here)
source(here::here("rmd_setup.R"))
## Print code by default
opts_chunk$set(echo = TRUE)
```
```{r setup2_loadlibs, echo=FALSE, include=FALSE}
## Load all of the libraries that we will use when we compile this file
## We are using the renv system. So these will all be loaded from a local library directory
library(dplyr)
library(ggplot2)
library(coin)
library(RItools)
library(optmatch)
```
## Today
\begin{enumerate}
\item Agenda: Continue to think about an adjustment strategy to enhance counterfactual causal interpretations in non-randomized studies: \\
Need adjust $\rightarrow$ "fair comparison" $\rightarrow$
stratification $\rightarrow$ evaluation of the stratification;
How to do this with one variable and/or more than one
variable.
\item Questions arising from the reading or assignments or life?
\end{enumerate}
# But first, review:
## Review of randomization assessment in a randomized experiment
How, in principle, might one do this?
```{r echo=FALSE, cache=TRUE}
load(url("http://jakebowers.org/Data/meddat.rda"))
```
## Review of linear model "control for"
Did the Metrocable intervention decrease
violence in those neighborhoods? We have Homicides per 1000 people in 2008 (`HomRate08`) as a function of Metrocable.
```{r}
meddat<- mutate(meddat, HomRate03=(HomCount2003/Pop2003)*1000,
HomRate08=(HomCount2008/Pop2008)*1000)
row.names(meddat) <- meddat$nh
lmRaw <- lm(HomRate08~nhTrt,data=meddat)
coef(lmRaw)[["nhTrt"]]
```
What do we need to believe or know in order to imagine that we have done a good job adjusting for Proportion with more than HS Education below? (concerns about extrapolation, interpolation, linearity, influential points, parallel slopes) (Ignoring whether we have done a good job **isolating** the causal relationship of interest. Just focusing on defending the claim that we have adjusted for HS Education well enough).
```{r}
lmAdj1 <- lm(HomRate08 ~ nhTrt + nhAboveHS, data=meddat)
coef(lmAdj1)["nhTrt"]
```
## Review of linear model "control for"
What about when we try to adjust for more than one variable? (all previous
problems (but harder to explain choices)+the curse of dimensionality)
```{r}
lmAdj2 <- lm(HomRate08 ~ nhTrt + nhAboveHS + nhRent, data=meddat)
coef(lmAdj2)["nhTrt"]
```
# Matching on one variable to create strata
## The optmatch workflow: The distance matrix
The goal of optmatch: collect observations into strata which minimize the
within strata differences (calculate the absolute difference between treated and
controls within strata, sum across strata, minimize this quantity).
```{r}
tmp <- meddat$nhAboveHS
names(tmp) <- rownames(meddat)
absdist <- match_on(tmp, z = meddat$nhTrt,data=meddat)
absdist[1:3,1:3]
abs(meddat$nhAboveHS[meddat$nhTrt==1][1] - meddat$nhAboveHS[meddat$nhTrt==0][1] )
```
## Do the match
We will talk in more detail later, for now, we merely use the software.
```{r}
fm1 <- fullmatch(absdist,data=meddat)
summary(fm1, min.controls=0, max.controls=Inf )
table(meddat$nhTrt,fm1)
pm1 <- pairmatch(absdist,data=meddat)
summary(pm1, min.controls=0, max.controls=Inf )
table(meddat$nhTrt,pm1,exclude=c())
```
## Evaluate the design: Within set differences
Look within sets:
```{r echo=FALSE}
meddat$fm1 <- fm1
meddat$pm1 <- pm1
```
```{r echo=FALSE, out.width=".7\\textwidth"}
library(gridExtra)
bpfm1 <- ggplot(meddat,aes(x=fm1,y=nhAboveHS)) +
geom_boxplot() +
stat_summary(fun=mean, geom="point", shape=20, size=3, color="red", fill="red")
bporig <- ggplot(meddat,aes(y=nhAboveHS))+
geom_boxplot()
grid.arrange(bpfm1,bporig,ncol=2,layout_matrix=matrix(c(1,1,1,1,2),nrow=1))
```
## Evaluate the design: Within set differences
Look within sets:
```{r echo=FALSE, out.width=".7\\textwidth"}
bppm1 <- ggplot(meddat,aes(x=pm1,y=nhAboveHS)) +
geom_boxplot() +
stat_summary(fun=mean, geom="point", shape=20, size=3, color="red", fill="red")
bporig <- ggplot(meddat,aes(y=nhAboveHS))+
geom_boxplot()
grid.arrange(bppm1,bporig,ncol=2,layout_matrix=matrix(c(1,1,1,1,2),nrow=1))
```
## Evaluate the design: Within set differences
```{r}
rawmndiffs <- with(meddat, mean(nhAboveHS[nhTrt==1]) - mean(nhAboveHS[nhTrt==0]))
setdiffsfm1 <- meddat %>% group_by(fm1) %>% summarize(mneddiffs =
mean(nhAboveHS[nhTrt==1]) -
mean(nhAboveHS[nhTrt==0]),
mnAboveHS = mean(nhAboveHS),
minAboveHS = min(nhAboveHS),
maxAboveHS = max(nhAboveHS))
setdiffsfm1
#summary(setdiffs$mneddiffs)
#quantile(setdiffs$mneddiffs, seq(0,1,.1))
```
## Evaluate the design: Within set differences
```{r}
setdiffspm1 <- meddat %>% group_by(pm1) %>% summarize(mneddiffs =
mean(nhAboveHS[nhTrt==1]) -
mean(nhAboveHS[nhTrt==0]),
mnAboveHS = mean(nhAboveHS),
minAboveHS = min(nhAboveHS),
maxAboveHS = max(nhAboveHS))
setdiffspm1
```
## Evaluate the design: Compare to a randomized experiment.
The within-set differences look different from those that would be expected
from a randomized experiment.
```{r}
xbHS2 <- xBalance(nhTrt~nhAboveHS,
strata=list(nostrat=NULL,
fm = ~fm1,
pm = ~pm1),
data=meddat,report="all")
xbHS2$results
xbHS2$overall
```
## What is xBalance doing?
```{r}
setmeanDiffs <- meddat %>% group_by(fm1) %>%
summarise(diffAboveHS=mean(nhAboveHS[nhTrt==1])-mean(nhAboveHS[nhTrt==0]),
nb=n(),
nTb = sum(nhTrt),
nCb = sum(1-nhTrt),
hwt = ( 2*( nCb * nTb ) / (nTb + nCb))
)
setmeanDiffs
```
## What is xBalance doing with multiple sets/blocks?
The test statistic is a weighted average of the set-specific differences (same
approach as we would use to test the null in a block-randomized experiment)
```{r}
## The descriptive adj.mean diff from xBalance
with(setmeanDiffs, sum(diffAboveHS*nTb/sum(nTb)))
## The mean diff used as the observed value in the testing
with(setmeanDiffs, sum(diffAboveHS*hwt/sum(hwt)))
## Compare to xBalance output
xbHS2$results[,,"fm"]
```
```{r}
btHS2 <- xBalance(nhTrt~nhAboveHS, strata=list(fm1=~fm1, pm1=~pm1),
data=meddat,report="all")
btHS2
```
# Matching on Many Covariates: Using Mahalnobis Distance
## Dimension reduction using the Mahalanobis Distance
The general idea: dimension reduction. When we convert many columns into one column we reduce the dimensions of the dataset (to one column).
```{r echo=FALSE}
X <- meddat[,c("nhAboveHS","nhPopD")]
plot(meddat$nhAboveHS,meddat$nhPopD,xlim=c(-.3,.6),ylim=c(50,700))
```
## Dimension reduction using the Mahalanobis Distance
First, let's look at Euclidean distance: $\sqrt{ (x_1 - x_2)^2 + (y_1 - y_2)^2 }$
```{r echo=FALSE, out.width=".8\\textwidth"}
par(mgp=c(1.25,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
plot(meddat$nhAboveHS,meddat$nhPopD,xlim=c(-.3,.6),ylim=c(50,700))
points(mean(X[,1]),mean(X[,2]),pch=19,cex=1)
arrows(mean(X[,1]),mean(X[,2]),X["407",1],X["407",2])
text(.4,200,label=round(dist(rbind(colMeans(X),X["407",])),2))
```
## Dimension reduction using the Mahalanobis Distance
First, let's look at Euclidean distance: $\sqrt{ (x_1 - x_2)^2 + (y_1 - y_2)^2 }$
```{r echo=FALSE, out.width=".4\\textwidth"}
par(mgp=c(1.25,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
plot(meddat$nhAboveHS,meddat$nhPopD,xlim=c(-.3,.6),ylim=c(50,700))
points(mean(X[,1]),mean(X[,2]),pch=19,cex=1)
arrows(mean(X[,1]),mean(X[,2]),X["407",1],X["407",2])
text(.4,200,label=round(dist(rbind(colMeans(X),X["407",])),2))
```
Distance between point 0,0 and unit "407".
```{r}
tmp <- rbind(colMeans(X),X["407",])
tmp
sqrt( (tmp[1,1] - tmp[2,1])^2 + (tmp[1,2]-tmp[2,2])^2 )
```
Problem: overweights variables with bigger scales (Population Density dominates).
## Dimension reduction using the Mahalanobis Distance
Now the Euclidean distance (on a standardized scale) so neither variable is overly dominant.
```{r echo=TRUE}
Xsd <-scale(X)
apply(Xsd,2,sd)
apply(Xsd,2,mean)
```
```{r echo=FALSE,out.width=".5\\textwidth"}
plot(Xsd[,1],Xsd[,2],xlab="nhAboveHS/sd",ylab="nhPopD/sd")
points(mean(Xsd[,1]),mean(Xsd[,2]),pch=19,cex=1)
arrows(mean(Xsd[,1]),mean(Xsd[,2]),Xsd["407",1],Xsd["407",2])
text(2,-1.2,label=round(dist(rbind(colMeans(Xsd),Xsd["407",])),2))
```
## Dimension reduction using the Mahalanobis Distance
The mahalanobis distance avoids the scale problem in the euclidean distance.^[For more [see here](https://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance)] Here each circle are points of the same MH distance.
```{r mhfig, echo=FALSE,out.width=".6\\textwidth"}
library(chemometrics)
par(mgp=c(1.5,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
mh <- mahalanobis(X,center=colMeans(X),cov=cov(X))
drawMahal(X,center=colMeans(X),covariance=cov(X),
quantile = c(0.975, 0.75, 0.5, 0.25))
abline(v=mean(meddat$nhAboveHS),h=mean(meddat$nhPopD))
pts <-c("401","407","411","202")
arrows(rep(mean(X[,1]),4),rep(mean(X[,2]),4),X[pts,1],X[pts,2])
text(X[pts,1],X[pts,2],labels=round(mh[pts],2),pos=1)
```
```{r}
Xsd <- scale(X)
tmp<-rbind(c(0,0),Xsd["407",])
mahalanobis(tmp,center=c(0,0),cov=cov(Xsd)) ## compare to Euclidean distances
```
## Dimension reduction using the Mahalanobis Distance
```{r echo=FALSE, out.width=".6\\textwidth"}
plot(Xsd[,1],Xsd[,2],xlab="nhAboveHS/sd",ylab="nhPopD/sd")
```
## Dimension reduction using the Mahalanobis Distance
```{r out.width=".6\\textwidth"}
drawMahal(X,center=colMeans(X),covariance=cov(X),quantile=c(.1,.2,.5,.6))
```
## Dimension reduction using the Mahalanobis Distance
Should be more circular if no covariance:
```{r echo=TRUE}
covX <- cov(X)
newcovX <- covX
newcovX[1,2] <- 0
newcovX[2,1] <- 0
```
```{r out.width=".6\\textwidth"}
drawMahal(X,center=colMeans(X),covariance=newcovX,quantile=c(.1,.2))
```
## Matching on the Mahalanobis Distance
Here using the rank based Mahalanobis distance following DOS Chap. 8 (but comparing to the ordinary version).
```{r}
mhdist <- match_on(nhTrt~nhPopD+nhAboveHS,data=meddat,method="rank_mahalanobis")
mhdist[1:3,1:3]
mhdist2 <- match_on(nhTrt~nhPopD+nhAboveHS,data=meddat)
mhdist2[1:3,1:3]
mhdist2[,"407"]
```
## Matching on the Mahalanobis Distance
```{r echo=FALSE}
par(mgp=c(1.5,.5,0),oma=rep(0,4),mar=c(3,3,0,0))
drawMahal(X,center=colMeans(X),covariance=cov(X),
quantile = c(0.975, 0.75, 0.5, 0.25))
abline(v=mean(meddat$nhAboveHS),h=mean(meddat$nhPopD))
cpts <-c("401","407","411")
tpts <-c("101","102","202")
arrows(X[tpts,1],X[tpts,2],rep(X["407",1]),rep(X["407",2]))
text(X[tpts,1],X[tpts,2],labels=round(mhdist2[tpts,"407"],2),pos=1)
mhdist2[tpts,"407"]
```
## Matching on the Mahalanobis Distance
```{r}
fmMh <- fullmatch(mhdist,data=meddat)
summary(fmMh,min.controls=0,max.controls=Inf)
## Require no more than one treated in each set
fmMh1 <- fullmatch(mhdist,data=meddat,min.controls=1)
summary(fmMh1,min.controls=0,max.controls=Inf)
```
```{r}
xbMh <- xBalance(nhTrt~nhAboveHS+nhPopD,strata=list(unstrat=NULL,fmMh=~fmMh,fmMh1=~fmMh1),report="all",data=meddat)
xbMh
```
# Matching on Many Covariates: Using Propensity Scores
## Matching on the propensity score
**Make the score**^[Note that we will be using `brglm` or `bayesglm` in the
future because of logit separation problems when the number of covariates
increases.]
```{r}
theglm <- glm(nhTrt~nhPopD+nhAboveHS+HomRate03,data=meddat,family=binomial(link="logit"))
thepscore <- theglm$linear.predictor
thepscore01 <- predict(theglm,type="response")
````
We tend to match on the linear predictor rather than the version required to
range only between 0 and 1.
```{r echo=FALSE, out.width=".7\\textwidth"}
par(mfrow=c(1,2),oma=rep(0,4),mar=c(3,3,2,0),mgp=c(1.5,.5,0))
boxplot(split(thepscore,meddat$nhTrt),main="Linear Predictor (XB)")
stripchart(split(thepscore,meddat$nhTrt),add=TRUE,vertical=TRUE)
boxplot(split(thepscore01,meddat$nhTrt),main="Inverse Link Function (g^-1(XB)")
stripchart(split(thepscore01,meddat$nhTrt),add=TRUE,vertical=TRUE)
```
## Matching on the propensity score
```{r}
psdist <- match_on(theglm,data=meddat)
psdist[1:4,1:4]
fmPs <- fullmatch(psdist,data=meddat)
summary(fmPs,min.controls=0,max.controls=Inf)
```
## Propensity score versus Mahalanobis score
- The **Mahalanobis distance** uses distance from the center of the
distribution of all covariates. Distances between units are distances in $X$
space. Rank-based MH distance aims to weight all covariates equally (or at
least not overweight because of variance differences.) (See @rosenbaum2010,
Chap 8).
- The **Propensity score distance** uses distance between propensity scores
--- which are weighted sums of covariates. The weights are selected so that
covariates that relate more to treatment are weighted more, and covariates
that relate less to treatment are weighted less.
Overall goal: Break the $X \rightarrow Z$ relationship to isolate $Z
\rightarrow Y$ from $X$ (where $X$ is the matrix of **observed covariates**).
## Can you do better?
**Challenge:** Improve the matched design by adding covariates or functions of
covariates using either or both of the propensity score or mahalanobis distance
(rank- or not-rank based). So far we have:
```{r}
thecovs <- unique(c(names(meddat)[c(5:7,9:24)],"HomRate03"))
balfmla<-reformulate(thecovs,response="nhTrt")
xb5 <- xBalance(balfmla,strata=list(unstrat=NULL,fmMh=~fmMh,fmPs=~fmPs),
data=meddat,report="all")
xb5$overall
```
## Can you do better?
Challenge: Improve the matched design. So far we have:
```{r}
plot(xb5)
```
## Summary:
What do you think?
- Statistical adjustment with linear regression models is hard to justify.
- Stratification via matching is easier to justify and assess (and describe).
- Matching solves the problem of making comparisons that are transparent
(Question: "Did you adjust enough for X?" Ans: "Here is some evidence about
how well I did.")
- You can adjust for one variable or more than one (if more than one, you need
to choose one or more methods for reducing many columns to one column).
- The workflow involves the creation of a distance matrix, asking an algorithm
to find the best configuration of sets that minimize the distances within
set, and checking balance. (Eventually, it will also be concerned about the
effective sample size.)
- Next: We will get more into the differences between full matching, optimal
matching, greedy matching, matching with and without replacement, etc..
next week. (Also: handling missing data, calipers and other methods of
improving design).
- Next: How to estimate causal effects and test causal hypotheses with a
matched design. (Although I bet you already can guess about how to do that
given our discussions about estimation and testing with stratified
experiments.)
## References