-
Notifications
You must be signed in to change notification settings - Fork 6
/
day12-Scores2.Rmd
359 lines (279 loc) · 12.7 KB
/
day12-Scores2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: |
| Statistical Adjustment in Observational Studies,
| Matched Stratification for Multiple Variables.
date: '`r format(Sys.Date(), "%B %d, %Y")`'
author: |
| ICPSR 2023 Session 1
| Jake Bowers, Ben Hansen, Tom Leavitt
bibliography:
- 'BIB/MasterBibliography.bib'
fontsize: 10pt
geometry: margin=1in
graphics: yes
biblio-style: authoryear-comp
biblatexoptions:
- natbib=true
output:
beamer_presentation:
slide_level: 2
keep_tex: true
latex_engine: xelatex
citation_package: biblatex
template: icpsr.beamer
incremental: true
includes:
in_header:
- defs-all.sty
md_extensions: +raw_attribute-tex_math_single_backslash+autolink_bare_uris+ascii_identifiers+tex_math_dollars
pandoc_args: [ "--csl", "chicago-author-date.csl" ]
---
<!-- To show notes -->
<!-- https://stackoverflow.com/questions/44906264/add-speaker-notes-to-beamer-presentations-using-rmarkdown -->
```{r setup1_env, echo=FALSE, include=FALSE}
library(here)
source(here::here("rmd_setup.R"))
```
```{r setup2_loadlibs, echo=FALSE, include=FALSE}
## Load all of the libraries that we will use when we compile this file
## We are using the renv system. So these will all be loaded from a local library directory
library(dplyr)
library(ggplot2)
library(coin)
library(RItools)
library(optmatch)
library(estimatr)
```
```{r echo=FALSE, cache=TRUE}
load(url("http://jakebowers.org/Data/meddat.rda"))
meddat <- mutate(meddat,
HomRate03 = (HomCount2003 / Pop2003) * 1000,
HomRate08 = (HomCount2008 / Pop2008) * 1000
)
row.names(meddat) <- meddat$nh
```
## Today
1. Agenda: Discuss methods to adjust for more than one background covariate at the same time (two examples: Mahalanobis distances, Propensity distances). Discuss greedy versus optimal matching, with and without replacement to develop a bit more intuition about "matching" versus simple "stratification": we are creating stratified research designs but we are deploying algorithms to build up strata rather than imposing those strata (i.e. divisions of one or more variables) from the start.
2. We are now moving into the "Observational Studies" or Matching part of the course (see the reading there).
3. Questions arising from the reading or assignments or life?
# But first, review
## Matching on one variable.
Two approaches so far:
1. Simple stratification by hand.
```{r simpstrat, echo=TRUE}
meddat$nhAboveHScut3 <- with(meddat,cut(nhAboveHS,breaks=quantile(nhAboveHS,c(0,.75,1)),include.lowest=TRUE))
with(meddat,table(nhTrt,nhAboveHScut3,exclude=c()))
## lm1cut3 <- lm_robust(HomRate08~nhTrt,fixed_effects=~nhAboveHScut3,data=meddat)
```
2. Create strata by solving an optimal matching problem:
a) Represent relationships among units as pairwise "distances" or "differences"
```{r dist1, echo=TRUE}
tmp <- meddat$nhAboveHS
names(tmp) <- rownames(meddat)
absdist <- match_on(tmp, z = meddat$nhTrt,data=meddat)
absdist[1:3,1:3] ## first 3 treated and 3 control neighborhoods
```
## Matching on one variable.
b) Find an optimal collection of sets
```{r fm1, echo=TRUE}
fm1 <- fullmatch(absdist,data=meddat,min.controls=.5)
summary(fm1, min.controls=0, max.controls=Inf)
meddat$fm1 <- fm1
with(meddat,table(nhTrt,fm1))
```
## Compare the two approaches: minimizing distances?
A good stratified design creates homogeneous sets on the focal variables. What are the within set differences in `nhAboveHS`?
```{r comparedists, echo=TRUE}
setdists1 <- meddat %>% group_by(nhAboveHScut3) %>% summarize(covdiff=mean(nhAboveHS[nhTrt==1]) - mean(nhAboveHS[nhTrt==0]),.groups="keep") %>% arrange(desc(abs(covdiff)))
setdists2 <- meddat %>% group_by(fm1) %>% summarize(covdiff=mean(nhAboveHS[nhTrt==1]) - mean(nhAboveHS[nhTrt==0]),.groups="keep") %>% arrange(desc(abs(covdiff)))
setdists1
mean(setdists1$covdiff)
summary(setdists2$covdiff)
```
## Compare the two approaches: equalizing distributions?
Looks like `fullmatch` does a better job than me in these terms.
```{r xb1}
xb1 <- xBalance(nhTrt~nhAboveHS,strata=list(unstrat=NULL,nhAboveHScut3 = ~nhAboveHScut3 ,fm1=~fm1),data=meddat,report="all")
xb1$results
xb1$results[,"adj.diff",]
xb1$overall
```
## Dimension reduction using the Mahalanobis Distance
The general idea: dimension reduction. When we convert many columns into one column we reduce the dimensions of the dataset (to one column). If using distance matrices and optimal matching allows us to side-step the problem of choosing cut-points for stratification, we can use the idea of **multivariate distance** to produce distance matrices to minimize **multivariate distances**.
Here is some fake data:
```{r mhexample, out.width=".6\\textwidth"}
library(MASS)
set.seed(12345)
X <- mvrnorm(n=100,mu=c(.5,500),Sigma=matrix(c(.2,1,1,10),2,2))
plot(X)
```
## Dimension reduction using the Mahalanobis Distance
The mahalanobis distance avoids the scale problem in the euclidean distance.^[For more [see here](https://stats.stackexchange.com/questions/62092/bottom-to-top-explanation-of-the-mahalanobis-distance)] Each circle shows points with the same MH distance from the center.
```{r mhfig, echo=FALSE,out.width=".7\\textwidth"}
library(chemometrics)
par(mgp = c(1.5, .5, 0), oma = rep(0, 4), mar = c(3, 3, 0, 0))
drawMahal(X,
center = colMeans(X), covariance = cov(X),
quantile = c(.1, .25, 0.975, 0.75, 0.5, 0.25)
)
abline(v = mean(X[,1]), h = mean(X[,2]))
pts <- c(25,33,51,57,59,72)
text(X[pts,1],X[pts,2],labels=pts)
```
## Dimension reduction using the Mahalanobis Distance
Notice that the points that are all the same Mahalanobis distance from the center vary in their euclidean distance:
```{r}
Xsd <- scale(X)
## This next tricks the dist function into looking at distance from the point of means.
newX <- rbind(colMeans(Xsd),Xsd[pts,])
edist <- as.matrix(dist(newX))
mh <- mahalanobis(X, center = colMeans(X), cov = cov(X))
compdists <- rbind(euclidean=edist[1,-1],
mahalanobis=mh[pts])
colnames(compdists) <- pts
compdists
```
```{r mhfig2, echo=FALSE,out.width=".7\\textwidth"}
par(mgp = c(1.5, .5, 0), oma = rep(0, 4), mar = c(3, 3, 0, 0))
drawMahal(X,
center = colMeans(X), covariance = cov(X),
quantile = c(.1, .25, 0.975, 0.75, 0.5, 0.25)
)
abline(v = mean(X[,1]), h = mean(X[,2]))
pts <- c(25,33,51,57,59,72)
text(X[pts,1],X[pts,2],labels=pts)
```
## Matching on the Mahalanobis Distance: Setup
Here using the rank based Mahalanobis distance following DOS Chap. 8 (but comparing to the ordinary version).
```{r mhmatch, echo=TRUE}
mhdist <- match_on(nhTrt ~ nhPopD + nhAboveHS + HomRate03, data = meddat, method = "rank_mahalanobis")
mhdist[1:3, 1:3]
mhdist2 <- match_on(nhTrt ~ nhPopD + nhAboveHS + HomRate03, data = meddat)
mhdist2[1:3, 1:3]
mhdist2[, "407"]
```
## Matching on the Mahalanobis Distance: Creation
```{r matchesmh, echo=TRUE}
fmMh <- fullmatch(mhdist, data = meddat)
summary(fmMh, min.controls = 0, max.controls = Inf)
fmMh1 <- fullmatch(mhdist, data = meddat, min.controls = .5)
summary(fmMh1, min.controls = 0, max.controls = Inf)
fmMh2 <- fullmatch(mhdist2, data = meddat)
summary(fmMh2, min.controls = 0, max.controls = Inf)
pmMh1 <- pairmatch(mhdist, data = meddat)
summary(pmMh1, min.controls = 0, max.controls = Inf)
```
## Matching on the Mahalanobis Distance: Evaluation
```{r xbmh, echo=TRUE}
xbMh <- xBalance(nhTrt ~ nhAboveHS + nhPopD + HomRate03,
strata = list(unstrat = NULL, fm1=~fm1, fmMh = ~fmMh, fmMh1 = ~fmMh1, fmMh2=~fmMh2,pmMh1=~pmMh1),
report = "all", data = meddat)
xbMh$overall
## xbMh$results
xbMh$results[,"std.diff",]
xbMh$results[,"adj.diff",]
## Another way to communicate with xBalance about the stratifications
strat_dat <- with(meddat,data.frame(unstrat = as.factor(rep(1,nrow(meddat))),
fm1=fm1, fmMh = fmMh, fmMh1 = fmMh1, fmMh2=fmMh2,pmMh1=pmMh1))
xbMh2 <- xBalance(nhTrt ~ nhAboveHS + nhPopD + HomRate03,
strata = strat_dat,
report = "all", data = meddat)
xbMh2$overall
```
# Matching on Many Covariates: Using Propensity Scores
## Matching on the propensity score
**Make the score**^[Note that we will be using `brglm` or `bayesglm` in the
future because of logit separation problems when the number of covariates
increases.]
```{r glm1, echo=TRUE}
theglm <- glm(nhTrt ~ nhPopD + nhAboveHS + HomRate03, data = meddat, family = binomial(link = "logit"))
thepscore <- theglm$linear.predictor
thepscore01 <- predict(theglm, type = "response")
## Add to the data
meddat$thepscore <- thepscore
meddat$thepscore01 <- thepscore01
````
## Matching on the propensity score
We tend to match on the linear predictor.
```{r echo=FALSE, out.width=".6\\textwidth"}
par(mfrow = c(1, 2), oma = rep(0, 4), mar = c(3, 3, 2, 0), mgp = c(1.5, .5, 0))
boxplot(split(thepscore, meddat$nhTrt), main = "Linear Predictor (XB)")
stripchart(split(thepscore, meddat$nhTrt), add = TRUE, vertical = TRUE)
boxplot(split(thepscore01, meddat$nhTrt), main = "Inverse Link Function")
stripchart(split(thepscore01, meddat$nhTrt), add = TRUE, vertical = TRUE)
```
```{r pscorediffs, echo=TRUE,tidy=FALSE}
minpscores <- meddat %>% group_by(nhTrt) %>% summarize(minp=min(thepscore),
minp01=min(thepscore01))
filter(minpscores,nhTrt==1) - filter(minpscores,nhTrt==0)
```
## Why a propensity score? {.fragile}
1. The Mahanobis distance weights all covariates equally (and the rank based version especially tries to do this). Maybe not all covariates matter equally for $Z$.
\begin{center}
\begin{tikzcd}[column sep=large,every arrow/.append style=-latex]
& Z \arrow[from=1-2,to=1-3, "\tau"] & y \\
x_1 \arrow[from=2-1,to=1-2, "\beta_1" ] \arrow[from=2-1,to=1-3,grey] &
x_2 \arrow[from=2-2,to=1-2, "\beta_2" ] \arrow[from=2-2,to=1-3,grey] &
\ldots &
x_p \arrow[from=2-4,to=1-2, "\beta_p" near start ] \arrow[from=2-4,to=1-3,grey]
\end{tikzcd}
\end{center}
2. @rosenbaum:rubi:1983 and @rosenbaum:rubi:1984a show that if we knew the true propensity score (ex. we knew the true model and the correct covariates) then we could stratify on the p-score and have adjusted for all of the covariates. (In practice, we don't know either. But the propensity score often performs well as an ingredient in a matched design)
## Matching on the propensity score: setup \& creation
```{r pssetup, echo=TRUE}
psdist <- match_on(theglm, data = meddat)
psdist[1:4, 1:4]
fmPs <- fullmatch(psdist, data = meddat)
summary(fmPs, min.controls = 0, max.controls = Inf, propensity.model=theglm)
fmPs2 <- fullmatch(psdist, data = meddat,min.controls=.5)
summary(fmPs2, min.controls = 0, max.controls = Inf, propensity.model=theglm)
```
## What about many covariates?
```{r manycovs, echo=TRUE}
thecovs <- unique(c(names(meddat)[c(5:7, 9:24)], "HomRate03"))
balfmla <- reformulate(thecovs, response = "nhTrt")
glmbig <- glm(balfmla,data=meddat,family=binomial(link="logit"))
psbig <- match_on(glmbig,data=meddat)
mhbig <- match_on(balfmla,data=meddat)
fmMhbig <- fullmatch(mhbig,data=meddat)
summary(fmMhbig,min.controls=0,max.controls=Inf)
meddat$fmMhbig <- fmMhbig
fmpsbig <- fullmatch(psbig,data=meddat)
summary(fmpsbig,min.controls=0,max.controls=Inf)
meddat$fmpsbig <- fmpsbig
xb5 <- xBalance(balfmla,
strata = list(unstrat=NULL,fmMh = ~fmMh, fmMh2=~fmMh2,fmPs = ~fmPs,fmMhbig=~fmMhbig,fmpsbig=~fmpsbig),
data = meddat, report = "all"
)
xb5$overall
resp <- xb5$results[,"p",]
resp[order(resp[,1],resp[,4]),]
```
## What about many covariates?
```{r}
plot(xb5)
```
## Summary:
What do you think?
- Statistical adjustment with linear regression models is hard to justify.
- Stratification via matching is easier to justify and assess (and describe).
- Matching solves the problem of making comparisons that are transparent
(Question: "Did you adjust enough for X?" Ans: "Here is some evidence about
how well I did. (1) Compared to a shared standard of unconfounded designs
and (2) Compared to what I know about the context and the theoretical and
explanatory goals of the research project.")
- You can adjust for one variable or more than one (if more than one, you need
to choose one or more methods for reducing many columns to one column
although you can **combine** approaches).
- The workflow involves the creation of a distance matrix, asking an algorithm
to find the best configuration of sets that minimize the distances within
set, and checking balance. (Eventually, it will also be concerned about the
effective sample size.)
- Next: We will get more into the differences between full matching, optimal
matching, greedy matching, matching with and without replacement, etc..
next week. (Also: handling missing data, calipers and other methods of
improving design).
- Next: How to estimate causal effects and test causal hypotheses with a
matched design. (Hint: If we can treat a stratified design created via
matching as-if it were a block-randomized study, then we know what to do.)
## References