-
Notifications
You must be signed in to change notification settings - Fork 6
/
day12-MatchingTools-2021.Rmd
481 lines (379 loc) · 18 KB
/
day12-MatchingTools-2021.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
---
title: |
| Statistical Adjustment in Observational Studies,
| Matched Stratification for Multiple Variables, Matching Modes, Tools.
date: '`r format(Sys.Date(), "%B %d, %Y")`'
author: |
| ICPSR 2021 Session 1
| Jake Bowers, Ben Hansen, Tom Leavitt
bibliography:
- 'BIB/MasterBibliography.bib'
fontsize: 10pt
geometry: margin=1in
graphics: yes
biblio-style: authoryear-comp
colorlinks: true
biblatexoptions:
- natbib=true
output:
beamer_presentation:
slide_level: 2
keep_tex: true
latex_engine: xelatex
citation_package: biblatex
template: icpsr.beamer
incremental: true
includes:
in_header:
- defs-all.sty
md_extensions: +raw_attribute-tex_math_single_backslash+autolink_bare_uris+ascii_identifiers+tex_math_dollars
pandoc_args: [ "--csl", "chicago-author-date.csl" ]
---
<!-- To show notes -->
<!-- https://stackoverflow.com/questions/44906264/add-speaker-notes-to-beamer-presentations-using-rmarkdown -->
```{r setup1_env, echo=FALSE, include=FALSE}
library(here)
source(here::here("rmd_setup.R"))
```
```{r setup2_loadlibs, echo=FALSE, include=FALSE}
## Load all of the libraries that we will use when we compile this file
## We are using the renv system. So these will all be loaded from a local library directory
library(dplyr)
library(ggplot2)
library(coin)
library(RItools)
library(optmatch)
library(estimatr)
```
```{r echo=FALSE, cache=TRUE}
load(url("http://jakebowers.org/Data/meddat.rda"))
meddat <- mutate(meddat,
HomRate03 = (HomCount2003 / Pop2003) * 1000,
HomRate08 = (HomCount2008 / Pop2008) * 1000
)
row.names(meddat) <- meddat$nh
```
## Today
1. Agenda: Types of algorithms for matching --- with-replacement,
without-replacement, fixed ratio, etc..; Producing useful propensity
scores when the number of covariates is large (relative to $N$); Methods
for excluding observations (`caliper`) and restricting the matched designs
to reflect substantive knowledge and research goals (`exactMatch`),
combining distance matrices; Maybe talk about `effective sample size`.
3. Questions arising from the reading or assignments or life?
# But first, review
## What have we done so far?
How can we argue that we have adequate adjustment for background covariates in observational studies?
- The potential for making this case for one or more variables using
**stratification** stratification reduces extrapolation and interpolation
without requiring models of adjustment (models relating $\bx$ to $Z$ and
$Y$).
- Using optimal, full matching technology to make and evaluate stratified
research designs: solve an optimization problem rather than stratify by
hand.
- Scalar distance --- to represent substantive knowledge about key
alternative explanations.
- Mahalanobis distance to compare units on $\bx$ --- to try to include all
covariates equally.
- The propensity score to compare units on $\hat{Z} \leftarrow \bx$ --- to
downweight covariates not relevant for confounding.
- Assessing stratifications using the block-randomized design as a standard
of comparison. (What does this mean in practice? What does `xBalance` do?
How does `xBalance` adjust differences of means for the stratifications?)
## Matching on the Mahalanobis Distance: Setup
Here using the rank based Mahalanobis distance following DOS Chap. 8 (but comparing to the ordinary version).
```{r mhmatch, echo=TRUE}
mhdist <- match_on(nhTrt ~ nhPopD + nhAboveHS + HomRate03, data = meddat)
```
## Matching on the Mahalanobis Distance: Creation
```{r matchesmh, echo=TRUE}
fmMh <- fullmatch(mhdist, data = meddat)
summary(fmMh, min.controls = 0, max.controls = Inf)
meddat$fmMh <- fmMh
```
## Matching on the Mahalanobis Distance: Evaluation
For a single variable (or could imagine using the propensity score as a summary):
```{r}
library(gridExtra)
bpfm1 <- ggplot(meddat, aes(x = fmMh, y = nhAboveHS)) +
geom_boxplot(alpha=.5) +
geom_jitter(width=0)+
stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "red", fill = "red")
meddat$nostrata <- rep(1, 45)
bporig <- ggplot(meddat, aes(x = nostrata, y = nhAboveHS)) +
geom_boxplot() +
stat_summary(
fun = mean, geom = "point",
shape = 20, size = 3, color = "red", fill = "red"
)
grid.arrange(bpfm1, bporig, ncol = 2, layout_matrix = matrix(c(1, 1, 1, 1, 2), nrow = 1))
```
## Matching on the Mahalanobis Distance: Evaluation
```{r xbmh, echo=TRUE}
xbMh <- xBalance(nhTrt ~ nhAboveHS + nhPopD + HomRate03,
strata = list(unstrat = NULL, fmMh = ~fmMh),
report = "all", data = meddat
)
xbMh$overall
xbMh$results[, "std.diff", ]
xbMh$results[, "adj.diff", ]
```
## Why a propensity score? {.fragile}
1. The Mahanobis distance weights all covariates equally (and the rank based version especially tries to do this). Maybe not all covariates matter equally for $Z$.
\begin{center}
\begin{tikzcd}[column sep=large,every arrow/.append style=-latex]
& Z \arrow[from=1-2,to=1-3, "\tau"] & y \\
x_1 \arrow[from=2-1,to=1-2, "\beta_1" ] \arrow[from=2-1,to=1-3,grey] &
x_2 \arrow[from=2-2,to=1-2, "\beta_2" ] \arrow[from=2-2,to=1-3,grey] &
\ldots &
x_p \arrow[from=2-4,to=1-2, "\beta_p" near start ] \arrow[from=2-4,to=1-3,grey]
\end{tikzcd}
\end{center}
2. @rosenbaum:rubi:1983 and @rosenbaum:rubi:1984a show that if we knew the true propensity score (ex. we knew the true model and the correct covariates) then we could stratify on the p-score and have adjusted for all of the covariates. (In practice, we don't know either. But the propensity score often performs well as an ingredient in a matched design)
# More on the propensity score: interpreting distances and separation problems
## Matching on the propensity score: What do the distance matrix entries mean?
`optmatch` creates a scaled propensity score distance by default --- scaling by,
roughly, the pooled median absolute deviation of the covariate (or here, the
propensity score). So, the distance matrix entries are like standard deviations
--- standardized scores.
```{r glm1, echo=TRUE}
meddat <- as.data.frame(meddat)
row.names(meddat) <- meddat$nh
theglm <- glm(nhTrt ~ nhPopD + nhAboveHS + HomRate03, data = meddat, family = binomial(link = "logit"))
thepscore <- theglm$linear.predictor
````
```{r echo=TRUE}
## Create a distance matrix using the propensity scores
psdist <- match_on(theglm, data = meddat)
psdist[1:4, 1:4]
```
## Matching on the propensity score: What do the distance matrix entries mean? {.allowframebreaks}
What do those distances mean? (They are standardized absolute differences.)
```{r echo=TRUE}
## Calculate the simple absolute distances in propensity score
simpdist <- outer(thepscore, thepscore, function(x, y) {
abs(x - y)
})
## Median Absolute Deviations
mad(thepscore[meddat$nhTrt == 1],na.rm=TRUE)
mad(thepscore[meddat$nhTrt == 0], na.rm=TRUE)
## Pooled median absolute deviation (average MAD between the treated and control groups)
(mad(thepscore[meddat$nhTrt == 1],na.rm=TRUE) + mad(thepscore[meddat$nhTrt == 0],na.rm=TRUE)) / 2
pooledmad <- (mad(thepscore[meddat$nhTrt == 1],na.rm=TRUE) + mad(thepscore[meddat$nhTrt == 0],na.rm=TRUE)) / 2
## We can see the actual R function here: optmatch:::match_on_szn_scale
optmatch:::match_on_szn_scale(thepscore, trtgrp = meddat$nhTrt)
simpdist["101", c("401", "402", "403")]
simpdist["101", c("401", "402", "403")] / 1.569
simpdist["101", c("401", "402", "403")] /
psdist["101", c("401", "402", "403")]
```
## What about using many covariates? The separation problem in logistic regression
What if we want to match on more than two covariates? Let's step through the following to discover a problem with logistic regression when the number of covariates is large relative to the size of the dataset.
```{r echo=TRUE}
library(splines)
library(arm)
thecovs <- unique(c(names(meddat)[c(5:7, 9:24)], "HomRate03"))
balfmla <- reformulate(thecovs, response = "nhTrt")
psfmla <- update(balfmla, . ~ . + ns(HomRate03, 2) + ns(nhPopD, 2) + ns(nhHS, 2))
glm0 <- glm(balfmla, data = meddat, family = binomial(link = "logit"))
glm1 <- glm(psfmla, data = meddat, family = binomial(link = "logit"))
bayesglm0 <- bayesglm(balfmla, data = meddat, family = binomial(link = "logit"))
bayesglm1 <- bayesglm(psfmla, data = meddat, family = binomial(link = "logit"))
psg1 <- predict(glm1, type = "response")
psg0 <- predict(glm0, type = "response")
psb1 <- predict(bayesglm1, type = "response")
psb0 <- predict(bayesglm0, type = "response")
```
## The separation problem
Logistic regression is excellent at discriminating between groups \ldots often **too excellent** for us \autocite{gelman2008weakly}. First evidence of this is big and/or missing coefficients in the propensity score model. See the coefficients below (recall that we are predicting `nhTrt` with these covariates in those models):
```{r echo=FALSE}
thecoefs <- rbind(
glm0 = coef(glm0)[1:20],
glm1 = coef(glm1)[1:20],
bayesglm0 = coef(bayesglm0)[1:20],
bayesglm1 = coef(bayesglm1)[1:20]
)
thecoefs[, 1:5]
```
## The separation problem
```{r, echo=FALSE, out.width=".9\\textwidth"}
par(mfrow = c(1, 2))
matplot(t(thecoefs), axes = FALSE)
axis(2)
axis(1, at = 0:19, labels = colnames(thecoefs), las = 2)
matplot(t(thecoefs), axes = FALSE, ylim = c(-15, 10))
axis(2)
axis(1, at = 0:19, labels = colnames(thecoefs), las = 2)
legend("topright", col = 1:4, lty = 1:4, legend = c("glm0", "glm1", "bayesglm0", "bayesglm1"))
```
## The separation problem in logistic regression
So, if we are interested in using the propensity score to compare observations in regards the multi-dimensional space of many covariates, we would probably prefer a dimensional reduction model like `bayesglm` over `glm`.
```{r out.width=".9\\textwidth", echo=FALSE}
par(mfrow = c(2, 2), mar = c(3, 3, 2, .1))
boxplot(psg0 ~ meddat$nhTrt, main = paste("Logit", length(coef(glm0)), " parms", sep = " "))
stripchart(psg0 ~ meddat$nhTrt, vertical = TRUE, add = TRUE)
boxplot(psg1 ~ meddat$nhTrt, main = paste("Logit", length(coef(glm1)), " parms", sep = " "))
stripchart(psg1 ~ meddat$nhTrt, vertical = TRUE, add = TRUE)
boxplot(psb0 ~ meddat$nhTrt, main = paste("Shrinkage Logit", length(coef(bayesglm0)), " parms", sep = " "))
stripchart(psb0 ~ meddat$nhTrt, vertical = TRUE, add = TRUE)
boxplot(psb1 ~ meddat$nhTrt, main = paste("Shrinkage Logit", length(coef(bayesglm1)), " parms", sep = " "))
stripchart(psb1 ~ meddat$nhTrt, vertical = TRUE, add = TRUE)
```
# Major Matching modes: Briefly, Greed versus Optimal Matching
## Optimal (communitarian) vs greedy (individualistic) matching {.fragile}
Compare the greedy to optimal matched designs:
\begin{center}
\begin{tabular}{l|cccc}
& \multicolumn{4}{c}{Illustrator} \\
Writer & Mo& John & Terry & Pat \\ \hline
Ben & 0 & 1 & 1 & 10 \\
Jake & 10& 0 & 10 & 10 \\
Tom & 1& 1 & 20 & $\infty$ \\ \hline
\end{tabular}
\end{center}
## Optimal (communitarian) vs greedy (individualistic) matching {.fragile}
Compare the greedy to optimal matched designs:
\begin{center}
\begin{tabular}{l|cccc}
& \multicolumn{4}{c}{Illustrator} \\
Writer & Mo& John & Terry & Pat \\ \hline
Ben & 0 & 1 & 1 & 10 \\
Jake & 10& 0 & 10 & 10 \\
Tom & 1& 1 & 20 & $\infty$ \\ \hline
\end{tabular}
\end{center}
Greedy match without replacement has mean distance (0+0+20)/3=6.67. The optimal
match keeps all the obs, and has mean distance (1+0+1)/3=.67.
```{r simpmatch, echo=TRUE, warning=FALSE, messages=FALSE}
bookmat <- matrix(c(0, 1, 1, 10, 10, 0, 10, 10, 1, 1, 20, 100 / 0), nrow = 3, byrow = TRUE)
dimnames(bookmat) <- list(c("Ben", "Jake", "Tom"), c("Mo", "John", "Terry", "Pat"))
pairmatch(bookmat)
fullmatch(bookmat)
```
\note{
*Greedy:* Ben-Mo (0) , Jake-John (0), Tom-Terry (20)
*Optimal:* Ben-Terry (1), Jake-John (0), Tom-Mo (1)
}
# Matching Tricks of the Trade: Calipers, Exact Matching, Missing Data
## Calipers {.allowframebreaks}
The optmatch package allows calipers (which forbids certain pairs from being matched).^[You can implement penalties by hand.] Here, for example, we forbid comparisons which differ by more than 2 propensity score standardized distances.
```{r echo=TRUE}
## First inspect the distance matrix itself: how are the distances distributed?
quantile(as.vector(psdist), seq(0, 1, .1))
## Next, apply a caliper (setting entries to Infinite)
psdistCal <- psdist + caliper(psdist, 2)
as.matrix(psdist)[5:10, 5:10]
as.matrix(psdistCal)[5:10, 5:10]
```
## Calipers
The optmatch package allows calipers (which forbid certain pairs from being matched).^[You can implement penalties by hand.] Here, for example, we forbid comparisons which differ by more than 2 standard deviations on the propensity score. (Notice that we also use the `propensity.model` option to `summary` here to get a quick look at the balance test:)
```{r echo=TRUE}
fmCal1 <- fullmatch(psdist + caliper(psdist, 2), data = meddat, tol = .00001)
summary(fmCal1, min.controls = 0, max.controls = Inf, propensity.model = theglm)
pmCal1 <- pairmatch(psdist + caliper(psdist, 2), data = meddat, remove.unmatchables = TRUE)
summary(pmCal1, propensity.model = theglm)
```
## Calipers
Another example: We may want to match on propensity distance but disallow any pairs with extreme mahalnobis distance and/or extreme differences in baseline homicide rates (here using many covariates all together).
```{r echo=TRUE}
## Create an R formulate object from vectors of variable names
balfmla <- reformulate(c("nhPopD", "nhAboveHS"), response = "nhTrt")
## Create a mahalanobis distance matrix (of rank transformed data)
mhdist <- match_on(balfmla, data = meddat, method = "rank_mahalanobis")
## Now make a matrix recording absolute differences between neighborhoods in
## terms of baseline homicide rate
tmpHom03 <- meddat$HomRate03
names(tmpHom03) <- rownames(meddat)
absdist <- match_on(tmpHom03, z = meddat$nhTrt, data = meddat)
absdist[1:3, 1:3]
quantile(as.vector(absdist), seq(0, 1, .1))
quantile(as.vector(mhdist), seq(0, 1, .1))
## Now create a new distance matrix using two calipers:
distCal <- psdist + caliper(mhdist, 9) + caliper(absdist, 2)
as.matrix(distCal)[5:10, 5:10]
## Compare to:
as.matrix(mhdist)[5:10, 5:10]
```
## Calipers
Now, use this new matrix for the creation of stratified designs --- but possibly excluding some units (also showing here the `tol` argument. The version with the tighter tolerance produces a solution with smaller overall distances)
```{r echo=TRUE}
fmCal2a <- fullmatch(distCal, data = meddat, tol = .1)
summary(fmCal2a, min.controls = 0, max.controls = Inf, propensity.model = theglm)
fmCal2b <- fullmatch(distCal, data = meddat, tol = .00001)
summary(fmCal2b, min.controls = 0, max.controls = Inf, propensity.model = theglm)
meddat$fmCal2a <- fmCal2a
meddat$fmCal2b <- fmCal2b
fmCal2a_dists <- matched.distances(fmCal2a, distCal)
fmCal2b_dists <- matched.distances(fmCal2b, distCal)
mean(unlist(fmCal2a_dists))
mean(unlist(fmCal2b_dists))
```
## Exact Matching
We often have covariates that are categorical/nominal and for which we really care about strong balance. One approach to solve this problem is match **exactly** on one or more of such covariates. If `fullmatch` or `match_on` is going slow, this is also an approach to speed things up.
```{r echo=FALSE}
meddat$classLowHi <- ifelse(meddat$nhClass %in% c(2, 3), "hi", "lo")
```
```{r}
dist2 <- psdist + exactMatch(nhTrt ~ classLowHi, data = meddat)
## or mhdist <- match_on(balfmla,within=exactMatch(nhTrt~classLowHi,data=meddat),data=meddat,method="rank_mahalanobis")
## or fmEx1 <- fullmatch(update(balfmla,.~.+strata(classLowHi)),data=meddat,method="rank_mahalanobis")
fmEx1 <- fullmatch(dist2, data = meddat, tol = .00001)
fmEx1 <- fullmatch(dist2, data = meddat, tol = .00001, min.controls=c("hi"=1,"lo"=.5))
summary(fmEx1, min.controls = 0, max.controls = Inf, propensity.model = theglm)
print(fmEx1, grouped = T)
meddat$fmEx1 <- fmEx1
```
## Exact Matching
```{r}
ftable(Class = meddat$classLowHi, Trt = meddat$nhTrt, fmEx1, col.vars = c("Class", "Trt"))
```
## Missing data and matching
What if `nhPopD` had some missing data?
```{r md1, echo=TRUE}
set.seed(12345)
meddat$nhPopD[sample(1:45, 10)] <- NA
summary(meddat$nhPopD)
xb0 <- xBalance(nhTrt ~ nhPopD + nhAboveHS, data = meddat, report = "all")
```
We would want to compare units who are equally likely to have `nhPopD` missing. So, we create a new variable:
```{r md2, echo=TRUE}
newdat <- fill.NAs(meddat[, c("nhAboveHS", "nhPopD")])
head(newdat)
stopifnot(all.equal(row.names(newdat), row.names(meddat)))
newdat <- cbind(newdat, meddat[, c("nhTrt", "HomRate08", "HomRate03")])
head(newdat)
```
## Missing data and matching
And we include that variable in our balance testing and matching:
```{r echo=TRUE}
theglm <- arm::bayesglm(nhTrt ~ nhAboveHS + nhPopD + nhPopD.NA + HomRate03, data = newdat)
psdist <- match_on(theglm, data = meddat)
maxCaliper(theglm$linear.predictor, z = newdat$nhTrt, widths = c(.1, .5, 1))
balfmla <- formula(theglm)
fm0 <- fullmatch(psdist + caliper(psdist, 2), data = newdat)
summary(fm0, min.controls = 0, max.controls = Inf, propensity.model = theglm)
summary(unlist(matched.distances(fm0, psdist)))
newdat$fm0 <- fm0
xb0 <- xBalance(balfmla, strata = list(fm0 = ~fm0), data = newdat, report = "all")
xb0$overall
xb0$results
```
## Missing Data and Matching
So:
1. Missing data on covariates is not a big problem --- such data reveals
information to us about the units pre-treatment, so we just stratify on it.
We treat missing data as just another covariate.
2. Missing data the outcome (let alone $Z$ ("treatment", "intervention",
"driver") is a bigger problem: we will tend to use bounds to report on the
range of possible answers in such cases.
## Summary and Questions:
- We can build a stratified research design allowing an algorithm to do some
work, and bringing in our substantive expertise and rhetorical strategies
for other work.
- We can restrict certain matches --- using calipers or exact-matching, using
different distance matrices.
- We can match on missing values on covariates.
- Some choices of score creation / dimension reduction raise other questions
and problems (ex. Mahalanobis versus Rank-Based Mahalanobis distances;
Propensity Score creation without overfitting.)
## References