Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matchability function #5

Closed
josherrickson opened this issue Mar 16, 2016 · 20 comments
Closed

matchability function #5

josherrickson opened this issue Mar 16, 2016 · 20 comments

Comments

@josherrickson
Copy link
Collaborator

After a discussion with Ben, created a function tentatively called matchability. The idea is to take some version of a distance matrix with some Infs (from caliper or otherwise) and identify which observations are completely unmatchable. E.g.

>   set.seed(1)
>   d <- data.frame(z=rep(0:1, each=5),
+                   x=rnorm(10))
>   rownames(d) <- letters[1:10]
>   m <- match_on(z ~ x, data=d)
>   m1 <- caliper(m, width=1)
> m1
       control
treated   a   b   c   d   e
      f   0 Inf   0 Inf Inf
      g Inf   0 Inf Inf   0
      h Inf   0 Inf Inf   0
      i Inf   0 Inf Inf   0
      j   0   0   0 Inf   0
>   mm1 <- matchability(m1)
> mm1
$matchable
$matchable$treatment
[1] "f" "g" "h" "i" "j"

$matchable$control
[1] "a" "b" "c" "e"


$unmatchable
$unmatchable$treatment
character(0)

$unmatchable$control
[1] "d"

Starting issue for commentary, especially on function name, how the output should look like, etc.

@benthestatistician
Copy link
Collaborator

I'd suggest reporting this information in a summary method for ISMs.

@josherrickson
Copy link
Collaborator Author

How would that work with matrices? Right now I have (albeit not implemented) matchability.matrix as well as matchability.InfinitySparseMatrix, matchability.BlockedInfinitySparseMatrix.

@benthestatistician
Copy link
Collaborator

I'm not sure we should bother with a matrix summary method. Somebody might
have optmatch on their search path but be doing other stuff with matrices,
then get confused by a matrix summary that's totally optmatch oriented. On
the other hand, if they have an ISM then we know it's there for use in
matching.

On Thu, Mar 17, 2016 at 12:19 PM, Josh Errickson [email protected]
wrote:

How would that work with matrices? Right now I have (albeit not
implemented) matchability.matrix as well as
matchability.InfinitySparseMatrix,
matchability.BlockedInfinitySparseMatrix.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#5 (comment)

@josherrickson
Copy link
Collaborator Author

How's this for a mockup of the results of print.summary.InfinitySparseMatrix?

Membership: 42 treatment, 33 control
Total eligible potential matches: 945
Total ineligible potential matches: 441

Unmatchable treatment members:
  d, f, g

Unmatchable control members:
  A, C, N

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01831 0.28170 0.43070 0.42760 0.60310 0.76680 

The unmatchable sections would drop if empty. (And always drop in print.summary.DenseMatrix.)

For print.summary.BlockedInfinitySparseMatrix, I'm thinking of running this for each block.

(I'm thinking of having summary.InfinitySparseMatrix, summary.BlockedInfinitySparseMatrix and summary.DenseMatrix all return the same type of object, maybe summary.optmatch.distance or something, so the calls to print.___ above are pseudo-code; in reality there will only be print.summary.optmatch.distance.)

@benthestatistician
Copy link
Collaborator

I like the way this is going. I think we could wind up with something very useful to users. Let me lay out some general comments first, then specific suggestions

General thoughts

  1. The main usage scenario I envision is that the user is managing a potentially very large matching problem and and wants advice on whether/how to cut down. If you envision others, do speak up so we can take them into account in this discussion.
  2. As we evaluate possible features, including those that I may suggest, let's try to bear in mind several resource constraints: Developer time; runtime demands.

Comments on @josherrickson's mockup:

  1. summary.DenseMatrix results should be able to have unmatchable treatment and control group members too -- those would be the rows/columns all of whose entries are in {Inf, Na, NaN}
  2. I quite like the idea of a print.summary S3 method, as separate from the summary method itself. Among other things this would enable us to store potentially voluminous artifacts as part of the summary itself, then selectively print them.
  3. First candidates for this selecting printing: "Unmatchable treatment members", "Unmatchable control members". We might print up to the first 5, followed by a "..." if there are more, along with a count of how many there are. (And a hint about how to access them w/in the summary artefact?)
  4. Large problems may have many many blocks. I'd think that the default for print.summary.BlockedInfinitySparseMatrix should be a collapsed representation across blocks, enriched by a concise summary of the blocking structure.
  5. That said, it would be nice if the summary object produced for a BlockedInfinitySparseMatrix had easily accessible components for each block, so that e.g.
summary(mybISM)$blockA

would invoke print.summary.InfinitySparseMatrix for "blockA". If so, then it'd be helpful if the print.summary.BlockedInfinitySparseMatrix's concise summary of the blocking structure either indicated block names or indicated to the user how to infer them from the summary result (eg names(summary(mybISM))[-(1:3)] or similar).
4. I like the Summary of distances part. See below for discussion of a potential enhancement, probably not something for now but maybe something for later.

Possible additional features

  1. For summary.BlockedInfinitySparseMatrix and/or print.summary.BlockedInfinitySparseMatrix, perhaps by a table, with blocks as rows, indicating matchable and unmatchable treatment and control counts (ie 4 separate columns) by block.
  2. I think it's possible for an ISM to have data entries of Inf. maybe also NA, NaN, etc. If that's correct, then it's possible for a given subject to appear to be matchable based on a quick scan of the ISM to see if there are potential matches there involving him; you would have to scan further to detect that each of those potential matches is in fact forbiddent. Perhaps the summary methods should make an effort to remove these entries signaling unmatchability before proceeding to the summarizing.
  3. Summary of distances enhancement: To my mind, still more useful would be a summary of the minimum distances per treatment group member -- from this you could infer how many treatment group members you'd leave out by imposing a caliper of .2, or .5 or whatever, on the distance in question. This strikes me as being likely to be computationally expensive, however. Maybe it would be cheaper for ISMs/block ISMs that have already been sorted? Not obvious to me how to realize that potential performance advantage, but if this is correct then this sort of summary enhancement is maybe something for a future point when we've begun to standardize around sorted ISMs.

@josherrickson
Copy link
Collaborator Author

Given the expanding scope of this function, is it safe to say this is no longer targeted for the 0.9-6 release? Or is there some more limited version you'd like to try and push through by next week?

@benthestatistician
Copy link
Collaborator

Correct, let's not target this for the 0.9-6 release.

Maybe we can get some user feedback before we release it out into the wild. I could probably get some students in my class this term to try out our working version of the thing, particularly if we got it up and running over the next several days.

The operation of determining the closest match for each treatment group member could be done with tapply, but I'd expect that to be slow with large ISMs. It seems like something that would be better done in C or C++. I don't program in those languages myself, but I see that the "Kmisc" package has an R function to generate C++ versions of tapply calls. See "C++ Function Generators," this vignette.

josherrickson added a commit that referenced this issue Mar 19, 2016
Various changes related to comments on #5
@josherrickson
Copy link
Collaborator Author

Here's what I've got so far.

  1. I've not delved into the distances measures yet.
    2) There's obviously a bug with the call to summary(m5) yielding a double statement. Fixed, next comment has updated output.

Wall of text incoming.

> data(nuclearplants)
> m1 <- match_on(pr ~ cost, data=nuclearplants)
> summary(m1)
Membership: 10 treatment, 22 control
Total eligible potential matches: 220 
Total eligible potential matches: 0 

Summary of distances:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.006858 0.420900 1.002000 1.102000 1.539000 3.858000 

> m2 <- match_on(pr ~ cost, data=nuclearplants, caliper=1)
> summary(m2)
Membership: 10 treatment, 22 control
Total eligible potential matches: 109 
Total eligible potential matches: 111 

1 unmatchable control member:
    V

Summary of distances:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.006858 0.181000 0.413200 0.453300 0.724000 0.993000 

> m3 <- match_on(pr ~ cost, data=nuclearplants, caliper=.05)
> summary(m3)
Membership: 10 treatment, 22 control
Total eligible potential matches: 9 
Total eligible potential matches: 211 

5 unmatchable treatment members:
    A, B, F, G, b

16 unmatchable control members:
    J, K, M, N, O, ...
See summary(m3)$unmatchable$control for a complete list.

Summary of distances:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.006858 0.017030 0.026270 0.025840 0.027780 0.047190 

> summary(m3)$unmatchable$control
 [1] "J" "K" "M" "N" "O" "P" "Q" "S" "T" "U" "V" "W" "X" "Y" "Z" "d"
> m4 <- match_on(pr ~ cost + strata(pt), data=nuclearplants)
> summary(m4)
Summary across all blocks:
Membership: 10 treatment, 22 control
Total eligible potential matches: 118 
Total eligible potential matches: 102 

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01703 0.30750 0.81410 0.90400 1.35000 3.43800 

To see summaries for individual blocks, call for example summary(m4)$`0`.

> summary(m4)$`1`
Membership: 3 treatment, 3 control
Total eligible potential matches: 9 
Total eligible potential matches: 0 

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02627 0.05736 0.10330 0.21250 0.39230 0.42340 


@josherrickson
Copy link
Collaborator Author

Fixed the double statement above, new version below

> nuclearplants$strat <- rep(letters[1:3], times=15)[1:32]
> m5 <- match_on(pr ~ cost + strata(strat), data=nuclearplants, caliper=.2)
> summary(m5)
Summary across all blocks:
Membership: 10 treatment, 22 control
Total eligible potential matches: 2 
Total eligible potential matches: 218 

6 unmatchable treatment members:
    A, C, D, G, a, ...
See summary(m5)$unmatchable$treatment for a complete list.

17 unmatchable control members:
    H, I, K, L, M, ...
See summary(m5)$unmatchable$control for a complete list.

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02732 0.07945 0.08236 0.09011 0.12080 0.14060 

To see summaries for individual blocks, call for example summary(m5)$`a`.

> summary(m5)$`a`
Membership: 2 treatment, 9 control
Total eligible potential matches: 2 
Total eligible potential matches: 16 

1 unmatchable treatment member:
    b

7 unmatchable control members:
    H, L, Q, T, V, ...
See summary(object)$`a`$unmatchable$control for a complete list.

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.07945 0.09475 0.11000 0.11000 0.12530 0.14060 

> print(summary(m5), printAllBlocks=TRUE)
Summary across all blocks:
Membership: 10 treatment, 22 control
Total eligible potential matches: 2 
Total eligible potential matches: 218 

6 unmatchable treatment members:
    A, C, D, G, a, ...
See summary(m5)$unmatchable$treatment for a complete list.

17 unmatchable control members:
    H, I, K, L, M, ...
See summary(m5)$unmatchable$control for a complete list.

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02732 0.07945 0.08236 0.09011 0.12080 0.14060 

Indiviual blocks:

$a
Membership: 2 treatment, 9 control
Total eligible potential matches: 2 
Total eligible potential matches: 16 

1 unmatchable treatment member:
    b

7 unmatchable control members:
    H, L, Q, T, V, ...
See summary(object)$`a`$unmatchable$control for a complete list.

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.07945 0.09475 0.11000 0.11000 0.12530 0.14060 


$b
Membership: 3 treatment, 8 control
Total eligible potential matches: 3 
Total eligible potential matches: 21 

5 unmatchable control members:
    I, M, O, U, Z

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02732 0.05484 0.08236 0.07681 0.10160 0.12080 


$c
Membership: 5 treatment, 5 control
Total eligible potential matches: 0 
Total eligible potential matches: 25 

5 unmatchable treatment members:
    A, C, D, G, a

5 unmatchable control members:
    K, P, S, W, d

@josherrickson
Copy link
Collaborator Author

Just noting I've pushed up these changes finally, feel free to play around.

@josherrickson
Copy link
Collaborator Author

Also, pointing out that I addressed Ben's concerns about Infas an actual entry (demo below in ISM, same logic at work in DenseMatrix)

> data(nuclearplants)
> np <- subset(nuclearplants, pt==1)
> m <- match_on(pr ~ cost, data=np, caliper=1)
> m
       control
treated         d         e         f
      a       Inf 0.2016450 0.1122457
      b 0.2451029       Inf       Inf
      c       Inf 0.4412846 0.3518854
> summary(m)
Membership: 3 treatment, 3 control
Total eligible potential matches: 5 
Total eligible potential matches: 4 

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1122  0.2016  0.2451  0.2704  0.3519  0.4413 

> m[3] <- Inf
> m@.Data
[1] 0.2016450 0.1122457       Inf 0.4412846 0.3518854
> summary(m)
Membership: 3 treatment, 3 control
Total eligible potential matches: 4 
Total eligible potential matches: 5 

1 unmatchable treatment member:
    b

1 unmatchable control member:
    d

Summary of distances:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1122  0.1793  0.2768  0.2768  0.3742  0.4413 

@benthestatistician
Copy link
Collaborator

Coming along nicely here! Suggestion: relabel at least one of the to "Total eligible potential matches" lines to clarify what they're giving back. (I'm not getting this myself.)

@josherrickson
Copy link
Collaborator Author

"Total eligible potential matches" is basically sum(is.finite([email protected])). (And the second should be Ineligible.) Any suggestion for better wording?

@benthestatistician
Copy link
Collaborator

Changing the second one to "Total ineligible..." will clear my issue right up!

@josherrickson
Copy link
Collaborator Author

New version of BISM has been pushed, fixing the typo above, and adding a table describing block structure (only appears when printAllBlocks is FALSE [the default] and blockStructure is TRUE [also default]).

Block structure:
    Matchable Txt Matchable Ctl Unmatchable Txt Unmatchable Ctl
`a`             1             2               1               7
`b`             3             3               0               5
`c`             0             0               5               5

I shrunk the titles from "Treatment" and "Control" to "Txt and Ctl" to make it less wide; alternatively I could mess with cat and nchar to make the colnames two lines. That would require a bit of work, so if that's desired, I can return to it later.

Also, the only item left on your original comments, Ben, is the distances changes.

@josherrickson
Copy link
Collaborator Author

First version of the change to distances summary is up.

I'm using built-in tapply as I was getting some issues trying to use Kmisc's functions. Overall, I saw some speed-up with them; with a 5000x5000 ISM, Kmisc::tapply_ was about 3 times faster than tapply. I saw no additional gain from the Rcpp generator methods.

Overall, speed may be an issue; the same 5000x5000 matrix takes several seconds to run. On first pass, there's no obvious bottleneck, rather a series of slowdowns.

Summary of minimum matchable distance per treatment member:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.006858 0.019340 0.042050 0.041290 0.058320 0.079450 

Any feedback on that description?

@benthestatistician
Copy link
Collaborator

Great! Built in tapply is the right call for now. There are various ways
it might be speeded up, something to ponder. In the meantime, how about an
option to turn this on or off, so that if you want to skip the performance
hit you can?

On Mon, Mar 21, 2016 at 7:21 PM, Josh Errickson [email protected]
wrote:

First version of the change to distances summary is up.

I'm using built-in tapply as I was getting some issues trying to use Kmisc's
functions. Overall, I saw some speed-up with them; with a 5000x5000 ISM,
Kmisc::tapply_ was about 3 times faster than tapply. I saw no additional
gain from the Rcpp generator methods.

Overall, speed may be an issue; the same 5000x5000 matrix takes several
seconds to run. On first pass, there's no obvious bottleneck, rather a
series of slowdowns.

Summary of minimum matchable distance per treatment member:
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.006858 0.019340 0.042050 0.041290 0.058320 0.079450

Any feedback on that description?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#5 (comment)

@benthestatistician
Copy link
Collaborator

PS: One of the "various ways it might be speeded up" would be to encode the
ISM info in a data.table:
https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf

On Mon, Mar 21, 2016 at 7:31 PM, Ben Hansen [email protected] wrote:

Great! Built in tapply is the right call for now. There are various
ways it might be speeded up, something to ponder. In the meantime, how
about an option to turn this on or off, so that if you want to skip the
performance hit you can?

On Mon, Mar 21, 2016 at 7:21 PM, Josh Errickson [email protected]
wrote:

First version of the change to distances summary is up.

I'm using built-in tapply as I was getting some issues trying to use
Kmisc's functions. Overall, I saw some speed-up with them; with a
5000x5000 ISM, Kmisc::tapply_ was about 3 times faster than tapply. I
saw no additional gain from the Rcpp generator methods.

Overall, speed may be an issue; the same 5000x5000 matrix takes several
seconds to run. On first pass, there's no obvious bottleneck, rather a
series of slowdowns.

Summary of minimum matchable distance per treatment member:
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.006858 0.019340 0.042050 0.041290 0.058320 0.079450

Any feedback on that description?


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#5 (comment)

@josherrickson
Copy link
Collaborator Author

Added flag to turn off distance summary. Also moved flags from print.summary.BlockedInfinitySparseMatrix to summary.BlockedInfinitySparseMatrix to get rid of some unwieldy syntax.

summary.BlockedInfinitySparseMatrix <- function(object, ...,
                                                distanceSummary=TRUE,
                                                printAllBlocks=FALSE,
                                                blockStructure=TRUE)

@josherrickson
Copy link
Collaborator Author

This set of functions has been moved over to Optmatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants