performance tuning for `mlm` #1

benthestatistician · 2015-11-30T02:39:12Z

@josherrickson reports that mlm is pretty slow (on a problem with 30K or so matched sets and 70K or so matched observations). I suspect the sapply call, but maybe I'm all wet - the first step should be profiling. If the sapply calls are slow, I suggest replacing them with sparse matrix ops, perhaps making use of SparseMMFromFactor as defined in the RItools clusters branch.

The text was updated successfully, but these errors were encountered:

josherrickson · 2015-11-30T14:01:27Z

In a run with ~4000 obs and 7 covariates, it takes ~ 30 secs to run mlm.
~15 sec are spent in the matchCsr <- as(theMatch, "matrix.csr”) line.
~7-8 secs each spent in nt <- sapply(levels(theMatch), function(l) { sum(theMatch == l & z) }) and its nc counterpart.

benthestatistician · 2015-12-01T15:09:15Z

Thanks for the info, @josherrickson. Looking inside the definition of that matrix.csr conversion, looks like there's a similarly pricey loop inside of there. I think that both of these are pretty avoidable. @josherrickson, would you be game for:

Coming up with a self-contained example of the performance gap, for use in testing and development?
Replacing the sapply to define nt with the crossprod of the treatment vector and a sparse model matrix as made from the factor variable defining the match, and similar for nc?
Working on non-loopng replacements for the offending loops inside of the matrix.csr conversion op?

josherrickson · 2015-12-02T19:56:38Z

The speed issues aren't as bad as I originally thought. Turns out when subsetting an optmatch object such as

match[subset, drop=TRUE]

is properly dropping any blank levels, but is not similarly truncating the matched.distances attribute. Cleaning those up speeds up the significantly. (Variable names masked.)

> d2$match <- d$match[usematches, drop=TRUE]
> system.time({
mlm(y ~ x + match, data=d2)
})

+ +    user  system elapsed
 12.630   0.576  13.243
> attr(d2$match, "matched.distances") <- attr(d2$match, "matched.distances")[names(attr(d2$match, "matched.distances")) %in% levels(d2$match)]

> system.time({
mlm(y ~ x + match, data=d2)
})

+ +    user  system elapsed
  2.104   0.243   2.362

For some context, d is an n=85000 data.frame with 17000 matched sets, which does completely choke mlm. d2 is a subset of 1000 matched sets, ending up with around n=4000

nmatches <- 1000
usematches <- d$match %in% sample(levels(d$match), nmatches)
d2 <- d[usematches,]

The relative slowdowns I mentioned in my first comment still exist. I'll crosspost this to the appropriate issue in optmatch 96.

benthestatistician · 2015-12-02T22:27:38Z

Thanks, @josherrickson . This prompted a couple of issues on the optmatch side, the most pertinent being optmatch #107.

Inside of setas("optmatch", "matrix.csr",...), I think we can replace

  # list of positions of treatment member(s), then
  # control group members; by matched set
  pos.tc <- lapply( levels(from), function(lev) c(which(from==lev & zz),
                                                  which(from==lev & !zz)))

with

pos.tc <- order(from, zz)

and some subsequent adjustments for the fact that pos.tc will now be avector rather than a list. this will speed up matters substantially.

josherrickson · 2015-12-17T16:37:36Z

283e156 addresses 2. from Ben's first comment above; in the test code I included in the commit, mlm went from ~3.7 secs to ~2.1 sec.

josherrickson · 2015-12-17T17:19:36Z

Great call on the pos.tc Ben, with the changes in ac9a699, I'm seeing a drop from the previous ~2.1sec to ~.2 sec.

josherrickson mentioned this issue Dec 2, 2015

Drop used levels after subsetting markmfredrickson/optmatch#96

Closed

benthestatistician mentioned this issue Dec 2, 2015

[.optmatch should drop matched distances attribute markmfredrickson/optmatch#107

Closed

josherrickson closed this as completed Dec 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance tuning for `mlm` #1

performance tuning for `mlm` #1

benthestatistician commented Nov 30, 2015

josherrickson commented Nov 30, 2015

benthestatistician commented Dec 1, 2015

josherrickson commented Dec 2, 2015

benthestatistician commented Dec 2, 2015

josherrickson commented Dec 17, 2015

josherrickson commented Dec 17, 2015

performance tuning for mlm #1

performance tuning for mlm #1

Comments

benthestatistician commented Nov 30, 2015

josherrickson commented Nov 30, 2015

benthestatistician commented Dec 1, 2015

josherrickson commented Dec 2, 2015

benthestatistician commented Dec 2, 2015

josherrickson commented Dec 17, 2015

josherrickson commented Dec 17, 2015

performance tuning for `mlm` #1

performance tuning for `mlm` #1