README.Rmd

---
output: github_document
---


[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/ctmle)](https://CRAN.R-project.org/package=ctmle)
[![](http://cranlogs.r-pkg.org/badges/ctmle)](https://CRAN.R-project.org/package=ctmle) [![](http://cranlogs.r-pkg.org/badges/grand-total/ctmle)](https://CRAN.R-project.org/package=ctmle)


<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
      collapse = TRUE,
      comment = "#>",
      fig.path = "README-"
)
```


# Collaborative Targeted Maximum Likelihood Estimation 

Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) is an extention of Targeted Maximum Likelihood Estimation (TMLE). It applies variable/model selection for nuisance parameter (e.g. the propensity score) estimation in a 'collaborative' way, by directly optimizing the empirical metric on the causal estimator.

In this package, we implemented the general template of C-TMLE, for the estimation of the average treatment effect (ATE). 

The package also offers convenient functions for discrete C-TMLE for variable selection, and LASSO-C-TMLE for model selection of LASSO, in estimation of the propensity score (PS).

# Installation

To install the CRAN release version of `ctmle`:

```R
install.packages('ctmle')
```

To install the development version (requires the devtools package):

```R
devtools::install_github('jucheng1992/ctmle')
```


## C-TMLE for variable selection

In this section, we start with examples of discrete C-TMLE for variable selection, using greedy forward searching, and scalable discrete C-TMLE with pre-ordering option.


```{r,eval=TRUE}
library(ctmle)
library(dplyr)
set.seed(123)

N <- 1000
p = 5
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
beta0 <- 2+2*Wmat[,1]+2*Wmat[,2]+2*Wmat[,5]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)

g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)

epsilon <-rnorm(N, 0, 1)
Y  <- beta0 + tau * A + epsilon

# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))

time_greedy <- system.time(
      ctmle_discrete_fit1 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
                                           preOrder = FALSE, detailed = TRUE)
)
ctmle_discrete_fit2 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat),
                                     preOrder = FALSE, detailed = TRUE)


time_preorder <- system.time(
      ctmle_discrete_fit3 <- ctmleDiscrete(Y = Y, A = A, W = data.frame(Wmat), Q = Q,
                                           preOrder = TRUE,
                                           order = rev(1:p), detailed = TRUE)
)
```

Scalable (discrete) C-TMLE takes much less computation time:
```{r,eval = TRUE}
time_greedy
time_preorder
```

Show the brief results from greedy CTMLE:
```{r,eval = TRUE}
ctmle_discrete_fit1
```
Summary function offers detial information of which variable is selected.
```{r,eval = TRUE}
summary(ctmle_discrete_fit1)
```


## LASSO-C-TMLE for model selection of LASSO

In this section, we introduce the LASSO-C-TMLE algorithm for model selection of LASSO in the estimation of the propensity score. We implemented three variations of the LASSO-C-TMLE algorithm. For simplicity, we call them C-TMLE1-3. See technical details in the corresponding references.

```{r,eval = TRUE}
# Generate high-dimensional data
set.seed(123)

N <- 1000
p = 100
Wmat <- matrix(rnorm(N * p), ncol = p)
beta1 <- 4 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
beta0 <- 2 + 2 * Wmat[,1] + 2 * Wmat[,2] + 2 * Wmat[,5] + 2 * Wmat[,6] + 2 * Wmat[,8]
tau <- 2
gcoef <- matrix(c(-1,-1,rep(-(3/((p)-2)),(p)-2)),ncol=1)
W <- as.matrix(Wmat)

g <- 1/(1+exp(W%*%gcoef /3))
A <- rbinom(N, 1, prob = g)

epsilon <-rnorm(N, 0, 1)
Y  <- beta0 + tau * A + epsilon

# With initial estimate of Q
Q <- cbind(rep(mean(Y[A == 0]), N), rep(mean(Y[A == 1]), N))

glmnet_fit <- cv.glmnet(y = A, x = W, family = 'binomial', nlambda = 20)
```

We start build a sequence of lambdas from the lambda selected by cross-validation, as the model selected by cv.glmnet would over-smooth w.r.t. the target parameter.

```{r}
lambdas <- glmnet_fit$lambda[(which(glmnet_fit$lambda==glmnet_fit$lambda.min)):length(glmnet_fit$lambda)]
```

We fit C-TMLE1 algorithm by feed the algorithm with a vector of lambda, in decreasing order:
```{r}
time_ctmlelasso1 <- system.time(
      ctmle_fit1 <- ctmleGlmnet(Y = Y, A = A,
                                W = data.frame(W = W),
                                Q = Q, lambdas = lambdas, ctmletype=1, 
                                family="gaussian",gbound=0.025, V=5)
)
```


We fit C-TMLE2 algorithm:
```{r}
time_ctmlelasso2 <- system.time(
      ctmle_fit2 <- ctmleGlmnet(Y = Y, A = A,
                                W = data.frame(W = W),
                                Q = Q, lambdas = lambdas, ctmletype=2, 
                                family="gaussian",gbound=0.025, V=5)
)
```

For C-TMLE3, we need two gn estimators, one with lambda selected by cross-validation, and the other with lambda slightly different from the selected lambda:
```{r}
gcv <- predict.cv.glmnet(glmnet_fit, newx=W, s="lambda.min",type="response")
gcv <- bound(gcv,c(0.025,0.975))

s_prev <- glmnet_fit$lambda[(which(glmnet_fit$lambda == glmnet_fit$lambda.min))] * (1+5e-2)
gcvPrev <- predict.cv.glmnet(glmnet_fit,newx = W,s = s_prev,type="response")
gcvPrev <- bound(gcvPrev,c(0.025,0.975))

time_ctmlelasso3 <- system.time(
      ctmle_fit3 <- ctmleGlmnet(Y = Y, A = A, W = W, Q = Q,
                                ctmletype=3, g1W = gcv, g1WPrev = gcvPrev,
                                family="gaussian",
                                gbound=0.025, V = 5)
)
```

Les't compare the running time for each LASSO-C-TMLE
 
```{r,eval = TRUE}
time_ctmlelasso1
time_ctmlelasso2
time_ctmlelasso3
```

Finally, we compare three C-TMLE estimates:

```{r,eval = TRUE}
ctmle_fit1
ctmle_fit2
ctmle_fit3
```

Show which regularization parameter (lambda) is selected by C-TMLE1:

```{r,eval = TRUE}
lambdas[ctmle_fit1$best_k]
```

In comparison, we show which regularization parameter (lambda) is selected by cv.glmnet:
```{r,eval = TRUE}
glmnet_fit$lambda.min
```

## Advanced topic: the general template of C-TMLE

In this section, we briefly introduce the general template of C-TMLE. In this function, the gn candidates could be a user-specified matrix, each column stand for the estimated PS for each unit. The estimators should be ordered by their empirical fit.

As C-TMLE requires cross-validation, it needs two gn estimate: one from cross-validated prediction, one from a vanilla prediction. For example, consider 5-folds cross-validation, where argument `folds` is the list of indices for each folds, then the (i,j)-th element in input `gn_candidates_cv` should be the predicted value of i-th unit, predicted by j-th unit, trained by other 4 folds where all of them do not contain i-th unit. `gn_candidates` should be just the predicted PS for each estimator trained on the whole data.

We could easily use `SuperLearner` package and `build_gn_seq` function to easily achieve this:

```{r,eval = TRUE}
lasso_fit <- cv.glmnet(x = as.matrix(W), y = A, alpha = 1, nlambda = 100, nfolds = 10)
lasso_lambdas <- lasso_fit$lambda[lasso_fit$lambda <= lasso_fit$lambda.min][1:5]

# Build SL template for glmnet
SL.glmnet_new <- function(Y, X, newX, family, obsWeights, id, alpha = 1,
                           nlambda = 100, lambda = 0,...){
      # browser()
      if (!is.matrix(X)) {
            X <- model.matrix(~-1 + ., X)
            newX <- model.matrix(~-1 + ., newX)
      }
      fit <- glmnet::glmnet(x = X, y = Y,
                            lambda = lambda,
                            family = family$family, alpha = alpha)
      pred <- predict(fit, newx = newX, type = "response")
      fit <- list(object = fit)
      class(fit) <- "SL.glmnet"
      out <- list(pred = pred, fit = fit)
      return(out)
}

# Use a sequence of estimator to build gn sequence:
SL.cv1lasso <- function (... , alpha = 1, lambda = lasso_lambdas[1]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.cv2lasso <- function (... , alpha = 1, lambda = lasso_lambdas[2]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.cv3lasso <- function (... , alpha = 1, lambda = lasso_lambdas[3]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.cv4lasso <- function (... , alpha = 1, lambda = lasso_lambdas[4]){
      SL.glmnet_new(... , alpha = alpha, lambda = lambda)
}

SL.library = c('SL.cv1lasso', 'SL.cv2lasso', 'SL.cv3lasso', 'SL.cv4lasso', 'SL.glm')
```

Construct the object `folds`, which is a list of indices for each fold
```{r}
V = 5
folds <-by(sample(1:N,N), rep(1:V, length=N), list)
```

Use `folds` and SuperLearner template to compute `gn_candidates` and `gn_candidates_cv`

```{r}
gn_seq <- build_gn_seq(A = A, W = W, SL.library = SL.library, folds = folds)
```

Lets look at the output of `build_gn_seq`

```{r,eval = TRUE}
gn_seq$gn_candidates %>% dim
gn_seq$gn_candidates_cv %>% dim
gn_seq$folds %>% length
```

Then we could use `ctmleGeneral` algorithm. As input estimator is already trained, it is much faster than previous C-TMLE algorithms. 

*Note: we recommand use the same `folds` as `build_gn_seq` for `ctmleGeneral`, to make cross-validation objective.*

```{r,eval = TRUE}
ctmle_general_fit1 <- ctmleGeneral(Y = Y, A = A, W = W, Q = Q,
                                   ctmletype = 1, 
                                   gn_candidates = gn_seq$gn_candidates,
                                   gn_candidates_cv = gn_seq$gn_candidates_cv,
                                   folds = folds, V = 5)

ctmle_general_fit1

```

## Citation

If you used `ctmle` package in your research, please cite:

>Ju, Cheng; Susan, Gruber; van der Laan, Mark J.; ctmle: Collaborative Targeted Maximum Likelihood Estimation. R package version 0.1.1, https://CRAN.R-project.org/package=ctmle.

```{bibtex,eval = FALSE}
@Manual{,
    title = {ctmle: Collaborative Targeted Maximum Likelihood Estimation},
    author = {Cheng Ju and Susan Gruber and Mark van der Laan},
    year = {2017},
    note = {R package version 0.1.1},
    url = {https://CRAN.R-project.org/package=ctmle},
}
```

## References (by inverse chronological order)

### C-TMLE for Adaptive Propensity Score Truncation

>Ju, Cheng, Joshua Schwab, and Mark J. van der Laan. "On adaptive propensity score truncation in causal inference." Statistical methods in medical research 28.6 (2019): 1741-1760.

### LASSO-C-TMLE  

>Ju, Cheng, et al. "Collaborative-controlled LASSO for constructing propensity score-based estimators in high-dimensional data." Statistical methods in medical research 28.4 (2019): 1044-1063.

### Scalable Discrete C-TMLE with Pre-ordering
>Ju, Cheng, et al. "Scalable collaborative targeted learning for high-dimensional data." Statistical methods in medical research 28.2 (2019): 532-554.

### Discrete C-TMLE with Greedy Search
>Susan, Gruber, and van rder Laan, Mark J.. "An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics." The International Journal of Biostatistics 6.1 (2010): 1-31.

### General Template of C-TMLE
>van der Laan, Mark J., and Susan Gruber. "Collaborative double robust targeted maximum likelihood estimation." The international journal of biostatistics 6.1 (2010): 1-71.

### C-TMLE for Model Selection 

>In preperation