4tmle.Rmd

# TMLE

```{r tmlesetup01t, include=FALSE}
require(knitr)
require(glmnet)
require(kableExtra)
require(dplyr)
require(xgboost)
require(SuperLearner)
require(Publish)
require(tableone)
options(knitr.kable.NA = '')
cachex=TRUE
```

## Doubly robust estimators

Now that we have covered 

- outcome models (e.g., G-computation) and 
- exposure models (e.g., propensity score models), 

let us talk about doubly robust (DR) estimators. DR has several important properties: 

* They use information from both 
  - the exposure and 
  - the outcome models. 
* They provide a **consistent estimator** if either of the above mentioned models is correctly specified.
  - consistent estimator means as the sample size increases, distribution of the estimates gets concentrated near the true parameter
* They provide an **efficient estimator** if both the exposure and the outcome model are correctly specified. 
  - efficient estimator means estimates approximates the true parameter in terms of a chosen loss function (e.g., could be RMSE).

## TMLE

Targeted Maximum Likelihood Estimation (TMLE) is a DR method, using 

- an initial estimate from the outcome model (G-computation)
- the propensity score (exposure) model to improve.

In addition to being DR, TMLE has several other desirable properties:

* It allows the use of **data-adaptive algorithms** like machine learning without sacrificing interpretability.
  * ML is only used in intermediary steps to develop the estimator, so the optimization and interpretation of the estimator as a whole remains intact.
  * The use of machine learning can help mitigate model misspecification. 
* It has been shown to outperform other methods, particularly in **sparse data settings**. 

## TMLE Steps

According to @luque2018targeted, we need to the following steps (2-7) for obtaining point estimates when dealing with binary outcome. But as we are dealing with continuous outcome, we need an added transformation step at the beginning, and also at the end.

|  |  |
|-|-|
|Step 1| Transformation of continuous outcome variable |
|Step 2| Predict from initial outcome modelling: G-computation|
|Step 3| Predict from propensity score model|
|Step 4| Estimate clever covariate $H$ |
|Step 5| Estimate fluctuation parameter $\epsilon$|
|Step 6| Update the initial outcome model prediction based on targeted adjustment of the initial predictions using the PS model |
|Step 7| Find treatment effect estimate  |
|Step 8| Transform back the treatment effect estimate in the original outcome scale  |
|Step 9| Confidence interval estimation based on closed form formula|

- We will go through the steps of TMLE one-by-one, using the RHC dataset presented in previous chapters. 
- As a reminder, the exposure we are considering is RHC (right heart catheterization) and the outcome of interest is length of stay in the hospital. 

```{r dataload_02, cache=TRUE, echo = TRUE}
# Read the data saved at the last chapter
ObsData <- readRDS(file = "data/rhcAnalytic.RDS")
```

## Step 1: Transformation of Y

In our example, the outcome is continuous. 

```{r transf1, cache=TRUE, echo = TRUE}
summary(ObsData$Y)
plot(density(ObsData$Y), main = "Observed Y")
```

```{block, type='rmdcomment'}
General recommendation is to **transform** continuous outcome to be within the range [0,1] [@gruber2010targeted].
```

<!--- However, it is recommended that even with continuous outcomes we use a log-likelihood loss function for the maximum likelihood estimation, as we would with a binary outcome.  --->


```{r transf, cache=TRUE, echo = TRUE}
min.Y <- min(ObsData$Y)
max.Y <- max(ObsData$Y)
ObsData$Y.bounded <- (ObsData$Y-min.Y)/(max.Y-min.Y)
```

Check the range of our transformed outcome variable

```{r transf2, cache=TRUE, echo = TRUE}
summary(ObsData$Y.bounded)
```

## Step 2: Initial G-comp estimate

```{block, type='rmdcomment'}
We construct our outcome model, and make our initial predictions. 
```

For this step, we will use **SuperLearner**. This requires no apriori assumptions about the structure of our outcome model. 

```{r SL_out0, cache=TRUE, message=FALSE, warning=FALSE}
library(SuperLearner)
set.seed(123)
ObsData.noY <- dplyr::select(ObsData, !c(Y,Y.bounded))
Y.fit.sl <- SuperLearner(Y=ObsData$Y.bounded, 
                       X=ObsData.noY, 
                       cvControl = list(V = 3),
                       SL.library=c("SL.glm", 
                                    "SL.glmnet", 
                                    "SL.xgboost"),
                       method="method.CC_nloglik", 
                       family="gaussian")
```

```{r SL_out01x, cache=TRUE}
ObsData$init.Pred <- predict(Y.fit.sl, newdata = ObsData.noY, 
                           type = "response")$pred

summary(ObsData$init.Pred)
# alternatively, we could write
# ObsData$init.Pred <- Y.fit.sl$SL.predict
```

- We will use these initial prediction values later. 

```{block, type='rmdcomment'}
$Q^0(A,L)$ is often used to represent the predictions from initial G-comp model.
```

### Get predictions under both treatments $A = 0$ and $1$

- We could estimate the treatment effect from this initial model.
- We will need the $Q^0(A=1,L)$ and $Q^0(A=0,L)$ predictions later.
- $Q^0(A=1,L)$ predictions:

```{r SL_out01, cache=TRUE}
ObsData.noY$A <- 1
ObsData$Pred.Y1 <- predict(Y.fit.sl, newdata = ObsData.noY, 
                           type = "response")$pred
summary(ObsData$Pred.Y1)
```

-  $Q^0(A=0,L)$ predictions:

```{r SL_out02, cache=TRUE}
ObsData.noY$A <- 0
ObsData$Pred.Y0 <- predict(Y.fit.sl, newdata = ObsData.noY, 
                           type = "response")$pred
summary(ObsData$Pred.Y0)
```

### Get initial treatment effect estimate

```{r SL_out03, cache=cachex, echo = TRUE}
ObsData$Pred.TE <- ObsData$Pred.Y1 - ObsData$Pred.Y0   
```

```{r SL_out04, cache=cachex, echo = TRUE}
summary(ObsData$Pred.TE) 
```

```{r SL_out, cache=TRUE, message=FALSE, warning=FALSE, include = FALSE}
# # specify the library of machine learning algorithms our SuperLearner should use
# # we will use the same set that is the default set in the tmle package
# Q.SL.library = c("SL.glm", "SL.xgboost", "SL.glmnet")
# # check number of missing values, and omit rows with missing values if there are not too many
# sapply(ObsData, function(x) sum(is.na(x))) 
# ObsData <- na.omit(ObsData) 
# # fit our SuperLearner
# X <- dplyr::select(ObsData, !Y)
# QbarSL <- SuperLearner(Y=ObsData$Y, X=X, SL.library=Q.SL.library, family="gaussian", method="method.CC_nloglik")
# # make predictions with our fitted model
# QbarAW <- QbarSL$SL.predict
# 
# # predict the counterfactual outcomes under treatment/no treatment for each observation
# X1 <- X
# X1$A <- 1
# X0 <- X
# X0$A <- 0
# # predicted outcome under treatment
# Qbar1W<- SuperLearner(Y=ObsData$Y, X=X, SL.library=Q.SL.library, family="gaussian", method="method.CC_nloglik", newX=X1)$SL.predict  
# # predicted outcome under no treatment
# Qbar0W<- SuperLearner(Y=ObsData$Y, X=X, SL.library=Q.SL.library, family="gaussian", method="method.CC_nloglik", newX=X0)$SL.predict   
# 
# # initial estimate of the effect of A on Y:
# PsiHat.SS <- mean(Qbar1W - Qbar0W)
# cat("/n Our initial estimate of the effect of RHC on length of stay is: ", PsiHat.SS)
```

## Step 3: PS model

At this point, we have our initial estimate and now want to perform our targeted improvement. 

```{r SL_out0ps, cache=TRUE, message=FALSE, warning=FALSE}
library(SuperLearner)
set.seed(124)
ObsData.noYA <- dplyr::select(ObsData, !c(Y,Y.bounded,
                                          A,init.Pred,
                                          Pred.Y1,Pred.Y0,
                                          Pred.TE))
PS.fit.SL <- SuperLearner(Y=ObsData$A, 
                       X=ObsData.noYA, 
                       cvControl = list(V = 3),
                       SL.library=c("SL.glm", 
                                    "SL.glmnet", 
                                    "SL.xgboost"),
                       method="method.CC_nloglik",
                       family="binomial")  
```


```{r SL_out01ps2, cache=TRUE}
all.pred <- predict(PS.fit.SL, type = "response")
ObsData$PS.SL <- all.pred$pred 
```

```{block, type='rmdcomment'}
These propensity score predictions (`PS.SL`) are represented as $g(A_i=1|L_i)$.
```

- We can estimate $g(A_i=0|L_i)$ as $1 - g(A_i=1|L_i)$ or `1 - PS.SL`.

```{r ipw2psx2btmle, cache=TRUE, echo = TRUE}
summary(ObsData$PS.SL)
tapply(ObsData$PS.SL, ObsData$A, summary)
plot(density(ObsData$PS.SL[ObsData$A==0]), 
     col = "red", main = "")
lines(density(ObsData$PS.SL[ObsData$A==1]), 
      col = "blue", lty = 2)
legend("topright", c("No RHC","RHC"), 
       col = c("red", "blue"), lty=1:2) 
```


```{r SL_ps, cache=TRUE, message=FALSE, error = FALSE, warning=FALSE, include = FALSE}
# library(SuperLearner)
# set.seed(123) 
# 
# # specify the library of machine learning algorithms our SuperLearner should use
# # we will use the same set that is the default set in the tmle package
# G.SL.library = c("SL.glm", "SL.gam", "tmle.SL.dbarts.k.5")
# # construct the propensity score model using SuperLearner
# gHatSL <- SuperLearner(Y=ObsData$A, X=subset(ObsData, select= -c(A,Y)), 
#                        SL.library=G.SL.library, family="binomial")
# # get the probability of receiving each treatment for each observation
# gHat1W <- gHatSL$SL.predict # predicted probabilities of A=1 given baseline chars
# gHat0W <- 1 - gHat1W
# # get the probability of receiving the treatment they did receive for each observation
# gHatAW <- rep(NA, nrow(ObsData))
# gHatAW[ObsData$A==1] <- gHat1W[ObsData$A==1]
# gHatAW[ObsData$A==0] <- gHat0W[ObsData$A==0]
```

## Step 4: Estimate $H$ 

```{block, type='rmdcomment'}
Clever covariate $H(A_i, L_i) = \frac{I(A_i=1)}{g(A_i=1|L_i)} - \frac{I(A_i=0)}{g(A_i=0|L_i)}$ [@luque2018targeted]
```

```{r hestimate2, cache=TRUE, warning=FALSE}
ObsData$H.A1L <- (ObsData$A) / ObsData$PS.SL 
ObsData$H.A0L <- (1-ObsData$A) / (1- ObsData$PS.SL)
ObsData$H.AL <- ObsData$H.A1L - ObsData$H.A0L
summary(ObsData$H.AL)
tapply(ObsData$H.AL, ObsData$A, summary)
t(apply(cbind(-ObsData$H.A0L,ObsData$H.A1L), 
      2, summary)) 
```

Aggregated or individual clever covariate components show slight difference in their summaries.

## Step 5: Estimate $\epsilon$

```{block, type='rmdcomment'}
Fluctuation parameter $\epsilon$, representing **how large of an adjustment we will make** to the initial estimate. 
```

- The fluctuation parameter $\hat\epsilon$ could be 
  - a scalar or 
  - a vector with 2 components $\hat\epsilon_0$ and $\hat\epsilon_1$. 
- It  is estimated through MLE, using a model with an offset based on the initial estimate, and clever covariates as independent variables [@gruber2009targeted]:

  $E(Y|A,L)(\epsilon) = \frac{1}{1+\exp(-\log\frac{\bar Q^0(A,L)}{(1-\bar Q^0(A,L))}-\epsilon \times H(A,L))}$ 

### $\hat\epsilon$ = $\hat\epsilon_0$ and $\hat\epsilon_1$ 

This is closer to how `tmle` package has implement clever covariates

```{r eestimate, cache=TRUE, warning=FALSE}
eps_mod <- glm(Y.bounded ~ -1 + H.A1L + H.A0L +  
                 offset(qlogis(init.Pred)), 
               family = "binomial",
               data = ObsData)
epsilon <- coef(eps_mod)  
epsilon["H.A1L"]
epsilon["H.A0L"] 
```

Note that, if `init.Pred` includes negative values, `NaNs` would be produced after applying `qlogis()`.

### Only 1 $\hat\epsilon$

For demonstration purposes

```{r eestimate2, cache=TRUE, warning=FALSE}
eps_mod1 <- glm(Y.bounded ~ -1 + H.AL +
                 offset(qlogis(init.Pred)),
               family = "binomial",
               data = ObsData)
epsilon1 <- coef(eps_mod1) 
epsilon1 
```

Alternative could be to use `H.AL` as weights (not shown here).

## Step 6: Update

### $\hat\epsilon$ = $\hat\epsilon_0$ and $\hat\epsilon_1$ 

We can use `epsilon["H.A1L"]` and `epsilon["H.A0L"]` to update

```{r teestimate, cache=TRUE, warning=FALSE}
ObsData$Pred.Y1.update <- plogis(qlogis(ObsData$Pred.Y1) +  
                                   epsilon["H.A1L"]*ObsData$H.A1L)
ObsData$Pred.Y0.update <- plogis(qlogis(ObsData$Pred.Y0) + 
                                   epsilon["H.A0L"]*ObsData$H.A0L)
summary(ObsData$Pred.Y1.update)
summary(ObsData$Pred.Y0.update)  
```

### Only 1 $\hat\epsilon$

Alternatively, we could use `epsilon` to from `H.AL` to update

```{r teestimateb, cache=TRUE, warning=FALSE}
ObsData$Pred.Y1.update1 <- plogis(qlogis(ObsData$Pred.Y1) +  
                                   epsilon1*ObsData$H.AL)
ObsData$Pred.Y0.update1 <- plogis(qlogis(ObsData$Pred.Y0) + 
                                   epsilon1*ObsData$H.AL)
summary(ObsData$Pred.Y1.update1)
summary(ObsData$Pred.Y0.update1)   
```

Note that, if `Pred.Y1` and `Pred.Y0` include negative values, `NaNs` would be produced after applying `qlogis()`.

```{r hestimate, cache=TRUE, warning=FALSE, include = FALSE}
# # clever covariates
# H1W <- ObsData$A / gHat1W
# H0W <- (1-ObsData$A) / gHat0W
# 
# # fluctuation parameter
# eps_mod <- glm(ObsData$Y ~ -1 + H0W + H1W + offset(qlogis(QbarAW)), family = "binomial")
# epsilon <- coef(eps_mod)
# cat("Epsilon:", epsilon)
# 
# # updated estimates
# Q0W_1 <- plogis(qlogis(Qbar0W) + epsilon[1]*H0W)
# Q1W_1 <- plogis(qlogis(Qbar1W) + epsilon[2]*H1W)
```

## Step 7: Effect estimate

Now that the updated predictions of our outcome models are calculated, we can calculate the ATE.

### $\hat\epsilon$ = $\hat\epsilon_0$ and $\hat\epsilon_1$ 

```{r teestimate2, cache=TRUE, warning=FALSE}
ATE.TMLE.bounded.vector <- ObsData$Pred.Y1.update -  
                           ObsData$Pred.Y0.update
summary(ATE.TMLE.bounded.vector) 
ATE.TMLE.bounded <- mean(ATE.TMLE.bounded.vector, 
                         na.rm = TRUE) 
ATE.TMLE.bounded 
```

### Only 1 $\hat\epsilon$

Alternatively, using `H.AL`:

```{r teestimate2b, cache=TRUE, warning=FALSE}
ATE.TMLE.bounded.vector1 <- ObsData$Pred.Y1.update1 -  
                           ObsData$Pred.Y0.update1
summary(ATE.TMLE.bounded.vector1) 
ATE.TMLE.bounded1 <- mean(ATE.TMLE.bounded.vector1, 
                         na.rm = TRUE) 
ATE.TMLE.bounded1 
```

## Step 8: Rescale effect estimate

We make sure to transform back to our original scale. 

### $\hat\epsilon$ = $\hat\epsilon_0$ and $\hat\epsilon_1$ 

```{r meantmle, cache=TRUE}
ATE.TMLE <- (max.Y-min.Y)*ATE.TMLE.bounded   
ATE.TMLE 
```

### Only 1 $\hat\epsilon$

Alternatively, using `H.AL`:

```{r meantmle2, cache=TRUE}
ATE.TMLE1 <- (max.Y-min.Y)*ATE.TMLE.bounded1
ATE.TMLE1 
```

## Step 9: Confidence interval estimation

- Since the machine learning algorithms were used only in intermediary steps, rather than estimating our parameter of interest directly, 95% confidence intervals can be calculated directly [@luque2018targeted]. 

```{block, type='rmdcomment'}
Based on semi-parametric theory, closed form variance formula is already derived [@van2012targeted]. 
```

- Time-consuming bootstrap procedure is not necessary.

```{r tmleinf2, cache=TRUE}
ci.estimate <- function(data = ObsData, H.AL.components = 1){
  min.Y <- min(data$Y)
  max.Y <- max(data$Y)
  # transform predicted outcomes back to original scale
  if (H.AL.components == 2){
    data$Pred.Y1.update.rescaled <- 
      (max.Y- min.Y)*data$Pred.Y1.update + min.Y
    data$Pred.Y0.update.rescaled <- 
      (max.Y- min.Y)*data$Pred.Y0.update + min.Y
  } 
  if (H.AL.components == 1) {
    data$Pred.Y1.update.rescaled <- 
      (max.Y- min.Y)*data$Pred.Y1.update1 + min.Y
    data$Pred.Y0.update.rescaled <- 
      (max.Y- min.Y)*data$Pred.Y0.update1 + min.Y
  }
  EY1_TMLE1 <- mean(data$Pred.Y1.update.rescaled, 
                    na.rm = TRUE)
  EY0_TMLE1 <- mean(data$Pred.Y0.update.rescaled, 
                    na.rm = TRUE)
  # ATE efficient influence curve
  D1 <- data$A/data$PS.SL*
    (data$Y - data$Pred.Y1.update.rescaled) + 
    data$Pred.Y1.update.rescaled - EY1_TMLE1
  D0 <- (1 - data$A)/(1 - data$PS.SL)*
    (data$Y - data$Pred.Y0.update.rescaled) + 
    data$Pred.Y0.update.rescaled - EY0_TMLE1
  EIC <- D1 - D0
  # ATE variance
  n <- nrow(data)
  varHat.IC <- var(EIC, na.rm = TRUE)/n
  # ATE 95% CI
  if (H.AL.components == 2) {
    ATE.TMLE.CI <- c(ATE.TMLE - 1.96*sqrt(varHat.IC), 
                   ATE.TMLE + 1.96*sqrt(varHat.IC))
  }
  if (H.AL.components == 1) {
    ATE.TMLE.CI <- c(ATE.TMLE1 - 1.96*sqrt(varHat.IC), 
                   ATE.TMLE1 + 1.96*sqrt(varHat.IC))
  }
  return(ATE.TMLE.CI) 
}
```

### $\hat\epsilon$ = $\hat\epsilon_0$ and $\hat\epsilon_1$ 

```{r tmleinf2b, cache=TRUE}
CI2 <- ci.estimate(data = ObsData, H.AL.components = 2) 
CI2
```

### Only 1 $\hat\epsilon$

```{r tmleinf2c, cache=TRUE}
CI1 <- ci.estimate(data = ObsData, H.AL.components = 1) 
CI1
```

```{r tmleinf, cache=TRUE, include = FALSE}
# # transform predicted outcomes back to original scale
# Q1W_1_sc <- (b-a)*Q1W_1+a
# Q0W_1_sc <- (b-a)*Q0W_1+a
# 
# EY1_TMLE1 <- mean(Q1W_1_sc)
# EY0_TMLE1 <- mean(Q0W_1_sc)
# 
# # ATE efficient influence curve
# D1 <- ObsData$A/gHat1W*(ObsData$Y - Q1W_1_sc) + Q1W_1_sc - EY1_TMLE1
# D0 <- (1 - ObsData$A)/(1 - gHat1W)*(ObsData$Y - Q0W_1_sc) + Q0W_1_sc - EY0_TMLE1
# EIC <- D1 - D0
# #ATE variance
# n <- nrow(ObsData)
# varHat.IC <- var(EIC)/n
# #ATE 95% CI
# ATE_TMLE1_CI <- c(ATE_TMLE1 - 1.96*sqrt(varHat.IC), ATE_TMLE1 + 1.96*sqrt(varHat.IC))
# cat("ATE: ", ATE_TMLE1, "  (", ATE_TMLE1_CI[1], ", ", ATE_TMLE1_CI[2], ")", sep = "")
```

```{r, cache=TRUE, echo = TRUE}
saveRDS(ATE.TMLE, file = "data/tmlepointh.RDS") 
saveRDS(CI2, file = "data/tmlecih.RDS")
```