bootR2

R package for computing fast Rcpp-based predictive R-squares, via the nonparametric bootstrap. Heavy use is made of RcppEigen.

Predictive R-squared statistics use two different samples, one for training a linear model and one for validating it. The bootR2 package draws two nonparametric bootstrap samples to be used as training and validation data for calculating a predictive R-squared. Such a bootstrap distribution provides a better assessment of the proportion of variation that the fitted model could explain in other samples from the population. NOTE: the adjusted R-squared does not provide this assessment.

Why?

The technometrics literature has argued for quite a while that the adjusted R-squared statistic is biased as an estimate of the ability of a fitted model to explain variation in the population. However, constructing an analytical bias correction is:

challenging
has resulted in a large number of proposals, none of which dominate the others in all settings
do not readily yield estimates of uncertainty in the point estimates

The nonparametric bootstrap is a good alternative, but obviously slower. By using RcppEigen, we are able to compute fast approximate sampling distributions of out-of-sample predictive R-squared statistics.

Example

library(bootR2)
## simulate data
set.seed(1)
n <- 100  # number of observations (e.g. cases/sites)
p <- 10  # number of explanatory variables
X <- cbind(1, matrix(rnorm(n * p), n, p))  # model matrix
y <- X %*% rnorm(p + 1, sd = 0.1) + rnorm(n)  # response variable

## compute bootstrap sample and give its summary
yBootR2 <- bootR2(X, y, nBoot = 10000)
summary(yBootR2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5490  0.0276  0.1110  0.0992  0.1880  0.4550

hist(yBootR2, 50, main = "", xlab = expression(paste("Predictive ", R^2)))

Note the bootstrap samples with negative R-squared statistics. This makes sense for out-of-sample predictions, and indicates that the model predictions are more variable than the validation data themselves.

bootR2pureR1 <- function(X, y) {
    n <- nrow(X)
    prmt <- sample.int(n, n, replace = TRUE)
    prmv <- sample.int(n, n, replace = TRUE)
    Xt <- X[prmt, ]
    Xv <- X[prmv, ]
    yt <- y[prmt]
    yv <- y[prmv]
    betaHat <- bootR2:::betaHat(Xt, yt)
    fitHat <- Xv %*% betaHat
    SSerr <- sum((yv - fitHat)^2)
    SStot <- sum((yv - mean(yv))^2)
    1 - (SSerr/SStot)
}
bootR2pureR <- function(X, y, nBoot = 1) replicate(nBoot, bootR2pureR1(X, y))
yBootR2pureR <- bootR2pureR(X, y, 10000)
hist(yBootR2pureR, 50, main = "", xlab = expression(paste("Predictive ", R^2)))

library(rbenchmark)
benchmark(bootR2(X, y, nBoot = 10000), bootR2pureR(X, y, nBoot = 10000), order = "relative", 
    replications = 10)

##                               test replications elapsed relative user.self
## 1      bootR2(X, y, nBoot = 10000)           10   2.228    1.000     2.174
## 2 bootR2pureR(X, y, nBoot = 10000)           10   7.468    3.352     7.459
##   sys.self user.child sys.child
## 1    0.054          0         0
## 2    0.009          0         0

Limitations

currently only least-squares fits are permitted
currently only sum-of-squares-based R-squared statistics are computed
currently only nonparametric bootstraps are allowed

Interested in helping remove these limitations?

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
R		R
figure		figure
man		man
misc		misc
src		src
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bootR2

Why?

Example

Limitations

About

Releases

Packages

Languages

guhjy/bootR2

Folders and files

Latest commit

History

Repository files navigation

bootR2

Why?

Example

Limitations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages