This example demonstrates how the bootstrap can be used to explore model uncertainty.
library(knitr)
library(rms) # for validate function
library(MASS) # for stepAIC
Use the set.seed() function in R to initialize the random number generator.
set.seed(2041971)
Read in the data:
dace<- read.csv("data/longnosedace.csv")
Fit a model using all 6 predictors, then use stepAIC to implement backwards selection to choose a “best” model.
fullmod.lm<-lm(longnosedace~acreage+do2+maxdepth+no3+so4+temp,data=dace)
stepAIC(fullmod.lm)
## Start: AIC=511.82
## longnosedace ~ acreage + do2 + maxdepth + no3 + so4 + temp
##
## Df Sum of Sq RSS AIC
## - so4 1 0.2 102787 509.82
## - do2 1 2165.8 104952 511.24
## <none> 102787 511.82
## - temp 1 4432.8 107219 512.69
## - maxdepth 1 6638.2 109425 514.08
## - no3 1 11876.0 114663 517.26
## - acreage 1 14230.1 117017 518.64
##
## Step: AIC=509.82
## longnosedace ~ acreage + do2 + maxdepth + no3 + temp
##
## Df Sum of Sq RSS AIC
## - do2 1 2169.2 104956 509.24
## <none> 102787 509.82
## - temp 1 4447.6 107234 510.70
## - maxdepth 1 6668.3 109455 512.10
## - no3 1 11935.8 114723 515.29
## - acreage 1 14268.0 117055 516.66
##
## Step: AIC=509.24
## longnosedace ~ acreage + maxdepth + no3 + temp
##
## Df Sum of Sq RSS AIC
## - temp 1 2948.0 107904 509.13
## <none> 104956 509.24
## - maxdepth 1 6108.5 111064 511.09
## - acreage 1 14588.0 119544 516.09
## - no3 1 16501.4 121457 517.17
##
## Step: AIC=509.13
## longnosedace ~ acreage + maxdepth + no3
##
## Df Sum of Sq RSS AIC
## <none> 107904 509.13
## - maxdepth 1 6058.4 113962 510.84
## - acreage 1 14652.0 122556 515.78
## - no3 1 16489.3 124393 516.80
##
## Call:
## lm(formula = longnosedace ~ acreage + maxdepth + no3, data = dace)
##
## Coefficients:
## (Intercept) acreage maxdepth no3
## -23.829067 0.001988 0.336605 8.673044
Validate will use the bootstrap to calculate “honest” measures of model fit. We can also visualize “model uncertainty” in the “best model” by using bw=T (which tells R to use backwards selection to choose the best model). The "*" below indicate, which variables are included in the “optimal model” for each bootstrap replicate.
After applying a backwards model selection algorithm, we end up with a model containing only acreage and no3. The \(R^2\) of this model = 0.24, which describes the variance in longnosedace explained by these two predictors. If we were to apply this same model to a new data set, we would expect the amount of variance that would be explained to be much lower. We can obtain a more “honest” measure of the variance by: a) creating 2 bootstrap data sets (one for model training and one for model testing); b) applying our model selection algorithm to the training data set and calculating the resulting \(R^2\); c) use the same model to predict the response in the bootstrap test data set and use these predictions to calculate a second \(R^2\); d) calculate a measure of “optimism” by subtracting the average \(R^2\) from part c from the average \(R^2\) in part b; e) subtract this estimate of optimism from the \(R^2\) obtained from our original data set. The validate function will do this for us!
fullmod.ols<-ols(longnosedace~acreage+do2+maxdepth+no3+so4+temp,data=dace, x=T, y=T)
validate(fullmod.ols, bw=T, B=100)
##
## Backwards Step-down - Original Model
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
## so4 0.00 1 0.9911 0.00 1 0.9911 -2.00 0.314
## do2 1.29 1 0.2565 1.29 2 0.5253 -2.71 0.300
## temp 1.75 1 0.1859 3.04 3 0.3860 -2.96 0.280
## maxdepth 3.60 1 0.0579 6.63 4 0.1566 -1.37 0.239
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept -2.861457 1.053e+01 -0.2717 0.7858203
## acreage 0.002325 6.502e-04 3.5754 0.0003497
## no3 9.012197 2.767e+00 3.2573 0.0011246
##
## Factors in Final Model
##
## [1] acreage no3
## index.orig training test optimism index.corrected n
## R-square 0.2394 0.3390 0.1501 0.1888 0.0505 100
## MSE 1675.9166 1469.3389 1872.5905 -403.2516 2079.1682 100
## g 25.0102 29.8839 23.0645 6.8194 18.1908 100
## Intercept 0.0000 0.0000 7.9427 -7.9427 7.9427 100
## Slope 1.0000 1.0000 0.8019 0.1981 0.8019 100
##
## Factors Retained in Backwards Elimination
##
## acreage do2 maxdepth no3 so4 temp
## *
## * *
## * *
## * * *
## * *
## * * *
## * *
## * *
## * * *
## * * *
## * * *
## * * *
## * * *
## * *
## * * *
## * * *
## * *
## * * *
## * * * *
## * *
## * * *
## * *
## * *
## * * *
## * * * *
## * *
## * *
## *
## * *
## * * *
## * *
## * * *
## *
## *
## * *
## * *
## * *
## *
## * * *
## * *
## * *
## * *
## * *
## * * *
## * *
## * * * *
## * * * * * *
## * *
## *
## * *
## * *
## * * *
## * * *
## * * *
## * * *
## * * * *
## * * *
## * *
## * *
## * *
## * *
## * * *
## * * * *
## * *
## * * *
## * * * *
## * *
## * * *
## * *
## * *
## * *
## * *
## * *
## * * *
## * *
## * *
## * * *
## * *
## * * *
## * *
## * *
## * * *
## * * *
## * * * *
## * * * *
## *
## * * *
## * *
## * *
## * *
## * * *
## * *
## * *
## * *
## * *
## * * *
## * *
## * * *
## * *
## * * * *
##
## Frequencies of Numbers of Factors Retained
##
## 1 2 3 4 6
## 7 50 33 9 1
We see that the different bootstrap samples result in different models being chosen as optimal. The number of predictor variables included ranges from 1 (in 6 models) to 6 (in 1 model).
We see that our original estimate of \(R^2\) (0.24) is likely quite optimistic (our estimate of optimism = 0.20). Thus, we end up with a corrected estimate of \(R^2\) = 0.037 (quite depressing!).