This simulation example demonstrates how to conduct a permutation-based test for a partial regression coefficient in a multiple linear regression model.
# Load Libraries
library(knitr)
library(mosaic)
library(ggplot2)
library(MASS)
# Set knitr options
opts_chunk$set(fig.width = 6, fig.height=5)
# Clear Environment (optional)
remove(list=ls())
# Set seed
set.seed(314159)
Here we will consider a simple simulation where a response variable, y, is related to two predictor variables, x1 and x2. The predictors are themselves correlated. We will illustrate a simple permutation-based test for the effect of x1, adjusted for x2.
Steps:
Simulation parameters
Sigma <- matrix(c(10,3,3,2),2,2)
Beta <- c(0.2, -0.5)
Create correlated predictors
X<- mvrnorm(n = 100, rep(0, 2), Sigma)
cor(X)
## [,1] [,2]
## [1,] 1.0000000 0.6054432
## [2,] 0.6054432 1.0000000
Form response variables
y<-X%*%Beta+rnorm(100,0,2)
Mydata<-data.frame(y=y, x1=X[,1], x2=X[,2])
Fit regression model to the data
lmsim<-lm(y~x1+x2, data=Mydata)
summary(lmsim)
##
## Call:
## lm(formula = y ~ x1 + x2, data = Mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9605 -1.3618 -0.1088 1.2206 5.3751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.05740 0.21377 -0.269 0.78888
## x1 0.18488 0.08781 2.105 0.03783 *
## x2 -0.66629 0.20420 -3.263 0.00152 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.134 on 97 degrees of freedom
## Multiple R-squared: 0.09913, Adjusted R-squared: 0.08056
## F-statistic: 5.337 on 2 and 97 DF, p-value: 0.006326
Step 1: capture the part of x1 that is not related to x2
lm1<-lm(x1~x2, data=Mydata)
Mydata<-Mydata %>% mutate(x1resid=lm1$resid)
Demonstrate that using the residuals here results in the same coefficient, standard error, t-statistic and p-value for x1 as in our original regression (lmsim)
lmsim2<-lm(y~x1resid+x2, data=Mydata)
summary(lmsim2)
##
## Call:
## lm(formula = y ~ x1resid + x2, data = Mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9605 -1.3618 -0.1088 1.2206 5.3751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.04442 0.21368 -0.208 0.8358
## x1resid 0.18488 0.08781 2.105 0.0378 *
## x2 -0.40599 0.16252 -2.498 0.0142 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.134 on 97 degrees of freedom
## Multiple R-squared: 0.09913, Adjusted R-squared: 0.08056
## F-statistic: 5.337 on 2 and 97 DF, p-value: 0.006326
Store the t-statistic for x1 from this model
(tstat<-summary(lmsim)$coefficients[2,3])
## [1] 2.105456
Step 2: create the permutation distribution
randsims<-do(10000)*{
lmrand<-lm(y~shuffle(x1resid)+x2, data=Mydata)
summary(lmrand)$coefficients[2,3]
}
head(randsims)
## result
## 1 0.3131506
## 2 0.6579098
## 3 0.7203800
## 4 -0.2128813
## 5 -0.5701575
## 6 0.3548467
Plot the randomization distribution with our original statistic
histogram(~result, data=randsims, v=tstat, col="gray")
Determine our p-value
prop(~I(abs(result)>=tstat), data=randsims)
## prop_TRUE
## 0.0393
The permutation-based approach allows us to relax the Normality assumption. Our randomization-based p-value is really similar to the p-value of the original t-test. This result is not surprising given that the assumptions of linear regression (constant variance, normality, linearity) all hold in the simulation example.