Objective

This simulation example demonstrates how to conduct a permutation-based test for a partial regression coefficient in a multiple linear regression model.

Document Preamble

# Load Libraries
library(knitr)
library(mosaic)
library(ggplot2) 
library(MASS)

# Set knitr options
opts_chunk$set(fig.width = 6, fig.height=5)

# Clear Environment (optional)
remove(list=ls())

# Set seed 
set.seed(314159) 

Simulation Example

Here we will consider a simple simulation where a response variable, y, is related to two predictor variables, x1 and x2. The predictors are themselves correlated. We will illustrate a simple permutation-based test for the effect of x1, adjusted for x2.

Steps:

  1. Fit a linear regression model relating x1 to x2.
  2. Add the residuals from this model to the original data set.
  3. Create the permutation distribution by shuffling these residuals.
  4. Determine the p-value by comparing the t-statistic from the fit to the original data set to the permutation-based distribution of this same statistic.

Simulation parameters

Sigma <- matrix(c(10,3,3,2),2,2) 
Beta <- c(0.2, -0.5)

Create correlated predictors

X<- mvrnorm(n = 100, rep(0, 2), Sigma)
cor(X)
##           [,1]      [,2]
## [1,] 1.0000000 0.6054432
## [2,] 0.6054432 1.0000000

Form response variables

y<-X%*%Beta+rnorm(100,0,2)
Mydata<-data.frame(y=y, x1=X[,1], x2=X[,2])

Fit regression model to the data

lmsim<-lm(y~x1+x2, data=Mydata)
summary(lmsim)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = Mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9605 -1.3618 -0.1088  1.2206  5.3751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.05740    0.21377  -0.269  0.78888   
## x1           0.18488    0.08781   2.105  0.03783 * 
## x2          -0.66629    0.20420  -3.263  0.00152 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.134 on 97 degrees of freedom
## Multiple R-squared:  0.09913,    Adjusted R-squared:  0.08056 
## F-statistic: 5.337 on 2 and 97 DF,  p-value: 0.006326

Step 1: capture the part of x1 that is not related to x2

lm1<-lm(x1~x2, data=Mydata)
Mydata<-Mydata %>% mutate(x1resid=lm1$resid)

Demonstrate that using the residuals here results in the same coefficient, standard error, t-statistic and p-value for x1 as in our original regression (lmsim)

lmsim2<-lm(y~x1resid+x2, data=Mydata)
summary(lmsim2)
## 
## Call:
## lm(formula = y ~ x1resid + x2, data = Mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9605 -1.3618 -0.1088  1.2206  5.3751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.04442    0.21368  -0.208   0.8358  
## x1resid      0.18488    0.08781   2.105   0.0378 *
## x2          -0.40599    0.16252  -2.498   0.0142 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.134 on 97 degrees of freedom
## Multiple R-squared:  0.09913,    Adjusted R-squared:  0.08056 
## F-statistic: 5.337 on 2 and 97 DF,  p-value: 0.006326

Store the t-statistic for x1 from this model

(tstat<-summary(lmsim)$coefficients[2,3])
## [1] 2.105456

Step 2: create the permutation distribution

randsims<-do(10000)*{  
  lmrand<-lm(y~shuffle(x1resid)+x2, data=Mydata)
  summary(lmrand)$coefficients[2,3]
}
head(randsims)
##       result
## 1  0.3131506
## 2  0.6579098
## 3  0.7203800
## 4 -0.2128813
## 5 -0.5701575
## 6  0.3548467

Plot the randomization distribution with our original statistic

histogram(~result, data=randsims, v=tstat, col="gray")

Determine our p-value

prop(~I(abs(result)>=tstat), data=randsims) 
## prop_TRUE 
##    0.0393

Conclusions

The permutation-based approach allows us to relax the Normality assumption. Our randomization-based p-value is really similar to the p-value of the original t-test. This result is not surprising given that the assumptions of linear regression (constant variance, normality, linearity) all hold in the simulation example.