Video 1

Linear Regression


1.1. Simple linear regression

Example. Cells are grown under different temperature conditions from \(20^\circ\) to \(40^\circ\) C. A researched would like to find a dependency between T and cell number. Let us build a linear regression. And then - visualize it with confidence and predictions intervals.

Data = read.table("http://edu.modas.lu/data/txt/cells.txt", header=T,sep="\t",quote="\"")
x = Data$Temperature
y = Data$Cell.Number
mod = lm(y~x) 
mod
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -190.78        15.33
summary(mod)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.183 -26.190  -1.185  22.147  54.477 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -190.784     35.032  -5.446 2.96e-05 ***
## x             15.332      1.145  13.395 3.96e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.76 on 19 degrees of freedom
## Multiple R-squared:  0.9042, Adjusted R-squared:  0.8992 
## F-statistic: 179.4 on 1 and 19 DF,  p-value: 3.958e-11
par(mfcol=c(1,2))
# draw the data
plot(x,y,pch=19,main=sprintf("Linear model y(x) = %.1f*x + %.1f",coef(mod)[2],coef(mod)[1]))

# add the regression line and its confidence interval (95%)
lines(x, predict(mod,int = "confidence")[,1],col=2,lwd=2)
lines(x, predict(mod,int = "confidence")[,2],col=2)
lines(x, predict(mod,int = "confidence")[,3],col=2)

# add the prediction interval for the values (95%)
lines(x, predict(mod,int = "pred")[,2],col=4)
lines(x, predict(mod,int = "pred")[,3],col=4)

## draw residuals
residuals = y - (x*coef(mod)[2] + coef(mod)[1])
plot(x, residuals,pch=19,main="Residuals")
abline(h=0,col=2,lwd=2)

Now, let’s predict how many cells we should expect under Temperature = \(37^\circ\)C.

## confidence (where the mean will be)
predict(mod,data.frame(x=37),int = "confidence")
##        fit      lwr      upr
## 1 376.5177 354.3436 398.6919
## prediction (where the actual observation will be)
predict(mod,data.frame(x=37),int = "pred")
##        fit      lwr      upr
## 1 376.5177 306.4377 446.5978

3.2 Multiple linear regression

Often one variable is not enough, and we need several independent variables to predict dependent one. Let’s consider internal swiss dataset: standardized fertility measure and socio-economic indicators for 47 French-speaking provinces of Switzerland at about 1888. See ?swiss

str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
## overview of the data
#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
chart.Correlation(swiss)

## see complete model
modAll = lm(Fertility ~ . , data = swiss)
summary(modAll)
## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
## remove correlated & non-significant variable... not really correct
modSig = lm(Fertility ~ . - Examination , data = swiss)
summary(modSig)
## 
## Call:
## lm(formula = Fertility ~ . - Examination, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10
## make simple regression using the most significant variable
mod1 = lm(Fertility ~ Education , data = swiss)
summary(mod1)
## 
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.036  -6.711  -1.011   9.526  19.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.6101     2.1041  37.836  < 2e-16 ***
## Education    -0.8624     0.1448  -5.954 3.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared:  0.4406, Adjusted R-squared:  0.4282 
## F-statistic: 35.45 on 1 and 45 DF,  p-value: 3.659e-07
## Let's remove correlated features by PCA
PC = prcomp(swiss[,-1])
swiss2 = data.frame(Fertility = swiss$Fertility, PC$x)
chart.Correlation(swiss2)

modPC = lm(Fertility ~ ., data = swiss2)
summary(modPC)
## 
## Call:
## lm(formula = Fertility ~ ., data = swiss2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.14255    1.04518  67.111  < 2e-16 ***
## PC1          0.14376    0.02437   5.900 6.00e-07 ***
## PC2         -0.11115    0.04930  -2.255   0.0296 *  
## PC3          0.98882    0.13774   7.179 9.23e-09 ***
## PC4          0.29483    0.28341   1.040   0.3043    
## PC5          0.96327    0.38410   2.508   0.0162 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
modPCSig = lm(Fertility ~ .-PC4, data = swiss2)
summary(modPCSig)
## 
## Call:
## lm(formula = Fertility ~ . - PC4, data = swiss2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.276  -3.959  -0.651   4.519  14.416 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.14255    1.04620  67.045  < 2e-16 ***
## PC1          0.14376    0.02439   5.895 5.63e-07 ***
## PC2         -0.11115    0.04934  -2.253   0.0296 *  
## PC3          0.98882    0.13788   7.172 8.26e-09 ***
## PC5          0.96327    0.38448   2.505   0.0162 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.172 on 42 degrees of freedom
## Multiple R-squared:  0.699,  Adjusted R-squared:  0.6703 
## F-statistic: 24.38 on 4 and 42 DF,  p-value: 1.76e-10

How to choose the proper model? Several approaches:

## calculate AIC and select minimal AIC
AIC(mod1,modSig, modAll, modPC, modPCSig)
##          df      AIC
## mod1      3 348.4223
## modSig    6 325.2408
## modAll    7 326.0716
## modPC     7 326.0716
## modPCSig  6 325.2961
BIC(mod1,modSig, modAll, modPC, modPCSig)
##          df      BIC
## mod1      3 353.9727
## modSig    6 336.3417
## modAll    7 339.0226
## modPC     7 339.0226
## modPCSig  6 336.3970
## Is fitting significantly better? Likelihood
anova(modAll,modSig)
## Analysis of Variance Table
## 
## Model 1: Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## Model 2: Fertility ~ (Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality) - Examination
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     41 2105.0                           
## 2     42 2158.1 -1   -53.027 1.0328 0.3155
anova(modAll,mod1)
## Analysis of Variance Table
## 
## Model 1: Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## Model 2: Fertility ~ Education
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     41 2105.0                                  
## 2     45 4015.2 -4   -1910.2 9.3012 1.916e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## this is not very informative (use training set)
plot(swiss$Fertility, predict(modAll,swiss),xlab="Real Fertility",ylab="Predicted Fertility",pch=19)
abline(a=0,b=1,col=2,lty=2)

Exercise 1

  1. A biology student wishes to determine the relationship between temperature and heart rate in leopard frog, Rana pipiens. He manipulates the temperature in \(2^\circ\) increment ranging from \(2^\circ\) to \(18^\circ\)C and records the heart rate at each interval. His data are presented in table rana.txt
  • Build the model and provide the p-value for linear dependency
  • Provide interval estimation for the slope of the dependency
  • Estimate 95% prediction interval for heart rate at \(15^\circ\)C

lm, summary, predict, confint

  1. Could you estimate Lean.tissues.weight using other variables of Mice dataset? Please note that there are some missing datapoints which should be excluded if you are going to compare models.

lm, summary, anova, AIC


Home Next

By Petr Nazarov