Example. Cells are grown under different temperature conditions from \(20^\circ\) to \(40^\circ\) C. A researched would like to find a dependency between T and cell number. Let us build a linear regression. And then - visualize it with confidence and predictions intervals.
Data = read.table("http://edu.modas.lu/data/txt/cells.txt", header=T,sep="\t",quote="\"")
x = Data$Temperature
y = Data$Cell.Number
mod = lm(y~x)
mod
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -190.78 15.33
summary(mod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.183 -26.190 -1.185 22.147 54.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -190.784 35.032 -5.446 2.96e-05 ***
## x 15.332 1.145 13.395 3.96e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.76 on 19 degrees of freedom
## Multiple R-squared: 0.9042, Adjusted R-squared: 0.8992
## F-statistic: 179.4 on 1 and 19 DF, p-value: 3.958e-11
par(mfcol=c(1,2))
# draw the data
plot(x,y,pch=19,main=sprintf("Linear model y(x) = %.1f*x + %.1f",coef(mod)[2],coef(mod)[1]))
# add the regression line and its confidence interval (95%)
lines(x, predict(mod,int = "confidence")[,1],col=2,lwd=2)
lines(x, predict(mod,int = "confidence")[,2],col=2)
lines(x, predict(mod,int = "confidence")[,3],col=2)
# add the prediction interval for the values (95%)
lines(x, predict(mod,int = "pred")[,2],col=4)
lines(x, predict(mod,int = "pred")[,3],col=4)
## draw residuals
residuals = y - (x*coef(mod)[2] + coef(mod)[1])
plot(x, residuals,pch=19,main="Residuals")
abline(h=0,col=2,lwd=2)
Now, let’s predict how many cells we should expect under Temperature = \(37^\circ\)C.
## confidence (where the mean will be)
predict(mod,data.frame(x=37),int = "confidence")
## fit lwr upr
## 1 376.5177 354.3436 398.6919
## prediction (where the actual observation will be)
predict(mod,data.frame(x=37),int = "pred")
## fit lwr upr
## 1 376.5177 306.4377 446.5978
Often one variable is not enough, and we need several independent
variables to predict dependent one. Let’s consider internal
swiss
dataset: standardized fertility measure and
socio-economic indicators for 47 French-speaking provinces of
Switzerland at about 1888. See ?swiss
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
## overview of the data
#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
chart.Correlation(swiss)
## see complete model
modAll = lm(Fertility ~ . , data = swiss)
summary(modAll)
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
## remove correlated & non-significant variable... not really correct
modSig = lm(Fertility ~ . - Examination , data = swiss)
summary(modSig)
##
## Call:
## lm(formula = Fertility ~ . - Examination, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
## make simple regression using the most significant variable
mod1 = lm(Fertility ~ Education , data = swiss)
summary(mod1)
##
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.036 -6.711 -1.011 9.526 19.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.6101 2.1041 37.836 < 2e-16 ***
## Education -0.8624 0.1448 -5.954 3.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared: 0.4406, Adjusted R-squared: 0.4282
## F-statistic: 35.45 on 1 and 45 DF, p-value: 3.659e-07
## Let's remove correlated features by PCA
PC = prcomp(swiss[,-1])
swiss2 = data.frame(Fertility = swiss$Fertility, PC$x)
chart.Correlation(swiss2)
modPC = lm(Fertility ~ ., data = swiss2)
summary(modPC)
##
## Call:
## lm(formula = Fertility ~ ., data = swiss2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.14255 1.04518 67.111 < 2e-16 ***
## PC1 0.14376 0.02437 5.900 6.00e-07 ***
## PC2 -0.11115 0.04930 -2.255 0.0296 *
## PC3 0.98882 0.13774 7.179 9.23e-09 ***
## PC4 0.29483 0.28341 1.040 0.3043
## PC5 0.96327 0.38410 2.508 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
modPCSig = lm(Fertility ~ .-PC4, data = swiss2)
summary(modPCSig)
##
## Call:
## lm(formula = Fertility ~ . - PC4, data = swiss2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.276 -3.959 -0.651 4.519 14.416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.14255 1.04620 67.045 < 2e-16 ***
## PC1 0.14376 0.02439 5.895 5.63e-07 ***
## PC2 -0.11115 0.04934 -2.253 0.0296 *
## PC3 0.98882 0.13788 7.172 8.26e-09 ***
## PC5 0.96327 0.38448 2.505 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.172 on 42 degrees of freedom
## Multiple R-squared: 0.699, Adjusted R-squared: 0.6703
## F-statistic: 24.38 on 4 and 42 DF, p-value: 1.76e-10
How to choose the proper model? Several approaches:
maximizing adjusted R2
Add / remove variable and compare models using likelihood ratio or chi2 test.
## calculate AIC and select minimal AIC
AIC(mod1,modSig, modAll, modPC, modPCSig)
## df AIC
## mod1 3 348.4223
## modSig 6 325.2408
## modAll 7 326.0716
## modPC 7 326.0716
## modPCSig 6 325.2961
BIC(mod1,modSig, modAll, modPC, modPCSig)
## df BIC
## mod1 3 353.9727
## modSig 6 336.3417
## modAll 7 339.0226
## modPC 7 339.0226
## modPCSig 6 336.3970
## Is fitting significantly better? Likelihood
anova(modAll,modSig)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
## Model 2: Fertility ~ (Agriculture + Examination + Education + Catholic +
## Infant.Mortality) - Examination
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 2105.0
## 2 42 2158.1 -1 -53.027 1.0328 0.3155
anova(modAll,mod1)
## Analysis of Variance Table
##
## Model 1: Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
## Model 2: Fertility ~ Education
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 2105.0
## 2 45 4015.2 -4 -1910.2 9.3012 1.916e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## this is not very informative (use training set)
plot(swiss$Fertility, predict(modAll,swiss),xlab="Real Fertility",ylab="Predicted Fertility",pch=19)
abline(a=0,b=1,col=2,lty=2)
- A biology student wishes to determine the relationship between temperature and heart rate in leopard frog, Rana pipiens. He manipulates the temperature in \(2^\circ\) increment ranging from \(2^\circ\) to \(18^\circ\)C and records the heart rate at each interval. His data are presented in table rana.txt
- Build the model and provide the p-value for linear dependency
- Provide interval estimation for the slope of the dependency
- Estimate 95% prediction interval for heart rate at \(15^\circ\)C
lm
,summary
,predict
,confint
- Could you estimate
Lean.tissues.weight
using other variables of Mice dataset? Please note that there are some missing datapoints which should be excluded if you are going to compare models.
lm
,summary
,anova
,AIC
Home Next |