-
Notifications
You must be signed in to change notification settings - Fork 16
/
test.Rmd
260 lines (183 loc) · 8.17 KB
/
test.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: "test"
author: "Bruce Campbell"
date: "October 27, 2017"
output: pdf_document
---
```{r setup, include=FALSE,echo=FALSE}
rm(list = ls())
knitr::opts_chunk$set(echo = FALSE)
library(pander)
```
## 10.1 1 (a - c) Subset Selection with prostate data
For 10.1 (a): Please use Backward Elimination in 3 ways: (i) a 0.05 p-value criterion as the stopping rule, (ii) using AIC as the stopping rule, and (iii) using BIC as the stopping rule.
For 10.1 (b-c): You should be comparing all possible subsets.
Use the prostate data with lpsa as the response and the other variables as predictors. Implement the following variable selection methods to determine the "best" model:
### (a) Backward elimination
It was not clear to be that it is possible to use regsubsets with the backward method to perform Backward Elimination based on p-value.
```{r}
rm(list = ls())
data(prostate, package="faraway");
df <- prostate
n <-nrow(df)
lm.fit <- lm(lpsa ~ ., data=prostate)
summary(lm.fit)
lm.subset1 <- update(lm.fit,. ~ . - gleason)
summary(lm.subset1)
lm.subset1 <- update(lm.subset1,. ~ . - lcp)
summary(lm.subset1)
lm.subset1 <- update(lm.subset1,. ~ . - pgg45)
summary(lm.subset1)
lm.subset1 <- update(lm.subset1,. ~ . - age)
summary(lm.subset1)
lm.subset1 <- update(lm.subset1,. ~ . - lbph)
summary(lm.subset1)
```
We can use regsubsets with the backwards method to find the best model by the BIC criteria. The plot method will show us the top models. Interestingly there does not appear a way to use the plot with the AIC.
```{r}
library(leaps)
regsubsets.out <- regsubsets(lpsa ~ .,data=prostate,method = "backward",nvmax=10)
rs <- summary(regsubsets.out)
rs$which
plot(regsubsets.out, scale = "bic", main = "BIC")
```
There does not appear to be a scale="aci" option for the regsubsets plot. This is interesting to note.
We plot the AIC for the models here and compare to the exhaustive search for reference.
```{r}
regsubsets.out <- regsubsets(lpsa ~ .,data=prostate,method = "backward",nvmax=10)
rs <- summary(regsubsets.out)
rs$which
AIC <- n*log(rs$rss/n) + (2:9)*2
plot(AIC ~ I(1:8), ylab="AIC", xlab="Number of Predictors")
```
### (b) AIC
```{r}
regsubsets.out <- regsubsets(lpsa ~ .,data=prostate,method = "exhaustive",nvmax=10)
rs <- summary(regsubsets.out)
rs$which
AIC <- n*log(rs$rss/n) + (2:9)*2
plot(AIC ~ I(1:8), ylab="AIC", xlab="Number of Predictors")
plot(regsubsets.out, scale = "bic")
```
### (c) Adjusted $R^2$
```{r}
regsubsets.out <- regsubsets(lpsa ~ .,data=prostate,method = "exhaustive",nvmax=8)
rs <- summary(regsubsets.out)
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")
```
## 10.4 Simplifying trees model
Using the trees data, fit a model with log(Volume) as the response and a second-order polynomial (including the interaction term) in Girth and Height. Determine whether the model may be reasonably simplified.
```{r}
rm(list = ls())
data(trees, package="faraway")
df <- trees
n <-nrow(df)
lm.fit <- lm(log(Volume) ~ polym(Girth,Height,degree=2) , data=trees)
summary(lm.fit)
```
Now we run subset selection. We'll use an ehaustive method since there are not too many predictors. And we'll use the Mallow $C_p$ as our criteria.
```{r}
lm.fit$coefficients
regsubsets.out <- regsubsets(log(Volume) ~ polym(Girth,Height,degree=2),data=trees,method = "exhaustive",nvmax=8)
rs <- summary(regsubsets.out)
rs$which
plot(regsubsets.out, scale = "Cp", main = "Mallow C_p")
AIC <- n*log(rs$rss/n) + (2:5)*2
plot(AIC ~ I(1:5), ylab="AIC", xlab="Number of Predictors")
```
The AIC citerion indicates the best model is the full model with all the polynomial terms, while the Mallow Cp indicates the best model is a reduced one with the first three terms of the ploynomial expansion :
$log(Volume) \sim polym(Girth, Height, degree = 2)1.0 + polym(Girth, Height, degree = 2)2.0 + polym(Girth, Height, degree = 2)0.1$
$log(Volume) \sim Girth + Girth^2, +Height$
## 10.5 Model reduction in stackloss data
### Fit a linear model to the stackloss data with stack.loss as the predictor and the other variables as predictors.
```{r}
data(stackloss, package="faraway")
lm.fit <- lm(stack.loss ~ . , data=stackloss);
df <- stackloss
n <-nrow(df)
summary(lm.fit)
plot(fitted(lm.fit),residuals(lm.fit),xlab="Fitted",ylab="Residuals")
abline(h=0)
```
### Simplify the model if possible.
Now we run subset selection. We'll use an ehaustive method since there are not too many predictors. And we'll use the Mallow $C_p$ as our criteria.
```{r}
regsubsets.out <- regsubsets(stack.loss ~ .,data=df,method = "exhaustive",nvmax=8)
rs <- summary(regsubsets.out)
rs$which
plot(regsubsets.out, scale = "Cp", main = "Mallow C_p")
n <-nrow(df)
AIC <- n*log(rs$rss/n) + (2:4)*2
plot(AIC ~ I(1:3), ylab="AIC", xlab="Number of Predictors")
```
The reduced model with the lowest AIC has 2 variables and is $stack.loss \sim Air.Flow + Water.Temp$. We note this is the same model indicated by the Mallow $Cp$ criterion.
### Check the model for outliers and influential points.
#### Check for outliers.
```{r}
lm.fit <- lm(stack.loss ~ Air.Flow + Water.Temp, data = df)
numPredictors <- 2
studentized.residuals <- rstudent(lm.fit)
max.residual <- studentized.residuals[which.max(abs(studentized.residuals))]
range.residuals <- range(studentized.residuals)
names(range.residuals) <- c("left", "right")
pander(data.frame(range.residuals=t(range.residuals)), caption="Range of Studentized residuals")
p<-numPredictors+1
n<-nrow(df)
t.val.alpha <- qt(.05/(n*2),n-p-1)
pander(data.frame(t.val.alpha = t.val.alpha), caption = "Bonferroni corrected t-value")
outlier.index <- abs(studentized.residuals) > abs(t.val.alpha)
outliers <- df[outlier.index==TRUE,]
if(nrow(outliers)>=1)
{
pander(outliers, caption = "outliers")
}
```
Here we look for studentized residuals that fall outside the interval given by the Bonferroni corrected t-values. In the case of the reduced model we do not see any outliers.
#### Check for influential points.
We plot the Cook's distances and the residual-leverage plot with level set contours of the Cook distance.
```{r}
plot(lm.fit,which =4)
plot(lm.fit,which = 5)
```
We see that data element 21 is an influential point for the reduced model under the criterie $D_i > \frac{1}{2}$. Elements 1 and 3 are also influential under the criteria $D_i > \frac{4}{n}$
### Now return to the full model, determine whether there are any outliers or influential points
#### Check for outliers.
```{r}
lm.fit <- lm(stack.loss ~ .,data=df)
numPredictors<- 3
studentized.residuals <- rstudent(lm.fit)
max.residual <- studentized.residuals[which.max(abs(studentized.residuals))]
range.residuals <- range(studentized.residuals)
names(range.residuals) <- c("left", "right")
pander(data.frame(range.residuals=t(range.residuals)), caption="Range of Studentized residuals")
p<-numPredictors+1
n<-nrow(df)
t.val.alpha <- qt(.05/(n*2),n-p-1)
pander(data.frame(t.val.alpha = t.val.alpha), caption = "Bonferroni corrected t-value")
outlier.index <- abs(studentized.residuals) > abs(t.val.alpha)
outliers <- df[outlier.index==TRUE,]
if(nrow(outliers)>=1)
{
pander(outliers, caption = "outliers")
}
```
Here we look for studentized residuals that fall outside the interval given by the Bonferroni corrected t-values. we see there are no outliers for the full model
#### Check for influential points.
We plot the Cook's distances and the residual-leverage plot with level set contours of the Cook distance.
```{r}
plot(lm.fit,which =4)
plot(lm.fit,which = 5)
```
We see element 21 is an infuential point, and that 1 and 4 are also influential under the criteria $D_i > \frac{4}{n}$. Since element 1 is an influential in both the full and reduced model we remove that alsong with element 21.
### Eliminate the outliers and influential points for the full model and then repeat the variable selection procedures.
```{r}
df <- df[-c(1,21),]
regsubsets.out <- regsubsets(stack.loss ~ .,data=df,method = "exhaustive",nvmax=8)
rs <- summary(regsubsets.out)
rs$which
plot(regsubsets.out, scale = "Cp", main = "Mallow C_p")
n <-nrow(df)
AIC <- n*log(rs$rss/n) + (2:4)*2
plot(AIC ~ I(1:3), ylab="AIC", xlab="Number of Predictors")
```
We see that the subset selection routine has chosen the same model $stack.loss \sim Air.Flow + Water.Temp$.