-
Notifications
You must be signed in to change notification settings - Fork 16
/
BruceCampbell_ST503_GroupDiscussion11_FarawayELM_Pblm_2.1.Rmd
167 lines (117 loc) · 7.11 KB
/
BruceCampbell_ST503_GroupDiscussion11_FarawayELM_Pblm_2.1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "NCSU ST 503 Discussion 11"
subtitle: "Probem 2.1 Faraway, Julian J. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models CRC Press."
author: "Bruce Campbell"
fontsize: 12pt
output: pdf_document
---
---
```{r setup, include=FALSE,echo=FALSE}
rm(list = ls())
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(dev = 'pdf')
knitr::opts_chunk$set(cache=TRUE)
knitr::opts_chunk$set(tidy=TRUE)
knitr::opts_chunk$set(prompt=FALSE)
knitr::opts_chunk$set(fig.height=5)
knitr::opts_chunk$set(fig.width=7)
knitr::opts_chunk$set(warning=FALSE)
knitr::opts_chunk$set(message=FALSE)
knitr::opts_knit$set(root.dir = ".")
library(latex2exp)
library(pander)
library(ggplot2)
library(GGally)
```
## 2.1 wbca analysis
```{r}
rm(list = ls())
library(faraway)
data("wbca", package="faraway")
df <- wbca
```
The dataset wbca comes from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status.
### (a) Plot the relationship between the classification and BNucl. i. Explain why plot(Class ~ BNucl, wbca) does not work well. ii. Create a factor version of the response and produce a version of the first panel of Figure 2.1. Comment on the shape of the boxplots. iii. Produce a version of the second panel of Figure 2.1. What does this plot say about the distribution? iv. Produce a version of the interleaved histogram shown in Figure 2.2 and comment on the distribution.
Here we plot $Class \sim BNucl$
```{r}
plot(Class ~ BNucl, wbca)
```
We see that since $BNucl$ is discrete we don't have a sense of how the variable is distributed by class well since the points overlap on the plot. A box plot provides a better visualization of the distribution by class.
```{r}
df$factor.class <- as.factor(wbca$Class)
boxplot(BNucl ~factor.class,df)
```
The boxplot show us that the $BNucl$ feature is a viable candidate for predicting cancer status. We can also add noise to the $Class \sim BNucl$ plot to remove the overlap in the points.
```{r}
plot(jitter(Class,0.15) ~ jitter(BNucl), wbca, xlab="BNucl", ylab="Class : tumor status", pch=".",col='red')
```
It looks like the $BNucl$ feature may be conditionally (on the class) modeled as a multinomial distribution. Most of the mass for the positive class is located at $BNucl =1$ while most of the mass for the negative class is located at $BNucl=10$.
```{r}
library(ggplot2)
ggplot(df, aes(x=BNucl, color=factor.class)) + geom_histogram(position="dodge", binwidth=3, aes(y=..density..))
```
### (b) Produce a version of Figure 2.3 for the predictors BNucl and Thick. Produce an alternative version with only one panel but where the two types are plotted differently. Compare the two plots and describe what they say about the ability to distinguish the two types using these two predictors.
```{r}
ggplot(wbca, aes(x=BNucl,y=Thick))+geom_point(alpha=0.2, position=position_jitter())+facet_grid(~ Class)
```
```{r}
qplot(x=BNucl, y=Thick, data=df, colour=factor(factor.class))# + title("PCA1 PCA2 ~ Sex") +theme(legend.position="none")
```
We see that a higher value of $BNucl$ is associated with an elevated value of $Thick$ and that a lower value of $BNucl$ is associated with a lower value of $Thick$. $Thick$ is a good candidate for inclusion in a mode using $BNucl$ to discriminate cancer status.
###(c) Fit a binary regression with Class as the response and the other nine variables as predictors. Report the residual deviance and associated degrees of freedom. Can this information be used to determine if this model fits the data? Explain.
```{r}
lm.logistic <- glm(Class ~ ., family = binomial, wbca)
summary(lm.logistic)
```
The deviance is used for hypothesis testing in model comparison. Since our response is Bernoulli we can not use the deviance for evaluating goodness of fit. To use the deviance in this setting we would bin the responses to approximate a binomially distributed response
###(e) Suppose that a cancer is classified as benign if p > 0.5 and malignant if p < 0.5. Compute the number of errors of both types that will be made if this method is applied to the current data with the reduced model.
```{r}
pred.prob <- predict(lm.logistic, type="response")
class.predicted <- pred.prob>0.5
TB <- table(df$factor.class, class.predicted)
TB
```
We see that the false positive rate is $10/(228+10)=0.04201681$ and the false negative rate is $9/(434+9)=0.02031603$
### (f) Suppose we change the cutoff to 0.9 so that p < 0.9 is classified as malignant and p > 0.9 as benign. Compute the number of errors in this case.
```{r}
pred.prob <- predict(lm.logistic, type="response")
class.predicted <- pred.prob>0.9
TB <- table(df$factor.class, class.predicted)
TB
```
We see that the false positive rate is $1/(237+1)=0.004201681$ and the false negative rate is $16/(427+16)=0.03611738$
### (h) It is usually misleading to use the same data to fit a model and test its predictive ability. To investigate this, split the data into two parts - assign every third observation to a test set and the remaining two thirds of the data to a training set. Use the training set to determine the model and the test set to assess its predictive performance. Compare the outcome to the previously obtained results.
```{r}
set.seed(123)
train <- sample(nrow(wbca), floor(nrow(df)* 2/3))
DFTrain <-wbca[train,]
DFTest <-wbca[-train,]
TDTrain <- table(DFTrain$Class)
TDTest <- table(DFTest$Class)
ratio.class <- TDTrain / TDTest
lm.logistic <- glm(Class ~ ., family = binomial, DFTrain)
summary(lm.logistic)
pred.prob.test <- predict(lm.logistic, type="response",newdata = DFTest)
class.predicted.test <- pred.prob.test>0.9
TB <- table(DFTest$Class, class.predicted.test)
pander (TB, caption = "Confusion matrix p=0.9")
class.predicted.test <- pred.prob.test>0.5
TB <- table(DFTest$Class, class.predicted.test)
pander (TB, caption = "Confusion matrix p=0.5")
thresh <- seq(0.01,0.95,0.01)
Sensitivity <- numeric(length(thresh))
Specificity <- numeric(length(thresh))
for(j in seq(along=thresh))
{
pp <- ifelse(pred.prob.test < thresh[j],"no","yes")
xx <- xtabs( ~ Class + pp, DFTest)
Specificity[j] <- xx[1,1]/(xx[1,1]+xx[1,2])
Sensitivity[j] <- xx[2,2]/(xx[2,1]+xx[2,2])
}
ry <- Sensitivity[thresh ==0.9]
rx <- 1-Specificity[thresh==0.9 ]
plot(1-Specificity,Sensitivity,type="l", main = "ROC curve - 0.9 c classifier maked in red")
points(x=rx,y=ry,pch = '*',col='red',cex=3)
```
We see that we have error rate of $(1+6) /(73+1+6+147) = 0.030837$ using the $p=0.9$ threshold. Using the threshold $p=0.5$ we have an accuracy $(4+3)/(70+4+3+150)=0.030837$ In this case we have very good evidence that the classifier will perform well on new data.
We note that our total error rates for the 2 models are the same - we may prefer one over the other based on the class conditional error rate. A ROC curve may help us in model tuning.