-
Notifications
You must be signed in to change notification settings - Fork 0
/
Project 2.Rmd
485 lines (315 loc) · 19.5 KB
/
Project 2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
---
title: "What factors determine survival among Cardiovascular Heart Disease (CHD) Patients?"
author: "Yordanos Alemu, Rabab Mohammed, Kovenda Mubale"
date: "1/16/2021"
output: word_document
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Abstract:
This data set from Plos One, is about Cardiovascular Heart Disease (CHD). The study is based on 299 patients of Cardiovascular Heart Disease (CHD) comprising of 105 women and 194 men. All the patients were more than 40 years old, having left ventricular systolic dysfunction and falling in NYHA class III and IV. Cardiovascular Heart Disease (CHD) was diagnosed by cardiac echo report or notes written by physician. The information related to risk factors were taken from blood reports while smoking status and blood pressure were taken from physician's notes.
Cardiovascular Heart Disease (CHD) is now top reason causing 31% of deaths globally. Pakistan is also included in the list of countries where prevalence of CHD is increasing significantly. According to report by Al-Shifa hospital, 33% of Pakistani population above 45 has hypertension, 25% of patients over 45 years suffer diabetes mellitus, and CHD deaths in Pakistan has reached about 200,000 per year i.e. 410/100,000 of the population). All this results in increased prevalence of heart failure. Rate of heart failure patients in Pakistan is estimated to be 110 per million. Rising stress of economic and social issues in the modern era, greasy food with little exercise results towards increased prevalence of heart failure in Pakistan.
The main objective of this study is to determine the factors that are critical for survival among Cardiovascular Heart Disease (CHD) Patients by using data from Faisalabad (third most populous city of Pakistan).
```{r}
library(readr)
survival <- read_csv("S1Data_fixed.csv")
attach (survival)
head(survival)
```
```{r}
na.omit(survival)
```
```{r}
survival$index <-1:nrow(survival)
head(survival)
```
## Data Characteristic:
Response variable:
Survival - Cardiovascular Heart Disease (CHD) Pateints that were still alive after the follow up period by the physicians. If the patient was alive by the follow up date: 1 and if not: 0.
Predictor variables:
Age, serum sodium, serum creatinine, gender, smoking, Blood Pressure (BP), Ejection Fraction (EF), anemia, platelets, Creatinine Phosphokinase (CPK) and diabetes. These are considered as potential variables explaining mortality caused by Cardiovascular Heart Disease (CHD).
Age and Time are continuous variables. EF, serum sodium, platelets, CPK and serum creatinine were taken as categorical variables.
Time is follow up time was 4 to 285 days with an average of 130 days.
EF was divided into three levels (i.e. EF<30 (coded as 0), 30<EF<45 (coded as 1) and EF>45 (coded as 2)) and platelets was also divided into three level on the basis of quartiles.
Sodium was also coded into three levels based on the quartiles
Serum Creatinine greater than its normal level (1.5) is an indicator of renal dysfunction. Its effect on mortality was studied as creatinine >1.5 (coded as 0) vs <1.5 (coded as 1).
Anemia in patients was assessed by their haematocrit level. The patients with haematocrit less than 36 (minimum normal level of haematocrit) were taken as anemic.
```{r}
library (ggplot2)
library (tidyr)
ggplot(gather(survival), aes(value)) +
geom_histogram(bins = 8) +
facet_wrap(~key, scales = 'free_x')
```
All the predictor variables are not exteremely right or left skewed. We think that all the varaibles are catagorical variales expect for Age and Time.
# Scatterplot matrix of columns:
```{r}
# Scatterplot matrix of columns
pairs (survival, col=ifelse (Survival==0, "green", "black") )
#pairs (survival[,2:10], col= EF)
```
The Explantory Analysis shows that all our variables are categorical execpt for Time and Age.
```{r}
transcorr = cor(survival [, c(1:10, 10, 12:13)], use= "complete.obs")
library("corrplot")
```
```{r}
## corrplot 0.84 loaded
corrplot (transcorr, method = "number", number.cex=0.6)
```
The correlation matrix shows that the varaible Time and surviaval are higly correlated (0.53). Gender and Smoking have a high correlation as well(0.45). Survaial and Serum Creatinine appears to have a moderate correlation(0.37).
```{r}
library(MASS)
table1= xtabs(~Survival+EF)
table1
```
93 patients had an Ejection Fraction of less than 30 %, while 146 have an Ejection Fraction of 30 to 45 % and only 60 have an Ejection Fraction higher than 45% . Of all the patients that had an ejection fraction less than 30 %, 55 % died of cardio-vascular heart disease. That is 32% more than the patients who died that had an ejection fraction higher than 45%. Patients with an ejection fraction between 30 and 45 % had the highest survival rate at 79 %. You have a higher chance of dying if your ejection fraction is less than 30 %.
```{r}
table2= xtabs(~Survival+Gender)
table2
```
Odds of surviving
. Male 132/194 = 0.680
. Female 71/105 = 0.676 .... Odds Ratio for Male VS Female -> 0.680/0.676 = 1.0059
Female patients are 1.0059 times more likely to survive cardio-vascular heart diseases than male patients. This essentially means the chances of survival for male and female patients is approximately 1 to 1.
```{r}
tabel4=xtabs(~Survival+Diabetes)
tabel4
```
Odds of not surviving
. Diabetic 40/125 = 0.32
. Non-diabetic 56/174 = 0.32 .... Odds Ratio for Diabetic VS Non-Diabetic -> 0.32/0.32 = 1
This essentially means the chances of not surviving for Diabetic and non-Diabetic patients is 1 to 1.
```{r}
tabel7=xtabs(~Survival+Sodium)
tabel7
```
Urinary Sodium to creatine ratio or Sodium levels 0 to 3 represent the quartiles of the collected Sodium ratios, 0 being less 25th quartile and 3 being greater than 75th quartile. We can see that when the patient's Sodium to Creatine ratio is less than the 25th quartile they have less than 50% chance of surving cardio-vascular heart disaese. While pateints with Sodium to Creatine ratio greater than the 25th quartile all have higher than 70% chance of surviving cardio-vascular heart disease.
#Initial Model:
```{r}
order1_fit1 <- glm (Survival ~ as.factor(EF) + `Serum Creatinine` + Diabetes + TIME + Gender + Smoking + BP + Anaemia + Age + Sodium + platelets + CPK, family = binomial)
summary(order1_fit1)
```
```{r}
exp (order1_fit1$coefficients)
```
```{r}
exp (confint (order1_fit1))
```
The model shows that pateints having an ejection fraction between 30 to 45 % and higher 45 % are significantly different from patients who have an ejection fraction of less than 30 %. The model also shows that Serum Creatinine, Time, Age and CPK are significant predictors for predicting the survival of cardio-vascular heart disease. For every unit increase in Serum Creatinine the odds of surviving cardio-vascular disease go up by 5.3554847-folds. The odds of surviving cardio-vascular heart disease for male pateints vs female pateints is not only 1 to 1 as we predicted earlier but essentially ranges from 0.74725273 to 4.1808057 -folds.
```{r}
# Make a plot similar to Y vs Y-hat for linear regression
order1_fit1.logit = predict (order1_fit1)
plot (jitter (Survival, 0.2) ~ order1_fit1.logit, data=survival)
# Add the overall fitted logistic curve to the plot, and a lowess fit
pihat.order1_fit1 = predict (order1_fit1, type='response')
pihat.ord = order (pihat.order1_fit1)
lines (order1_fit1.logit [pihat.ord], pihat.order1_fit1 [pihat.ord])
lines (lowess (order1_fit1.logit [pihat.ord], pihat.order1_fit1 [pihat.ord]), col='red', lty=2)
```
The jittered response vs. predicted values with the fitted logistic curve and a lowess fit shows that the model fits the data aaproximately well since the solid and the jittered line are approximately similar in shape.
```{r}
plot(predict(order1_fit1),residuals(order1_fit1))
abline(h=0,lty=1,col="brown")
```
The linaerity condition is statisfied because the plot is flat and right on zero.
```{r}
par (mfrow=c(1,2))
plot (order1_fit1, which=c(5))
```
There are no points with high leverage or high Cook's Distance therefore those influence measures do not appear on the plot. The point index 240 has a standerized residual around 6. We will see whether it is an influential point .
#Inference Analysis to confirm that there is no outliers:
```{r}
survival$Residual = round (order1_fit1$residuals, 4)
survival$leverage = round (hatvalues(order1_fit1), 4)
survival$rstudent = round (rstudent(order1_fit1), 4)
```
Pull out the cases with high leverage and large residuals.
```{r}
high.levg.resd = survival [survival$leverage > 3*(order1_fit1$rank) / (order1_fit1$rank + order1_fit1$df.residual) |abs (order1_fit1$residuals) > 30 , ]
high.levg.resd[order(-high.levg.resd$rstudent),][c(1,9:17)]
```
Although index 240 has a large residual value, however, its 0.0131 leverage value is not bigger than the Leverage Cutoff value at 0.13043478 (3(k+1)/n). Therefore, this point is an influential point.
We will see whether there are any other influential points that do not have high residuals as index 240.
```{r}
high.Leverage = survival [survival$leverage > 0.13043478 & is.na(survival$leverage) == FALSE, ]
high.Leverage[order(-high.Leverage$leverage),][c(1,9:17)]
```
There are only 2 points with leverage values higher than the Leverage Cutoff value at 0.13043478 (3(k+1)/n). The point index 242 has the highest leverage value of 0.134.
```{r}
car::vif(order1_fit1)
```
All of the VIF are less than 5. Hence, multi-colinearity is low in our model. We will be doing stepwise regresion to find the best model.
#Model Selection:
```{r}
step.AIC = step (order1_fit1, direction = "both")
summary(step.AIC)
beta2 = coefficients(step.AIC)
exp (beta2)
exp (confint (step.AIC))
```
```{r}
step.BIC = step(order1_fit1, direction = "both", k=log (order1_fit1$rank + order1_fit1$df.residual))
summary(step.BIC)
```
```{r}
summary(step.AIC)
summary(step.BIC)
```
```{r}
anova ( step.BIC, step.AIC )
pchisq(2.0858, 1,lower.tail = FALSE)
```
We choose the AIC model to do the interaction effects on becuase it has significant predictors and has a lower residual deviance of 214.42 Residual Deviance on 292 degrees of freedom. The pvalue (0.1486743) which is less than 0.05 the typical cutoff.
```{r}
# Center quantitative predictors and add interaction effects
my.center = function (y) y - mean (y)
survival$TIME.c = my.center (survival$TIME)
survival$Age.c = my.center (survival$Age)
fit1_InteractionEffects = glm (Survival ~ (as.factor(EF) + TIME.c + CPK + `Serum Creatinine` + Age.c)^2,
data=survival,
family = binomial)
summary (fit1_InteractionEffects)
```
#Final Stepwise Regression
```{r}
step.AIC_final = step (fit1_InteractionEffects, direction = "both")
summary(step.AIC_final)
```
```{r}
step.BIC_final = step (fit1_InteractionEffects, direction = "both",k=log (fit1_InteractionEffects$rank + fit1_InteractionEffects$df.residual))
summary(step.BIC_final)
```
```{r}
summary(step.AIC_final)
summary(step.BIC_final)
```
```{r}
anova (step.AIC_final, step.BIC_final)
pchisq(11.949, 3, lower.tail = FALSE)
```
We choose AIC model as final becuase it's Residual Deviance is 202.47 which less than the Residual Deviance 214.42 of the BIC model. The pvalue (0.0075599) is less than 0.05 which means that the two models are statistically differnt from each other.
```{r}
final_model_1 = step.AIC_final
summary(final_model_1)
```
```{r}
### Interaction plots
# New categorize function
categorize = function (x, quantiles=(1:3)/4) {
cutoffs = quantile (x, quantiles)
n.cutoffs = length (cutoffs)
result = rep ("C1", length (x))
for (j in 1:n.cutoffs) {
result [x > cutoffs [j]] = paste ("C", j+1, sep="")
}
return (result)
}
library (ggplot2)
qplot (TIME, predict (final_model_1), data=survival, color=categorize (EF)) + geom_smooth (method="lm")
qplot (`Serum Creatinine`, predict (final_model_1), data=survival, color=categorize (TIME)) + geom_smooth (method="lm")
```
The interaction plot between time and Ejection Fracton shows that the predicted logits of suriving cardio-vascular disease increases over time for all ejection fraction (EF) categories(EF<30 (coded as 0), 30<EF<45 (coded as 1)). Patients with 30<EF<45 have the highest rise in the predicted logits of suriving cardio-vascular disease over time. We had mentioned earlier with our frequency table that patients with an ejection fraction (EF) between 30 and 45 % had the highest survival rate at 79 %. However, with this interaction plot with time we see that statement holds true only for patients whose follow up time was longer than 100 days. We see that for patients with a follow up time of less than 25 days, they have lowest predicted logits of suriving cardio-vascular disease.
The interaction plot between Serum Creatinine and time shows the predicted logits of surviving cardio-vascular disease increases over time because C1 (25th quartile of follow time) is the lowest and C4 (75th quartile of follow time) is the highest in the graph. This effect was already accounted for in the previuos interaction plot. Serum Creatinine was coded as follows, >1.5(indicator of renal dysfunction coded as 0) and < 1.5(normal level coded as 1). We can see from the positive slope of all four of the TIME graphs that all patients with renal dysfunction (0) have lower predicted logits of surviving than those who do not (1). All patients with follow up time in C4 (75 quartile),regardless of whether they have renal dysfuntion, have higher predicted logits of suriving cardio-vascular disease than all other patients in C1,2 & 3, regardless of whether those patients do not have renal dysfuntion. Patients with follow up time in C4 (75 quartile), also have the most noticeable change in the predicted logits for surviving cardio-vascular disease between patients with renal dysfunction(0) and patients without (0).
#Final Model:
```{r}
exp (confint (final_model_1))
```
```{r}
exp (final_model_1$coefficients)
```
An increase in a unit of Creatinine Phosphokinase (CPK) decreases a patients of odds of surviving by 0.331-folds. A unit increase in Age decreases a patients odds of surviving by a range of 0.0808 to 0.01856-folds.
```{r}
preds = predict (final_model_1, se.fit = T)
pred.df = cbind.data.frame (survival, as.data.frame (preds))
pred.df$lwr = pred.df$fit - 1.96 * pred.df$se.fit
pred.df$upr = pred.df$fit + 1.96 * pred.df$se.fit
pred.df$fit.pr = round (exp (pred.df$fit) / (1 + exp (pred.df$fit)), 3)
pred.df$lwr.pr = round (exp (pred.df$lwr) / (1 + exp (pred.df$lwr)), 3)
pred.df$upr.pr = round (exp (pred.df$upr) / (1 + exp (pred.df$upr)), 3)
head(pred.df)
```
#Patients Predictions:
```{r}
# Predictions for patients in their 70s and older.
pred.df[c(2, 17, 88, 48, 40),c(1,2,3,8,17:20)]
```
```{r}
# Predictions for patients in their 40s and 50s
pred.df[c(103, 66, 83, 21, 11),c(1,2,3,8,17:20)]
```
#Model Diagnostics:
```{r}
# Diagnostic plots for model hp9
par (mfrow=c(1,2))
plot (jitter (Survival, 0.2) ~ predict (final_model_1), data=survival)
logit.9 = predict (final_model_1)
ord9 = order (logit.9)
pihat.9 = exp (logit.9) / (1 + exp (logit.9))
lines (logit.9 [ord9], pihat.9 [ord9])
# Lowess fit
lines (lowess (logit.9, survival$Survival), col='red', lty=2)
plot (final_model_1, which=1)
```
```{r}
# Likelihood ratio test for first model
pchisq (order1_fit1$null.deviance - order1_fit1$deviance,
order1_fit1$df.null - order1_fit1$df.residual, lower.tail = F)
```
The pvalue (1.570568e-28) is less than 0.05, therfore we reject the null hypothesis and conclude that at least one of our parameters is different from zero.
```{r}
# Likelihood ratio test for final model
pchisq (final_model_1$null.deviance - final_model_1$deviance,
final_model_1$df.null - final_model_1$df.residual, lower.tail = F)
```
The pvalue (1.553978e-32) is less than 0.05, therfore we reject the null hypothesis and conclude that at least one of our parameters is different from zero.
```{r}
# Goodness of fit test
pchisq (final_model_1$deviance, final_model_1$df.residual, lower.tail = F)
```
The p_value(0.9999696) is bigger than (0.05), suggesting that there is no significant lack of fit in the model.
#Influence diagnostics analysis:
```{r}
plot(final_model_1,which = 5)
```
# ROC plot:
```{r}
# ROC Curve for model final_model_1
# ROC curve - install package ROCR
par (mfrow=c(1,1))
library(ROCR)
pred1 <- prediction(final_model_1$fitted.values, final_model_1$y)
perf1 <- performance(pred1,"tpr","fpr")
auc1 <- performance(pred1,"auc")@y.values[[1]]
auc1
plot(perf1, lwd=2, col=2)
abline(0,1)
legend(0.6, 0.3, c(paste ("AUC=", round (auc1, 4), sep="")), lwd=2, col=2)
# Extract the X and Y values from the ROC plot, as well as the probability cutoffs
roc.x = slot (perf1, "x.values") [[1]]
roc.y = slot (perf1, "y.values") [[1]]
cutoffs = slot (perf1, "alpha.values") [[1]]
auc.table = cbind.data.frame(cutoff=pred1@cutoffs,
tp=pred1@tp, fp=pred1@fp, tn=pred1@tn, fn=pred1@fn)
names (auc.table) = c("Cutoff", "TP", "FP", "TN", "FN")
auc.table$sensitivity = auc.table$TP / (auc.table$TP + auc.table$FN)
auc.table$specificity = auc.table$TN / (auc.table$TN + auc.table$FP)
auc.table$FalsePosRate = 1 - auc.table$specificity
auc.table$sens_spec = auc.table$sensitivity + auc.table$specificity
# Find the row(s) in the AUC table where sensitivity + specificity is maximized
auc.best = auc.table [auc.table$sens_spec == max (auc.table$sens_spec),]
auc.best
# Plot the maximum point(s) on the ROC plot
points (auc.best$FalsePosRate, auc.best$sensitivity, cex=1.3)
```
The AUC is 0.916, meaning that the model correctly predicted for 91.6% of the cases in our dataset. This suggests that our final model predicts surviving cardio-vascular disease better than simply guessing. Our cutoff for determining predictions of failure or no failure was 0.6412. This has a false positive rate of 156 and a sensitivity (probability of a true positive prediction) of 0.84.
#Final Conclusions:
We can conclude that, among the predictor variables from our dataset, the presence of Ejection Fraction, Time, Creatinine Phosphokinase (CPK), Serum Creatinine and Age were helpful in explaining patients' survival Cardiovascular Heart Disease (CHD).
We also found a significant interaction effect between Time and Ejection Fraction. The AUC of this model is 0.916, showing that these five variables and the four interaction plots help predict patients' survival Cardiovascular Heart Disease (CHD) better than simply guessing, and the low p-value of 1.553978e-32 from the Likelihood Ratio test confirmed the significance of the model. The model’s sensitivity was 0.84 and specificity was 0.84375, suggesting that the model does a much better job predicting non-failures than failures.
This analysis begs the question of what other variables can explain dam failure, as the variables from this set could not fully account for dam failure.
Some follow-up questions to this analysis could include:
1. Our analysis shows that an increase in age decreases a patient's chance of survival; it also indicates that if a patient is older than 75 years, with EF of 1 or 2, no renal dysfunction and has a follow-up time after 100 days(not sick). Those patients will have a better survival rate than those who have the same age but have renal dysfunction and EF of 0.
Will creating an educational program that prevents renal dysfunction and maintains an EF level of 1 or 2 to improve the survival rate for patients older than 75 years and who have a follow-up time after 100 days.
2. Are there other factors that could be measured, such as obeisty, cholestrol, family history and physical inactivity that could improve the predictability of the model?