Project-01.Rmd

---
title: 'STA4234: Project 1'
author: 'Benjamin Hendrix'
date: 'Last updated: `r Sys.Date()`'
output:
  html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
# pull in packages we need
library(gsheet)
library(tidyverse)
library(ggpubr)
# NOTE 1 -- if you have not been engaging with R prior to this project, you may need to install packages.
# NOTE 2 -- you do *not* need to edit anything below this note in the code chunk. All of the data management has been done for you. Once you confirm that the data has pulled in properly, you are ready to begin the project.
options(scipen = 999)
# call in data
data <- as_tibble(gsheet2tbl("https://docs.google.com/spreadsheets/d/1H3TP-2SBMGleriJLESOe1cdCjtSj2F76bUh5iBqC8tI"))
# keep the variables of interest
data <- data %>% select(subjid, sbp, age, sex, BMI, HbA1c, perceivedStress, lifetimeDiscrm)
# keep complete cases
data <- na.omit(data)
```

#### 1. Construct the correlation matrix for all variables in the dataset. Looking at the correlations between systolic blood pressure (<span style="font-family:Courier;">sbp</span>) and the others, what potential relationships do you see?
```{r}
data_copy_without_sex_variable <- data[, -4]
cor(data_copy_without_sex_variable, method = c("pearson"))
```

#### 2. Construct the regression model that models systolic blood pressure (<span style="font-family:Courier;">sbp</span>) as a function of age (<span style="font-family:Courier;">age</span>), sex (<span style="font-family:Courier;">sex</span>), body mass index (<span style="font-family:Courier;">BMI</span>), hemoglobin A1c (<span style="font-family:Courier;">HbA1c</span>), perceived stress (<span style="font-family:Courier;">perceivedStress</span>), and lifetime discrimination (<span style="font-family:Courier;">lifetimeDiscrm</span>). This is model 1 (m1).
\[ \hat{Y} = \beta_0 + \beta_{age} X_{\text{age}} + \beta_{sex} X_{\text{sex}} + \beta_{BMI} X_{\text{BMI}} + \beta_{HbA1c} X_{\text{HbA1c}} + \beta_{stress} X_{\text{stress}} + \beta_{discrimination} X_{\text{discrimination}}\]
```{r}
m1 <- lm(sbp ~ age + sex + BMI + HbA1c + perceivedStress + lifetimeDiscrm, data = data)
summary(m1)
```
\[\hat{Y} = 89.012 + (0.44943) X_{\text{age}} + (4.09604) X_{\text{sex}} + (0.27336) X_{\text{BMI}} + (0.53173) X_{\text{HbA1c}} + (0.08744) X_{\text{stress}} - (0.33151) X_{\text{discrimination}}\]


#### 3. Use the appropriate hypothesis test to show that perceived stress and lifetime discrimination can be removed from the model simultaneously. Test at the $\alpha=0.05$ level. 
```{r}
drop1(m1, .~., test="F")
```
**Hypotheses**
&emsp;&emsp; \[H_0: \beta_{stress} = \beta_{discrimination} = 0\  in\ the\ model\ Y=\beta_0 + \beta^*_{age}X_{age} + \beta^*_{sex}X_{sex} + \beta^*_{BMI}X_{BMI} + \beta^*_{HbA1c}X_{HbA1c} + \beta^*_{stress}X_{stress} + \beta^*_{discrimination}X_{discrimination}\]
&emsp;&emsp; \[H_1: at \ least \ one \ \beta^*_i \neq 0\  in\ the\ model\ Y=\beta_0 + \beta^*_{age}X_{age} + \beta^*_{sex}X_{sex} + \beta^*_{BMI}X_{BMI} + \beta^*_{HbA1c}X_{HbA1c} + \beta^*_{stress}X_{stress} + \beta^*_{discrimination}X_{discrimination}\]

**Test Statistic**
&emsp;&emsp; $F_{stress}$ = 1.4023
$F_{discrimination}$ = 5.0827
``` {r}
reduced <- lm(sbp ~ age + sex + BMI + HbA1c, data = data)
anova(reduced, m1)
```

***p*-value**
&emsp;&emsp; $p_{stress}$ = 0.23645
$p_{discrimination} \in (0.01, 0.025)$

**Rejection Region**
&emsp;&emsp; Reject if $p < \alpha$, where $\alpha=0.05$.

**Conclusion and Interpretation**
&emsp;&emsp; Fail to Reject $H_0$. There is not sufficient evidence to suggest that both perceived stress and lifetime discrimination are insignificant predictors to systolic blood pressure.


#### 4. Reconstruct the regression model *without* perceived stress (<span style="font-family:Courier;">perceivedStress</span>) and lifetime discrimination (<span style="font-family:Courier;">lifetimeDiscrm</span>). This is model 2 (m2).
\[\hat{Y} = \beta_0 + \beta_{age} X_{\text{age}} + \beta_{sex} X_{\text{sex}} + \beta_{BMI} X_{\text{BMI}} + \beta_{HbA1c} X_{\text{HbA1c}} + \beta_{stress} X_{\text{stress}} + \beta_{discrimination} X_{\text{discrimination}}\]
```{r}
m2 <- lm(sbp ~ age + sex + BMI + HbA1c, data = data)
summary(m2)
```
\[\hat{Y} = 88.77404 + (0.44386) X_{\text{age}} + (3.85831) X_{\text{sex}} + (0.27100) X_{\text{BMI}} + (0.55311) X_{\text{HbA1c}}\]


#### 5. Find $R^2_{\text{adj}}$ for both models and provide interpretations. Does $R^2_{\text{adj}}$ support our decision to remove perceived stress and lifetime discrimination from the model?
```{r}
summary(m1)$adj.r.squared
summary(m2)$adj.r.squared
```
\[R^{2}_{\text{adj|m1}} = 0.130779  > R^{2}_{\text{adj|m2}} = 0.1295746\]
Yes, the $R^2_{adj}$ value supports our decision to remove perceived stress and lifetime discrimination from the model as the $R^2_{adj}$ value for model 1 is greater than the $R^2_{adj}$ value for model 2.

#### 6. Use the appropriate hypothesis test to determine if model 2 is a significant regression model. (i.e., is at least one slope non-zero?)  Test at the $\alpha=0.05$ level.
**Hypotheses**
&emsp;&emsp; \[H_0: \beta^*_i = 0\  in\ the\ model\ Y=\beta_0 + \beta^*_{age}X_{age} + \beta^*_{sex}X_{sex} + \beta^*_{BMI}X_{BMI} + \beta^*_{HbA1c}X_{HbA1c}\]
&emsp;&emsp; \[H_1: at \ least \ one \ \beta^*_i \neq 0\  in\ the\ model\ Y=\beta_0 + \beta^*_{age}X_{age} + \beta^*_{sex}X_{sex} + \beta^*_{BMI}X_{BMI} + \beta^*_{HbA1c}X_{HbA1c}\]

**Test Statistic**
&emsp;&emsp; \[t_0 = \frac{Estimate}{Std. Error}\]
\[t_{age} = \frac{0.44386}{0.02588} = 17.152\]
\[t_{sex} = \frac{3.85831}{0.62551} = 6.168\]
\[t_{BMI} = \frac{0.27100}{0.04420} = 6.131\]
\[t_{HbA1c} = \frac{0.55311}{0.27638} = 2.001\]

***p*-value**
```{r}
format(round(summary(m2)$coefficients[,4], 3), nsmall = 3)
```
&emsp;&emsp;  
$p_{age}$ < 0.001  
$p_{sex}$ < 0.001  
$p_{BMI}$ < 0.001  
$p_{HbA1c}$ < 0.05  

**Rejection Region**
&emsp;&emsp; Reject if $p < \alpha$, where $\alpha=0.05$.

**Conclusion and Interpretation**
&emsp;&emsp; Reject $H_0$. There is sufficient evidence to suggest that age, sex, BMI, and HbA1c are significant predictors for systolic blood pressure.


#### 7. Create a table containing the estimated slope, 95% confidence interval, $t$ statistic, and $p$-value for each predictor in m2.
|       | **$\hat{\beta}$ (95% CI)** | **$t$** | **$p$-value** |
|-------|---------------------------------------------|-----------------|--------------------------------|
| Age   | (0.3931, 0.4946)     |  17.152    | < 0.001      |
| Sex   | (2.6317, 5.0849)     |   6.168    | < 0.001      |
| BMI   | (0.1843, 0.3577)     |   6.131    | < 0.001      |
| HbA1c | (0.0111, 1.0951)     |   2.001    |   0.045      |

```{r}
confint(m2)
```
#### 8. Based on your results in the table above, which, if any, are significant predictors of systolic blood pressure? You do not need to provide the formal hypothesis test for each predictor, however, you must justify your response.
Age, sex, BMI, and HbA1c are all significant predictors of systolic blood pressure as every p-value is less than \alpha = 0.05. The lower our p-values are the more the predictors in the model represents the true value of the data. 

#### 9. Provide a brief interpretation for the relationship between age and systolic blood pressure.
Age is the biggest predictor for systolic blood pressure of all of the possible predictors in the model. This can be concluded by referencing the t-value presented in the summary graph above. Age has the greatest value which means that it contributes to the systolic blood pressure the most of all of the other predictors.

#### BONUS. Graph the regression line for males (<span style="font-family:Courier;">sex = 1</span>) with a BMI of 30 (<span style="font-family:Courier;">BMI = 1</span>) and HbA1c of 5.9 (<span style="font-family:Courier;">HbA1c = 5.9</span>). Include the confidence and prediction bands.
```{r}
BMI <- 30
HbA1c <- 5.9

new <- tibble(BMI, HbA1c)

one_cb <- bind_rows(data, new)
one_cb$Prediction <- predict(m2, newdata = one_cb)
one_cb$LCL <- predict(m2, newdata = one_cb, interval = "confidence", level = 0.95)[,2]
one_cb$UCL <- predict(m2, newdata = one_cb, interval = "confidence", level = 0.95)[,3]
select(filter(one_cb, BMI == 30 & HbA1c == 5.9), LCL, UCL)

one_pb <- bind_rows(data, new)
one_pb$Prediction <- predict(m2, newdata = one_pb)
one_pb$LCL <- predict(m2, newdata = one_pb, interval = "prediction", level = 0.95)[,2]
one_pb$UCL <- predict(m2, newdata = one_pb, interval = "prediction", level = 0.95)[,3]
select(filter(one_pb, BMI == 30 & HbA1c == 5.9), LCL, UCL)
```