Project 1 - MATH 372.Rmd

---
title: "Project 1 - Math 372"
author: "Hadley Dixon"
date: "2023-09-27"
output: html_document
---

## Data visualization and pre-processing

**Data Description**: The data in the file “UN.txt” contains PPgdp, the 2001 gross national product per person in US dollars, and Fertility, the birth rate per 1000 females in the population in the year 2000. The data are for 184 localities, mostly UN member countries, but also other areas such as Hong Kong that are not independent countries. In this problem, we study the relationship between Fertility and PPgdp.

```{r}
UN <- read.table(file = "UN.txt", header = TRUE)
library(ggplot2)
library(stats)
```

**1. Draw the scatterplot of Fertility on the vertical axis versus PPgdp on the horizontal axis and summarize the information in this graph. Does a simple linear regression model seem to be a plausible for a summary of this graph?**

*It does not seem that a simple linear regression model fits this data well, in the state that it is in. We will likely need to transform either X, Y, or both to be able to apply a SLR model to this data.*

```{r}
original_scat <- ggplot(UN, aes(x=PPgdp, y=Fertility)) + geom_point()
original_scat
```


**2. In order to get a better fit, we seek to transform the variables. What transformations you would take so that a simple linear regression model is proper? State why you choose these transformations. Draw the scatter plot of the transformed variables. Comment on the plot.**

*In order to get a better fit, I have chosen to transform X and Y. The transformations I have chosen is $\tilde{X} = log(X)$ and $\tilde{Y} = log(Y)$. I have chosen this transformation because the original data looks similar to a negative logarithmic function, as opposed to linear, after initial inspection. I confirmed this hypothesis by plotting the relationship between X and Y using each of the most common transformations ($log(X)$, $sqrt(X)$, $X^2$, $X^3$, $\sqrt[3]{X}$, $1/X$, respectively for Y as well). I found that the transformation which created the most linear relationship between X and Y was log(X) and log(Y), which will be confirmed when diagnostic plots are run in Part 4. Commenting on the scatter plot below, there seems to be a [relatively] strong negative linear relationship between $\tilde{X}$ and $\tilde{Y}$, with minimal outliers in the data*

```{r}
transX <- log(UN$PPgdp)
transY <- log(UN$Fertility)
transformed_scat <- ggplot(UN, aes(transX, transY)) + geom_point()
transformed_scat
```

## Model fitting and diagnostics

**3. Fit the simple linear model on the transformed data through three ways. Report the least square estimates for the coefficients and R2. Add the fitted line to the scatter plot on the transformed data and comment on the fit.**

*The least squares estimators for the coefficients of our SLR model are:*<br />
*$β_0 = 2.87607$*<br />
*$β_1 = -0.23749$*<br /><br />
*The $R^2$ value is:*<br />
*$R^2 = 0.5813$*

**(a) Plain coding (not using the ‘lm’ function or matrix manipulation)**

```{r}
n <- 184
xbar <- mean(transX) # 7.662578
ybar <- mean(transY) # 1.056323

ssx <- sum((transX-xbar)^2) # 510.8974
ssy <- sum((transY-ybar)^2) # 49.56609
ssxy <- sum((transX-xbar)*(transY-ybar)) # -121.3306

b1 <- ssxy/ssx # -0.2374852        
b0 <- ybar - b1*xbar # 2.876072

ssr <- b1^2 * ssx # 28.81422
sst <- ssy # 49.56609
r2 <- ssr/sst # 0.5813292
```

**(b) Using the ‘lm’ function**

```{r}
model <- lm(transY ~ transX, data = UN)
summary(model)
```

**(c) Through matrix manipulation**

```{r}
X <- as.matrix(cbind(1,transX))
Y <- as.matrix(transY)

B <- solve(t(X) %*% X) %*% t(X) %*% Y

H <- X %*% solve(t(X) %*% X) %*% t(X)
I <- matrix(0, nrow = n, ncol = n)
I[row(I) == col(I)] <- 1
J <-matrix(1, nrow = n, ncol = n)
temp <- I - (J/n)

SSE <- t(Y) %*% (I - H) %*% Y
SST <- t(Y) %*% temp %*% Y

R2 <- 1 - (SSE/SST)
```

**Graph the fitted regression line**

```{r}
fitted_scat <- transformed_scat + geom_smooth(method=lm , color="red", se=FALSE)
fitted_scat
```

**4. Draw the diagnostic plots and comment**

*My first diagnostic plot is the residual plot, which shows the difference between the observed response and the fitted response values. Ideally, the (null) residual plot should show a random scatter of points forming an approximately constant width band around 0. Looking at the residual plot below, this is, in fact, the case. We see no pattern, which points to homoscedasticity, that our $\sigma^2$ does not depend on $x_i$. Also, we also see the points centered around $e_i = 0$ and no presence of outliers.*

```{r}
# Residual plot
yhatFunct <- function(ximp){
  b0 + b1 * ximp}

yhat <- b0 + b1*transX

residual <- transY - yhatFunct(transX)

ggplot(UN, aes(x=transX, y=residual)) + geom_point() + geom_smooth(method=lm , color="blue", se=FALSE)
```

*My second diagnostic plot is the histogram of the residuals. This primarily assesses the normality of the residuals as well as the presence of outliers. Ideally, the histogram is a bell shape curve. Looking at the plot below, there is a symmetric bell-shape which is approximately normally distributed around zero. There is slight under dispersion, however, when interpreting alongside the residual plot and QQ plot, I have concluded there is not cause for concern.*

```{r}
# Histogram of residual plot
hist(residual)
```

*My third diagnostic plot is the Q-Q plot (Normal Probability Plot), which primarily assesses the normality of the residuals. It is a graphical method for comparing 2 probability distributions by plotting their quantiles against each other. The quantile of each observed residual is calculated (observed quantile), and then compared to what it should be under a perfect Normal distribution (theoretical quantile). Looking at the Q-Q plot below, we see very slight light tails, however, alongside the plot of the residuals and histogram above, I would be comfortable performing inference given the Q-Q plot results.*

```{r}
# Q-Q plot
qqnorm(residual, pch = 1, frame = FALSE)
qqline(residual , col = "green")
```

## Inference

**5. Test whether there is a linear relationship between the transformed variables**

Null and alternative hypothesis: $H_0: β_1 = 0, H_1: β_1 ≠ 0$ <br />
Significance level = 0.05  <br />
Test statistic ($t$) = $βˆ_1/SE(βˆ_1)$ <br />
$t ∼ t_{n-2}$ under the null hypothesis <br />
Decision rule: If $|t| > t^*_{n-2}(1 − α/2)$ then we reject $H_0$ and conclude that PPgdp is a significant predictor of fertility<br /><br />
*Conclusion: Our quantile is less than the absolute value of our t-statistic so we reject the null hypothesis and conclude that per person GDP is a significant predictor of fertility.*

$T=\dfrac{\hat{\beta_1}-0}{\widehat{SE}(\hat{\beta}_1)}=-15.89683$

```{r}
sigmahat2 <- sum(residual^2) / (n-2) # 0.1140213 (aka: MSE)

seb1 <- sqrt(sigmahat2/ssx) # 0.01493916

alpha95 <- 0.05
quantile95 <- qt(1-alpha95/2, n-2) # 1.973084

t <- b1/seb1 # -15.89683
abs(t) > quantile95 # TRUE
```

**6. Provide a 99% confidence interval on the expected Fertility for a region with PPgdp 20,000 US dollars in 2001.**

Before I begin the discussion, it is important to note that the values mentioned below are of the transformed data. The equivalent original data can be found by calculating the natural exponential of X or Y, respectively ($e^X$; $e^Y$). The reversions to original data are denoted in [brackets]<br />

*The expected fertility for a region with PPgdp of \$9.903488 [original: \$20,000] in 2001 is 0.5241401 [original: 1.689006]. We find this expected value by transforming our $x_i$ value from 20000 to 9.903488 using a log(x) transformation, to stay consistent with our previous transformation. A 99% confidence interval for this expected fertility is (0.4155429, 0.6327373) [original: 1.515193, 1.882757]. This means that we know with 99% certainty that our expected fertility, 0.5241401, will fall between 0.4155429 and 0.6327373 In more precise terms, if we took repeated samples and then calculated repeated confidence intervals, we can expect 99% of these confidence intervals would contain 0.5241401.*

```{r}
trans20000 <- log(20000) # 9.903488

seFunct_ci <- function(xinp){
  se <- sqrt(sigmahat2) * sqrt((1/n) + (((xinp-xbar)^2) / ssx))
  return(se)}

alpha99 <- 0.01
quantile99 <- qt(1-alpha99/2, n-2) # 2.603112

lb99 <- yhatFunct(trans20000) - quantile99 * seFunct_ci(trans20000) # 0.4155429
ub99 <- yhatFunct(trans20000) + quantile99 * seFunct_ci(trans20000) # 0.6327373
```

**7. Provide a 95% confidence band for the relation between the expected Fertility and PPgdp. Add the bands to the scatter plot of the original data.**

```{r}
xgrid <- seq(min(transX), max(transX), length.out = 100) # Create a range of evenly distributed x values from minimum and maximum values
df <- data.frame(x = numeric(100), lower_bound = numeric(100), upper_bound = numeric(10)) # Create data frame
WHquantile <- sqrt(2*qf(1-alpha95, 2, n-2))

# Calculate simultaneous CI over the range of x
for (i in 1:100) {
  xi <- xgrid[i]
  df$x[i] <- xi
  df$lower_bound[i] <- yhatFunct(xi) - WHquantile * seFunct_ci(xi)
  df$upper_bound[i] <- yhatFunct(xi) + WHquantile * seFunct_ci(xi)
}

# Connect the lower bound points together and the upper bound points together
fitted_scat + geom_line(data=df, aes(x=x, y=lower_bound)) + geom_line(data=df, aes(x=x, y=upper_bound))

# Connect the lower bound points together and the upper bound points together
original_scat + geom_line(data=exp(df), aes(x=x, y=lower_bound)) + geom_line(data=exp(df), aes(x=x, y=upper_bound))
```

**8. Assuming that the same relationship between Fertility and PPgdp holds, give a 99% prediction interval on Fertility for a region with PPgdp 25,000 US dollars in 2018.**

Before I begin the discussion, it is important to note that the values mentioned below are of the transformed data. The equivalent original data can be found by calculating the natural exponential of X or Y, respectively ($e^X$; $e^Y$). The reversions to original data are denoted in [brackets]<br />

*Assuming that the same relationship between Fertility and PPgdp holds, a 99% prediction interval on Fertility for a region with PPgdp \$25,000 in 2018 is (-0.415426, 1.35772) [0.660059, 3.88732]. We find this by transforming our $x_i$ value from 25000 to 10.12663 using a log(x) transformation, to stay consistent with our previous transformation.*

```{r}
trans25000 <- log(25000) # 10.12663
predict(model, data.frame(transX = trans25000) , interval = "prediction", level = 0.99) # [-0.415426, 1.35772]
```

**9. Based on the diagnostic plots in Part 4, do you have any concern on the above hypothesis testing and inferences? If so, what are the concerns?**

*Based on the diagnostic plots in Part 4, I have very minimal concern on the above hypothesis testing and inferences. Once the data was transformed, our 3 assumptions hold. There is linearity of X and Y, shown in the scatter plot of the transformed data. There is equal variance, such that sigma squared is constant and does not depend on x (homoscedasticity). This supports our final assumption that the residuals follow are iid Normally distributed with mean 0 and variance sigma squared. We know this because our residuals are centered around 0 with no pattern observed or presence of an outlier (shown in the residual scatter plot and histogram of residuals). Similarly, our QQ plot shows points falling approximately on the line y = x, with slight light tails. I would be comfortable performing inference based on the hypothesis testing and inferences. *