Regression_modelling.Rmd

---
title: "Regression Modelling in `R`"
subtitle: "A bit of theory and application"
institute: "North Central London ICB"
date: "24/11/2022"
author: "Chris Mainey - c.mainey1@nhs.net"
output:
  xaringan::moon_reader:
    css: 
      - default
      - nhsr-theme/css/nhsr.css
      - nhsr-theme/css/nhsr-fonts.css
    lib_dir: libs
    seal: false
    self-contained: true
    nature:
      ratio: '16:9'
      highlightStyle: github
      highlightLines: true
      highlightLanguage: ["r"]
      countIncrementalSlides: false

---
class: title-slide, left, bottom
```{r setup, include = FALSE}
library(ragg)
ragg_png = function(..., res = 192) {
  ragg::agg_png(..., res = res, units = "in")
}
knitr::opts_chunk$set(dev = "ragg_png", fig.ext = "png")
library(knitr)
library(tidyverse)
library(NHSRtheme)
library(xaringanExtra)
library(ggforce)
# set default options
opts_chunk$set(echo = FALSE,
               fig.width = 6,
               fig.height = 3.3,
               dpi = 300,
               dev = "png", 
               dev.args = list(type = "cairo-png"))

# uncomment the following lines if you want to use the NHS-R theme colours by default
scale_fill_continuous <- partial(scale_fill_nhs, discrete = FALSE)
scale_fill_discrete <- partial(scale_fill_nhs, discrete = TRUE)
scale_colour_continuous <- partial(scale_colour_nhs, discrete = FALSE)
scale_colour_discrete <- partial(scale_colour_nhs, discrete = TRUE)

theme_set(
theme_bw() +
    theme(legend.position = "right",
          plot.subtitle = element_text(face = "italic"
                                       , family = "sans"
                                       , size=10))
)

alpha <- 5.4776
X<-c(1,2,2)
Y<-alpha + (X * 1.2507)
Y[2] <- Y[1]
dt2  <- data.frame(X,Y)

triangle<- data.frame(X=c(1.5,2.13,0),
Y = c(6.3, 7.3, 6.2), 
#value=c("1", "1.5", "2"),
label= c("1", "\u03B2", "\u03B1" ))


```

```{r echo=FALSE}
xaringanExtra::use_logo(
  image_url = "nhsr-theme/img/logo-nhs-blue.png",
  exclude_class = c("title-slide", "inverse", "hide-logo")
)
```

# `r rmarkdown::metadata$title`
----
## **`r rmarkdown::metadata$subtitle`**
### `r rmarkdown::metadata$author` 
### `r icons::icon_style(icons::fontawesome("twitter"), fill = "#005EB8")`@chrismainey
#### `r rmarkdown::metadata$date`


---

# Workshop Overview

- Correlation
- Linear Regression
 - Specifying a model with `lm`
 - Interpreting the model output
 - Assessing model fit
- Multiple Regression
- Prediction
- Generalized Linear Models using `glm`
 - Logistic Regression
 

<br>
___Mixture of theory, examples and practical exercises___

---

# Relationships between variables

If two variables are related, we usually describe them as 'correlated'.


<br>
Usually interested in "strength" and "direction" of association

--

<br><br>
Two analysis techniques commonly used to investigate:


+ ___Correlation:___ shows direction, and strength of association

+ ___Regression:___ estimate how one (or more) variable(s) change in relation to each other. Usually:
  + $y$ (the variable we're interested in) is the "dependent variable" or "outcome"
  + $x$ (or more than one, $x_{i}$) as the "independent variables" or "predictors"

--

<br>
Sometimes the effects of other variables interact/mask this (___"confounding"___)

---

```{r lmsetup, echo=FALSE,include=FALSE}
set.seed(222)
x <- rnorm(50, 10, 4)
y <- runif(50, min = 0.5, 10) + (1.25*x)
z <- predict(lm(y~x))
set.seed(222)
library(ggplot2)
```

## Example:
```{r lm1, fig.align="center",fig.width = 5.9, fig.height = 3}
x <- rnorm(50, 10, 4)
y <- runif(50, min = 0.5, 10) + (1.25*x)

a<-ggplot(data.frame(x,y,z), aes(x=x,y=y))+
  geom_point(col="dodgerblue2", alpha=0.65)
a
```

---

# Correlation

+ Measured with a correlation coefficient ('Pearson' is the most common)

+ Range: 
 + __-1 to 1:__ Perfect negative to Perfect positive Correlation
 + __0:__  No Correlation

--

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg" height="300" class="center">
</center>

.footnote[
Graphic from:
Wikipedia: [Correlation and dependence:](https://en.wikipedia.org/wiki/Correlation_and_dependence)
By DenisBoigelot, https://commons.wikimedia.org/w/index.php?curid=15165296 [Accessed 24 Sept 2019]
]
---

# Correlation in R

Lets check the correlation in our generated data:

```{r correlation, echo = TRUE, collapse=TRUE}
cor(x, y)

cor.test(x,y)

```

+ `cor.test` is a correlation and a t-test.
+ Different types of correlation coefficient, default is 'Pearson'
+ Doesn't work for different distributions, data types or more variables

---

### Regression models (1)

Regression gives us more options than correlation:

```{r lm3, fig.align="center", fig.retina=2, message=FALSE, warning=FALSE,fig.width = 5.9, fig.height = 3}
z <- lm(y~x)
print(a<- a + geom_smooth(method="lm", col="red", linetype=2,fullrange=TRUE, se=FALSE))
```


$$y= \alpha + \beta x + \epsilon$$

---

### Regression models (2)
Zooming in...
 
```{r lm35, fig.align="center", fig.retina=2, message=FALSE, warning=FALSE,fig.width = 5.9, fig.height = 3}
z <- lm(y~x)
print(a + geom_mark_circle(aes(y=5.4776 , x=0), col = "darkgoldenrod2", linetype="dashed",
             size=1.2)+
        geom_polygon(aes(x=X,y=Y), fill=NA, col = "darkgoldenrod2", linetype="dashed",
             size=1.2 ,data=dt2) +
        geom_label(aes(x=X, y = Y, label=label), data=triangle)+
        coord_cartesian(xlim = c(0,3),ylim= c(5,10))
        )
```

---

## Regression equation

$$\large{y= \alpha + \beta_{i} x_{i} + \epsilon}$$
<br>
.pull-left[
+ $y$ - is our 'outcome', or 'dependent' variable
+ $\alpha$ - is the 'intercept', the point where our line crosses y-axis
+ $\beta$ - is a coefficient (weight) applied to $x$ 
+ $x$ - is our 'predictor', or 'independent' variable
+ $i$ - is our index, we can have $i$ predictor variables, each with a coefficient
+ $\epsilon$ - is the remaining ('residual') error
]

--

.pull-right[

We are making some assumptions:
+ Linear relations
+ Data points are independent (not correlated)
+ Normally distributed error
+ Homoskedastic (error doesn't vary across the range)
]

---

### Ordinary Least Squares 'OLS'

+ 'Residual' distance between prediction and data point ( $\epsilon$ ).
--

+ Sum would be zero, so we 'square' (^2) it, and minimise the _'sum of the squares'_
--

```{r lm2, fig.align="center", message=FALSE, warning=FALSE, fig.width = 5.9, fig.height = 3}
a + geom_segment(aes(xend=c(x), yend=c(z)), size=0.4, arrow=arrow(length=unit(0.15,"cm")))
```

---

## Regression models (3)

So now let's create a linear regression model.  I prefer to create them as objects so I can use them again later.  Let's call this one _model1_.  That's fairly bad naming, but oh well....
<br><br><br>
Let's say we have a data.frame called _mydata_ and columns called _Y_ (that we are predicting) using a column called _X_

```{r lm2a, eval=FALSE, echo=TRUE}
model1 <- lm(Y ~ X, data = mydata)
```
<br><br>
We can then use other methods on this object, like `print()`, `summary()`, `plot()` and `predict()`.

<br><br>

The next two slides show the output of the summary function and plot.
---

## `lm` summary

```{r sum_reg}
summary(z)
```


--

+ We can test fit using f-tests, prediction error, or the R<sup>2</sup> (the proportion of variation in $y$, explained by $x$).


---

## Interpretation (1)

So how do we interpret the output?  

+ The intercept $(\alpha)$ = 5.48
+ The coefficient $(\beta)$ for $x$ = 1.25

___"For each increase of 1 in $x$,  $y$ increases by 1.25, starting at 5.48."___

--

<br><br> 

A common addition is to __"mean-centre and scale"__ our variables. So $x$ becomes:

$$ \frac{(x - \bar{x})}{\sigma_x} $$
```{r lm2b, eval = FALSE, echo=TRUE}
model1_scaled <- lm(Y ~ scale(X), data = mydata)
```
---
## Interpretation (2)

```{r xaringan-panelset, echo=FALSE}
xaringanExtra::use_panelset()
```

.panelset[

.panel[.panel-name[Original]
```{r modor}
z <- lm(y~x)
summary(z)
```
]

.panel[.panel-name[Scaled]
```{r modsc}
z <- lm(y~scale(x))
summary(z)
```

]

.panel[.panel-name[Interpretation]
<br><br>
This converts our interpretation:
+ The intercept becomes the average $y$ value
+ $\beta$ becomes the change in $y$ for one standard deviation increase in $x$ 

<br>

So...
+ Average $x$ is 18.24.69
+ Each increase of 1 standard deviation in $x$ increase $y$ by 4.7256
]
]

---

## Regression diagnostics (1)

A common check is to plot residuals:
```{r rplotsetup, eval=TRUE, fig.width = 5.9, fig.height = 3}
par(mfrow=c(1,2))
plot(z, which = c(1:2))
```
---
## Regression diagnostics (2)

A common check is to plot residuals:
```{r rplotsetup2, eval=TRUE, fig.width = 5.9, fig.height = 3}
par(mfrow=c(1,2))
plot(z, which = c(3:4))
```

---
class: center, middle

## Exercise 1:  Linear regression with a single predictor

---

# More than one predictor?

Our plots in earlier slides make sense in 2 dimensions, but regression is not limited to this.

--

<br><br>
If we add more predictors, our interpretation of each coefficient becomes: <br>
+ ___"The change in $y$ whilst holding all others parameters constant"___

<br><br>

We can add more predictors with the `+`:
```{r formula, eval=FALSE, echo=TRUE}
lm(y ~ x1 + x2 + x3 + xi)
```

---

## Categorical variables

How do we enter categorical variables into a model?

+ Models won't understand text, and numbers are numeric, so we use `factor` variables?

--

<br>
`Factors` are 'dummy coded:' 
+ _'pivotted' to binary columns_
+ Contain a reference level: with categories: "A", "B" & "C", we get:
```{r dummy, echo}
a<- data.frame(Category = factor(c("A", "B", "C", "C", "A", "B")))
model.matrix(~., a)
```

---
class: middle

## Exercise 2:  Linear regression with multiple predictors

---
class: middle center

```{r, echo=FALSE, out.width="70%", fig.cap="https://xkcd.com/1725/"}
knitr::include_graphics("https://imgs.xkcd.com/comics/linear_regression.png")
```

---

## What about non-linear data?

- Data are not necessarily linear. Death is binary, LOS is a count etc.

- We can use the __Generalized Linear Model (GLM)__:

$$\large{g(\mu)= \alpha + \beta x}$$
<br>
Where $\mu$ is the _expectation_ of $Y$, and $g$ is the link function

--

+ We assume a distribution from the [Exponential family](https://en.wikipedia.org/wiki/Exponential_family):

  + Binomial for binary, TRUE/FALSE, PASS/FAIL
  + Poisson for counts

--

+ The link function transforms the data before fitting a model

--

+ Can't use OLS for this, so we use 'maximum-likelihood' estimation, which is not exact.

--

+ Many of the methods, and `R` function, for `lm` are common to `glm`, but we can't use R<sup>2</sup>. Other measures include AUC ('C-statistic'), and AIC or likelihood ratio tests.

---

## Generalized Linear Models

Let's model the probably of death in a data set from US Medicaid. 

+ The data are in the `COUNT` package, and are called `medpar`

+ Load the library and use the `data()` function to load it.

--

+ We'll use a `glm`, with a `binomial` distribution.

+ The `binomial` family automatically uses the `logit` link function: the log-odds of the event.

```{r glm1, collapse=TRUE, warning=FALSE, message=FALSE, echo=TRUE}
library(COUNT)
data(medpar)

glm_binomial <- glm(died ~ factor(age80) + los + factor(type), data=medpar, family="binomial")

ModelMetrics::auc(glm_binomial)
```

---

```{r glm2, collapse=TRUE, warning=FALSE, message=FALSE, echo=TRUE}
summary(glm_binomial)
```

---

## Interactions

+ 'Interactions' are where predictor variables affect each other. 

+ Allows us to separate effects into 

+ Can add using `*` or `:` (check help for which to use)

--

```{r glm3, collapse=TRUE, warning=FALSE, message=FALSE, echo=TRUE}
glm_binomial2 <- glm(died ~ factor(age80) * los + factor(type), data=medpar, family="binomial")

ModelMetrics::auc(glm_binomial2)
```

---
```{r glm4, collapse=TRUE, warning=FALSE, message=FALSE, echo=TRUE}
summary(glm_binomial2)
```

---

# Interpretation

+ Our model coefficients in `lm` were straight-forward multipliers
+ `glm` is similar, but it is on the scale of the ___link-function___.
 + log scale for `poisson` models, or logit (log odds) scale for `binomial`
+ It is common to transform outputs back to the original ('response') scale.
+ This gives Incident Rate Ratios for `poisson`, or Odds Ratios for `binomial`.

```{r coeftransform, echo=TRUE, message=FALSE, warning=FALSE}
cbind(Link=coef(glm_binomial2), Response=exp(coef(glm_binomial2)))
```

---

# Odds-what-now?

Odds is a concept commonly used in statistics, but often misunderstood.  Lets' consider a '2 x 2 table':


```{r, echo=FALSE, out.width="60%", fig.align='center'}
knitr::include_graphics("./man/figures/2_by_2.png")
```
.pull-left[
__Relative risk:__
+ _a / (a + b)_
+ _d / (c + d)_
]

.pull-right[
__Odds:__
+ _a / b_
+ _c / d_
]
---

# Odds Ratio
.pull-left[
```{r, echo=FALSE, out.width="100%", fig.align='center'}
knitr::include_graphics("./man/figures/2_by_2_pt2.png", )
```
]
.pull-right[
__Odds Ratio:__
+ _<span style="color: red;">(a / b)</span> / <span style="color: blue;">( c / d)</span>_
+ _=  <span style="color: red;">a</span><span style="color: blue;">d</span> / <span style="color: blue;">c</span><span style="color: red;">b</span>_

<br><br>
+ If odds ratio = 1, chance of outcome the same in each group

+ If odds ratio >1  - greater chance of outcome in exposure group

+ If odds ratio <1  - lesser chance of outcome in exposure group

]

<br><br>

Great explainer:  https://www.youtube.com/watch?v=ixKhS0Silb4


---

class: middle

## Exercise 3: Generalized Linear Model (GLM)

---

## Prediction (1)

- We can then use our model to predict our expected $Y$:
- Need to decide what scale to predict on: `link` or `response`

```{r pred1, warning=FALSE, message=FALSE, echo=TRUE}
library(dplyr)
medpar$preds <- predict(glm_binomial2, type="response")

top_n(medpar,5) %>% knitr::kable(format = "html")
```

---

## Prediction (2)

- Lets see the 10 cases with the highest predicted risk of death:

```{r pred2, echo=TRUE, message=FALSE}
medpar %>% arrange(desc(preds)) %>% top_n(10) %>% knitr::kable(format = "html")
```

---
class: middle

# Exercise 4: Predicting from models

---
# Summary

- Correlation shows the direction and strength of association

- Regression allows us to quantify the relationships

--

- We can use a single, or multiple, predictors 
- Regression coefficients explain how much a change in $x$ affects $y$

--

- R<sup>2</sup> is a common measure of in linear models, C-statistic/AUC/ROC in logistic models

--

- Generalized Linear Model (`glm`) allow linear models on a transformed scale, e.g. logistic regression for binary variables
- Interactions terms allow us to examine confounded predictors

--

- Consider back-transforming GLM coefficients for interpretability (e.g. odds ratios)

--

- We can predict from our model objects, but must remember the link-function scale in `glm`

---
class: middle

# Exercise 5: Predicting 10-year CHD risk in Framingham data

---
class: middle center

```{r, echo=FALSE, out.width="70%", fig.cap="https://xkcd.com/552/"}
knitr::include_graphics("https://imgs.xkcd.com/comics/correlation.png")
```