-
Notifications
You must be signed in to change notification settings - Fork 1
/
19-gams.Rmd
159 lines (112 loc) · 6.28 KB
/
19-gams.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
```{r, setup-gams, include=FALSE}
require(mosaic)
require(ggformula)
theme_set(theme_bw(base_size=14))
knitr::opts_chunk$set(
tidy=FALSE, echo=TRUE,
size="small", message=FALSE, results='markup',
fig.width=6, fig.height=3.5, cache=FALSE)
```
# GAMs: Generalized Additive Models
So far, we have learned ways to model continuous, logical, and count response variables as functions of quantitative and categorical predictors. We started with linear models - where both the response and predictor variables are quantitative and the relationship between them is *linear*. What about nonlinear relationships?
So far, we have considered...
- **Categorical predictor variables**. Making use of indicator variables for (all but one of the) categories, we can model a situation where each value of the predictor variable has a different effect on the response. But...
- How many categories?
- What about periodicity?
- **GLMs**. In logistic, Poisson, etc. regression, the action of the link function results in a relationship between the predictors and the response variable that is linear on the scale of the link function ( = scale of the RHS of the equation), but non-linear on the scale of the response variables (LHS). But...
- Nonlinear, but **monotonic**
## Non-linear, non-monotonic relationships
It's not true that all interesting predictor-response relationships are linear or monotonic. One example is in some data on bat migration: how does the probability of bats leaving on their migratory journey depend on air humidity?
```{r}
bats <- read.csv('https://ndownloader.figshare.com/files/9348010')
```
```{r, include = FALSE}
# old url: 'http://rsbl.royalsocietypublishing.org/highwire/filestream/33982/field_highwire_adjunct_files/2/rsbl20170395supp3.csv')
```
```{r,fig.width=6.5, fig.height=2, fig.show = 'hold', echo = FALSE}
gf_boxplot(relativehumid ~ factor(migration), data = bats)
bats <- bats %>%
mutate(humid_bins = cut(bats$relativehumid, breaks=7))
bats2 <- bats %>%
group_by(humid_bins) %>%
summarize(migration_prop = prop(~migration == 1)) %>%
ungroup()
gf_point(migration_prop ~ humid_bins, data = bats2) %>%
gf_labs(x='Relative Humidity', y='Proportion Migrating')
```
Another dataset (our example for the day) -- ozone levels as a function of solar radiation (`rad`), temperature (`temp`) and wind speed (`wind`):
```{r, fig.width=2.2, fig.height=2, fig.show='hold'}
oz <- read.csv('http://geog.uoregon.edu/GeogR/data/csv/ozone.csv')
gf_point(ozone ~ rad, data=oz, alpha=0.4)
gf_point(ozone ~ temp, data=oz, alpha=0.4)
gf_point(ozone ~ wind, data=oz, alpha=0.4)
```
## Smooth terms
We can fit a model where the relationship between the response and the predictor is a ``smooth" -- no linearity or monotonicity requirement.
### Basis functions
- A smooth term is constructed as the sum of several parts, or *basis functions*. Each of these functions has a relatively simple shape, but scaled and added together, they can produce nearly any ``wiggly" shape.
- Increasing the dimension of the basis (more functions added together) can allow more wiggliness.
- Goal: allow enough wiggliness to fit the data well, without *overfitting* (smooth goes through every point in the data, or follows ``trends" that are spurious)
We will fit smooth models to data using the function `gam()` from the package `mgcv`. It includes many options for basis functions (types of smooths) - see `?mgcv::gam` or [https://rsconnect.calvin.edu:3939/content/28/] for details.
## Fitting GAMs
An excellent resource: <https://converged.yt/mgcv-workshop/>.
### Choosing model formulation
Which terms should be modelled as smooth terms? Explore the data!
- Pros:
- Cons:
### Model formula
Let's fit a simple GAM for the ozone data as a function of radiation, temperature and wind. Note the `s()` function for specifying a smooth, which takes as input:
- a variable name (or more than one, for advanced users)
- `k`
- `bs`
How do we choose? For some exploration, see: [https://rsconnect.calvin.edu:3939/content/28/].
We can also fit the model and smooths by different methods and with options:
- `method = 'GCV.Cp'`
- `method = 'REML'`
- `method = 'ML'`
- `select = TRUE` (or `FALSE`)
```{r, fig.show='hold', fig.width=2.2, fig.height=2.25, results='markup'}
require(mgcv)
oz.gam <- gam(ozone ~ s(rad, k=5, bs='tp') +
s(wind, k=5, bs='tp') +
s(temp, k=5, bs='tp'), data=oz,
method='REML', select=TRUE)
par(mar=c(4,4,2,2))
plot(oz.gam)
summary(oz.gam)
```
## Model Assessment
In addition to what you already know (...which all still holds, except linearity expectation!) `mgcv` has some nice model checking functions built in.
```{r, fig.show='hold', fig.width=5.5, fig.height=3.5, results='markup'}
par(mar=c(4,4,2,2))
gam.check(oz.gam)
```
### Concurvity
Like collinearity and multicollinearity, but for smooths...values of 0 indicate no problem, and 1 a huge problem (total lack of identifiability -- same information in multiple predictors).
Overall, does the model have problems with concurvity?
```{r}
concurvity(oz.gam, full=TRUE)
```
Or alternatively, which specific pairs of terms cause problems?
```{r}
concurvity(oz.gam, full=FALSE)
```
## Model Selection
### Shrinkage and Penalties
With GAMs, in a sense, some model selection is (or can be) done during model fitting - what smooth is best? Or is the relationship a line? A flat line? Using *shrinkage* basis or including `select=TRUE` allows for this.
### P-value selection
Cautions: **p-values are approximate!** Successfulness of the procedure best when fitting method is: ML (1st choice), REML (2nd choice).
```{r, results = 'hide'}
anova(oz.gam)
```
Interpretation as usual. Note that `anova()` (not `Anova()`) works here - especially for GAMs, it does *not* do sequential tests; and `Anova()` doesn't handle GAMs.
### Information criteria
- Conditional/Approximate - bias
- Fitting method:
- REML-based IC scores can be used to compare models with different *random effects* but **not** different predictors. (IF `select = TRUE` and using a shrinkage basis.)
- ML-based IC scores can be used to compare models with different fixed effects (regular predictors) and different `family`, but **not** different random effects
```{r}
require(MuMIn)
oz.ml <- update(oz.gam, method='ML', na.action='na.fail')
head(dredge(oz.ml, rank='AIC'),2)
```