-
Notifications
You must be signed in to change notification settings - Fork 12
/
3-umap.Rmd
153 lines (104 loc) · 4.51 KB
/
3-umap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "UMAP: Uniform Manifold Approximation and Projection"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Highlights
* UMAP is a nonlinear dimensionality reduction algorithm intended for data visualization.
* It seeks to capture local distances (nearby points stay together) rather than global distances (all points transformed in the same way, as in principal component analysis).
* It is inspired by **t-SNE** but arguably is preferable due to favorable properties: more scalable performance and better capture of local structure.
* It originates from topology, and discovers **manifolds**: nonlinear surfaces that connect nearby points.
## Key assumptions
* All data are locally connected. That is, the data distribution can be approximately by smoothing between observations with similar covariate profiles.
## Data prep
We're trying out a separate birth weight dataset.
```{r data_prep}
data = MASS::birthwt
summary(data)
?MASS::birthwt
data$race = factor(data$race, labels = c("white", "black", "other"))
str(data)
# Create a list to hold different variables.
task = list(
# Birth weight or low are generally our outcomes for supervised analyses.
outcomes = c("bwt", "low"),
# Variables we want to exclude from our analysis - none currently.
exclude = NULL
)
task$covariates = setdiff(names(data), task$outcomes)
# Review our data structure.
task
dplyr::glimpse(data[task$covariates])
sapply(data[task$covariates], class)
# Convert factor to indicators - don't run on outcomes though.
result = ck37r::factors_to_indicators(data, predictors = task$covariates, verbose = TRUE)
data = result$data
```
## Basic UMAP
```{r basic_umap}
library(umap)
# Conduct UMAP analysis of our matrix data, setting a random seed.
result = umap(data, random_state = 1)
```
## Plot UMAP
```{r umap_plot}
library(ggplot2)
# Compile results into a dataframe.
plot_data = data.frame(x = result$layout[, 1],
y = result$layout[, 2],
data[, task$outcomes])
# Create an initial plot object.
p = ggplot(data = plot_data, aes(x = x, y = y, color = bwt)) +
theme_minimal()
p + geom_point() + ggtitle("Continuous birth weight")
# Compare to the binarized outcome.
p + geom_point(aes(color = low)) + ggtitle("Low birth weight = 1")
```
## Hyperparameters
The most important hyperparameter is the number of neighbors, which controls how UMAP balances the detection of local versus global structure. With a low number of neighbors we concentrate on the local structure (nearby observations), whereas with a high number of neighbors global structure (patterns between all observations).
```{r modify_hyperparameters}
# Review default settings.
umap.defaults
config = umap.defaults
# Set a seed.
config$random_state = 1
config$n_neighbors = 30
# Run umap with new settings.
result2 = umap(data, config)
p + geom_point(aes(x = result2$layout[, 1],
y = result2$layout[, 2]))
# Try even more neighbors.
config$n_neighbors = 60
result3 = umap(data, config)
p + geom_point(aes(x = result3$layout[, 1],
y = result3$layout[, 2]))
```
## Additional hyperparameters
Noting a few other highlights:
* **min_dist** - how tightly UMAP can pack points together in the embedded space, default 0.1.
* **n_components** - dimensions to return, typically 2 but can be larger.
* **metric** - distance metric, such as euclidean, manhattan, cosine, etc.
* **n_epochs** - number of optimization iterations to perform, default 200.
## Challenge
* Experiment with min_dist - what value strikes the right balance for visualization?
```{r review_defaults}
# Review the many other hyperparameters.
umap.defaults
help(umap.defaults)
```
Also note that when calling umap, the function argument `method` can be changed to "umap-learn" to use the python package.
More info on hyperparameters on the [umap-learn python documentation page](https://umap-learn.readthedocs.io/en/latest/parameters.html).
```{r check_help, eval=FALSE}
?umap
```
## Challenge
* Try out on heart dataset
## Resources
* [Performance comparison of UMAP vs. tSNE, PCA, etc.](https://umap-learn.readthedocs.io/en/latest/benchmarking.html)
* Arxiv paper: (https://arxiv.org/abs/1802.03426)
* PyData 2018 talk (PCA, tSNE, and UMAP): (https://www.youtube.com/watch?v=YPJQydzTLwQ)
* PyCon 2018 talk: (https://www.youtube.com/watch?v=nq6iPZVUxZU)
## References
McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.