-
Notifications
You must be signed in to change notification settings - Fork 0
/
tomato.qmd
161 lines (107 loc) · 6.46 KB
/
tomato.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# ToMATo
Given a pointcloud $X$, how topology can help us in creating clusters?
Let's understand the ToMATo algorithm (see the original paper [@tomato]) with an example and pretend we are actually inventing it. How fun!
## Inspiration
Let's load some packages first
```{julia}
using ToMATo
import GeometricDatasets as gd
using AlgebraOfGraphics, GLMakie;
using Random;
```
Suppose $X$ is the following dataset:
```{julia}
seed = MersenneTwister(0) # setting some seed
X = hcat(
randn(seed, 2, 800)
, randn(seed, 2, 800) .* 0.8 .+ (4, -4)
, randn(seed, 2, 100) .* 0.3 .- (2, -2)
)
df = (x1 = X[1, :], x2 = X[2, :])
plt = data(df) * mapping(:x1, :x2)
draw(plt)
```
We can see two big clusters and possibly a small one.
Let's make a note-so-absurd supposition that our data is generated by some normal distribuitions (in our case, it really is; in general this can be false). Then, intuitively, the mean (or the *center*) of each distribution is the point with the highest density, ie, with more points close to it, because it is in the mean of the distribution.
### Density
We can estimate the density of a dataset as follows
```{julia}
ds = gd.density_estimation(X, h = 0.5)
df = (x1 = X[1, :], x2 = X[2, :], ds = ds)
plt = data(df) * mapping(:x1, :x2, color = :ds)
draw(plt)
```
The precise definition of the density of a point $x \in X$ is
$$
\text{dens}(x) = \dfrac{1}{|X|} \cdot \sum_{y \in X} \text{exp} \left(-\dfrac{d(x, y)^2}{h^2} \right)
$$
where $\text{exp}(x) = e^x$. Informally, we want that points close (ie. small distance) to many other points have a high density; this is why we calculate the distance from $x$ to $y$ and put it inside an exponential with a minus sign. We then calculate the mean of all these values.
Now we put this density estimation on another axis and plot it
```{julia}
axis = (type = Axis3, width = 800, height = 450)
df = (x1 = X[1, :], x2 = X[2, :], ds = ds)
plt = data(df) * mapping(:x1, :x2, :ds, color = :ds)
draw(plt; axis = axis)
```
Now we have mountains. Next, we will describe a sort of "mountain climbing" algorithm that will assign each point to its direct peak, like the path a point would take to climb the mountain in the fastest way.
## Climbing mount topology
The idea now is that given a point $x$:
- if $x$ is the highest point in its corresponding mountain, then it is a new cluster;
- otherwise, we will we seek for the highest neighbor of $x$, say $x'$ and say that the cluster of $x$ is the same of $x'$.
To do that, we need to define a notion of "neighborhood" in this dataset. The easiest way is to define a graph whose vertex set is $X$ and edges connect neighbor points. Fortunately, the `ToMATo` package has a function that does exactly that. You are welcome!
```{julia}
g = proximity_graph(
X
, epsilon_ball_or_knn(0.5, min_ball_points = 4, max_ball_points = 6, knn_points = 3)
)
```
The graph $g$ above is constructed as follows: given $x \in X$, we create a ball with radius $0.2$ around $X$ and do the following:
- Are there less than 4 points in the ball? If yes, then we connect $x$ to its 2 closest points; if no, we connect $x$ to its 6 (at maximum) closest points in the ball. In short: if the ball has not the amount of points we stipulated, then we use knn search.
These numbers obviously are arbitrary and can be changed at will.
Let's see the result of our algorithm:
```{julia}
X2 = vcat(X, ds')
clusters, _ = tomato(X, g, ds, 0)
fig, ax, plt = graph_plot(X2, g, clusters .|> string)
fig
```
Look how many clusters! Something is rotten in the state of Denmark...
## The Comedy of Errors
This ~~tragedy~~ happened because we did not take into account the "false peaks": peaks that are just a little bump and not a real peak. To merge these small-peaks into the big ones, we need to add the following step:
Let $\tau$ be a number that denotes how small (in height) a bump must be to be merged (we will show how to calculate $\tau$ below). Given $x \in X$, let $N$ be the set of its neighbors higher than $x$. Denote by $x_{max}$ the highest point in $N$, and $c_{max}$ its cluster. Now, for each $y \in N$, ask the following:
- Is the difference of heights of $x$ and $y$ less than $\tau$? If yes, we merge the cluster of $y$ with the cluster of $x_{max}$ (that is: $c_{max}$). Otherwise, do nothing.
```{julia}
τ = 0.02
clusters, _ = tomato(X, g, ds, τ, max_cluster_height = τ)
X2 = vcat(X, ds')
fig, ax, plt = graph_plot(X2, g, clusters .|> string)
fig
```
Now we got something!
But how did we calculate this magic $\tau$? Should we just go on guessing real numbers? I hope not!
We usually run the ToMATo algorithm twice. The first time, we put $\tau = \inf$ and see how the montains of $X$ merged: we plot the amount of time that each one of these mountains survived.
```{julia}
_, births_and_deaths = tomato(X, g, ds, Inf)
plot_births_and_deaths(births_and_deaths)
```
Choosing a bigger $\tau$ (say 0.04) will also merge the small clusters on the left
```{julia}
τ = 0.04
clusters, _ = tomato(X, g, ds, τ, max_cluster_height = τ)
X2 = vcat(X, ds')
fig, ax, plt = graph_plot(X2, g, clusters .|> string)
fig
```
At the end of the day you will still need to define how high a bump can be before merging it.
## The original algorithm
After all this talk, maybe the original algorithm can be better understood!
Here is what we can read in [@tomato]:
![The original ToMATo algorithm. Here, $G$ is the proximity graph, $\tilde{f}$ is the density vector, $\tau$ is like above, $\mathbfcal{N}$ is the neighborhood of $x$ with points higher than $x$ (ie higher density), $r$ stores the cluster of each point](images/tomato-algorithm.png)
## Summing up
The ToMATo algorithm needs the following ingredients:
- A metric space $X$;
- A density function $f$;
- A graph $g$ whose vertex set is $X$ and whose edges mean "these two points are neighbors";
- A parameter $\tau$ (which can be calculated as above) that define how high a peak can be and be taken as a "false peak" which we can just merge in the biggest peak close to it.
There are many natural choices for $f$. The Gaussian density we used has a parameter $h$ that must be defined. Other nice choice is the *distance to measure* (as seen in [@tomato]): it calculates the distance between $x$ and its $n$ nearest neighbors, where $n$ need to be defined.
Choosing $g$ is tricky; which notion of "neighborhood of a point" in $X$ can you allow? In the example above I chose an method that tries to take points in an $\epsilon$-ball and, in case there are too few, take then some of the nearest points.