Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
arturtoshev committed Oct 30, 2023
1 parent ce18638 commit ab17012
Show file tree
Hide file tree
Showing 5 changed files with 119 additions and 42 deletions.
Binary file added _images/metropolis_hastings.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 57 additions & 19 deletions _sources/lecture/cc-1-2-gmm-mcmc.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,13 +167,19 @@ Show that the Bernoulli distribution is a member of the exponential family

## Gaussian Mixture Models

Assume that we have a set of measurements $\{x^{(1)}, \dots x^{(m)}\}$. This is one of the few unsupervised learning examples in this lecture, thus, we do not know the true labels.
Assume that we have a set of measurements $\{x^{(1)}, \dots x^{(m)}\}$. This is one of the few unsupervised learning examples in this lecture, thus, we do not know the true labels $y$.

Gaussian Mixture Models (GMMs) assume that the data comes from a mixture of $K$ Gaussian distributions in the form

$$p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x|\mu_k, \Sigma_k),$$ (gmm_model)

with $\pi_k$ called mixing coefficients. We define a K-dimensional r.v. $z$ which satisfies $z\in \{0,1\}$ and $\sum_k z_k=1$ (i.e. with only one of its dimensions being 1, while all others are 0), such that $z_k~\sim \text{Categorical}(\pi_k)$ and $p(z_k=1) = \pi_k$ . For Eq. {eq}`gmm_model` to be valid, the parameters $\{\pi_k\}$ must satisfy $0\le\pi_k\le 1$ and $\sum_k \pi_k=1$.
with

- $\pi = (\pi_1,...,\pi_k)$ called mixing coefficients, or cluster probabilities,
- $\mu = (\mu_1,...,\mu_k)$ the cluster means, and
- $\Sigma = (\Sigma_1,...,\Sigma_k)$ the cluster covariance matrices.

We define a K-dimensional r.v. $z$ which satisfies $z\in \{0,1\}$ and $\sum_k z_k=1$ (i.e. with only one of its dimensions being 1, while all others are 0), such that $z_k~\sim \text{Multinomial}(\pi_k)$ and $p(z_k=1) = \pi_k$ . For Eq. {eq}`gmm_model` to be a valid probability density, the parameters $\{\pi_k\}$ must satisfy $0\le\pi_k\le 1$ and $\sum_k \pi_k=1$.

The marginal distribution of $z$ can be equivalently written as

Expand All @@ -193,7 +199,7 @@ p(x) &= \sum_z p(x,z) \\
\end{aligned}
$$ (gmm_marginalization)
Thus, the unknown parameters are $\{\pi_k, \mu_k, \Sigma_k\}_{k=1:K}$. We can write the maximum likelihood of the data as
Thus, the unknown parameters are $\{\pi_k, \mu_k, \Sigma_k\}_{k=1:K}$. We can write the log likelihood of the data as
$$
\begin{aligned}
Expand All @@ -205,7 +211,6 @@ However, if we try to analytically solve this problem, we will see that there is

### Expectation-Maximization


```{figure} ../imgs/cc1/em_algorithm.png
---
width: 600px
Expand All @@ -219,8 +224,8 @@ There is an iterative algorithms that can solve the maximum likelihood problem b

0. Guess the number of modes $K$
1. Randomly initialize the means $\mu_k$, covariances $\Sigma_k$, and mixing coefficients $\pi_k$, and evaluate the likelihood
2. **(E-step)**. Evaluate the weights
$$w_k^{(i)} := p(z^{(i)}=k| x^{(i)}, \pi, \mu, \Sigma)$$ (gmm_e_step)
2. **(E-step)**. Evaluate $\omega_k^{(i)}$ assuming constant $\pi, \mu, \Sigma$ (see expression after the algorithm)
$$w_k^{(i)} := p(z^{(i)}=k| x^{(i)}, \pi, \mu, \Sigma).$$ (gmm_e_step)
3. **(M-step)**. Update the parameters by solving the maximum likelihood probelms for fixed $z_k$ values.
$$\begin{aligned}
\pi_k &:= \frac{1}{m}\sum_{i=1}^m w_k^{(i)} \\
Expand All @@ -240,16 +245,40 @@ p(z^{(i)}=k| x^{(i)},\pi,\mu,\Sigma) &= \frac{p(x^{(i)}|z^{(i)}=k, \mu, \Sigma)p
&= \frac{\pi_k \mathcal{N}(x^{(i)}|\mu_K, \Sigma_k)}{\sum_{l=1}^K \pi_l \mathcal{N}(x^{(i)}|\mu_l, \Sigma_l)}
\end{aligned}$$ (gmm_responsibilities)

The values of $p(x^{(i)}|z^{(i)}=k, \mu, \Sigma)$ can be computed by evaluating the $k$th Gaussian with parameters $\mu_k$ and $\Sigma_k$. And $p(z^{(i)}=k,\pi)$ is just $\pi_k$.
The values of $p(x^{(i)}|z^{(i)}=k, \mu, \Sigma)$ can be computed by evaluating the $k$th Gaussian with parameters $\mu_k$ and $\Sigma_k$. And $p(z^{(i)}=k,\pi)$ is just $\pi_k$.

**Exercise: derive the M-step update equations following the maximum likelihood approach.**

> Hint: look at [Bishop, 2006](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf), Section 9.2.
### Applications and Limitations

Once we have fitted a GMM on $p(x)$, we can use it for:

1. Density estimation: by evaluate the probability $p(\tilde{x})$ of any new point $\tilde{x}$, we can say how probable it is that this point comes from the same distribution as the training data.
2. Clustering: so far we have talked about density estimation, but GMMs are typically used for clustering. Given a new query point $\tilde{x}$, we can evaluate each of the $K$ Gaussians and scale their probability by the respective $\pi_k$. These will be the probabilities of $\tilde{x}$ to be part of cluster $k$.

Most limitations of this approach arive from the assumption that the indivudual clusters follow the Gaussian distribution:

- If the data does not follow a Gaussian distribution, e.g. heavy-tailed ditribution with outliers, then too much weight will be given to the outliers
- If there is an outlier, eventually one mode will focus only on this one data point. But if a Gaussian describes only one data point, then its variance will be zero and we recover a singularity/Dirac function.
- The choice of $K$ is crucial and this parameters need to be optimized in a outer loop.
- GMMs do now scale well to high dimensions.


## Sampling Methods

The opposite process to density estimation is sampling. Here, we will look at the general case of an unnormalized ($\int_{\Omega}p(\cdot)\ne 1$) *target distribution*

$$p(y,\theta) = g(\theta)f(y|\theta).$$ (target_density)

The task we will be trying to solve here is how to generate samples from this distribution. How these samples are useful and why we choose this particular form of a target distribution will be the topic of next week's lecture.

### Acceptance-Rejection Sampling

Acceptance-rejection sampling draws its random samples directly from the target posterior distribution, as we only have access to the unscaled target distribution initially we will have to draw from the unscaled target. _The acceptance-rejection algorithm is specially made for this scenario._ The acceptance-rejection algorithm draws random samples from an easier-to-sample starting distribution and then successively reshapes its distribution by only selectively accepting candidate values into the final sample. For this approach to work the candidate distribution $g_{0}(\theta)$ has to dominate the posterior distribution, i.e. there must exist an $M$ s.t.
Acceptance-rejection sampling draws its random samples directly from the target posterior distribution, as we only have access to the unscaled target distribution initially we will have to draw from the unscaled target. _The acceptance-rejection algorithm is specially made for this scenario._ The acceptance-rejection algorithm draws random samples from an easier-to-sample starting distribution and then successively reshapes its distribution by only selectively accepting candidate values into the final sample. For this approach to work the *candidate distribution* $g_{0}(\theta)$ has to dominate the posterior distribution, i.e. there must exist an $M$ s.t.

$$M \times g_{0}(\theta) \geq g(\theta) f(y|\theta), \quad \forall \theta$$
$$M \times g_{0}(\theta) \geq g(\theta) f(y|\theta), \quad \forall \theta$$ (acceptance_rejection)

Taking an example candidate density for an unscaled target as an example to show the "dominance" of the candidate distribution over the posterior distribution.

Expand All @@ -259,7 +288,7 @@ width: 450px
align: center
name: acceptance_rejection
---
Acceptance-rejection algorithm (Source: Bolstad, 2009).
Acceptance-rejection candidate and target distributions (Source: Bolstad, 2009).
```

To then apply acceptance-rejection sampling to the posterior distribution we can write out the algorithm as follows:
Expand All @@ -269,7 +298,7 @@ To then apply acceptance-rejection sampling to the posterior distribution we can
3. Calculate the candidate density at each random sample, and multiply by $M$.
4. Compute the weights for each random sample

$$ w_{i} = \frac{g(\theta_{i}) \times f(y_{1}, \ldots, y_{n}| \theta_{i})}{M \times g_{0}(\theta_{i})}$$
$$ w_{i} = \frac{g(\theta_{i}) \times f(y_{1}, \ldots, y_{n}| \theta_{i})}{M \times g_{0}(\theta_{i})}$$ (acceptance_rejection_weights)

5. Draw $N$ samples from the $U(0, 1)$ uniform distribution.
6. If $u_{i} < w_{i}$ accept $\theta_{i}$
Expand All @@ -279,7 +308,7 @@ $$ w_{i} = \frac{g(\theta_{i}) \times f(y_{1}, \ldots, y_{n}| \theta_{i})}{M \ti

The sampling-importance-resampling algorithm is a two-stage extension of the acceptance-rejection sampling which has an improved weight-calculation, but most importantly employs a _resampling_ step. This resampling step resamples from the space of parameters. The weight is then calculated as

$$w_{i} = \frac{\frac{g(\theta_{i})f(y_{1}, \ldots\ y_{n} | \theta_{i})}{g_{0}(\theta_{i})}}{\left( \sum_{i=1}^{N} \frac{g(\theta_{i})f(y_{1}, \ldots, y_{n}| \theta_{i})}{g_{0}(\theta_{i})} \right)} $$
$$w_{i} = \frac{\frac{g(\theta_{i})f(y_{1}, \ldots\ y_{n} | \theta_{i})}{g_{0}(\theta_{i})}}{\left( \sum_{i=1}^{N} \frac{g(\theta_{i})f(y_{1}, \ldots, y_{n}| \theta_{i})}{g_{0}(\theta_{i})} \right)} $$ (sampling_importance_resampling_weights)

The algorithm to sample from the posterior distribution is then:

Expand All @@ -288,11 +317,11 @@ The algorithm to sample from the posterior distribution is then:
3. Calculate the starting density at each random sample, and multiply by $M$.
4. Calculate the ratio of the unscaled posterior to the starting distribution

$$r_{i} = \frac{g(\theta_{i})f(y_{1}, \ldots, y_{n}| \theta_{i})}{g_{0}(\theta_{i})}$$
$$r_{i} = \frac{g(\theta_{i})f(y_{1}, \ldots, y_{n}| \theta_{i})}{g_{0}(\theta_{i})}$$ (sampling_importance_resampling_weights_r)

5. Calculate the importance weights

$$w_{i} = \frac{r_{i}}{\sum r_{i}}$$
$$w_{i} = \frac{r_{i}}{\sum r_{i}}$$ (sampling_importance_resampling_weights_w)

6. Draw $n \leq 0.1 \times N$ random samples with the sampling probabilities given by the importance weights.

Expand All @@ -311,36 +340,45 @@ width: 450px
align: center
name: adaptive_rejection_sapling
---
Adaptive rejection sampling (Source: Bolstad, 2009).
(Not) log-concave function (Source: Bolstad, 2009).
```

Using the tangent method our algorithm then takes the following form:
Using the tangent method, our algorithm then takes the following form:

1. Construct an upper bound from piecewise exponential functions, which dominate the log-concave unscaled posterior
2. With the envelope giving us the initial candidate density we draw $N$ random samples
3. Rejection sampling, see the preceding two subsections for details.
4. If rejected, add another exponential piece which is tangent to the target density.

As all three presented sampling approaches have their limitations, practitioners tend to rely more on Markov chain Monte Carlo methods such as Gibbs sampling, and Metropolis-Hastings.
> As all three presented sampling approaches have their limitations, practitioners often rely more on Markov chain Monte Carlo methods such as Gibbs sampling, and Metropolis-Hastings.
### Markov Chain Monte Carlo

The idea of Markov Chain Monte Carlo (MCMC) is to construct an ergodic Markov chain of samples $\{\theta^0, \theta^1, ...,\theta^N\}$ distributed according to the posterior distribution $g(\theta|y)$. This chain evolves according to a transition kernel given by $q(x_{next}|x_{current})$. Let's look at one of the most popular MCMC algorithms: Metropolis Hastings
The idea of Markov Chain Monte Carlo (MCMC) is to construct an ergodic Markov chain of samples $\{\theta^0, \theta^1, ...,\theta^N\}$ distributed according to the posterior distribution $g(\theta|y) \propto g(\theta)f(y|\theta)$. This chain evolves according to a transition kernel given by $q(x_{next}|x_{current})$. Let's look at one of the most popular MCMC algorithms: Metropolis Hastings

**Metropolis-Hastings**

The general Metropolis-Hastings prescribes a rule which guarantees that the constructed chain is representative of the target distribution $g(\theta|y)$. This is done by following the algorithm:

0. Start at an initial point $\theta_{current} = \theta^0$.
1. Sample $\theta' \sim q(\theta_{next}|\theta_{current})$
2. Compute
2. Compute the ecceptance probability
$$\alpha = min \left\{ 1, \frac{g(\theta'|y) q(\theta_{current}|\theta')}{g(\theta_{current}|y) q(\theta'|\theta_{current})} \right\}$$
3. Sample $u\sim \text{Uniform}(0,1)$
4. If $\alpha > u$, then $\theta_{current} = \theta'$, else $\theta_{current} = \theta_{current}$
5. Repeat $N$ times from step 1.

A special choice of $q(\cdot | \cdot)$ is for example the normal distribution $\mathcal{N}(\cdot | \theta_{current}, \sigma^2)$, which results in the so-called Random Walk Metropolis algorithm. Other special cases include the Metropolis-Adjusted Langevin Algorithm (MALA), as well as the Hamiltonian Monte Carlo (HMC) algorithm.

```{figure} ../imgs/cc1/metropolis_hastings.png
---
width: 450px
align: center
name: metropolis_hastings
---
Metropolis-Hastings trajectory (Source: [relguzman.blogpost.com](https://relguzman.blogspot.com/2018/04/sampling-metropolis-hastings.html)).
```

---

In summary:
Expand Down
Loading

0 comments on commit ab17012

Please sign in to comment.