Flow models don't model the probability distribution
The fundamental insight is that if you have the pdf
Thus, instead of getting the cdf as a byproduct of training on the pdf, we can
directly learn the cdf. Note that if
To train the parameters of a flow model we will be optimizing the expected
log-likelihood of the data:
Note that in order for this formula to be valid our function
In order for us to optimize the model we have to compute the determinant of the
Jacobian at every step of the training process. Thus, we need an architecture
that allows
- The simplest idea would be to have our flow model act independently over different dimensions of the input:
where
- The second idea would be to use an auto-regressive architecture (e.g. PixelCNN). This type of flow causes the Jacobian to be lower-triangular since:
And again the determinant is calculated by multiplying the diagonal entries.
- We can think of the auto-regressive architecture as corresponding to a full
Bayes net: every variable
$x_i$ is transformed conditioned on all the previous variables$x_{\le i}$ . We could however design a partial Bayes net:- half of the variables are transformed independently,
- the other half are transformed conditioned on the first half.
With this approach we again arrive at a lower triangular Jacobian matrix and can calculate the determinant by multiplying the diagonal entries.
The Bayes net structure defines only the coupling dependency architecture that
the model uses. The next question is what invertible transformation
The most common choice is to use an affine transformation by scaling and translating the input.
Flow models are designed to work on continuous data and cannot learn if trained on discrete data. For a given data point the model would try to increase the likelihood of that specific point without putting any density on the vicinity, thus the model could place infinitely high mass on these points.
De-quantization transforms discrete data into continuous data by adding
noise and mapping to
Simple flows can be composed in order to produce a more complex flow and increase the expressivity of the model.
The log probability then becomes:
To reduce the computational cost of large models composed of multiple flows we
could remove some of the dimensions of the input. In the case of images, we could
remove some of the pixels without loosing the semantical information of the image.
After the first
Hyper-parameters for training the model on two different datasets are provided:
- CIFAR10
- CelebA cropped to 32x32
To train the model run:
python3 run.py --seed 0 --lr 3e-4 --epochs 50 --dataset CelebA
python3 run.py --seed 0 --lr 3e-4 --epochs 350 --dataset CIFAR10
The script will download the corresponding dataset into a datasets
folder and
will train the model on it. The trained model parameters will be saved to the file
realnvp_<DATASET>.pt
.
To use the trained model for generating CelebA images run the following:
model = torch.load("realnvp_CelebA.pt")
imgs = model.sample(n=36) # img,shape = (36, 3, 32, 32)
grid = torchvision.utils.make_grid(imgs, nrow=6)
plt.imshow(grid.permute(1, 2, 0))
This is what the model generates after training for 50 epochs on the CelebA dataset.
For generating CIFAR10 images run:
model = torch.load("realnvp_CIFAR10.pt")
imgs = model.sample(n=36) # img,shape = (36, 3, 32, 32)
grid = torchvision.utils.make_grid(imgs, nrow=6)
plt.imshow(grid.permute(1, 2, 0))
This is what the model generates after training for 350 epochs on the CIFAR10 dataset.