Vector Quantized VAE(VQ-VAE)

Last updated 3 months ago

Was this helpful?

Vector Quantized VAE(VQ-VAE)

Purpose

Many previous VAE generates continuous latent space.

This paper represents a powerful generative model that has discrete latent space.

Core

Encoder gets image, and outputs the vectors $z_\theta (x)$ .
Each of the vectors are then compared to the codebook(a.k.a embedding space). And find the nearest embedding vector, then be converted into it. This is the latent matrix $q(z|x)$ .
Finally, decoder gets the latent vector and convert it to an image.

Training VQ-VAE

There are three trainable parts: encoder parameter, codebook, and decoder parameter.

The objective of training is to minimize the following loss function: $sg$ stands for the stopgradient operator, which has zero partial derivative.

L = log \ p(x|z_q(x)) + ||sg[z_e(x)] - e||_2^2 + \beta||z_e(x)-sg[e]||_2^2

The first term is the reconstruction loss, which is used to optimize decoder and encoder.

The second term is used to train the codebook, making it be similar to the output of the encoder.

The last term is to train the encoder to be similar to the embeddings.

You may think that second and last term is making codebook and encoder to be similar to be each other, which is weird. Yes, but there might be a case where encoder's parameter is quickly changing, and codebook parameter's training speed cannot follow it.
We add the last term to make sure the encoder and codebook move together.

Why don't we have a KLD term in the objective loss function? This is because we set the prior of latent space $z$ as uniform distribution, $p(z) = \frac{1}{K}$ .
Then the KLD between the prior and posterior categorical distribution can be expressed as follows: $KL(q(z|x), p(x)) \\~\\ = \int log \ q(z|x) \cdot log(\frac{q(z|x)}{p(x)}) \\~\\ = log \ q(z|x) \cdot log(\frac{q(z|x)}{p(x)}) \\~\\ = 1 \cdot log(\frac{1}{K}) = -log(K)$
Because of this reason, we ignore the KLD in the loss objective.

Adding PixelCNN into VQ-VAE for image processing

After training VQ-VAE, we have a sequence of latent vectors. PixelCNN learns the relationship between all the latent vectors, in other words, PixelCNN learns the prior of latent space. This is important to make realistic samples.

PixelCNN is autoregression latent vector generator, which learns how to make latent vectors that generates realistic image.

Implementation details