Variational Auto Encoder
Last updated
Was this helpful?
Last updated
Was this helpful?
Before Machine Learning algorithms process data, all non-matrix datas are converted to low-dimensional vector(called Latent vector
). For example, for text, they are converted to embedding spaces.
This is crutial step because if we just use one-hot vector, then the vector dimension would be very large, and this will make the computational cost explode.
Encoder is an algorithm that converts real-data(Text, Image) into low-dimension matrix. Decoder is an algorithm that converts low-dimension matrix into real-data. In deep learning, we often use neural network such as CNN for encoder/decoder.
Auto encoder outputs latent vector, called , for input . Decoder will be trained to convert into again.
However, auto encoder outputs same latent vector if the input is same. which means that, decoder won't be able to produce various generative outputs for same input. Also, decoder won't be trained to produce valid outputs if the latent vector slightly change.
This limitation brings the use of VAE.
Variational Auto Encoder(VAE) produce a distribution of latent vector for input.
The term
distribution of latent vector
is hard to understand. Let's say latent vector is 3-dimensional. Latent vector can be expressed as . Distribution of latent vector means that, each of are distribution and not deterministic values. , ,
However, neural networks(CNN, RNN) used in encoder don't have an ability to produce a distribution for same input. Instead, we structure the encoder to produce the mean and variant . And we set the latent vector as Gaussian distribution
Let's think of specific case of VAE.
We want to build a VAE as follows: input : picture taken by a camera output : the angle of the camera, focus of the lense, type of figure that is taken
Important fact is that we can observe , but not . We want to know the distribution of when is given(= )
Due to Bayes theorem, we can come up with following equation.
Can we just calculate it? No, we cannot.
The term is the distribution of latent vector. Also called as prior distribution for latent vector. This is tractable because we usually approximate into tractable distribution such as Gaussian.
The term is the distribution of decoder output when latent vector is given. This is tractable because this is a simple calculation of decoder neural network.
The term , is however intractable. As the decoder neural network gets complicated(which makes complex ), it is impossible to do integral for all latent space.
Bayes theorem is not enough.
Approximating is the method used in VAE. is well-known distribution such as Gaussian.
We are going to use KL-divergence to find best matching close to .
Please look at KL-Divergence if you are not familiar with KL-divergence.
Finding best-fitting is as follows:
Let's unwrap the KL-divergence to take a deeper look.
The left term in (1) can be expressed as because is known.( are tractable!)
The right term in (1) does not depend on . As a result, it can be represented as right term in (2). This is also referred as evidence
.
Since probability is between 0~1, evidence is always smaller or equal to 0.
Also, KL-divergence is always bigger or equal to 0.
This results , which means that is the evidence lower bound, a.k.a ELBO.
If we find the optimal , then this means that we can encode the input into latent distribution , which is an approximation of .
The encoder learns to output mean and standard deviation of the input image. For outputs of encoder , it makes latent space based on Gaussian distribution with parameter .
There is a critical problem in VAE: To get latent vector, we sample from the distribution . However, sampling from distribution is not differentiable, because it is just picking a random sample from the distribution.
Reparameterization is a method that express the randomness using one more parameter.
Using reparameterization, we can move the random sampling step to unimportant node .
You can train and run VAE model to generate various MNIST picture in .