DALL-E
Last updated
Was this helpful?
Last updated
Was this helpful?
This paper explain about a text-to-image model that use dVAE and autoregressive transformer.
Discrete Variational Autoencoder(dVAE) is trained to compress RGB image into image tokens(a.k.a latent vectors). Discrete latent space has 8192 possible latent vectors.
Train an autoregressive transformer that models the joint distribution over the caption text and image token. In other words, train a transformer that outputs the "next image token" when caption text and "previous image tokens" are given.
Our purpose is to create a model that generates image when text is given. Which means that we should maximize the evidence by maximizing the ELBO.
How did this ELBO inequality came up?
This equation came from -VAE paper. This is same as VAE loss function: construction loss + KLD In -VAE, it change the coefficient for KLD terms to make disintangled latent space.
The first term of the left-side is the reconstruction loss.
The second term of the left-side is KLD of dVAE encoder distribution and autoregressive transformer distribution.
We maximize the ELBO by training parameters . After this step, we get a fully trained dVAE encoder and decoder.
The prior is trained to maximize the ELBO. In other words, autoregressive transformer tries to learn the pattern of dVAE encoder.