DALL-E

Last updated 3 months ago

Was this helpful?

DALL-E

Purpose

This paper explain about a text-to-image model that use dVAE and autoregressive transformer.

Core

How the DALL-E works?

Discrete Variational Autoencoder(dVAE) is trained to compress $256 \times 256$ RGB image into $32 \times 32$ image tokens(a.k.a latent vectors). Discrete latent space has 8192 possible latent vectors.
Train an autoregressive transformer that models the joint distribution over the caption text and image token. In other words, train a transformer that outputs the "next image token" when caption text and "previous image tokens" are given.

Objective of Overall process

Our purpose is to create a model that generates image $x$ when text $y$ is given. Which means that we should maximize the evidence $log \ p_{\theta, \psi}(x, y)$ by maximizing the ELBO.

ln \ p_{\theta, \psi}(x, y) \ge E_{z \sim q_\phi(z|x)}[ln \ p_\theta (x|y, z) - \beta D_{KL}(q_\phi(y, z|x), p_\psi(y, z))]

How did this ELBO inequality came up?
This equation came from $\beta$ -VAE paper. This is same as VAE loss function: construction loss + KLD In $\beta$ -VAE, it change the coefficient for KLD terms to make disintangled latent space.

The first term of the left-side is the reconstruction loss.

The second term of the left-side is KLD of dVAE encoder distribution and autoregressive transformer distribution.

First Step: Learning the Visual Codebook

Second Step: Learning the Prior