ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Purpose
  • Core

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

DALL-E

Last updated 3 months ago

Was this helpful?

Purpose

This paper explain about a text-to-image model that use dVAE and autoregressive transformer.

Core

How the DALL-E works?

  1. Discrete Variational Autoencoder(dVAE) is trained to compress 256×256256 \times 256256×256 RGB image into 32×3232 \times 3232×32 image tokens(a.k.a latent vectors). Discrete latent space has 8192 possible latent vectors.

  2. Train an autoregressive transformer that models the joint distribution over the caption text and image token. In other words, train a transformer that outputs the "next image token" when caption text and "previous image tokens" are given.

Objective of Overall process

Our purpose is to create a model that generates image xxx when text yyy is given. Which means that we should maximize the evidence log pθ,ψ(x,y)log \ p_{\theta, \psi}(x, y)log pθ,ψ​(x,y) by maximizing the ELBO.

ln pθ,ψ(x,y)≥Ez∼qϕ(z∣x)[ln pθ(x∣y,z)−βDKL(qϕ(y,z∣x),pψ(y,z))]ln \ p_{\theta, \psi}(x, y) \ge E_{z \sim q_\phi(z|x)}[ln \ p_\theta (x|y, z) - \beta D_{KL}(q_\phi(y, z|x), p_\psi(y, z))]ln pθ,ψ​(x,y)≥Ez∼qϕ​(z∣x)​[ln pθ​(x∣y,z)−βDKL​(qϕ​(y,z∣x),pψ​(y,z))]

How did this ELBO inequality came up?

This equation came from β\betaβ-VAE paper. This is same as VAE loss function: construction loss + KLD In β\betaβ-VAE, it change the coefficient for KLD terms to make disintangled latent space.

The first term of the left-side is the reconstruction loss.

The second term of the left-side is KLD of dVAE encoder distribution and autoregressive transformer distribution.

First Step: Learning the Visual Codebook

Second Step: Learning the Prior

We maximize the ELBO by training parameters ϕ,θ\phi, \thetaϕ,θ. After this step, we get a fully trained dVAE encoder and decoder.

The prior pψ(y,z)p_\psi(y, z)pψ​(y,z) is trained to maximize the ELBO. In other words, autoregressive transformer tries to learn the y,zy, zy,z pattern of dVAE encoder.

10MB
DALL-E Zero-Shot Text-to-Image Generation.pdf
pdf
Paper of DALL-E.
https://arxiv.org/abs/2102.12092