ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Purpose
  • Core
  • Implementation details
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

Diffusion Model

Last updated 3 months ago

Was this helpful?

Purpose

By adding step-by-step Gaussian Noise the to image, we can generate a Gaussian distribution. From Gaussian distribution, by denoising step-by-step, we can generate a high-quality image.

Core

How Diffusion Model works?

Forward Process

Joint distribution for latent vectors can be expressed as follows:

Reverse Process

The joint distribution of all latent vectors and generated image can be expressed as follows:

What is the training objectives?

How can we come up with this inequality?

Let's look step-by-step.

Deep-diving to make the loss simpler

We can make this term more simpler! (Very Important!!)

Simplifying right-side

Simplifying left-side

Calculating the KL-divergance

This is a reconstruction loss term. We can rewrite this term as follows:

In the paper, it doesn't talk much about this term. It is eventually ignored, but I still didn't understand the reason.

Final training objective

Implementation details

I made a simple Diffusion model using pytorch.

References

For a given image, we add Gaussian noise step-by-step. This procedure is called 'Forward process', and represented as q(xt∣xt−1)q(x_t|x_{t-1})q(xt​∣xt−1​). x0x_0x0​ is the original image, and x1,x2,...,xTx_1, x_2, ..., x_Tx1​,x2​,...,xT​ are all latent vector that has same dimension with x0x_0x0​.

Adding Gaussian noise can be expressed as follows: βt\beta_tβt​ are fixed for each step ttt.

q(xt∣xt−1):=N(xt;1−βt xt−1,βtI) ...(1)q(x_t|x_{t-1}) := N(x_t; \sqrt{1-\beta_t} \ x_{t-1}, \beta_t I) \ ... (1)q(xt​∣xt−1​):=N(xt​;1−βt​​ xt−1​,βt​I) ...(1)
q(x1:T∣x0):=∏t=1Tq(xt∣xt−1) ...(2)q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}) \ ...(2)q(x1:T​∣x0​):=t=1∏T​q(xt​∣xt−1​) ...(2)

After TTT steps, we get an fully Gaussian distribution N(0,I)N(0, I)N(0,I). We will do 'Reverse process', to denoise the Gaussian distribution to get high-quality image. It is represented as pθ(xt−1∣xt)p_\theta(x_{t-1}|x_t)pθ​(xt−1​∣xt​).

Denoising Gaussian noise can be expressed as follows: μθ,Σθ\mu_\theta, \Sigma_\thetaμθ​,Σθ​ are neural network outputs that has parameters θ\thetaθ.

pθ(xt−1∣xt):=N(xt−1;μθ(xt,t),Σθ(xt,t)) ...(3)p_\theta(x_{t-1}|x_t):=N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \ ...(3)pθ​(xt−1​∣xt​):=N(xt−1​;μθ​(xt​,t),Σθ​(xt​,t)) ...(3)
pθ(x0:T):=p(xT)∏t=1Tpθ(xt−1∣xt) ...(4)p_\theta(x_{0:T}) := p(x_T) \prod_{t=1}^{T}p_\theta(x_{t-1}|x_t) \ ...(4)pθ​(x0:T​):=p(xT​)t=1∏T​pθ​(xt−1​∣xt​) ...(4)

Our goal is to maximize the log-likelihood pθ(x0)p_\theta(x_0)pθ​(x0​). The meaning of pθ(x0)p_\theta(x_0)pθ​(x0​): for given image x0x_0x0​ and adjustable parameter θ\thetaθ, what is the probability of model producing x0x_0x0​?

Our ultimate goal is to train θ\thetaθ to maximize the log-likelihood.

To maximize pθ(x0)p_\theta(x_0)pθ​(x0​), we are going to try to maximize the left-side.

log pθ(x0)≥log pθ(x0)−KL(?,?)log \ p_\theta(x_0) \ge log \ p_\theta(x_0) - KL(?, ?)log pθ​(x0​)≥log pθ​(x0​)−KL(?,?)

This can be interperated as minimizing LLL:

−log pθ(x0)≤Eq[−log pθ(x0:T)q(x1:T∣x0)]=Eq[−log p(xT)−∑t≥1log pθ(xt−1∣xt)q(xt∣xt−1)]:=L ...(5)-log \ p_\theta(x_0) \le E_q[-log \ \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}] = E_q[-log \ p(x_T) - \sum_{t \ge 1}log \ \frac{p_\theta (x_{t-1}|x_t)}{q(x_t|x_{t-1})}] := L \ ...(5)−log pθ​(x0​)≤Eq​[−log q(x1:T​∣x0​)pθ​(x0:T​)​]=Eq​[−log p(xT​)−t≥1∑​log q(xt​∣xt−1​)pθ​(xt−1​∣xt​)​]:=L ...(5)

Since KL-divergance is always non-negative, we can setup inequality as follows: −log pθ(x0)≤−log pθ(x0)+KL(q(x1:T∣x0),pθ(x1:T∣x0)) ...(6)-log \ p_\theta(x_0) \le -log \ p_\theta(x_0) + KL(q(x_{1:T}|x_0), p_\theta(x_{1:T}|x_0)) \ ...(6)−log pθ​(x0​)≤−log pθ​(x0​)+KL(q(x1:T​∣x0​),pθ​(x1:T​∣x0​)) ...(6) q(x1:T∣x0)q(x_{1:T}|x_0)q(x1:T​∣x0​) means the latent vectors distribution when input image x0x_0x0​ is given. pθ(x1:T∣x0)p_\theta(x_{1:T}|x_0)pθ​(x1:T​∣x0​) means the latent vectors distribution when output image x0x_0x0​ is given. The latent vectors distribution for both case should be similar as possible. So we added KL-divergence term that compares two distribution.

If we take a closer look at KL-divergence term, we can convert it using Bayes theorem: KL(q(x1:T∣x0),pθ(x1:T∣x0)) =Ex1:T∼q(x1:T∣x0)[log q(x1:T∣x0) pθ(x0)pθ(x0:T)] =log pθ(x0)+Ex1:T∼q(x1:T∣x0)[log q(x1:T∣x0)pθ(x0:T)] ...(7)KL(q(x_{1:T}|x_0), p_\theta(x_{1:T}|x_0)) \\~\\ = E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0) \ p_\theta(x_0)}{p_\theta(x_{0:T})}] \\~\\ = log \ p_\theta(x_0) + E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \ ...(7)KL(q(x1:T​∣x0​),pθ​(x1:T​∣x0​)) =Ex1:T​∼q(x1:T​∣x0​)​[log pθ​(x0:T​)q(x1:T​∣x0​) pθ​(x0​)​] =log pθ​(x0​)+Ex1:T​∼q(x1:T​∣x0​)​[log pθ​(x0:T​)q(x1:T​∣x0​)​] ...(7)

By combining it to the inequality, we can come up with followings: −log pθ(x0)≤−log pθ(x0)+log pθ(x0)+Ex1:T∼q(x1:T∣x0)[log q(x1:T∣x0)pθ(x0:T)] ...(8)-log \ p_\theta(x_0) \le -log \ p_\theta(x_0) + log \ p_\theta(x_0) + E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \ ...(8)−log pθ​(x0​)≤−log pθ​(x0​)+log pθ​(x0​)+Ex1:T​∼q(x1:T​∣x0​)​[log pθ​(x0:T​)q(x1:T​∣x0​)​] ...(8)

If we use equation (2) and (4), we can simplify the equation (8) as follows: −log pθ(x0)≤Ex1:T∼q(x1:T∣x0)[−log p(xT)−∑t≥1log pθ(xt−1∣xtq(xt∣xt−1)]-log \ p_\theta(x_0) \le E_{x_{1:T} \sim q(x_{1:T}|x_0)}[-log \ p(x_T) - \sum_{t \ge 1} log \ \frac{p_\theta(x_{t-1}|x_t}{q(x_t|x_{t-1})}]−log pθ​(x0​)≤Ex1:T​∼q(x1:T​∣x0​)​[−log p(xT​)−∑t≥1​log q(xt​∣xt−1​)pθ​(xt−1​∣xt​​]

L:=Eq[−log p(xT)−∑t≥1log pθ(xt−1∣xt)q(xt∣xt−1)] =Eq[−log p(xT)+log q(x1∣x0)pθ(x0∣x1)+∑t≥2log q(xt∣xt−1)pθ(xt−1∣xt)] ...(9) =Eq[−log p(xT)+logq(x1∣x0)pθ(x0∣x1)+∑t≥2log q(xt−1∣xt,x0)q(xt∣x0)q(xt−1∣x0)pθ(xt−1∣xt)] ...(10) =Eq[−log p(xT)+log q(x1∣x0)pθ(x0∣x1)+∑t≥2log q(xt∣x0)q(xt−1∣x0)+∑t≥2log q(xt−1∣xt,x0)pθ(xt−1∣xt)] =Eq[−log p(xT)+log q(x1∣x0)pθ(x0∣x1)+log q(xT∣x0)q(x1∣x0)]+∑t≥2Ext,xt−1∼q(xt,xt−1∣x0)[log q(xt−1∣xt,x0)pθ(xt−1∣xt)] ...(11) =DKL(q(xT∣x0)∣∣p(xT))−Eq[−log pθ(x0∣x1)]]+∑t≥2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt)) ...(12)L := E_q[-log \ p(x_T) - \sum_{t \ge 1}log \ \frac{p_\theta (x_{t-1}|x_t)}{q(x_t|x_{t-1})}] \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} + \sum_{t\ge2}log \ \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)}] \ ...(9) \\~\\ = E_q[-log \ p(x_T)+log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}+\sum_{t \ge2}log \ \frac{q(x_{t-1}|x_t, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)p_\theta(x_{t-1}|x_t)}] \ ...(10) \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} + \sum_{t \ge 2}log \ \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + \sum_{t \ge 2}log \ \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)}] \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}+log \ \frac{q(x_T|x_0)}{q(x_1|x_0)}] + \sum_{t \ge 2}E_{x_t, x_{t-1} \sim q(x_t, x_{t-1} | x_0)}[log \ \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)}] \ ...(11) \\~\\ = D_{KL}(q(x_T|x_0) || p(x_T)) - E_q[-log \ p_\theta(x_0|x_1)]] + \sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t)) \ ...(12)L:=Eq​[−log p(xT​)−t≥1∑​log q(xt​∣xt−1​)pθ​(xt−1​∣xt​)​] =Eq​[−log p(xT​)+log pθ​(x0​∣x1​)q(x1​∣x0​)​+t≥2∑​log pθ​(xt−1​∣xt​)q(xt​∣xt−1​)​] ...(9) =Eq​[−log p(xT​)+logpθ​(x0​∣x1​)q(x1​∣x0​)​+t≥2∑​log q(xt−1​∣x0​)pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)q(xt​∣x0​)​] ...(10) =Eq​[−log p(xT​)+log pθ​(x0​∣x1​)q(x1​∣x0​)​+t≥2∑​log q(xt−1​∣x0​)q(xt​∣x0​)​+t≥2∑​log pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)​] =Eq​[−log p(xT​)+log pθ​(x0​∣x1​)q(x1​∣x0​)​+log q(x1​∣x0​)q(xT​∣x0​)​]+t≥2∑​Ext​,xt−1​∼q(xt​,xt−1​∣x0​)​[log pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)​] ...(11) =DKL​(q(xT​∣x0​)∣∣p(xT​))−Eq​[−log pθ​(x0​∣x1​)]]+t≥2∑​DKL​(q(xt−1​∣xt​,x0​)∣∣pθ​(xt−1​∣xt​)) ...(12)
How did we convert q(xt∣xt−1)=q(xt−1∣xt,x0)q(xt∣x0)q(xt−1∣x0)q(x_t|x_{t-1}) = \frac{q(x_{t-1}|x_t, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}q(xt​∣xt−1​)=q(xt−1​∣x0​)q(xt−1​∣xt​,x0​)q(xt​∣x0​)​ at (9) ->(10)?
Interesting fact about DKL(q(xT∣x0)∣∣p(xT))D_{KL}(q(x_T|x_0) || p(x_T)) DKL​(q(xT​∣x0​)∣∣p(xT​))term in equation (12)

Meaning of ∑t≥2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))∑t≥2​DKL​(q(xt−1​∣xt​,x0​)∣∣pθ​(xt−1​∣xt​)) in equation (12)

The meaning of this term is that matching the learnable denoising model pθ(xt−1∣xt)p_\theta(x_{t-1}|x_t)pθ​(xt−1​∣xt​) to ground-truth denoising model q(xt−1∣xt,x0)q(x_{t-1}|x_t, x_0)q(xt−1​∣xt​,x0​).

We know the equation of pθ(xt−1∣xt)p_\theta(x_{t-1}|x_t)pθ​(xt−1​∣xt​), which is pθ(xt−1∣xt):=N(xt−1;μθ(xt,t),Σθ(xt,t))p_\theta(x_{t-1}|x_t):=N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))pθ​(xt−1​∣xt​):=N(xt−1​;μθ​(xt​,t),Σθ​(xt​,t))

Equation of q(xt−1∣xt,x0)q(x_{t-1}|x_t, x_0)q(xt−1​∣xt​,x0​) can be expressed as follows: q(xt−1∣xt,x0)=q(xt∣xt−1,x0)q(xt−1∣x0)q(xt∣x0)q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1}, x_0) q(x_{t-1}|x_0)}{q(x_t|x_0)}q(xt−1​∣xt​,x0​)=q(xt​∣x0​)q(xt​∣xt−1​,x0​)q(xt−1​∣x0​)​

We know q(xt∣xt−1,x0)=q(xt∣xt−1)=N(xt;1−βt xt−1,βtI)q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t} \ x_{t-1}, \beta_t I)q(xt​∣xt−1​,x0​)=q(xt​∣xt−1​)=N(xt​;1−βt​​ xt−1​,βt​I)

For q(xt∣x0)q(x_t|x_0)q(xt​∣x0​), we are going to use reparameterization trick.

xt∼q(xt∣xt−1)=1−βtxt−1+βtI⋅ϵ,where  ϵ∼N(0,1)x_t \sim q(x_t|x_{t-1}) = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}I \cdot \epsilon, where \ \ \epsilon \sim N(0, 1)xt​∼q(xt​∣xt−1​)=1−βt​​xt−1​+βt​​I⋅ϵ,where  ϵ∼N(0,1) xt−1∼q(xt−1∣xt−2)=1−βt−1xt−2+βt−1I⋅ϵ,where  ϵ∼N(0,1)x_{t-1} \sim q(x_{t-1}|x_{t-2}) = \sqrt{1-\beta_{t-1}}x_{t-2} + \sqrt{\beta_{t-1}}I \cdot \epsilon, where \ \ \epsilon \sim N(0, 1)xt−1​∼q(xt−1​∣xt−2​)=1−βt−1​​xt−2​+βt−1​​I⋅ϵ,where  ϵ∼N(0,1)

As a result, we can express xtx_txt​ with x0x_0x0​

xt∼q(xt∣xt−1)=1−βtxt−1+βtI⋅ϵt∗ =1−βt(1−βt−1xt−2+βt−1I⋅ϵt−1∗)+βtI⋅ϵt∗ ...(13) =1−βt1−βt−1xt−2+(1−βtβt−1)2+βtϵt−2 ...(14) =aˉtx0+1−aˉtϵ0x_t \sim q(x_t|x_{t-1}) = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}I \cdot \epsilon_t^* \\~\\ = \sqrt{1-\beta_t}(\sqrt{1-\sqrt{\beta_{t-1}}}x_{t-2} + \sqrt{\beta_{t-1}}I \cdot \epsilon_{t-1}^*) + \sqrt{\beta_t}I \cdot \epsilon_t^* \ ...(13) \\~\\ = \sqrt{1-\beta_t} \sqrt{1-\beta_{t-1}} x_{t-2} + \sqrt{(\sqrt{1-\beta_t}\sqrt{\beta_{t-1}})^2+\sqrt{\beta_t}}\epsilon_{t-2} \ ...(14) \\~\\ = \sqrt{\bar{a}_t}x_0 + \sqrt{1-\bar{a}_t}\epsilon_0xt​∼q(xt​∣xt−1​)=1−βt​​xt−1​+βt​​I⋅ϵt∗​ =1−βt​​(1−βt−1​​​xt−2​+βt−1​​I⋅ϵt−1∗​)+βt​​I⋅ϵt∗​ ...(13) =1−βt​​1−βt−1​​xt−2​+(1−βt​​βt−1​​)2+βt​​​ϵt−2​ ...(14) =aˉt​​x0​+1−aˉt​​ϵ0​

where at:=1−βt , aˉt:=∏s=1taswhere \ a_t := 1-\beta_t \ , \ \bar{a}_t := \prod_{s=1}^t a_swhere at​:=1−βt​ , aˉt​:=∏s=1t​as​

Reason we could convert from (13) to (14) is that all ϵi∗\epsilon_i^*ϵi∗​'s are iid. q(xt∣x0)=N(xt;aˉtx0,(1−atˉ)I)q(x_t|x_0) = N(x_t; \sqrt{\bar{a}_t}x_0, (1-\bar{a_t})I)q(xt​∣x0​)=N(xt​;aˉt​​x0​,(1−at​ˉ​)I)

Using all the facts we know, we can now simplify it as follows: Please reference for details.

q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_tI)q(xt−1​∣xt​,x0​)=N(xt−1​;μ~​t​(xt​,x0​),β~​t​I)

where μ~t:=aˉt−1βt1−aˉtx0+at(1−aˉt−1)1−aˉtxt and β~t:=1−aˉt−11−aˉtβtwhere \ \tilde{\mu}_t := \frac{\sqrt{\bar{a}_{t-1}}\beta_t}{1-\bar{a}_t}x_0 + \frac{\sqrt{a_t}(1-\bar{a}_{t-1})}{1-\bar{a}_t}x_t \ and \ \tilde{\beta}_t := \frac{1-\bar{a}_{t-1}}{1-\bar{a}_t}\beta_twhere μ~​t​:=1−aˉt​aˉt−1​​βt​​x0​+1−aˉt​at​​(1−aˉt−1​)​xt​ and β~​t​:=1−aˉt​1−aˉt−1​​βt​

We can simplify μ~t\tilde{\mu}_tμ~​t​ to be a function of only xtx_txt​ using xt=aˉtx0+1−aˉtϵx_t = \sqrt{\bar{a}_t}x_0 + \sqrt{1-\bar{a}_t}\epsilonxt​=aˉt​​x0​+1−aˉt​​ϵ

Then μ~t=1aˉt(xt−βt1−aˉtϵ)\tilde{\mu}_t = \frac{1}{\sqrt{\bar{a}_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar{a}_t}}\epsilon)μ~​t​=aˉt​​1​(xt​−1−aˉt​​βt​​ϵ).

∑t≥2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt)) =∑t≥2DKL(N(xt−1;μ~t(xt),β~tI)∣∣N(xt−1;μθ(xt,t),Σθ(xt,t))) ...(15) =∑t≥212σq2(t)∣∣μt(xt,x0)−μθ(xt,t)∣∣22 ...(16) =∑t≥2βt2aˉt(1−aˉt)σq2(t)∣∣ϵ−ϵθ(xt,t)∣∣22 ...(17)\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t)) \\~\\ = \sum_{t \ge2}D_{KL}(N(x_{t-1};\tilde{\mu}_t(x_t), \tilde{\beta}_tI)||N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))) \ ...(15) \\~\\ = \sum_{t\ge2}\frac{1}{2\sigma_q^2(t)} || \mu_t(x_t, x_0) - \mu_\theta(x_t, t) ||_2^2 \ ...(16) \\~\\ = \sum_{t\ge2} \frac{\beta_t}{2\bar{a}_t(1-\bar{a}_t)\sigma_q^2(t)}||\epsilon - \epsilon_\theta(x_t, t)||_2^2 \ ...(17)∑t≥2​DKL​(q(xt−1​∣xt​,x0​)∣∣pθ​(xt−1​∣xt​)) =∑t≥2​DKL​(N(xt−1​;μ~​t​(xt​),β~​t​I)∣∣N(xt−1​;μθ​(xt​,t),Σθ​(xt​,t))) ...(15) =∑t≥2​2σq2​(t)1​∣∣μt​(xt​,x0​)−μθ​(xt​,t)∣∣22​ ...(16) =∑t≥2​2aˉt​(1−aˉt​)σq2​(t)βt​​∣∣ϵ−ϵθ​(xt​,t)∣∣22​ ...(17)

Converting (15) -> (16) assumes the Σθ\Sigma_\thetaΣθ​ is not trainable, and same as β~t=σq2(t)\tilde{\beta}_t = \sigma_q^2(t)β~​t​=σq2​(t).

During conversion of (16) -> (17), we set μθ(xt,t)=1aˉt(xt−βt1−aˉtϵθ(xt,t))\mu_\theta(x_t, t) = \frac{1}{\sqrt{\bar{a}_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar{a}_t}}\epsilon_\theta(x_t, t))μθ​(xt​,t)=aˉt​​1​(xt​−1−aˉt​​βt​​ϵθ​(xt​,t)). We can assume like this because we are trying to approximate μθ(xt,t)\mu_\theta(x_t, t)μθ​(xt​,t) to μt(xt,x0)\mu_t(x_t, x_0)μt​(xt​,x0​) using reparameterization.

Meaning of Eq[log pθ(x0∣x1)]E_q[log \ p_\theta(x_0|x_1)]Eq​[log pθ​(x0​∣x1​)] in equation (12)

pθ(x0∣x1)=∏i=1D∫δ−x0iδ+x0iN(x;μθi(x1,I),σ12)dxp_\theta(x_0|x_1) = \prod_{i=1}^D \int_{\delta_-x_0^i}^{\delta_+x_0^i} N(x; \mu_\theta^i(x_1, I), \sigma_1^2)dxpθ​(x0​∣x1​)=i=1∏D​∫δ−​x0i​δ+​x0i​​N(x;μθi​(x1​,I),σ12​)dx

Minimize the following loss: L:=∑t≥2∣∣ϵ−ϵθ(xt,t)∣∣22L := \sum_{t\ge2}||\epsilon - \epsilon_\theta(x_t, t)||_2^2L:=∑t≥2​∣∣ϵ−ϵθ​(xt​,t)∣∣22​

[1]

[2]

[3]

For normal Bayes theorem, q(xt∣xt−1)=q(xt−1∣xt)q(xt)q(xt−1)q(x_t|x_{t-1}) = \frac {q(x_{t-1}|x_t)q(x_t)}{q(x_{t-1})}q(xt​∣xt−1​)=q(xt−1​)q(xt−1​∣xt​)q(xt​)​

But the noising&denoising process at Diffusion model follows Markov Chain rule. Which means that only xt−1x_{t-1}xt−1​ affects xtx_txt​ at noising process, and vice versa at denoising process.

So, we can make the probability of xt(t≥2)x_t(t \ge2)xt​(t≥2) conditional to x0x_0x0​

p(xT)p(x_T)p(xT​) is the distribution of the latent vector xTx_TxT​, which is Gaussian p(xT)∼N(0,I)p(x_T) \sim N(0, I)p(xT​)∼N(0,I).

Also, q(xT∣x0)q(x_T|x_0)q(xT​∣x0​) is a Gaussian distribution(q(xT∣x0)∼N(0,I)q(x_T|x_0) \sim N(0, I)q(xT​∣x0​)∼N(0,I)) because noises are added to x0x_0x0​ during many steps. This denotes that distribution difference between q(xT∣x0)q(x_T|x_0)q(xT​∣x0​) and p(xT)p(x_T)p(xT​) is not that much different.

here
https://www.youtube.com/watch?v=HoKDTa5jHvg&t=1528s
https://dlaiml.tistory.com/entry/Diffusion-Model-%EC%88%98%EC%8B%9D-%EC%A0%95%EB%A6%AC
https://www.youtube.com/watch?v=a4Yfz2FxXiY&t=1365s
https://github.com/jinho-choi123/Diffusion-pytorch
Algorithm of Training and Sampling(generating image). From Diffusion model paper.
10MB
Diffusion Model.pdf
pdf
Paper of Diffusion Model.
https://arxiv.org/abs/2006.11239