By adding step-by-step Gaussian Noise the to image, we can generate a Gaussian distribution. From Gaussian distribution, by denoising step-by-step, we can generate a high-quality image.
Core
How Diffusion Model works?
Forward Process
Joint distribution for latent vectors can be expressed as follows:
Reverse Process
The joint distribution of all latent vectors and generated image can be expressed as follows:
What is the training objectives?
How can we come up with this inequality?
Let's look step-by-step.
Deep-diving to make the loss simpler
We can make this term more simpler! (Very Important!!)
Simplifying right-side
Simplifying left-side
Calculating the KL-divergance
This is a reconstruction loss term. We can rewrite this term as follows:
In the paper, it doesn't talk much about this term. It is eventually ignored, but I still didn't understand the reason.
Final training objective
Implementation details
I made a simple Diffusion model using pytorch.
References
For a given image, we add Gaussian noise step-by-step. This procedure is called 'Forward process', and represented as q(xt∣xt−1). x0 is the original image, and x1,x2,...,xT are all latent vector that has same dimension with x0.
Adding Gaussian noise can be expressed as follows:
βt are fixed for each step t.
q(xt∣xt−1):=N(xt;1−βtxt−1,βtI)...(1)
q(x1:T∣x0):=t=1∏Tq(xt∣xt−1)...(2)
After T steps, we get an fully Gaussian distribution N(0,I). We will do 'Reverse process', to denoise the Gaussian distribution to get high-quality image. It is represented as pθ(xt−1∣xt).
Denoising Gaussian noise can be expressed as follows:
μθ,Σθ are neural network outputs that has parameters θ.
Our goal is to maximize the log-likelihood pθ(x0). The meaning of pθ(x0): for given image x0 and adjustable parameter θ, what is the probability of model producing x0?
Our ultimate goal is to train θ to maximize the log-likelihood.
To maximize pθ(x0), we are going to try to maximize the left-side.
Since KL-divergance is always non-negative, we can setup inequality as follows:
−logpθ(x0)≤−logpθ(x0)+KL(q(x1:T∣x0),pθ(x1:T∣x0))...(6)q(x1:T∣x0) means the latent vectors distribution when input image x0 is given.
pθ(x1:T∣x0) means the latent vectors distribution when output image x0 is given.
The latent vectors distribution for both case should be similar as possible. So we added KL-divergence term that compares two distribution.
If we take a closer look at KL-divergence term, we can convert it using Bayes theorem:
KL(q(x1:T∣x0),pθ(x1:T∣x0))=Ex1:T∼q(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)pθ(x0)]=logpθ(x0)+Ex1:T∼q(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]...(7)
By combining it to the inequality, we can come up with followings:
−logpθ(x0)≤−logpθ(x0)+logpθ(x0)+Ex1:T∼q(x1:T∣x0)[logpθ(x0:T)q(x1:T∣x0)]...(8)
If we use equation (2) and (4), we can simplify the equation (8) as follows:
−logpθ(x0)≤Ex1:T∼q(x1:T∣x0)[−logp(xT)−∑t≥1logq(xt∣xt−1)pθ(xt−1∣xt]
Converting (15) -> (16) assumes the Σθ is not trainable, and same as β~t=σq2(t).
During conversion of (16) -> (17), we set μθ(xt,t)=aˉt1(xt−1−aˉtβtϵθ(xt,t)). We can assume like this because we are trying to approximate μθ(xt,t) to μt(xt,x0) using reparameterization.
Minimize the following loss:
L:=∑t≥2∣∣ϵ−ϵθ(xt,t)∣∣22
[1]
[2]
[3]
For normal Bayes theorem,
q(xt∣xt−1)=q(xt−1)q(xt−1∣xt)q(xt)
But the noising&denoising process at Diffusion model follows Markov Chain rule. Which means that only xt−1 affects xt at noising process, and vice versa at denoising process.
So, we can make the probability of xt(t≥2) conditional to x0
p(xT) is the distribution of the latent vector xT, which is Gaussian p(xT)∼N(0,I).
Also, q(xT∣x0) is a Gaussian distribution(q(xT∣x0)∼N(0,I)) because noises are added to x0 during many steps.
This denotes that distribution difference between q(xT∣x0) and p(xT) is not that much different.