Diffusion Model

Last updated 3 months ago

Was this helpful?

Purpose

By adding step-by-step Gaussian Noise the to image, we can generate a Gaussian distribution. From Gaussian distribution, by denoising step-by-step, we can generate a high-quality image.

Core

How Diffusion Model works?

Forward Process

For a given image, we add Gaussian noise step-by-step. This procedure is called 'Forward process', and represented as $q(x_t|x_{t-1})$ . $x_0$ is the original image, and $x_1, x_2, ..., x_T$ are all latent vector that has same dimension with $x_0$ .

Adding Gaussian noise can be expressed as follows: $\beta_t$ are fixed for each step $t$ .

q(x_t|x_{t-1}) := N(x_t; \sqrt{1-\beta_t} \ x_{t-1}, \beta_t I) \ ... (1)

Joint distribution for latent vectors can be expressed as follows:

q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}) \ ...(2)

Reverse Process

After $T$ steps, we get an fully Gaussian distribution $N(0, I)$ . We will do 'Reverse process', to denoise the Gaussian distribution to get high-quality image. It is represented as $p_\theta(x_{t-1}|x_t)$ .

Denoising Gaussian noise can be expressed as follows: $\mu_\theta, \Sigma_\theta$ are neural network outputs that has parameters $\theta$ .

p_\theta(x_{t-1}|x_t):=N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \ ...(3)

The joint distribution of all latent vectors and generated image can be expressed as follows:

p_\theta(x_{0:T}) := p(x_T) \prod_{t=1}^{T}p_\theta(x_{t-1}|x_t) \ ...(4)

What is the training objectives?

Our goal is to maximize the log-likelihood $p_\theta(x_0)$ . The meaning of $p_\theta(x_0)$ : for given image $x_0$ and adjustable parameter $\theta$ , what is the probability of model producing $x_0$ ?

Our ultimate goal is to train $\theta$ to maximize the log-likelihood.

To maximize $p_\theta(x_0)$ , we are going to try to maximize the left-side.

log \ p_\theta(x_0) \ge log \ p_\theta(x_0) - KL(?, ?)

This can be interperated as minimizing $L$ :

-log \ p_\theta(x_0) \le E_q[-log \ \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}] = E_q[-log \ p(x_T) - \sum_{t \ge 1}log \ \frac{p_\theta (x_{t-1}|x_t)}{q(x_t|x_{t-1})}] := L \ ...(5)

How can we come up with this inequality?

Let's look step-by-step.

Since KL-divergance is always non-negative, we can setup inequality as follows: $-log \ p_\theta(x_0) \le -log \ p_\theta(x_0) + KL(q(x_{1:T}|x_0), p_\theta(x_{1:T}|x_0)) \ ...(6)$ $q(x_{1:T}|x_0)$ means the latent vectors distribution when input image $x_0$ is given. $p_\theta(x_{1:T}|x_0)$ means the latent vectors distribution when output image $x_0$ is given. The latent vectors distribution for both case should be similar as possible. So we added KL-divergence term that compares two distribution.
If we take a closer look at KL-divergence term, we can convert it using Bayes theorem: $KL(q(x_{1:T}|x_0), p_\theta(x_{1:T}|x_0)) \\~\\ = E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0) \ p_\theta(x_0)}{p_\theta(x_{0:T})}] \\~\\ = log \ p_\theta(x_0) + E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \ ...(7)$
By combining it to the inequality, we can come up with followings: $-log \ p_\theta(x_0) \le -log \ p_\theta(x_0) + log \ p_\theta(x_0) + E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \ ...(8)$
If we use equation (2) and (4), we can simplify the equation (8) as follows: $-log \ p_\theta(x_0) \le E_{x_{1:T} \sim q(x_{1:T}|x_0)}[-log \ p(x_T) - \sum_{t \ge 1} log \ \frac{p_\theta(x_{t-1}|x_t}{q(x_t|x_{t-1})}]$

Deep-diving to make the loss simpler

L := E_q[-log \ p(x_T) - \sum_{t \ge 1}log \ \frac{p_\theta (x_{t-1}|x_t)}{q(x_t|x_{t-1})}] \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} + \sum_{t\ge2}log \ \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)}] \ ...(9) \\~\\ = E_q[-log \ p(x_T)+log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}+\sum_{t \ge2}log \ \frac{q(x_{t-1}|x_t, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)p_\theta(x_{t-1}|x_t)}] \ ...(10) \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} + \sum_{t \ge 2}log \ \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + \sum_{t \ge 2}log \ \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)}] \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}+log \ \frac{q(x_T|x_0)}{q(x_1|x_0)}] + \sum_{t \ge 2}E_{x_t, x_{t-1} \sim q(x_t, x_{t-1} | x_0)}[log \ \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)}] \ ...(11) \\~\\ = D_{KL}(q(x_T|x_0) || p(x_T)) - E_q[-log \ p_\theta(x_0|x_1)]] + \sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t)) \ ...(12)

How did we convert

q(x_t|x_{t-1}) = \frac{q(x_{t-1}|x_t, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}

at (9) ->(10)?

Interesting fact about

D_{KL}(q(x_T|x_0) || p(x_T))

term in equation (12)

Meaning of $\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))$ in equation (12)

The meaning of this term is that matching the learnable denoising model $p_\theta(x_{t-1}|x_t)$ to ground-truth denoising model $q(x_{t-1}|x_t, x_0)$ .

We can make this term more simpler! (Very Important!!)

Simplifying right-side

We know the equation of $p_\theta(x_{t-1}|x_t)$ , which is $p_\theta(x_{t-1}|x_t):=N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

Simplifying left-side

Equation of $q(x_{t-1}|x_t, x_0)$ can be expressed as follows: $q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1}, x_0) q(x_{t-1}|x_0)}{q(x_t|x_0)}$

We know $q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t} \ x_{t-1}, \beta_t I)$

For $q(x_t|x_0)$ , we are going to use reparameterization trick.

$x_t \sim q(x_t|x_{t-1}) = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}I \cdot \epsilon, where \ \ \epsilon \sim N(0, 1)$ $x_{t-1} \sim q(x_{t-1}|x_{t-2}) = \sqrt{1-\beta_{t-1}}x_{t-2} + \sqrt{\beta_{t-1}}I \cdot \epsilon, where \ \ \epsilon \sim N(0, 1)$

As a result, we can express $x_t$ with $x_0$

$x_t \sim q(x_t|x_{t-1}) = \sqrt{1-\beta_t}x_{t-1} + \sqrt{\beta_t}I \cdot \epsilon_t^* \\~\\ = \sqrt{1-\beta_t}(\sqrt{1-\sqrt{\beta_{t-1}}}x_{t-2} + \sqrt{\beta_{t-1}}I \cdot \epsilon_{t-1}^*) + \sqrt{\beta_t}I \cdot \epsilon_t^* \ ...(13) \\~\\ = \sqrt{1-\beta_t} \sqrt{1-\beta_{t-1}} x_{t-2} + \sqrt{(\sqrt{1-\beta_t}\sqrt{\beta_{t-1}})^2+\sqrt{\beta_t}}\epsilon_{t-2} \ ...(14) \\~\\ = \sqrt{\bar{a}_t}x_0 + \sqrt{1-\bar{a}_t}\epsilon_0$

$where \ a_t := 1-\beta_t \ , \ \bar{a}_t := \prod_{s=1}^t a_s$

Reason we could convert from (13) to (14) is that all $\epsilon_i^*$ 's are iid. $q(x_t|x_0) = N(x_t; \sqrt{\bar{a}_t}x_0, (1-\bar{a_t})I)$

Using all the facts we know, we can now simplify it as follows: Please reference for details.

$q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_tI)$

$where \ \tilde{\mu}_t := \frac{\sqrt{\bar{a}_{t-1}}\beta_t}{1-\bar{a}_t}x_0 + \frac{\sqrt{a_t}(1-\bar{a}_{t-1})}{1-\bar{a}_t}x_t \ and \ \tilde{\beta}_t := \frac{1-\bar{a}_{t-1}}{1-\bar{a}_t}\beta_t$

We can simplify $\tilde{\mu}_t$ to be a function of only $x_t$ using $x_t = \sqrt{\bar{a}_t}x_0 + \sqrt{1-\bar{a}_t}\epsilon$

Then $\tilde{\mu}_t = \frac{1}{\sqrt{\bar{a}_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar{a}_t}}\epsilon)$ .

Calculating the KL-divergance

$\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t)) \\~\\ = \sum_{t \ge2}D_{KL}(N(x_{t-1};\tilde{\mu}_t(x_t), \tilde{\beta}_tI)||N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))) \ ...(15) \\~\\ = \sum_{t\ge2}\frac{1}{2\sigma_q^2(t)} || \mu_t(x_t, x_0) - \mu_\theta(x_t, t) ||_2^2 \ ...(16) \\~\\ = \sum_{t\ge2} \frac{\beta_t}{2\bar{a}_t(1-\bar{a}_t)\sigma_q^2(t)}||\epsilon - \epsilon_\theta(x_t, t)||_2^2 \ ...(17)$

Converting (15) -> (16) assumes the $\Sigma_\theta$ is not trainable, and same as $\tilde{\beta}_t = \sigma_q^2(t)$ .

During conversion of (16) -> (17), we set $\mu_\theta(x_t, t) = \frac{1}{\sqrt{\bar{a}_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar{a}_t}}\epsilon_\theta(x_t, t))$ . We can assume like this because we are trying to approximate $\mu_\theta(x_t, t)$ to $\mu_t(x_t, x_0)$ using reparameterization.

Meaning of $E_q[log \ p_\theta(x_0|x_1)]$ in equation (12)

This is a reconstruction loss term. We can rewrite this term as follows:

p_\theta(x_0|x_1) = \prod_{i=1}^D \int_{\delta_-x_0^i}^{\delta_+x_0^i} N(x; \mu_\theta^i(x_1, I), \sigma_1^2)dx

In the paper, it doesn't talk much about this term. It is eventually ignored, but I still didn't understand the reason.

Final training objective

Minimize the following loss: $L := \sum_{t\ge2}||\epsilon - \epsilon_\theta(x_t, t)||_2^2$

Implementation details

I made a simple Diffusion model using pytorch.

https://github.com/jinho-choi123/Diffusion-pytorch

References

[1]

[2]

[3]

Adding Gaussian noise can be expressed as follows: $\beta_t$ are fixed for each step $t$ .

q(x_t|x_{t-1}) := N(x_t; \sqrt{1-\beta_t} \ x_{t-1}, \beta_t I) \ ... (1)

q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}) \ ...(2)

Denoising Gaussian noise can be expressed as follows: $\mu_\theta, \Sigma_\theta$ are neural network outputs that has parameters $\theta$ .

p_\theta(x_{t-1}|x_t):=N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \ ...(3)

p_\theta(x_{0:T}) := p(x_T) \prod_{t=1}^{T}p_\theta(x_{t-1}|x_t) \ ...(4)

Our ultimate goal is to train $\theta$ to maximize the log-likelihood.

To maximize $p_\theta(x_0)$ , we are going to try to maximize the left-side.

log \ p_\theta(x_0) \ge log \ p_\theta(x_0) - KL(?, ?)

This can be interperated as minimizing $L$ :

-log \ p_\theta(x_0) \le E_q[-log \ \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}] = E_q[-log \ p(x_T) - \sum_{t \ge 1}log \ \frac{p_\theta (x_{t-1}|x_t)}{q(x_t|x_{t-1})}] := L \ ...(5)

Since KL-divergance is always non-negative, we can setup inequality as follows: $-log \ p_\theta(x_0) \le -log \ p_\theta(x_0) + KL(q(x_{1:T}|x_0), p_\theta(x_{1:T}|x_0)) \ ...(6)$ $q(x_{1:T}|x_0)$ means the latent vectors distribution when input image $x_0$ is given. $p_\theta(x_{1:T}|x_0)$ means the latent vectors distribution when output image $x_0$ is given. The latent vectors distribution for both case should be similar as possible. So we added KL-divergence term that compares two distribution.

If we take a closer look at KL-divergence term, we can convert it using Bayes theorem: $KL(q(x_{1:T}|x_0), p_\theta(x_{1:T}|x_0)) \\~\\ = E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0) \ p_\theta(x_0)}{p_\theta(x_{0:T})}] \\~\\ = log \ p_\theta(x_0) + E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \ ...(7)$

By combining it to the inequality, we can come up with followings: $-log \ p_\theta(x_0) \le -log \ p_\theta(x_0) + log \ p_\theta(x_0) + E_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \ \frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}] \ ...(8)$

If we use equation (2) and (4), we can simplify the equation (8) as follows: $-log \ p_\theta(x_0) \le E_{x_{1:T} \sim q(x_{1:T}|x_0)}[-log \ p(x_T) - \sum_{t \ge 1} log \ \frac{p_\theta(x_{t-1}|x_t}{q(x_t|x_{t-1})}]$

L := E_q[-log \ p(x_T) - \sum_{t \ge 1}log \ \frac{p_\theta (x_{t-1}|x_t)}{q(x_t|x_{t-1})}] \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} + \sum_{t\ge2}log \ \frac{q(x_t|x_{t-1})}{p_\theta(x_{t-1}|x_t)}] \ ...(9) \\~\\ = E_q[-log \ p(x_T)+log \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}+\sum_{t \ge2}log \ \frac{q(x_{t-1}|x_t, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)p_\theta(x_{t-1}|x_t)}] \ ...(10) \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)} + \sum_{t \ge 2}log \ \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + \sum_{t \ge 2}log \ \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)}] \\~\\ = E_q[-log \ p(x_T) + log \ \frac{q(x_1|x_0)}{p_\theta(x_0|x_1)}+log \ \frac{q(x_T|x_0)}{q(x_1|x_0)}] + \sum_{t \ge 2}E_{x_t, x_{t-1} \sim q(x_t, x_{t-1} | x_0)}[log \ \frac{q(x_{t-1}|x_t, x_0)}{p_\theta(x_{t-1}|x_t)}] \ ...(11) \\~\\ = D_{KL}(q(x_T|x_0) || p(x_T)) - E_q[-log \ p_\theta(x_0|x_1)]] + \sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t)) \ ...(12)

How did we convert

q(x_t|x_{t-1}) = \frac{q(x_{t-1}|x_t, x_0)q(x_t|x_0)}{q(x_{t-1}|x_0)}

at (9) ->(10)?

For normal Bayes theorem, $q(x_t|x_{t-1}) = \frac {q(x_{t-1}|x_t)q(x_t)}{q(x_{t-1})}$

But the noising&denoising process at Diffusion model follows Markov Chain rule. Which means that only $x_{t-1}$ affects $x_t$ at noising process, and vice versa at denoising process.

So, we can make the probability of $x_t(t \ge2)$ conditional to $x_0$

Interesting fact about

D_{KL}(q(x_T|x_0) || p(x_T))

term in equation (12)

$p(x_T)$ is the distribution of the latent vector $x_T$ , which is Gaussian $p(x_T) \sim N(0, I)$ .

Also, $q(x_T|x_0)$ is a Gaussian distribution( $q(x_T|x_0) \sim N(0, I)$ ) because noises are added to $x_0$ during many steps. This denotes that distribution difference between $q(x_T|x_0)$ and $p(x_T)$ is not that much different.

Meaning of $\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))$ in equation (12)

The meaning of this term is that matching the learnable denoising model $p_\theta(x_{t-1}|x_t)$ to ground-truth denoising model $q(x_{t-1}|x_t, x_0)$ .

We know the equation of $p_\theta(x_{t-1}|x_t)$ , which is $p_\theta(x_{t-1}|x_t):=N(x_{t-1};\mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

Equation of $q(x_{t-1}|x_t, x_0)$ can be expressed as follows: $q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1}, x_0) q(x_{t-1}|x_0)}{q(x_t|x_0)}$

We know $q(x_t|x_{t-1}, x_0) = q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t} \ x_{t-1}, \beta_t I)$

For $q(x_t|x_0)$ , we are going to use reparameterization trick.

As a result, we can express $x_t$ with $x_0$

$where \ a_t := 1-\beta_t \ , \ \bar{a}_t := \prod_{s=1}^t a_s$

Reason we could convert from (13) to (14) is that all $\epsilon_i^*$ 's are iid. $q(x_t|x_0) = N(x_t; \sqrt{\bar{a}_t}x_0, (1-\bar{a_t})I)$

$q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_tI)$

We can simplify $\tilde{\mu}_t$ to be a function of only $x_t$ using $x_t = \sqrt{\bar{a}_t}x_0 + \sqrt{1-\bar{a}_t}\epsilon$

Then $\tilde{\mu}_t = \frac{1}{\sqrt{\bar{a}_t}}(x_t - \frac{\beta_t}{\sqrt{1-\bar{a}_t}}\epsilon)$ .

Converting (15) -> (16) assumes the $\Sigma_\theta$ is not trainable, and same as $\tilde{\beta}_t = \sigma_q^2(t)$ .

Meaning of $E_q[log \ p_\theta(x_0|x_1)]$ in equation (12)

p_\theta(x_0|x_1) = \prod_{i=1}^D \int_{\delta_-x_0^i}^{\delta_+x_0^i} N(x; \mu_\theta^i(x_1, I), \sigma_1^2)dx

Minimize the following loss: $L := \sum_{t\ge2}||\epsilon - \epsilon_\theta(x_t, t)||_2^2$

For normal Bayes theorem, $q(x_t|x_{t-1}) = \frac {q(x_{t-1}|x_t)q(x_t)}{q(x_{t-1})}$

But the noising&denoising process at Diffusion model follows Markov Chain rule. Which means that only $x_{t-1}$ affects $x_t$ at noising process, and vice versa at denoising process.

So, we can make the probability of $x_t(t \ge2)$ conditional to $x_0$

$p(x_T)$ is the distribution of the latent vector $x_T$ , which is Gaussian $p(x_T) \sim N(0, I)$ .

Purpose

Core

How Diffusion Model works?

What is the training objectives?

Deep-diving to make the loss simpler

Final training objective

Implementation details

References

Purpose

Core

How Diffusion Model works?

What is the training objectives?

Deep-diving to make the loss simpler

Meaning of ∑t≥2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))∑t≥2​DKL​(q(xt−1​∣xt​,x0​)∣∣pθ​(xt−1​∣xt​)) in equation (12)

Meaning of Eq[log pθ(x0∣x1)]E_q[log \ p_\theta(x_0|x_1)]Eq​[log pθ​(x0​∣x1​)] in equation (12)

Final training objective

Implementation details

References

Meaning of ∑t≥2DKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))∑t≥2​DKL​(q(xt−1​∣xt​,x0​)∣∣pθ​(xt−1​∣xt​)) in equation (12)

Meaning of Eq[log pθ(x0∣x1)]E_q[log \ p_\theta(x_0|x_1)]Eq​[log pθ​(x0​∣x1​)] in equation (12)

Meaning of $\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))$ in equation (12)

Meaning of $E_q[log \ p_\theta(x_0|x_1)]$ in equation (12)

Meaning of $\sum_{t \ge 2}D_{KL}(q(x_{t-1}|x_t, x_0) || p_\theta(x_{t-1}|x_t))$ in equation (12)

Meaning of $E_q[log \ p_\theta(x_0|x_1)]$ in equation (12)