Denoising Diffusion Probabilistic Models (2020)

#Diffusion Model

June 15, 2025

Denoising Diffusion Probabilistic Models generate high-quality images by learning to reverse a gradual noising process. The forward process is a fixed Markov chain that incrementally corrupts the data by adding Gaussian noise. The reverse process is a parameterized Markov chain, trained to denoise the corrupted inputs and generate samples. The authors demonstrate that training the model to predict the added noise minimizes a variational upper bound on the negative log-likelihood.

The forward (diffusion) process is defined as a sequence $\textbf{x}_0, \dots, \textbf{x}_T$, where $\textbf{x}_0 \sim q(\textbf{x}_0)$ represents the original data and $\textbf{x}_T$ is nearly pure noise: $$ p(\textbf{x}_T)=\mathcal{N}(\textbf{x}_T;\boldsymbol{0}, \textrm{\textbf{I}}) $$ At each step, Gaussian noise is added to the data: $$ \begin{align} q(\textbf{x}_{1:T}|\textbf{x}_0)&:=\prod^T_{t=1}q(\textbf{x}_t|\textbf{x}_{t-1})\\ q(\textbf{x}_t|\textbf{x}_{t-1})&:=\mathcal{N}(\textbf{x}_t;\sqrt{1-\beta_t}\textbf{x}_{t-1},\beta_t\textbf{I}) \end{align} $$ where $\beta_t$ can be fixed as hyperparameters.

The reverse process, learned through training, is another Markov chain: $$ \begin{align} p_\theta(\textbf{x}_0)&:=\int p_\theta (\textbf{x}_{0:T})d\textbf{x}_{1:T}\\ p_\theta(\textbf{x}_{0:T})&:=p(\textbf{x}_T)\prod^T_{t=1}p_\theta (\textbf{x}_{t-1}|\textbf{x}_t)\\ p_\theta(\textbf{x}_{t-1}|\textbf{x}_t)&:= \mathcal{N}(\textbf{x}_{t-1};\boldsymbol{\mu}_\theta (\textbf{x}_t,t),\boldsymbol{\Sigma}_\theta(\textbf{x}_t,t)) \end{align} $$

The model is trained by minimizing the standard variational upper bound on the negative log-likelihood:

$$ \mathbb{E}\left[-\log p_\theta(\textbf{x}_0)\right]\le\mathbb{E}_q\left[-\log\frac{p_\theta(\textbf{x}_{0:T})}{q(\textbf{x}_{1:T}|\textbf{x}_0)}\right]=\mathbb{E}_q\left[-\log p(\textbf{x}_T)-\sum_{1\ge 1}\log\frac{p_\theta(\textbf{x}_{t-1}|\textbf{x}_t)}{q(\textbf{x}_t|\textbf{x}_{t-1})}\right]:=L $$

This can be rewritten using KL divergence as: $$ L=\mathbb{E}_q\left[D_{\textrm{KL}}(q(\textbf{x}_T|\textbf{x}_0)||p(\textbf{x}_T))+\sum_{t>1}D_{\textrm{KL}}(q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0)||p_\theta (\textbf{x}_{t-1}|\textbf{x}_t))-\log p_\theta(\textbf{x}_0|\textbf{x}_1)\right] $$

The term involving $q(\textbf{x}_T|\textbf{x}_0)$ can be ignored during training if $\beta_t$ is fixed, because it does not depend on any learnable parameters. The forward process allows closed-form sampling at any timestep: $$ \begin{equation} q(\textbf{x}_t|\text{x}_0)=\mathcal{N}(\textbf{x}_t;\sqrt{\bar{\alpha}_t}\textbf{x}_0,(1-\bar{\alpha}_t)\textbf{I}) \end{equation} $$ where $\alpha_t:=1-\beta_t$, $\bar{\alpha}_t:=\prod^t_{s=1}\alpha_s$.

To minimize $L_{t-1}:=\mathbb{E}_q\left[D_{\textrm{KL}}(q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0)||p_\theta (\textbf{x}_{t-1}|\textbf{x}_t))\right]$, the KL divergence at each step, we use a closed-form expression for the posterior: $$ q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0) = \mathcal{N}(\textbf{x}_{t-1};\tilde{\boldsymbol{\mu}}_t(\textbf{x}_t,\textbf{x}_0),\tilde{\beta}_t\textbf{I}) $$ with $$ \begin{align*} \tilde{\boldsymbol{\mu}}_t(\textbf{x}_t,\textbf{x}_0)&:=\frac{\sqrt{\bar{\alpha}_{t-1}\beta_t}}{1-\bar{\alpha}_t}\textbf{x}_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\textbf{x}_t\\ \tilde{\beta}_t&:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_{t} \end{align*} $$

Let $\textbf{x}_t(\textbf{x}_0, \boldsymbol{\epsilon}) = \sqrt{\bar{\alpha}_t} \textbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$, with $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \textbf{I})$. The objective simplifies to: $$ \begin{align*} L_{t-1}-C &= \mathbb{E}_q\left[\frac{1}{2\sigma^2_t}||\tilde{\boldsymbol{\mu}}(\textbf{x}_t,\textbf{x}_0)-\boldsymbol{\mu}_\theta (\textbf{x}_t, t)||^2\right]\\ &=\mathbb{E}_{\textbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2\sigma^2_t}\left|\left|\frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon})-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)-\boldsymbol{\mu}_{\theta}(\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon}),t)\right|\right|^2\right] \end{align*} $$ The above equation implies that $\boldsymbol{\mu}_\theta$ should be trained to approximate $\frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)$. $\textbf{x}_t$ is given as input, and assuming $\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon})=\sqrt{\bar{\alpha}_t}\textbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$, $L_{t-1}-C$ can also be expressed in closed form: $$ L_{t-1}-C=\mathbb{E}_{\textbf{x}_0,\boldsymbol{\epsilon}}\left[\frac{\beta^2_t}{2\sigma^2_t\alpha_t(1-\bar{\alpha}_t)}\left|\left|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\textbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},t)\right|\right|^2\right] $$ In short, training $\boldsymbol{\mu}_\theta$ to predict $\tilde{\boldsymbol{\mu}}_t$ is equivalent to learning to predict $\boldsymbol{\epsilon}$.

To compute the discrete log-likelihood of $p_\theta(\textbf{x}_0|\textbf{x}_1)), pixel values in \({0, 1, \dots, 255}$ are linearly scaled to $[-1, 1]$, and an independent decoder is used to model the distribution: $$ \textbf{x}_0\sim \mathcal{N}(\boldsymbol{\mu}_\theta(\textbf{x}_1,1), \sigma^2_1\textbf{I}) $$ with integration bounds defined as: $$ \begin{align*} \delta_+(x)&=\begin{cases} \infty&\text{if}\ x=1\\ x+\frac{1}{255}&\text{if}\ x < 1 \end{cases}\\ \delta_-(x)&=\begin{cases} -\infty&\text{if}\ x=-1\\ x-\frac{1}{255}&\text{if}\ x > -1 \end{cases}\\ \end{align*} $$ During sampling, the denoising process begins from $\textbf{x}_T \sim \mathcal{N}(0, \textbf{I})$ as: $$ \textbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha_t}}}\boldsymbol{\epsilon}_\theta(\textbf{x}_t,t)\right)+\sigma_t\textbf{z},\ \textbf{z}\sim\mathcal{N}(\textbf{0}, \textbf{I}) $$ The full procedure for training and sampling is illustrated below: algo