Denoising Diffusion Probabilistic Models (2020)
June 15, 2025Denoising Diffusion Probabilistic Models generate high-quality images by learning to reverse a gradual noising process. The forward process is a fixed Markov chain that incrementally corrupts the data by adding Gaussian noise. The reverse process is a parameterized Markov chain, trained to denoise the corrupted inputs and generate samples. The authors demonstrate that training the model to predict the added noise minimizes a variational upper bound on the negative log-likelihood.
The forward (diffusion) process is defined as a sequence \(\textbf{x}_0, \dots, \textbf{x}_T\), where \(\textbf{x}_0 \sim q(\textbf{x}_0)\) represents the original data and \(\textbf{x}_T\) is nearly pure noise: $$ p(\textbf{x}_T)=\mathcal{N}(\textbf{x}_T;\boldsymbol{0}, \textrm{\textbf{I}}) $$ At each step, Gaussian noise is added to the data: $$ \begin{align} q(\textbf{x}_{1:T}|\textbf{x}_0)&:=\prod^T_{t=1}q(\textbf{x}_t|\textbf{x}_{t-1})\\ q(\textbf{x}_t|\textbf{x}_{t-1})&:=\mathcal{N}(\textbf{x}_t;\sqrt{1-\beta_t}\textbf{x}_{t-1},\beta_t\textbf{I}) \end{align} $$ where \(\beta_t\) can be fixed as hyperparameters.
The reverse process, learned through training, is another Markov chain: $$ \begin{align} p_\theta(\textbf{x}_0)&:=\int p_\theta (\textbf{x}_{0:T})d\textbf{x}_{1:T}\\ p_\theta(\textbf{x}_{0:T})&:=p(\textbf{x}_T)\prod^T_{t=1}p_\theta (\textbf{x}_{t-1}|\textbf{x}_t)\\ p_\theta(\textbf{x}_{t-1}|\textbf{x}_t)&:= \mathcal{N}(\textbf{x}_{t-1};\boldsymbol{\mu}_\theta (\textbf{x}_t,t),\boldsymbol{\Sigma}_\theta(\textbf{x}_t,t)) \end{align} $$
The model is trained by minimizing the standard variational upper bound on the negative log-likelihood:
$$ \mathbb{E}\left[-\log p_\theta(\textbf{x}_0)\right]\le\mathbb{E}_q\left[-\log\frac{p_\theta(\textbf{x}_{0:T})}{q(\textbf{x}_{1:T}|\textbf{x}_0)}\right]=\mathbb{E}_q\left[-\log p(\textbf{x}_T)-\sum_{1\ge 1}\log\frac{p_\theta(\textbf{x}_{t-1}|\textbf{x}_t)}{q(\textbf{x}_t|\textbf{x}_{t-1})}\right]:=L $$
This can be rewritten using KL divergence as: $$ L=\mathbb{E}_q\left[D_{\textrm{KL}}(q(\textbf{x}_T|\textbf{x}_0)||p(\textbf{x}_T))+\sum_{t>1}D_{\textrm{KL}}(q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0)||p_\theta (\textbf{x}_{t-1}|\textbf{x}_t))-\log p_\theta(\textbf{x}_0|\textbf{x}_1)\right] $$
The term involving \(q(\textbf{x}_T|\textbf{x}_0)\) can be ignored during training if \(\beta_t\) is fixed, because it does not depend on any learnable parameters. The forward process allows closed-form sampling at any timestep: $$ \begin{equation} q(\textbf{x}_t|\text{x}_0)=\mathcal{N}(\textbf{x}_t;\sqrt{\bar{\alpha}_t}\textbf{x}_0,(1-\bar{\alpha}_t)\textbf{I}) \end{equation} $$ where \(\alpha_t:=1-\beta_t\), \(\bar{\alpha}_t:=\prod^t_{s=1}\alpha_s\).
To minimize \(L_{t-1}:=\mathbb{E}_q\left[D_{\textrm{KL}}(q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0)||p_\theta (\textbf{x}_{t-1}|\textbf{x}_t))\right]\), the KL divergence at each step, we use a closed-form expression for the posterior: $$ q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0) = \mathcal{N}(\textbf{x}_{t-1};\tilde{\boldsymbol{\mu}}_t(\textbf{x}_t,\textbf{x}_0),\tilde{\beta}_t\textbf{I}) $$ with $$ \begin{align*} \tilde{\boldsymbol{\mu}}_t(\textbf{x}_t,\textbf{x}_0)&:=\frac{\sqrt{\bar{\alpha}_{t-1}\beta_t}}{1-\bar{\alpha}_t}\textbf{x}_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\textbf{x}_t\\ \tilde{\beta}_t&:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_{t} \end{align*} $$
Let \(\textbf{x}_t(\textbf{x}_0, \boldsymbol{\epsilon}) = \sqrt{\bar{\alpha}_t} \textbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\), with \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \textbf{I})\). The objective simplifies to: $$ \begin{align*} L_{t-1}-C &= \mathbb{E}_q\left[\frac{1}{2\sigma^2_t}||\tilde{\boldsymbol{\mu}}(\textbf{x}_t,\textbf{x}_0)-\boldsymbol{\mu}_\theta (\textbf{x}_t, t)||^2\right]\\ &=\mathbb{E}_{\textbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2\sigma^2_t}\left|\left|\frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon})-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)-\boldsymbol{\mu}_{\theta}(\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon}),t)\right|\right|^2\right] \end{align*} $$ The above equation implies that \(\boldsymbol{\mu}_\theta\) should be trained to approximate \(\frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)\). \(\textbf{x}_t\) is given as input, and assuming \(\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon})=\sqrt{\bar{\alpha}_t}\textbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}\), \(L_{t-1}-C\) can also be expressed in closed form: $$ L_{t-1}-C=\mathbb{E}_{\textbf{x}_0,\boldsymbol{\epsilon}}\left[\frac{\beta^2_t}{2\sigma^2_t\alpha_t(1-\bar{\alpha}_t)}\left|\left|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\textbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},t)\right|\right|^2\right] $$ In short, training \(\boldsymbol{\mu}_\theta\) to predict \(\tilde{\boldsymbol{\mu}}_t\) is equivalent to learning to predict \(\boldsymbol{\epsilon}\).
To compute the discrete log-likelihood of \(p_\theta(\textbf{x}_0|\textbf{x}_1)), pixel values in \({0, 1, \dots, 255}\) are linearly scaled to \([-1, 1]\), and an independent decoder is used to model the distribution:
$$
\textbf{x}_0\sim \mathcal{N}(\boldsymbol{\mu}_\theta(\textbf{x}_1,1), \sigma^2_1\textbf{I})
$$
with integration bounds defined as:
$$
\begin{align*}
\delta_+(x)&=\begin{cases}
\infty&\text{if}\ x=1\\
x+\frac{1}{255}&\text{if}\ x < 1
\end{cases}\\
\delta_-(x)&=\begin{cases}
-\infty&\text{if}\ x=-1\\
x-\frac{1}{255}&\text{if}\ x > -1
\end{cases}\\
\end{align*}
$$
During sampling, the denoising process begins from \(\textbf{x}_T \sim \mathcal{N}(0, \textbf{I})\) as:
$$
\textbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha_t}}}\boldsymbol{\epsilon}_\theta(\textbf{x}_t,t)\right)+\sigma_t\textbf{z},\ \textbf{z}\sim\mathcal{N}(\textbf{0}, \textbf{I})
$$
The full procedure for training and sampling is illustrated below:
