Denoising Diffusion Probabilistic Models (2020)
June 15, 2025Denoising Diffusion Probabilistic Models generate high-quality images by learning to reverse a gradual noising process. The forward process is a fixed Markov chain that incrementally corrupts the data by adding Gaussian noise. The reverse process is a parameterized Markov chain, trained to denoise the corrupted inputs and generate samples. The authors demonstrate that training the model to predict the added noise minimizes a variational upper bound on the negative log-likelihood.
The forward (diffusion) process is defined as a sequence \(\textbf{x}_0, \dots, \textbf{x}_T\), where \(\textbf{x}_0 \sim q(\textbf{x}_0)\) represents the original data and \(\textbf{x}_T\) is nearly pure noise:
$$ p(\textbf{x}\_T)=\mathcal{N}(\textbf{x}\_T;\boldsymbol{0}, \textrm{\textbf{I}}) $$At each step, Gaussian noise is added to the data:
$$ \begin{align} q(\textbf{x}\_{1:T}|\textbf{x}\_0)&:=\prod^T\_{t=1}q(\textbf{x}\_t|\textbf{x}\_{t-1})\\\\ q(\textbf{x}\_t|\textbf{x}\_{t-1})&:=\mathcal{N}(\textbf{x}\_t;\sqrt{1-\beta\_t}\textbf{x}\_{t-1},\beta\_t\textbf{I}) \end{align} $$where \(\beta_t\) can be fixed as hyperparameters.
The reverse process, learned through training, is another Markov chain:
$$ \begin{align} p_\theta(\textbf{x}\_0)&:=\int p\_\theta (\textbf{x}\_{0:T})d\textbf{x}\_{1:T}\\\\ p\_\theta(\textbf{x}\_{0:T})&:=p(\textbf{x}\_T)\prod^T\_{t=1}p\_\theta (\textbf{x}\_{t-1}|\textbf{x}\_t)\\\\ p\_\theta(\textbf{x}\_{t-1}|\textbf{x}\_t)&:= \mathcal{N}(\textbf{x}\_{t-1};\boldsymbol{\mu}\_\theta (\textbf{x}\_t,t),\boldsymbol{\Sigma}\_\theta(\textbf{x}\_t,t)) \end{align} $$The model is trained by minimizing the standard variational upper bound on the negative log-likelihood:
$$ \mathbb{E}\left[-\log p\_\theta(\textbf{x}\_0)\right]\le\mathbb{E}\_q\left[-\log\frac{p\_\theta(\textbf{x}\_{0:T})}{q(\textbf{x}\_{1:T}|\textbf{x}\_0)}\right]=\mathbb{E}\_q\left[-\log p(\textbf{x}\_T)-\sum\_{1\ge 1}\log\frac{p\_\theta(\textbf{x}\_{t-1}|\textbf{x}\_t)}{q(\textbf{x}\_t|\textbf{x}\_{t-1})}\right]:=L $$This can be rewritten using KL divergence as:
$$ L=\mathbb{E}\_q\left[D\_{\textrm{KL}}(q(\textbf{x}\_T|\textbf{x}\_0)||p(\textbf{x}\_T))+\sum\_{t>1}D\_{\textrm{KL}}(q(\textbf{x}\_{t-1}|\textbf{x}\_t,\textbf{x}\_0)||p\_\theta (\textbf{x}\_{t-1}|\textbf{x}\_t))-\log p\_\theta(\textbf{x}\_0|\textbf{x}\_1)\right] $$The term involving \(q(\textbf{x}_T|\textbf{x}_0)\) can be ignored during training if \(\beta_t\) is fixed, because it does not depend on any learnable parameters. The forward process allows closed-form sampling at any timestep:
$$ \begin{equation} q(\textbf{x}\_t|\text{x}\_0)=\mathcal{N}(\textbf{x}\_t;\sqrt{\bar{\alpha}\_t}\textbf{x}\_0,(1-\bar{\alpha}\_t)\textbf{I}) \end{equation} $$where \(\alpha_t:=1-\beta_t\), \(\bar{\alpha}_t:=\prod^t_{s=1}\alpha_s\).
To minimize \(L_{t-1}:=\mathbb{E}_q\left[D_{\textrm{KL}}(q(\textbf{x}_{t-1}|\textbf{x}_t,\textbf{x}_0)||p_\theta (\textbf{x}_{t-1}|\textbf{x}_t))\right]\), the KL divergence at each step, we use a closed-form expression for the posterior:
$$ q(\textbf{x}\_{t-1}|\textbf{x}\_t,\textbf{x}\_0) = \mathcal{N}(\textbf{x}\_{t-1};\tilde{\boldsymbol{\mu}}\_t(\textbf{x}\_t,\textbf{x}\_0),\tilde{\beta}\_t\textbf{I}) $$with
$$ \begin{align*} \tilde{\boldsymbol{\mu}}\_t(\textbf{x}\_t,\textbf{x}\_0)&:=\frac{\sqrt{\bar{\alpha}\_{t-1}\beta\_t}}{1-\bar{\alpha}\_t}\textbf{x}\_0+\frac{\sqrt{\alpha\_t}(1-\bar{\alpha}\_{t-1})}{1-\bar{\alpha}\_t}\textbf{x}\_t\\\\ \tilde{\beta}\_t&:=\frac{1-\bar{\alpha}\_{t-1}}{1-\bar{\alpha}\_t}\beta\_{t} \end{align*} $$Let \(\textbf{x}_t(\textbf{x}_0, \boldsymbol{\epsilon}) = \sqrt{\bar{\alpha}_t} \textbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}\), with \(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \textbf{I})\). The objective simplifies to:
$$ \begin{align*} L\_{t-1}-C &= \mathbb{E}\_q\left[\frac{1}{2\sigma^2\_t}||\tilde{\boldsymbol{\mu}}(\textbf{x}\_t,\textbf{x}\_0)-\boldsymbol{\mu}\_\theta (\textbf{x}\_t, t)||^2\right]\\\\ &=\mathbb{E}\_{\textbf{x}\_0, \boldsymbol{\epsilon}}\left[\frac{1}{2\sigma^2\_t}\left|\left|\frac{1}{\sqrt{\alpha\_t}}\left(\textbf{x}\_t(\textbf{x}\_0,\boldsymbol{\epsilon})-\frac{\beta\_t}{\sqrt{1-\bar{\alpha}\_t}}\boldsymbol{\epsilon}\right)-\boldsymbol{\mu}\_{\theta}(\textbf{x}\_t(\textbf{x}\_0,\boldsymbol{\epsilon}),t)\right|\right|^2\right] \end{align*} $$The above equation implies that \(\boldsymbol{\mu}_\theta\) should be trained to approximate \(\frac{1}{\sqrt{\alpha_t}}\left(\textbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)\). \(\textbf{x}_t\) is given as input, and assuming \(\textbf{x}_t(\textbf{x}_0,\boldsymbol{\epsilon})=\sqrt{\bar{\alpha}_t}\textbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}\), \(L_{t-1}-C\) can also be expressed in closed form:
$$ L\_{t-1}-C=\mathbb{E}\_{\textbf{x}\_0,\boldsymbol{\epsilon}}\left[\frac{\beta^2\_t}{2\sigma^2\_t\alpha\_t(1-\bar{\alpha}\_t)}\left|\left|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}\_\theta(\sqrt{\bar{\alpha}\_t}\textbf{x}\_0+\sqrt{1-\bar{\alpha}\_t}\boldsymbol{\epsilon},t)\right|\right|^2\right] $$In short, training \(\boldsymbol{\mu}_\theta\) to predict \(\tilde{\boldsymbol{\mu}}_t\) is equivalent to learning to predict \(\boldsymbol{\epsilon}\).
To compute the discrete log-likelihood of \(p_\theta(\textbf{x}_0|\textbf{x}_1)), pixel values in \({0, 1, \dots, 255}\) are linearly scaled to \([-1, 1]\), and an independent decoder is used to model the distribution:
$$ \textbf{x}\_0\sim \mathcal{N}(\boldsymbol{\mu}\_\theta(\textbf{x}\_1,1), \sigma^2\_1\textbf{I}) $$with integration bounds defined as:
$$ \begin{align*} \delta\_+(x)&=\begin{cases} \infty&\text{if}\ x=1\\\\ x+\frac{1}{255}&\text{if}\ x < 1 \end{cases}\\\\ \delta\_-(x)&=\begin{cases} -\infty&\text{if}\ x=-1\\\\ x-\frac{1}{255}&\text{if}\ x > -1 \end{cases}\\\\ \end{align*} $$During sampling, the denoising process begins from \(\textbf{x}_T \sim \mathcal{N}(0, \textbf{I})\) as:
$$ \textbf{x}\_{t-1} = \frac{1}{\sqrt{\alpha\_t}}\left(\textbf{x}\_t-\frac{\beta\_t}{\sqrt{1-\bar{\alpha\_t}}}\boldsymbol{\epsilon}\_\theta(\textbf{x}\_t,t)\right)+\sigma\_t\textbf{z},\ \textbf{z}\sim\mathcal{N}(\textbf{0}, \textbf{I}) $$
The full procedure for training and sampling is illustrated below:
