Denoising Diffusion Probabilistic Models

Jonathan Ho; Ajay Jain; Pieter Abbeel

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel

Abstract

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at https://github.com/hojonathanho/diffusion.

AI explanations are being prepared (some are ready below)... Refresh in a moment to see them.

Abstract

p.1

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models ...

Understanding the DDPM Abstract

The Big Picture

This abstract introduces Denoising Diffusion Probabilistic Models (DDPMs) as a new approach to generating high-quality images. Rather than explaining the entire methodology here, the abstract is making four key claims:

What they're proposing: A new class of generative models inspired by physics
How to train them: A specific mathematical approach that connects to existing techniques
How to use them: A progressive decoding process with nice properties
How well they work: State-of-the-art results on standard benchmarks

Let me break down what each part means:

Part 1: What Are Diffusion Probabilistic Models?

Core Concept

Diffusion probabilistic models are a class of latent variable models. Let me unpack this:

Latent variable model: A generative model that learns the underlying factors or patterns in data. The "latent variables" are hidden representations that the model learns. Think of them as the "essence" of what makes an image what it is.
Diffusion: The model is inspired by a process from physics where systems gradually move toward equilibrium (like heat spreading through a room).

The Physics Inspiration: Nonequilibrium Thermodynamics

The paper draws inspiration from nonequilibrium thermodynamics, which studies systems that aren't in equilibrium. Here's the intuition:

Imagine starting with a clear image (ordered state)
Gradually add random noise to it (like heat being added to a system, pushing it toward disorder)
This is the forward process: data → noise

To generate new images, we reverse this:

Start with pure noise (disordered state)
Gradually learn to remove noise step-by-step
This is the reverse process: noise → data

This is conceptually elegant: if we can learn to undo the noise, we can create new images from random noise.

Part 2: The Training Approach

The Weighted Variational Bound

The abstract mentions training on a "weighted variational bound." Let me break down what this means:

Variational bound = A mathematical inequality that provides a lower bound on something we want to maximize (the probability of generating real data)

In generative modeling, we want to maximize:

p(x)

where $x$ represents our data (an image). This is hard to compute directly, so instead we use a variational bound:

\log p(x) \geq \text{(something easier to compute)}

The right-hand side is what we actually optimize during training. The "weighted" part means different terms in this bound are given different importance during training—some terms are multiplied by larger weights than others.

Connection to Denoising Score Matching

The paper makes a novel connection between:

Diffusion probabilistic models (what they're proposing)
Denoising score matching with Langevin dynamics (an existing technique from physics)

Score matching is a technique from statistical physics. It involves learning the gradient of the log-probability distribution:

\nabla_x \log p(x)

where:

$\nabla_x$ is the gradient with respect to $x$ (points in the direction of increasing probability)
$\log p(x)$ is the logarithm of the probability density

Langevin dynamics is a mathematical process that uses these gradients to sample from a distribution. The connection the paper discovers is that training diffusion models is mathematically equivalent to learning these score functions.

This connection is powerful because it:

Provides a principled way to train the model
Links to well-understood physics concepts
Suggests better weighting schemes for the training objective

Part 3: The Decoding Process

The abstract mentions a "progressive lossy decompression scheme" that generalizes autoregressive decoding.

What This Means

Progressive: The model generates images in steps, progressively reducing noise

Step 1: Remove most of the noise from pure noise (rough image)
Step 2: Refine further (less rough)
Step 3, 4, ...: Continue until image is clean

Lossy decompression: Like decompressing a compressed image, but with some information loss at each step (we're removing noise, not perfectly reconstructing)

Generalizes autoregressive decoding: Autoregressive models generate data one piece at a time (like predicting next word in a sentence). This approach does something similar but for image generation—predicting progressively refined versions.

The key insight: you can stop the process early to get a rough sample, or continue longer for higher quality. This gives flexibility not present in many other generative models.

Part 4: Quantitative Results

The paper then provides empirical evidence of success:

CIFAR-10 (32×32 images of objects)

Inception Score = 9.46: Measures if generated images look like real object categories
FID Score = 3.17: Measures similarity between generated and real image distributions (lower is better)
Both were state-of-the-art at the time of publication

LSUN (256×256 natural scene images)

Sample quality similar to ProgressiveGAN: At the time, ProgressiveGAN was one of the best image generation methods
This shows diffusion models are competitive with the leading approach

Implementation Available

The authors released their code, which is important for reproducibility and adoption by the research community.

Why This Matters

The combination of:

Theoretical elegance (connection to physics and score matching)
Training stability (the weighted variational bound helps with training)
Practical results (competitive or better than existing methods)
Flexibility (progressive generation allows quality/speed tradeoffs)

...makes this a significant contribution that would influence generative modeling for years to come.

1 Introduction

p.1

Deep generative models of all kinds have recently exhibited high quality samples in a wide variety of data modalities. G...

Understanding the Introduction to Diffusion Probabilistic Models

The Big Picture

This introduction section does several important things:

Positions diffusion models in the landscape of generative models — explaining where they fit among existing approaches like GANs, VAEs, and autoregressive models
Introduces the core concept — what a diffusion model is and how it works at an intuitive level
States the paper's main contributions — showing that diffusion models can generate high-quality images, and revealing a mathematical connection to score matching
Sets expectations — acknowledging trade-offs (good samples, but not the best likelihood values)

Let's break this down carefully.

Part 1: The Landscape of Generative Models

"Deep generative models of all kinds have recently exhibited high quality samples in a wide variety of data modalities. Generative adversarial networks (GANs), autoregressive models, flows, and variational autoencoders (VAEs) have synthesized striking image and audio samples..."

What this means: The authors are situating their work among four major families of generative models:

GANs (Generative Adversarial Networks): Use adversarial training — a generator competes against a discriminator
Autoregressive models: Generate data sequentially, one element at a time, conditioning on all previous elements
Flows: Use invertible transformations to map between simple distributions (like Gaussian) and complex data distributions
VAEs (Variational Autoencoders): Learn a latent representation and reconstruction process

All of these have shown impressive results. The authors are essentially saying: "Here's another approach that also works well."

Part 2: What is a Diffusion Model? (The Core Concept)

This is the crucial conceptual section. Let me break it down carefully:

The Two Directions: Diffusion vs. Reverse

"A diffusion probabilistic model (which we will call a 'diffusion model' for brevity) is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time. Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed."

This describes two complementary processes:

Process 1: The Diffusion Process (The Forward Direction)

Think of this as corrupting data with noise:

Start with a clean image from your dataset
Gradually add Gaussian noise at each timestep
After many steps, you're left with pure noise (the "signal is destroyed")
This is a Markov chain — each step only depends on the previous step

Mathematically: If we denote the original data as $x_0$ and the diffusion process at timestep $t$ as $x_t$ , then: $x_t = x_{t-1} + \text{small amount of noise}$

After many iterations (say 1000 steps), $x_T$ looks like completely random Gaussian noise.

Process 2: The Reverse Process (What We Learn)

This is the inverse of diffusion — removing noise to recover data:

Start with pure random noise
At each timestep, learn to remove some of the noise
After many steps, you should have a clean, realistic image
We train a neural network to learn how to do this

Key insight: If we could perfectly learn the reverse of the diffusion process, sampling would work by:

Start with random noise $x_T$
Apply the learned reverse transitions repeatedly
End up with a sample from the data distribution at $x_0$

Part 3: Why Gaussian Noise Makes This Simple

"When the diffusion consists of small amounts of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization."

Why this matters:

If at each step you're adding small amounts of Gaussian noise, then the reverse process can also be parameterized using Gaussian distributions.

The mathematical beauty: For Gaussian distributions, we have nice closed-form formulas. Instead of the neural network learning the full reverse transition, it only needs to predict:

The mean (center) of the Gaussian at each step
The variance (spread) is often fixed

This is much simpler than having the network predict an entire probability distribution. It's one of the practical advantages of using Gaussian noise.

Part 4: The Paper's Main Contributions (Three Key Claims)

Contribution 1: Diffusion Models Can Generate High-Quality Images

"Diffusion models are straightforward to define and efficient to train, but to the best of our knowledge, there has been no demonstration that they are capable of generating high quality samples. We show that diffusion models actually are capable of generating high quality samples, sometimes better than the published results on other types of generative models (Section 4)."

Translation: Previously, nobody had shown that diffusion models work well for image generation. This paper demonstrates they do — and sometimes better than GANs or VAEs.

Evidence (from the abstract): They achieve FID score of 3.17 on CIFAR10, which was state-of-the-art at the time.

Contribution 2: Connection to Score Matching and Langevin Dynamics

"In addition, we show that a certain parameterization of diffusion models reveals an equivalence with denoising score matching over multiple noise levels during training and with annealed Langevin dynamics during sampling (Section 3.2)."

This is more technical, but here's the intuition:

Score matching is another established method in machine learning — it learns the gradient (called the "score") of the log-probability distribution
Langevin dynamics is a sampling technique that uses these gradients to generate samples
The authors' discovery: When you set up diffusion models the right way, training them is mathematically equivalent to training a score matching model, and sampling is equivalent to running Langevin dynamics

Why this matters: This connection validates diffusion models theoretically — they're not just an ad-hoc method, but deeply connected to established mathematical frameworks.

Part 5: The Trade-off: Sample Quality vs. Likelihood

"Despite their sample quality, our models do not have competitive log likelihoods compared to other likelihood-based models..."

What this means:

Sample quality: The images generated look good (this paper's strength)
Log likelihood: A measure of how well the model assigns probability to real data — mathematically, $\log p(x)$ for actual data $x$ . Higher is better.

The authors are being honest: their models generate beautiful images, but if you ask "what probability does this model assign to real data?", the answer isn't as competitive as other methods.

Why? The next sentence explains:

"...the majority of our models' lossless codelengths are consumed to describe imperceptible image details."

Translation: The model is spending its probability budget on tiny details humans can't perceive. If you're measuring likelihood (which treats all details equally), the model looks inefficient. But perceptually, the samples look great.

Part 6: Progressive Decoding Perspective

"...the sampling procedure of diffusion models is a type of progressive decoding that resembles autoregressive decoding along a bit ordering that vastly generalizes what is normally possible with autoregressive models."

What this means:

Progressive decoding: Refine the result step-by-step, getting better each time
Unlike autoregressive models that generate data in a fixed sequence (like left-to-right in images), diffusion models refine all positions simultaneously, progressively adding detail

This is a conceptually interesting perspective — it shows diffusion sampling as a generalization of autoregressive sampling.

Summary: What You Should Take Away

Conceptually: Diffusion models work by learning to reverse a noise-addition process. You train on the task "remove noise from images," then sample by starting with noise and repeatedly denoising.
Practically: They're simple to implement (just Gaussian distributions) and efficient to train.
Theoretically: They connect to score matching and Langevin dynamics — grounding them in established mathematical frameworks.
Empirically: They generate high-quality images, though they don't optimize for likelihood metrics.
Philosophically: The sampling is a progressive refinement that generalizes autoregressive generation.

The next sections will formalize these intuitions with mathematics.

2 Background

Mathp.2

Diffusion models [53] are latent variable models of the form $p_\theta(\mathbf{x}_0) := \int p_\theta(\mathbf{x}_{0:T}) ...

Understanding Section 2: Background on Diffusion Models

Big Picture: What's This Section About?

This section introduces the mathematical framework of diffusion probabilistic models. Think of it as describing a two-way process:

Forward process (easy, fixed): Gradually add noise to clean data until it becomes pure random noise
Reverse process (hard, learned): Learn to gradually remove noise from random noise to reconstruct clean data

The key insight is that if you can learn to reverse the forward process, you can generate new data by starting with random noise and denoising it. This section lays out the mathematical machinery to make this work.

Part 1: What Are Diffusion Models? (The Overall Framework)

The Core Equation

A diffusion probabilistic model is a latent variable model. This means it works with hidden variables (latents) to model observed data. The key equation is:

$p_\theta(\mathbf{x}_0) := \int p_\theta(\mathbf{x}_{0:T}) \, d\mathbf{x}_{1:T}$

Let me break down the notation:

$\mathbf{x}_0$ = the observed data (e.g., an image) - this is what we want to generate
$\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$ = hidden variables at different noise levels (latents)
$\mathbf{x}_{0:T}$ = shorthand meaning "all variables from $\mathbf{x}_0$ to $\mathbf{x}_T$ "
$p_\theta(\mathbf{x}_{0:T})$ = the reverse process: a probability distribution parameterized by neural network weights $\theta$
The integral $\int d\mathbf{x}_{1:T}$ means we "marginalize out" or sum over all possible latent values

In plain English: To get the probability of clean data $\mathbf{x}_0$ , we consider all possible noisy versions $\mathbf{x}_1$ through $\mathbf{x}_T$ and integrate over them.

Part 2: The Reverse Process (What the Model Learns)

Equation (1) defines the reverse process in detail:

$p_\theta(\mathbf{x}_{0:T}) := p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$

$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) := \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

Understanding the Components

The product notation $\prod_{t=1}^{T}$ is like a $\sum$ but for multiplication. It means we're multiplying probabilities together: $p(\mathbf{x}_T) \times p_\theta(\mathbf{x}_{T-1}|\mathbf{x}_T) \times p_\theta(\mathbf{x}_{T-2}|\mathbf{x}_{T-1}) \times \cdots \times p_\theta(\mathbf{x}_0|\mathbf{x}_1)$

This forms a Markov chain - a chain where each step depends only on the previous step (not the whole history).

Key components:

$p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$ = starting point is pure Gaussian noise (mean $\mathbf{0}$ , identity covariance $\mathbf{I}$ )
$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ = the learned conditional: "given noisy version at step $t$ , what's the cleaner version at step $t-1$ ?"
This is Gaussian with mean $\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ and covariance $\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)$ - both predicted by neural networks

In plain English: We start with pure noise and repeatedly apply learned denoising steps. Each step takes slightly noisy data and produces slightly cleaner data.

Part 3: The Forward Process (Fixed, No Learning)

Equation (2) defines the forward process - the opposite direction:

$q(\mathbf{x}_{1:T}|\mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t|\mathbf{x}_{t-1})$

$q(\mathbf{x}_t|\mathbf{x}_{t-1}) := \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})$

Key Differences from the Reverse Process

The forward process is fixed - no learning here! It's just a mathematical definition
It gradually adds noise using a variance schedule $\beta_1, \beta_2, \ldots, \beta_T$ $β_{1}, β_{2}, \dots, β_{T}$
- These $\beta_t$ values are hyperparameters (set by the researcher, not learned)
- Small values like $\beta_t = 0.0001$ to $0.02$ are typical

What Does Each Step Do?

At step $t$ , we go from $\mathbf{x}_{t-1}$ (slightly noisy) to $\mathbf{x}_t$ (more noisy) using:

$\mathbf{x}_t = \sqrt{1-\beta_t} \cdot \mathbf{x}_{t-1} + \sqrt{\beta_t} \cdot \boldsymbol{\epsilon}$

where $\boldsymbol{\epsilon}$ is a random noise vector.

Breaking this down:

$\sqrt{1-\beta_t}$ ≈ slightly less than 1 (scales down the signal)
$\sqrt{\beta_t}$ ≈ small (scales the added noise)
So each step slightly reduces the signal and adds a bit of noise

If you repeat this $T$ times (like $T=1000$ ), eventually the signal completely vanishes and you have pure noise.

Part 4: The Key Innovation - Closed-Form Sampling

This is a crucial practical advantage. Instead of applying the forward process $T$ times sequentially, you can jump directly to any timestep $t$ using Equation (4):

$q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$

With definitions:

$\alpha_t := 1 - \beta_t$ (how much signal remains at step $t$ )
$\bar{\alpha}_t := \prod_{s=1}^{t} \alpha_s$ (cumulative product - how much signal remains after $t$ steps)

Why this matters: You can randomly sample a timestep $t$ and directly compute what $\mathbf{x}_t$ looks like, without sequentially applying $t$ noising operations. This dramatically speeds up training!

Part 5: Training the Model (The Variational Bound)

Now here's the challenge: how do we train the reverse process to actually reverse the forward process?

The standard approach uses variational inference - specifically, optimizing a variational lower bound on the log-likelihood:

$\mathbb{E}[-\log p_\theta(\mathbf{x}_0)] \leq \mathbb{E}_q\left[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\right]$

Interpreting This Inequality

Left side: The negative log-likelihood (how bad our model is at predicting real data)
Right side: An upper bound that's easier to compute
The inequality comes from a fundamental principle called Jensen's inequality

The right side expands to:

$L = \mathbb{E}_q\left[-\log p(\mathbf{x}_T) - \sum_{t \geq 1} \log \frac{p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)}{q(\mathbf{x}_t|\mathbf{x}_{t-1})}\right]$

Breaking down the sum:

First term $-\log p(\mathbf{x}_T)$ : How good are we at having a distribution that matches pure noise?
Remaining terms: At each timestep, compare our learned reverse step against the forward process

When we minimize this bound, we're training the neural network to predict the right denoising steps.

Part 6: Variance Reduction - A Better Loss Function

Computing the loss above can have high variance. The paper rewrites it more cleverly in Equation (5):

$\mathbb{E}_q\left[\underbrace{D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \| p(\mathbf{x}_T))}_{L_T} + \sum_{t>1} \underbrace{D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) \| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))}_{L_{t-1}} + \underbrace{- \log p_\theta(\mathbf{x}_0|\mathbf{x}_1)}_{L_0}\right]$

What Changed?

Instead of comparing our reverse step directly to the forward process, we compare it to $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ - the forward process conditioned on knowing the original data $\mathbf{x}_0$ .

This is a clever trick! When training on actual data, we know $\mathbf{x}_0$ . Using this information in the comparison reduces variance dramatically.

The notation $D_{\mathrm{KL}}(A \| B)$ is the Kullback-Leibler divergence - a standard measure of how different two probability distributions are. It's always non-negative and equals zero only when distributions are identical.

Part 7: The Tractable Posterior

Here's why this clever rewriting works - Equation (6) shows the forward process posterior is Gaussian:

$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$

With explicit formulas:

$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}\mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t$

$\tilde{\beta}_t := \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t$

Why This is Amazing

Both $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ (the target) and $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ (our learned model) are Gaussians
When comparing two Gaussians using KL divergence, there's a closed-form formula - no Monte Carlo sampling needed!
This dramatically reduces training noise and variance

In plain English: We can compute the exact loss using calculus instead of approximating it through random sampling.

Summary

This section established:

Two parallel processes:
- Forward: gradually add noise (fixed, simple)
- Reverse: gradually remove noise (learned, complex)
Mathematical framework: Both are Markov chains of Gaussians
Training trick: Compare learned reverse steps to forward process conditioned on real data
Computational efficiency:
- Jump to any timestep instantly (Eq. 4)
- Exact loss computation for Gaussian comparisons (Eq. 6-7)

This foundation enables the training procedure described in the later sections, which achieves state-of-the-art image generation results.

q(\mathbf{x}_{1:T}|\mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t|\mathbf{x}_{t-1}), \qquad q(\mathbf{x}_t|\mathbf{x}_{t-1}) := \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})

Great! Both give $\approx 0.92466$ , confirming the algebraic identity: $\sqrt{a} \cdot \sqrt{b} = \sqrt{a \cdot b}$ .

This is why the closed-form formula works: $\sqrt{\bar{\alpha}_t} = \sqrt{\prod_{s=1}^{t} \alpha_s} = \prod_{s=1}^{t} \sqrt{\alpha_s}$

Summary Table

Here's how the forward process evolves at key timesteps for a linear noise schedule:

Timestep	$\beta_t$	$\alpha_t$	$\bar{\alpha}_t$	Signal Strength	Noise Strength
$t=0$	—	1.0	1.0	100%	0%
$t=1$	0.0001	0.9999	0.9999	99.995%	0.005%
$t=25$	0.0025	0.9975	0.9377	96.8%	3.2%
$t=50$	0.0050	0.9950	0.8786	93.7%	6.3%
$t=100$	0.0100	0.9900	0.7724	87.9%	12.1%
$t=T$	—	—	→ 0	→ 0%	→ 100%

The Big Picture

What does this equation accomplish?

Defines a forward process: A mathematically tractable way to progressively corrupt data
Ensures reversibility: The Gaussian form makes the reverse process (denoising) learnable
Enables efficient training: The closed-form sampling (Eq. 4) lets us train on random timesteps
Separates signal from noise: At each timestep, we have a known mixture of original signal and Gaussian noise

Why Gaussian noise? Gaussians are:

Mathematically elegant (closed-form formulas)
Computationally efficient (KL divergence has closed form)
Approximately universal (central limit theorem — any noise can be approximated)
Reversible (small $\beta_t$ means the reverse is close to the forward)

The forward process is the foundation that makes the entire diffusion model framework work!

Visualizing the signal retention factor (1-β_t) as β_t increases from 0 to 1

Visualizing the signal scaling factor sqrt(1-β_t) across the noise schedule

integral_0^1 sqrt(1 + 1/(4 - 4 x))dx≈1.47894286...

Computing the mean of the Gaussian for the first diffusion step

Computing the cumulative product for bar_alpha_2 with two steps of β=0.1

Computing the cumulative noise factor (1 - bar_alpha_2)

Verifying that the product equals the square root of the product (algebraic identity)

zero point nine two four six six two one zero zero four four five three four six five

p_\theta(\mathbf{x}_{0:T}) := p(\mathbf{x}_T) \prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t), \qquad p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) := \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))

The notation got confused. Let me state the formula directly: for two Gaussians $\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1)$ and $\mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)$ :

$D_{\mathrm{KL}}(\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1) \| \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2)) = \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^T\Sigma_2^{-1}(\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)\right]$

In the DDPM setting, the paper fixes the covariance (doesn't learn it), so $\Sigma_\theta$ is constant. This means the KL divergence simplifies to a mean squared error between $\tilde{\boldsymbol{\mu}}_t$ and $\boldsymbol{\mu}_\theta$ :

$\text{Training objective} \propto \mathbb{E}_{t,\mathbf{x}_0,\mathbf{x}_t}\left[\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\|^2\right]$

This is remarkably simple: just predict the forward process posterior mean!

Summary: Why This Equation Matters

Property	Why It's Important
Markov factorization $\prod_{t=1}^{T}$	Makes sampling tractable: each step is independent given current state
Gaussian transitions	Closed-form KL divergence to forward process posterior; efficient training
Learned mean $\boldsymbol{\mu}_\theta$	Neural network predicts how to denoise; more flexible than fixed schedule
Learned covariance $\boldsymbol{\Sigma}_\theta$	Optional; can be learned or fixed (DDPM fixes it)
Starting from $\mathcal{N}(\mathbf{0}, \mathbf{I})$	Natural endpoint: pure noise is easy to sample; reverse of forward process

The equation defines a learnable path from noise to data that mirrors the forward noising process, enabling tractable likelihood computation and stable training via variational inference.

Show the algebraic structure of the forward process posterior mean

simplify | (sqrt(a) (1 - b))/(1 - c) + (sqrt(b) (c - b))/(1 - c)

(sqrt(a) (b - 1) + sqrt(b) (b - c))/(c - 1)

(sqrt(a) (b - 1))/(c - 1) + (sqrt(b) (b - c))/(c - 1)

(sqrt(a) b - sqrt(a) + b^(3/2) - sqrt(b) c)/(c - 1)

Visualize how variance typically decreases as we reverse from noise to data (showing exponential decay schedule)

plot | exp(-x)
exp(-2 x)
exp(-3 x) | x = 0 to 5

Visualize a narrow Gaussian transition (low variance step) in the reverse process - this represents a small refinement step

plot | 1/sqrt(2 π×0.01) exp(-x^2/(2×0.01)) | x = -0.5 to 0.5

Verify that any Gaussian distribution is properly normalized (integrates to 1)

integral_(-∞)^∞ exp(-x^2/(2 σ^2))/sqrt(2 π σ^2) dx = 1

integral exp(-x^2/(2 σ^2))/sqrt(2 π σ^2) dx = (σ erf(x/(sqrt(2) σ)))/(2 sqrt(σ^2)) + constant

Show how small variance schedule (forward process) can be closely approximated by learned transitions (reverse process) when betas are tiny

plot | 0.0001 x
0.01 exp(-x/10) | x = 0 to 50

integral_0^50 sqrt(1.×10^-8 + 1.×10^-6 e^(-x/5))dx≈0.0123395...

Show the structure of KL divergence between two multivariate Gaussians (general case before simplification)

simplify | -log(( left bracketing bar PauliMatrix[2] right bracketing bar )/( left bracketing bar PauliMatrix[1] right bracketing bar )) + Tr[(PauliMatrix[2])^(-1) PauliMatrix[1]] + (μ×2 - μ×1)^T (PauliMatrix[2])^(-1) (μ×2 - μ×1)

3 Diffusion models and denoising autoencoders

p.3

Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degre...

Section 3: Diffusion Models and Denoising Autoencoders

Big Picture: What's This Section Trying to Do?

Before diving into the math, let's understand the overarching goal. The previous sections established what diffusion models do (gradually add noise to data, then learn to reverse it) and showed they can be trained using variational inference. But the authors haven't explained how to make the best design choices for these models.

This section answers that question by revealing a surprising connection: diffusion models are mathematically equivalent to denoising score matching, a technique from a different field entirely. This connection gives the authors:

Theoretical insight into why their approach works
Practical guidance on how to parameterize and weight the training objective
A simplified objective function that works better empirically

Think of it like discovering that two seemingly different recipes produce the same dish—once you make that connection, you can borrow techniques from one to improve the other.

The Core Challenge: Too Many Design Degrees of Freedom

The opening paragraph highlights the problem:

"Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation."

What does this mean?

Recall from Section 2 that diffusion models require you to choose:

The variance schedule $\beta_1, \ldots, \beta_T$ (how much noise to add at each step)
The neural network architecture and parameterization for the reverse process

These aren't trivial choices—different choices will give different results. The section is asking: Is there a principled way to make these choices, rather than just guessing?

The answer the authors provide: Connect to denoising score matching.

Score Matching: A Brief Introduction

To understand the connection, we need to know what "score matching" means.

The score of a distribution is the gradient of its log-probability:

\nabla_{\mathbf{x}} \log p(\mathbf{x})

Here:

$\nabla_{\mathbf{x}}$ is the gradient operator (the vector of partial derivatives with respect to each component of $\mathbf{x}$ )
$\log p(\mathbf{x})$ is the natural logarithm of the probability density

Intuition: The score points in the direction of increasing probability—it's like a compass that always points toward regions where the data is more likely to appear.

Denoising score matching is a technique that learns to predict this score by training a network $s_\theta(\mathbf{x}, t)$ to match the true score at different noise levels. The key insight is:

\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t|\mathbf{x}_0) = -\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{1 - \bar{\alpha}_t}

This says: "Given a noisy version of data $\mathbf{x}_t$ , the score tells us how to remove noise to recover $\mathbf{x}_0$ ."

The Connection: Why Denoising Networks Learn Scores

This is where the magic happens. The section (specifically 3.2, though you haven't asked me to explain it fully) establishes that predicting the noise in diffusion models is mathematically equivalent to predicting the score.

Here's the intuition:

In diffusion models, the network predicts what noise was added: given a noisy image, predict the noise $\boldsymbol{\epsilon}$ .

In denoising score matching, the network predicts the gradient of the log-probability (the score).

These are the same thing (up to a scaling factor). If you predict the score correctly, you implicitly predict the noise correctly, and vice versa.

This connection is powerful because score-based methods have strong theoretical foundations. Once we make this connection, we can adopt their techniques—specifically, Langevin dynamics for sampling, which has better theoretical guarantees.

The Practical Payoff: A Better Training Objective

The section mentions:

"leads to a simplified, weighted variational bound objective for diffusion models (Section 3.4)"

This is referring to the key result: once you embrace the score-matching perspective, you can reweight the variational bound from Equation (5) to focus on what matters.

Recall Equation (5) from Section 2:

\mathbb{E}_q\left[L_T + \sum_{t>1} L_{t-1} - L_0\right]

where:

$L_T$ = KL divergence at the final (pure noise) step
$L_{t-1}$ = KL divergence at intermediate steps
$L_0$ = reconstruction loss for the data itself

The question: Should all these terms be weighted equally when training? Probably not— $L_T$ is comparing two nearly-identical Gaussian distributions (both nearly pure noise), so it's numerically stable. But $L_0$ (reconstructing from almost-pure-noise to the actual image) is harder and probably deserves more weight.

The weighted bound (which Section 3.4 will detail) gives different weights to different terms based on the signal-to-noise ratio and other properties. This is derived from the score-matching connection, not just ad-hoc design.

How This Section Is Organized

The text tells us the section proceeds by "categorized by the terms of Eq. (5)":

Section 3.1: Likely discusses $L_T$ (the pure-noise term)
Section 3.2: Establishes the score-matching connection (the theoretical centerpiece)
Section 3.3: Likely discusses intermediate terms $L_{t-1}$
Section 3.4: Proposes the simplified, weighted objective

This organization shows how theory (the score-matching connection) guides practical choices (reweighting).

Key Takeaway

This section is fundamentally about justifying design choices through theory:

Without this section: "We tried various parameterizations and found one that works empirically."

With this section: "These design choices emerge naturally from a principled connection to denoising score matching, which explains why they work."

This kind of theoretical grounding is valuable because:

It builds confidence in the approach
It suggests which hyperparameters matter most
It connects this work to the broader literature on score-based generative models
It enables future improvements informed by score-matching theory

The mathematical details in Sections 3.2–3.4 will flesh out this connection and derive the weighted objective, but the conceptual insight is already clear: diffusion models are denoising score matchers in disguise.

3.1 Forward process and $L_T$

p.3

We ignore the fact that the forward process variances $\beta_t$ are learnable by reparameterization and instead fix them...

Section 3.1: Forward Process and $L_T$ - Detailed Explanation

Big Picture: Why This Section Matters

Before diving into mathematics, let's understand what's happening here and why it's important:

The paper is describing how to train diffusion models by optimizing a variational bound (the loss function). In equation (5) from the background, this loss has three types of terms: $L_T$ , $L_{t-1}$ (for various $t$ ), and $L_0$ .

This section is about a crucial simplification: The authors are saying that one of these loss terms—specifically $L_T$ —can be completely ignored during training because it's a constant. This is a huge practical advantage because it means less computation and simpler training. Let's understand why this works.

Understanding the Forward Process Setup

Recall from equation (5) in the background:

L_T = D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \| p(\mathbf{x}_T))

Let me define the key quantities here:

$q(\mathbf{x}_T|\mathbf{x}_0)$ : This is the distribution of the data after $T$ steps of noise addition, starting from the original data $\mathbf{x}_0$ . Remember from equation (4) in the background: $q(\mathbf{x}_T|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_T; \sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)\mathbf{I})$
$p(\mathbf{x}_T)$ : This is the target distribution we've chosen—specifically $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$ (standard normal distribution)
$D_{\mathrm{KL}}$ : The Kullback-Leibler (KL) divergence, which measures how different two probability distributions are. It's non-negative, equals zero only when the distributions are identical, and higher values mean more different distributions.

The Key Insight: Why $L_T$ is Constant

Here's the crucial reasoning:

Step 1: Identify what has learnable parameters

In the training process, we have two things:

The forward process $q(\mathbf{x}_{1:T}|\mathbf{x}_0)$ : This is defined by the variance schedule $\beta_1, \ldots, \beta_T$ (see equation 2). According to this section, we fix these variances to constants rather than learning them.
The reverse process $p_\theta(\mathbf{x}_{0:T})$ : This is what we're training—the neural network learns parameters $\theta$ to specify the mean and covariance of the reverse process transitions.

Step 2: Analyze $L_T$ specifically

L_T = D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \| p(\mathbf{x}_T))

Notice what happens here:

The left side $q(\mathbf{x}_T|\mathbf{x}_0)$ depends only on the forward process, which has no learnable parameters (we fixed $\beta_t$ )
The right side $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0}, \mathbf{I})$ is a fixed standard normal distribution that doesn't depend on any parameters at all

Step 3: Why this means we can ignore it

Since neither side of the KL divergence in $L_T$ depends on the learnable parameters $\theta$ :

As we train and update $\theta$ , the value of $L_T$ never changes
It's a constant—like adding $+5$ to a loss function every training step
When we take gradients with respect to $\theta$ for optimization, the derivative of a constant is zero: $\frac{\partial L_T}{\partial \theta} = 0$

From an optimization perspective, constants don't affect which direction to move the parameters. Therefore, we can simply drop $L_T$ from the training objective without changing the optimal solution.

Mathematical Formulation

Let me show this more formally. The full loss from equation (5) is:

L = \mathbb{E}_q\left[L_T + \sum_{t>1} L_{t-1} - \log p_\theta(\mathbf{x}_0|\mathbf{x}_1)\right]

During training, we optimize by computing gradients:

\frac{\partial L}{\partial \theta} = \frac{\partial}{\partial \theta}\left[\mathbb{E}_q[L_T]\right] + \frac{\partial}{\partial \theta}\left[\mathbb{E}_q\left[\sum_{t>1} L_{t-1}\right]\right] + \frac{\partial}{\partial \theta}\left[\mathbb{E}_q[-\log p_\theta(\mathbf{x}_0|\mathbf{x}_1)]\right]

Since $L_T$ doesn't depend on $\theta$ (it only depends on the fixed forward process):

\frac{\partial}{\partial \theta}\left[\mathbb{E}_q[L_T]\right] = 0

Therefore, we can train using:

L_{\text{effective}} = \mathbb{E}_q\left[\sum_{t>1} L_{t-1} - \log p_\theta(\mathbf{x}_0|\mathbf{x}_1)\right]

without any loss in optimality.

Why This Design Choice Matters

Practical implications:

Simpler training: One fewer term to compute during each training step
Cleaner gradient flow: The optimization focuses only on the terms where the model can actually improve
Design justification: The authors note that this choice is "justified by simplicity and empirical results"

A philosophical note: The section explicitly states they ignore the possibility that $\beta_t$ could be learned through reparameterization. This is an intentional design choice—it simplifies the method and, as we'll see in Section 4, works empirically very well. Sometimes in machine learning, simpler approaches that fix certain components actually perform better than more complex ones.

Connection to the Bigger Picture

This section represents part of the authors' solution to a fundamental question: How do we design a diffusion model that trains efficiently and produces high-quality samples?

The answer involves:

Fixing forward process variances (this section)
Establishing connections to denoising score matching (Section 3.2)
Using a weighted loss that emphasizes the right noise levels (Section 3.4)

By removing the constant $L_T$ term, we simplify the training objective so the model can focus on learning what actually matters: the reverse process at noise levels where the data still has meaningful signal.

3.2 Reverse process and $L_{1:T-1}$

Mathp.3

Now we discuss our choices in $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_...

Understanding Section 3.2: Reverse Process and the Mean Parameterization

The Big Picture

This section tackles a critical design decision in diffusion models: How should the neural network learn to predict the reverse process mean? In other words, when we're going backward from noise to images, what should our network actually output?

The section reveals something elegant: there are multiple mathematically equivalent ways to parameterize what the network predicts, and the authors show that predicting noise (denoted $\boldsymbol{\epsilon}_\theta$ ) is particularly effective. This choice connects diffusion models to a classical machine learning technique called denoising score matching, which provides theoretical justification for the approach.

Part 1: Setting the Variance (the Easy Choice)

Let's start simple. Recall from Equation (1) in the background that the reverse process is:

$p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

This says: "given noisy image $\mathbf{x}_t$ at step $t$ , the next step back (less noisy) is normally distributed with mean $\boldsymbol{\mu}_\theta$ and variance $\boldsymbol{\Sigma}_\theta$ ."

The first choice: Make the variance fixed and time-dependent.

The authors set: $\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \sigma_t^2 \mathbf{I}$

where:

$\sigma_t^2$ is a constant (not learned) that depends only on timestep $t$
$\mathbf{I}$ is the identity matrix (meaning independent, equal variance in all dimensions)
The authors explore two options: $\sigma_t^2 = \beta_t$ or $\sigma_t^2 = \tilde{\beta}_t$

Why this makes sense: The variance of the reverse process should stay relatively small and consistent—we're taking small steps backward through the noise. By making it a constant based on the schedule, the network only needs to learn the mean, not worry about variance estimation.

Part 2: The Mean—Finding the Right Parameterization

Now for the interesting part: What should the network output for the mean?

Starting Point: The Direct Approach

From Equation (5), the loss term is $L_{t-1}$ , which (after some calculation shown in Equation 8) equals:

$L_{t-1} = \mathbb{E}_q\left[\frac{1}{2\sigma_t^2}\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\|^2\right] + C$

What does this mean in words?

$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$ (defined in Eq. 7) is the true posterior mean—where we should go when reversing the noise
$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)$ is what our network predicts
The loss measures the squared distance between them
$C$ is just a constant that doesn't affect learning

The naive approach: Train a network to directly predict $\tilde{\boldsymbol{\mu}}_t$ . This would work, but there's a better way.

The Key Insight: Reparameterization

The authors show that we can expand Equation (8) further using a clever algebraic trick. Recall from Equation (4) that we can write any noisy sample $\mathbf{x}_t$ as:

$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is random Gaussian noise.

What does this mean? Any point in the noisy distribution is a weighted combination of:

The original image scaled by $\sqrt{\bar{\alpha}_t}$ (shrinks as $t$ increases)
Pure noise scaled by $\sqrt{1-\bar{\alpha}_t}$ (grows as $t$ increases)

After substituting this reparameterization into the loss (Equations 9-10), the authors derive that the network must predict:

$\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}\right)$

The crucial observation: This expression involves the noise $\boldsymbol{\epsilon}$ that was added during the forward process! So instead of having the network predict $\tilde{\boldsymbol{\mu}}_t$ directly, we can have it predict the noise itself.

The Noise Prediction Parameterization (Equation 11)

The authors propose:

$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right)$

where $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is a neural network trained to predict the noise.

Why is this genius? Let's break down what happens:

The network learns to denoise: $\boldsymbol{\epsilon}_\theta$ takes a noisy image and predicts what noise was added
The formula reconstructs the mean: Once we know the predicted noise, the fraction subtracts it out and rescales to get a less-noisy version
It connects to prior work: This becomes equivalent to denoising score matching, a classical technique

Sampling with the Noise Parameterization

When actually generating images (during inference), we sample:

$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$

where $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is fresh random noise.

Intuition:

The first fraction removes predicted noise from the current step
The $+ \sigma_t \mathbf{z}$ term adds back a little noise to maintain proper variance (because the process is stochastic)

Part 3: The Connection to Denoising Score Matching

With the noise parameterization in place, the loss simplifies dramatically to Equation (12):

$\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t(1-\bar{\alpha}_t)}\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\right\|^2\right]$

This is the key equation. What does it say?

The network is trained to predict: actual noise $\boldsymbol{\epsilon}$ vs. predicted noise $\boldsymbol{\epsilon}_\theta$
We're measuring squared error between them
We're doing this for multiple noise scales (indexed by $t$ )
The weighting factor out front varies with the schedule

Why "denoising score matching"? Score matching is a classical technique where you learn to denoise data at various noise levels. The fact that diffusion models reduce to this objective is remarkable—it connects the diffusion model framework to decades of prior work.

Part 4: Why This Parameterization?

The section concludes by noting three possible parameterizations:

Predict $\tilde{\boldsymbol{\mu}}_t$ : The mean directly (straightforward but less elegant)
Predict $\boldsymbol{\epsilon}$ : The noise (what's discussed extensively)
Predict $\mathbf{x}_0$ : The original clean image (tried but didn't work as well)

The authors chose option 2 because:

It has strong theoretical justification (connection to denoising score matching)
It resembles Langevin dynamics (a classical sampling technique from statistical physics)
It empirically works better (verified in Section 4 experiments)

Summary: The Key Takeaway

Instead of making a neural network directly learn what the next reverse step should be, we make it learn to predict and remove noise. This is:

✓ Mathematically equivalent to the original formulation
✓ Connected to classical machine learning theory (denoising score matching)
✓ Connected to classical physics (Langevin dynamics)
✓ Empirically more effective in practice

The elegance of this section is showing that what might seem like an arbitrary choice (predict noise instead of mean) actually falls out naturally from careful mathematical analysis.

L_{t-1} = \mathbb{E}_q\left[\frac{1}{2\sigma_t^2}\|\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)\|^2\right] + C

Understanding the basic structure of the squared error term in the loss

Understanding the quadratic loss structure

Computing the gradient of the loss with respect to prediction error

Computing specific weighting values for different noise levels

$1/(2 σ^2) where σ = {0.1, 0.5, 1, 2}$

Computing loss with concrete numbers: sigma=0.5, true mean=1.5, predicted mean=1.2

L_{t-1} - C = \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2\sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}), \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}) - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon})\right) - \boldsymbol{\mu}_\theta(\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}), t)\right\|^2\right]

At timestep $t=25$ (mid-process) with schedule factor $0.9$ :

$\sqrt{\bar{\alpha}_{25}} \approx 0.268$ : Signal weight is ~27%
$\sqrt{1-\bar{\alpha}_{25}} \approx 0.963$ : Noise weight is ~96%
$1-\bar{\alpha}_{25} \approx 0.928$ : The data is ~93% noise

So if the original clean sample is $\mathbf{x}_0$ (say, a coherent image), then:

$\mathbf{x}_{25} = 0.268 \cdot \mathbf{x}_0 + 0.963 \cdot \boldsymbol{\epsilon}$

where $\boldsymbol{\epsilon}$ is random Gaussian noise. The model sees mostly noise but must learn to recover the clean image. The noise prediction network learns to estimate what $\boldsymbol{\epsilon}$ was, so it can subtract it out.

Why This Reparameterization Matters

The progression from Equation (8) → (9) → (10) → (11) → (12) reveals three equivalent ways to train a diffusion model:

Parameterization	What the network predicts	Loss function
Direct (Eq. 8)	$\tilde{\boldsymbol{\mu}}_t$ (true posterior mean)	MSE between predicted and true mean
X₀-prediction	Original clean sample $\mathbf{x}_0$	MSE in data space (less effective)
$\boldsymbol{\epsilon}$ -prediction (Eq. 11-12)	Noise component $\boldsymbol{\epsilon}$	Denoising score matching loss

The $\boldsymbol{\epsilon}$ -prediction is optimal because:

✅ It connects diffusion to score matching theory
✅ It simplifies the loss to a cleaner form
✅ It works better empirically (more stable gradients)
✅ It relates to Langevin dynamics, connecting to statistical physics

Connection to Sampling

When sampling $\mathbf{x}_{t-1}$ from the learned reverse process:

$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$

This is remarkably intuitive:

First term: Removes predicted noise from $\mathbf{x}_t$ , rescaled appropriately
Second term: Adds back stochasticity with scale $\sigma_t$ to match the reverse process distribution
Result: A step toward cleaner data, with controlled variance

This resembles Langevin dynamics where $\boldsymbol{\epsilon}_\theta$ acts as a learned gradient of the log density.

Summary

Equation (9) is a pivotal step that:

Reparameterizes the variational bound using the forward process formula
Reveals that training reduces to predicting the denoising direction
Enables the elegant noise-prediction parameterization (Eq. 11)
Simplifies to denoising score matching (Eq. 12)
Connects discrete diffusion models to continuous score-based models and Langevin dynamics

The beauty of this derivation is that it transforms a complex probabilistic inference problem into a simple noise-prediction task, which is both theoretically motivated and practically effective.

Visualizing how the signal weight sqrt(alpha_bar_t) and noise weight sqrt(1-alpha_bar_t) evolve across timesteps in a linear schedule

plot | sqrt(0.9^t)
sqrt(1 - 0.9^t) | t = 0 to 100

Computing signal/noise weights at timestep t=25 with schedule parameter 0.9

$N[{sqrt(0.9^25), sqrt(1 - 0.9^25), 1 - 0.9^25, 0.9^25}]$

{0.267936, 0.963437, 0.92821, 0.0717898}

3.3 Data scaling, reverse process decoder, and $L_0$

Mathp.4

We assume that image data consists of integers in $\{0, 1, \ldots, 255\}$ scaled linearly to $[-1, 1]$. This ensures tha...

Section 3.3: Data Scaling, Reverse Process Decoder, and $L_0$

The Big Picture

This section addresses a practical but crucial problem: how do we handle the fact that images are discrete (pixel values are integers from 0 to 255) when our diffusion model is built on continuous Gaussian distributions?

Up until this point in the paper, the diffusion process has assumed continuous data. But real image data is discrete—each pixel is an integer. This section explains:

How to scale discrete image data appropriately
How to design the final step of the reverse process to convert back to discrete pixels
Why this approach lets us compute exact log-likelihoods without cheating

Let's work through this step by step.

Part 1: Data Scaling ( $[-1, 1]$ normalization)

The Problem and Solution

Problem: Image pixels naturally range from 0 to 255 (integer values). The diffusion model's reverse process starts from $p(\mathbf{x}_T)$ , which is a standard normal distribution (mean 0, variance 1). If we fed raw pixel values (0-255) into the neural network, the scales would be completely mismatched.

Solution: Scale all pixel values linearly to the range $[-1, 1]$ .

Mathematically, if $x_{\text{pixel}} \in \{0, 1, \ldots, 255\}$ , we transform it to:

x_{\text{scaled}} = \frac{2 \cdot x_{\text{pixel}}}{255} - 1 \in [-1, 1]

**Why this matters:**

Data in $[-1, 1]$ is naturally compatible with a standard normal prior (which also has zero mean and unit variance)
The neural network $\boldsymbol{\epsilon}_\theta$ (or $\boldsymbol{\mu}_\theta$ ) now sees consistently scaled inputs throughout training
This improves numerical stability and learning efficiency

--- ## Part 2: The Discrete Decoder and $L_0$ ### The Core Challenge

Here's the subtle issue: our continuous Gaussian model describes $p_\theta(\mathbf{x}_0 | \mathbf{x}_1)$ as a continuous distribution. But we want to report a log-likelihood for discrete data—the actual integers from 0-255.

A naive approach would be to: 1. Sample $\mathbf{x}_0$ from the Gaussian 2. Round to the nearest integer 3. Report the probability **Problem with this:** Rounding loses information and isn't differentiable. We'd be computing the log-likelihood of a distribution that doesn't perfectly match our actual data model. ### The Solution: Discretized Continuous Decoder Instead, the authors use a clever approach: **integrate the continuous Gaussian over the region corresponding to each discrete value.**

Look at Equation (13). For each pixel coordinate $i$ , the probability of observing a discrete pixel value $x_0^i$ is:

p_\theta(x_0^i|\mathbf{x}_1) = \int_{\delta_-(x_0^i)}^{\delta_+(x_0^i)} \mathcal{N}(x; \mu_\theta^i(\mathbf{x}_1, 1), \sigma_1^2)\, dx Let me break down what this means: **The variables:** - $x_0^i$ is the $i$-th coordinate (pixel) of $\mathbf{x}_0$, a discrete integer in $\{0, 1, \ldots, 255\}$ (after scaling, in the range $[-1, 1]$) - $\mu_\theta^i(\mathbf{x}_1, 1)$ is the mean of the Gaussian predicted by the neural network for coordinate $i$ - $\sigma_1^2$ is the variance at step $t=1$ (the last noisy step before reaching $\mathbf{x}_0$) - $\mathcal{N}(x; \mu, \sigma^2)$ is the Gaussian probability density function: $\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$ **What the integral does:** Instead of asking "what's the probability density at exactly $x_0^i$?", we ask "what's the total probability mass in the interval around $x_0^i$?" ### The Binning Intervals: $\delta_+$ and $\delta_-$

The notation is a bit cryptic, so let's unpack it. The functions $\delta_+$ and $\delta_-$ define the boundaries of the integration interval:

\delta_+(x) = \begin{cases} \infty & \text{if } x = 1 \\ x + \frac{1}{255} & \text{if } x < 1 \end{cases}

\delta_-(x) = \begin{cases} -\infty & \text{if } x = -1 \\ x - \frac{1}{255} & \text{if } x > -1 \end{cases}

Interpretation:

Since we scaled pixels from $\{0, 1, \ldots, 255\}$ to $[-1, 1]$ , each discrete pixel value now corresponds to a small interval of width $\frac{2}{255}$ in the continuous space.

For pixel value 0 (scaled to $-1$ ): the interval is $[-\infty, -1 + \frac{1}{255}]$ (all values from $-\infty$ up to just past $-1$ )
For pixel value 127 (scaled to $\approx -0.002$ ): the interval is $[-0.002 - \frac{1}{255}, -0.002 + \frac{1}{255}]$ (centered at $-0.002$ with width $\frac{2}{255}$ )
For pixel value 255 (scaled to $1$ ): the interval is $[1 - \frac{1}{255}, \infty)$ (all values from just below $1$ to $\infty$ )

Why use $\infty$ at the boundaries? This ensures that probability mass beyond the valid pixel range gets assigned to the extreme pixel values. For example, any predicted value below $-1$ contributes to the probability of pixel 0.

Why This Approach is Clever

This discrete decoder has three important properties:

Lossless codelength: The integral represents the exact probability of observing a discrete pixel value, given the continuous Gaussian model. The variational bound computed using these probabilities is a true lower bound on the log-likelihood of discrete data.
No added noise: Unlike some VAE approaches that add noise to discrete data before processing, we don't corrupt the data. The discretization is in the model, not the data.
No Jacobian correction: The scaling from $\{0, 1, \ldots, 255\}$ to $[-1, 1]$ is linear and fixed. We don't need to worry about how this transformation affects probability densities (which would require computing Jacobians) because our final decoder directly models probabilities over the discrete values.

Part 3: Sampling and Reporting

During Sampling

When we're actually sampling images from the model:

We sample $\mathbf{x}_1$ from $p(\mathbf{x}_T)$ and run the reverse process (Algorithm 2) to get to $\mathbf{x}_1$
For the final step, instead of sampling from the discretized distribution in Equation (13), the authors simply output $\boldsymbol{\mu}_\theta(\mathbf{x}_1, 1)$ without adding noise

This is written as: "At the end of sampling, we display $\boldsymbol{\mu}_\theta(\mathbf{x}_1, 1)$ noiselessly."

Why no noise at the final step?

The mean prediction $\boldsymbol{\mu}_\theta(\mathbf{x}_1, 1)$ is already optimized to be a good reconstruction of the original data
Adding additional Gaussian noise would only blur the final image
Since $\sigma_1$ is very small (remember, we're very close to the data in the reverse process), the Gaussian in Equation (13) is already quite peaked around the mean anyway

Part 4: Connection to Other Methods

The authors note that this discrete decoder strategy is similar to approaches used in:

Variational Autoencoders (VAEs): VAEs also face the problem of modeling discrete data with continuous latent variables
Autoregressive models: Models like PixelCNN also need to convert continuous models to discrete predictions

The reference to "more powerful decoders like conditional autoregressive models" hints that one could replace the simple Gaussian integral with something more sophisticated, but the current approach works well and is simpler.

Summary: Why All This Matters

To compute an exact log-likelihood on discrete image data using a continuous diffusion model, we need to:

Scale the data appropriately ( $[-1, 1]$ ) so it's compatible with the continuous model
Design the final decoder to integrate the Gaussian over discrete pixel regions, giving us proper probabilities for integer values
Avoid shortcuts like rounding or adding fake noise, which would make our log-likelihood computation invalid

This section ensures that when the paper reports quantitative results (like log-likelihood or FID scores), they're computing real, valid metrics on the actual discrete data distribution, not approximations.

3.4 Simplified training objective

Mathp.4

With the reverse process and decoder defined above, the variational bound, consisting of terms derived from Eqs. (12) an...

Section 3.4: Simplified Training Objective

Big Picture: Why This Section Matters

The authors are introducing a clever practical simplification to the theoretical training objective they derived in previous sections. Here's the key insight: The mathematically rigorous variational bound from Section 3.2 (Equation 12) is theoretically perfect but computationally cumbersome with complicated weighting factors. The authors propose dropping these weights to create a simpler objective that, surprisingly, works even better in practice.

This is an important moment in machine learning research—sometimes theory and practice diverge, and empirical evidence wins. The authors are being transparent about this trade-off.

Understanding the Simplified Objective (Equation 14)

The Basic Form

$L_{\text{simple}}(\theta) := \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\right\|^2\right]$

Let me break down what each component means:

What we're computing:

A loss function (denoted by $L_{\text{simple}}$ ) that measures how well the neural network learns to denoise images
The loss is the expected squared error between:
- The true noise $\boldsymbol{\epsilon}$ (the actual random noise added to an image)
- The predicted noise $\boldsymbol{\epsilon}_\theta(\cdot)$ (what the neural network thinks the noise is)

The expectation notation $\mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}[\cdot]$ means we're averaging over three random things:

$t$ : which timestep in the diffusion process (ranging from 1 to $T$ )
$\mathbf{x}_0$ : which original clean image we start with
$\boldsymbol{\epsilon}$ : which specific noise was added

What Gets Plugged Into the Network

The network $\boldsymbol{\epsilon}_\theta(\cdot)$ receives:

A noisy image: $\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$
- This is the forward process equation from Section 3.1
- $\sqrt{\bar{\alpha}_t}\mathbf{x}_0$ is the original image scaled down by a factor that decreases with $t$
- $\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ is noise scaled up by a factor that increases with $t$
- At small $t$ , mostly signal; at large $t$ , mostly noise
The timestep: $t$
- The network needs to know which step it's at, because the noise characteristics change at each timestep

The Key Simplification: Dropping the Weights

To understand what was simplified away, compare Equation 14 to Equation 12:

Original (Equation 12) — with weighting factors: $\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{\beta_t^2}{2\sigma_t^2 \alpha_t(1-\bar{\alpha}_t)}\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\cdots)\right\|^2\right]$

Simplified (Equation 14) — without weighting factors: $\mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\cdots)\right\|^2\right]$

Notice that all the complicated coefficients in front (like $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t(1-\bar{\alpha}_t)}$ ) have been removed. This makes training simpler and faster because:

No need to compute or store these complex weighting factors
Numerically more stable (no division by tiny numbers at early timesteps)
Implementation is cleaner

Why This Simpler Loss Actually Works Better

The Loss Weighting Story

The authors explain that by discarding the theoretical weights, their simplified objective creates an implicit new weighting across different timesteps. Here's what happens:

At small $t$ (early denoising, small amounts of noise):

The theoretical weight $\frac{\beta_t^2}{2\sigma_t^2 \alpha_t(1-\bar{\alpha}_t)}$ is very small
This weight DOWN-WEIGHTS these terms in the original loss
By dropping the weight entirely and using a uniform distribution over $t$ , we DOWN-WEIGHT these terms even more
Result: The network doesn't waste capacity learning to denoise almost-clean images

At large $t$ (late denoising, lots of noise):

The theoretical weight is larger
These hard denoising tasks get emphasized
Result: The network focuses computational effort on the harder tasks

The Intuition

Think of it like learning to remove noise from photos:

If a photo is 99% clear with 1% noise, it's easy—don't spend much time on this
If a photo is 30% signal and 70% noise, it's hard—spend more time learning to do this well
By down-weighting easy tasks, the network becomes better at hard tasks overall

This is counterintuitive but makes sense: a good denoiser should excel at difficult noise levels, because those determine the final sample quality.

The Three Cases: How $t$ Relates to the Losses

The authors clarify how their simplified objective connects back to the theoretical framework:

Case 1: $t = 1$ (First denoising step)

This corresponds to $L_0$ from the variational bound
Technically, Equation 13 (the discrete decoder) involves an integral
The simplification approximates this integral by:
- Treating the Gaussian like it's flat across each bin (using density times bin width)
- Ignoring the variance $\sigma_1^2$ (treating it as small)
- Ignoring edge effects at $-1$ and $+1$ boundaries

Cases 2 and beyond: $t > 1$

These correspond to the denoising objectives from Equation 12
The simplification uses an "unweighted version"—meaning we drop the weighting factor
This is analogous to how NCSN (Noise Conditional Score Networks) weights their losses

Why $L_T$ is absent:

Recall from Section 3.1: the forward process variances $\beta_t$ are fixed constants (not learned)
Therefore $L_T$ becomes a constant that doesn't affect training
No need to include it in the loss function

Sampling with the Simplified Objective

An important point: using a simpler training objective doesn't change how you generate samples. Once the network is trained on Equation 14, you still use Algorithm 2 from Section 3.2 to sample:

$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$

The sampling procedure is unchanged; only the training loss changes.

Summary: The Core Contribution

Aspect	Theoretical (Eq. 12)	Practical (Eq. 14)
Weighting factors	Complex, time-dependent	Uniform over timesteps
Computational cost	Higher (more multiplications)	Lower (pure MSE)
Emphasis on easy tasks	Medium weight	Down-weighted
Emphasis on hard tasks	Higher weight	Up-weighted relatively
Empirical performance	Good	Better

The key takeaway: Sometimes dropping theoretical baggage and using a simpler loss with implicit reweighting produces better results. This is why empirical validation matters in machine learning.

L_{\text{simple}}(\theta) := \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\right\|^2\right]

Perfect! Here's a concrete example:

Given:

True noise: $\boldsymbol{\epsilon} = [0.1, 0.2, 0.15]$
Network prediction: $\boldsymbol{\epsilon}_\theta = [0.08, 0.22, 0.13]$

The loss for this example: $\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\right\|^2 = (0.02)^2 + (-0.02)^2 + (0.02)^2 = 0.0012$

The actual training loss would average this over many samples at different timesteps $t$ , on different data points $\mathbf{x}_0$ , and with different noise samples $\boldsymbol{\epsilon}$ .

Summary of Key Points

Aspect	Meaning
Purpose	Train a neural network to predict noise added during diffusion
Input to network	Noisy image ( $\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ ) and timestep $t$
Target	The true random noise $\boldsymbol{\epsilon}$ that was added
Loss	Mean squared error between predicted and true noise
Key insight	Unweighted loss emphasizes hard timesteps (high noise), improving quality
Why it works	Network learns to denoise gracefully across the full noise spectrum

This simple-looking equation is actually quite clever: by dropping the theoretically-justified weights and treating all timesteps equally, the training procedure naturally focuses on the hardest denoising tasks, leading to better generative performance in practice.

Visualizing how signal-to-noise ratio changes over diffusion timesteps

plot | sqrt(0.9995^x)
sqrt(1 - 0.9995^x) | x = 0 to 1000

Comparing noise levels at different timesteps

Computing a concrete example of the squared error term

$( left double bracketing bar {v1 - v2} right double bracketing bar )^2 where v1 = {0.1, 0.2, 0.15}, v2 = {0.08, 0.22, 0.13}$

4 Experiments

p.5

We set $T = 1000$ for all experiments so that the number of neural network evaluations needed during sampling matches pr...

Section 4: Experiments - A Detailed Explanation

Big Picture: What Are We Doing Here?

This section describes the practical implementation choices the authors made when training and using their diffusion model. Think of it as the "engineering manual" for the theory developed in earlier sections. The authors need to make concrete decisions about:

How many diffusion steps to use ( $T$ )
How much noise to add at each step (the $\beta_t$ values)
What neural network architecture to train

These aren't arbitrary choices—they're carefully calibrated based on the theory and earlier findings from related work. This section justifies each choice and explains the resulting design.

Part 1: Setting the Number of Steps and Noise Schedule

The Key Parameters: $T$ and $\{\beta_t\}$

The Number of Steps ( $T = 1000$ ):

The authors set $T = 1000$ , meaning the diffusion process has 1000 steps. This number was chosen to match previous diffusion work [53, 55]. Why does this matter? Recall from earlier sections that at each step, the model must make a neural network prediction (the $\epsilon_\theta$ function from Eq. (11)). More steps means more predictions during sampling, which affects computational cost. By matching previous work, they ensure fair comparison.

The Variance Schedule: Linear Increase from $\beta_1$ to $\beta_T$

Now for the noise levels. Remember from the forward process (mentioned in previous sections) that $\beta_t$ controls how much noise is added at step $t$ . The authors set:

$\beta_1 = 10^{-4}, \quad \beta_T = 0.02, \quad \text{increasing linearly}$

What does this mean geometrically? Think of the diffusion process as a journey through noise space:

At step 1: Add a tiny amount of noise ( $\beta_1 = 0.0001$ )
At step 1000: Add substantially more noise ( $\beta_T = 0.02$ )
In between: Increase smoothly and linearly

These values were "chosen to be small relative to data scaled to $[-1, 1]$ " (as mentioned in Section 3.3). Since image data was scaled to the range $[-1, 1]$ , these noise levels are appropriately small—they don't immediately destroy the image structure.

Why These Specific Values? Three Justifications

1. Maintaining Process Symmetry:

The authors state that "reverse and forward processes have approximately the same functional form." What does this mean mathematically?

In the forward process, at step $t$ , we have: $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$

where $\bar{\alpha}_t = \prod_{s=1}^{t}(1-\beta_s)$ (this accumulates all the noise up to step $t$ ).

In the reverse process (Eq. 11), we predict $\boldsymbol{\epsilon}$ and then reconstruct with a similar weighted combination. By keeping the $\beta_t$ values small, the process remains approximately linear in structure at each step—meaning the signal doesn't collapse too rapidly, allowing the forward and reverse processes to have similar properties.

2. Keeping the Signal-to-Noise Ratio Low:

The authors specifically mention that at step $T$ (the final step), the signal-to-noise ratio should be "as small as possible." They achieved: $L_T = D_{\mathrm{KL}}(q(\mathbf{x}_T|\mathbf{x}_0) \| \mathcal{N}(\mathbf{0}, \mathbf{I})) \approx 10^{-5} \text{ bits per dimension}$

This is technical notation for the Kullback-Leibler divergence between the final noisy distribution and a standard Gaussian. Intuitively: the Gaussian prior $\mathcal{N}(\mathbf{0}, \mathbf{I})$ should match the distribution of $\mathbf{x}_T$ very closely. This means that by step $T$ , the data has been transformed almost completely into standard normal noise—there's almost nothing left of the original signal. This is ideal because it means the model doesn't waste effort on the last step; nature (the prior) has already done the heavy lifting.

3. Avoiding Extreme Changes:

Small $\beta_t$ values prevent any single step from adding so much noise that reversing it becomes nearly impossible. Each step is a manageable perturbation.

Part 2: The Neural Network Architecture

Overview: U-Net with Attention

The authors use a U-Net backbone, similar to an unmasked PixelCNN++ model. Let me break down what this means:

U-Net Structure:

A U-Net is a convolutional neural network with a distinctive shape:

Encoder path: Convolutional layers progressively reduce spatial resolution (downsample)
Bottleneck: A central layer at low resolution
Decoder path: Convolutional layers progressively increase spatial resolution (upsample)
Skip connections: Information from encoder layers is concatenated with decoder layers at matching resolutions

This symmetric structure is useful for image-to-image tasks because it preserves spatial information while extracting features at multiple scales.

Key Architectural Choices

1. Group Normalization:

The network uses group normalization throughout. This is a normalization technique that divides channels into groups and normalizes within each group independently. Why?

Normalization helps stabilize training by keeping activations in a reasonable range. Group normalization (unlike batch normalization) doesn't depend on batch statistics, making it more robust when training with different batch sizes.

2. Time Embedding via Sinusoidal Position Embeddings:

The authors state: "Parameters are shared across time, which is specified to the network using the Transformer sinusoidal position embedding."

This is crucial. The same neural network $\epsilon_\theta(\mathbf{x}_t, t)$ must handle all 1000 timesteps. But how does it know which timestep it's at? The answer: sinusoidal position embeddings.

These are learned representations of the timestep $t$ . The Transformer architecture uses embeddings like: $PE(t, 2i) = \sin\left(\frac{t}{10000^{2i/d}}\right)$ $PE(t, 2i+1) = \cos\left(\frac{t}{10000^{2i/d}}\right)$

where $d$ is the embedding dimension and $i$ indexes different frequency components.

Why sinusoidal embeddings? They provide a continuous, smooth representation of time that helps the network understand the relationship between different timesteps. A sinusoid repeats with different frequencies, allowing the network to learn both coarse and fine-grained temporal patterns.

3. Self-Attention at 16×16 Resolution:

The network includes self-attention layers at the $16 \times 16$ feature map resolution.

Self-attention allows the model to relate distant parts of the image to each other, capturing long-range dependencies. This is computationally expensive (complexity scales as the square of spatial dimensions), so the authors apply it only at an intermediate resolution ( $16 \times 16$ ), not at the full image resolution. This balances expressiveness with computational efficiency.

Why This Architecture?

The PixelCNN++ base is appropriate because:

PixelCNN++ was designed for image generation (autoregressive models)
It combines convolutional efficiency with the ability to capture complex distributions
The authors adapt it (making it "unmasked" and adding attention) to the diffusion setting where you're predicting noise from a single noisy image, not generating autoregressively pixel-by-pixel

Summary Table: The Configuration

Parameter	Value	Why?
$T$ (timesteps)	1000	Matches prior work for fair comparison
$\beta_1$	$10^{-4}$	Small relative to data range $[-1,1]$
$\beta_T$	$0.02$	Ensures final step is nearly pure Gaussian noise
Variance schedule	Linear increase	Smooth, symmetric forward/reverse processes
Architecture	U-Net + attention	Captures multi-scale spatial features + long-range dependencies
Normalization	Group norm	Stable training across batch sizes
Time specification	Sinusoidal embeddings	Allows one network to handle all timesteps

Connection to Previous Sections

Recall from Section 3.4 that the authors train using the simplified objective (Eq. 14): $L_{\text{simple}}(\theta) := \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}, t)\right\|^2\right]$

The noise schedule $\{\beta_t\}$ directly determines $\bar{\alpha}_t$ (the cumulative product of $1-\beta_s$ terms), which appears in this loss. The careful choice of $\beta_t$ values ensures that the model receives balanced supervision across all timesteps, with the network focusing more on difficult mid-range denoising tasks rather than trivial near-Gaussian tasks.

Key Takeaway

Section 4 translates the theoretical framework from Sections 2-3 into concrete engineering decisions. Every choice—from the 1000 steps to the U-Net architecture—is justified either by matching prior work, maintaining mathematical symmetry, or enabling efficient training. The result is a practical recipe for training diffusion models that achieves state-of-the-art image synthesis results (as mentioned in the abstract: FID of 3.17 on CIFAR-10).

4.1 Sample quality

p.5

Table 1 shows Inception scores, FID scores, and negative log likelihoods (lossless codelengths) on CIFAR10. With our FID...

Section 4.1 Sample Quality: Comprehensive Explanation

The Big Picture

This section is answering a fundamental question: How good are the images generated by diffusion probabilistic models? The authors are presenting empirical evidence that their method produces high-quality synthetic images by comparing against established benchmarks. This matters because throughout the paper they've been developing a theoretical framework and training procedure—now they need to demonstrate it actually works in practice.

The section also reveals an important practical insight: the theoretical "best" training approach (optimizing the true variational bound) doesn't necessarily produce the "best-looking" samples, though it does produce better log-likelihoods. This tension between different objectives is a key finding.

Understanding the Metrics

Before diving into the results, let's clarify what the authors are measuring:

1. Inception Score (IS)

This is a metric that evaluates generated image quality by:

Running generated images through a pre-trained classifier (InceptionNet)
Measuring two things:
- Discriminativeness: Does the classifier confidently assign each image to a specific class?
- Diversity: Do the generated images span different classes?

Mathematically, IS is roughly proportional to the entropy of the class distribution—higher scores mean the classifier is both confident and diverse in its predictions. The authors report 9.46 on CIFAR10, which is competitive.

2. FID Score (Fréchet Inception Distance)

This is a more sophisticated metric that:

Extracts features from real and generated images using a pre-trained classifier
Treats these features as samples from Gaussian distributions
Measures the distance between these two distributions using the Fréchet distance

Why this matters: Unlike IS, FID directly compares generated images to real images, making it a more meaningful quality assessment.

The authors report 3.17 on CIFAR10 (training set), which they emphasize beats most prior work. They also note that when measured against the test set instead (which is stricter), the FID becomes 5.24—still competitive—showing their model generalizes well rather than overfitting.

3. Negative Log-Likelihood (NLL) / Lossless Codelength

This measures the actual probability the model assigns to real data. Think of it as: "How many bits would you need to losslessly compress data using this model as a codec?"

Key insight: This is directly related to the variational bound from equation (12) in section 3.4. Lower values are better.

The Core Result: Table 1 Findings

"With our FID score of 3.17, our unconditional model achieves better sample quality than most models in the literature, including class conditional models."

Let's unpack what this statement means:

Unconditional model: The model doesn't receive class labels or other conditioning information. It just generates images from random noise. This is harder than conditional generation where you tell the model "generate a dog" or "generate a cat."
Beats class conditional models: This is impressive because conditional models have additional information to work with. The diffusion model achieves comparable quality while working with less information.
FID = 3.17 vs. test set FID = 5.24: The gap between training and test FID is relatively small, suggesting the model is learning genuine image structure rather than memorizing the training set.

The Crucial Trade-off: Sample Quality vs. Likelihood

This is perhaps the most important insight in this section:

"We find that training our models on the true variational bound yields better codelengths than training on the simplified objective, as expected, but the latter yields the best sample quality."

What's happening here?

Training on the true variational bound (from Eq. 12 in the previous section):

Includes proper weighting terms derived from information-theoretic principles
Maximizes the actual log-likelihood of real data
Results: Better NLL/codelength measurements
But: Produces slightly lower-quality samples visually

Training on the simplified objective (Eq. 14):

Removes weighting terms to create an unweighted loss
Down-weights the hardest denoising tasks (small $t$ , where there's very little noise)
Results: Best visual sample quality
But: Lower log-likelihood values

Why does this trade-off exist?

The key is in section 3.4's explanation:

"These terms train the network to denoise data with very small amounts of noise, so it is beneficial to down-weight them so that the network to focus on more difficult denoising tasks at larger $t$ terms."

Intuition:

At small $t$ (near the end of sampling), you're removing tiny amounts of noise from almost-real images. This is technically easy.
At large $t$ (early in sampling), you're removing large amounts of noise from pure random noise. This is technically hard and requires learning the most about the data distribution.

By using the simplified objective that naturally down-weights small $t$ terms, the model focuses its "learning capacity" on the harder problems. This improves visual quality because the model becomes better at the hard parts of the denoising process.

The trade-off: The true variational bound ensures every timestep is weighted equally according to information theory, which optimizes compression efficiency (likelihood). But for visual quality, you want to sacrifice some of that theoretical optimality to focus on the hard problems.

Evidence from Figures

The section references three figures showing generated samples:

Figure 1 (referenced but not shown here): CIFAR10 and CelebA-HQ 256×256 samples—smaller images and celebrity faces
Figure 3: LSUN Church samples with FID = 7.89
- LSUN is a more complex, diverse dataset
- Higher FID than CIFAR10 is expected due to increased complexity
- Church interiors are a challenging domain with complex textures and geometry
Figure 4: LSUN Bedroom samples with FID = 4.90
- Similar to churches, but bedrooms may have slightly more structure
- Better FID suggests the model adapts well across domains

What these figures demonstrate: The diffusion model can generate diverse, high-resolution images across different domains, not just toy datasets. This is evidence of genuine generalization.

Why This Matters for the Paper's Contribution

Validates the theory: All the mathematical framework from sections 2-3 isn't just elegant theory—it produces state-of-the-art results.
Justifies the simplified objective: The authors could have just reported results with the "theoretically correct" objective, but instead they honestly report both, showing that practical sample quality matters as much as theoretical properties.
Shows scalability: Achieving competitive FID on both toy datasets (CIFAR10) and complex, high-resolution datasets (256×256 LSUN) demonstrates the method scales well.
Establishes a new paradigm: Diffusion models join GANs and autoregressive models as viable generators, with the advantage of stable training (no adversarial dynamics) and principled likelihood optimization.

Summary

This section presents empirical evidence that diffusion probabilistic models achieve state-of-the-art or competitive image generation quality. The key findings are:

Aspect	Result
CIFAR10 FID	3.17 (training), 5.24 (test)
Sample quality	Beats most prior methods, including conditional models
Likelihood	True VB gives better likelihood; simplified objective gives better samples
Scalability	Works on 256×256 images across multiple domains

The crucial insight is the sample quality vs. likelihood trade-off: diffusion models trained with a simplified objective that down-weights easy denoising tasks achieve better visual quality, even though they optimize a weighted variant of the variational bound rather than the theoretically "pure" bound.

4.2 Reverse process parameterization and training objective ablation

p.6

In Table 2, we show the sample quality effects of reverse process parameterizations and training objectives (Section 3.2...

Section 4.2: Reverse Process Parameterization and Training Objective Ablation

The Big Picture

This section is investigating a crucial engineering question: How should we design the neural network to predict the reverse diffusion process, and how should we train it?

Think of it this way: we have a noisy image and want to denoise it step-by-step. But there are multiple ways to set up this denoising process mathematically, and multiple loss functions we could use to train the network. This section systematically compares these choices to see which combination actually produces the best results.

This is important because earlier sections (particularly Section 3.2) introduced different ways to parameterize the reverse process, and Section 3.4 introduced a simplified training objective. Now the authors are asking: "Do all these options work equally well, or are some combinations better than others?"

Key Concepts to Review

Before diving in, let's recall three important concepts from earlier:

Reverse process parameterization: Different mathematical formulations for what the neural network predicts at each denoising step
Variational bound: A theoretically principled loss function that provides a lower bound on the data likelihood
Simplified objective (Eq. 14): A practical training loss that's easier to implement and often yields better results

The Three Parameterization Choices

The section compares three different ways to structure the reverse process. Each choice changes what the neural network is trained to predict:

Option 1: Predicting $\tilde{\boldsymbol{\mu}}$ (the mean)

The neural network directly predicts $\tilde{\boldsymbol{\mu}}_\theta(\mathbf{x}_t, t)$ , which is the mean of the reverse process distribution at step $t$ .

Mathematical setup: In the reverse process $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ , we need to specify a mean and variance. This option predicts the mean directly.
Training with the variational bound: When trained with the proper weighted variational bound from Section 3.2, this works well. The weighting helps the network focus on important denoising steps.
Training with the simplified objective: When trained with Equation 14's unweighted MSE loss, this performs poorly. The network loses the principled guidance about which denoising steps matter most.

Key insight: This approach is theoretically motivated, but it's sensitive to how you weight the training signal.

Option 2: Learning the variances $\boldsymbol{\Sigma}_\theta(\mathbf{x}_t)$

Instead of fixing the variance of the reverse process, allow the neural network to predict it as well.

The idea: Make the network learn not just "what's the clean image?" but also "how confident am I about this prediction?"
The problem: The authors found this leads to unstable training and worse sample quality.

Why might it be unstable? Learning variances adds another dimension of optimization complexity. The network has to simultaneously predict means and uncertainties while balancing gradients between them. This can cause training dynamics to oscillate or diverge.

Option 3: Predicting $\boldsymbol{\epsilon}$ (the noise)

The neural network predicts $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ , which represents the noise that was added in the forward process. This is related to denoising score matching (mentioned in the abstract and Section 3.1).

Connection to the forward process: Recall from Equation 14 that we can write: $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is Gaussian noise.
Training signal: Instead of predicting the "clean" image, the network learns what noise is present. This is mathematically equivalent but provides a different inductive bias.
Performance with different losses:
- With the variational bound: performs comparably to predicting $\tilde{\boldsymbol{\mu}}$
- With the simplified objective: significantly better than predicting $\tilde{\boldsymbol{\mu}}$

Why is this better? The noise prediction task is more aligned with the simplified MSE objective. The simplified loss naturally emphasizes hard denoising tasks (high noise) and downweights easy ones (low noise), which matches what we want the network to learn.

Reading Table 2: The Experimental Results

The section refers to Table 2, which shows sample quality metrics (likely FID scores or Inception scores) for these combinations:

Approach	Trained on Variational Bound?	Fixed Variances?	Sample Quality Result
Predict $\tilde{\boldsymbol{\mu}}$	Yes	Yes	Good
Predict $\tilde{\boldsymbol{\mu}}$	No	Yes	Poor
Learn $\boldsymbol{\Sigma}_\theta$	Yes	No	Unstable/Poor
Predict $\boldsymbol{\epsilon}$	Yes	Yes	Good (similar to $\tilde{\boldsymbol{\mu}}$ )
Predict $\boldsymbol{\epsilon}$	No	Yes	Much Better

The Key Insight: Matching Training Objective to Task

The most important takeaway is this:

The choice of what to predict (parameterization) must be matched to how you train (loss function).

Think of it like this analogy:

If you're teaching someone to identify dogs, you could ask them to "describe the dog" or "describe what makes it NOT a cat."
If your evaluation focuses on distinguishing dogs from cats, the second approach might be more directly aligned with your goal, even if the first seems more complete.

Similarly:

Predicting $\tilde{\boldsymbol{\mu}}$ is conceptually appealing (directly predict the clean image)
Predicting $\boldsymbol{\epsilon}$ is practically superior when using the simplified training objective because the MSE loss on noise naturally down-weights easy steps and emphasizes hard ones

Why Fixed Variances Work Better Than Learned Ones

The section notes that learning variances leads to worse performance. Here's the intuition:

Additional complexity: The network must optimize over more parameters simultaneously
Optimization interference: Gradients for predicting means and variances can conflict with each other
Initialization issues: Without careful initialization, variance predictions can collapse to unrealistic values
Empirical finding: The authors discovered that fixed variances work just fine—they don't need to be learned

This is a pragmatic choice: simpler often beats more complex when the results are comparable.

Mathematical Connection to Earlier Work

Recall from Section 3.1 and 3.4 that:

This loss function:

Takes an expectation over time steps $t$ , data $\mathbf{x}_0$ , and noise $\boldsymbol{\epsilon}$
Measures MSE between actual noise and predicted noise
Implicitly gives more weight to harder denoising problems (larger $t$ means more noise, so errors matter more)

When you train $\boldsymbol{\epsilon}_\theta$ with this loss, the network naturally learns what it should—the network's errors on high-noise (hard) steps contribute equally to the loss, even though the signal is harder to recover.

Practical Implications

The authors' choice for their final model:

Parameterization: Predict $\boldsymbol{\epsilon}$ (the noise)
Training objective: Use the simplified loss (Eq. 14)
Variances: Keep them fixed

This combination delivers:

Best sample quality (FID scores reported in Table 1)
Simpler training procedure (no variance learning complications)
Clear interpretation (predict the noise component of the noisy image)

This is why the abstract mentions achieving state-of-the-art FID scores—this engineering choice matters significantly for practical performance.

4.3 Progressive coding

Mathp.6

Table 1 also shows the codelengths of our CIFAR10 models. The gap between train and test is at most 0.03 bits per dimens...

Section 4.3: Progressive Coding - Comprehensive Explanation

Big Picture: What's This Section About?

This section explores a fascinating property of diffusion models: they can be used as lossy compressors (like JPEG for images) and progressive generators (like how images load line-by-line on old internet). Rather than just generating images in one shot, diffusion models can generate progressively from coarse features to fine details, and we can understand this through the lens of information theory (the mathematical study of compression and communication).

The key insight: diffusion models naturally encode information hierarchically — they can reconstruct images well even if you only see partial information partway through the denoising process.

Part 1: Codelengths and the Lossy Compression Property

The Basic Observation

The paper reports that their diffusion model achieves:

Train-test gap: only 0.03 bits/dimension (very small, indicating no overfitting)
Lossless codelengths: not as good as other methods (like autoregressive models)
BUT: sample quality is excellent

This seems paradoxical! The resolution: diffusion models are great lossy compressors but not great lossless compressors.

Rate-Distortion Analysis

The paper uses classic rate-distortion theory from information theory. Here's the setup:

Rate-Distortion Framework:

Rate $R$ : how many bits (units of information) you need to transmit
Distortion $D$ : how wrong your reconstruction is (measured as error)
The trade-off: you can reduce distortion by increasing rate, or vice versa

In the variational bound from Equation (5) (defined earlier in the paper):

L = L_0 + L_1 + L_2 + \cdots + L_T

The paper reinterprets this as:

Rate = $L_1 + L_2 + \cdots + L_T$ (all the middle terms)
Distortion = $L_0$ (the final reconstruction error)

For their CIFAR-10 model:

Rate: 1.78 bits/dimension
Distortion: 1.97 bits/dimension

What Does This Mean Intuitively?

The fact that distortion is large doesn't mean the images look bad! The paper states that 1.97 bits/dimension corresponds to an RMSE (root mean squared error) of 0.95 on a 0-255 scale. That's roughly a pixel value error of 1 out of 256 — mostly imperceptible to human eyes.

Key insight: More than half the total bits in the lossless codelength describe imperceptible distortions. The model is spending bits on information humans can't see!

Part 2: Progressive Lossy Coding

The Core Idea

Imagine you're transmitting an image over a slow internet connection. Instead of:

Sending the entire high-quality image (slow)

You want: 2. Send a rough version first, then progressively add details

The diffusion model naturally does this! Here's how:

The Algorithm (Conceptual)

The paper references Algorithms 3 and 4 (in appendix), which work like this:

Start with pure noise $\mathbf{x}_T$
Progressively denoise: $\mathbf{x}_T \to \mathbf{x}_{T-1} \to \cdots \to \mathbf{x}_1 \to \mathbf{x}_0$
At each step $t$ , transmit just enough information to specify $\mathbf{x}_{t-1}$ given $\mathbf{x}_t$

The total bits transmitted equals the variational bound in Equation (5).

The Key Formula: Reconstructing from Partial Information

At any point during denoising (at step $t$ ), the receiver can estimate what the final image $\mathbf{x}_0$ looks like using:

\mathbf{x}_0 \approx \hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)}{\sqrt{\bar{\alpha}_t}}

Breaking this down:

$\mathbf{x}_t$ : the current noisy image (at timestep $t$ )
$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)$ : the neural network's prediction of the noise in $\mathbf{x}_t$
$\bar{\alpha}_t$ : a scaling factor (defined earlier in the paper) that controls how much signal vs. noise is in $\mathbf{x}_t$
Division by $\sqrt{\bar{\alpha}_t}$ : scales the reconstructed image back to the original data range

Intuition: The formula inverts the noise addition process. Since we know roughly how much noise was added to get $\mathbf{x}_t$ , we can subtract our estimate of that noise to recover $\mathbf{x}_0$ .

What Figure 5 Shows

Figure 5 is a rate-distortion curve — a graph showing the fundamental trade-off:

Horizontal axis (Rate): cumulative bits received so far (0 to full codelength)
Vertical axis (Distortion): RMSE between true $\mathbf{x}_0$ and reconstructed $\hat{\mathbf{x}}_0$
Key observation: The curve is steep at low rates and flattens at high rates

What this means:

You get most of the visual improvement very quickly (low rate, high distortion reduction)
Additional bits after that mostly reduce imperceptible errors
This is perfect for progressive transmission! Send bits quickly early, slower later

Part 3: Progressive Generation

From Decompression to Generation

Instead of receiving a compressed bitstream, the reverse process can simply sample from the model while generating:

Start with random noise $\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})$
Run Algorithm 2 (the reverse process from the paper)
At each step, predict $\hat{\mathbf{x}}_0$ using Equation (15) above
Watch as $\hat{\mathbf{x}}_0$ improves over time

What Figures 6 and 10 show:

Generation progresses from coarse to fine
Early iterations: only large-scale structure (composition, main objects)
Later iterations: fine details (textures, edges)

This mirrors how human perception works and how artists sketch!

Figure 7: The Latent Space Structure

Figure 7 shows something interesting: when you condition on different timesteps $t$ , you get different levels of detail:

Large $t$ (early, noisy $\mathbf{x}_t$ ): Only coarse features remain; different random samples look very different at fine scales
Small $t$ (late, clean $\mathbf{x}_t$ ): Fine details are locked in; different samples differ only in imperceptible ways

Hint: The paper speculates this relates to conceptual compression — the idea that images can be understood as compositions of concepts at different scales.

Part 4: Connection to Autoregressive Decoding

The Theoretical Insight

This is the paper's most abstract contribution. They show that diffusion can be understood as a generalized autoregressive model.

Rewriting the Variational Bound

Equation (16) rewrites Equation (5) as:

L = D_{\mathrm{KL}}(q(\mathbf{x}_T) \| p(\mathbf{x}_T)) + \mathbb{E}_q\left[\sum_{t \geq 1} D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t) \| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))\right] + H(\mathbf{x}_0)

Each term explained:

First term $D_{\mathrm{KL}}(q(\mathbf{x}_T) \| p(\mathbf{x}_T))$ : KL divergence between what the data looks like after $T$ steps of noise ( $q$ ) vs. what the model thinks it should look like ( $p$ ). After $T=1000$ steps, this is ≈ 0 since both are nearly pure Gaussian.
Second term $\mathbb{E}_q[\sum_{t \geq 1} D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t) \| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))]$ : For each timestep, the KL divergence between the true reverse process and the model's learned reverse process. The expectation is over samples from the forward process.
Third term $H(\mathbf{x}_0)$ : Entropy (uncertainty) of the original data — a constant independent of the model.

The Autoregressive Connection

Now the paper makes a clever thought experiment. Imagine if you:

Set $T$ = number of pixels in the image
Define the forward process as: $q(\mathbf{x}_t|\mathbf{x}_0)$ masks out the first $t$ pixels, leaving the rest visible
Set $p(\mathbf{x}_T)$ = a completely blank image (all probability on blank)
Make $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$ fully expressive (can represent any distribution)

Then:

The first KL divergence term = 0 (goal matches reality)
Each KL term trains the model to: predict pixel $t$ from pixels $t+1, \ldots, T$
This is exactly autoregressive decoding! (predicting left-to-right as in PixelCNN)

The Deeper Insight

The paper argues: Gaussian diffusion = autoregressive decoding with a different "bit ordering"

Instead of ordering bits by pixel position (which is arbitrary), diffusion orders by noise level — which might better match the hierarchical structure of natural images.

Why this matters:

The order in which you predict coordinates affects model quality (known from prior work on autoregressive models)
Diffusion's ordering by noise level might be more natural for images than positional ordering
The diffusion length ( $T = 1000$ ) doesn't need to equal data dimension, allowing flexible expressiveness

Key Takeaways

Lossy compression: Diffusion models are excellent lossy compressors because they naturally allocate bits efficiently — spending more on perceptible features and less on imperceptible ones.
Progressive structure: The rate-distortion curve is steep at low rates, meaning you get good reconstructions quickly. Perfect for progressive transmission/generation.
Hierarchical generation: Diffusion naturally generates coarse-to-fine, matching human perception and artistic processes.
Theoretical unification: Diffusion can be understood as an autoregressive model with noise-level-based ordering instead of spatial ordering — potentially a better inductive bias for images.

\mathbf{x}_0 \approx \hat{\mathbf{x}}_0 = \left(\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)\right) / \sqrt{\bar{\alpha}_t}

The singularity at $x=0$ is interesting! Let me try a bounded range:

Deriving the general form of solving for x0 from the linear combination

Visualizing how the noise schedule coefficients evolve during the reverse process

plot | exp(-5 t)
sqrt(exp(-5 t))
sqrt(1 - exp(-5 t)) | t = 0 to 1

4.4 Interpolation

p.8

We can interpolate source images $\mathbf{x}_0, \mathbf{x}_0' \sim q(\mathbf{x}_0)$ in latent space using $q$ as a stoch...

Section 4.4: Interpolation – A Comprehensive Explanation

Big Picture: What Are We Trying to Accomplish?

The interpolation section demonstrates a fascinating capability of diffusion models: creating smooth, high-quality transitions between two images. Think of it like blending between two photographs, but with intelligent "cleanup" that removes the artifacts that would normally appear from naïve blending.

This is important because:

It shows that the model has learned meaningful semantic structure in its latent representations
It demonstrates that the reverse process can "repair" corrupted or blended images intelligently
It provides qualitative evidence that the model understands image features and how to interpolate between them

The Core Idea: Three-Step Process

The authors propose a clever three-step interpolation scheme:

Step 1: Encode both source images into a "noisy" latent space

Take two real images: $\mathbf{x}_0, \mathbf{x}_0' \sim q(\mathbf{x}_0)$ (note: both are sampled from the data distribution $q$ )
Use the forward diffusion process $q$ to "corrupt" them: $\mathbf{x}_t, \mathbf{x}_t' \sim q(\mathbf{x}_t|\mathbf{x}_0)$
At some intermediate timestep $t$ , both images have been partially corrupted with noise

Step 2: Linearly interpolate in this corrupted space

Instead of blending the original clean images (which would look blurry/unnatural), blend the corrupted versions:

\bar{\mathbf{x}}_t = (1-\lambda)\mathbf{x}_t + \lambda\mathbf{x}_t'

where $\lambda \in [0, 1]$ is the interpolation parameter (0 = all of image 1, 1 = all of image 2, 0.5 = equal mix)

Step 3: Decode back to image space with the reverse process

Use the learned reverse process $p_\theta$ to "denoise": $\bar{\mathbf{x}}_0 \sim p(\mathbf{x}_0|\bar{\mathbf{x}}_t)$
This generates a high-quality image that blends features from both source images

Why This Works: The Key Insight

The crucial detail the authors mention: "We fixed the noise for different values of $\lambda$ "

This means when computing $\mathbf{x}_t$ and $\mathbf{x}_t'$ , they use the same random noise samples. Mathematically, if we recall from earlier in the paper that:

q(\mathbf{x}_t|\mathbf{x}_0) = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}

where $\boldsymbol{\epsilon}$ is Gaussian noise, then both:

$\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ (same $\boldsymbol{\epsilon}$ )
$\mathbf{x}_t' = \sqrt{\bar{\alpha}_t}\mathbf{x}_0' + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}$ (same $\boldsymbol{\epsilon}$ )

When we interpolate these:

\bar{\mathbf{x}}_t = (1-\lambda)\mathbf{x}_t + \lambda\mathbf{x}_t' = \sqrt{\bar{\alpha}_t}[(1-\lambda)\mathbf{x}_0 + \lambda\mathbf{x}_0'] + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}

The noise component stays the same, so we're truly interpolating between the image components, not between different noise patterns. This is what makes the interpolation clean.

Understanding the Reverse Process Cleanup

The key is this phrase: "use the reverse process to remove artifacts from linearly interpolating corrupted versions"

When you linearly blend two images naively, you get a "ghosting" or blurring effect. But here's the magic:

The interpolated latent $\bar{\mathbf{x}}_t$ is not a natural image—it's a corrupted mixture
The reverse process has learned to denoise—to convert corrupted, low-quality representations back to clean images
During this denoising, the model intelligently "chooses" between features from the two source images rather than averaging them

For example, instead of creating a blurry average of two faces, it might:

Use pose from one image and skin tone from the other
Smoothly transition hairstyle from one to the other
Average the background naturally

The Role of the Timestep Parameter $t$

The parameter $t$ controls the interpolation behavior:

Small $t$ (e.g., $t = 50$ ):
- Both $\mathbf{x}_t$ and $\mathbf{x}_t'$ are only slightly corrupted
- Interpolating between them mostly preserves the source images
- Reconstruction is faithful to the originals
- Interpolation shows fine details
Large $t$ (e.g., $t = 500$ in Fig. 8):
- Both source images are heavily corrupted
- Much of the detail is lost in the corruption
- The reverse process has more freedom to create variations
- Results show plausible interpolations but with coarser features
- More "creative" results as structural details blur out
Very large $t$ (e.g., $t = 1000$ ):
- Near-complete corruption (approaching random noise)
- Reverse process essentially generates novel samples
- Results are very different from source images

This is displayed in the figures: Fig. 8 (right) shows $t = 500$ , and the appendix (Fig. 9) shows $t = 1000$ with much more variation.

What the Results Tell Us

The observed interpolation properties:

Feature	What this shows
Smooth pose transitions	Model learns continuous pose variations
Skin tone blending	Color/texture interpolation in latent space
Hairstyle transitions	High-level semantic features interpolate smoothly
Expression changes	Dynamic facial features are captured
Background averaging	Spatial layout understood hierarchically
Eyewear NOT interpolating	This is interesting—some discrete features resist interpolation

The fact that most attributes interpolate smoothly but eyewear does not suggests the model may have learned discrete (on/off) representations for certain binary properties, which is actually a sign of sophisticated feature learning.

Mathematical Perspective: What's Happening in the Latent Space

From the perspective of the model:

Latent space geometry: The forward process $q(\mathbf{x}_t|\mathbf{x}_0)$ projects images into a "corrupted" latent space at level $t$
Linear structure: At intermediate corruption levels, this space appears to have some linear structure (linear interpolation produces meaningful intermediate points)
Learned manifold: The learned reverse process $p_\theta(\mathbf{x}_0|\mathbf{x}_t)$ knows the structure of realistic images and "projects" any point back onto this manifold
Generalization: Linear interpolation between corrupted points, when decoded, produces new valid points on the image manifold

This is quite different from, say, trying to linearly interpolate in pixel space or even many other latent variable models. The diffusion process essentially gives us a learned coordinate system where linear interpolation is meaningful.

Connection to Earlier Concepts

Recall from earlier sections:

Equation (4) showed: $\mathbf{x}_0 \approx \hat{\mathbf{x}}_0 = (\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)) / \sqrt{\bar{\alpha}_t}$
The reverse process progressively denoises by predicting the noise: $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)$
Here, we're using that same denoising ability to clean up blended, corrupted images

Summary

The interpolation section demonstrates that:

Diffusion models create meaningful latent representations where linear interpolation is semantically meaningful
The reverse process is a powerful image enhancement tool, not just for generation from noise but also for "repairing" corrupted or blended images
Interpolation quality depends on corruption level, allowing a tunable trade-off between faithfulness to sources and creative freedom
Features interpolate at different rates, suggesting hierarchical feature learning (coarse features emerge before fine details)

This capability—smooth, attribute-preserving interpolation—is strong evidence that the model has captured the underlying structure of natural images in its learned representations.

5 Related Work

p.8

While diffusion models might resemble flows [9, 46, 10, 32, 5, 16, 23] and VAEs [33, 47, 37], diffusion models are desig...

Section 5: Related Work - A Comprehensive Explanation

Big Picture: What's This Section About?

This section positions diffusion models within the broader landscape of generative modeling research. The authors are essentially saying: "Our approach is novel and distinct, but it also connects to several important existing ideas in machine learning." They do this by:

Contrasting diffusion models with similar approaches (flows and VAEs)
Connecting to denoising score matching and Langevin dynamics
Discussing other related methods for learning Markov chains
Linking to energy-based models and autoregressive models

This matters because it establishes credibility, shows the work isn't in isolation, and reveals unexpected mathematical connections that deepen our understanding of what diffusion models actually do.

Part 1: How Diffusion Models Differ from Flows and VAEs

The Key Distinction

The authors start with an important contrast:

"diffusion models are designed so that $q$ has no parameters and the top-level latent $\mathbf{x}_T$ has nearly zero mutual information with the data $\mathbf{x}_0$ "

Let me unpack this carefully:

What this means:

$q$ has no parameters: Recall from previous sections that $q$ is the forward diffusion process (the noising process). Unlike flows and VAEs, $q$ is completely fixed—we don't learn it. We only learn the reverse process $p_\theta$ .
$\mathbf{x}_T$ has nearly zero mutual information with $\mathbf{x}_0$ : Mutual information $I(X; Y)$ $I (X; Y)$ measures how much knowing one variable tells us about another. If we know the final noisy state $\mathbf{x}_T$ $x_{T}$ , it tells us almost nothing about the original image $\mathbf{x}_0$ $x_{0}$ because we've added so much noise. This is intentional and different from:
- Flows: These create bijective (one-to-one) mappings, preserving all information
- VAEs: These compress information into a lower-dimensional latent space intentionally, but it's structured to contain useful information

Why This Matters

This design choice has a nice consequence: since there's minimal information loss in an abstract sense (the information that's "lost" is just noise), the diffusion process is reversible in principle. The reverse process can perfectly reconstruct $\mathbf{x}_0$ from $\mathbf{x}_T$ if the reverse process is learned correctly.

Part 2: The Connection to Denoising Score Matching and Langevin Dynamics

What Are These Things?

This is the deepest part of the section. Let me introduce the concepts:

Score Matching: The "score" is the gradient of the log probability: $s(\mathbf{x}) = \nabla_\mathbf{x} \log p(\mathbf{x})$

where $\nabla_\mathbf{x}$ means "take the gradient with respect to $\mathbf{x}$ " (the direction of steepest increase).

Score matching is a training technique where we learn to estimate this gradient directly, rather than estimating the probability distribution itself.

Denoising Score Matching: This is score matching applied at multiple noise levels. Instead of matching the score of the data distribution, we match the score of noise-corrupted data at various noise levels.

Langevin Dynamics: This is a sampling technique from statistical physics. To generate new samples from a distribution with score $s(\mathbf{x})$ , you iteratively update: $\mathbf{x}_{t-1} = \mathbf{x}_t + \frac{\epsilon}{2}s(\mathbf{x}_t) + \sqrt{\epsilon} \mathbf{z}_t$

where:

$\epsilon$ is a step size
$\mathbf{z}_t$ is random Gaussian noise
The first term moves you toward higher probability regions (following the gradient)
The second term adds randomness to explore

Annealed Langevin Dynamics: Run Langevin dynamics at progressively decreasing noise levels to gradually refine samples.

How Diffusion Models Connect

The authors' $\boldsymbol{\epsilon}$ -prediction parameterization (predicting the noise in the forward process) establishes a mathematical equivalence:

Training a diffusion model to predict noise = Training a Langevin dynamics sampler via variational inference

This is profound because:

It provides a new theoretical understanding of why diffusion models work
The reverse process in the diffusion model is literally performing something like annealed Langevin dynamics
But diffusion models add something important: straightforward log likelihood evaluation (you can directly compute $\log p_\theta(\mathbf{x})$ efficiently)

Langevin dynamics alone didn't have an easy way to evaluate likelihoods, so diffusion models solve this problem while maintaining the connection.

The Bidirectional Insight

The text notes: "The connection also has the reverse implication that a certain weighted form of denoising score matching is the same as variational inference to train a Langevin-like sampler."

In other words:

Diffusion interpretation: We're learning a denoising network for a probabilistic model
Score matching interpretation: We're learning to estimate gradients of log probability at various noise levels

These are mathematically equivalent! This duality is a key insight.

Part 3: Other Related Methods for Learning Markov Chains

The authors briefly mention alternative methods:

Infusion training, variational walkback, generative stochastic networks, etc.

These are all methods for learning the transition operators (the "how to go from one state to the next") of Markov chains. A Markov chain is a sequence of random states where the next state depends only on the current state:

$\mathbf{x}_0 \to \mathbf{x}_1 \to \mathbf{x}_2 \to \cdots \to \mathbf{x}_T$

Diffusion models are one way to do this, but historically there were other approaches. The authors are positioning their work in this lineage.

Part 4: Connections to Energy-Based Models

Score Matching ↔ Energy-Based Models

There's a known mathematical relationship: $s(\mathbf{x}) = \nabla_\mathbf{x} \log p(\mathbf{x}) = -\nabla_\mathbf{x} E(\mathbf{x})$

where $E(\mathbf{x})$ is an "energy function."

This comes from the fact that in statistical physics, probability distributions can be expressed as: $p(\mathbf{x}) = \frac{e^{-E(\mathbf{x})}}{\text{normalization}}$

Why mention this? Because recent work on energy-based models might benefit from insights about diffusion models, and vice versa. The connection is bidirectional.

Part 5: Rate-Distortion and Progressive Decoding

Referencing Previous Results

The authors reference the rate-distortion curves from Section 4.3 (the curves showing how quality improves with more bits). They note this is "reminiscent of how rate-distortion curves can be computed over distortion penalties in one run of annealed importance sampling."

What does this mean?

Annealed importance sampling (AIS) is another technique where you gradually change the distribution you're sampling from (annealing) to get better estimates. Computing rate-distortion curves by progressively transmitting information is conceptually similar—you're gradually improving the reconstruction.

Part 6: Connection to Autoregressive Models

This final point references Section 4.3's discussion of autoregressive decoding.

The Key Insight

The authors showed earlier (Equation 16 in Section 4.3) that the diffusion objective can be rewritten in a form that looks like autoregressive modeling, where you predict one variable at a time conditioned on previously predicted variables.

The progressive decoding connection: Just as autoregressive models generate data one piece at a time, the diffusion model generates images progressively:

First: large-scale features appear
Last: fine details are refined

This might explain why diffusion models work well—they might inherit good inductive biases from the autoregressive modeling literature. The authors mention that different orderings of coordinates affect autoregressive model quality (prior work [38]), suggesting that the Gaussian noise schedule in diffusion might serve a similar purpose.

Summary: Why This Section Matters

By positioning diffusion models within this landscape, the authors establish:

Novelty: Diffusion models are distinct from flows and VAEs in important ways
Theoretical grounding: Deep connections to score matching and Langevin dynamics provide theoretical understanding
Practical implications: The connections to likelihood-based models, energy-based models, and autoregressive models suggest where future work might go
Broader context: This isn't an isolated technique—it's part of a rich mathematical landscape

The remarkable insight is that seemingly different approaches (diffusion, score matching, Langevin sampling, autoregressive models) are actually deeply connected mathematically. Understanding these connections helps us build better generative models.

6 Conclusion

p.9

We have presented high quality image samples using diffusion models, and we have found connections among diffusion model...

Explaining Section 6: Conclusion

The Big Picture

This conclusion section is brief but important—it's the authors stepping back to highlight what they've accomplished and why it matters. They're essentially saying: "We've shown that diffusion models work really well for images, AND we've discovered that these models connect to several other important ideas in machine learning." This is significant because when different mathematical frameworks turn out to be related, it often provides deeper insight into why something works.

Think of it like discovering that three seemingly different roads all lead to the same destination—understanding the connections helps us travel more efficiently and understand the landscape better.

Breaking Down the Key Claims

1. High-Quality Image Generation

The authors start by stating they've achieved their main goal: producing high-quality image samples using diffusion models.

From the abstract and earlier sections, we know they achieved:

Inception Score of 9.46 on CIFAR-10
FID score of 3.17 (state-of-the-art at time of publication)
Quality comparable to ProgressiveGAN on high-resolution images (256×256)

Why this matters: This demonstrates that diffusion models are practical and competitive with other state-of-the-art generative models, not just a theoretical curiosity.

2. Connections to Multiple Frameworks

The authors identify five major connections that their work has uncovered:

Connection A: Variational Inference for Markov Chains

What this means: Recall from Section 4.3 (Equation 16) that the diffusion model's objective can be written as:

$L = D_{\mathrm{KL}}(q(\mathbf{x}_T) \| p(\mathbf{x}_T)) + \mathbb{E}_q\left[\sum_{t \geq 1} D_{\mathrm{KL}}(q(\mathbf{x}_{t-1}|\mathbf{x}_t) \| p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t))\right] + H(\mathbf{x}_0)$

This equation is literally the variational bound used in variational inference. Breaking this down:

$D_{\mathrm{KL}}(q \| p)$ denotes the Kullback-Leibler divergence—a measure of how different two probability distributions are. Mathematically: $D_{\mathrm{KL}}(q \| p) = \mathbb{E}_q[\log q(x) - \log p(x)]$
The diffusion process $q(\mathbf{x}_t|\mathbf{x}_{t-1})$ creates a Markov chain—a sequence where each step only depends on the previous step, not the entire history
Variational inference is a technique for approximating complex distributions by optimizing a simpler one. The diffusion model is doing exactly this.

Why it's important: This shows diffusion models aren't using some novel training principle—they're actually a specific instance of a well-understood general framework.

Connection B: Denoising Score Matching with Annealed Langevin Dynamics

What this means: From earlier sections (particularly Section 3), we learned that:

The model predicts noise $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t)$ at each diffusion step
This is equivalent to learning the score (gradient of log-probability): $\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$
This connects to score matching, which trains models to estimate these gradients

Langevin dynamics is a sampling technique that generates samples from a distribution by:

Starting with random noise
Iteratively moving in the direction of the score (uphill on the log-probability)
Adding noise to escape local modes

The "annealed" part means you use different noise levels at different steps—exactly what diffusion does!

Why it's important: This connection reveals that diffusion models are literally training Langevin samplers through variational inference. It's a beautiful unification of two previously separate ideas.

Connection C: Autoregressive Models

This is discussed in detail in Section 4.3, but the key idea is:

If you set $T$ (the number of diffusion steps) equal to the data dimensionality and modify the forward process to progressively mask out coordinates instead of adding noise, the diffusion objective becomes indistinguishable from autoregressive modeling.

Mathematical intuition: An autoregressive model predicts: $p(\mathbf{x}_0) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1,x_2) \cdots p(x_D|x_1,...,x_{D-1})$

Gaussian diffusion can be viewed as: $p(\mathbf{x}_0) = \int p(\mathbf{x}_1) p(\mathbf{x}_2|\mathbf{x}_1) \cdots p(\mathbf{x}_T|\mathbf{x}_{T-1}) d\mathbf{x}_1 \cdots d\mathbf{x}_{T-1}$

These are mathematically equivalent under the right conditions, but diffusion uses a generalized bit ordering that can't be expressed by simply reordering coordinates.

Why it's important: This suggests that diffusion models might have inductive biases (built-in assumptions about the structure of images) similar to autoregressive models, but potentially superior ones.

Connection D: Energy-Based Models

Through the connection to score matching, diffusion models relate to energy-based models (EBMs), which model distributions as: $p(\mathbf{x}) = \frac{1}{Z} e^{-E(\mathbf{x})}$

where $E(\mathbf{x})$ is an energy function and $Z$ is a normalizing constant.

The score (gradient of log-probability) is simply $-\nabla E(\mathbf{x})$ , so learning scores is equivalent to learning energy functions.

Why it's important: This opens doors to using techniques from energy-based modeling to improve diffusion models.

Connection E: Progressive Lossy Compression

From Section 4.3, the variational bound can be decomposed as:

Rate (bits needed for lossless transmission): $L_1 + \cdots + L_T$
Distortion (information loss): $L_0$

The progressive generation procedure (Algorithms 3 and 4) literally implements a lossy compression codec where you can stop at any time and get a reconstructed image.

Why it's important: This suggests diffusion models have an inductive bias toward allocating bits efficiently—most bits go to perceptually important features, with imperceptible distortions compressed away.

Why These Connections Matter

Mathematical Unification

When you can express the same model using multiple mathematical frameworks, it suggests you've discovered something fundamental. It's like finding that three different physics equations are actually the same law written in different coordinate systems.

Theoretical Understanding

These connections help us understand why diffusion models work:

The variational inference connection shows the training is principled
The score matching connection explains why the denoising objective works
The autoregressive connection reveals the inductive biases
The compression connection shows why the model learns meaningful hierarchies

Practical Implications

These connections suggest how to improve diffusion models (borrow techniques from each framework)
They indicate which hyperparameters matter (e.g., timestep spacing affects the "bit ordering")
They open doors to combining diffusion with other techniques

The Forward-Looking Statement

The conclusion ends with:

"Since diffusion models seem to have excellent inductive biases for image data, we look forward to investigating their utility in other data modalities and as components in other types of generative models and machine learning systems."

Translation: The authors are saying:

"Excellent inductive biases for image data" — Diffusion models seem particularly good at learning image structure (as evidenced by the high-quality samples)
"Other data modalities" — Will this work for audio, text, 3D shapes, video, etc.?
"Components in other systems" — Can we use diffusion as part of larger models? (This prediction was accurate—diffusion is now used in text-to-image models, video generation, etc.)

Summary for Intuition

Imagine you develop a new type of lock that works really well. Then you discover that:

It's actually a special case of a general locking principle (variational inference)
It's mathematically equivalent to three other locking systems (score matching, autoregressive, energy-based)
It has unexpected properties (compression)

This discovery does two things:

Validates that your lock is sound (it connects to established frameworks)
Expands your understanding of locks in general

That's what this conclusion accomplishes for diffusion models.

Broader Impact

p.9

Our work on diffusion models takes on a similar scope as existing work on other types of deep generative models, such as...

Broader Impact Section Explanation

Big Picture: Why This Section Matters

This section steps back from the technical details of diffusion models to discuss the real-world consequences of this research. It's not about mathematics or algorithms—it's about ethics and societal impact.

The authors are essentially saying: "We've developed a powerful new tool for generating images. Before researchers and practitioners use this, we need to acknowledge both the good and bad things this technology could enable."

This is important because:

Powerful tools can be misused: A hammer can build a house or cause harm
The authors have a responsibility: By publishing this work, they're making it available to everyone, so they should acknowledge potential harms
Generative models are becoming mainstream: As AI gets better, societal impacts become more significant

The Main Arguments: Broken Down

1. Diffusion Models Fit Into a Broader Category

"Our work on diffusion models takes on a similar scope as existing work on other types of deep generative models, such as efforts to improve the sample quality of GANs, flows, autoregressive models, and so forth."

What this means:

Diffusion models aren't unique in their impact—they're one tool among many (GANs, flows, autoregressive models) that can generate synthetic data
This work makes diffusion models better and more practical, so it amplifies whatever impacts generative models already have
The implications aren't entirely novel, but the effectiveness of this method matters

Analogy: If you invent a faster car engine, you're not inventing the concept of transportation, but you are making transportation more accessible and widespread.

2. The Malicious Uses Problem

The authors identify two main concerns:

Deepfakes and Misinformation

"Sample generation techniques can be employed to produce fake images and videos of high profile figures for political purposes."

What this means:

Diffusion models can generate realistic images of people who don't exist, or synthetic videos of real people doing things they never did
These can be weaponized for:
- Political manipulation (fake scandal videos)
- Fraud and impersonation
- Damage to reputation

Important nuance: The authors acknowledge this isn't new—people have been creating fake images manually for centuries. But what's changed is ease of access. With deep generative models, you don't need special skills; you just need to run a script.

Current safeguard (citation [62]): The authors note that "CNN-generated images currently have subtle flaws that allow detection." This means we can still tell synthetic images apart from real ones right now. But as models improve (like in this paper), this becomes harder.

Dataset Bias Amplification

"Generative models also reflect the biases in the datasets on which they are trained... If samples from generative models trained on these datasets proliferate throughout the internet, then these biases will only be reinforced further."

What this means:

Training data (like internet images) contains human biases:
- Racial representation imbalances
- Gender stereotypes
- Socioeconomic biases
- Historical inequities
When you train a generative model on biased data, it learns and reproduces those biases
If synthetic data from these models is then used in downstream tasks (or spreads online), those biases get amplified

Example: If a training dataset has more images of doctors that are male, the model learns this association. Generated images of doctors will then disproportionately show men, which reinforces the stereotype.

The vicious cycle:

\text{Biased Training Data} \rightarrow \text{Model Learns Biases} \rightarrow \text{Generates Biased Samples} \rightarrow \text{Biased Data Back on Internet} \rightarrow \text{Future Models Train on Worse Data}

3. The Potential Benefits

The authors don't want to be entirely pessimistic, so they also discuss positive applications:

Data Compression

"diffusion models may be useful for data compression, which, as data becomes higher resolution and as global internet traffic increases, might be crucial to ensure accessibility of the internet to wide audiences."

What this means:

Diffusion models learn to represent images efficiently (recall from Section 4.4 that they can interpolate and reconstruct images)
This compression capability could reduce internet bandwidth needed for transmitting high-resolution images
This matters for global accessibility: people with slower internet connections in developing regions could still access visual content

Representation Learning

"Our work might contribute to representation learning on unlabeled raw data for a large range of downstream tasks, from image classification to reinforcement learning"

What this means:

Diffusion models learn rich internal representations of images during training
These learned representations (the "features" the model captures) can be extracted and reused for other tasks
You don't need labeled data: the model learns from unlabeled images, which is cheaper and easier to collect
This could improve many downstream applications

Analogy: A student who learns to understand images deeply (not just memorize labels) can apply that understanding to many different tasks.

Creative Applications

"diffusion models might also become viable for creative uses in art, photography, and music."

What this means:

Artists and photographers could use diffusion models as a tool to:
- Generate new ideas and inspirations
- Augment their creative process
- Explore novel artistic directions
Similar to how Photoshop became a creative tool rather than just a manipulation tool

The Tension: A Fundamental Challenge

Notice that the benefits and harms are two sides of the same coin:

Capability	Beneficial Use	Harmful Use
Generate realistic images	Creative tools, art	Deepfakes, misinformation
Learn from unlabeled data	Improve AI for good	Amplify biases at scale
Compress data efficiently	Help accessibility	Enable faster misinformation spread

The authors aren't claiming to have solved this tension—they're just acknowledging it exists.

What This Section Is Not Saying

⚠️ Common misinterpretations:

Not saying: "Don't use diffusion models" or "We regret publishing this"
Not saying: "These harms are unique to diffusion models" (they apply to GANs, VAEs, etc.)
Not saying: "We have solutions to these problems" (they're just highlighting concerns)

Why This Matters in the Context of the Paper

This section follows immediately after the technical contributions and before the conclusion. It signals that:

Impact awareness: The authors understand their work has implications beyond academic metrics
Responsibility: Publishing a powerful tool comes with acknowledging how it might be misused
Realistic optimism: There are genuine benefits, but they require thoughtful deployment

This is increasingly expected in ML research—not just to optimize metrics like FID or Inception score, but to think about the broader ecosystem in which the technology operates.

A Extended derivations

Mathp.13

Below is a derivation of Eq. (5), the reduced variance variational bound for diffusion models. This material is from Soh...

B Experimental details

p.14

Our neural network architecture follows the backbone of PixelCNN++ [52], which is a U-Net [48] based on a Wide ResNet [7...

C Discussion on related work

p.15

Our model architecture, forward process definition, and prior differ from NCSN [55, 56] in subtle but important ways tha...

D Samples

p.15

Additional samples: Figure 11, 13, 16, 17, 18, and 19 show uncurated samples from the diffusion models trained on CelebA...

Denoising Diffusion Probabilistic Models

Abstract

Abstract

Understanding the DDPM Abstract

The Big Picture

Part 1: What Are Diffusion Probabilistic Models?

Core Concept

The Physics Inspiration: Nonequilibrium Thermodynamics

Part 2: The Training Approach

The Weighted Variational Bound

Connection to Denoising Score Matching

Part 3: The Decoding Process

What This Means

Part 4: Quantitative Results

CIFAR-10 (32×32 images of objects)

LSUN (256×256 natural scene images)

Implementation Available

Why This Matters

1 Introduction

Understanding the Introduction to Diffusion Probabilistic Models

The Big Picture

Part 1: The Landscape of Generative Models

Part 2: What is a Diffusion Model? (The Core Concept)

The Two Directions: Diffusion vs. Reverse

Process 1: The Diffusion Process (The Forward Direction)

Process 2: The Reverse Process (What We Learn)

Part 3: Why Gaussian Noise Makes This Simple

Part 4: The Paper's Main Contributions (Three Key Claims)

Contribution 1: Diffusion Models Can Generate High-Quality Images

Contribution 2: Connection to Score Matching and Langevin Dynamics

Part 5: The Trade-off: Sample Quality vs. Likelihood

Part 6: Progressive Decoding Perspective

Summary: What You Should Take Away

2 Background

Understanding Section 2: Background on Diffusion Models

Big Picture: What's This Section About?

Part 1: What Are Diffusion Models? (The Overall Framework)

The Core Equation

Part 2: The Reverse Process (What the Model Learns)

Understanding the Components

Part 3: The Forward Process (Fixed, No Learning)

Key Differences from the Reverse Process

What Does Each Step Do?

Part 4: The Key Innovation - Closed-Form Sampling

Part 5: Training the Model (The Variational Bound)

Interpreting This Inequality

Part 6: Variance Reduction - A Better Loss Function

What Changed?

Part 7: The Tractable Posterior

Why This is Amazing

Summary

Summary Table

The Big Picture

Summary: Why This Equation Matters

3 Diffusion models and denoising autoencoders

Section 3: Diffusion Models and Denoising Autoencoders

Big Picture: What's This Section Trying to Do?

The Core Challenge: Too Many Design Degrees of Freedom

Score Matching: A Brief Introduction

The Connection: Why Denoising Networks Learn Scores

The Practical Payoff: A Better Training Objective

How This Section Is Organized

Key Takeaway

3.1 Forward process and $L_T$

Section 3.1: Forward Process and LTL_TLT​ - Detailed Explanation

Big Picture: Why This Section Matters

Understanding the Forward Process Setup

The Key Insight: Why LTL_TLT​ is Constant

Mathematical Formulation

Why This Design Choice Matters

Connection to the Bigger Picture

3.2 Reverse process and $L_{1:T-1}$

Understanding Section 3.2: Reverse Process and the Mean Parameterization

The Big Picture

Part 1: Setting the Variance (the Easy Choice)

Part 2: The Mean—Finding the Right Parameterization

Starting Point: The Direct Approach

The Key Insight: Reparameterization

The Noise Prediction Parameterization (Equation 11)

Sampling with the Noise Parameterization

Section 3.1: Forward Process and $L_T$ - Detailed Explanation

The Key Insight: Why $L_T$ is Constant

Section 3.3: Data Scaling, Reverse Process Decoder, and $L_0$

Part 1: Data Scaling ( $[-1, 1]$ normalization)

The Three Cases: How $t$ Relates to the Losses

Case 1: $t = 1$ (First denoising step)

Cases 2 and beyond: $t > 1$

Why $L_T$ is absent:

The Key Parameters: $T$ and $\{\beta_t\}$

The Variance Schedule: Linear Increase from $\beta_1$ to $\beta_T$

Option 1: Predicting $\tilde{\boldsymbol{\mu}}$ (the mean)

Option 2: Learning the variances $\boldsymbol{\Sigma}_\theta(\mathbf{x}_t)$

Option 3: Predicting $\boldsymbol{\epsilon}$ (the noise)