10 Diffusion and score-based models

Diffusion models are the part of modern AI most directly continuous with statistical physics, which makes this chapter unusually easy to write for the intended audience.

The underlying idea: define a stochastic process that gradually destroys data into noise, then learn to reverse it. The forward process is fixed and analytical; the reverse process is the learned model. Sampling is integration of the reverse process from pure noise back to data.

10.1 The forward process

Start with a sample x_0 \sim p_{\text{data}}. Define a noising process indexed by time t \in [0, T]. The simplest version is the variance-preserving SDE: dx_t = -\tfrac{1}{2}\, \beta(t)\, x_t\, dt + \sqrt{\beta(t)}\, dW_t, which interpolates between x_0 (data) and approximately \mathcal{N}(0, I) at t = T. The noise schedule \beta(t) is a design choice.

The marginal at any time t has a closed form, x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I), where \bar{\alpha}_t = \exp\!\left(-\int_0^t \beta(s)\, ds\right). This is the discrete-time DDPM formulation (Ho et al. 2020).

10.2 The reverse process

Anderson’s theorem gives the reverse-time SDE of the above: dx_t = \left[ -\tfrac{1}{2}\, \beta(t)\, x_t - \beta(t)\, \nabla_{x_t} \log p_t(x_t) \right] dt + \sqrt{\beta(t)}\, d\bar{W}_t, where \nabla_{x_t} \log p_t(x_t) is the score function of the marginal at time t and \bar{W} is reverse-time Brownian motion. If we know the score at every (x_t, t), we can integrate this SDE backward from x_T \sim \mathcal{N}(0, I) to a sample x_0 \sim p_{\text{data}}.

The score function is what we learn.

10.3 The training objective

Train a neural network s_\theta(x, t) \approx \nabla_x \log p_t(x) via denoising score matching (Song et al. 2021): \mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\, \big\| s_\theta(x_t, t) - \nabla_{x_t} \log p_t(x_t \mid x_0) \big\|^2 \,\right], where the conditional score has a closed form \nabla_{x_t} \log p_t(x_t \mid x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t}\, x_0}{1 - \bar{\alpha}_t} = -\frac{\epsilon}{\sqrt{1 - \bar{\alpha}_t}}.

Substituting, the objective becomes “predict the noise”: \mathcal{L}(\theta) = \mathbb{E}_{t,\, x_0,\, \epsilon}\!\left[\, \big\| \epsilon - \epsilon_\theta(x_t, t) \big\|^2 \,\right]. That is literally the training loss of a diffusion model. A network that takes a noisy image and the timestep, and tries to predict the noise. Train it for millions of steps and you have a generative model.

10.4 Probability-flow ODE

The reverse SDE has a deterministic counterpart, the probability-flow ODE, with the same marginal at every t: \frac{dx_t}{dt} = -\tfrac{1}{2}\, \beta(t)\, x_t - \tfrac{1}{2}\, \beta(t)\, \nabla_{x_t} \log p_t(x_t). This is what most production samplers use, because deterministic ODE integration (DDIM, DPM-Solver, Heun) converges in 10–50 steps versus thousands for naive SDE integration. The ODE perspective also makes diffusion models look like continuous normalizing flows with a learned vector field, and unifies them with flow matching — the currently dominant variant.

10.5 Flow matching

A more direct framing: pick a path p_t interpolating between noise (t=0) and data (t=1), and learn a vector field v_\theta(x, t) that pushes p_0 to p_1. The training objective is again a regression — match v_\theta to the conditional velocity that takes a noise sample to a data sample along the chosen path.

Flow matching with linear (or rectified) paths, x_t = (1 - t)\, \epsilon + t\, x_0, is the recipe behind most large-scale image and video generators in 2024–2026 (Stable Diffusion 3, recent video models). It produces straighter sampling trajectories than DDPM and converges in fewer steps.

10.6 Conditioning and guidance

For conditional generation (text-to-image, class-conditional, image-to-image) you train a score s_\theta(x, t, c) that takes a condition c. At sampling time, classifier-free guidance amplifies the conditional signal: \tilde{s}(x, t, c) = (1 + w)\, s_\theta(x, t, c) - w\, s_\theta(x, t, \emptyset), trading sample diversity for closer adherence to the condition. The same model is trained with c randomly dropped to \emptyset during training so it learns both heads.

10.7 Latent diffusion

For high-resolution images, diffusing in pixel space is prohibitively expensive. Latent diffusion instead trains an autoencoder to compress images to a smaller latent space (e.g., 64 \times 64 instead of 1024 \times 1024), and runs the diffusion model entirely in latent space. This is the architecture of Stable Diffusion and most production text-to-image systems.

10.8 Why physicists should care

Three reasons beyond cultural affinity:

The math is the cleanest in modern generative modeling: well-defined SDEs, exact training losses, no adversarial training, no mode collapse.
The connection to non-equilibrium statistical mechanics is real — the original DDPM paper was inspired by Sohl-Dickstein’s earlier work on non-equilibrium thermodynamics.
Diffusion-style models are spreading beyond images: video, audio, 3D structures, molecules, protein backbones, materials design. Anywhere there is a continuous data manifold and you want to sample from it, this is currently the dominant approach.