By Serge-Paul Carrasco on January 12, 2024

Theoretical Underpinnings of Diffusion Models

(T) I attended last month a lecture at the Statistics Department of Stanford University from Professor Andrea Montanari regarding his research, with Ahmed El Alaoui, Mark Sellke, and Yuchen Wu, (who all of them have studied with Professor Montari) on high-dimensional statistics.

Montanari, El Alaoui, Sellke, and Wu have been researching a new class of algorithms for mean field models for spin glasses such as the Sherrington-Kirkpatrick model. Those algorithms are closely related to the denoising diffusion methods in machine learning, and to the stochastic localization technique in probability theory.

Following are first some preliminary materials to review diffusion models, stochastic localization, and mean-field spin glass models, and second the key papers, presentations, and videos that outline that research.

But first let start with a short historical background.

Historical background:

“The Sherrington-Kirkpatrick (SK) model was introduced by David Sherrington and Scott Kirkpatrick in
1975 as a simple ‘solvable’ (in their words) model for spin-glasses. Spin-glasses are some type of magnetic
alloys, and ‘solvable’ meant that the asymptotic free entropy density could be computed exactly.
It turns out that the original SK solution was incorrect and in fact inconsistent (the authors knew this).
A consistent conjecture for the asymptotic free energy per spin was put forward by Giorgio Parisi in 1982,
and derived through the non-rigorous replica method.

It took 24 years to prove this conjecture. The final proof is due to Michel Talagrand in 2006 and is a real
tour de force (from Professor Montanari class notes on stochastic processes) .”

In 2011, Professor Dimitri Panchenko from the University of Toronto proposed a more general proof to the Parisi formula than did Mr. Talagrand.

In 2021, Parisi was awarded the Nobel Prize in Physics for “for the discovery of the interplay of disorder and fluctuations in physical systems from atomic to planetary scales”.

Preliminary materials:

Denoising diffusion models in machine learning:

Denoising diffusion models are generative models, that have been in particular used to generate high-resolution images. They are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random Gaussian noise to the original data, and then learn to reverse the diffusion process, e.g. remove the noise, to construct the desired data samples.

Lilian Weng: “What are diffusion models?“
Sander Dieleman: “Perspectives on diffusion“

Stochastic localization in statistics:

“Proposed by Professor Ronen Eldan, stochastic localization is a way of decomposing a probability measure on a high-dimensional space into a mixture of simpler measures in a “Markovian” way. These simpler measures are “localized” in the sense that they are concentrated on smaller (random) subsets of the original space. The decomposition turns out to be useful in understanding the original measure.”

Simmon Institute lectures on stochastic localization
Ronen Eldan: “Thin shell implies spectral gap up to polylog via a stochastic localization scheme“
Ronen Eldan: “Analysis of high-dimensional distributions using pathwise methods”

Mean-field spin glass techniques in physics and in mathematics:

Spin glass models are high-dimensional probability distributions that were introduced by physicists in the 1970s to model the statistical properties of certain magnetic materials.

Those models are researched by Professor Montanari’s research team.

Michel Talagrand: “Mean field models for spin glasses“
Andrea Montanari: “A Short Tutorial on Mean-Field Spin Glass Techniques for Non-Physicists“
Ro Jefferson: “Mean field theory: from physics to deep neural nets“

Video: Ahmed El Alaoui: Methods from statistical physics, [Lecture I, Lecture II, Lecture III]
Video – Mark Selke: Algorithmic spin glass theory:

Video – Ahmed El Alaoui: Optimization of mean-field spin glass Hamiltonians:

Key papers:

Sampling from the Sherrington-Kirkpatrick Gibbs measure via algorithmic stochastic localization:

“We consider the Sherrington-Kirkpatrick model of spin glasses at high-temperature and no external field, and study the problem of sampling from the Gibbs distribution μ in polynomial time. We prove that, for any inverse temperature β<1/2, there exists an algorithm with complexity O(n2) that samples from a distribution μalg which is close in normalized Wasserstein distance to μ. Namely, there exists a coupling of μ and μalg such that if (x,xalg)∈{−1,+1}n×{−1,+1}n is a pair drawn from this coupling, then n−1𝔼{||x−xalg||22}=on(1). The best previous results, by Bauerschmidt and Bodineau and by Eldan, Koehler, and Zeitouni, implied efficient algorithms to approximately sample (under a stronger metric) for β<1/4.
We complement this result with a negative one, by introducing a suitable “stability” property for sampling algorithms, which is verified by many standard techniques. We prove that no stable algorithm can approximately sample for β>1, even under the normalized Wasserstein metric.
Our sampling method is based on an algorithmic implementation of stochastic localization, which progressively tilts the measure μ towards a single configuration, together with an approximate message passing algorithm that is used to approximate the mean of the tilted measure.”

Presentation: “Sampling via stochastic localization“
Videos:

Posterior Sampling from the Spiked Models via Diffusion Processes:

“Sampling from the posterior is a key technical problem in Bayesian statistics. Rigorous guarantees are difficult to obtain for Markov Chain Monte Carlo algorithms of common use. In this paper, we study an alternative class of algorithms based on diffusion processes. The diffusion is constructed in such a way that, at its final time, it approximates the target posterior distribution. The stochastic differential equation that defines this process is discretized (using a Euler scheme) to provide an efficient sampling algorithm. Our construction of the diffusion is based on the notion of observation process and the related idea of stochastic localization. Namely, the diffusion process describes a sample that is conditioned on increasing information. An overlapping family of processes was derived in the machine learning literature via time-reversal.
We apply this method to posterior sampling in the high-dimensional symmetric spiked model. We observe a rank-one matrix θθ𝖳 corrupted by Gaussian noise, and want to sample θ from the posterior. Our sampling algorithm makes use of an oracle that computes the posterior expectation of θ given the data and the additional observation process. We provide an efficient implementation of this oracle using approximate message passing. We thus develop the first sampling algorithm for this problem with approximation guarantees.”

Sampling, Diffusions, and Stochastic Localization:

“Diffusions are a successful technique to sample from high-dimensional distributions can be either explicitly given or learnt from a collection of samples. They implement a diffusion process whose endpoint is a sample from the target distribution and whose drift is typically represented as a neural network. Stochastic localization is a successful technique to prove mixing of Markov Chains and other functional inequalities in high dimension. An algorithmic version of stochastic localization was introduced in [EAMS2022], to obtain an algorithm that samples from certain statistical mechanics models.
This notes have three objectives: (i) Generalize the construction [EAMS2022] to other stochastic localization processes; (ii) Clarify the connection between diffusions and stochastic localization. In particular we show that standard denoising diffusions are stochastic localizations but other examples that are naturally suggested by the proposed viewpoint; (iii) Describe some insights that follow from this viewpoint.”

Sampling from Mean-Field Gibbs Measures via Diffusion Processes:

“We consider Ising mixed p-spin glasses at high-temperature and without external field, and study the problem of sampling from the Gibbs distribution μ in polynomial time. We develop a new sampling algorithm with complexity of the same order as evaluating the gradient of the Hamiltonian and, in particular, at most linear in the input size. We prove that, at sufficiently high-temperature, it produces samples from a distribution μalg which is close in normalized Wasserstein distance to μ. Namely, there exists a coupling of μ and μalg such that if (x,xalg)∈{−1,+1}n×{−1,+1}n is a pair drawn from this coupling, then n−1𝔼{‖x−xalg‖22}=on(1). For the case of the Sherrington-Kirkpatrick model, our algorithm succeeds in the full replica-symmetric phase.
We complement this result with a negative one for sampling algorithms satisfying a certain `stability’ property, which is verified by many standard techniques.
No stable algorithm can approximately sample at temperatures below the onset of shattering, even under the normalized Wasserstein metric. Further, no algorithm can sample at temperatures below the onset of replica symmetry breaking.
Our sampling method implements a discretized version of a diffusion process that has become recently popular in machine learning under the name of `denoising diffusion.’ We derive the same process from the general construction of stochastic localization. Implementing the diffusion process requires to efficiently approximate the mean of the tilted measure. To this end, we use an approximate message passing algorithm that, as we prove, achieves sufficiently accurate mean estimation.”

Note: The picture above is Rennes in France.

Categories: Algorithms, Artificial Intelligence, Mathematics