This tutorial on generative modeling is in part of Statistical Machine Learning Tutorial by Ying Nian Wu at UCLA Statistics. The tutorial goes over the key equations and algorithms for learning recent generative models, including energy-based models, diffusion/score-based models, autoregressive/flow-based models, VAEs, and GANs, and explains the connections between these models. In contrast to most current tutorials on generative modeling from the perspective of machine learning, this tutorial is unique for providing a more basic and natural perspective form statistics. Starting with the very basic probability background, the tutorial is extremely learner-friendly.
- Highlights & Signifiance
- Prerequisite: Probability Density
- The Core Problem: Density Estimation
- Energy-based Model
- Sampling Process for Learning EBM
- Diffusion/score-based Models
- Bibliography
- Author and Citation Info
The tutorial connects different families of generative models from multiple perspectives---original formulation, the essence of sampling process, and variational formulation.
Sampling a high-modality distribution is extremely hard. Diffusion model factorizes the problem of sampling from the high-modality distribution into a thousand of small incremental steps, making the sampling much more tractable. VAE follows an elegant formulation trying to sample the data distribution in a single trial, however the estimated aggregated posterior may mismatch the prior. GAN also suffers from single trial, but uses a discriminator to guide the generation.
Dr. Wu introduces a smooth analogy to the golf for understanding the relations between the generative models. In this perspective, the model expressitivity, the sampling process, and the data modality, are analogous to the sum of balls, the number of strokes, and the number of holes, respectively---more balls means the possibility to cover more holes, and more strokes means sending a ball to a hole more with more patience. The readers may employ this analogy to understand the pros and cons of different generative models. Also see the relation between generative models from the perspective of Diffusion model.
A visulization of the golf analogy
As long as you can count, you understand everything about probability density.
Consider a clot in the 2-D space, with n example data points. The Probability Density tells you how the points are distributed. As the number of data points can become extremely large (
A visual demonstration of probabilistic density
To Analyze the continuous density, we can discretize spaces into
Consider the number of data points a cell (shadowed area), we have:
On the basis, we have joint density as:
Given the most important concept, we can work on the three elementary operations---marginalization, conditioning, and factorization.
Calculating
Calculating
On the basis of conditioning and marginalization, we have the factorization operation:
The expectation
The gold standard for density estimation is Maximum Likelihood Estimation.
The real world is not counting points in a 2-D space. In fact, most data comes with a high-dimensional space, and the number of examples
Given the problem, what could we do? The intuitive way is to estimate a function to capture the properties of such probabilistic density. We have to parametrize the probabilistic density and try to learn the underlying parameters from finite examples. We hope the learned density of the finite examples can generalize to infinite examples.
Maximum Likelihood Estimation (MLE) is the basic idea to estimate a density function.
Given finite
Now we come to the core of the MLE---defining the log-likelihood function:
Our objective is to maximize the log-likelihood function, that
Kullback-Leibler Divergence (KL-Divergence) measures the difference between two distributions
Trivially, if we are trying to maximize
Since we are calculating the expectation over
Energy-based Model (EBM) is the most basic generative model.
The target density
We can calculate the derivative over
where we get an important property that
Bringing the EBM formulation into the MLE formulation, we have:
and the derivative of
However, computing the expectation is extremely hard. We have to use Monte-Carlo Sampling to draw examples from the estimated density. The goal is to match the average of observed examples and the average of synthesized examples. The main problem of learning EBM is that sampling from the model density is non-trivial at all.
Following the KL-Divergence perspective, we can also interpret the Monte-Carlo Sampling process for EBM in a similar way: consider the model in
and the
A visual demonstration of contrastive divergence
If we treat the current model
The mode chasing behavior in W-GAN
Noise Contrastive Estimation (NCE) introduces a reference distribution
To learn the model, a more practical way is to view the problem from an adversarial perspective. If we draw
Small steps get us to faraway places. 不積跬步,無以至千里。
As aforementioned, the core problem of generative modeling is estimating the model density. In this section, we are starting with reviewing the commonly used sampling method, Langevin Dynamics, a special case of Markov-Chain Monte-Carlo (MCMC) Sampling.
We cannot sample from the model density all at once. Hence, we use Langevin Dynamics to sample in steps alone the time-axis. Here we denote the target distribution
We discretize the time axis and each
Consider the time step becomes very small, i.e.,
A more sophiscated version of MCMC Sampling is Hamiltonian Monte-Carlo (HMC), which adds a momentum to smooth the trajectory.
An interesting observation in Langevin Dynamics is: once
How does this come? Let us look back into the updating equation of Langevin Dynamics, into the two terms---the gradient ascent term
Explaining Langevin Dynamics with equilibrium sampling: (1) gradient ascent as squeezing; (2) random pertubation as diffusion
To analyze the phenomenon mathematically, we may look into the Taylor expansion of the testing function
The derivation for the first-order Taylor expansion in gradient ascent is as following:
This derivation shows that the remainder of the Taylor expansion of gradient ascent is negative.
The derivation for the second-order Taylor expansion in diffusion is as following:
This derivation shows that the remainder of the Taylor expansion of diffusion is positive.
On the basis of equilibrium sampling, we are introducing score-based/diffusion models.
Though coming with the merit of equilibrium sampling, Langevin Dynamics suffers from very slow convergence, especially when the model density has a lot of localized modes (high modalities). To address this problem, we introduce a temperature parameter
Imagine you are playing the golf. You can exactly see where the hole
$x_0$ is. But you want to use a thousand strokes to shoot back to the hole. You do not want to shoot back in a stroke because the chance you hit the hole is very small. Rather, you see where the hole is, and you are going toward the hole by small steps.
Unlike the EBM where we directly target to the log-density, the Diffusion model essentially learns a sampling process. Diffusion model tries to decompose sampling the density into a large number of very small incremental steps.
The forward process of Diffusion model is gradually adding noise to a clean image until it becomes a Gaussian, using non-equilibrium sampling.
Let
We can also look into the Taylor expansion of the testing function
A visual demonstration of the forward time in non-equilibrium sampling
After changing the clean image into white noise, now we are trying to walk back.
We only need to reverse the deterministic step, from noise level
Similar to Langevin Dynamics, we have two variants of reverse updating when the time steps become very small, (i.e.,
If we only consider the deterministic step, we have Ordinary Differential Equation (ODE):
A visual demonstration of the reverse time in non-equilibrium sampling (ODE)
If we consider the random step, we have Stochastic Differential Equation (SDE):
A visual demonstration of the reverse time in non-equilibrium sampling (SDE)
Here we go into the core problem of learning a Duffusion model: how do we estimate the score
From the clean image
From line 1 to line 2 we integrate over
On the basis of the Vincent Identity, we have:
U-Net: encoding the noisy version of the image to decode the clean version of the image
Under this implementation, we take relatively big steps to estimate noise:
We can alternativelly reform the score-based methods into a variational way. The forward process from
The derivation starts from applying the Bayes rule to obtain
Hence, this variational formulation transforms the extremely hard conditional distribution estimation to a very simple Gaussian distribution.
Recall our gold standard, MLE---as we have obtained the conditional distribution, naturally we can formulate the variational form in KL-Divergence:
In the reverse process, we can execute noise reduction by decomposing the KL-Divergence:
Diffusion model can be viewed as the Auto-regressive model in the time domain, which reverse the time, going from white noise
Diffusion model can be viewed as the Flow-based model. Flow-based model starts from white noise
Diffusion model can be viewed as a refined version of Variational Auto-Encoder (VAE). VAE starts from white noise
VAE estimates
A visualization of the relation between different generative models from the perspective of Diffusion model
The following animations show the two pairs of counterparts that needs distinction.
Mode covering vs. mode chasing
Contrastive divergence vs. EM algorithm
Equilibrium sampling vs. non-equilibrium sampling
-
Divergence Triangle for Joint Training of Generator Model, Energy-based Model, and Inference Model - CVPR'19, 2019. [All Versions].
-
Deep Unsupervised Learning using Nonequilibrium Thermodynamics - ICML'15, 2015. [All Versions].
-
Denoising Diffusion Probabilistic Models - NeurIPS'20, 2020. [All Versions].
-
Score-Based Generative Modeling through Stochastic Differential Equations - ICLR'20, 2020. [All Versions].
This tutorial entry is composed by Yu-Zhe Shi under the supervision of Dr. Ying Nian Wu.
Dr. Ying Nian Wu is currently a professor in Department of Statistics, UCLA. He received his A.M. degree and Ph.D. degree in statistics from Harvard University in 1994 and 1996 respectively. He was an assistant professor in the Department of Statistics, University of Michigan from 1997 to 1999. He joined UCLA in 1999. He has been a full professor since 2006. Wu’s research areas include generative modeling, representation learning, computer vision, computational neuroscience, and bioinformatics.
@InCollection{shi2022generative,
author = {Shi, Yu-Zhe and Wu, Ying Nian},
title = {{Generative Modeling Explained}},
booktitle = {Statistical Machine Learning Tutorials},
howpublished = {\url{https://github.com/YuzheSHI/generative-modeling-explained}},
year = {2022},
edition = {{S}ummer 2022},
publisher = {Department of Statistics, UCLA}
}
The authors thank Dr. Yixin Zhu for helpful suggestions, Zhangqian Bi for debugging the markdown maths renderer, and Ms. Zhen Chen for helping design the animations.