Generative Modeling by Estimating Gradients of the Data Distribution

I’ll refer to the ICLR Diffusion explained blogpost for the motivations of score based modeling

The idea is that the objective we want to solve is to minimize the fischer divergence but we have ways of minimizing this without access to the true score values

The problem is that the learned score function might be inaccurate in regions of low density

→ this motivates perturbing the data with noise, and training score based models on the noised data instead, which has less low density regions

this trades off corrupting the data with getting a better score estimate

Idea: use many different noise amounts, and train a noise conditional score-based model to predict all of them minimize the weighted sum of the Fisher divergences for the different noise scales

then to actually sample, use annealed Langevin dynamics: basically do the reverse langevin diffusion using decreasing noise scales

SDEs

Idea: take the number of noise scales to go to infinity, so that you have the pdfs p_t(x) for t \in [0, T] continuously

p_0(x) = p(x) is the data distribution
p_T(x) is pure noise

general form of an SDE

dw is brownian motion
in practice, you hand design the SDE. The SDE is part of the model, and it dictates how you add noise

Any SDE has a reverse SDE

notice the score function popping out

once again, we train a time dependent score function based on a weighted fisher divergence When you choose lambda(t) = g^2(t), you actually get that this is an upper bound to the KL divergence between p_0(x) and p_theta(x)!

use numerical solvers to solve the reverse SDE for sampling.

Probability Flow ODE

Problem: with SDEs, we can’t compute the exact log likelihood of an x_0

You can convert any SDE into an ODE (difference is an ODE is deterministic) that has the same marginals (no guarantees about the trajectories, i.e. as you vary t continuously, but same marginals yes). And then since this becomes a neural ODE / continuous normalizing flow when you plug in the approximation s_theta(x, t), you can use numerical ODE solvers to compute the p_0 likelihoods.

Controllable generation

Bayes Rule for score functions Have the first term on the right by score matching, and the second term they lowkey seem wrong about how you can get it. I feel like I like the lil log explanation of conditional diffusion more.

What is this useful for?

class conditional image generation

image inpainting

image coloration (hey wait a second that looks somewhat similar to what I’m doing for my UROP uh oh)

Connection to diffusion models

on the surface, seems like they’re different because

score based models are trained by score matching and sampled by Langevin dynamics, while
diffusion models are trained by ELBO and sampled with a learned decoder

but turns out the ELBO loss is equivalent to the weighted fisher divergence objective

different perspectives of the same model family