The AiEdge+: Diffusion models - Stable Diffusion vs DALLE-2 vs Imagen

The models that beat GANs at generating realistic images

Mar 20, 2023

∙ Paid

Today we dig into the models that beat GANs at generating realistic images: Diffusion models! Stable Diffusion, OpenAI's DALL-E 2 and Google's Imagen are state of the art image generation models conditioned on text prompt. They look similar but there are a few differences that make them special. We are going to look at

What is a diffusion model?
The differences and similarities between Stable Diffusion, DALLE-2 and Imagen
Github repositories, articles and Youtube videos about diffusion models

What is a diffusion model?

What is a diffusion model in Machine Learning? Conceptually, it is very simple! You add some noise to an image, and you learn to remove it. Train a machine learning model that takes as input a noisy image and as output a denoised image and you have a denoising model.

The typical way to do it is to assume a normal distribution of the noise and to parametrize the distribution mean and standard deviation matrix. Effectively, we can simplify the problem to just learning the mean matrix. The process can be divided into the forward process, where white noise (Gaussian distributed) is progressively added to a clean image, and the reverse process, where a learner progressively learns to denoise the noisy image until it is back to being clean.

Why is that called a diffusion model? What does that have to do with the diffusive process of particles in a fluid with a gradient of concentration (see Wikipedia)? This due to the way mathematicians have abused the jargon of the physical process to formalize a mathematical concept. It happens that physical phenomena like Fick diffusion, heat diffusion and Brownian motion are all well described by the diffusion equation:

\(\frac{\partial \phi(\mathbf{r}, t)}{\partial t} = D\nabla_{\mathbf{r}}^2\phi(\mathbf{r}, t)\)

first time derivative of a state function is equal to the second space derivative of that state function. That diffusion equation has an equivalent stochastic formulation known as the Langevin equation:

\(d\mathbf{r}=\sqrt{2D}d\mathbf{W}\)

At the core of the Langevin equation is a mathematical object called the Wiener process W. Interestingly enough, this process is also called Brownian motion (not to be confused with the physical process). It can be thought of as a random walk with infinitely small steps. The key feature of the Wiener process is that a time increment of that object is Normal distributed. That is why the concept of "diffusion" is intertwined with the white noise generation process and that is why those ML models are called diffusion models!

Those diffusion models are generative models as data is getting generated using a gaussian prior, and they are the core of the text to image generative models such as Stable Diffusion, DALL-E 2 and Imagen.

Stable Diffusion vs DALLE-2 vs Imagen

The State of the Art image generation models conditioned on text prompt have 3 things in common: A text encoder, a way to inject the text information into an image and a diffusion mechanism. Stable Diffusion, OpenAI's DALL-E 2 and Google's Imagen are very similar in that regard but also have small differences that make them special.

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.