How does a Stable diffusion or a Dall E model work?

15 min readDec 4, 2023

How Stable Diffusion and other models including Dall E work?

Now Generating images using diffusion. What is that? So I should probably find out. Are there things like Dall E and Dall E 2, yeah, Imagen from Google? Stable diffusion now as well. I’ve spent quite a long time messing about with stable diffusion. I’m having quite a lot of fun with that. So what I thought I’d do is I’d download the code, I’d, you know, read the papers, work out what’s going on, and then we can talk about it.

I delved into this code and realized there’s quite a lot to these, these things, right? It’s not so much that they’re complicated, it’s just there’s a lot of, a lot of moving parts. Um, so let’s just have a quick reminder of generative adversarial networks, which are, I suppose before now, the standard way for generating images, and then we can talk about how it’s different and why we’re doing it using diffusion.

Having a network or, or some, you know, deep network trained to just produce the same image over and over again, is not very interesting. So, we have some kind of random noise that we’re using to make it different each time. We have some kind of very large generator network, which is, this is just, I’m gonna, this is a black box, big neural network, right, that turns out an image that hopefully looks nice, right, like the thing we’re trying to produce.

Faces, landscapes, people, you know. Is this how those anonymous people on this person do not exist? Is this what Yeah, that’s exactly how they work, yeah. That’s using, I think, Style GAN, right, and it’s that exact idea. And that’s trained on a large corpus of faces. And it just generates faces, right, at random.

Right, or at least mostly at random. The way we train this is, we have, you know, millions and millions of pictures of something that we’re trying to produce. So we produce, we give it noise. We produce an image and we have to tell if is good or bad, right? We need to give this network some instruction on whether this image looks like a face, right?

Otherwise, it’s not going to train it. So what we do is we have another network here, which is sort of like the opposite, and this says is it a real or is it a fake image? And so we’re giving this half the time we’re giving it fake images and half the time we’re giving it real faces. So this trains and gets better at discriminating between the fake images produced here and the real images produced from the training set.

And. In doing so, this has to get better at faking them, and so on, and the hope is that they just get better and better and better. Now, that kind of works. The problem is that They’re very hard to train, right? You have a lot of problems with things like mode collapse, where it just produces the same face.

If it produces a face that fools this every time, there’s not a lot of incentive for this network to do anything interesting. Because it’s solved the problem, right? It’s beaten this, let’s move on, right? And so, if you’re not careful with your training process, these kind of things can happen. And I suppose, intuitively, it’s quite difficult to go from this bit of noise to a beautiful-looking image in high resolution.

Without there being some oddities, right? And some things that go a bit wrong. So, what we’re going to do, using diffusion models, is try and simplify this process into a kind of iterative small-step situation where the work that this network has to do is slightly smaller and you just run it a few times to try and make the, process better.

Right, we’ll start again on the paper so we can clean things up a bit. So, we’ve got an image Right, let’s say it’s an image of a rabbit. Right, we add some noise so we’ve got a rabbit Which is the same, right? And you add some noise to it. Now, it’s not speckly noise, but I can’t draw Gaussian noise. Right?

And then we add another bit of noise. Right? And the rabbit, it’s the same shape rabbit. There’s a bit more noise, right? And then we come over here, and we come over here. And we end up with just noise. Looks like nonsense. And so the question is, right, how do we craft some kind of training algorithm, some kind of what we call inference, you know, how do we Deploy a network that can undo this process?

The first question is, how much noise do we add? Why don’t we just add loads of noise? Right? So just delete all these images and we don’t need to worry about that. Add loads of noise, and then say, right, give me that. And then you’ve got a pair of training examples you could use. And the answer is, it’ll kind of work, but that’s a very difficult job.

And you’re sort of in the same problem as the GAN. You’re trying to do everything in one go. The intuition perhaps is that it’s maybe slightly easier to go from this one to this one. Just remove a little bit of noise. And then from this one to this one, a little bit more noise. But in traditional, like, image processing, you do, there are noise removal techniques, right?

Yeah. So it’s not difficult to do that, is it? No. I mean, it’s, it’s difficult in the sense that you don’t know what the original image was. So, what we’re trying to do is train a network to undo this process. That’s the idea. And if we can do that, then we can start with random noise, a bit like our GAN, and we can just iterate this process.

and produce an image, right? Now there are a lot of missing parts here, right? So we’ll start building up the complexity a little bit. Okay, so the first thing is, let’s go back to our question of how much noise do we add, right? So we could add a small amount of noise and then the same amount again and then the same amount again and we could keep adding it until we have essentially what looks like random noise over here, right?

And that would be what we would call a linear Right, so that is the same amount of noise each time. Right, and it’s not interesting, but it works. The other thing you could do is you could add very little noise at the beginning, and then ramp up the amount of noise you add later, right? And so there are different strategies, depending on what paper you read, about the best approach for adding noise, but it’s called the schedule, right?

So the idea is you have a schedule that says, right, given this image, so this is an image, At time t equals naught. This is t equals one. Blah, blah, blah. T equals some capital T, which is like the final number, of steps you’ve got. Right? And this represents essentially all the noise. And this represents some amount of noise.

And you can change how much each step has. Right? And then the nice thing is, you can then very easily produce, because Gaussians add together very nicely, so you can say, well I want t equals 7, and you don’t have to produce all the images, you can just jump straight to t equals 7, add the exact right amount of noise, and then hand that back to the network.

So when you train this, you can give it random images from your training set, with random amounts of noise added, based on this schedule, right? Varying randomly between 1 and t. And you can say, here’s a really noisy image, undo it. Here’s a little less noisy image. So, what you do is you take your noisy image, right, I’m going to keep going with this rabbit, it’s taller than it was before, right, you take your noisy image at some time, let’s say t equals 5, right, you have a giant u net shaped network, we’ve talked about encoder-decoder networks before, there’s nothing particularly surprising about this one.

And then you also put in the time, right? Because if we’re running a funny schedule where you’re, at different times, have different amounts of noise, you need to tell the network where it is so that it knows, okay, I’m going to have to remove a lot of noise this time, or just a little bit of noise. What do we produce here?

So, we could go for the whole hog and just say we’ll just produce the original rabbit image. But then you’ve got a situation where you have to go from here back to the rabbit. That’s a little bit difficult, right? Mathematically, it works out a little bit easier if we just try and predict The noise.

We want to know what is the noise that was added to this image that you could use to get back to the original image. So this is all the noise from T 1, 2, 3, 4, and five. So you just get noise basically out here like this, right? With no rabbit. That’s the hope. And then theoretically you could take that away from this and you’d get the rabbit back.

Right? Now if you did that from here, you would find that it’s, it’s a little bit iffy. Right, because, you know, predicting the noise back to this rabbit is maybe quite difficult. But if you did it from here, it may be not quite so difficult. We want to predict the noise, so what we could do is predict the noise at, let’s say, time t equals 5, and just say, give me the noise that takes us back to t equals 4.

Right, and then t equals 3 and t equals 2. The problem if you do that is that you’re very stuck doing the exact time steps that this schedule used, right? If you used a thousand time steps for training, now you’ve got to use a thousand-time steps for inference, right? You can’t speed it up. Um, so what we might try and do instead is say, well, okay, whatever time step you’re at, you’ve got some amount of noise.

Remove it all. Predict all the noise in the image, and just give me back that noise that I can take away and get back to the original image. And so that’s what we do. So during training, we pick a random source image, we pick a random time step, and we add, based on our schedule, that amount of noise.

Right. So we have a noisy image, a time, step T. We put that into the network and we say, what was the noise that we’ve just added to that image Right now we haven’t given it the original image, right? So that’s what’s difficult about this. We have the original image without any noise on it, and we’re not showing it, and we added some noise and we want that noise back.

So we can do that very easily. We can, we’ve got millions of images in, or billions of images in our data set. We can add random bits of noise, and we can say, what was that noise? And over time it starts to build up a picture of what that noise is. So this sounds like a really good kind of plug-in for Photoshop or something, right?

It’s going to be noise noise-removal plug-in. How does that turn into creating new images? Yeah, so actually, in some sense, that’s the clever bit, right, is how we use this network that produces noise. to undo the noise, right? We’ve got a network which, given an image with some noise added to it, and a time step that represents how much noise that is, roughly, or where we are in the noising process, we have a network which produces an estimate for what that noise is, in total.

And theoretically, if we take that noise away from this, we get back to the original image. Now, that is not a perfect process, right? This network is not going to be perfect, and so if you give it an incredibly noisy image, and you take away what it predicts, you’ll get like a sort of, maybe like a vague shape.

And so what we want to do is take it a little bit more slowly. Okay, so, we take this noise, and we subtract it from our image, right, to get an estimate of what the original image was. Right, T naught, okay? So, we take this, And we take this, and we do subtraction, and we get another image, which is our estimate for t equals 0, right?

And it’s not going to look very good the first time. But then we add a bunch of this noise back again, and we get to a t that’s slightly less than this one. So maybe this was, like, t 10, t equals 10. Maybe we add, like, 9-tenths of the noise back, and we get to roughly t equals 9. Alright, so now we have a slightly less noisy image, and we can repeat this process.

So we put the slightly less noisy image in, we predict how to get back to the naught, and we add back most but not all of the noise, and then we repeat the process. Alright, and so each time we loop this, we get a little bit closer to the original image. It was very difficult to predict the noise at t equals ten.

It’s slightly easier to predict the noise at t equals 9, and very easy at t equals 1, because it’s mostly the image noise on it. And so if we just sort of feel our way towards it, by taking off little bits of noise at a time, we can produce an image. Alright, so you start with a noisy image, you predict all the noise, remove it, and then add back most of it.

Alright, and so then you get, and so at each step, you have an estimate for what the original image was. And you have the next image, which is just a little bit less noisy than the one before. And you loop this several times, right? Um, and that’s basically how the image generation process works. So, you take your noisy image, you loop it, and you gradually remove the noise until you end up back at what the network thinks was the original image.

And you’re doing this by predicting noise and taking it away. Rather than spitting out an image with less noise, right? And that mathematically works out as a lot easier to train and it’s a lot more stable than a GAN. There’s an elephant in the room here. There is. You’re kind of talking about how to make random images effectively.

How do we direct this? So that’s where the complexity starts ramping up, right? We’ve got a structure where we can train a network to produce random images. There’s no way of saying, I want a frog rabbit hybrid. Right, which I’ve done, and it’s very weird. So how do we do that? The answer is, that we condition this network.

That’s the word we would use. We give it access to the text as well. Alright, so let’s infer an image on my piece of paper, right? Um, bearing in mind the output’s going to be hand-drawn by me, so it’s going to be terrible. You start with a random noise image, right? So this is just an image that you’ve generated by taking random Gaussian noise.

Mathematically, this is centered around 0, so you have negative and positive numbers. You don’t go from 0 to 255, because it’s just easier for the network to train. You put in your time step. So you generate a, you put in a time step. Let’s say you want to do 50 iterations, right? So we put in a time step that’s maybe right at the end of our schedule that says like time step equals, you know, 50.

Which is our most noised image. Right? And then you pass it through the network and say estimate me the noise. We also take our string Which is frogs, frogs on stilts. Oh, I have to try that later. I’m looking forward to this one. Anyway, we could spend, let’s say, another 20, or 30 minutes producing frogs on stilts.

We embed this, right, by using our GPT-style transformer embedding. And we stick that in as well. And then it produces an estimate of how much noise it thinks is in that image. So that estimate on t equals 50 is going to be a bit average, right? It’s not going to produce you a frog on a stilted picture.

It’s going to produce a grey image or a brown image or something like that. Um, because that is a very, very difficult problem to solve. However, if you subtract this noise from this image, you get your first estimate of what your first image is, right? And then you add back a bunch of noise and you get to t equals 49, right?

So now we’ve got slightly less noise. And maybe, like, the vaguest outline of a frog on a stilt. Right? And this is t equals 49. You take your embedding, and you put this in as well. Right? And you get another, maybe a slightly better estimate of the noise in the image. And then we loop. Alright, it’s a for loop, alright?

We’ve done those before. You take this output, you subtract it, you add noise back, and you repeat this process. And you keep adding this text embedding. Now there’s one final trick that they use to make things a little bit better. If you do this, you will get a picture that maybe looks slightly frog-like.

Maybe there’s a stilt in it, right? But it won’t look anything like the images you see on the internet that have been produced by these tools. Because they do another trick to make the output even more tied to the text. What you do is something called classifier-free guidance. So you put this image in twice, once you include the embeddings of the text, and once you don’t.

So this network is maybe slightly better when it has the text. Estimating the noise. So you put in two images, right? This one’s with the embedding, and this one’s no embedding, right? And this one is maybe slightly more random noise and this one’s slightly more frog-like, right? Or, it’s better at slightly moving towards the right thing.

And we can calculate the difference between these two noises and amplify that signal. Right? And then feed that back. So what we essentially do is we say, Okay, if this network wasn’t given any information on what was in the image, and then this version of the network was, what’s the difference between those two predictions?

And can we amplify that when we loop this, to target this kind of output? And the idea is you’re forcing this network, or this loop, to point in the direction of the scene we want. Um, and that’s called Classified Free Guidance. And it is somewhat of a hack at the end of the network.

But it does work, right? If you turn it off, which I’ve done, it doesn’t, it produces vague sort of structures that kind of look right. It’s not, it’s not terrible. I mean, I think I did like a Muppet cooking in the kitchen, and it just produced a picture of a generic kitchen with no Muppet in it.

Right? But if you do this Then you suddenly are targeting what you want. Standard question, gotta, gotta ask it. Is this something people can play with without just going to one of these websites and typing some words? Well, yeah, I mean that’s the thing is, is that, um, is that it costs hundreds of thousands of dollars to train one of these networks because of how many images they use and how much processing power they use.

Um, the good news is that there are ones like stable diffusion that are, um, that are available to use for free, right? And you can use them through things like Google Colab. Now, I did this through Google Colab, um, and it worked well. Um, and maybe we’ll talk about that in another blog post where we delve into the code and see all of these bits happening within the code, right?

I blew through my, free Google allowance very, very quickly. I had to pay my 8. For, for premium Google access. So, um, you know, Eight pounds, hang on. I’ve got some money here, here you go. Thank you, yeah. So, you know, never let it be said I don’t spare expenses. I, you know, I spare no expense on, um, on, on Computerphile, getting access to proper compute hardware.

Could Beast do something like that? It could, yeah. Most of our servers could. I’m just a bit lazy and I haven’t set them up to do so. Um, but the code is quite easy to run. The, sort of, the entry-level version of the code, you literally can just like call one Python function and it will produce an image.

I’m using a code that is perhaps a little bit more detailed. It’s got the full loop in it and I can go in and inject things and change things so I can understand it better. Um, and we’ll talk through that next, you know, perhaps the next time. The only other interesting thing about the current neural networks is that the weights here, and here, and here are shared, so they are the same.

Because otherwise, this one here would always be learning It’s the same amount of time to make one sandwich, but you’ve got two people doing it, so they make twice as many sandwiches each time they make a sandwich. Same with the computer.

How does a Stable diffusion or a Dall E model work?

Written by Image Pipeline

No responses yet