StableDiffusion工作原理：AI生成图像秘籍作者：盘古课堂来源：盘古课堂 ********规划设计建筑设计园林景观设计室内设计综合工程设计工程建设 (投资策划设计建设运营)全过程咨询合作添加微信助手（pangushare） AI image generation is the most recent AI capability blowing people’s minds (mine included). The ability to create striking visuals from text descriptions has a magical

StableDiffusion工作原理：AI生成图像秘籍

By AiBard123
March 22, 2024 - 2 min read

作者：盘古课堂来源：盘古课堂

********规划设计建筑设计园林景观设计室内设计综合工程设计工程建设 (投资策划设计建设运营)全过程咨询合作添加微信助手（pangushare）

AI image generation is the most recent AI capability blowing people’s minds (mine included). The ability to create striking visuals from text descriptions has a magical quality to it and points clearly to a shift in how humans create art. The release of AI图像生成是最新的AI能力，它让人们(包括我自己)大吃一惊。从文本描述中创建引人注目的视觉效果的能力具有神奇的质量，并明确指出人类如何创造艺术的转变。

Stable Diffusion is a clear milestone in this development because it made a high-performance model available to the masses (performance in terms of image quality, as well as speed and relatively low resource/memory requirements).稳定的扩散是这一发展中的一个明确的里程碑，因为它使高性能模型可以为大众所用(在图像质量、速度和相对较低的资源/内存需求方面的性能)。

After experimenting with AI image generation, you may start to wonder how it works.在实验了人工智能图像生成之后，你可能会开始想知道它是如何工作的。

Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).稳定扩散是多才多艺的，它可以用在许多不同的方式。让我们首先关注从文本生成图像(text2img)。上面的图像显示了一个示例文本输入和生成的图像(实际的完整提示在这里)。除了文本到图像，另一种主要的使用方式是通过使它改变图像(因此输入是文本图像)。

Stable Diffusion的****组成部分

Stable Diffusion is a system made up of several components and models. It is not one monolithic model.稳定扩散是一个由多个组件和模型组成的系统，它不是一个单一的模型。

As we look under the hood, the first observation we can make is that there’s a text-understanding component that translates the text information into a numeric representation that captures the ideas in the text.当我们深入了解时，我们可以做出的第一个观察是，有一个文本理解组件，它将文本信息转换为数字表示，以捕获文本中的思想。

We’re starting with a high-level view and we’ll get into more machine learning details later in this article. However, we can say that this text encoder is a special Transformer language model (technically: the text encoder of a CLIP model). It takes the input text and outputs a list of numbers representing each word/token in the text (a vector per token).我们从一个高层次的视图开始，我们将在本文稍后进入更多的机器学习细节。然而，我们可以说这个文本编码器是一个特殊的 Transformer 语言模型(技术上：CLIP模型的文本编码器)。它接受输入文本，并输出一个数字列表，表示文本中的每个单词/标记(每个标记一个向量)。

That information is then presented to the Image Generator, which is composed of a couple of components itself.然后将该信息呈现给图像生成器，它本身由几个组件组成。

The image generator goes through two stages:图像生成器经历两个阶段：

1 -Image information creator 1- 图像信息创建器

This component is the secret sauce of Stable Diffusion. It’s where a lot of the performance gain over previous models is achieved.这个组件是稳定扩散的秘密武器。它是实现以前模型的许多性能提升的地方。

This component runs for multiple steps to generate image information. This is thesteps parameter in Stable Diffusion interfaces and libraries which often defaults to 50 or 100.这个组件运行多个步骤来生成图像信息。这是稳定扩散接口和库中的步骤参数，通常默认为50或100。

The image information creator works completely in theimage information space (or latent space). We’ll talk more about what that means later in the post. This property makes it faster than previous diffusion models that worked in pixel space. In technical terms, this component is made up of a UNet neural network and a scheduling algorithm.图像信息创建器完全在图像信息空间(或潜在空间)中工作。我们将在后面的文章中讨论更多关于这意味着什么。这个属性使它比以前在像素空间中工作的扩散模型更快。在技术术语中，这个组件由一个UNet神经网络和一个调度算法组成。

The word “diffusion” describes what happens in this component. It is the step by step processing of information that leads to a high-quality image being generated in the end (by the next component, the image decoder).“扩散”这个词描述了这个组件所发生的事情。它是对信息进行逐步处理，最终生成高质量图像(由下一个组件图像解码器生成)。

2 -Image Decoder 2-图像解码器

The image decoder paints a picture from the information it got from the information creator. It runs only once at the end of the process to produce the final pixel image.图像解码器根据从信息生成器获得的信息绘制图像，它只在流程的末尾运行一次，以生成最终的像素图像。

With this we come to see the three main components (each with its own neural network) that make up Stable Diffusion:有了这个，我们开始看到组成稳定扩散的三个主要组件(每个组件都有自己的神经网络)：

·ClipText for text encoding.用于文本编码的ClipText。

Input: text.输入:文本。

Output: 77 token embeddings vectors, each in 768 dimensions.输出：77个标记嵌入向量，每个在768维度。

·UNet + Scheduler to gradually process/diffuse information in the information (latent) space.UNet调度器，在信息(潜在)空间中逐渐处理/扩散信息。

Input: text embeddings and a starting multi-dimensional array (structured lists of numbers, also called atensor) made up of noise.输入：文本嵌入和一个由噪声组成的初始多维数组(结构化的数字列表，也称为张量)。

Output: A processed information array输出：一个处理后的信息数组

·Autoencoder Decoder that paints the final image using the processed information array.自动编码器使用处理后的信息数组绘制最终图像的解码器。

Input: The processed information array (dimensions: (4,64,64))输入：处理后的信息数组(维度：(4,64,64))

Output: The resulting image (dimensions: (3, 512, 512) which are (red/green/blue, width, height))输出：结果图像(尺寸：(3,512,512)，其中(红/绿/蓝，宽度，高度))

Stable Diffusion工作原理：什么是扩散？

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random startingimage information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.扩散是发生在粉色“图像信息创建者”组件内部的过程，拥有表示输入文本的标记嵌入和一个随机开始的图像信息数组(这些也称为潜在值)，该过程产生一个信息数组，图像解码器使用该数组绘制最终图像。

This process happens in a step-by-step fashion. Each step adds more relevant information. To get an intuition of the process, we can inspect the random latents array, and see that it translates to visual noise. Visual inspection in this case is passing it through the image decoder.这个过程是一步一步进行的。每一步都会添加更多的相关信息。为了对这个过程有一个直观的了解，我们可以检查随机潜伏数组，并看到它转化为视觉噪声。在这种情况下，视觉检查是通过图像解码器进行的。

Diffusion happens in multiple steps, each step operates on an input latents array, and produces another latents array that better resembles the input text and all the visual information the model picked up from all images the model was trained on.扩散发生在多个步骤中，每个步骤都对输入的潜在数组进行操作，并产生另一个潜在数组，该潜在数组更类似于输入文本和模型从模型训练的所有图像中提取的所有视觉信息。

We can visualize a set of these latents to see what information gets added at each step.我们可以可视化这些潜伏的集合，以查看在每一步中添加了什么信息。

The process is quite breathtaking to look at.这个过程看起来非常惊人。

Something especially fascinating happens between steps 2 and 4 in this case. It’s as if the outline emerges from the noise.在这种情况下，第2步和第4步之间发生了一些特别有趣的事情，就好像轮廓从噪音中浮现出来。

Stable Diffusion工作原理：扩散是如何发生的？

The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations. Diffusion models approach image generation by framing the problem as following:使用扩散模型生成图像的核心思想依赖于我们拥有强大的计算机视觉模型。如果数据集足够大，这些模型可以学习复杂的操作。扩散模型通过以下框架来解决图像生成问题：

Say we have an image, we generate some noise, and add it to the image.假设我们有一个图像，我们生成一些噪声，并将其添加到图像中。

This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.这可以被认为是一个训练示例。我们可以使用相同的公式创建许多训练示例来训练图像生成模型的中心组件。

While this example shows a few noise amount values from image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.虽然这个示例显示了从图像(量0，无噪声)到总噪声(量4，总噪声)的几个噪声量值，但我们可以轻松控制向图像添加多少噪声，因此我们可以将其分散到数十个步骤中，为训练数据集中的所有图像创建数十个训练示例。

With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:有了这个数据集，我们可以训练噪声预测器，并最终得到一个很棒的噪声预测器，它在特定的配置下运行时可以创建图像。如果你有机器学习经验，那么训练步骤看起来应该很熟悉：

Let’s now see how this can generate images.

现在让我们看看如何生成图像。

Painting images by removing noise

通过去除噪声来绘制图像

The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.经过训练的噪声预测器可以对带噪图像进行去噪，并对去噪步数进行预测，并能够预测出噪声片。

The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the *distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way – pointy ears and clearly unimpressed).对采样噪声进行预测，这样如果我们从图像中减去它，我们就会得到一个更接近模型训练图像的图像（不是确切的图像本身，而是分布 *- 像素排列的世界，天空通常是蓝色的，地面上方，人有两只眼睛，猫看起来是某种方式 - 尖尖的耳朵和明显的无动于衷)。

If the training dataset was of aesthetically pleasing images (e.g.，LAION Aesthetics, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If the we train it on images of logos, we end up with a logo-generating model.如果训练数据集是美学上令人愉悦的图像(例如LAION美学，如果我们用标志的图像来训练它，我们最终会得到一个标志生成模型

This concludes the description of image generation by diffusion models mostly as described inDenoising Diffusion Probabilistic Models. Now that you have this intuition of diffusion, you know the main components of not only Stable Diffusion, but also Dall-E 2 and Google’s Imagen.这结束了对扩散模型的图像生成的描述，主要是在去噪扩散概率模型中描述的。现在，您对扩散有了直观的认识，您不仅知道稳定扩散的主要组件，还知道Dall-E 2和Google的Imagen。

Note that the diffusion process we described so far generates images without using any text data. So if we deploy this model, it would generate great looking images, but we’d have no way of controlling if it’s an image of a pyramid or a cat or anything else. In the next sections we’ll describe how text is incorporated in the process in order to control what type of image the model generates.注意，我们到目前为止所描述的扩散过程在生成图像时没有使用任何文本数据。因此，如果我们部署这个模型，它将生成漂亮的图像，但我们无法控制它是金字塔、猫或其他任何东西的图像。在接下来的部分中，我们将描述如何将文本纳入该过程中，以控制模型生成的图像类型。

Speed Boost: Diffusion on Compressed (Latent) Data Instead of the Pixel Image速度提升：在压缩(潜在)数据上进行扩散，而不是像素图像

To speed up the image generation process, the Stable Diffusion paper runs the diffusion process not on the pixel images themselves, but on a compressed version of the image.The paper calls this “Departure to Latent Space”.为了加快图像生成过程，稳定扩散论文没有在像素图像本身上运行扩散过程，而是在图像的压缩版本上运行扩散过程。论文称之为“潜空间出发”。

This compression (and later decompression/painting) is done via an autoencoder. The autoencoder compresses the image into the latent space using its encoder, then reconstructs it using only the compressed information using the decoder.这种压缩(以及随后的解压/绘制)是通过自动编码器完成的。自动编码器使用编码器将图像压缩到潜在空间中，然后使用解码器仅使用压缩信息重建图像。

Now the forward diffusion process is done on the compressed latents. The slices of noise are of noise applied to those latents, not to the pixel image. And so the noise predictor is actually trained to predict noise in the compressed representation (the latent space).现在，前向扩散过程是在压缩的潜影上完成的。噪声片是应用于这些潜影的噪声，而不是像素图像的噪声。因此，噪声预测器实际上是在压缩表示(潜影空间)中预测噪声的训练。

The forward process (using the autoencoder’s encoder) is how we generate the data to train the noise predictor. Once it’s trained, we can generate images by running the reverse process (using the autoencoder’s decoder).正向过程(使用自动编码器的编码器)是我们生成数据来训练噪声预测器的方法。一旦它被训练好了，我们就可以通过运行反向过程(使用自动编码器的解码器)来生成图像。

These two flows are what’s shown in Figure 3 of the LDM/Stable Diffusion paper:这两个流如LDM/Stable Diffusion论文的图3所示：

This figure additionally shows the “conditioning” components, which in this case is the text prompts describing what image the model should generate. So let’s dig into the text components.这张图还显示了“条件”组件，在这种情况下，它是描述模型应该生成什么图像的文本提示。

The Text Encoder: A Transformer Language Model文本编码器：一个转换器语言模型

A Transformer language model is used as the language understanding component that takes the text prompt and produces token embeddings. The released Stable Diffusion model uses ClipText (AGPT-based model), while the paper used BERT.使用Transformer语言模型作为语言理解组件，以文本提示为输入并生成标记嵌入。已发布的稳定扩散模型使用ClipText(基于GPT的模型)，而本文使用BERT。

The choice of language model is shown by the Imagen paper to be an important one. Swapping in larger language models had more of an effect on generated image quality than larger image generation components.语言模型的选择是Imagen论文中的一个重要问题。在较大的语言模型中进行交换比在较大的图像生成组件中进行交换对生成的图像质量的影响更大。

Larger/better language models have a significant effect on the quality of image generation models. Source: Google Imagen paper by Saharia et. al.. Figure A.5.更大/更好的语言模型对图像生成模型的质量有显著影响。来源：Google Imagen论文，作者为Saharia等人。图A-5。

The early Stable Diffusion models just plugged in the pre-trained ClipText model released by OpenAI. It’s possible that future models may switch to the newly released and much larger OpenCLIP variants of CLIP (Nov2022 update: True enough, Stable Diffusion V2 uses OpenClip). This new batch includes text models of sizes up to 354M parameters, as opposed to the 63M parameters in ClipText.早期的稳定扩散模型只是插入了OpenAI发布的预训练ClipText模型。未来的模型可能会切换到新发布的CLIP的OpenCLIP变体(2022年11月更新：没错，稳定扩散V2使用OpenClip)。这个新批次包括大小高达354M参数的文本模型，而不是ClipText中的63M参数

Stable Diffusion工作原理：如何训练CLIP？

CLIP is trained on a dataset of images and their captions. Think of a dataset looking like this, only with 400 million images and their captions:CLIP是在图像及其标题的数据集上进行训练的。想象一下这样的数据集，只有4亿张图像和它们的标题：

In actuality, CLIP was trained on images crawled from the web along with their “alt” tags.实际上，CLIP是通过从网上抓取的带有“alt”标签的图像进行训练的。

CLIP is a combination of an image encoder and a text encoder. Its training process can be simplified to thinking of taking an image and its caption. We encode them both with the image and text encoders respectively.CLIP是一个图像编码器和文本编码器的组合，它的训练过程可以简化为考虑取一张图像和它的标题，分别用图像编码器和文本编码器进行编码。

We then compare the resulting embeddings using cosine similarity. When we begin the training process, the similarity will be low, even if the text describes the image correctly.然后使用余弦相似度比较结果嵌入。当我们开始训练过程时，即使文本正确描述了图像，相似度也会很低。

We update the two models so that the next time we embed them, the resulting embeddings are similar.我们更新了两个模型，以便下次嵌入它们时，结果嵌入是相似的。

By repeating this across the dataset and with large batch sizes, we end up with the encoders being able to produce embeddings where an image of a dog and the sentence “a picture of a dog” are similar. Just like in word2vec, the training process also needs to include，negative examples of images and captions that don’t match, and the model needs to assign them low similarity scores.通过在数据集上重复这个过程，并使用大批量，我们最终使编码器能够生成狗的图像和“狗的图片”相似的嵌入。培训过程还需要包括负面的例子 不匹配的图像和标题，模型需要为它们分配低相似性得分。

Feeding Text Information Into The Image Generation Process将文本信息输入到图像生成过程中。

To make text a part of the image generation process, we have to adjust our noise predictor to use the text as an input.为了让文本成为图像生成过程的一部分，我们必须调整噪声预测器，将文本作为输入。

Our dataset now includes the encoded text. Since we’re operating in the latent space, both the input images and predicted noise are in the latent space.我们的数据集现在包括了编码后的文本。由于我们在潜在空间中操作，所以输入图像和预测噪声都在潜在空间中。

To get a better sense of how the text tokens are used in the Unet, let’s look deeper inside the Unet.为了更好地理解文本标记在 Unet 中的使用方式，让我们深入了解一下 Unet。

Layers of the Unet Noise predictor (without text)Unet Noise 预测器的层(无文本)

Let’s first look at a diffusion Unet that does not use text. Its inputs and outputs would look like this:让我们首先看看一个不使用文本的扩散 Unet，它的输入和输出如下：

Inside, we see that:在里面，我们看到：

The Unet is a series of layers that work on transforming the latents arrayEach layer operates on the output of the previous layer Some of the outputs are fed (via residual connections) into the processing later in the network。Unet是一系列层，用于转换潜在数组每一层都对前一层的输出进行操作一些输出(通过残余连接)被馈送到网络中稍后的处理中。
The timestep is transformed into a time step embedding vector, and that’s what gets used in the layers 时间步长被转换成时间步长嵌入向量，这就是在图层中使用的东西。

Layers of the Unet Noise predictor WITH textUnet Noise 预测器的层

Let’s now look how to alter this system to include attention to the text.现在让我们看看如何改变这个系统，以包括对文本的注意。

The main change to the system we need to add support for text inputs (technical term: text conditioning) is to add an attention layer between the ResNet blocks.我们需要为系统添加对文本输入(技术术语：文本调节)的支持，主要的变化是在ResNet块之间添加一个注意力层。

Note that the ResNet block doesn’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.注意ResNet块并不直接查看文本。但注意力层将这些文本表示合并到潜层中。现在下一个ResNet可以在其处理中利用合并的文本信息。

####** 结论**

I hope this gives you a good first intuition about how Stable Diffusion works. Lots of other concepts are involved, but I believe they’re easier to understand once you’re familiar with the building blocks above. 我希望这能让你对稳定扩散的工作原理有一个良好的第一直观认识。还有很多其他的概念涉及其中，但我相信，一旦你熟悉了上面的构建模块，它们就更容易理解了。

转发、分享、点赞、在看，分享给更多需要的朋友，添加微信助手，领取原文件！最新最全房地产内部全周期全专业资料免费领，地产人的智库，知识的天堂，陪你成长伴你远行！TOP100经典楼盘和畅销项目，实地考察研学，集团总裁或项目总全专业流程复盘教学！添加微信助手（pangushare），进盘古课堂，百万地产精英荟，与一线地产大咖直接交流！请备注：姓名-工作岗位-公司-城市或赐名片（全专业交流学习、全专业流程咨询、全专业培训咨询、全专业直聘岗位内推）

更多AI工具，参考Github-AiBard123，国内AiBard123

可关注我们的公众号：每天AI新工具