👨‍🎨 [AI] AI Image Generation Technology Overview

July 1, 2024 · 16 min read

微信公众号@卤代烃实验室

Quote

"The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material." ― Michelangelo

If ChatGPT's launch in December 2022 marked the beginning of the AI large model era, then the impressive "text-to-image" models released in August 2022 were the prelude to this era. People were amazed by ChatGPT's intelligence and equally stunned by the creative images generated by DALL·E 2, Midjourney, and Stable Diffusion.

This document provides a comprehensive introduction to AI image generation technology, covering its development history, technical principles, and practical applications to help you enter this creative world.

Main Applications

Currently, there are three major international AI image generation applications:

DALL·E	Midjourney	Stable Diffusion

	Open Source	Latest Version	Usage Method
DALL·E Series	No, OpenAI product	DALL·E 3	Directly available in ChatGPT
Midjourney Series	No	Midjourney 6	Discord bot
Stable Diffusion Series	Yes	Stable Diffusion 3, XL, etc.	Self-deployment, beginners typically use integrations (Chinese link)

Many domestic companies have also developed image generation applications, some self-developed and others based on fine-tuning the open-source SD model. However, their overall quality doesn't match DALL·E and Midjourney, so I won't list them all here.

Development Timeline

	January	February	March	April	May	June	August	September	October	November	December
2020					DETR	DDPM			DDIM
2021	CLIP 🔗 DALL·E				DMBG
2022	BLIP			DALL·E 2			StableDiffusion Midjourney V3			Stable Diffusion 2
2023	BLIP2		GPT4 Midjourney V5	SAM				DALL·E 3		ChatGPT	Midjourney 6
2024		Sora		Stable Diffusion 3

The table above lists key technical milestones that influenced image generation development, all occurring in recent years:

2020: Landmark papers DETR/DDPM/DDIM established the theoretical foundation of Diffusion Models
2021: CLIP paper established text-image relationships, laying the theoretical groundwork for text2image
2022: Application explosion with DALL·E 2, Midjourney 3, and Stable Diffusion creating stunning works
2023+: Focus on model optimization and parameter scaling to improve generation stability and quality ceiling

Feeling overwhelmed by the technical terms above? Don't worry. The next section will explain the model details in simple terms, helping you understand the principles of image generation models to better use these services.

Technical Principles

While two of the three major AI image generation applications are closed-source, based on their published papers and Stable Diffusion's open-source code, industry analysis suggests they share similar technical foundations based on two core models:

CLIP establishes relationships between text and images, enabling cross-modal semantic alignment
Diffusion Models can generate images from noise through a "creation from nothing" process

The combination of these two models enables text-to-image generation. Let's explore how this works in detail.

CLIP

Let's start with CLIP.

In traditional deep learning models, different functionalities are typically separated. NLP models only process text content, while CV models handle image-related tasks. This led to the question: could a single model handle multiple modalities simultaneously?

Today's GPT-4o represents this multimodal approach, accepting and outputting images, text, audio, and video. However, several years ago, such capability didn't exist. In January 2021, OpenAI published the paper "Contrastive Language-Image Pre-training" - a pre-trained neural network model designed to match images and text, achieving cross-modal semantic alignment.

This influential paper became foundational; both DALL·E and Stable Diffusion rely on CLIP as their TextEncoder. Let's examine how this algorithm works.

Training Data

To achieve image-text alignment, a relevant training dataset is required. Computer scientists needed to prepare "text-image pairs" - text descriptions of image content - to feed into the model.

Where can we find such vast amounts of text<->image data in real life? HTML provides the answer:

<img
  src="/media/cc0-images/grapefruit-slice-332-332.jpg"
  alt="Grapefruit slice atop a pile of other slices"
 />

HTML img tags naturally include alt attributes that typically describe the image content. OpenAI, known for scaling through massive data, downloaded 400 million images with alt text descriptions from the internet for their training set.

You might question whether alt descriptions accurately reflect image content. This concern is valid - alt text undergoes initial cleaning to remove obvious issues (like advertisements), but the sheer scale of 400 million images can resolve many problems. Of course, more precise descriptions do improve model capabilities. Later models like DALL·E 3 and Stable Diffusion 3 enhanced their overall inference accuracy by improving training dataset quality, but that's a later development.

Pre-training

With the data prepared, training can begin. CLIP uses contrastive learning to help models find text-image matching relationships. Since it involves two modalities, two separate models extract features:

Text Encoder: Extracts text features using an NLP model, such as a Text Transformer
- Input: Text
- Output: Feature values [T1, T2, ... TN]
Image Encoder: Extracts image features using a common CV CNN model
- Input: Image
- Output: Feature values [I1, I2, ... IN]

Next comes contrastive learning: combining the two feature sets creates $N^2$ possible text-image pair similarities. Positive samples are on the matrix diagonal (only $N$ of them), while negative samples total $N^2-N$ . CLIP's training objective is to maximize the similarity of the $N$ positive samples while minimizing the similarity of negative samples.

After learning from 400 million data points, the text-image matching relationship is established.

tip

CLIP has zero-shot classification capabilities, but that's not the focus of this document, so we'll skip it for now.

Diffusion

Origin of the Term

"Diffusion" - meaning spread or propagation - is borrowed from thermodynamics.

In thermodynamics, diffusion is defined as:

info

Diffusion phenomenon refers to the process where material molecules transfer from high-concentration areas to low-concentration areas until uniform distribution is achieved.

You can observe diffusion in the ink spreading video on the left:

Diffusion Video	Final State

Over time, the water becomes uniformly blue, reaching an isotropic normal distribution.

Applying this phenomenon to images creates an analogy:

Starting with a normal image (a glass of water), we add random noise (blue ink). As noise increases (ink continues spreading), the image eventually becomes pure noise (uniformly blue water):

The similarities between these processes led to naming this approach "Diffusion Model."

Conceptual Understanding

In computer science, unlike physics where "spilled water cannot be recovered" is accepted, we have more control. Since we actively add noise, we have room to manipulate the process.

If we record the noise added at each step, we can reverse the process by gradually reducing noise, potentially restoring the original image from noise:

Technical Analysis

The above provides intuitive understanding. Now let's examine the technical details (without diving deep into mathematical formulas).

To generate an image from noise, we need to denoise step by step. While the denoising structure is similar, the denoising intensity clearly differs between step 1 and step 1000:

To help the model know the appropriate denoising intensity for each step, we modify the model. The denoise input now includes not only the noisy image but also the step number:

What's the denoise structure? The diagrams below show a key Noise Predictor component that predicts what noise should be output at each step, then subtracts this noise from the input to produce a clearer output:

Now the problem becomes how to train a Noise Predictor. The key challenge is how the Noise Predictor knows whether its output is correct. This requires training through reverse thinking - we need to create relevant training data.

Creating this data is straightforward. We take images from the internet and add incremental noise, creating step-by-step noise images. This process is called Forward Process or Diffusion Process:

This process creates abundant training data! Here's how to understand it:

The Forward Process input/output:

Input: Previous noise image, step, noise
Output: Next noise image

The Noise Predictor input/output:

Input: Next noise image, step
Output: noise

This training approach is ingenious - we create massive data through the forward process to train the Noise Predictor needed for reverse operation.

At this point, we still have basic image generation. To add text guidance, we input text during the denoising phase to adjust model expectations, enabling text-to-image generation:

Of course, with changed Noise Predictor inputs, the training process must also be adjusted. The Forward Process must include text:

With this in place, text-to-image generation becomes possible.

AI Image Generation Architecture

With the foundation prepared, combining both components enables text-to-image generation!

Text-to-image models consist of 3 main components (each is an independent neural network model):

A TextEncoder
- Input: Text descriptions
- Output: Text feature vectors
A Generation Model (typically a Diffusion Model)
- Input: Text feature vectors and noise
- Output: Intermediate Latent Representation (essentially a compressed version of the image)
A Decoder
- Input: Latent representation
- Output: Final image

Research papers show that Stable Diffusion and DALL·E share the same architecture (as shown below). Let's examine each component in detail:

TextEncoder

As seen from the diffusion stage, we now have basic text-to-image capabilities. A Google research study found that for text-to-image models, image generation quality correlates strongly with text language model quality, but has little correlation with image generation model depth. Therefore, a good language model deserves a good foundation - we need a more powerful language model.

The model needs to establish text-image relationships and output "text feature vectors." The CLIP model we discussed earlier can accomplish this. Both Stable Diffusion and DALL·E's TextEncoders are based on CLIP, so we won't elaborate further here.

Decoder

Let's start with the Decoder for better overall understanding.

The Decoder's role is to generate the final image from a noise-like intermediate product. How do we obtain this Decoder, or how do we train it? Decoder training is quite ingenious.

Training models require data. Let's assume the intermediate product is a "thumbnail." The problem then becomes: how can we decode a "thumbnail" into a "high-resolution image"?

The internet contains many high-resolution images. We can compress "high-resolution images" into "thumbnails," manually creating "thumbnail-high-resolution" pairs
Using these data pairs, we can train a Decoder to restore "thumbnails" to "high-resolution images"

If the intermediate product isn't a "thumbnail" but a "high-dimensional noise image" that's indistinguishable to the human eye (which we call Latent Representation), how do we create data pairs? The answer is to train an autoencoder:

This model first Encodes images into Latent Representation
Then can Decode Latent Representation back into images
This way we get both an Encoder and a Decoder, where the Decoder can be used in the final step

Generation Model

This section was introduced in the Diffusion chapter above, but there are slight differences due to engineering performance considerations.

In the previous Forward Process stage, noise was added directly to images (DDPM). If the image is particularly large, this consumes significant performance:

For this reason, engineering practices convert operations on original image sizes to operations on the intermediate Latent Representation (DDIM). For Stable Diffusion, for example, this is just a fixed $(4,64,64)$ matrix, much smaller than variable original image sizes.

So how is Latent Representation generated? The autoencoder above is already trained, and the Encoder can be used directly in the Diffusion Model's Forward Process stage. The original image goes through the Encoder to generate an initial Latent Representation, then noise is gradually added to this Latent to calculate the final required Latent:

The above diagram evolves into this, with the yellowish middle section being the Latent Representation

The final Diffusion Model architecture is as follows, with each Denoise layer:

Input: Round number, previous round's Latent, and "text feature vectors"
Output: Partially denoised Latent
Multiple iterations
Finally sent to autoencoder's Decoder to produce the final image

This concludes our introduction to AI image generation's underlying architecture:

Relationship with Transformers

As we know, this current AI wave was initiated by OpenAI, and today's most famous LLMs are based on Transformers. So what's the connection between text-to-image and Transformers?

This becomes clear when we understand the original definitions. Diffusion Model is an algorithmic concept, while Transformers are a neural network architecture. Currently, the most widely used Diffusion Model architecture is based on UNet, though it can be switched to Transformers. Last month's Stable Diffusion 3 update switched to Transformers architecture, and OpenAI's Sora (text-to-video product) is also speculated to be based on Transformers.

Core Capabilities

This section introduces the most common fundamental capabilities in AI image generation.

Text2image (Text-to-Image)

Text-to-image is the core function of image generation. The workflow is as follows:

Extract text embeddings using TextEncoder from the input prompt
Generate random noise
Combine text embeddings and noise, feed into Diffusion Model
Iterate multiple times, then obtain final image through Decoder

Several parameters significantly affect the final image quality in text-to-image:

Steps: Number of denoising or sampling steps during inference. More steps produce better image quality but require longer inference time. SD typically generates stable images with 30-50 steps
CFG_Scale: Correlation between input/prompt and output. Higher values make output closer to prompt but may cause distortion; lower values deviate from input but often have better quality. Recommended range: 7-10
negative_prompt: Reverse prompt describing content you want to avoid in the image

Image2image (Image-to-Image)

Image-to-image is a variant of text-to-image. While text-to-image starts with random noise, image-to-image takes an existing image, adds noise (diffusion), combines the noisy image with the prompt, then denoises to learn structures from the original image:

Compared to text-to-image, adds original image noise step

Compared to text-to-image, image-to-image has an additional parameter - strength:

strength: Parameter between 0-1 indicating the amount of noise added to the input image. Higher values add more noise and disrupt the original image more. Recommended range: 0.6-0.8
- When strength=1, it becomes random noise, equivalent to pure text-to-image
- When strength=0, it's prone to overfitting

Inpainting (Partial Image Redrawing)

This is another variant of the original text-to-image capability that can edit local image details. These features are now implemented in various image editors, for example:

Turning an elephant facing away from the viewer to face forward
Removing unwanted people from vacation photos
Changing clothing colors
And much more...

Here's a brief overview of how inpainting works:

Input an original image and a Mask
Areas outside the mask remain unchanged through technical processing
Areas inside the mask are noised, combined with text embeddings, and denoised
Generate the final image

Outpainting (Image Expansion)

Similar to inpainting, outpainting uses mask guidance to generate expanded images beyond the original boundaries.

ControlNet

AI image generation can be unstable. ControlNet partially addresses stability issues through control mechanisms like object edges, human pose skeletons, and depth maps.

LoRA (Fine-tuned Small Models)

LoRA stands for Low-Rank Adaptation, an efficient model fine-tuning technique that allows different fine-tuning weights to be easily combined with base models. In the Stable Diffusion community, sharing custom fine-tuned LoRA models is very popular for generating images in specific styles.

The popular anime model NovelAI from last year was created by fine-tuning on a large collection of anime images:

Current Limitations

Probabilistic Nature: Since Diffusion Models are fundamentally probabilistic, even with identical prompts, output quality depends on "luck". Users often call this "gacha pulling" - whether you get the desired image comes down to chance.
Single-Layer Output: Generated images are complete compositions without layers. This makes it difficult for artists who work with hundreds of layers to make adjustments or modifications. Current workflows can only use AI for creative guidance or minor tasks like background removal and decorative elements.
Copyright Issues: Models learn many artists' styles during training. Since Stable Diffusion is open-source, users can easily perform style fine-tuning, creating numerous copyright concerns that particularly affect existing artists.
Industry Resistance: Companies like NetEase and Bilibili faced user backlash when implementing AI images in games (promotional banners, character designs, etc.). While these companies quickly apologized and claimed to stop using AI (though they still use it for minor elements), some artists and fans stigmatize AI image generation as "stitching together corpse pieces," creating serious division.

References

CLIP: Connecting text and images
The Illustrated Stable Diffusion
CLIP: Connecting text and images, creating transferable vision models
Text-to-Image Model: Stable Diffusion
Diffusion Model Explanation
CLIP Paper Detailed Reading
DALL·E 2 (includes diffusion model introduction)
Diffusion Model
Efficient Parameter Fine-tuning for Stable Diffusion using LoRA
Complete History of SD Versions - Serial
Denoising Diffusion Implicit Models.pdf
Denoising Diffusion Probabilistic Models.pdf
Diffusion Models Beat GANs on Image Synthesis.pdf
Hierarchical Text-Conditional Image Generation with CLIP Latents.pdf
High-Resolution Image Synthesis with Latent Diffusion Models.pdf
Improving neural networks by preventing co-adaptation of feature detectors.pdf
Learning Transferable Visual Models From Natural Language Supervision.pdf
LoRA Low-Rank Adaptation of Large Language Models.pdf
Optimizing Prompts for Text-to-Image Generation.pdf
dall-e-3.pdf
Scalable Diffusion Models with Transformers.pdf
SDEdit- Guided Image Synthesis and Editing with Stochastic Differential Equations.pdf
Sketch-Guided Scenery Image Outpainting.pdf
hotorealistic Text-to-Image Diffusion Models with Deep Language Understanding.pdf

Main Applications​

Development Timeline​

Technical Principles​

CLIP​

Training Data​

Pre-training​

Diffusion​

Origin of the Term​

Conceptual Understanding​

Technical Analysis​

AI Image Generation Architecture​

TextEncoder​

Decoder​

Generation Model​

Relationship with Transformers​

Core Capabilities​

Text2image (Text-to-Image)​

Image2image (Image-to-Image)​

Inpainting (Partial Image Redrawing)​

Outpainting (Image Expansion)​

ControlNet​

LoRA (Fine-tuned Small Models)​

Current Limitations​

References​