A Gentle Introduction to Guidance in Diffusion Models
Diffusion models have become a cornerstone of modern generative AI, capable of creating stunningly realistic images, audio, and more. While generating random high-quality samples is impressive, the real magic happens when we can control what the model generates. This control is achieved through a technique called guidance.
This post will explore an overview of guidance techniques in diffusion models. We will begin with the foundational concepts of classifier guidance and classifier-free guidance. We will then dive into advanced applications like multi-condition guidance, history guidance, and inner guidance.
A Quick Recap of Diffusion Models
Let’s assume you have a basic understanding of how diffusion or flow-matching models work. In short, training a diffusion model involves two processes:
- Forward Process: We start with a clean data point
(like an image) and incrementally add Gaussian noise over a series of timesteps . By the final timestep , the data becomes indistinguishable from pure Gaussian noise. - Reverse Process: We train a neural network to reverse this process. During inference, the model starts with random noise and gradually “denoises” it over the same timesteps to produce a clean sample.
The reverse process can be modeled as a conditional probability distribution, where the model predicts the data at the previous timestep
While there are many ways to parameterize a diffusion model (e.g., predicting the noise, predicting the clean image
Here, the gradient of the log probability of the data distribution
Intuitively, this score function,

The denoising update rule can then be expressed using this learned score:
$$
\begin{equation}
\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}t}}s\theta(x_t, t) \right)
\end{equation}
$$
where
Further Disscusion on Diffusion
Here, I want to share two insightful perspectives on diffusion models that reframe our understanding of how they work, drawing from concepts in recent literature that I find particularly compelling.
1) Noise as Frequency-Domain Masking. The process of adding noise can be intuitively understood as a form of masking in the frequency domain. This perspective draws a compelling parallel to Masked Autoencoders (MAE), a successful self-supervised learning (SSL) technique where masking parts of the input forces the model to learn robust semantic representations. This might explain why diffusion models develop strong internal representations. While SSL methods are still arguably superior in pure representation learning, the potential for diffusion models to bridge the gap is immense. Works like REPA have shown that better representations improve generation. The inverse—that the act of generation improves representation—remains a promising frontier yet to be fully realized.
2) Denoising as Autoregression on the Noise Axis. While diffusion and autoregressive models are often treated as distinct families, the denoising process is fundamentally a form of autoregression along the noise-level axis. Think of a standard autoregressive model, which predicts the next element in a sequence given the previous one: $p(\text{token}i | \text{token}{i-1})
Introducing Conditions for Controllable Generation
To control the generation process, we need to make it conditional. We want to sample not from the general data distribution
Intuitively, we can just modify our score function to be conditional:
Why? Generating a sample that perfectly matches a condition is often much harder than simply classifying it. For instance, a model that can generate a perfect image of a “Siberian Husky” must implicitly be a perfect classifier for Siberian Huskies. It’s more likely that a dedicated classifier model will be better at recognizing the condition than a generative model will be at producing it.
This observation is the key insight behind classifier guidance.
Classifier Guidance
Instead of relying solely on the generative model, we can leverage the power of a separate, pre-trained classifier to “guide” the diffusion process.
We start with the conditional score
This elegant equation shows that the conditional score is just the unconditional score plus a guidance term. This guidance term,
To control the strength of this guidance, we introduce a guidance scale hyperparameter
Let’s take a look at the weighted formulation in the probability space:
$$
\begin{equation}
\widetilde{p}\theta (x_t | y) \propto p\theta (x_t | y) \cdot p_\theta (y | x_t)^\lambda
\end{equation}
$$
So its physical meaning is actually to make the data with higher probablity of class
Let’s look at how the guidance term is calculated in practice, using OpenAI’s guided-diffusion code as a reference:
However, classifier guidance has some clear drawbacks:
- Extra Work: It requires training and maintaining a separate classifier model that must be compatible with noisy inputs.
- Limited Scope: It can only guide based on the classes or attributes the classifier was trained on.
- Inflexibility: Extending it to more complex conditions like free-form text or image-to-image translation is difficult and cumbersome.
Classifier-Free Guidance (CFG)
Given the limitations of classifier guidance, researchers sought a way to get the same control without needing a separate classifier. This led to the development of classifier-free guidance (CFG), a clever and powerful technique that is now standard in deep generative models.
Rearrange Equation 7 and we can get:
We can replace it back to Equation 7 and add a guidance scale
What if we could teach a single neural network to predict both?
That’s exactly what CFG does. We train one diffusion model
This teaches the model to act in two modes:
- When given a condition
, it predicts the conditional score: . - When given the null token
, it predicts the unconditional score: .
As illustrated below, this formulation intuitively pushes the generation process more firmly in the direction of the condition.

Classifier-free guidance elegantly solves the problems of its predecessor:
- Simplicity: It only requires one model, simplifying the training and inference pipeline.
- Flexibility: It works seamlessly with any type of conditioning signal, including complex ones like text embeddings (from models like CLIP) or images, making it incredibly versatile.
Further Discussion on CFG
From the perspective of compute, CFG leverages additional compute during inference to achieve superior generation quality. This philosophy is conceptually analogous to the advanced reasoning techniques now common in Large Language Models (LLMs). To solve highly complex tasks, modern LLMs don’t just produce an answer; they generate intermediate “reasoning” steps (e.g., chain-of-thought), effectively using more inference-time computation to “think” through the problem. This very strategy has led to groundbreaking achievements, including LLMs solving problems at the level of the International Mathematical Olympiad (IMO).
This parallel highlights what could be considered a fundamental principle in AI: performance scales with effective computation. Throughout the history of AI, the models and methods that can most effectively utilize greater computational resources—whether during training or inference—have consistently prevailed and defined the state of the art. While current CFG implementations are straightforward, often just doubling the number of function evaluations per step, the underlying concept holds vast potential. From my personal perspective, the idea of using guidance as a mechanism for inference-time scaling in diffusion models is still in its infancy. In the future, generating targets of unprecedented complexity may involve more sophisticated guidance strategies that dynamically allocate computational effort to refine the output, moving far beyond the simple two-pass system used today.
Multi-Condition Guidance
We will show that CFG is easy to be extended to multi-condition cases. For simplicity, we start with the two-condition case. Let’s take InstructPix2Pix as an example. It performs an image editing task and receives two conditions. One is image condition
Note that
From Equation 9, we can know that
People may notice that Equation 12 might have multiple possible formulation such as switching the positions of
Now, let’s consider the general case of
History Guidance
DiT-based full-attention video diffusion models can typically only generate a fixed number of frames at a time. To create long videos, we must generate them in chunks, with each new window conditioned on the previously generated frames (the “history”). A naive approach would be to simply use CFG with the history as the condition. However, History Guidance offers a far more elegant and powerful solution.
The key innovation of History Guidance is the use of a vector of noise timesteps instead of a single scalar value. This means each frame in a video chunk can have its own independent noise level. This simple but powerful change allows us to flexibly define which frames are “conditions” and which are “targets” within a single framework:
- Frames to Generate
: These are given a high noise level, starting the denoising process from scratch. - History Frames
: These conditioning frames are set to have zero noise, making them clean, known inputs.
This turns complex tasks into simple configurations. For example, video prediction is just generating future frames from a clean past. Video interpolation is generating middle frames when the first and last frames are clean conditions.

Suppose we want to estimate the score
The above equation effectively allows us to compose the scores conditioned on 1) different history subsequences, and 2) history frames that are partially noisy. This opens up two principal “axes” for controlling the generation process. For simplicity, composition along these two axes are introduced separately.
Time Axis: Temporal History Guidance (HG-t)
This approach involves composing guidance from different subsequences of the (clean) history by setting $k_{\mathcal{H}i}=0
Frequency Axis: Fractional History Guidance (HG-f)
A major failure of using naive CFG for video is that high guidance scales often produce overly static videos. This occurs because vallina CFG encourages consistency with history, leading to a trivial solution of simply copying the most recent history frame. HG-f cleverly solves this by guiding the process with fractionally noisy history frames. Fractionally masking history retains only low-frequency information, allowing highfrequency details (e.g., fine textures and fast motions) to remain unconstrained by guidance. This approach makes videos more dynamic while maintaining consistency, which is mainly associated with low-frequency details. Specifically, the HG-f score is given by:
Inner-Guidance
Untill now, we have to note that a fundamental assumption of all the above guidance techniques is based on the assumption that the conditions are independent of
Citation
References
- classifier guidance
- cfg
- instructpix2pix
- videoJAM
- history guidance
- meanflow
- https://medium.com/@baicenxiao/understand-classifier-guidance-and-classifier-free-guidance-in-diffusion-model-via-python-e92c0c46ec18
- composing diffusion models
- mar
- diffusion forcing
- REPA