Research Note: Flow Matching for Generative Modeling (Paper link)

summary : 这篇论文从 CNF 的角度延伸,提出了一个全新的无模拟( Simulation-free )训练框架 CFM :即通过构造条件概率路径和对应的微观向量场,直接让神经网络回归拟合该向量场。同时,论文提出了最优传输条件流匹配( OT-CFM ):通过人为设计最简单的线性概率路径,实现了理论上几何最短的生成路径与物理上恒定的流速 \(v\),极大地提升了生成模型的训练与采样效率。

idea : OT-CFM 就一定是最优生成路径吗?微观的理论完美,等于宏观的全局最优吗?虽然 OT-CFM 在条件单样本(微观)层面设计了极简的直线路径与恒定速度,但在面对整个数据集(宏观)进行训练时,这一 “最优性” 一定会受到挑战。由于宏观向量场 \(u_t(x)\) 是无数微观向量场的加权平均,在复杂的高维分布中,不同目标的直线轨迹不可避免地会发生大量的 轨迹交叉( Trajectory Crossing ) 。在交叉点处,神经网络拟合出的平均流速和方向必然会发生偏折与妥协。这意味着,实际推理采样时的宏观轨迹往往会偏离理想的直线与匀速,微观的 “最优” 在全局平均效应下实际上退化成了 “次优” 。如何减少宏观轨迹交叉,是不是未来优化的核心呢?又或者有没有更好的生成路径呢?


1. Background

  • Vector Field : A mathematical construct that assigns a specific vector (which has both direction and magnitude ) to every single point in a given space. You can intuitively think of it as the flow of water in a river . In the context of generative models like Flow Matching , a vector field acts as the “wind” that pushes random noise particles along specific trajectories over time until they form a clear, real image.
  • Ordinary Differential Equation (ODE) : If a vector field is the “wind,” an ODE represents the fundamental physics rules that calculate the exact continuous trajectory a single particle (like a pixel of random noise) will take as it is blown by that “wind” from a starting point to a final destination (a real image). Mathematically, it is typically written as \(\frac{dx(t)}{dt} = v_t(x(t))\), meaning the rate of change (velocity) of a particle \(x\) at time \(t\) is perfectly determined by the vector field \(v_t\) at its exact current location.
  • Continuous Normalizing Flow (CNF) : A type of generative model that uses a neural network to learn the underlying vector field of an ODE . By solving this ODE over continuous time, a CNF smoothly transforms a simple, easy-to-sample distribution (like pure Gaussian noise) into a complex, high-dimensional data distribution (like real images). While theoretically elegant, traditional CNFs are computationally expensive to train because they require sequentially simulating the ODE step-by-step.
  • Diffusion model (Probability Flow ODE) : See the derivation in Consistency Model Blog. The Consistency Model learns to map any point on the curved PF-ODE path directly to the data. In contrast, Optimal Transport Conditional Flow Matching (OT-CFM) constructs a fundamentally simpler straight-line path directly connecting the noise \(x_0\) and the data \(x_1\).

2. Continuous Normalizing Flow (CNF)

Let \(x \in \mathbb{R}^d\) denote a data point. A Continuous Normalizing Flow (CNF) transforms a simple prior distribution \(p_0\) into a complex target distribution \(p_1\) using a neural-parameterized vector field \(v_t(x; \theta)\).

1. ODE and Flow formulation The vector field \(v_t\) defines a diffeomorphic flow \(\phi_t\) over time \(t \in [0, 1]\) via the following Ordinary Differential Equation (ODE):

\[\frac{d}{dt}\phi_t(x) = v_t(\phi_t(x)) \tag{1}\]

with the initial condition:

\[\phi_0(x) = x \tag{2}\]

2. Push-Forward Equation The macroscopic evolution of the probability density path \(p_t\) is strictly defined by the push-forward operator:

\[p_t = [\phi_t]_* p_0 \tag{3}\]

This indicates that the flow \(\phi_t\) transports the initial density \(p_0\) to generate \(p_t\).

3. Change of Variables (Likelihood computation) To evaluate the exact probability density \(p_t(x)\) at any given point, the push-forward operator is mathematically expanded using the change of variables formula:

\[p_t(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right] \tag{4}\]

The Jacobian determinant \(\det[\dots]\) precisely accounts for the continuous volume deformation induced by the inverse flow \(\phi_t^{-1}\). Ultimately, a vector field \(v_t\) validly generates \(p_t\) if this mapping satisfies the continuity equation.

Due to computational constraints and space limitations, the derivation of Equation (4) is omitted here.


3. Conditional Flow Matching (CFM)

To bypass the highly expensive ODE simulations in Continuous Normalizing Flows (CNFs), the Flow Matching framework proposes to directly regress a neural network toward a target vector field. However, before deriving the objective function, we must first clarify the strict mathematical relationship between the macroscopic probability distribution and the microscopic conditional flows.

1. Marginal and Conditional Distributions (Law of Total Probability) In generative tasks, the target dataset \(q(x_1)\) consists of numerous specific real images. Here, \(x_1\) in the formula specifically represents a single, specific real image .

  • Conditional probability density \(p_t(x \mid x_1)\): Embodying the microscopic perspective , it defines the probability density of a particle located at state \(x\) at time \(t\), strictly conditioned on its final trajectory targeting a specific data sample \(x_1\).
  • Marginal probability density \(p_t(x)\): Embodying the macroscopic perspective , it denotes the aggregate probability density of a particle at state \(x\) at time \(t\). Governed by the Law of Total Probability, it is mathematically formulated by marginalizing the microscopic conditional densities over all possible target states \(x_1\) across the entire data distribution:
\[p_t(x) = \int p_t(x \mid x_1)q(x_1)dx_1 \tag{5}\]

2. Construction of the Marginal Vector Field Similarly, the wind direction ( vector field ) in the high-dimensional space is also divided into macroscopic and microscopic levels:

  • Conditional vector field \(u_t(x \mid x_1)\): The local, microscopic wind direction specifically guiding pure noise particles toward that specific target image \(x_1\).
  • Marginal vector field \(u_t(x)\): Based on the mathematical continuity equation (omitted here), the macroscopic overall wind direction is essentially the weighted average of all microscopic wind directions passing through that point. At point \(x\), the proportion of particle densities heading to various destinations strictly determines the weights of the total wind direction:
\[u_t(x) = \frac{\int u_t(x \mid x_1)p_t(x \mid x_1)q(x_1)dx_1}{p_t(x)} \tag{6}\]

3. Objective Function Transformation and Gradient Equivalence (Theorem 2) Ideally, we want the neural network \(v_t(x; \theta)\) to directly fit the macroscopic marginal vector field, optimizing the original Flow Matching (FM) objective:

\[\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t, p_t(x)} \lVert v_t(x) - u_t(x) \rVert^2 \tag{7}\]

However, since both \(u_t(x)\) and \(p_t(x)\) involve complex integrals over the entire high-dimensional data distribution, this is completely intractable in practical computation.

To break this deadlock, the CFM framework proposes training the network to fit the single-sample microscopic conditional vector field instead, leading to the Conditional Flow Matching (CFM) objective:

\[\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x \mid x_1)} \lVert v_t(x) - u_t(x \mid x_1) \rVert^2 \tag{8}\]

The most stunning core breakthrough of CFM theory is that optimizing the computationally tractable \(\mathcal{L}_{CFM}(\theta)\) is mathematically absolutely equivalent to optimizing the intractable \(\mathcal{L}_{FM}(\theta)\) in terms of gradients, i.e.,

\[\nabla_\theta \mathcal{L}_{FM}(\theta) = \nabla_\theta \mathcal{L}_{CFM}(\theta) \tag{9}\]

Derivation Process: First, we fully expand the \(L_2\) norm inside \(\mathcal{L}_{FM}(\theta)\):

\[\lVert v_t(x) - u_t(x) \rVert^2 = \lVert v_t(x) \rVert^2 - 2\langle v_t(x), u_t(x)\rangle + \lVert u_t(x) \rVert^2\]

Since the term \(\lVert u_t(x) \rVert^2\) is completely independent of the network parameters \(\theta\), its gradient must be zero. We only need to focus on the cross-term containing the parameters. The expectation of this cross-term under the marginal distribution \(p_t(x)\) can be written as:

\[\mathbb{E}_{p_t(x)}[\langle v_t(x), u_t(x)\rangle] = \int \langle v_t(x), u_t(x) \rangle p_t(x) dx\]

Next, substituting the previously defined marginal vector field formula (the weighted average formula) into \(u_t(x)\):

\[= \int \langle v_t(x), \frac{\int u_t(x \mid x_1)p_t(x \mid x_1)q(x_1)dx_1}{p_t(x)} \rangle p_t(x) dx\]

Here, the marginal density \(p_t(x)\) in the denominator mathematically cancels with the outer \(p_t(x)\) term of the expectation, simplifying the equation to:

\[= \int \int \langle v_t(x), u_t(x \mid x_1)\rangle p_t(x \mid x_1)q(x_1) dx_1 dx = \mathbb{E}_{q(x_1), p_t(x \mid x_1)}[\langle v_t(x), u_t(x \mid x_1)\rangle]\]

This perfectly matches the parameter-dependent term obtained after expanding \(\mathcal{L}_{CFM}(\theta)\). Therefore, minimizing the microscopic CFM objective naturally equates to minimizing the macroscopic FM objective .


4. Conditional Probability Paths and Vector Fields

Having established the equivalence between the intractable global FM objective and the tractable CFM objective, we now face the fundamental question: what do the microscopic conditional paths \(p_t(x \mid x_1)\) and their generating vector fields \(u_t(x \mid x_1)\) actually look like in mathematical practice?

To answer this, the Flow Matching framework introduces a universal analytical formulation based on Gaussian distributions.

1. Proactive Construction of Gaussian Paths A common misconception is that the Gaussian distribution is passively derived from a physical noise process (as seen in the Stochastic Differential Equations of Diffusion Models). In contrast, Flow Matching proactively defines the conditional path as a Gaussian distribution by design. This is a deliberate mathematical construction rather than a physical coincidence.

For any specific target data sample \(x_1\), the conditional probability path is explicitly defined by a time-dependent mean \(\mu_t(x_1)\) and a time-dependent standard deviation \(\sigma_t(x_1)\):

\[p_t(x \mid x_1) = \mathcal{N}(x \mid \mu_t(x_1), \sigma_t(x_1)^2 I) \tag{10}\]

This specific Gaussian design is fundamentally crucial for two reasons:

  • Boundary Control: It allows us to perfectly satisfy the generation boundary conditions. By defining \(\mu_0 = 0\) and \(\sigma_0 = 1\), the starting distribution represents pure standard noise. By defining \(\mu_1 = x_1\) and \(\sigma_1 \to 0\), the final distribution collapses into a Dirac delta function centered exactly at the target image.
  • The Reparameterization Trick: The Gaussian assumption enables us to express the complex probability evolution as a simple, deterministic kinematic equation. By sampling an initial base noise particle \(x_0 \sim \mathcal{N}(0, I)\), the absolute spatial coordinate \(x\) at any time \(t\) is simply given by a linear affine transformation:
\[x = \sigma_t(x_1)x_0 + \mu_t(x_1) \tag{11}\]

2. Theorem 3: The General Conditional Vector Field By formulating the path as the linear affine transformation above, we can now rigorously determine the exact analytical vector field \(u_t(x \mid x_1)\) that generates it. Theorem 3 provides the elegant, closed-form universal solution:

\[u_t(x \mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}(x - \mu_t(x_1)) + \mu_t'(x_1) \tag{12}\]

where \(\sigma_t'(x_1)\) and \(\mu_t'(x_1)\) represent the exact mathematical derivatives with respect to time \(t\).

Derivation of Theorem 3: This formula originates directly from fundamental kinematics. By taking the time derivative of the explicit position equation \(x = \sigma_t(x_1)x_0 + \mu_t(x_1)\), we obtain the instantaneous velocity of the particle (representing the Lagrangian specification of the flow):

\[\frac{dx}{dt} = \sigma_t'(x_1)x_0 + \mu_t'(x_1)\]

However, a valid vector field must be a function of the current spatial state \(x\), independent of the unobservable initial noise identity \(x_0\) (representing the Eulerian specification). Thanks to our proactive Gaussian affine design, we can algebraically reverse the transformation to explicitly solve for \(x_0\):

\[x_0 = \frac{x - \mu_t(x_1)}{\sigma_t(x_1)}\]

Substituting this \(x_0\) back into the velocity equation mathematically yields the exact conditional vector field:

\[u_t(x \mid x_1) = \sigma_t'(x_1) \left[ \frac{x - \mu_t(x_1)}{\sigma_t(x_1)} \right] + \mu_t'(x_1)\]

Rearranging the terms directly produces the universal equation in Theorem 3. This powerful closed-form equation serves as the mathematical engine for constructing any specific flow mapping. By simply defining different \(\mu_t\) and \(\sigma_t\) schedules, we can effortlessly instantiate either the curved paths of Diffusion Models or the mathematically optimal straight paths of Flow Matching.


5. Optimal Transport Conditional Flow Matching (OT-CFM)

Equipped with the universal mathematical engine from Theorem 3, we can now derive the most elegant and computationally efficient instance of the Flow Matching framework: Optimal Transport Conditional Flow Matching (OT-CFM) .

1. The Optimal Transport Schedule Instead of relying on the complex trigonometric or square-root schedules traditionally used in Diffusion Models, OT-CFM constructs the simplest possible probability path: a straight line. We proactively define the mean \(\mu_t\) and standard deviation \(\sigma_t\) to change strictly linearly over time \(t \in [0, 1]\):

\[\mu_t(x_1) = t x_1\] \[\sigma_t(x_1) = 1 - (1 - \sigma_{min})t\]

where \(\sigma_{min}\) is a strictly small noise parameter (e.g., \(10^{-5}\)) preventing the distribution from perfectly collapsing into a mathematical singularity at \(t=1\).

2. The Constant Velocity Vector Field To find the exact vector field guiding this straight-line path, we calculate the time derivatives of our explicitly linear schedule:

\[\mu_t'(x_1) = x_1\] \[\sigma_t'(x_1) = -(1 - \sigma_{min})\]

By directly substituting these derivatives and the affine schedule into the universal equation from Theorem 3 (Equation 12), we obtain:

\[u_t(x \mid x_1) = \frac{-(1 - \sigma_{min})}{1 - (1 - \sigma_{min})t}(x - t x_1) + x_1\]

Through elegant algebraic simplification, this reduces precisely to the final OT-CFM vector field :

\[u_t(x \mid x_1) = \frac{x_1 - (1 - \sigma_{min})x}{1 - (1 - \sigma_{min})t} \tag{13}\]

3. The Final OT-CFM Objective Finally, we substitute this closed-form analytical vector field into our tractable CFM objective (Equation 8). This elegantly produces the exact, highly efficient loss function utilized in practical PyTorch implementations:

\[\mathcal{L}_{OT-CFM}(\theta) = \mathbb{E}_{t, q(x_1), p_t(x \mid x_1)} \left\lVert v_t(x) - \frac{x_1 - (1 - \sigma_{min})x}{1 - (1 - \sigma_{min})t} \right\rVert^2 \tag{14}\]

4. Physical Intuition: Why “Optimal Transport”? The sheer brilliance of this design lies in its absolute physical and geometric optimality. By strictly enforcing a linear affine schedule, OT-CFM guarantees two fundamental properties for the particle dynamics:

  • Geometric Optimality (Straight Lines) : Particles are compelled to travel from the pure noise distribution directly to the target data manifold along the shortest possible geometric paths.
  • Physical Optimality (Constant Velocity) : The particles traverse these straight trajectories at a steady, unchanging speed, completely devoid of any acceleration or deceleration.

In classical mechanics and Optimal Transport theory, moving probability mass along the shortest path at a constant speed mathematically minimizes the total kinetic energy (also known as the dynamic transport cost). This inherently makes OT-CFM the absolute most efficient, or “optimal,” mapping achievable between two distributions, perfectly mirroring the dynamical formulation of the Wasserstein distance .


6. Conclusion (Paper)

We introduced Flow Matching, a new simulation-free framework for training Continuous Normalizing Flow models, relying on conditional constructions to effortlessly scale to very high dimensions. Furthermore, the FM framework provides an alternative view on diffusion models, and suggests forsaking the stochastic/diffusion construction in favor of more directly specifying the probability path, allowing us to, e.g., construct paths that allow faster sampling and/or improve generation. We experimentally showed the ease of training and sampling when using the Flow Matching framework, and in the future, we expect FM to open the door to allowing a multitude of probability paths (e.g., non-isotropic Gaussians or more general kernels altogether).