<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Nvidia on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/nvidia/</link><description>Recent content in Nvidia on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 14 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/nvidia/index.xml" rel="self" type="application/rss+xml"/><item><title>NVIDIA AnyFlow — video diffusion distillation that is not tied to a step count</title><link>https://ice-ice-bear.github.io/posts/2026-05-14-nvidia-anyflow-wan-t2v/</link><pubDate>Thu, 14 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-14-nvidia-anyflow-wan-t2v/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post NVIDIA AnyFlow — video diffusion distillation that is not tied to a step count" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;&lt;a class="link" href="https://nvlabs.github.io/AnyFlow" target="_blank" rel="noopener"
 &gt;AnyFlow&lt;/a&gt;, released by &lt;a class="link" href="https://www.nvidia.com/" target="_blank" rel="noopener"
 &gt;NVIDIA&lt;/a&gt;, is a framework that distills video &lt;a class="link" href="https://en.wikipedia.org/wiki/Diffusion_model" target="_blank" rel="noopener"
 &gt;diffusion models&lt;/a&gt; so they are &lt;strong&gt;not locked to a fixed inference step count&lt;/strong&gt;. Conventional few-step distilled models are pinned — a 4-step model does 4 steps, an 8-step model does 8. AnyFlow runs anywhere from 1 step to dozens from a single set of weights, and quality climbs steadily as you add steps. Starting from the &lt;a class="link" href="https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers" target="_blank" rel="noopener"
 &gt;&lt;code&gt;nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers&lt;/code&gt;&lt;/a&gt; model card, this post looks at the &lt;strong&gt;On-Policy Flow Map Distillation&lt;/strong&gt; underneath it, and why it departs from conventional &lt;a class="link" href="https://arxiv.org/abs/2303.01469" target="_blank" rel="noopener"
 &gt;consistency distillation&lt;/a&gt;.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;graph TD
 Base["Wan2.1-T2V-14B &amp;lt;br/&amp;gt; (flow matching DiT, 50+ steps)"]
 Base --&gt; Problem["problem: few-step distillation &amp;lt;br/&amp;gt; pins the step count + loses test-time scaling"]
 Problem --&gt; AnyFlow["AnyFlow &amp;lt;br/&amp;gt; On-Policy Flow Map Distillation"]
 AnyFlow --&gt; FM["flow map &amp;lt;br/&amp;gt; z_t to z_r arbitrary-interval transition"]
 AnyFlow --&gt; BS["Flow Map Backward Simulation &amp;lt;br/&amp;gt; decompose Euler rollout into shortcut segments"]
 FM --&gt; Result["any-step inference &amp;lt;br/&amp;gt; (1, 4, 8, 16, 32 steps)"]
 BS --&gt; Result
 Result --&gt; Tasks["T2V / I2V / V2V &amp;lt;br/&amp;gt; bidirectional + causal"]&lt;/pre&gt;&lt;h2 id="the-base-model--wan21"&gt;The base model — Wan2.1
&lt;/h2&gt;&lt;p&gt;AnyFlow is not trained from scratch; it is a distillation layer on top of Alibaba&amp;rsquo;s open-source video model &lt;a class="link" href="https://github.com/Wan-Video/Wan2.1" target="_blank" rel="noopener"
 &gt;Wan2.1&lt;/a&gt;. The base, &lt;a class="link" href="https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers" target="_blank" rel="noopener"
 &gt;&lt;code&gt;Wan-AI/Wan2.1-T2V-14B-Diffusers&lt;/code&gt;&lt;/a&gt;, is a 14B-parameter &lt;a class="link" href="https://arxiv.org/abs/2212.09748" target="_blank" rel="noopener"
 &gt;Diffusion Transformer&lt;/a&gt; built on the &lt;a class="link" href="https://arxiv.org/abs/2210.02747" target="_blank" rel="noopener"
 &gt;Flow Matching&lt;/a&gt; framework: it takes text through a multilingual &lt;a class="link" href="https://huggingface.co/docs/transformers/model_doc/t5" target="_blank" rel="noopener"
 &gt;T5 encoder&lt;/a&gt; and injects the condition via &lt;a class="link" href="https://en.wikipedia.org/wiki/Attention_%28machine_learning%29" target="_blank" rel="noopener"
 &gt;cross-attention&lt;/a&gt; in every transformer block. Temporal compression is handled by &lt;strong&gt;Wan-VAE&lt;/strong&gt;, a 3D causal &lt;a class="link" href="https://en.wikipedia.org/wiki/Variational_autoencoder" target="_blank" rel="noopener"
 &gt;VAE&lt;/a&gt; designed specifically for video.&lt;/p&gt;
&lt;p&gt;Wan2.1&amp;rsquo;s weakness is the weakness of diffusion models generally: &lt;strong&gt;it is slow&lt;/strong&gt;. Producing one 480P five-second clip takes roughly 50 steps of &lt;a class="link" href="https://en.wikipedia.org/wiki/Ordinary_differential_equation" target="_blank" rel="noopener"
 &gt;ODE&lt;/a&gt; integration, and at 14B each step is heavy. That is what few-step distillation is for — and that is where the conventional approach shows its limits.&lt;/p&gt;
&lt;h2 id="the-problem--why-few-step-distillation-is-pinned"&gt;The problem — why few-step distillation is pinned
&lt;/h2&gt;&lt;p&gt;The standard tool for few-step sampling is distillation in the &lt;a class="link" href="https://arxiv.org/abs/2303.01469" target="_blank" rel="noopener"
 &gt;consistency model&lt;/a&gt; family. The core idea is to learn a mapping that jumps straight from any noisy point &lt;code&gt;z_t&lt;/code&gt; to the clean output &lt;code&gt;z_0&lt;/code&gt; — an endpoint consistency mapping. The catch is that this &lt;strong&gt;replaces the original &lt;a class="link" href="https://arxiv.org/abs/2011.13456" target="_blank" rel="noopener"
 &gt;probability-flow ODE&lt;/a&gt; trajectory wholesale with a consistency-sampling trajectory&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Two things break as a result. First, the model is optimized for a particular step count and degrades at other budgets. Second, and more damaging — &lt;strong&gt;test-time scaling disappears&lt;/strong&gt;. Ordinary diffusion sampling gets better as you add steps; consistency-distilled models do not improve, and can even get worse, with more steps. That is the price of discarding the ODE trajectory&amp;rsquo;s &amp;ldquo;more compute means more accuracy&amp;rdquo; property. The &lt;a class="link" href="https://arxiv.org/abs/2605.13724" target="_blank" rel="noopener"
 &gt;AnyFlow paper&lt;/a&gt; takes exactly this failure as its starting point.&lt;/p&gt;
&lt;h2 id="anyflows-answer--on-policy-flow-map-distillation"&gt;AnyFlow&amp;rsquo;s answer — on-policy flow map distillation
&lt;/h2&gt;&lt;p&gt;AnyFlow&amp;rsquo;s shift compresses to one line: &lt;strong&gt;drop the endpoint mapping (&lt;code&gt;z_t → z_0&lt;/code&gt;) and learn a flow-map transition over arbitrary time intervals (&lt;code&gt;z_t → z_r&lt;/code&gt;).&lt;/strong&gt; Because it learns transitions between any two points on the trajectory rather than a single endpoint &lt;code&gt;z_0&lt;/code&gt;, the same model handles whatever way inference chooses to slice the steps. That is the technical basis of &amp;ldquo;any-step.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;The key training technique is &lt;strong&gt;Flow Map Backward Simulation&lt;/strong&gt;. It decomposes a full &lt;a class="link" href="https://en.wikipedia.org/wiki/Euler_method" target="_blank" rel="noopener"
 &gt;Euler rollout&lt;/a&gt; into several shortcut flow-map segments, so the model trains on the intermediate states it produces itself — that is, &lt;strong&gt;on-policy&lt;/strong&gt;. This decomposition addresses two error sources at once:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Discretization error&lt;/strong&gt; — the integration error that accumulates when few-step sampling takes large jumps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exposure bias&lt;/strong&gt; — the mismatch between training and inference distributions that compounds in causal (autoregressive) generation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is the decisive difference from &lt;a class="link" href="https://arxiv.org/abs/2303.01469" target="_blank" rel="noopener"
 &gt;consistency distillation&lt;/a&gt;. Consistency distillation &lt;strong&gt;replaces&lt;/strong&gt; the original trajectory; AnyFlow &lt;strong&gt;preserves the original &lt;a class="link" href="https://en.wikipedia.org/wiki/Ordinary_differential_equation" target="_blank" rel="noopener"
 &gt;ODE&lt;/a&gt; trajectory and decomposes it into segments&lt;/strong&gt;. Because the trajectory is left intact, the &amp;ldquo;more steps means more accurate&amp;rdquo; property survives — AnyFlow matches or beats consistency-based methods in the few-step regime, and uniformly lifts quality across the whole trajectory as steps increase.&lt;/p&gt;
&lt;h2 id="what-it-supports--architectures-and-tasks"&gt;What it supports — architectures and tasks
&lt;/h2&gt;&lt;p&gt;AnyFlow is not a single model but a lineup released as a &lt;a class="link" href="https://huggingface.co/collections/nvidia/anyflow" target="_blank" rel="noopener"
 &gt;HuggingFace collection&lt;/a&gt;.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Model&lt;/th&gt;
 &lt;th&gt;Tasks&lt;/th&gt;
 &lt;th&gt;Architecture&lt;/th&gt;
 &lt;th&gt;Resolution&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers" target="_blank" rel="noopener"
 &gt;AnyFlow-Wan2.1-T2V-14B-Diffusers&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;T2V&lt;/td&gt;
 &lt;td&gt;bidirectional&lt;/td&gt;
 &lt;td&gt;480P&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers" target="_blank" rel="noopener"
 &gt;AnyFlow-Wan2.1-T2V-1.3B-Diffusers&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;T2V&lt;/td&gt;
 &lt;td&gt;bidirectional&lt;/td&gt;
 &lt;td&gt;480P&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers" target="_blank" rel="noopener"
 &gt;AnyFlow-FAR-Wan2.1-14B-Diffusers&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;T2V / I2V / V2V&lt;/td&gt;
 &lt;td&gt;causal&lt;/td&gt;
 &lt;td&gt;480P&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a class="link" href="https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers" target="_blank" rel="noopener"
 &gt;AnyFlow-FAR-Wan2.1-1.3B-Diffusers&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;T2V / I2V / V2V&lt;/td&gt;
 &lt;td&gt;causal&lt;/td&gt;
 &lt;td&gt;480P&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The &lt;code&gt;FAR&lt;/code&gt; variants put AnyFlow on top of &lt;a class="link" href="https://sites.google.com/view/showlab" target="_blank" rel="noopener"
 &gt;Show Lab&lt;/a&gt;&amp;rsquo;s &lt;a class="link" href="https://github.com/showlab/FAR" target="_blank" rel="noopener"
 &gt;FAR&lt;/a&gt; (Long-Context Autoregressive Video Modeling, &lt;a class="link" href="https://arxiv.org/abs/2503.19325" target="_blank" rel="noopener"
 &gt;arXiv 2503.19325&lt;/a&gt;) — a next-frame-prediction causal video model — and handle Image-to-Video and Video-to-Video alongside &lt;a class="link" href="https://en.wikipedia.org/wiki/Text-to-video_model" target="_blank" rel="noopener"
 &gt;Text-to-Video&lt;/a&gt; in one model. AnyFlow is validated on both bidirectional (Wan2.1 proper) and causal (FAR) architectures and across scales from 1.3B to 14B. The backward simulation that targets exposure bias matters most on the causal side.&lt;/p&gt;
&lt;h2 id="trying-it--diffusers"&gt;Trying it — Diffusers
&lt;/h2&gt;&lt;p&gt;The &lt;a class="link" href="https://github.com/huggingface/diffusers" target="_blank" rel="noopener"
 &gt;🤗 Diffusers&lt;/a&gt; integration is done, so the barrier to entry is low. It loads with the standard &lt;code&gt;DiffusionPipeline&lt;/code&gt;, and for step-count control you use the dedicated &lt;code&gt;WanAnyFlowPipeline&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;diffusers.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;export_to_video&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;far.pipelines.pipeline_wan_anyflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WanAnyFlowPipeline&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;WanAnyFlowPipeline&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cuda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;CG game concept digital art, a majestic elephant running towards a herd.&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;480&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;832&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_frames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;81&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;num_inference_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# raise 4 -&amp;gt; 8 -&amp;gt; 16 and quality goes up&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;cuda&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;frames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;export_to_video&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;output.mp4&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;code&gt;num_inference_steps&lt;/code&gt; is the whole point. From the same checkpoint, changing only this value lets you pick any point on the speed-quality curve — something a few-step distilled model simply cannot do. Training, inference, and &lt;a class="link" href="https://github.com/Vchitect/VBench" target="_blank" rel="noopener"
 &gt;VBench&lt;/a&gt; evaluation configs live in the &lt;a class="link" href="https://github.com/NVlabs/AnyFlow" target="_blank" rel="noopener"
 &gt;NVlabs/AnyFlow&lt;/a&gt; repository, and running in &lt;a class="link" href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format" target="_blank" rel="noopener"
 &gt;&lt;code&gt;bfloat16&lt;/code&gt;&lt;/a&gt; with &lt;a class="link" href="https://huggingface.co/docs/accelerate" target="_blank" rel="noopener"
 &gt;accelerate&lt;/a&gt; and &lt;a class="link" href="https://huggingface.co/docs/transformers" target="_blank" rel="noopener"
 &gt;transformers&lt;/a&gt; is recommended.&lt;/p&gt;
&lt;p&gt;The license needs care. The GitHub code is &lt;a class="link" href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank" rel="noopener"
 &gt;Apache 2.0&lt;/a&gt;, but the &lt;strong&gt;model weights on HuggingFace are under the NVIDIA One-Way Noncommercial License (NSCLv1)&lt;/strong&gt; — noncommercial use only. That contrasts with the Wan2.1 base itself, which is Apache 2.0.&lt;/p&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;AnyFlow is interesting not just because it is &amp;ldquo;a faster video model.&amp;rdquo; The work &lt;strong&gt;rewrites the default of distillation itself&lt;/strong&gt;. For the past few years the implicit premise of few-step distillation has been that the inference budget is fixed at training time — &lt;a class="link" href="https://arxiv.org/abs/2310.04378" target="_blank" rel="noopener"
 &gt;LCM&lt;/a&gt;, &lt;a class="link" href="https://arxiv.org/abs/2303.01469" target="_blank" rel="noopener"
 &gt;consistency models&lt;/a&gt;, and assorted step-distilled checkpoints were all shipped that way. AnyFlow dissolves that premise simply by learning a flow map instead of an endpoint mapping. The result is that &amp;ldquo;speed or quality&amp;rdquo; stops being a fixed choice made at deployment and becomes &lt;strong&gt;a slider at inference time&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The deeper insight is about &lt;em&gt;what gets preserved&lt;/em&gt;. Consistency distillation bought speed by discarding the original ODE trajectory, and in doing so lost test-time scaling — one of diffusion&amp;rsquo;s core assets. AnyFlow chooses instead to preserve the trajectory and cut it into segments, keeping that asset intact. It is a design that asks &amp;ldquo;what must be preserved&amp;rdquo; before asking &amp;ldquo;what can be thrown away to approximate.&amp;rdquo; That on-policy backward simulation catches discretization error and exposure bias with a single mechanism is the same theme: not two separate patches, but one property that falls out naturally once the trajectory is decomposed correctly.&lt;/p&gt;
&lt;p&gt;The limitations are clear too. The public model card and project page do not yet state quantitative &lt;a class="link" href="https://github.com/Vchitect/VBench" target="_blank" rel="noopener"
 &gt;VBench&lt;/a&gt; scores — the comparisons are qualitative and relative — resolution is capped at 480P, and the weight license is noncommercial. Still, the direction is unmistakable: the next round of differentiation in video generation comes not from model size but from &lt;strong&gt;how wide a speed-quality spectrum a single set of weights can cover&lt;/strong&gt;. AnyFlow is the first case to move that spectrum from deployment into inference.&lt;/p&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;p&gt;&lt;strong&gt;Models &amp;amp; code&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers" target="_blank" rel="noopener"
 &gt;nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers&lt;/a&gt; — the model card this post covers&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/collections/nvidia/anyflow" target="_blank" rel="noopener"
 &gt;AnyFlow HuggingFace collection&lt;/a&gt; — full 1.3B/14B, bidirectional/causal lineup&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/NVlabs/AnyFlow" target="_blank" rel="noopener"
 &gt;NVlabs/AnyFlow&lt;/a&gt; — training, inference, evaluation code (Apache 2.0)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://nvlabs.github.io/AnyFlow" target="_blank" rel="noopener"
 &gt;AnyFlow project page&lt;/a&gt; · &lt;a class="link" href="https://nvlabs.github.io/AnyFlow/demo" target="_blank" rel="noopener"
 &gt;demo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers" target="_blank" rel="noopener"
 &gt;Wan-AI/Wan2.1-T2V-14B-Diffusers&lt;/a&gt; · &lt;a class="link" href="https://github.com/Wan-Video/Wan2.1" target="_blank" rel="noopener"
 &gt;Wan-Video/Wan2.1&lt;/a&gt; — the base model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Papers&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2605.13724" target="_blank" rel="noopener"
 &gt;AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation (arXiv 2605.13724)&lt;/a&gt; — Gu, Fang, Jiang, Mao, Han, Cai, Shou (NVIDIA / Show Lab NUS / MIT, 2026)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2503.19325" target="_blank" rel="noopener"
 &gt;Long-Context Autoregressive Video Modeling with Next-Frame Prediction — FAR (arXiv 2503.19325)&lt;/a&gt; — Gu, Mao, Shou (2025) — base of the causal variants&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2303.01469" target="_blank" rel="noopener"
 &gt;Consistency Models (arXiv 2303.01469)&lt;/a&gt; — the distillation paradigm AnyFlow contrasts with&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2310.04378" target="_blank" rel="noopener"
 &gt;Latent Consistency Models (arXiv 2310.04378)&lt;/a&gt; — a representative few-step distillation case&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2210.02747" target="_blank" rel="noopener"
 &gt;Flow Matching for Generative Modeling (arXiv 2210.02747)&lt;/a&gt; — the generative framework under Wan2.1 and AnyFlow&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2011.13456" target="_blank" rel="noopener"
 &gt;Score-Based Generative Modeling through SDEs (arXiv 2011.13456)&lt;/a&gt; — origin of the probability-flow ODE&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://arxiv.org/abs/2212.09748" target="_blank" rel="noopener"
 &gt;Scalable Diffusion Models with Transformers — DiT (arXiv 2212.09748)&lt;/a&gt; — the Wan2.1 backbone architecture&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; tools&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/huggingface/diffusers" target="_blank" rel="noopener"
 &gt;🤗 Diffusers&lt;/a&gt; — the library the model is integrated into&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/showlab/FAR" target="_blank" rel="noopener"
 &gt;FAR&lt;/a&gt; · &lt;a class="link" href="https://github.com/guandeh17/Self-Forcing" target="_blank" rel="noopener"
 &gt;Self-Forcing&lt;/a&gt; · &lt;a class="link" href="https://github.com/WZDTHU/TiM" target="_blank" rel="noopener"
 &gt;TiM&lt;/a&gt; — prior work AnyFlow credits as build foundations&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/Vchitect/VBench" target="_blank" rel="noopener"
 &gt;VBench&lt;/a&gt; — video generation evaluation benchmark&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://en.wikipedia.org/wiki/Diffusion_model" target="_blank" rel="noopener"
 &gt;Diffusion model&lt;/a&gt; · &lt;a class="link" href="https://en.wikipedia.org/wiki/Text-to-video_model" target="_blank" rel="noopener"
 &gt;Text-to-video model&lt;/a&gt; — conceptual background&lt;/li&gt;
&lt;/ul&gt;</description></item></channel></rss>