FastWan Generates a 5-Second Video in 5 Seconds via Sparse Distillation

FastWan generates a 5-second video in about 5 seconds on a single GPU — and the team at UC San Diego’s Hao AI Lab did it by training the model to be fast rather than patching speed on at inference time. Released under Apache-2.0 with full weights, training recipes, and datasets, FastWan introduces Sparse Distillation: a single training process that fuses sparse attention with aggressive step reduction. The 1.3B variant runs its denoising loop in under one second on an H200, roughly a 98× speedup over the dense baseline.

Advanced

Diagram of FastWan's sparse distillation architecture combining a sparse student model, a frozen real score network, and a trainable fake score network
Image credit: Hao AI Lab @ UC San Diego

Why Video Diffusion Is Slow

Text-to-video diffusion transformers (DiTs) are expensive for two compounding reasons. First, they denoise iteratively — a standard sampler runs 50 steps, each a full forward pass. Second, 3D attention over a video’s space-time tokens scales quadratically, so attention dominates the compute as resolution and duration grow. The two problems multiply: 50 dense-attention passes per clip.

The obvious fixes attack each axis separately. Distillation compresses 50 steps down to 1–4. Sparse attention prunes the attention map so each pass touches fewer tokens. The catch is that they fight each other: most sparse-attention methods exploit redundancy across the many denoising steps to decide what to prune. Collapse the steps from 50 to 3 and that redundancy disappears — the sparsity heuristics break exactly when you need them most.

How Sparse Distillation Works

FastWan’s answer is to stop treating the two as separate stages and co-train them. Sparse Distillation jointly optimizes a few-step sparse student to match the output distribution of a full-step dense teacher, in one process. It rests on two components:

  • VSA (Video Sparse Attention) — a learnable sparse-attention kernel that drops in as a replacement for FlashAttention. Rather than relying on profiling or fixed heuristics, VSA learns data-dependent sparsity patterns during training (FastWan trains at 0.8 sparsity). Because the pattern is learned rather than inferred from multi-step redundancy, it survives distillation — the team calls it the first sparse-attention mechanism fully compatible with distillation.
  • DMD (Distribution Matching Distillation) — the step-compression engine. It uses three networks: the trainable sparse student, a frozen real-score network (full attention) that anchors the target distribution, and a trainable fake-score network that estimates the student’s own distribution so the gap can be minimized.

Training the two together is the key move: VSA adapts during distillation instead of being bolted on afterward, so the student learns a sparsity pattern that holds up at 3 inference steps.

Bar chart of FastWan denoising time on a single H200 dropping from 95.21 seconds with FlashAttention-2 to 0.98 seconds with VSA, DMD, and torch.compile
Image credit: Hao AI Lab @ UC San Diego

The Numbers

On the 1.3B model, the published H200 denoising times (DiT only) stack up as: 95.21s with FlashAttention-2, 2.88s adding DMD, 1.49s with FlashAttention-3 plus torch.compile, and 0.98s with VSA + DMD + torch.compile. End-to-end, FastWan2.1-T2V-1.3B produces a 5-second 480p clip in about 5 seconds on an H200 (1s denoising) and roughly 21 seconds on a consumer RTX 4090 (2.8s denoising). It supports 3-step inference and hits up to 16 FPS generation throughput on a single H100. The larger FastWan2.2-TI2V-5B renders a 5-second 720p clip in about 16 seconds on one H200.

Reproducibility is unusually concrete: the 1.3B model was distilled on 64 H200 GPUs for 4,000 steps — about 768 GPU-hours, or roughly $2,600 at quoted cloud rates. Every input is synthetic: 600k 480p videos and 250k 720p videos generated by Wan2.1-14B, plus 32k from Wan2.2-5B, sidestepping data-licensing concerns entirely.

What This Means

FastWan reframes “fast video generation” as a training problem rather than an inference trick. Training-free accelerators have to respect whatever the model already learned; by folding sparsity into distillation, FastWan lets the model learn a representation that is fast by construction. The practical upshot is that real-time, interactive video generation is now reachable on hardware people actually own — including the RTX 4090 and Apple Silicon — and the entire recipe is open under Apache-2.0, so the result is fully reproducible rather than a closed demo.

Related Coverage

Sources