Wan2.2: Alibaba’s Open‑Source Breakthrough in AI Video Generation

July 29, 2025Provided by Utku Ege Tuluk

The Wan2.2 project (hosted at Wan‑Video/Wan2.2) marks another leap forward in large‑scale, open, consumer‑accessible video generative models developed by the Wan‑AI team at Alibaba Cloud (GitHub). Released on July 28, 2025, this upgrade offers major technical advancements over the previous Wan2.1 version (Hugging Face).

🚀 Key Innovations in Wan2.2

1. Mixture‑of‑Experts (MoE) Architecture

Wan2.2’s A14B model uses a Mixture‑of‑Experts (MoE) mechanism that combines two specialized expert sub‑models: one for early high‑noise denoising and another for later fine‑detail refinement. While the model totals ~27 billion parameters, only ~14 billion are active per inference step—maintaining inference cost while boosting capacity and performance (Hugging Face).

2. Cinematic‑Level Aesthetic Control

The training pipeline includes finely labeled aesthetic datasets (lighting, composition, color maps, contrast), enabling precise control over cinematic styles in generated video outputs (Hugging Face).

3. Expanded & Diverse Training Data

Compared to Wan2.1, Wan2.2 is trained with 65.6% more images and 83.2% more videos, significantly improving generalizability across motion patterns, semantics, and visual quality. As a result, it leads across both open‑source and closed‑source benchmarks on Wan‑Bench 2.0 (Hugging Face).

4. Lightweight 5 B TI2V Model for 720p@24fps

The TI2V‑5B variant integrates text‑to‑video (T2V) and image‑to‑video (I2V) capabilities within one high‑compression model. Its custom Wan2.2‑VAE achieves a 64× compression, enabling 720p output at 24 fps in under ~9 minutes on an RTX 4090—among the fastest for that resolution running on consumer hardware (Hugging Face).

Available Models & Features

Model	Configuration	Supports	Notes
A14B (MoE)	2×14B experts	Text‑to‑Video, Image‑to‑Video	Excellent quality, similar memory cost to single expert (Hugging Face, Wan AI)
TI2V‑5B	Dense 5B + high‑compression VAE	Unified T2V & I2V	Efficient, 720p@24fps on single consumer GPU (Hugging Face, Hugging Face)

Getting Started

git clone https://github.com/Wan-Video/Wan2.2.git
cd Wan2.2
pip install -r requirements.txt

Model download is supported via Hugging Face or ModelScope CLI. Examples: Wan2.2-T2V-A14B, I2V-A14B, Wan2.2-TI2V-5B. The TI2V‑5B model supports both T2V and I2V at 720p (Hugging Face).

Inference scripts offer options for prompt extension and memory offloading to optimize VRAM use. ComfyUI and Diffusers integration is already available as of the July 28 release (GitHub).

Community & Development

Wan2.2 opened access for community projects. It has already been integrated into ComfyUI workflows and Hugging Face Spaces. Ongoing development includes plans for multi‑GPU inference support, more model checkpoints, and full integration with ComfyUI and Diffusers (GitHub).

Community discussion on GitHub includes questions about video length limits and MoE tuning. The project is under active development, with new issues and contributions appearing daily (GitHub).

Why It Matters

Open‑source leadership: Wan2.2 is fully Apache‑2.0 licensed and accessible to developers, researchers, and creators worldwide (GitHub).
Consumer‑grade access: Even the capable 5B model runs efficiently on GPUs like the RTX 4090—broadening access beyond high‑end clusters.
Visual quality leaps: Integration of MoE and aesthetic labels yields richer, more cinematic, precise video outputs.

Summary

Wan2.2 delivers a major upgrade over Wan2.1 in video generative performance, efficiency, and aesthetic control. With both high‑fidelity MoE models and a streamlined high‑compression variant, it balances quality with accessibility. For anyone interested in working with open video generation models, Wan2.2 is a standout release.