Introducing V-JEPA 2: Meta’s Self-Supervised Video World Model for Understanding, Prediction, and Planning

June 19, 2025Provided by Utku Ege Tuluk

Meta’s FAIR research team has just released V-JEPA 2, a cutting-edge world model trained on large-scale video data that enables AI agents to understand, predict, and even plan in the physical world. Building on the original JEPA (Joint Embedding Predictive Architecture) framework, V-JEPA 2 is pretrained on over 1 million hours of internet video and 1 million images, learning rich spatio-temporal representations without any manual annotations (arxiv.org).

How V-JEPA 2 Works

Joint-Embedding Predictive Architecture
V-JEPA 2 learns by embedding both past and future video frames into a shared latent space, then predicting future embeddings from past contexts. This self-supervised objective encourages the model to capture high-level semantics like object motion and interactions.
Scale of Pretraining
Trained on more than 1 million hours of unlabeled video data sourced from the web, V-JEPA 2 discovers the dynamics of the physical world in an autonomous way (arxiv.org).

State-of-the-Art Performance

V-JEPA 2 sets new records on several benchmarks:

Motion Understanding: 77.3% top-1 accuracy on Something-Something v2, outperforming previous task-specific models.
Action Anticipation: 39.7% recall@5 on Epic-Kitchens-100.
Video Question Answering: When aligned with an 8 B-parameter language model, V-JEPA 2 achieves 84.0 on PerceptionTest and 76.9 on TempCompass (arxiv.org).

Extending to Robotic Planning: V-JEPA 2-AC

Beyond perception, Meta introduced V-JEPA 2-AC, an action-conditioned variant fine-tuned with only 62 hours of unlabeled robot videos from the Droid dataset. Without any task-specific rewards or extra data collection, V-JEPA 2-AC enables zero-shot planning on real Franka robotic arms—successfully executing pick-and-place tasks in unfamiliar environments (arxiv.org).

New Benchmarks for Physical Reasoning

Alongside the model release, Meta published three novel benchmarks focusing on causal and counterfactual reasoning in physical-world videos, designed to evaluate an AI’s ability to answer “what-if” and “why” questions from footage (about.fb.com).

Try It Yourself

A variety of V-JEPA 2 variants—differing in model size (ViT-L, ViT-H, ViT-G) and input resolution—are available on Hugging Face:

https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6

You can experiment with video classification, action anticipation, and world-model planning tasks using these pretrained checkpoints.