Meta’s FAIR research team has just released V-JEPA 2, a cutting-edge world model trained on large-scale video data that enables AI agents to understand, predict, and even plan in the physical world. Building on the original JEPA (Joint Embedding Predictive Architecture) framework, V-JEPA 2 is pretrained on over 1 million hours of internet video and 1 million images, learning rich spatio-temporal representations without any manual annotations (arxiv.org).
V-JEPA 2 sets new records on several benchmarks:
Beyond perception, Meta introduced V-JEPA 2-AC, an action-conditioned variant fine-tuned with only 62 hours of unlabeled robot videos from the Droid dataset. Without any task-specific rewards or extra data collection, V-JEPA 2-AC enables zero-shot planning on real Franka robotic arms—successfully executing pick-and-place tasks in unfamiliar environments (arxiv.org).
Alongside the model release, Meta published three novel benchmarks focusing on causal and counterfactual reasoning in physical-world videos, designed to evaluate an AI’s ability to answer “what-if” and “why” questions from footage (about.fb.com).
A variety of V-JEPA 2 variants—differing in model size (ViT-L, ViT-H, ViT-G) and input resolution—are available on Hugging Face:
https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6
You can experiment with video classification, action anticipation, and world-model planning tasks using these pretrained checkpoints.