Seaweed APT2: Real-Time Interactive Video Generation with Autoregressive Adversarial Post-Training

June 19, 2025Provided by Utku Ege Tuluk

Seaweed APT2 is a groundbreaking streaming video generation model designed for real-time interactive applications. Building on its predecessor (APT1), APT2 employs an autoregressive adversarial post-training paradigm that allows it to generate continuous video frames with minimal latency. At its core, the model produces a single latent frame—equivalent to four video frames—using only one network forward evaluation (1NFE), making ultra-low-latency streaming possible on modern GPUs.

In performance benchmarks, the 8-billion-parameter APT2 achieves real-time, nonstop video generation at 736×416 resolution and 24 fps on a single NVIDIA H100 GPU—far outpacing existing diffusion-based approaches. When scaled to higher resolutions, APT2 can stream 1280×720 video at 24 fps across eight H100 GPUs, sustaining one-minute continuous generation without dropping a frame. This leap in throughput opens the door to live content creation, gaming, virtual production, and telepresence experiences where latency and continuity are paramount.

Beyond raw speed, Seaweed APT2 supports interactive control. In a virtual human demo, users supply an initial portrait frame, then drive real-time pose changes, watching the character move fluidly at 24 fps on a single H100 GPU. Similarly, camera-controlled world exploration showcases how APT2 ingests camera displacement and orientation embeddings to render panoramic scenes on the fly. These interactive proofs-of-concept highlight the model’s potential for immersive virtual environments and live storytelling.

Under the hood, APT2’s architecture resembles a large-language model with block causal attention and a KV cache, enabling constant-time autoregressive inference. A generator network predicts the next latent frame, while a matching discriminator evaluates frame fidelity using relativistic GAN losses and R1/R2 regularization. Both networks initialize from a pretrained bidirectional video diffusion model, then undergo adversarial post-training (AAPT) to transform into a high-throughput streaming pipeline. Detailed comparisons show APT2 maintains visual fidelity far longer than competing diffusion-forcing methods, which begin to degrade after just a few seconds of generation.

Despite its impressive capabilities, APT2 has limitations: fast-motion scenarios can still challenge the single-evaluation design, and sliding-window attention may struggle with very long-distance dependencies. Occasional physics violations and subject drift appear in extended streams. The Seaweed research team plans further work on human-preference alignment, memory extension, and robustness improvements. To explore the full details and see video samples, visit the project page or read the research paper on arXiv.