Alibaba Unveils Qwen3-Omni Series: Revolutionizing Multimodal AI with Advanced Capabilities

Alibaba’s Qwen3-Omni series marks a significant advancement in multimodal AI, offering three specialized models tailored for diverse applications. These models—Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner—are designed to handle text, images, audio, and video with enhanced efficiency and performance.

Key Features

  • State-of-the-Art Multimodal Support: Combines early text-first pretraining with mixed modal training, achieving strong results across audio, video, and text tasks. It outperforms many benchmarks, including ASR and voice conversation metrics comparable to Gemini 2.5 Pro.
  • Multilingual Capabilities: Supports 119 text languages and 19 speech input languages, with 10 speech output languages. This makes it ideal for global applications and real-time interactions.
  • Innovative Architecture: Utilizes a MoE-based Thinker–Talker design with AuT pretraining, enabling efficient processing and low-latency responses. The multi-codebook design further optimizes performance.
  • Real-Time Interaction: Delivers immediate text or speech responses, supporting natural turn-taking in audio/video interactions.
  • Flexible Control: Allows customization via system prompts for tailored behavior, making it adaptable to various use cases.
  • Open-Source Audio Captioner: The Captioner model is now open source, addressing a critical gap in audio captioning with detailed, low-hallucination outputs.

Model Overview

Model NameDescription
Qwen3-Omni-30B-A3B-InstructCombines both thinker and talker components for audio, video, and text input, with audio/text output.
Qwen3-Omni-30B-A3B-ThinkingFocuses on chain-of-thought reasoning for text-based tasks, supporting audio/video/text input with text output.
Qwen3-Omni-30B-A3B-CaptionerSpecialized for detailed audio captioning, open-sourced for community use.

For more details, refer to the Qwen3-Omni Technical Report and the Captioner Cookbook.

Explore the models on Hugging Face, Thinking, and Captioner.