Alibaba Unveils Qwen3-Omni Series: Revolutionizing Multimodal AI with Advanced Capabilities

September 23, 2025Provided by Utku Ege Tuluk

Alibaba’s Qwen3-Omni series marks a significant advancement in multimodal AI, offering three specialized models tailored for diverse applications. These models—Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner—are designed to handle text, images, audio, and video with enhanced efficiency and performance.

Key Features

State-of-the-Art Multimodal Support: Combines early text-first pretraining with mixed modal training, achieving strong results across audio, video, and text tasks. It outperforms many benchmarks, including ASR and voice conversation metrics comparable to Gemini 2.5 Pro.
Multilingual Capabilities: Supports 119 text languages and 19 speech input languages, with 10 speech output languages. This makes it ideal for global applications and real-time interactions.
Innovative Architecture: Utilizes a MoE-based Thinker–Talker design with AuT pretraining, enabling efficient processing and low-latency responses. The multi-codebook design further optimizes performance.
Real-Time Interaction: Delivers immediate text or speech responses, supporting natural turn-taking in audio/video interactions.
Flexible Control: Allows customization via system prompts for tailored behavior, making it adaptable to various use cases.
Open-Source Audio Captioner: The Captioner model is now open source, addressing a critical gap in audio captioning with detailed, low-hallucination outputs.

Model Overview

Model Name	Description
Qwen3-Omni-30B-A3B-Instruct	Combines both thinker and talker components for audio, video, and text input, with audio/text output.
Qwen3-Omni-30B-A3B-Thinking	Focuses on chain-of-thought reasoning for text-based tasks, supporting audio/video/text input with text output.
Qwen3-Omni-30B-A3B-Captioner	Specialized for detailed audio captioning, open-sourced for community use.

For more details, refer to the Qwen3-Omni Technical Report and the Captioner Cookbook.

Explore the models on Hugging Face, Thinking, and Captioner.

Key Features

Model Overview

New York University