Alibaba’s Qwen3-Omni series marks a significant advancement in multimodal AI, offering three specialized models tailored for diverse applications. These models—Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner—are designed to handle text, images, audio, and video with enhanced efficiency and performance.
Key Features
State-of-the-Art Multimodal Support: Combines early text-first pretraining with mixed modal training, achieving strong results across audio, video, and text tasks. It outperforms many benchmarks, including ASR and voice conversation metrics comparable to Gemini 2.5 Pro.
Multilingual Capabilities: Supports 119 text languages and 19 speech input languages, with 10 speech output languages. This makes it ideal for global applications and real-time interactions.
Innovative Architecture: Utilizes a MoE-based Thinker–Talker design with AuT pretraining, enabling efficient processing and low-latency responses. The multi-codebook design further optimizes performance.
Real-Time Interaction: Delivers immediate text or speech responses, supporting natural turn-taking in audio/video interactions.
Flexible Control: Allows customization via system prompts for tailored behavior, making it adaptable to various use cases.
Open-Source Audio Captioner: The Captioner model is now open source, addressing a critical gap in audio captioning with detailed, low-hallucination outputs.
Model Overview
Model Name
Description
Qwen3-Omni-30B-A3B-Instruct
Combines both thinker and talker components for audio, video, and text input, with audio/text output.
Qwen3-Omni-30B-A3B-Thinking
Focuses on chain-of-thought reasoning for text-based tasks, supporting audio/video/text input with text output.
Qwen3-Omni-30B-A3B-Captioner
Specialized for detailed audio captioning, open-sourced for community use.