Alibaba Unveils Qwen3-Omni Series: Revolutionizing Multimodal AI with Advanced Capabilities
Alibaba’s Qwen3-Omni series marks a significant advancement in multimodal AI, offering three specialized models tailored for diverse applications. These models—Qwen3-Omni-30B-A3B-Instruct, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner—are designed to handle text, images, audio, and video with enhanced efficiency and performance.
Key Features
- State-of-the-Art Multimodal Support: Combines early text-first pretraining with mixed modal training, achieving strong results across audio, video, and text tasks. It outperforms many benchmarks, including ASR and voice conversation metrics comparable to Gemini 2.5 Pro.
- Multilingual Capabilities: Supports 119 text languages and 19 speech input languages, with 10 speech output languages. This makes it ideal for global applications and real-time interactions.
- Innovative Architecture: Utilizes a MoE-based Thinker–Talker design with AuT pretraining, enabling efficient processing and low-latency responses. The multi-codebook design further optimizes performance.
- Real-Time Interaction: Delivers immediate text or speech responses, supporting natural turn-taking in audio/video interactions.
- Flexible Control: Allows customization via system prompts for tailored behavior, making it adaptable to various use cases.
- Open-Source Audio Captioner: The Captioner model is now open source, addressing a critical gap in audio captioning with detailed, low-hallucination outputs.
Model Overview
| Model Name | Description |
|---|---|
| Qwen3-Omni-30B-A3B-Instruct | Combines both thinker and talker components for audio, video, and text input, with audio/text output. |
| Qwen3-Omni-30B-A3B-Thinking | Focuses on chain-of-thought reasoning for text-based tasks, supporting audio/video/text input with text output. |
| Qwen3-Omni-30B-A3B-Captioner | Specialized for detailed audio captioning, open-sourced for community use. |
For more details, refer to the Qwen3-Omni Technical Report and the Captioner Cookbook.
Explore the models on Hugging Face, Thinking, and Captioner.


沪公网安备31011502017015号