Voxtral Mini 3B & Small 24B — Frontier Open‑Source Speech Understanding by Mistral AI
Introduction
On July 15, 2025, Mistral AI launched Voxtral, a new family of speech understanding models offering state-of-the-art, multilingual, and open-source voice AI. Available in two sizes—a compact Mini 3B for local or edge use and a larger Small (24B) for production environments—both are offered under an Apache 2.0 license and accessible via Hugging Face and Mistral’s API (Wikipedia, Mistral AI).
What Makes Voxtral Stand Out
- Open, affordable, production-ready: Voxtral bridges the gap between error-prone, open-source ASR systems and expensive closed proprietary APIs. It offers high-quality transcription and understanding at less than half the cost of most commercial alternatives (Mistral AI).
- Flexible deployment: Use locally on devices or edge hardware (Mini 3B), or deploy at scale in cloud or enterprise infrastructure (Small 24B), with optimized endpoints for transcription-specific workloads (Mistral AI).
- API & Le Chat integration: Voxtral powers Mistral’s API (just $0.001/minute), and is being rolled out in voice mode within Le Chât, enabling real-time transcription, Q&A, and summaries (Mistral AI).
Capabilities
🧠 Context & Intelligence
- Handles up to 32 K tokens, supporting around 30-minute transcriptions or 40-minute conversational contexts (Mistral AI).
- Beyond transcription—supports spoken Q&A, summarization, and even function-calling directly from voice, making workflows seamless (Mistral AI).
- Full multilingual support with automatic detection and high performance across major languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, Arabic, etc.) (Mistral AI).
🥇 Benchmarks & Accuracy
- Transcription: Voxtral Small outperforms OpenAI’s Whisper large-v3, GPT‑4o mini, Gemini 2.5 Flash, and ElevenLabs Scribe across short-form, long-form and multilingual benchmarks—including LibriSpeech, Common Voice, FLEURS, and more (Mistral AI).
- Audio Understanding: On Q&A benchmarks and speech translation (e.g. FLEURS-Translation), Voxtral Small ties or surpasses GPT‑4o‑mini and Gemini 2.5 Flash (Mistral AI).
- Text comprehension: Retains the full text understanding strengths of its language model backbone (Mistral Small 3.1) (Mistral AI).
Pricing
Available via Mistral’s API or Hugging Face:
| Model | Audio Input | Text Output |
|---|---|---|
| Voxtral Mini 3B | $0.001/min | $0.04/M tokens |
| Voxtral Small 24B | $0.004/min | $0.10/M tokens (Wikipedia, Mistral AI, Mistral AI) |
This cost is under half that of comparable commercial offerings like Whisper API or ElevenLabs Scribe.
Getting Started
- Download or API:
- Access both models on Hugging Face.
- Use the API for integration—an ultra-efficient transcription endpoint is available.
- Use in Le Chat Voice Mode:
- Upload or record audio, get transcriptions, Q&A, and summaries—via web or mobile (Mistral AI).
- Enterprise Features:
- Private on-prem deployment, multi-GPU scaling, fine-tuning, speaker segmentation, emotion detection, word-level timestamps, and non-speech audio recognition are in the pipeline (Mistral AI).
Why It Matters
Voxtral democratises voice AI by combining transcription, semantic understanding, multilingual fluency, custom workflow triggers, and long-form context—all in a cost-effective, open-source package. Its versatility makes it ideal for applications ranging from voice agents and podcasts to support systems and business intelligence.
What’s Next
- Live webinar: On August 6, 2025, Mistral will host a session (in collaboration with Inworld.ai) demonstrating voice-to-voice agents with Voxtral and Inworld TTS (Mistral AI).
- Feature roadmap: Soon expects speaker diarization, emotion analysis, timestamps, non-speech recognition, and expanded context windows (Mistral AI).
- Hiring: Mistral is actively expanding its audio team to further advance voice intelligence (Mistral AI).


沪公网安备31011502017015号