Voxtral Mini 3B & Small 24B — Frontier Open‑Source Speech Understanding by Mistral AI

July 16, 2025Provided by Utku Ege Tuluk

Introduction

On July 15, 2025, Mistral AI launched Voxtral, a new family of speech understanding models offering state-of-the-art, multilingual, and open-source voice AI. Available in two sizes—a compact Mini 3B for local or edge use and a larger Small (24B) for production environments—both are offered under an Apache 2.0 license and accessible via Hugging Face and Mistral’s API (Wikipedia, Mistral AI).

What Makes Voxtral Stand Out

Open, affordable, production-ready: Voxtral bridges the gap between error-prone, open-source ASR systems and expensive closed proprietary APIs. It offers high-quality transcription and understanding at less than half the cost of most commercial alternatives (Mistral AI).
Flexible deployment: Use locally on devices or edge hardware (Mini 3B), or deploy at scale in cloud or enterprise infrastructure (Small 24B), with optimized endpoints for transcription-specific workloads (Mistral AI).
API & Le Chat integration: Voxtral powers Mistral’s API (just $0.001/minute), and is being rolled out in voice mode within Le Chât, enabling real-time transcription, Q&A, and summaries (Mistral AI).

Capabilities

🧠 Context & Intelligence

Handles up to 32 K tokens, supporting around 30-minute transcriptions or 40-minute conversational contexts (Mistral AI).
Beyond transcription—supports spoken Q&A, summarization, and even function-calling directly from voice, making workflows seamless (Mistral AI).
Full multilingual support with automatic detection and high performance across major languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, Arabic, etc.) (Mistral AI).

🥇 Benchmarks & Accuracy

Transcription: Voxtral Small outperforms OpenAI’s Whisper large-v3, GPT‑4o mini, Gemini 2.5 Flash, and ElevenLabs Scribe across short-form, long-form and multilingual benchmarks—including LibriSpeech, Common Voice, FLEURS, and more (Mistral AI).
Audio Understanding: On Q&A benchmarks and speech translation (e.g. FLEURS-Translation), Voxtral Small ties or surpasses GPT‑4o‑mini and Gemini 2.5 Flash (Mistral AI).
Text comprehension: Retains the full text understanding strengths of its language model backbone (Mistral Small 3.1) (Mistral AI).

Pricing

Available via Mistral’s API or Hugging Face:

Model	Audio Input	Text Output
Voxtral Mini 3B	$0.001/min	$0.04/M tokens
Voxtral Small 24B	$0.004/min	$0.10/M tokens (Wikipedia, Mistral AI, Mistral AI)

This cost is under half that of comparable commercial offerings like Whisper API or ElevenLabs Scribe.

Getting Started

Download or API:
- Access both models on Hugging Face.
- Use the API for integration—an ultra-efficient transcription endpoint is available.
Use in Le Chat Voice Mode:
- Upload or record audio, get transcriptions, Q&A, and summaries—via web or mobile (Mistral AI).
Enterprise Features:
- Private on-prem deployment, multi-GPU scaling, fine-tuning, speaker segmentation, emotion detection, word-level timestamps, and non-speech audio recognition are in the pipeline (Mistral AI).

Why It Matters

Voxtral democratises voice AI by combining transcription, semantic understanding, multilingual fluency, custom workflow triggers, and long-form context—all in a cost-effective, open-source package. Its versatility makes it ideal for applications ranging from voice agents and podcasts to support systems and business intelligence.

What’s Next

Live webinar: On August 6, 2025, Mistral will host a session (in collaboration with Inworld.ai) demonstrating voice-to-voice agents with Voxtral and Inworld TTS (Mistral AI).
Feature roadmap: Soon expects speaker diarization, emotion analysis, timestamps, non-speech recognition, and expanded context windows (Mistral AI).
Hiring: Mistral is actively expanding its audio team to further advance voice intelligence (Mistral AI).