Voxtral Mini 3B & Small 24B — Frontier Open‑Source Speech Understanding by Mistral AI

Introduction

On July 15, 2025, Mistral AI launched Voxtral, a new family of speech understanding models offering state-of-the-art, multilingual, and open-source voice AI. Available in two sizes—a compact Mini 3B for local or edge use and a larger Small (24B) for production environments—both are offered under an Apache 2.0 license and accessible via Hugging Face and Mistral’s API (Wikipedia, Mistral AI).

What Makes Voxtral Stand Out

  • Open, affordable, production-ready: Voxtral bridges the gap between error-prone, open-source ASR systems and expensive closed proprietary APIs. It offers high-quality transcription and understanding at less than half the cost of most commercial alternatives (Mistral AI).
  • Flexible deployment: Use locally on devices or edge hardware (Mini 3B), or deploy at scale in cloud or enterprise infrastructure (Small 24B), with optimized endpoints for transcription-specific workloads (Mistral AI).
  • API & Le Chat integration: Voxtral powers Mistral’s API (just $0.001/minute), and is being rolled out in voice mode within Le Chât, enabling real-time transcription, Q&A, and summaries (Mistral AI).

Capabilities

🧠 Context & Intelligence

  • Handles up to 32 K tokens, supporting around 30-minute transcriptions or 40-minute conversational contexts (Mistral AI).
  • Beyond transcription—supports spoken Q&A, summarization, and even function-calling directly from voice, making workflows seamless (Mistral AI).
  • Full multilingual support with automatic detection and high performance across major languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, Arabic, etc.) (Mistral AI).

🥇 Benchmarks & Accuracy

  • Transcription: Voxtral Small outperforms OpenAI’s Whisper large-v3, GPT‑4o mini, Gemini 2.5 Flash, and ElevenLabs Scribe across short-form, long-form and multilingual benchmarks—including LibriSpeech, Common Voice, FLEURS, and more (Mistral AI).
  • Audio Understanding: On Q&A benchmarks and speech translation (e.g. FLEURS-Translation), Voxtral Small ties or surpasses GPT‑4o‑mini and Gemini 2.5 Flash (Mistral AI).
  • Text comprehension: Retains the full text understanding strengths of its language model backbone (Mistral Small 3.1) (Mistral AI).

Pricing

Available via Mistral’s API or Hugging Face:

ModelAudio InputText Output
Voxtral Mini 3B$0.001/min$0.04/M tokens
Voxtral Small 24B$0.004/min$0.10/M tokens (Wikipedia, Mistral AI, Mistral AI)

This cost is under half that of comparable commercial offerings like Whisper API or ElevenLabs Scribe.

Getting Started

  1. Download or API:
    • Access both models on Hugging Face.
    • Use the API for integration—an ultra-efficient transcription endpoint is available.
  2. Use in Le Chat Voice Mode:
    • Upload or record audio, get transcriptions, Q&A, and summaries—via web or mobile (Mistral AI).
  3. Enterprise Features:
    • Private on-prem deployment, multi-GPU scaling, fine-tuning, speaker segmentation, emotion detection, word-level timestamps, and non-speech audio recognition are in the pipeline (Mistral AI).

Why It Matters

Voxtral democratises voice AI by combining transcription, semantic understanding, multilingual fluency, custom workflow triggers, and long-form context—all in a cost-effective, open-source package. Its versatility makes it ideal for applications ranging from voice agents and podcasts to support systems and business intelligence.

What’s Next

  • Live webinar: On August 6, 2025, Mistral will host a session (in collaboration with Inworld.ai) demonstrating voice-to-voice agents with Voxtral and Inworld TTS (Mistral AI).
  • Feature roadmap: Soon expects speaker diarization, emotion analysis, timestamps, non-speech recognition, and expanded context windows (Mistral AI).
  • Hiring: Mistral is actively expanding its audio team to further advance voice intelligence (Mistral AI).