On March 26, 2026, Mistral AI released Voxtral TTS — a 4-billion-parameter open-weight text-to-speech model that the company says outperforms ElevenLabs Flash v2.5 in human preference tests while matching ElevenLabs v3 in lifelike interactions. Built on Ministral 3B, Voxtral TTS supports nine languages, clones voices from just three seconds of audio, and achieves 70ms latency — making it one of the most capable open TTS models available today.
Intermediate
Voxtral TTS is a transformer-based, autoregressive, flow-matching model composed of three components:
The system uses an in-house codec with an 8,192-vocabulary semantic vector quantizer and 36-dimensional, 21-level acoustic finite scalar quantization at a 12.5 Hz frame rate. It processes voice prompts of 5–25 seconds and generates up to 2 minutes of audio natively, with smart interleaving for longer content.
On a single NVIDIA H200 with a 500-character input and 10-second voice reference, Voxtral TTS achieves:
Human evaluations show Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio. It also performs at parity with the larger ElevenLabs v3 model, including support for emotional steering across neutral, happy, and sarcastic tones.
Voxtral TTS can adapt to new voices with as little as three seconds of reference audio, capturing accent subtleties, intonation patterns, natural pauses, and emotional nuance. The model ships with 20 preset voices and supports nine languages: English (with American, British, and French dialect variants), French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — with zero-shot cross-lingual voice adaptation.
The model requires 16 GB or more of GPU memory for deployment and outputs audio in WAV, PCM, FLAC, MP3, AAC, and Opus formats. It supports both streaming and batch inference, making it production-ready for real-time voice agent workflows.
Voxtral TTS is available today on Hugging Face under a CC BY-NC 4.0 license, with self-hosting supported via vLLM Omni. It can also be accessed through Mistral’s API at $0.016 per 1,000 characters, as well as through Mistral Studio and Le Chat. A research paper is available at mistral.ai.
Voxtral TTS marks a significant milestone in open-weight audio AI. While ElevenLabs has dominated the TTS space with its proprietary models, Mistral is now offering comparable quality with open weights that developers can self-host, fine-tune, and integrate without per-character API costs. The 4B parameter size keeps hardware requirements modest — a single consumer GPU with 16 GB VRAM is sufficient — opening the door for edge deployment, on-device applications, and privacy-sensitive use cases like healthcare and financial services.
Combined with Mistral’s earlier Voxtral Transcribe 2 for speech-to-text, the company now offers a complete open-weight audio pipeline for building voice-first applications.
