Voxtral TTS: Mistral’s Open-Weight Text-to-Speech Model Rivals ElevenLabs

On March 26, 2026, Mistral AI released Voxtral TTS — a 4-billion-parameter open-weight text-to-speech model that the company says outperforms ElevenLabs Flash v2.5 in human preference tests while matching ElevenLabs v3 in lifelike interactions. Built on Ministral 3B, Voxtral TTS supports nine languages, clones voices from just three seconds of audio, and achieves 70ms latency — making it one of the most capable open TTS models available today.

Intermediate

Voxtral TTS performance benchmark chart comparing latency and quality metrics
Image credit: Mistral AI

Architecture and Technical Details

Voxtral TTS is a transformer-based, autoregressive, flow-matching model composed of three components:

  • 3.4B-parameter transformer decoder backbone — handles text understanding and semantic token generation
  • 390M flow-matching acoustic transformer — converts semantic tokens into acoustic latents using 16 function evaluations
  • 300M neural audio codec — a symmetric encoder-decoder that produces 24 kHz audio output

The system uses an in-house codec with an 8,192-vocabulary semantic vector quantizer and 36-dimensional, 21-level acoustic finite scalar quantization at a 12.5 Hz frame rate. It processes voice prompts of 5–25 seconds and generates up to 2 minutes of audio natively, with smart interleaving for longer content.

Voxtral TTS architecture diagram showing the three-component pipeline
Image credit: Mistral AI

Performance and Benchmarks

On a single NVIDIA H200 with a 500-character input and 10-second voice reference, Voxtral TTS achieves:

  • 70ms latency at concurrency 1 with a real-time factor of 0.103 (roughly 9.7x real-time speed)
  • 331ms latency at concurrency 16, delivering 879 characters/second/GPU throughput
  • 1,430 characters/second/GPU at concurrency 32

Human evaluations show Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio. It also performs at parity with the larger ElevenLabs v3 model, including support for emotional steering across neutral, happy, and sarcastic tones.

Win rate comparison chart showing Voxtral TTS vs ElevenLabs models
Image credit: SiliconANGLE

Voice Cloning and Language Support

Voxtral TTS can adapt to new voices with as little as three seconds of reference audio, capturing accent subtleties, intonation patterns, natural pauses, and emotional nuance. The model ships with 20 preset voices and supports nine languages: English (with American, British, and French dialect variants), French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic — with zero-shot cross-lingual voice adaptation.

The model requires 16 GB or more of GPU memory for deployment and outputs audio in WAV, PCM, FLAC, MP3, AAC, and Opus formats. It supports both streaming and batch inference, making it production-ready for real-time voice agent workflows.

Availability and Pricing

Voxtral TTS is available today on Hugging Face under a CC BY-NC 4.0 license, with self-hosting supported via vLLM Omni. It can also be accessed through Mistral’s API at $0.016 per 1,000 characters, as well as through Mistral Studio and Le Chat. A research paper is available at mistral.ai.

What This Means

Voxtral TTS marks a significant milestone in open-weight audio AI. While ElevenLabs has dominated the TTS space with its proprietary models, Mistral is now offering comparable quality with open weights that developers can self-host, fine-tune, and integrate without per-character API costs. The 4B parameter size keeps hardware requirements modest — a single consumer GPU with 16 GB VRAM is sufficient — opening the door for edge deployment, on-device applications, and privacy-sensitive use cases like healthcare and financial services.

Combined with Mistral’s earlier Voxtral Transcribe 2 for speech-to-text, the company now offers a complete open-weight audio pipeline for building voice-first applications.

Related Coverage

Sources