Pocket TTS

January 21, 2026Provided by Utku Ege Tuluk

Pocket TTS is a 100-million-parameter open-source text-to-speech (TTS) model developed by Kyutai that runs efficiently on CPUs in real time and supports high-quality voice cloning. (eWeek)

Key Highlights

⚡ CPU-First, Real-Time TTS

Designed to run on common laptop CPUs without requiring a GPU. (eWeek)
Achieves faster-than-real-time performance — e.g., 6× real-time throughput on standard hardware like a MacBook Air CPU. (byteiota | From Bits to Bytes)
Typical latency: initial audio output in ~200 ms after text input. (Vision Agents)

🗣️ Voice Cloning

Can clone a voice from a short (≈5 s) audio sample, capturing tonal qualities, accent, and acoustic characteristics. (eWeek)
Cloned speech maintains a high level of similarity to the reference voice. (Kyutai)

📦 Compact and Efficient

At 100 M parameters, Pocket TTS is much smaller than typical commercial or research TTS models but still delivers high-quality output. (Kyutai)
The lightweight nature makes it ideal for edge devices, laptops, and offline usage. (byteiota | From Bits to Bytes)

📡 Technical Innovation — Continuous Audio Language Models

The model uses Kyutai’s Continuous Audio Language Models (CALM) framework, which directly predicts audio signals rather than relying on intermediate discrete audio token representations. (arXiv)
This continuous approach reduces computational overhead and enables CPU-optimized inference. (byteiota | From Bits to Bytes)

📜 Open Science & Accessibility

Fully open-source under an MIT license, including training code and model weights. (eWeek)
Trained on a large dataset (e.g., 88,000 hours of public audio data). (LinkedIn)
Accessible Python API and CLI tools make integration straightforward. (Hugging Face)

🔒 Privacy and Cost Benefits

By enabling local, offline TTS, Pocket TTS avoids sending audio to remote APIs, preserving user privacy. (byteiota | From Bits to Bytes)
No usage costs or rate limits typical of commercial services. (byteiota | From Bits to Bytes)

Typical Use Cases

Voice assistants and interactive agents that speak naturally without cloud connectivity. (Vision Agents)
Personal voice preservation (e.g., for users who want their unique voice for accessibility). (eWeek)
Game development with multiple character voices generated locally. (eWeek)
Offline narration projects like audiobooks. (LinkedIn)

How to Try or Use (from community sources)

You can install and run Pocket TTS via common Python tooling:

pip install pocket-tts
uvx pocket-tts serve  # start local server

Then generate speech locally with commands or code. (Hugging Face)