GLMâTTS is an openâsource textâtoâspeech (TTS) synthesis system built using large language models (LLMs). Itâs designed to produce expressive, highâquality speech from text and includes features like zeroâshot voice cloning and emotion control. (Hugging Face)
đĄ Key Features
- Zeroâshot voice cloning: Clone a speakerâs voice using only ~3â10 seconds of sample audio. (Hugging Face)
- Emotionâexpressive speech: Uses reinforcement learning to improve emotional expressiveness and prosody. (Hugging Face)
- High synthesis quality: Produces speech with low error rates (measured via character error rate comparisons). (Hugging Face)
- Phonemeâlevel control: You can mix phoneme input with text to control pronunciation more precisely. (Hugging Face)
- Streaming inference: Supports realâtime generation, useful for interactive applications. (Hugging Face)
- Bilingual support: Optimized for mixed Chinese and English text. (Hugging Face)
đ§ Architecture
GLMâTTS uses a twoâstage pipeline:
- An LLM (based on Llama) turns textual input into speech token sequences.
- A Flow Matching model turns those tokens into melâspectrograms and then into waveform audio using a vocoder. (Hugging Face)
đ Reinforcement Learning
The system employs a multiâreward reinforcement learning framework (GRPO) to align the LLMâs generation strategy with natural prosody, emotion, and similarity to target voices. (Hugging Face)
đ Quick Usage
The project provides scripts and examples in its GitHub repo for installation and inference. You can clone the code, install dependencies, and run the inference scripts on your local machine. (Hugging Face)