Introducing MagentaRT: Google’s Open-Weights Real-Time Music Generation Model

June 22, 2025Provided by Utku Ege Tuluk

Today, Google’s Magenta team unveiled Magenta RealTime (MagentaRT), an 800-million-parameter autoregressive transformer model designed to generate high-fidelity, 48 kHz stereo music in real time with low latency and full user control (magenta.withgoogle.com). Licensed permissively (with some bespoke terms), MagentaRT is the open-weights counterpart to Google’s proprietary Lyria RealTime model and aims to eventually run locally on consumer hardware—even though it currently demos at real-time factor 1.6 on Colab free-tier TPUs (v2-8) by generating 2 seconds of audio in just 1.25 seconds (magenta.withgoogle.com).

MagentaRT builds on the architecture of MusicLM by performing block autoregression: it generates sequential audio chunks (10 s of coarse tokens → 2 s of fine tokens) conditioned on previous outputs and a style embedding. By dynamically adjusting the style embedding—a weighted mix of text or audio prompts—users can morph genres, instruments, and sonic textures in real time (magenta.withgoogle.com).

Under the hood, MagentaRT leverages the new SpectroStream codec (a successor to SoundStream) for higher fidelity and the MusicCoCa joint music+text embedding model, influenced by MuLan and the CoCa family (magenta.withgoogle.com). The model was trained on approximately 190 000 hours of mostly instrumental stock music, giving it strong capabilities in Western instrumental styles, though it currently has limited vocal and non-Western coverage.

Try it Yourself:

▶️ Watch the video demo: https://www.youtube.com/watch?v=Ae1Kz2zmh9M
📄 Read the official blog post: https://magenta.withgoogle.com/magenta-realtime
💻 Explore the GitHub repo: https://github.com/magenta/magenta-realtime
🤗 Access the model card and weights on Hugging Face: https://huggingface.co/google/magenta-realtime
📊 Run inference in Colab: https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/magenta_realtime.ipynb

Limitations and Future Work:

Style Coverage: Primarily trained on Western instrumental music; vocal generation is non-lexical and unconditioned on lyrics (magenta.withgoogle.com).
Latency & Context: Style changes incur at least a 2-second delay; context window is limited to 10 seconds, so long-form structures aren’t maintained (magenta.withgoogle.com).
Next Steps: On-device inference for mobile/desktop, personal fine-tuning, and higher-quality, lower-latency next-gen models are on the roadmap (magenta.withgoogle.com).

MagentaRT marks a significant step toward interactive, high-quality AI music performance, enabling artists, developers, and researchers to explore new creative frontiers—live, in-moment, and open source.