ACE-Step 1.5, released on January 28, 2026, is a powerful open-source music generation model jointly developed by ACE Studio and StepFun. The model delivers commercial-grade music quality on consumer hardware, generating a full song in under 2 seconds on an A100 GPU or under 10 seconds on an RTX 3090 — while requiring less than 4GB of VRAM. On SongEval, the standard benchmark for overall music quality, ACE-Step 1.5 outperforms Suno v5, a leading commercial music AI service.
At the heart of ACE-Step 1.5 lies a novel two-stage pipeline that separates high-level creative planning from low-level audio synthesis. A Language Model (LM) ranging from 0.6B to 4B parameters functions as an “omni-capable planner,” using Chain-of-Thought reasoning to transform a simple user prompt into a comprehensive song blueprint — complete with style descriptors, lyrics, and arrangement metadata. This blueprint then guides a Diffusion Transformer (DiT), which renders the final audio.
Rather than relying on external reward models, ACE-Step 1.5 employs intrinsic reinforcement learning for alignment, which the team says avoids the preference biases that plague models trained on human feedback. The result is more predictable stylistic adherence across 50+ languages.
ACE-Step 1.5 supports an unusually wide range of generation and editing workflows:
The model is available in multiple DiT configurations (base, SFT, turbo) to trade off quality against generation speed, and the LM component can be swapped between 0.6B and 4B parameter versions depending on available hardware.
The commercial music generation landscape has been dominated by closed platforms — Suno, Udio, and ElevenLabs’ Eleven Music — that keep their weights proprietary and charge subscription fees. ACE-Step 1.5 is a significant counter-move: MIT-licensed, locally runnable on a mid-range gaming GPU, and trained on legally licensed and royalty-free music. Users retain full control over generated outputs and can fine-tune the model on their own catalogues.
The benchmark claims are striking. According to the paper, ACE-Step 1.5 achieves a SongEval score of 8.09, surpassing Suno v5, while other metrics — AudioBox (7.42), Style Alignment (6.47), Lyric Alignment (8.35) — are competitive with commercial leaders. Generation speed is 10–120× faster than comparable alternative open models.
The team acknowledges current limitations: output quality can vary with random seed and duration settings, certain genres (notably Chinese rap) underperform, repainting transitions can sound unnatural, and fine-grained musical parameter control remains coarse. Vocal synthesis quality is noted as an area for future improvement.
ACE-Step 1.5 is available on GitHub, Hugging Face, and ModelScope, with weights, code, and an interactive demo all publicly accessible.
