ACE-Step 1.5: Open-Source Music Generation That Rivals Commercial AI

ACE-Step 1.5, released on January 28, 2026, is a powerful open-source music generation model jointly developed by ACE Studio and StepFun. The model delivers commercial-grade music quality on consumer hardware, generating a full song in under 2 seconds on an A100 GPU or under 10 seconds on an RTX 3090 — while requiring less than 4GB of VRAM. On SongEval, the standard benchmark for overall music quality, ACE-Step 1.5 outperforms Suno v5, a leading commercial music AI service.

ACE-Step 1.5 model architecture showing the Language Model planner and Diffusion Transformer components
Image credit: ACE-Step 1.5 paper, arXiv 2602.00744

A Hybrid Architecture: Planner Meets Synthesizer

At the heart of ACE-Step 1.5 lies a novel two-stage pipeline that separates high-level creative planning from low-level audio synthesis. A Language Model (LM) ranging from 0.6B to 4B parameters functions as an “omni-capable planner,” using Chain-of-Thought reasoning to transform a simple user prompt into a comprehensive song blueprint — complete with style descriptors, lyrics, and arrangement metadata. This blueprint then guides a Diffusion Transformer (DiT), which renders the final audio.

Rather than relying on external reward models, ACE-Step 1.5 employs intrinsic reinforcement learning for alignment, which the team says avoids the preference biases that plague models trained on human feedback. The result is more predictable stylistic adherence across 50+ languages.

Map of ACE-Step 1.5 multi-modal applications including cover generation, vocal-to-BGM, and audio repainting
Image credit: ACE-Step 1.5 paper, arXiv 2602.00744

Capabilities and Editing Tools

ACE-Step 1.5 supports an unusually wide range of generation and editing workflows:

  • Text-to-music: Generate compositions from 10 seconds up to 10 minutes with natural language prompts.
  • Cover generation: Reinterpret an existing track in a new style using reference audio input.
  • Audio repainting: Selectively regenerate specific bars or segments without touching the rest of the composition.
  • Vocal-to-BGM: Automatically produce an accompaniment from a vocal track.
  • LoRA fine-tuning: Train personalized style adapters from just a few reference songs.
  • Metadata control: Specify BPM, key signature, time signature, and 1,000+ instrument and style combinations.
  • Batch generation: Produce up to 8 songs simultaneously.

The model is available in multiple DiT configurations (base, SFT, turbo) to trade off quality against generation speed, and the LM component can be swapped between 0.6B and 4B parameter versions depending on available hardware.

What This Means for Open-Source AI Music

The commercial music generation landscape has been dominated by closed platforms — Suno, Udio, and ElevenLabs’ Eleven Music — that keep their weights proprietary and charge subscription fees. ACE-Step 1.5 is a significant counter-move: MIT-licensed, locally runnable on a mid-range gaming GPU, and trained on legally licensed and royalty-free music. Users retain full control over generated outputs and can fine-tune the model on their own catalogues.

The benchmark claims are striking. According to the paper, ACE-Step 1.5 achieves a SongEval score of 8.09, surpassing Suno v5, while other metrics — AudioBox (7.42), Style Alignment (6.47), Lyric Alignment (8.35) — are competitive with commercial leaders. Generation speed is 10–120× faster than comparable alternative open models.

The team acknowledges current limitations: output quality can vary with random seed and duration settings, certain genres (notably Chinese rap) underperform, repainting transitions can sound unnatural, and fine-grained musical parameter control remains coarse. Vocal synthesis quality is noted as an area for future improvement.

ACE-Step 1.5 is available on GitHub, Hugging Face, and ModelScope, with weights, code, and an interactive demo all publicly accessible.

Related Coverage

Sources