Baidu Open-Sources ERNIE-Image, an 8B Diffusion Transformer

Baidu has open-sourced ERNIE-Image, a new text-to-image diffusion model that the company claims reaches state-of-the-art quality among open-weight systems despite using only 8 billion parameters. The model is released under the Apache 2.0 license and is available directly on Hugging Face, along with a distilled “Turbo” variant that generates images in just eight inference steps.
Intermediate
Technical Details
ERNIE-Image is built on a single-stream Diffusion Transformer (DiT) with 8B parameters, paired with a lightweight Prompt Enhancer that rewrites short prompts into richer, more structured descriptions before passing them to the diffusion backbone. Baidu describes the design as a deliberate compact-but-competitive alternative to much larger open models.
Two checkpoints are available:
- ERNIE-Image (SFT) — the main general-purpose model, running roughly 50 inference steps at a guidance scale of 4.0.
- ERNIE-Image-Turbo — a distilled version trained with Distribution Matching Distillation and reinforcement learning, producing comparable images in only 8 steps.
Supported resolutions include 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768 and 1200×896. Baidu says the model runs comfortably on a single consumer GPU with 24 GB of VRAM, putting it within reach of enthusiasts and small studios rather than datacenter-only territory.
Benchmarks and Strengths
Baidu highlights three areas where ERNIE-Image performs especially well: dense text rendering inside images (posters, signs, comic panels), complex multi-object instruction following, and structured layouts such as storyboards and multi-panel compositions. The reported numbers back this up:
- GENEval overall: 0.8728 with the prompt enhancer enabled (0.8856 on the best single-object split).
- LongTextBench average: 0.9733 — a strong score on a benchmark specifically designed to test long-form text rendering in generated images.
- OneIG-EN overall: 0.5750 for the SFT model, 0.5656 for the Turbo variant.
The gap between the Turbo and SFT models is small across most benchmarks, which is notable given Turbo’s roughly 6× speedup in sampling.
What This Means
The open-weight text-to-image space has been dominated lately by much larger models or closed commercial systems. An 8B DiT that fits on a 24 GB GPU, ships under Apache 2.0, and is competitive on layout-heavy tasks fills a real gap — especially for users who need reliable text rendering in images, which has historically been a weak spot for open diffusion models. The Turbo variant also makes ERNIE-Image practical for interactive or batch workflows where 50-step sampling is a bottleneck.
For Baidu, the release also fits a broader pattern: the company has been steadily pushing its ERNIE family into the open-weight conversation alongside new multimodal efforts like ERNIE 5.0. Making ERNIE-Image permissively licensed and trivially deployable via the Diffusers library is a clear bid for mindshare among developers who would otherwise reach for Stable Diffusion or Flux variants.
Related Coverage
- Ernie offers 3.5 and 4 options — earlier coverage of Baidu’s ERNIE family.
- Ernie Bot (文心一言) Now Available to General Public — Baidu’s original consumer ERNIE launch.
- Baidu Says Its AI as Good as ChatGPT in Big Claim for China — background on Baidu’s positioning in the global AI race.



沪公网安备31011502017015号