Baidu Open-Sources ERNIE-Image, an 8B Diffusion Transformer

Baidu has open-sourced ERNIE-Image, a new text-to-image diffusion model that the company claims reaches state-of-the-art quality among open-weight systems despite using only 8 billion parameters. The model is released under the Apache 2.0 license and is available directly on Hugging Face, along with a distilled “Turbo” variant that generates images in just eight inference steps.

Intermediate

Mosaic of sample images generated by Baidu's ERNIE-Image showing diverse styles including posters, photographs, and illustrated scenes
Image credit: Baidu / Hugging Face

Technical Details

ERNIE-Image is built on a single-stream Diffusion Transformer (DiT) with 8B parameters, paired with a lightweight Prompt Enhancer that rewrites short prompts into richer, more structured descriptions before passing them to the diffusion backbone. Baidu describes the design as a deliberate compact-but-competitive alternative to much larger open models.

Two checkpoints are available:

  • ERNIE-Image (SFT) — the main general-purpose model, running roughly 50 inference steps at a guidance scale of 4.0.
  • ERNIE-Image-Turbo — a distilled version trained with Distribution Matching Distillation and reinforcement learning, producing comparable images in only 8 steps.

Supported resolutions include 1024×1024, 848×1264, 1264×848, 768×1376, 896×1200, 1376×768 and 1200×896. Baidu says the model runs comfortably on a single consumer GPU with 24 GB of VRAM, putting it within reach of enthusiasts and small studios rather than datacenter-only territory.

Benchmarks and Strengths

Baidu highlights three areas where ERNIE-Image performs especially well: dense text rendering inside images (posters, signs, comic panels), complex multi-object instruction following, and structured layouts such as storyboards and multi-panel compositions. The reported numbers back this up:

  • GENEval overall: 0.8728 with the prompt enhancer enabled (0.8856 on the best single-object split).
  • LongTextBench average: 0.9733 — a strong score on a benchmark specifically designed to test long-form text rendering in generated images.
  • OneIG-EN overall: 0.5750 for the SFT model, 0.5656 for the Turbo variant.

The gap between the Turbo and SFT models is small across most benchmarks, which is notable given Turbo’s roughly 6× speedup in sampling.

What This Means

The open-weight text-to-image space has been dominated lately by much larger models or closed commercial systems. An 8B DiT that fits on a 24 GB GPU, ships under Apache 2.0, and is competitive on layout-heavy tasks fills a real gap — especially for users who need reliable text rendering in images, which has historically been a weak spot for open diffusion models. The Turbo variant also makes ERNIE-Image practical for interactive or batch workflows where 50-step sampling is a bottleneck.

For Baidu, the release also fits a broader pattern: the company has been steadily pushing its ERNIE family into the open-weight conversation alongside new multimodal efforts like ERNIE 5.0. Making ERNIE-Image permissively licensed and trivially deployable via the Diffusers library is a clear bid for mindshare among developers who would otherwise reach for Stable Diffusion or Flux variants.

Related Coverage

Sources