ERNIE Image logo

ERNIE Image

Baidu's 8B Open-Source Text-to-Image Diffusion Transformer

2026-04-23

Product Introduction

  1. Overview: ERNIE Image is an advanced 8-billion parameter (8B) text-to-image generation model developed by Baidu. It is built on a high-efficiency single-stream Diffusion Transformer (DiT) architecture and is released under the permissive Apache 2.0 license for open commercial use.
  2. Value: It bridges the gap between massive proprietary models and local execution, allowing users to generate professional-grade graphics, legible typography, and complex layouts on a single consumer GPU (24GB VRAM) without API dependencies or usage quotas.

Main Features

  1. 8B Single-Stream Diffusion Transformer: A robust architecture designed for high-fidelity image synthesis and precise instruction following, supporting up to 2048×2048 resolution.
  2. LLM-Powered Prompt Enhancer: Includes a lightweight Large Language Model (LLM) that automatically expands simple user inputs into descriptive, structured prompts to improve visual output quality.
  3. State-of-the-Art Benchmarking: Outperforms many larger models in spatial reasoning and text accuracy, achieving 0.8856 on the GENEval benchmark and 0.9733 on the LongTextBench for text generation accuracy.

Problems Solved

  1. Challenge: Most diffusion models struggle with rendering legible text and maintaining structured layouts in posters or infographics.
  2. Audience: Graphic designers, comic artists, AI researchers, and developers looking for a customizable, locally hostable image generator.
  3. Scenario: Creating marketing posters with specific headlines, multi-panel comic strips with consistent logic, and complex scenes involving multiple objects with spatial relationships.

Unique Advantages

  1. Vs Competitors: Unlike many 'open' models that are restricted to non-commercial use, ERNIE Image's Apache 2.0 license allows full commercial freedom.
  2. Innovation: Optimized for the 'LongText' problem, it is one of the few open-weight models that can reliably generate paragraphs of readable text within a visual composition.

Frequently Asked Questions (FAQ)

  1. What are the hardware requirements for ERNIE Image? It requires a single consumer GPU with at least 24GB of VRAM (e.g., NVIDIA RTX 3090/4090) to run locally.
  2. Can ERNIE Image be used for commercial projects? Yes, the model weights are released under the Apache 2.0 license, which permits commercial modification and distribution.
  3. Which languages does the prompt enhancer support? The model and its prompt enhancer are optimized for bilingual English (EN) and Chinese (ZH) generation, as well as Japanese (JA).

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news