Dream 7B

Dream 7B is a 7-billion-parameter open-source diffusion large language model developed jointly by the HKU NLP Group and Huawei Noah’s Ark Lab, designed to advance text generation through non-autoregressive architectures.
The core value of Dream 7B lies in its ability to outperform existing diffusion language models while matching or exceeding the performance of top-tier autoregressive (AR) models like LLaMA3 8B and Qwen2.5 7B in general, mathematical, and coding tasks, with additional strengths in planning and flexible inference.

Superior Performance Across Tasks: Dream 7B achieves state-of-the-art results in general language understanding, mathematical reasoning, and code generation, validated through benchmarks comparing it to models of similar size (7B-8B parameters). It was pretrained on 580 billion tokens from diverse corpora, including Dolma v1.7, OpenCoder, and DCLM-Baseline.
Advanced Planning Abilities: The model demonstrates exceptional performance in constraint-based planning tasks such as Countdown and Sudoku, outperforming even much larger models like DeepSeek V3 671B in few-shot settings. This capability stems from its bidirectional contextual modeling and iterative refinement process.
Flexible Inference Mechanisms: Unlike autoregressive models limited to left-to-right generation, Dream 7B supports arbitrary-order synthesis (e.g., infilling, completion) and dynamic quality-speed trade-offs via adjustable diffusion timesteps, enabling users to prioritize speed or output quality.

Limitations of Autoregressive Models: Dream 7B addresses key weaknesses of AR models, including challenges in complex reasoning, long-term planning, and maintaining coherence across extended contexts, which are critical for applications like autonomous agents and decision-making systems.
Target User Groups: The model is tailored for AI researchers, developers working on embodied AI systems, and enterprises requiring scalable language models for applications demanding high reasoning fidelity and flexible text generation.
Use Case Scenarios: Typical applications include multi-step planning systems (e.g., robotics task sequencing), constrained text generation (e.g., code synthesis with strict syntax rules), and scenarios requiring real-time quality-compute trade-offs, such as interactive AI assistants.

Architectural Differentiation: As a diffusion model, Dream 7B uses parallel sequence refinement and bidirectional context integration, contrasting with the sequential token-by-token generation of AR models. This enables superior global coherence and constraint satisfaction.
Innovative Training Techniques: The model incorporates AR weight initialization (from Qwen2.5 7B) and a novel context-adaptive token-level noise rescheduling mechanism, which optimizes denoising by dynamically adjusting noise levels per token based on contextual information.
Competitive Efficiency: Despite its non-AR architecture, Dream 7B achieves training efficiency comparable to AR models through optimized initialization and pretraining protocols, requiring only 256 hours on 96 NVIDIA H800 GPUs without loss spikes.

How does Dream 7B differ from traditional autoregressive models like GPT-4? Dream 7B uses a diffusion architecture that refines text iteratively from noise, enabling bidirectional context modeling and flexible generation orders, whereas AR models generate tokens strictly left-to-right. This allows Dream to excel in planning and constrained reasoning tasks.
What makes Dream 7B effective for planning tasks? The model’s diffusion process inherently supports multi-constraint optimization through parallel token updates, as demonstrated by its superior performance on Sudoku and Countdown benchmarks compared to AR models of similar size.
Can users customize Dream 7B’s inference behavior? Yes, the model allows dynamic adjustment of diffusion steps (tokens generated per iteration) to balance speed and quality. For example, reducing steps accelerates inference for real-time applications, while increasing steps enhances output precision for critical tasks.

Powerful Open Diffusion LLM, Beyond Autoregressive