DreamActor-M1 From ByteDance

DreamActor-M1 is a diffusion transformer (DiT)-based human animation framework developed by ByteDance, designed to synthesize highly expressive and realistic human videos by imitating behaviors captured from reference images or videos.
The core value of DreamActor-M1 lies in its hybrid guidance system, which integrates motion, scale, and appearance controls to achieve fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence in generated animations.

DreamActor-M1 utilizes hybrid motion guidance combining implicit facial representations, 3D head spheres, and 3D body skeletons to enable precise control of facial expressions, head movements, and body poses while preserving identity and fidelity.
The framework employs a progressive training strategy with multi-resolution datasets to handle diverse scales, from portrait close-ups to full-body animations, ensuring adaptability to complex poses and varying image compositions.
DreamActor-M1 integrates complementary visual references and motion patterns from sequential frames to maintain temporal consistency for occluded or unseen regions during dynamic movements, such as clothing folds or hair motion.

Existing human animation methods often lack fine-grained control over facial expressions and body movements, suffer from limited scalability across portrait-to-full-body resolutions, and fail to maintain coherence in long-term sequences.
The product targets content creators, filmmakers, and digital artists requiring high-fidelity human animations for applications like virtual influencers, film production, or interactive media.
Typical use cases include generating lip-synced character dialogues in multiple languages, transferring motion from reference videos to custom avatars, and producing shape-aware animations with adjustable bone lengths for stylized characters.

Unlike conventional methods relying on单一 control signals, DreamActor-M1 combines 3D skeletal data, implicit facial encoding, and visual references in a unified DiT architecture for robust multi-modal guidance.
The framework innovates with cross-attention mechanisms (Face Attn, Ref Attn) to fuse facial motion tokens and appearance references directly into the denoising process, enabling simultaneous expression fidelity and temporal stability.
Competitive advantages include superior performance in state-of-the-art benchmarks for identity preservation, motion expressiveness, and temporal coherence, validated through experiments comparing outputs against leading animation tools.

How does DreamActor-M1 handle animations at different scales (e.g., portrait vs. full-body)? The framework uses progressive training on multi-resolution datasets and a 3D VAE-based latent space to adaptively process inputs ranging from facial close-ups to full-body sequences without quality degradation.
Can DreamActor-M1 preserve the identity of characters across long videos? Yes, the hybrid guidance system ensures identity consistency by integrating reference image tokens through concatenated self-attention layers and enforcing spatial-temporal constraints during the diffusion process.
Does the system support audio-driven facial animation? DreamActor-M1 extends to audio synchronization by aligning lip movements with speech inputs through additional motion pattern integration, supporting multilingual lip-sync with minimal retraining.
How are occluded regions handled during complex motions? Complementary visual references and motion priors from sequential frames are injected into the DiT blocks to infer plausible appearances for unseen areas, maintaining coherence in clothing or accessory dynamics.
What distinguishes DreamActor-M1 from other diffusion-based animation tools? The combination of 3D head spheres, skeletal controls, and scale-adaptive training within a DiT architecture provides unmatched granularity in motion transfer and robustness across diverse character designs.

Holistic & expressive human animation with custom controls