MiMo-V2.5 Voice logo

MiMo-V2.5 Voice

Bilingual ASR for dialects, code-switching, and songs

2026-04-25

Product Introduction

  1. Definition: MiMo-V2.5 Voice is a sophisticated suite of speech technologies developed by Xiaomi, encompassing both the MiMo-V2.5-ASR (Automatic Speech Recognition) and the MiMo-V2.5-TTS (Text-to-Speech) series. The ASR component is an 8B-parameter open-source model designed for high-accuracy transcription, while the TTS series utilizes advanced generative AI to produce natural, emotive, and highly controllable synthetic speech.

  2. Core Value Proposition: MiMo-V2.5 Voice exists to bridge the gap between robotic synthetic speech and human-like vocal performance. By providing developers with an 8B foundation model that supports Mandarin, English, and eight Chinese dialects, and a TTS system capable of "Director Mode" style control, it empowers ML engineers and researchers to build real-world voice applications that handle code-switching, complex emotions, and even singing with professional-grade fidelity.

Main Features

  1. 8B Open-Source ASR Model (MiMo-V2.5-ASR): This high-capacity speech recognition model is engineered to transcribe complex audio environments. It supports multi-modal inputs including Mandarin, English, and eight distinct Chinese dialects. Its technical architecture is optimized for code-switched speech (mixing languages) and specific niche tasks like song lyric transcription, making it ideal for diverse linguistic regions.

  2. MiMo-V2.5-TTS Tri-Model Architecture: The system is divided into three specialized models for distinct use cases:

    • MiMo-V2.5-TTS (Standard): Provides out-of-the-box high-quality built-in voices with support for singing mode.
    • MiMo-V2.5-TTS-VoiceDesign: Uses zero-shot generation to create entirely new, custom voices based solely on text descriptions, eliminating the need for audio samples.
    • MiMo-V2.5-TTS-VoiceClone: Enables precise replication of any target voice using small audio samples to produce high-similarity synthetic output.
  3. Multi-Granularity Style & Director Mode: Unlike traditional TTS that offers limited emotion toggles, MiMo-V2.5 supports "Director Mode." This allows users to provide natural language instructions (e.g., "gentle but tired," "repressed anger") to control speech at the paragraph, sentence, word, or even character level. It supports complex audio tags for inhales, sighs, laughter, and coughs, ensuring the output matches the nuanced requirements of film-level content generation.

  4. Natural Language & Tag-Based Control: The API supports two distinct control methods for style. Natural Language Control is passed through the user role to describe the desired tone, while Audio Tag Control uses specific markers (e.g., (唱歌), [breath]) within the assistant role’s text to trigger fine-grained acoustic features and rhythm adjustments.

Problems Solved

  1. Pain Point: Unnatural and Monotonous AI Voices: Traditional TTS often sounds "uncanny" or lacks emotional depth. MiMo-V2.5 solves this through multi-emotion mixing and "multi-style switching," allowing a single segment to transition naturally from an announcement tone to a whisper or a roar.

  2. Target Audience:

    • ML Engineers & Researchers: Seeking high-parameter, open-source models for speech-to-text and text-to-speech experimentation.
    • AI Agent Developers: Building conversational bots (e.g., via Hermes Agent) that require low-latency, emotive vocal responses.
    • Content Creators & Game Developers: Requiring diverse character voices and specific dialect support (Northeast, Sichuan, Cantonese, etc.) without hiring multiple voice actors.
    • Enterprise Solutions: Companies needing automated customer service that can handle code-switched (English/Mandarin) dialogue.
  3. Use Cases:

    • AI Storytelling/Audiobooks: Using Director Mode to specify character traits, accents, and emotional fluctuations for different roles.
    • Voice Cloning for Personalization: Replicating a user's voice for personalized digital assistants.
    • Globalized Customer Support: Utilizing the ASR's dialect and code-switching capabilities to understand and respond to regional users accurately.
    • Singing Synthesis: Generating vocal performances for lyrics in Chinese and English.

Unique Advantages

  1. Differentiation (Instruction Following): Most competitive TTS models require rigid parameter tuning. MiMo-V2.5 differentiates itself with superior instruction-following capabilities, where a single natural language sentence can dictate speed, resonance, and emotional subtext, effectively acting as a "vocal actor" rather than just a synthesizer.

  2. Key Innovation (Zero-Shot Voice Design): The voicedesign model is a significant innovation, allowing developers to create unique timbres (e.g., "a deep-voiced mature woman with a bone-chilling sense of oppression") without needing to source, record, or upload audio files.

  3. Integration and Compatibility: The platform is built for modern developer workflows, offering OpenAI-compatible API structures, streaming support (in compatibility mode), and seamless integration with top-tier Agent frameworks like Hermes Agent.

Frequently Asked Questions (FAQ)

  1. Does MiMo-V2.5-TTS support dialect synthesis? Yes. The model supports a wide variety of dialects, including Northeast dialect, Sichuan dialect, Henan dialect, and Cantonese. These can be activated using either natural language descriptions or specific audio tags at the start of the text.

  2. What is the difference between Voice Design and Voice Cloning? Voice Design (MiMo-V2.5-TTS-VoiceDesign) creates a brand-new voice based on your written description of the character's age, gender, and personality. Voice Cloning (MiMo-V2.5-TTS-VoiceClone) requires you to upload an existing audio sample (MP3 or WAV) to replicate that specific person's voice.

  3. How do I control the speed and emotion of the speech? You can use Natural Language Control by describing the pace in the user message role (e.g., "speak at an extremely fast pace, like a machine gun") or use Audio Tag Control by inserting markers like (Lazy) or [pauses for a moment] directly into the text content for the assistant role.

  4. What audio formats are supported for output? The MiMo-V2.5 API supports standard formats including wav and pcm16. For streaming calls, pcm16 is recommended to allow for seamless audio chunk splicing on the client side.

  5. Is there a free trial for the MiMo-V2.5 series? During the Public Beta phase, the MiMo-V2.5 series is free for a limited time. Token Plan users also receive preferential rates, off-peak discounts, and periodic credit resets.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news