Voxtral Transcribe 2 by Mistral logo

Voxtral Transcribe 2 by Mistral

Real-time speech-to-text with speaker diarization

2026-02-05

Product Introduction

  1. Definition: Voxtral Transcribe 2 is a next-generation speech-to-text AI model suite (technical category: end-to-end automatic speech recognition system) comprising two specialized variants: Voxtral Mini Transcribe V2 for batch processing and Voxtral Realtime for live applications.
  2. Core Value Proposition: It delivers ultra-low-latency, high-accuracy transcription with speaker diarization at industry-leading cost efficiency, enabling real-time voice applications and scalable enterprise deployments where privacy and speed are critical.

Main Features

  1. Voxtral Realtime Streaming Architecture:

    • How it works: Uses a novel streaming architecture (not chunk-based) to transcribe audio incrementally as input arrives, avoiding offline processing bottlenecks.
    • Technical specs: Configurable latency down to sub-200ms, 4B parameter model optimized for edge deployment, Apache 2.0 open-weights license.
    • Supported tech: Native multilingual support across 13 languages including English, Chinese, Hindi, and Arabic.
  2. Voxtral Mini Transcribe V2 Batch Processing:

    • How it works: Processes long-form audio (up to 3 hours) with context biasing and speaker diarization, using transformer-based acoustic modeling.
    • Technical specs: Achieves 4% word error rate (WER) on FLEURS benchmark, supports word-level timestamps, diarization error rate (DER) outperforms competitors across 5+ benchmarks.
    • Supported tech: Noise-robust processing for challenging environments (e.g., factory floors, call centers).
  3. Enterprise-Grade Diarization & Context Biasing:

    • How it works: Assigns speaker labels with precise timestamps using clustering algorithms and accepts 100+ custom phrases (e.g., technical terms, names) to bias transcriptions.
    • Technical specs: Handles overlapping speech (prioritizes one speaker), experimental multilingual biasing beyond English.

Problems Solved

  1. Pain Point: Eliminates latency barriers for voice agents and live subtitling, where traditional ASR systems incur 500ms+ delays.
  2. Target Audience:
    • Developers building real-time voice apps (e.g., conversational AI, contact centers)
    • Compliance officers requiring HIPAA/GDPR-compliant meeting transcription
    • Media producers needing multilingual subtitles with low latency
  3. Use Cases:
    • Real-time sentiment analysis during customer support calls
    • Automated minute-taking for multilingual corporate meetings
    • Live broadcast captioning with context biasing for technical terminology

Unique Advantages

  1. Differentiation:
    • Vs. GPT-4o/Deepgram: 30% lower WER at 1/5 the cost of ElevenLabs Scribe v2.
    • Vs. AssemblyAI: 3x faster batch processing with native diarization.
  2. Key Innovation:
    • Sub-200ms configurable latency architecture (patent-pending) enabling new voice-agent use cases.
    • Open-weight edge deployment for privacy-first industries (e.g., healthcare, finance).

Frequently Asked Questions (FAQ)

  1. What languages does Voxtral Transcribe 2 support?
    Voxtral Transcribe 2 supports 13 languages including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch with best-in-class non-English accuracy.
  2. How does Voxtral ensure data privacy?
    Both models support on-premise/private cloud deployment with Apache 2.0 open weights (Realtime), enabling HIPAA/GDPR-compliant processing without third-party data exposure.
  3. What is the cost of Voxtral Transcribe 2?
    Voxtral Mini costs $0.003/minute for batch transcription; Voxtral Realtime is $0.006/minute for live streaming—up to 80% cheaper than competitors like Deepgram Nova.
  4. Can Voxtral handle overlapping speech in meetings?
    Yes, its diarization engine assigns speaker labels during overlaps but transcribes one speaker primarily—ideal for meeting transcripts with 95%+ attribution accuracy.
  5. How accurate is Voxtral in noisy environments?
    It maintains <5% WER in high-noise scenarios (e.g., call centers, industrial sites) via noise-robust acoustic modeling trained on diverse audio datasets.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news