EVI 3: Understand and generate any voice

EVI 3 is Hume AI’s third-generation speech-language model that integrates transcription, language processing, and speech generation into a unified system for highly expressive, emotionally intelligent voice interactions.
The core value of EVI 3 lies in its ability to generate any voice or personality from a text prompt while delivering human-like expressiveness, real-time responsiveness, and superior emotional understanding compared to existing models like GPT-4o.

EVI 3 uses a single autoregressive model to process both text (T) and voice (V) tokens, enabling seamless integration of language instructions and vocal style customization through system prompts.
The model streams user speech and generates responses at conversational latency (under 300ms on optimized hardware) while maintaining audio quality equivalent to Hume’s text-to-speech model, Octave.
EVI 3 dynamically incorporates real-time context from parallel systems—such as web search or reasoning models—into its responses, allowing it to “think fast and slow” during interactions for adaptive intelligence.

EVI 3 addresses the limited expressiveness and rigid voice options of traditional voice AI by enabling instant generation of custom voices and personalities without requiring fine-tuning datasets.
The product targets developers and enterprises building voice-enabled applications requiring emotional nuance, such as customer service bots, interactive entertainment, or AI coaching tools.
Typical use cases include real-time multilingual voice agents, emotionally responsive AI companions, and dynamic role-playing scenarios (e.g., acting as a pirate or simulating job interview stress).

Unlike GPT-4o and similar models, EVI 3 processes speech and language through a unified architecture rather than separate transcription/LLM/TTS systems, reducing latency and improving contextual coherence.
Its reinforcement learning framework trains the model to identify and replicate preferred vocal qualities from any speaker, achieving superior emotion/style modulation (e.g., “sultry” or “exhilarated” tones) compared to competitors.
In blind evaluations, EVI 3 outperformed GPT-4o, Gemini, and Sesame across seven metrics—including empathy (15% higher), naturalness (22% higher), and interruption handling—while maintaining faster practical latency (1.2s average vs. GPT-4o’s 2.6s).

How does EVI 3 generate custom voices without pre-recorded samples? EVI 3 infers vocal characteristics and personalities directly from text prompts using Hume’s text-to-speech platform, which has already created over 100,000 unique voices through style descriptors and semantic context.
What hardware is required to achieve sub-300ms latency? The model achieves benchmark latency on NVIDIA H100 GPUs, though real-world performance depends on network conditions and implementation; Hume provides optimized API endpoints for web and mobile integration.
Does EVI 3 support languages beyond English? While currently optimized for English, the model is being trained for French, German, Italian, and Spanish, with full multilingual support planned before its official API release in late 2025.
How does EVI 3 recognize emotions in user speech? The model analyzes tune, rhythm, and timbre independently of language content, achieving 89% accuracy in blind tests for recognizing nine core emotions (e.g., anger, joy) compared to GPT-4o’s 72%.
When will API access be available? Developers can test EVI 3 through Hume’s iOS app and web demo immediately, with enterprise API access rolling out in Q3 2025 via an early access program prioritizing high-volume use cases.

Hume AI's new voice that truly understands emotion