Mercury 2 logo

Mercury 2

Fastest reasoning LLM built for instant production AI

2026-02-25

Product Introduction

  1. Definition: Mercury 2 is a diffusion-based large language model (LLM) engineered for real-time reasoning. It replaces traditional autoregressive decoding with parallel refinement technology, enabling simultaneous token generation.
  2. Core Value Proposition: It eliminates latency bottlenecks in production AI loops (e.g., agentic workflows, RAG pipelines) by delivering reasoning-grade quality at speeds exceeding 1,000 tokens/sec, making real-time AI interactions economically viable.

Main Features

  1. Parallel Refinement:
    • How it works: Instead of sequential left-to-right decoding, Mercury 2 generates multiple tokens concurrently through iterative refinement. It converges responses over 10–20 diffusion steps (vs. 100+ steps in autoregressive models).
    • Technology: Leverages diffusion probabilistic models adapted for language tasks, optimized via NVIDIA Blackwell GPUs.
  2. Ultra-Low Latency:
    • Achieves 1,009 tokens/sec throughput under high concurrency (p95 latency).
    • Maintains stable throughput during peak loads, critical for voice interfaces and interactive tools.
  3. Tunable Reasoning & Tool Integration:
    • Supports dynamic reasoning adjustment (e.g., chain-of-thought complexity) via step control.
    • Features native tool use, 128K context windows, and schema-aligned JSON output for seamless API integration.

Problems Solved

  1. Pain Point: Autoregressive decoding in conventional LLMs creates compounded latency in multi-step AI loops (e.g., agents, RAG), degrading user experience and limiting workflow complexity.
  2. Target Audience:
    • Developer Tools Engineers (e.g., Zed’s code-suggestion workflows).
    • Voice AI Developers (e.g., Happyverse AI’s real-time avatars).
    • Enterprise Automation Teams (e.g., Skyvern’s document processing agents).
    • RAG Pipeline Architects (e.g., SearchBlox’s retrieval systems).
  3. Use Cases:
    • Agentic Workflows: 20+ inference calls/task without latency penalties.
    • Real-Time Voice Interfaces: Sub-second responses matching natural speech cadence.
    • Interactive Coding: In-flow code edits/refactors (e.g., Zed).
    • Multi-Hop RAG: Reasoning-enhanced search within sub-second budgets.

Unique Advantages

  1. Differentiation:
    • 5x faster than sequential models while matching quality benchmarks of speed-optimized LLMs (e.g., GPT-5.2).
    • $0.75/1M output tokens pricing undercuts competitors requiring equivalent compute for similar tasks.
  2. Key Innovation:
    • First reasoning diffusion LLM: Applies image-diffusion principles to language, enabling parallel token generation. This shifts the quality-latency tradeoff curve, allowing complex reasoning in previously infeasible real-time scenarios.

Frequently Asked Questions (FAQ)

  1. How does Mercury 2 achieve 1,000+ tokens/sec?
    By replacing sequential decoding with parallel refinement diffusion, generating tokens simultaneously across NVIDIA Blackwell GPUs.
  2. Is Mercury 2 compatible with existing AI stacks?
    Yes, it’s OpenAI API-compatible, requiring no code rewrites for integration into current pipelines.
  3. What applications benefit most from Mercury 2?
    Latency-sensitive use cases like voice AI, agentic loops, and real-time coding tools where >200ms delays break user immersion.
  4. How does diffusion improve reasoning quality?
    Parallel refinement allows more compute per token within fixed latency budgets, enabling complex chain-of-thought reasoning traditionally limited by sequential bottlenecks.
  5. What hardware optimizes Mercury 2 performance?
    NVIDIA Blackwell GPUs deliver peak throughput, though the model runs efficiently on standard cloud infrastructure.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news