Step 3.7 Flash logo

Step 3.7 Flash

Flash-speed agents model that can see and act

2026-05-30

Product Introduction

  1. Definition: Step 3.7 Flash is a high-efficiency, Apache 2.0 licensed open-weight multimodal AI model designed for real-world agentic applications. It is a "Flash-tier" model, meaning it is optimized for speed and cost-efficiency while maintaining strong performance, with approximately 11 billion active parameters and support for 256K context length.
  2. Core Value Proposition: It exists to provide a cost-effective, high-throughput foundation for building autonomous agents that can see, think, and act by combining native multimodal understanding, reliable tool use, web search, and coding capabilities, achieving up to 400 tokens per second (TPS).

Main Features

  1. Native Multimodal Understanding & Acting: The model natively processes and understands images—including product UIs, documents, charts, and natural scenes—and can then write code or call tools to act on that visual information. This is powered by a built-in vision transformer (ViT) and integrated tool-calling architecture.
  2. Web & Visual Search Enhancement: Step 3.7 Flash features enhanced search capabilities for both text and images. Its web search can access more sources and perform deeper follow-up queries. Its visual search can recognize long-tail entities and newly emerged concepts that other systems often miss, compensating for parametric knowledge limitations through test-time tool use.
  3. Reliable Tool Use & Orchestration: The model is engineered for robust, long-horizon task execution. It can reliably drive terminals, browsers, Office tools, and search engines, maintaining coherence over extended runs with less agentic drift, fewer broken tool calls, and fewer failed executions compared to previous versions.
  4. Agent Ecosystem Compatibility: It is designed for seamless integration into existing agent development workflows. Step 3.7 Flash works with mainstream agent harnesses like Claude Code, KiloCode, Hermes Agent, and OpenClaw, reducing integration costs and minimizing workflow rewiring for developers.
  5. High-Efficiency Architecture: With ~11B active parameters out of a 196B total parameter count, the model employs a Mixture of Experts (MoE) architecture to achieve Flash-tier operational efficiency. This allows for high token throughput (up to 400 TPS) and lower inference costs while maintaining competitive benchmark performance.

Problems Solved

  1. Pain Point: The high cost and latency of running large, frontier AI models for continuous, long-horizon agentic tasks (like coding, data analysis, and process automation) make them impractical for many real-world applications.
  2. Target Audience: Enterprise developers building AI agents for customer support, data analysis, and internal automation; AI engineers and researchers focusing on agentic systems and tool-augmented models; Companies needing efficient, multimodal AI for document processing, visual search, and coding assistance.
  3. Use Cases:
    • Agentic Coding: Automating software engineering tasks, bug fixes, and code reviews within integrated development environments.
    • Enterprise Process Automation: Handling multi-step workflows involving documents, spreadsheets, databases, and web applications (e.g., generating production schedules from specifications).
    • Multimodal Research & Analysis: Conducting deep-dive research by synthesizing information from web articles, academic PDFs, charts, and images.
    • Visual Task Completion: Operating graphical user interfaces (GUIs) for mobile or desktop applications, testing frontend code, or extracting information from complex screenshots.

Unique Advantages

  1. Differentiation: Unlike many open-weight models that sacrifice capability for size, Step 3.7 Flash delivers performance competitive with much larger "pro-level" models (like GPT 5.5, Claude Opus 4.7) on agentic and coding benchmarks, but at a fraction of the computational cost and latency. It specifically outperforms other Flash-sized models like DeepSeek V4 Flash in key agentic areas (e.g., Toolathlon, ClawEval).
  2. Key Innovation: Its Advisor Mode strategy is a key innovation for cost-effective agentic execution. The smaller Step 3.7 Flash model runs the main task loop, only consulting a larger, more expensive "advisor" model (like a frontier model) at critical decision points (e.g., planning, recovery from errors). This approach can achieve ~97% of a top model's performance at roughly one-ninth the cost per task.

Frequently Asked Questions (FAQ)

  1. What is the difference between Step 3.7 Flash and Step 3.5 Flash? Step 3.7 Flash is a significant upgrade, featuring native multimodal vision capabilities, improved tool use reliability, enhanced search, and major gains on agentic benchmarks (e.g., +5% on SWE-Bench Pro, +6.1% on Terminal-Bench 2.1). It also introduces the Advisor Mode for cost optimization.
  2. How does Step 3.7 Flash achieve high speed (400 TPS)? It utilizes a Mixture of Experts (MoE) architecture with ~11B active parameters out of 196B total, allowing it to activate only a subset of its neural network for each token, drastically increasing processing speed and efficiency compared to dense models of similar capability.
  3. Can Step 3.7 Flash run locally on a workstation? Yes, it can be deployed on high-memory workstations like those with NVIDIA DGX Station, AMD Ryzen AI Max+ 395 systems, or Apple Mac Studio/Macbook Pro with at least 128GB of unified memory, using inference engines like vLLM, llama.cpp, or Hugging Face Transformers.
  4. What benchmarks prove Step 3.7 Flash's agentic capability? Key benchmarks include SWE-Bench Pro (56.3%), Terminal-Bench 2.1 (59.6%), Toolathlon (49.5%), ClawEval-1.1 (67.1%), and GDPval (45.8%). These test its coding, terminal command, multi-tool orchestration, and general task execution abilities.
  5. Is Step 3.7 Flash good for visual tasks without internet search? Yes, for pure visual perception and reasoning (e.g., V*, HR-Bench), it can use a Python tool interface to manipulate images (cropping, zooming) and achieves scores like 95.3% on V*, rivaling models five times its size. For visual recognition requiring world knowledge, it effectively uses its Visual Search tool.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news