Step 3.5 Flash logo

Step 3.5 Flash

Frontier open-source MoE model built for OpenClaw agents

2026-03-05

Product Introduction

  1. Definition: Step 3.5 Flash is a 196-billion-parameter sparse Mixture-of-Experts (MoE) large language model (LLM) engineered for high-efficiency inference. It activates only 11 billion parameters per token during inference.
  2. Core Value Proposition: It delivers frontier reasoning capabilities and robust agentic performance while minimizing computational overhead, enabling real-time interaction for complex tasks like coding automation, multi-step research, and edge-cloud orchestration.

Main Features

  1. Sparse Mixture-of-Experts (MoE) Architecture

    • How it works: Uses expert routing to dynamically select specialized subnetworks per token, limiting active parameters to 11B of the total 196B.
    • Technologies: Hybrid dense/sparse layer stacking for optimized memory use; gating mechanisms for expert selection.
  2. Multi-Token Prediction (MTP-3)

    • How it works: Predicts 3 future tokens in parallel via dedicated output heads, enabling speculative decoding.
    • Impact: Achieves 100–350 tokens/sec throughput, critical for latency-sensitive agentic workflows.
  3. Hybrid Long-Context Attention

    • How it works: Combines Sliding Window Attention (SWA) and Full Attention layers at a 3:1 ratio.
    • Technologies: 256K context window with Head-wise Gated Attention for stability; SWA layers augmented to 96 query heads for enhanced representation.
  4. Native Agentic Integration (OpenClaw)

    • How it works: Embeds tool-use, code execution, and multi-agent orchestration via seamless OpenClaw compatibility.
    • Capabilities: Supports Python runtime, GUI automation (Step-GUI), and cloud-edge task delegation.
  5. Scalable Reinforcement Learning (MIS-PO)

    • How it works: Uses Metropolis Independence Sampling for stable off-policy RL, filtering divergent trajectories.
    • Impact: Enables continuous self-improvement in math (AIME: 97.3 → 99.8) and coding (SWE-bench: 74.4%).

Problems Solved

  1. Pain Point: Computational inefficiency of monolithic LLMs in real-time agentic applications.

    • Solution: Sparse activation reduces inference cost by 6–18.9x vs. dense peers (e.g., DeepSeek V3.2).
    • Target Audience: AI engineers deploying LLMs on consumer hardware (NVIDIA DGX Spark, Mac M4 Max).
  2. Pain Point: Fragile tool-use in agent workflows causing logic errors or context collapse.

    • Solution: Integrated Python runtime and toolchain orchestration (e.g., 51% success on Terminal-Bench 2.0).
    • Use Case: Automated stock trading with 80+ tool calls for data aggregation, visualization, and alerting.
  3. Pain Point: Inaccessible frontier AI for private, secure deployment.

    • Solution: Local execution via INT4/INT8 quantization (20 tok/s on edge devices) and 256K context support.
    • Target Audience: Enterprise developers requiring GDPR-compliant, on-premise agent systems.

Unique Advantages

  1. Differentiation vs. Competitors:

    • Outperforms models 2–5x larger (e.g., 81.0 avg. score vs. Gemini 3.0 Pro’s 80.7) with 11B active parameters.
    • 3.9x faster decoding than GLM-4.7 and 18.9x cheaper than Kimi K2.5 in long-context scenarios.
  2. Key Innovations:

    • MTP-3 + SWA Synergy: Parallel token verification enables sub-100ms response times for agent loops.
    • Edge-Cloud Orchestration: Step-GUI integration for mobile task execution (57% success on AndroidDaily Hard).
    • MIS-PO Framework: Eliminates RL collapse in 10k+ step reasoning tasks.

Frequently Asked Questions (FAQ)

  1. How does Step 3.5 Flash achieve high speed with 196B parameters?
    Its sparse MoE design activates only 11B parameters per token, while MTP-3 parallelizes token generation, enabling 350 tok/s throughput on NVIDIA Hopper GPUs.

  2. Can Step 3.5 Flash run locally on consumer hardware?
    Yes, via INT4-quantized GGUF weights optimized for llama.cpp, supporting 20 tok/s on NVIDIA DGX Spark and M4 Max Macs with 256K context.

  3. How does it compare to GPT-5.2 in agentic tasks?
    It scores 88.2 on Agent τ²-Bench vs. GPT-5.2’s 85.5 and achieves 65.3% on ResearchRubrics, rivaling Gemini DeepResearch (63.7%) at lower latency.

  4. What makes OpenClaw integration critical for agents?
    Native support for tool-chaining, Python execution, and multi-agent routing (e.g., Master/Search/Verify agents) enables complex workflows like automated repo documentation or data pipelines.

  5. Is Step 3.5 Flash suitable for real-time coding?
    Yes, it scores 86.4 on LiveCodeBench-V6 and 74.4% on SWE-bench Verified, handling WebGL rendering, Three.js pipelines, and financial data modeling via Claude Code integration.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news