Step 3.5 Flash

Definition: Step 3.5 Flash is a 196-billion-parameter sparse Mixture-of-Experts (MoE) large language model (LLM) engineered for high-efficiency inference. It activates only 11 billion parameters per token during inference.
Core Value Proposition: It delivers frontier reasoning capabilities and robust agentic performance while minimizing computational overhead, enabling real-time interaction for complex tasks like coding automation, multi-step research, and edge-cloud orchestration.

Sparse Mixture-of-Experts (MoE) Architecture
- How it works: Uses expert routing to dynamically select specialized subnetworks per token, limiting active parameters to 11B of the total 196B.
- Technologies: Hybrid dense/sparse layer stacking for optimized memory use; gating mechanisms for expert selection.
Multi-Token Prediction (MTP-3)
- How it works: Predicts 3 future tokens in parallel via dedicated output heads, enabling speculative decoding.
- Impact: Achieves 100–350 tokens/sec throughput, critical for latency-sensitive agentic workflows.
Hybrid Long-Context Attention
- How it works: Combines Sliding Window Attention (SWA) and Full Attention layers at a 3:1 ratio.
- Technologies: 256K context window with Head-wise Gated Attention for stability; SWA layers augmented to 96 query heads for enhanced representation.
Native Agentic Integration (OpenClaw)
- How it works: Embeds tool-use, code execution, and multi-agent orchestration via seamless OpenClaw compatibility.
- Capabilities: Supports Python runtime, GUI automation (Step-GUI), and cloud-edge task delegation.
Scalable Reinforcement Learning (MIS-PO)
- How it works: Uses Metropolis Independence Sampling for stable off-policy RL, filtering divergent trajectories.
- Impact: Enables continuous self-improvement in math (AIME: 97.3 → 99.8) and coding (SWE-bench: 74.4%).

Pain Point: Computational inefficiency of monolithic LLMs in real-time agentic applications.
- Solution: Sparse activation reduces inference cost by 6–18.9x vs. dense peers (e.g., DeepSeek V3.2).
- Target Audience: AI engineers deploying LLMs on consumer hardware (NVIDIA DGX Spark, Mac M4 Max).
Pain Point: Fragile tool-use in agent workflows causing logic errors or context collapse.
- Solution: Integrated Python runtime and toolchain orchestration (e.g., 51% success on Terminal-Bench 2.0).
- Use Case: Automated stock trading with 80+ tool calls for data aggregation, visualization, and alerting.
Pain Point: Inaccessible frontier AI for private, secure deployment.
- Solution: Local execution via INT4/INT8 quantization (20 tok/s on edge devices) and 256K context support.
- Target Audience: Enterprise developers requiring GDPR-compliant, on-premise agent systems.

Differentiation vs. Competitors:
- Outperforms models 2–5x larger (e.g., 81.0 avg. score vs. Gemini 3.0 Pro’s 80.7) with 11B active parameters.
- 3.9x faster decoding than GLM-4.7 and 18.9x cheaper than Kimi K2.5 in long-context scenarios.
Key Innovations:
- MTP-3 + SWA Synergy: Parallel token verification enables sub-100ms response times for agent loops.
- Edge-Cloud Orchestration: Step-GUI integration for mobile task execution (57% success on AndroidDaily Hard).
- MIS-PO Framework: Eliminates RL collapse in 10k+ step reasoning tasks.

How does Step 3.5 Flash achieve high speed with 196B parameters?
Its sparse MoE design activates only 11B parameters per token, while MTP-3 parallelizes token generation, enabling 350 tok/s throughput on NVIDIA Hopper GPUs.
Can Step 3.5 Flash run locally on consumer hardware?
Yes, via INT4-quantized GGUF weights optimized for llama.cpp, supporting 20 tok/s on NVIDIA DGX Spark and M4 Max Macs with 256K context.
How does it compare to GPT-5.2 in agentic tasks?
It scores 88.2 on Agent τ²-Bench vs. GPT-5.2’s 85.5 and achieves 65.3% on ResearchRubrics, rivaling Gemini DeepResearch (63.7%) at lower latency.
What makes OpenClaw integration critical for agents?
Native support for tool-chaining, Python execution, and multi-agent routing (e.g., Master/Search/Verify agents) enables complex workflows like automated repo documentation or data pipelines.
Is Step 3.5 Flash suitable for real-time coding?
Yes, it scores 86.4 on LiveCodeBench-V6 and 74.4% on SWE-bench Verified, handling WebGL rendering, Three.js pipelines, and financial data modeling via Claude Code integration.

Frontier open-source MoE model built for OpenClaw agents