Skip to content

Step 3.7 Flash Review

Step 3.7 Flash is a high-efficiency, multimodal Mixture-of-Experts (MoE) vision-language model designed for real-world agentic workflows, developed by StepFun.

shipped May 31, 2026aifreemium
Step 3.7 Flash - AI tool for step flash. Professional illustration showing core functionality and features.
1Released on May 28, 2026, Step 3.7 Flash is a 198-billion-parameter sparse MoE model.
2It features a 256k context window and activates approximately 11 billion parameters per token during inference.
3The model achieved a second-place finish on SWE-Bench PRO with a score of 56.3.
4Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1 for workflow integrity and tool orchestration.

Stork Quadrant

Dead Man Walking· 0/100

An LLM can do most of what this tool's UI promises. No moat, no agent presence.

This is a Chinese inference-speed model competing in the most crowded lane in AI. No proprietary data, no regulatory moat, no network effects, no trust workflow ownership. Speed and price are the pitch — both erode within months as every major lab ships faster, cheaper models. This will get commoditized.

Claude Sonnet 4.6, scored 2026-05-31

Defensibility · 0/100

  • Physical-world coupling
  • Regulatory moat
  • Network liquidity
  • Proprietary refreshing data
  • High-trust catastrophic workflows
  • Multi-party coordination
  • Brand / community / taste

An LLM alone could replace

  • Generate text responses to prompts — any frontier LLM does this
  • Analyze images and describe or reason about visual content — GPT-4o, Gemini Flash do this today
  • Execute agentic tasks like browsing or form-filling — Operator, Claude, Gemini already compete here
  • Answer questions quickly at low latency — commodity inference optimization, not a moat

Agent-Readiness · 0/100

  • Verified MCP
  • Listed on agent surfaces
  • Usage-based pricing
  • Headless agent auth
  • Public OpenAPI
  • Active changelog
  • llms.txt

How to defend

Pick a vertical where Chinese-language enterprise compliance or specific regional data access matters, and own that workflow end-to-end with liability attached. Otherwise, become an API layer that agents call rather than a product users visit.

  • Ship an MCP server and list it on Stork — biggest single point gain (+25).
  • Get listed in the Anthropic MCP registry, Cursor, or Claude Desktop (+20).
  • Add a usage-based or per-call tier; per-seat-only pricing dies when agents replace seats (+15).
  • Expose API-key auth with a self-serve sandbox tier; remove sales-call gates (+15).
  • Publish an OpenAPI spec at /openapi.json or /.well-known/openapi (+10).

Step 3.7 Flash at a Glance

Best For
product-hunt
Pricing
freemium
Key Features
Released on May 28, 2026, Step 3.7 Flash is a 198-billion-parameter sparse MoE model. · It features a 256k context window and activates approximately 11 billion parameters per token during inference. · The model achieved a second-place finish on SWE-Bench PRO with a score of 56.3.
Alternatives
Google Gemini (as an agent), AskUI Vision Agent, Skygen, OpenAI Operator

About Step 3.7 Flash

Founded
2023
</>Embed "Featured on Stork" Badge
Badge previewBadge preview light
<a href="https://www.stork.ai/en/step-3-7-flash" target="_blank" rel="noopener noreferrer"><img src="https://www.stork.ai/api/badge/step-3-7-flash?style=dark" alt="Step 3.7 Flash - Featured on Stork.ai" height="36" /></a>
[![Step 3.7 Flash - Featured on Stork.ai](https://www.stork.ai/api/badge/step-3-7-flash?style=dark)](https://www.stork.ai/en/step-3-7-flash)

overview

What is Step 3.7 Flash?

Step 3.7 Flash is a high-efficiency, multimodal Mixture-of-Experts (MoE) vision-language model developed by StepFun that enables AI Developers and Enterprise users to build and deploy advanced AI agents. It provides advanced perception, search, and reasoning capabilities at production scale for agentic workflows. This 198-billion-parameter sparse MoE model, released on May 28, 2026, activates approximately 11 billion parameters per token during inference, ensuring high throughput. It integrates a 196B-parameter language backbone with a 1.8B-parameter vision encoder, facilitating native image and video understanding. The model supports a substantial 256k context window and offers three selectable reasoning levels (low, medium, and high) to balance speed, cost, and cognitive depth. Its primary function is to support agentic workflows requiring multimodal perception, search, and multi-step reasoning across various digital environments.

quick facts

Quick Facts

AttributeValue
DeveloperStepFun
Business ModelFreemium, Usage-based
PricingFreemium, Usage-based (Step 3.7 Flash input: $0.00020 per 1k tokens, output: $0.00115 per 1k tokens)
PlatformsAPI, Web (StepFun Open Platform)
API AvailableYes
IntegrationsNVIDIA NIM, SGLang, NVIDIA TensorRT-LLM, vLLM, Hugging Face, OpenRouter, ModelScope
Founded2023
HQShanghai, China

features

Key Features of Step 3.7 Flash

Step 3.7 Flash incorporates a suite of technical features designed for high-performance agentic AI applications, leveraging a multimodal Mixture-of-Experts architecture. These capabilities enable advanced perception, reasoning, and action across diverse data types and operational environments.

  • 1198-billion-parameter sparse Mixture-of-Experts (MoE) model, activating approximately 11 billion parameters per token.
  • 2Native image and video understanding via an integrated 1.8B-parameter vision encoder.
  • 3Supports a 256k context window for extensive information processing.
  • 4Offers three selectable reasoning levels (low, medium, high) to optimize for speed, cost, or cognitive depth.
  • 5Reliable interaction with external APIs, browsers, terminals, and Office tools for complex task execution.
  • 6Open-source availability under the Apache 2.0 License on platforms like Hugging Face and ModelScope.
  • 7Full inference stack support from NVIDIA, including availability as an NVIDIA NIM inference microservice.
  • 8Advisor Mode functionality, allowing a smaller executor model to escalate complex tasks to a larger advisor model for cost efficiency.

use cases

Who Should Use Step 3.7 Flash?

Step 3.7 Flash is engineered for professionals and organizations requiring advanced multimodal AI capabilities for agentic workflows, particularly those focused on automation, complex data interpretation, and application development.

  • 1**AI Developers:** For building and deploying next-generation AI applications, including multimodal agents with reliable tool use and orchestration.
  • 2**Enterprise Users:** For parsing massive financial reports, running multi-step search loops with cross-source verification, and operating concurrent coding agents in high-throughput pipelines.
  • 3**Engineers/Researchers:** For agentic coding, independently tracing multi-file repositories, identifying bugs from issue reports, and generating functional code patches.
  • 4**Content Creators:** For applications requiring text-to-speech, voice cloning, creative writing, and advanced language learning functionalities.
  • 5**Individuals Seeking Personal AI Assistance:** For knowledge acquisition, information finding, and general multimodal interaction.

pricing

Step 3.7 Flash Pricing & Plans

Step 3.7 Flash operates on a freemium and usage-based pricing model, allowing users to access a free tier before incurring costs based on token consumption. Specific rate limits are applied to concurrency, requests per minute (RPM), and tokens per minute (TPM), with a request timeout of 10 minutes. Users requiring higher limits can contact platform@stepfun.com.

  • 1**Freemium:** A free tier is available for initial access and limited usage.
  • 2**Step 1 (32K):** Input: $0.00205 per 1k tokens, Output: $0.00959 per 1k tokens.
  • 3**Step 3.5 Flash:** Input: $0.000096 per 1k tokens, Output: $0.000288 per 1k tokens.
  • 4**Step 3.5 Flash 2603:** Input: $0.000100 per 1k tokens, Output: $0.000300 per 1k tokens.
  • 5**Step 3.7 Flash:** Input: $0.00020 per 1k tokens, Output: $0.00115 per 1k tokens.

competitors

Step 3.7 Flash vs Competitors

Step 3.7 Flash is positioned as a leading multimodal agentic model, competing in the 'Flash' model market against established and emerging AI solutions. Its strengths lie in native multimodal perception, robust tool orchestration, and competitive performance in coding and visual intelligence benchmarks.

1
Google Gemini (as an agent)

Gemini is a multimodal AI model capable of understanding and operating across various data types, including images, video, and text, enabling sophisticated reasoning and direct UI control.

Similar to Step 3.7 Flash, Gemini offers real-time perception and action capabilities, particularly strong in multimodal understanding and complex decision-making. Its freemium access is typically via API for developers, allowing for the creation of custom agents.

2
AskUI Vision Agent

AskUI Vision Agent specializes in automating desktop and mobile workflows by visually understanding and interacting with graphical user interfaces at the operating system level.

This is a direct competitor focusing on the 'see and act' aspect for digital interfaces, translating visual data into low-level commands. Its specialization in GUI automation provides a focused alternative to a general 'flash-speed' agent model.

3

Skygen is an AI desktop automation agent that provides real-time visibility and runs tasks across various applications, websites, and cloud computers.

Skygen aligns closely with Step 3.7 Flash's description of a 'flash-speed agent model that can see and act' within digital environments, emphasizing real-time operation and broad application interaction. It offers a freemium model, similar to the described pricing of Step 3.7 Flash.

4
OpenAI Operator

OpenAI Operator is designed to execute multi-step actions directly within a web browser, enabling autonomous completion of complex web tasks.

While its pricing is listed as a paid 'Pro' tier rather than freemium, OpenAI Operator offers a direct functional comparison by focusing on agents that 'see' (perceive web interfaces) and 'act' (perform tasks) at speed within a browser environment.

5
Agno AI Agents

Agno AI Agents is a framework built for performance, enabling the creation of lightning-fast, production-ready AI agents with minimal startup times and a tiny footprint.

Agno directly addresses the 'flash-speed' aspect, offering a framework to build agents that are exceptionally fast and efficient. While its 'see' capability is more about perceiving digital states for action rather than explicit visual recognition, its emphasis on rapid, production-grade agent deployment makes it a strong competitor for high-performance autonomous tasks.

Frequently Asked Questions

+What is Step 3.7 Flash?

Step 3.7 Flash is a high-efficiency, multimodal Mixture-of-Experts (MoE) vision-language model developed by StepFun that enables AI Developers and Enterprise users to build and deploy advanced AI agents. It provides advanced perception, search, and reasoning capabilities at production scale for agentic workflows.

+Is Step 3.7 Flash free?

Step 3.7 Flash operates on a freemium model, offering a free tier. For usage beyond the free tier, it is usage-based, with input tokens priced at $0.00020 per 1k tokens and output tokens at $0.00115 per 1k tokens.

+What are the main features of Step 3.7 Flash?

Key features of Step 3.7 Flash include its 198-billion-parameter sparse MoE architecture, native image and video understanding via a 1.8B-parameter vision encoder, a 256k context window, three selectable reasoning levels, and reliable interaction with external APIs and tools. It also supports NVIDIA inference stacks and offers an Advisor Mode for cost-efficient agentic operations.

+Who should use Step 3.7 Flash?

Step 3.7 Flash is primarily intended for AI Developers, Enterprise Users, Engineers/Researchers, and Content Creators who require advanced multimodal AI agents for tasks such as building AI applications, automating complex workflows, agentic coding, and processing diverse data types.

+How does Step 3.7 Flash compare to alternatives?

Step 3.7 Flash distinguishes itself with native multimodal support (images and video), outperforming competitors like DeepSeek V4 Flash in this aspect. It demonstrates strong coding performance, scoring 56.3 on SWE-Bench PRO, and leads the ClawEval-1.1 benchmark for tool orchestration. Its Advisor Mode offers a cost-effective alternative to models like Claude Opus 4.6 for similar performance levels.

For builders

This page is doing a job for someone else’s tool.

AI agents read it. Buyers find it. Backlinks accrue. Your tool can have one too — live in 24 hours, indexed by Claude, ChatGPT, and Perplexity, queryable via MCP.