TL;DR / Key Takeaways
- Xiaomi just launched an AI model that generates over 1,000 tokens per second on standard GPUs, blowing past GPT-4.
- This breakthrough in 'model-system codesign' could fundamentally change real-time AI applications.
The Thousand-Token Barrier Is Broken
Xiaomi, in collaboration with systems partner TileRT, has unveiled the **MiMo V2.5 Pro UltraSpeed** model, a 1-trillion-parameter Mixture-of-Experts (MoE) AI. This new contender shatters previous benchmarks for large language model inference speed. Its headline claim: generating text at over 1,000 tokens per second, with some demonstrations peaking near 1,200 TPS.
To put this into perspective, current frontier models like GPT-4 or Claude 4 Opus typically deliver around 50-60 tokens per second. This often results in noticeable reasoning lags for complex tasks. MiMo V2.5 Pro UltraSpeed's performance represents a staggering 15 to 20-fold increase, an order-of-magnitude leap that fundamentally redefines the practical limits of real-time AI interaction and capability.
What makes this achievement particularly disruptive is its remarkably modest hardware footprint. Rather than relying on specialized, custom silicon or massive data centers, this trillion-parameter behemoth operates efficiently on standard, readily available infrastructure. It runs on a single server equipped with just eight commodity GPUs, demonstrating an unprecedented level of model-system co-design and optimization for widespread deployment.
Inside the Three-Layer Speed Stack
Xiaomi’s MiMo V2.5 Pro UltraSpeed achieves its blistering 1,000+ tokens per second through an "extreme model-system co-design," attacking latency from three synchronized angles. The first layer tackles memory bandwidth, a critical bottleneck for a 1-trillion-parameter Mixture-of-Experts model. Xiaomi deployed MXFP4 Quantization, compressing MoE Expert parameters to 4 bits. This alleviated memory pressure significantly while Quantization-Aware Training (QAT) preserved the model’s near-identical accuracy by maintaining higher precision in core routing layers.
Second, the model radically changed token prediction with DFlash speculative decoding. Unlike standard methods that guess tokens one-by-one, DFlash predicts an entire block of hidden tokens simultaneously via a parallel forward pass. This allows the model to take "massive eight token leaps forward." For coding tasks, the main model accepts an average of 6.3 out of every eight tokens DFlash guesses, accelerating output dramatically.
Finally, the third layer eliminates microsecond-level pauses inherent in GPU execution. TileRT, Xiaomi’s systems partner, developed a persistent GPU kernel runtime that remains resident on the GPU. Using warp specialization, it assigns permanent roles to hardware sections, enabling simultaneous data movement, computation, and communication. This ensures the execution pipeline literally never stops, maintaining continuous momentum for unparalleled speed.
Real-World Tests: Blazing Speed, Brittle Code
Xiaomi's MiMo V2.5 Pro UltraSpeed demonstrates astonishing raw throughput in controlled tests. A hard LeetCode challenge saw the Mixture-of-Experts model peak at an astounding 3,451 tokens per second, generating complex code at speeds previously unheard of for a 1-trillion-parameter model. In another impressive display, it rapidly constructed a functional Three.js game in under a minute, showcasing its ability to translate prompts into working applications with remarkable velocity.
Yet, this blazing speed often comes with significant caveats. When tackling more complex, multi-step tasks, the MiMo V2.5 Pro UltraSpeed frequently exhibited critical failures. Attempts to generate a comprehensive, Khan Academy-style math explainer webpage, for instance, led to frozen outputs and completely dropped context, halting generation entirely after only a couple of minutes. Even when simplified, the resulting code often featured broken functionality, with only initial sections working reliably while later components remained non-functional or empty.
The MiMo V2.5 Pro UltraSpeed clearly prioritizes raw generation speed, representing a unique engineering feat in token throughput. While its performance on narrow, high-speed coding tasks is unparalleled, its overall capability and reliability do not yet rival the nuanced understanding or consistent output of frontier models like Claude Opus or GPT-4. This trade-off highlights a divergent path in AI development, focusing on velocity over sustained, complex reasoning. For those interested in the underlying architecture and its performance, further details are available at the Xiaomi MiMo Home.
Why 'Model-System Codesign' Changes the Game
At its core, MiMo V2.5 Pro UltraSpeed's blistering pace stems from extreme model-system codesign. This philosophy involves meticulously optimizing the model's architecture and the underlying hardware runtime simultaneously, extracting peak performance from every component. It’s how Xiaomi forced a 1-trillion-parameter Mixture-of-Experts model to generate text at microsecond speeds on standard hardware.
Such an integrated approach fundamentally challenges the market for expensive, specialized AI accelerators. Instead of custom silicon, Xiaomi and TileRT demonstrated this unprecedented 1,000+ tokens/second inference on a single standard server equipped with eight commodity GPUs. This maximizes existing hardware potential, democratizing access to frontier AI capabilities for a fraction of the cost.
The resulting millisecond latency unlocks a new class of applications previously confined to theoretical discussions. These include: - Real-time trading algorithms that react to market shifts instantly - Autonomous coding agents generating production-ready code within seconds - Instant fraud detection systems operating at transaction speed, preventing losses before they occur
This paradigm shift suggests that future AI breakthroughs may not exclusively rely on ever-larger, more specialized chips, but rather on smarter, more efficient integration across the entire system stack.
Frequently Asked Questions
What is Xiaomi MiMo V2.5 Pro UltraSpeed?
It is a 1-trillion-parameter Mixture-of-Experts AI model developed by Xiaomi and TileRT, capable of generating text at over 1,000 tokens per second on standard, commodity hardware.
How does the MiMo UltraSpeed model achieve such high speeds?
It uses a three-part strategy called 'extreme model-system codesign': MXFP4 quantization to reduce memory usage, DFlash speculative decoding to predict token blocks in parallel, and a TileRT persistent GPU kernel to eliminate hardware latency.
What hardware is required to run the MiMo UltraSpeed model?
The reported speeds were achieved on a single standard server equipped with eight commodity GPUs, not specialized or custom-built AI hardware.
Is the MiMo UltraSpeed model as capable as models like GPT-4 or Claude Opus?
While exceptionally fast, tests show it currently has limitations. It can produce broken or incomplete outputs on complex tasks, indicating a trade-off between raw speed and the reasoning capabilities of leading frontier models.
