The Scaling Lie: Uncovering the Real Engine Driving AGI Progress

AI's Two Truths and a Lie

Scaling is supposedly over. Prominent researchers argue that large language models are hitting a wall, that deep learning has exhausted its tricks, that we’ve already scraped “one internet” and there’s nothing left to feed the beast. Gary Marcus called the shot in 2022: deep learning is “hitting a wall.” Ilya Sutskever warned about peak data and the hard ceiling on simply shoveling more text into transformers.

Benchmarks tell a different story. Objective measurements of what AI systems actually do are not slowing; they’re bending upward. METR’s data on autonomous agents shows the length of tasks generalist systems can complete has doubled every 7 months for six years, then recently sped up to every 4 months—a log-quadratic curve that looks less like a plateau and more like a rocket.

This is the scaling paradox. Two statements are both true and apparently incompatible: - Returns from vanilla scaling of transformers are diminishing - Real-world capabilities and utility are accelerating

From GPT-2 to GPT-3 to GPT-4, the story looked straightforward: scale is all you need. Models grew roughly 10x–100x in parameters and compute, and capabilities climbed in lockstep. Parameter count and FLOPs became the scoreboard; progress meant bigger models, bigger clusters, bigger training runs.

Then the scoreboard broke. Despite diminishing returns from raw parameter and data scaling, systems started blowing through “grand challenge” benchmarks. ARC-AGI took four years to crawl from 0% to 5% accuracy, then went from 5% to near saturation in a matter of months, forcing a rapid rollout from ARC-AGI 1 to 2 and now 3. Benchmarks that were supposed to last a decade now expire in a product cycle.

Resolving that paradox is not academic; it points directly at where AGI is actually coming from. If scale alone is no longer the main driver, the real action moves to everything wrapped around the model: test-time compute, tool use, agent scaffolding, post-training tricks. The future of AGI looks less like a single giant brain and more like an accelerating stack of interacting systems—and that future is arriving faster than either camp expected.

The Wall of Diminishing Returns

Skeptics argue the scaling party is already over. Gary Marcus has spent years insisting that deep learning has “hit a wall,” and by 2022 he was calling current methods a dead end even as GPT-4 loomed. Ilya Sutskever, hardly a doomer, echoed a narrower but sharper concern: “peak data”. We have one internet, he warned, and large language models have already chewed through most of it.

Early on, the story sounded much simpler: “scale is all you need.” GPT-2 to GPT-3 brought roughly a 100x jump in parameters and compute; GPT-3 to GPT-4 repeated the trick at a similar order of magnitude. Performance curves looked almost smooth: more tokens, more FLOPs, better benchmarks. Parameter count and training compute became the scoreboard for AI progress.

That scoreboard now looks misleading. Empirical scaling laws show clear diminishing returns when you keep the vanilla transformer architecture fixed and just crank up parameters and data. Each extra 10x in compute buys a smaller accuracy gain on standard language modeling and reasoning benchmarks. At internet scale, those curves start bending uncomfortably flat.

Sutskever’s “one internet” line captures the hard limit. Web-scale corpora sit in the low trillions of tokens; frontier models already train on hundreds of billions to low-trillions. Scraping a second internet does not exist as an option. You can clean, dedupe, and augment, but you cannot conjure another planet’s worth of text.

Within that constraint, the math backs the critics. Classic scaling law papers show loss improving roughly as a power law with model size and data, which means the curve flattens as you go big. Doubling parameters from 10 billion to 20 billion moves the needle; doubling from 1 trillion to 2 trillion barely nudges perplexity. The cost curve, meanwhile, grows brutally steep.

So the narrow technical claim stands: returns from simply adding more parameters and more data to vanilla transformers are diminishing. Deep learning did not “hit a wall” in the sense of zero progress, but one particular wall is real. The old paradigm—bigger model, bigger dataset, predictable gains—no longer explains why capabilities keep exploding.

The Curve That Breaks the Narrative

Scaling pessimists run into a problem when they hit the METR charts. While Gary Marcus declared in 2022 that deep learning was “hitting a wall,” METR’s data on machine autonomy shows the opposite: a curve that bends up, not down. Over the past six years, the length of tasks completed by generalist agents has doubled roughly every seven months.

That pace recently accelerated to a doubling every four months. Plot those points and you do not get a gentle exponential; you get what METR describes as a log-quadratic trajectory, a growth curve that steepens over time. On a linear graph, the right-hand side looks almost vertical.

This matters because autonomy is not a vanity benchmark. METR tracks how long an AI agent can operate in messy, open-ended environments before failing—multi-step software tasks, complex web interactions, tool use. When that median task length doubles every few months, you are watching systems cross thresholds of economic relevance.

The timing undercuts the “wall” narrative. That four-month doubling regime kicked in roughly three years after Marcus’s warning and after Ilya Sutskever’s concerns about “peak data” and “one internet.” If classic scaling laws were truly stalling overall progress, you would expect the autonomy curve to flatten, not spike.

Instead, progress on autonomy aligns with other runaway metrics. Reasoning benchmarks like ARC-AGI went from near-zero performance to near-saturation in months, forcing rapid releases of ARC-AGI-1, -2, and now a planned interactive -3. Benchmark designers keep moving the goalposts because current models keep blasting through them.

For a deeper look at why some capabilities explode while others lag, see work like Moravec's Paradox and Restrepo's Model: Limits of AGI Automation. But on the core question—whether AI progress is slowing—the METR curve is blunt. Capability, measured as real task completion, is not coasting; it is compounding on a near-vertical climb.

When Grand Challenges Last for Months

Grand challenges used to mean careers, not quarters. When François Chollet released the ARC-AGI benchmark in 2019, he framed it as a test of abstract reasoning that should resist brute-force scaling. Early systems barely scratched it: roughly 0–5% accuracy after four years of incremental model upgrades, clever search tricks, and hand-tuned solvers.

Then large multimodal models and better tool scaffolding arrived, and the curve snapped. Performance jumped from that low single-digit plateau to near-saturation in a matter of months, not decades. A benchmark deliberately engineered to stump pattern-matching systems suddenly looked like low-hanging fruit.

ARC-AGI’s puzzles are tiny 2D grids where models must infer a transformation rule from a handful of input-output examples, then apply it to a new case. No natural language, no internet-scale priors, just pure systematic generalization. Chollet’s whole point was to create something that rewarded “intelligence” over memorization or scale.

Reality answered with a speedrun. Once researchers wrapped foundation models with test-time search, program synthesis, and explicit tool use, ARC-AGI’s difficulty collapsed. The benchmark that crawled from 0% to 5% in four years sprinted from 5% to near-human performance in less than a year of serious attention.

So the community is now doing what game designers do when players break the meta: patching and escalating. Chollet and collaborators rolled out a stricter ARC-AGI 1 leaderboard, then started work on ARC-AGI 2, and now an even more ambitious ARC-AGI 3. Each iteration exists because the previous “grand challenge” aged out in record time.

ARC-AGI 3 reportedly goes interactive. Instead of static before/after grids, models face environments that react to their actions, forcing multi-step hypothesis testing and feedback-driven exploration. That shift mirrors a broader trend in AI evaluation:

1Moving from static test sets to interactive tasks
2From one-shot answers to multi-step reasoning traces
3From closed-world puzzles to open-ended tool use

Critics call this “moving the goalposts.” Practitioners call it survival. When a supposedly decade-long benchmark compresses into a few training runs and some clever agent scaffolding, the only option is to build harder goals fast enough to stay ahead of the curve.

It's Not One Thing, It's Everything

Scaling pessimists and acceleration data both describe reality because they point at different curves. The slowdown shows up in one narrow vector: vanilla pre-training on a fixed transformer, where more parameters and “one internet” of data buy less gain each generation. Meanwhile, the actual capability frontier moves because researchers stack many orthogonal upgrades on top of that baseline.

Test-time compute now acts like a second scaling law. Techniques such as Chain of Thought prompting, tree search, and tool-augmented reasoning let the same base model spend far more FLOPs per question, trading latency for accuracy. On tasks like math word problems and planning, simply allowing multi-step reasoning can jump success rates by double digits without changing the underlying weights.

Architectural innovation quietly rewires how models spend their capacity. Mixture of Experts routes tokens through small subsets of specialized sub-networks, effectively increasing parameter count by 5–10x while keeping inference costs near a dense model. Structured State Space Models (SSMs) and hybrid transformer-SSM stacks push sequence length and stability, enabling models that can handle sprawling contexts and long-horizon plans.

Around those cores, agentic scaffolding turns static models into systems that behave more like workers than autocomplete engines. Multi-agent orchestrators, tool routers, and planning frameworks let LLMs call APIs, write and execute code, search the web, and iteratively refine outputs. Benchmarks like ARC-AGI jump not because the base model suddenly “understands” more, but because it now sits inside a problem-solving loop.

Training recipes have also become a fast-moving frontier. Post-training stacks now mix: - RLHF-style preference optimization - DPO and other direct alignment methods - Synthetic data generation and self-play

Each layer shapes behavior and reliability far beyond what raw cross-entropy loss ever delivered. Synthetic data, especially, lets models bootstrap from their own strengths, creating curriculum-style training corpora orders of magnitude larger than curated human datasets.

Sam Altman summed up GPT-4’s leap as “not one thing… hundreds of little improvements.” That line is not PR; it is a description of a new regime where progress comes from a combinatorial pileup of optimizations. When critics declare “scaling is over” or boosters insist “scale is all you need,” both flatten a messy empirical story into a slogan and miss how those hundreds of tweaks now move the curve.

The Trillion-Token Elephant in the Room

Call it the trillion-token elephant: modern LLMs only look smart after gorging on data sets so huge they make the entire written history of humanity feel small. GPT-scale models train on hundreds of billions to trillions of tokens, often sweeping up most of the public internet plus synthetic data just to squeeze out a few points on benchmarks.

Human brains do not play this game. A child hears on the order of tens of millions of words in early life, not trillions, and by adolescence can juggle multiple languages, social nuance, and abstract reasoning with a sample budget that would barely register in a modern training run.

Researchers who defend this gap often retreat to evolution. They argue that humans arrive with baked-in inductive biases—visual priors, language instincts, social heuristics—distilled by billions of years of natural selection, while LLMs start as blank slates that must learn everything from scratch.

That story breaks the moment you look at skills evolution never saw coming. No hominid lineage practiced multivariable calculus, Python programming, or writing long-form legal contracts, yet teenagers routinely learn all three from a few textbooks, lectures, and problem sets—thousands of examples, not billions.

High school students internalize the core of differential equations in a semester. A motivated 20-year-old can go from zero to employable software engineer after perhaps a few thousand LeetCode problems and some project work, far below what current models need to reach comparable coding performance.

Humans also perform one-shot and few-shot learning that today’s systems only mimic through prompt tricks. See a novel gadget once, and you can usually guess how to hold it, which parts move, and how to avoid breaking it; LLMs and vision models typically require large curated data sets to reach similar robustness.

Sample efficiency is not a rounding error in the scaling story; it is the next hard wall. Forecasts like The case for AGI by 2030 | 80,000 Hours hinge less on raw FLOPs and more on whether researchers can close this yawning gap between brain-like learning and today’s data-hungry architectures.

To Understand the World, Squeeze It

Compression is quietly becoming the new scaling law. When you force a model to squeeze terabytes of messy internet text, code, images, and video into a few billion parameters, you are not just shaving bits—you are pressuring it to discover structure. That pressure turns “predict the next token” into “reverse engineer the world that produced these tokens.”

Ilya Sutskever has pushed a simple but brutal idea: efficient compression demands an internal model of the data-generating process. To shrink a corpus without losing predictive power, a system must infer how language, physics, social norms, and software actually work. At high compression ratios, shallow statistics break and only genuine regularities survive.

Zip files can’t do this because they only exploit local redundancy. A transformer trained on hundreds of billions of tokens must instead infer latent causes: why objects fall, why functions call other functions, why legal contracts share weird phrasings. Those inferred causes form a crude but increasingly rich world model that generalizes beyond the training set.

Think of it as climbing an abstraction hierarchy. At the bottom, models memorize n-grams and syntax templates. Higher up, they learn reusable concepts—loops, metaphors, negotiation patterns—that let them compress entire families of situations into a single representation.

At the top of that hierarchy sits something that starts to look like reasoning. When a model solves a novel coding bug or passes a previously unseen logic puzzle, it is cashing in those compressed abstractions. The same internal structures that made the data smaller also make new situations legible.

This is why post-training tricks like chain-of-thought, tool use, and agents hit so hard. They expose and amplify abstractions the model already learned to compress its training data, turning static representations into dynamic problem solvers. Better prompts don’t magically add IQ; they just give that compressed world model room to unspool.

As sample efficiency becomes the bottleneck, compression-driven understanding turns into the main growth engine. Whoever learns to cram more of the world into fewer bits—without losing the ability to act on it—wins the next phase of AI.

The Compression Engine in Action

Compression-first AI already exists in the wild, and its loudest proof point right now is DeepSeek. While Western labs argue about “peak data,” a Chinese team quietly trained a frontier-scale model on roughly 14 trillion tokens for about $5 million in compute — a budget that would barely cover a couple of OpenAI keynotes.

DeepSeek’s core trick attacks the costliest part of large language models: text tokens. Instead of feeding the transformer raw characters or subwords, the system uses OCR-style visual representations of text to slash sequence length. By packing information into denser visual tokens, DeepSeek reports roughly 7–20x token reduction, turning what used to be a trillion-token problem into a few hundred billion effective steps.

Token reduction alone would not be enough without solving the memory bottleneck. Transformers pay a quadratic tax on context length because every new token must attend to everything that came before. DeepSeek introduces Multi-Head Latent Attention, a scheme that compresses the KV cache — the stored keys and values each attention head needs — into a smaller latent space, then reconstructs just enough detail on the fly to preserve accuracy.

That KV compression matters because cache memory dominates inference and training costs at long context lengths. By shrinking those tensors, DeepSeek can: - Fit longer contexts into the same GPU RAM - Run more parallel sequences per GPU - Cut the FLOPs per effective training token

Stack these gains and the economics flip. A training run that might have demanded tens of millions of dollars under a vanilla transformer recipe suddenly fits into a single-digit million budget, while still chewing through internet-scale data. The model does not violate scaling laws; it bends them by changing what a “token” and a “step” mean.

DeepSeek functions as a working demo of the compression forces understanding thesis. By aggressively compressing both inputs and internal state, it squeezes more structure out of the same raw web scrape, turning a hard data ceiling into an algorithmic floor for the next generation of systems.

Why Efficiency is Inevitable

Efficiency stops being optional once three hard constraints show up at the same time: compute, power, and data. Each one by itself would nudge AI toward better sample efficiency; together they turn it into the main battlefield for progress.

Start with compute. Even companies sitting on thousands of H100s run into ceilings: supply chain bottlenecks, datacenter buildout time, and inference costs that balloon with every parameter. When a frontier model update costs hundreds of millions of dollars in training runs, the only sustainable way forward is to squeeze more capability out of every floating-point operation.

Power makes that pressure worse. Training and running state-of-the-art models already draw megawatts per site, and global AI electricity demand is on track to rival small countries. Regulators and investors increasingly care about intelligence per watt, not just raw benchmark scores, forcing architectures, compilers, and training recipes that deliver more cognition for the same energy budget.

Data forms the third vise. Ilya Sutskever’s “we have but one internet” line captures the basic reality: the web-scale text that powered GPT-3 and GPT-4 style models does not grow fast enough to support endless 10x jumps. You can scrape YouTube transcripts and code repos forever, but you hit duplicated content, spam, and noise long before you hit new concepts.

That so-called “data wall” is really a compression wall. Models already see the same patterns again and again; the next gains come from extracting more structure from what we already have. Better tokenization, curriculum learning, synthetic data, and self-play all boil down to the same move: represent the world’s information more compactly so each real example teaches more.

Three forcing functions now steer research roadmaps:

1Compute constraint: fewer FLOPs per unit of capability
2Power constraint: fewer joules per inference and per training run
3Data constraint: fewer real-world examples per skill learned

For a deeper dive into how these constraints intersect with AGI timelines and why experts disagree so sharply, see Why do people disagree about when powerful AI will arrive?.

Beyond Scale: Redefining Intelligence

Scale dominated the AI story for a decade: more parameters, more data, more GPUs, more benchmark wins. Now the center of gravity is shifting toward learning efficiency—how quickly a system can turn experience into competence, not how many trillions of tokens it has chewed through.

Emergent abilities once served as the scoreboard. Hit 10x more compute, watch new skills pop out of the loss curve like magic tricks. But those “surprises” mostly reflected blunt-force coverage of the training distribution, not the underlying primitive we actually care about: rapid generalization from scarce, messy, or novel data.

Benchmarks like ARC-AGI exposed the mismatch. Models crawled from 0% to 5% over four years, then jumped from 5% to near-saturation in a matter of months once researchers leaned on better search, tool use, and agent scaffolding. METR’s autonomy data shows task lengths doubling every 4–7 months, even as returns from vanilla pre-training flatten, signaling that smarter use of experience now moves the needle more than scale alone.

Metrics follow incentives. For years, labs bragged about: - Parameter counts (billions to trillions) - Training FLOPs (10x per flagship release) - Benchmark leaderboards (MMLU, GSM8K, ARC)

Next-generation systems will advertise something different: shots-to-mastery, update latency, and how few interactions they need to surpass a human domain expert. Sample efficiency becomes not just a research curiosity, but the defining performance spec.

So the question for AGI flips. Instead of “How big is it?” the more revealing test becomes “How fast can it learn anything?” An AGI that ingests a single textbook and outperforms veteran engineers the next day, or watches minutes of video from a new factory line and rewrites the control software in real time, represents a fundamentally different creature than today’s frozen giants. Intelligence stops looking like a static model and starts looking like an always-on learning process racing the clock.

Frequently Asked Questions

What is the 'scaling paradox' in AI?

It's the observation that while traditional scaling laws (adding more data and compute to models) are yielding diminishing returns, the actual real-world capabilities of AI systems are accelerating faster than ever.

Is AI really 'hitting a wall' as some experts claim?

The data suggests otherwise. While the single vector of vanilla scaling is slowing, progress on benchmarks like ARC-AGI and autonomy metrics from METR shows a dramatic acceleration, indicating progress is shifting to other areas.

Why is data compression suddenly so important for AGI?

The theory is that true understanding comes from efficient compression. To compress data effectively, a model must learn the underlying structure of the world that generated it, leading to more generalizable intelligence and overcoming data bottlenecks.

What is the METR autonomy curve?

It's a trend tracked by the research organization METR showing that the complexity of tasks AI agents can complete has been doubling at an accelerating rate, recently speeding up from every 7 months to every 4 months.

The Scaling Lie: AI's Real Growth Engine