GPT-5's ARC-AGI Benchmark Score: A Breakthrough in AI Reasoning

The Test That Was Built to Break AI

Rumors of a secret GPT-5 breakthrough started with a chart: a claimed 75–76% score on the new ARC-AGI-2 benchmark, comfortably above the roughly 60% average human test taker. The story, amplified on X and YouTube, framed it as the moment an AI finally beat humans on a test explicitly built to gatekeep AGI.

ARC-AGI comes from François Chollet, a Google DeepMind researcher who has spent years arguing that scaling up language models is not the same as building general intelligence. His ARC (Abstraction and Reasoning Corpus) benchmark, and its newer ARC-AGI-2 variant, targets the kind of fluid reasoning humans use to solve puzzles they have never seen before.

Instead of trivia questions or textbook problems, ARC-AGI presents tiny colored grids and asks the model to infer the hidden rule. Each task includes just three input-output examples, then a fourth input where the model must generate the correct output grid from scratch. No instructions, no labels, no multiple choice.

The benchmark measures fluid intelligence: pattern discovery, compositional reasoning, and generalization from almost no data. It uses a strict Pass@2 metric—models get at most two attempts per task, with no partial credit and an eye on compute cost per solution.

That design makes ARC-AGI brutally hard for large language models. LLMs excel when they can lean on memorized patterns from web-scale text, but ARC-AGI’s puzzles are procedurally generated and visual, not linguistic, and deliberately unlike anything in common training corpora.

Standard leaderboards like MMLU, GSM8K, or HumanEval often blur the line between reasoning and recall. Benchmarks leak into training data; model vendors fine-tune directly on similar question formats; scores creep upward in ways that may say more about data contamination than genuine understanding.

ARC-AGI pushes in the opposite direction. Tasks are “human-easy/AI-hard,” with human solvers effectively near 100% when given time, while early frontier models scraped single digits on ARC-AGI-2. That gap is why a claimed 75% GPT-5 score, even if unverified, set off alarms: if true, it would signal an AI not just parroting knowledge, but cracking brand-new rules the way people do.

Thinking in Grids: What Makes ARC So Hard

Colored squares on a grid do not sound like a Turing test, but ARC-AGI turns that kids’ toy aesthetic into a razor for AI. Each puzzle shows a handful of tiny input grids and matching output grids, then asks the model to transform a new grid using the same hidden rule: maybe mirror the blue blocks, grow a red shape by one pixel, or delete everything except the largest connected component.

Humans glance at these examples and almost immediately start narrating structure: “Oh, the yellow line marks the center,” or “the pattern repeats every three cells.” For current models, those same 10×10 or 20×20 grids are a combinatorial minefield. Every colored pixel multiplies the number of possible transformations that could fit the data, and nothing in a language model’s pretraining corpus looks much like this.

ARC’s creator François Chollet designed it as a pure test of fluid intelligence: the ability to reason in novel situations, discover patterns, and recombine concepts on the fly. That stands in contrast to crystallized intelligence, which leans on memorized facts and familiar templates—where large language models shine by regurgitating and remixing web-scale text.

On ARC-AGI-2, there is no training split to memorize and no dataset overlap to exploit. Models see just 3–5 input-output pairs per task and must generalize to a new example. No gradients update, no fine-tuning occurs; everything happens at test time, inside the model’s existing weights and whatever scaffolding sits around them.

To keep systems honest, ARC-AGI-2 uses a Pass@2 metric: a model gets at most two guesses per task. There is no partial credit for “almost right,” and no opportunity to shotgun thousands of samples until one sticks. Benchmarks also track efficiency, counting how much compute each attempt burns, which punishes brute-force enumeration of candidate programs.

Humans, by contrast, routinely solve these puzzles in a few minutes, often with a single clean insight. That gap—between human “obvious” and machine “opaque”—exposes how far today’s best models still lag on genuine abstraction, even as they dominate exams built on crystallized knowledge.

The 'Unhobbling' Revolution Nobody Saw Coming

Unhobbling sounds like a niche alignment term, but Leopold Aschenbrenner uses it to name something brutally simple: current models are smart, yet artificially crippled. His 2024 “Situational Awareness” paper argues that a huge fraction of near-term gains will come not from bigger models, but from removing those shackles.

His analogy lands hard. Asking an LLM to solve a tough math problem in one shot is like demanding a human blurt out the answer instantly, no scratch paper, no revisions. Chain-of-thought prompting acted as that scratchpad, turning “chatbots that guess” into systems that can walk through multi-step reasoning and suddenly ace far harder problems.

Today’s frontier models remain heavily hobbled. Aschenbrenner calls out that they: - Have no robust long-term memory - Can’t use a computer or filesystem fluidly - Rarely “think before they speak” with extended internal deliberation - Mostly operate in short, single-threaded chats instead of persistent projects

Unhobbling means fixing those constraints with scaffolding: tool use, planning loops, external memory, multi-agent orchestration, and more test-time compute. Crucially, it changes what you can do with the same base weights, which is why Aschenbrenner classifies it as algorithmic progress rather than just UX polish.

You can already see this in the numbers. Poetic’s meta-system reportedly pushes a GPT-5 variant from roughly human-level ARC-AGI-2 performance (~60%) to around 75–76%, and lifts Grok-4-style models from ~56–57% to ~72% on similar reasoning tests, all without a bigger base model. Google’s Gemini 3 line shows the same pattern: from sub-30% to mid-40s, then to and past human baselines on ARC-style tasks via successive unhobbling passes.

That dynamic reframes timelines. If unhobbling alone can deliver 10–20 point jumps on benchmarks that were supposed to require the next generation of models, you no longer have to wait for GPT-6-scale training runs to see step changes. OpenAI’s own Introducing GPT-5 - OpenAI messaging leans on similar themes: more tools, more context, more agency layered on top of raw scale.

Aschenbrenner’s forecast is blunt: by 2027, continued unhobbling turns today’s chatbot into something that behaves much more like an agent and a coworker than a talking search box.

Inside Poetic: The 'Manager AI' Strategy

Poetic sits at the center of the GPT-5 ARC story. TheAIGRID’s video credits the company with building an “unhobbling” scaffold around a frontier OpenAI model, not training a new brain from scratch. Their claim: a meta-system that pushes GPT-5 from roughly human-level ARC-AGI-2 performance to a reported 75–76% without scaling up the underlying weights.

At the core of Poetic’s approach sits a “Manager AI.” Instead of firing a single giant model call at each puzzle, the manager inspects the grid, proposes a high-level plan, then decomposes it into subproblems. Each subproblem routes to a specialized worker model—some tuned for pattern recognition, others for code generation, search, or verification.

Crucially, this manager does not just prompt and pray. It can: - Write and execute code against the puzzle grid - Inspect intermediate outputs and compare them to the target - Branch into alternative strategies when a path looks wrong - Decide when to stop once a correct solution appears

That loop—plan, act, check, revise—turns ARC from a one-shot guessing game into an iterative search. The system can run dozens of cheap worker calls instead of hammering a single expensive frontier model. Poetic argues this saves massive compute on hard reasoning tasks, because the manager halts early whenever a candidate output matches the required grid exactly.

Contrast that with the standard monolithic LLM setup. In the baseline world, you send one prompt to one big model, get one answer, and pay full price even if the output fails. There is no explicit decomposition, no persistent scratchpad, no self-correction beyond a user hitting “try again.”

Poetic’s multi-agent, self-correcting architecture effectively externalizes what chain-of-thought only hints at. Instead of coaxing a single model into thinking step by step, the manager orchestrates a team, allocates test-time compute where needed, and prunes dead ends. On a benchmark like ARC-AGI-2, that kind of structured meta-reasoning can matter more than another 10 billion parameters.

Reality Check: The Real ARC-AGI Leaderboard

Reality hits as soon as you open the actual ARC Prize leaderboard. The viral 75% GPT-5 score simply does not exist there, or anywhere else that is independently verified. Instead, the public numbers paint a far more grounded—and still astonishing—picture of where current models stand.

On the main ARC-AGI-2 board, baseline GPT-5 posts a Pass@2 score of just 9.9%. That puts it in the same struggling cohort as other frontier models: Claude Opus 4 at 8.6%, various Gemini 3 variants in the low double digits, and many systems languishing between 2% and 6%. Grok-4 “Thinking” leads that early table with 16.0%, hardly the stuff of AGI victory laps.

Scroll further and the supposed miracle model appears in a different guise: GPT-5.2, a newer OpenAI system that suddenly changes the curve. On the official ARC-AGI-2 “systems” leaderboard, GPT-5.2 clocks in around 53–54% Pass@2. That score more than triples GPT-5’s 9.9% and roughly triples GPT-5.1’s reported 17.6%, while comfortably beating previous stars like Gemini 3 Pro at roughly 45%.

Humans, however, still own this benchmark. ARC-AGI-2’s human baseline sits around 60% for average test takers, with validated sets approaching 98–100% when you only count tasks solved by at least two of nine or ten people. The entire point of ARC is that these grid puzzles feel “obvious” to humans yet remain brutally opaque to machines.

That context makes the 75–76% claim look more like marketing than measurement. No public leaderboard entry, paper, or ARC Prize update shows any GPT-5 variant, Poetic system, or Grok configuration breaking the human-average 60% line, let alone smashing it by 15 points. If such a run exists, it lives off-book, unverifiable, and outside the norms of competitive benchmarks.

None of this diminishes how shocking the verified 53–54% GPT-5.2 result actually is. A single model family jumping from sub-20% to above 50% on ARC-AGI-2 in one generation represents a step-change in abstract reasoning performance. Human-level remains out of reach, but the gap just narrowed far faster than almost anyone predicted.

Why 54% is More Impressive Than 100%

ARC-AGI progress never looked like a smooth curve. For years, state-of-the-art models hovered between 0% and 6% on ARC-style puzzles, effectively showing no fluid intelligence despite monstrous training runs. They could ace bar exams and coding interviews, then faceplant on a 5×5 grid of colored squares.

That’s why 54% matters more than a hypothetical 100%. Hitting mid-50s on ARC-AGI-2, as GPT-5.2 reportedly does, means models jumped from “basically broken” to “solving most problems a bright human can.” That is a qualitative phase change, not a marginal benchmark bump.

ARC-AGI-2 uses Pass@2: two guesses, no partial credit, cost-sensitive evaluation. Prior frontier models like GPT-5, Claude Opus 4, and Grok-4 Thinking clustered in the single digits to low teens. A leap to ~53–54% more than triples those scores, while average humans sit around 60% and curated human baselines hit 98–100%.

Crucially, that leap did not come from just scaling model size. It came from unhobbling: better search, scratchpad reasoning, tool use, and manager-style orchestration around the base model. Poetic’s “manager AI” approach—routing tasks, decomposing problems, iterating solutions—embodies the algorithmic progress Leopold Aschenbrenner flagged as the next big driver of capability.

Aschenbrenner’s thesis was simple: models are far more capable than their naive one-shot outputs suggest. Add structured thought, memory, and tools, and you unlock dormant intelligence. ARC’s jump from 0–6% to >50% is the graph version of that argument.

Sam Altman has repeatedly pointed to ARC as a “real” AGI yardstick, precisely because it resists memorization and prompt engineering hacks. OpenAI insiders reportedly track ARC curves more closely than splashy standardized tests. When that line bends sharply upward, people building AGI pay attention.

Anyone can browse the public leaderboards and methodology at **ARC Prize - Abstract Reasoning Corpus**. The headline isn’t perfection; it’s that the curve finally moved.

Beyond Scaling: The New Path to AGI

Scaling laws had a good run. For most of the past five years, progress in large language models followed a simple recipe: more parameters, more data, more compute. GPT-3 to GPT-4 to GPT-5 looked like a straight line on a log-log chart, with performance curves that neatly fit power-law equations.

ARC-AGI-2 quietly breaks that story. Models like GPT-5.2 jump from low double digits on earlier ARC-style tasks to roughly 53–54% on ARC-AGI-2 not because someone trained a trillion-parameter behemoth, but because researchers changed how models think at test time. System design and algorithms, not raw scale, delivered the step change.

François Chollet, who created the original ARC benchmark, has argued this for years. In his view, true general intelligence cannot live in a static, pre-trained blob of weights that only regurgitates correlations. It requires systems that can build and revise hypotheses on the fly, explore solution spaces, and adapt their strategy as they encounter new tasks.

That philosophy shows up directly in ARC’s design. Each puzzle gives just 3–5 input-output examples and then a completely new test grid; no internet-scale training set can bail you out. To solve these, a model must perform test-time learning: infer rules, search over candidate transformations, and self-correct under tight compute budgets.

“Unhobbling” is what happens when you take that seriously and wrap a powerful base model in scaffolding that lets it behave more like a scientist than a autocomplete engine. Leopold Aschenbrenner’s “Situational Awareness” paper calls out things like chain-of-thought prompting, tool use, and long-horizon planning as simple tweaks that unlock latent capability. Poetic’s manager-LLM architecture is that idea turned into a product.

Instead of one giant forward pass, Poetic orchestrates multiple models, tools, and retries under a manager AI that decides how to spend compute. That is an architectural innovation, not a scaling one. Grok-4 “Thinking” jumping from ~56–57% to ~72% on internal reasoning tests, or Gemini 3 variants climbing from under 30% to human-level on ARC-style tasks, came from this kind of system-level unhobbling.

If that pattern holds, AGI might arrive less as a single colossal model and more as a tightly integrated stack of adaptive components. Brute force built the engines; clever architecture may finish the car.

The Goalposts Are Moving: ARC-AGI-3 and Beyond

ARC-AGI-2 is already brutal, but its creators are not standing still. The ARC Prize team is quietly working on ARC-AGI-3, a next-generation benchmark slated for around 2026, designed explicitly to break models that only look smart on static tests.

Instead of colored grids as fixed puzzles, ARC-AGI-3 will drop models into an unknown environment and ask them to figure out what matters. Think less “solve this pattern” and more “you’re in a strange microworld with objects and rules; discover how it works and then achieve a goal.”

That shift turns passive pattern-matching into interactive reasoning. Models will need to poke at the environment, run experiments, and update their hypotheses when something breaks, much closer to how humans learn a new tool, game, or interface.

The new benchmark targets skills today’s frontier models mostly fake with clever prompting. To succeed, an AI will need to:

1Explore efficiently instead of randomly clicking around
2Set its own subgoals without being hand-held
3Build and revise a world model from sparse feedback
4Plan multi-step sequences of actions and execute them reliably

ARC-AGI-3 also attacks one of the biggest crutches in current evaluations: dense instructions. Instead of a natural-language spec telling the model exactly what to do, the system will often have to infer the task from a few examples, partial rewards, or even just “make something good happen.”

That makes it a test of agency, not just reasoning. A system that can autonomously decide, “I should map this space, catalog object behaviors, then search for a path to the goal,” looks a lot closer to the “AI coworker” Leopold Aschenbrenner predicted than to a chatbox that waits for prompts.

If ARC-AGI-2 measures whether a model can solve a hard puzzle when you spoon-feed it the rules, ARC-AGI-3 asks whether it can walk into a new world and teach itself the rules. Crossing that gap—from problem solver to adaptable agent—is the next real hurdle on the road to AGI.

How 'Unhobbled' AI Will Change Your Workflow

Unhobbling stops being abstract the moment an AI stops acting like a chat window and starts behaving like a colleague who owns part of your job. Leopold Aschenbrenner’s bet is specific: by 2027, most knowledge workers will interact daily with agents that plan, remember, and execute, not just answer questions. That shift turns “prompt engineering” into something closer to management and collaboration.

Picture a project manager agent embedded in your company’s Slack and Jira. You give it a goal—“ship the new onboarding flow by March 15”—and it decomposes the work, files tickets, negotiates dependencies with other agents, and pings humans only for approvals or decisions that need judgment. It tracks burndown charts, predicts slippage using historical velocity, and automatically drafts stakeholder updates.

A software engineer might hand an unhobbled system a messy monolith and say: “Find performance bottlenecks and propose a migration plan to services.” The agent crawls the repo, builds call graphs, runs profiling in a staging environment, opens pull requests with refactors, and writes regression tests. Human engineers review and steer, but the drudge work of spelunking through legacy code and wiring boilerplate mostly disappears.

Market analysts could offload entire research projects instead of one-off queries. An agent with live web tools and API access might: - Scrape earnings calls and 10-Ks across a sector - Track price, sentiment, and volume data in real time - Run scenario analysis and Monte Carlo simulations on cash-flow models - Synthesize a 20-page brief with charts, caveats, and recommended trades

Benchmarks like ARC-AGI-2 and datasets in the GPT-5 Benchmarks Repository quietly power this shift, but the surface experience feels mundane: fewer tabs, fewer meetings, fewer status documents you write by hand. The magic comes from unhobbling constraints Aschenbrenner calls out—short context windows, lack of tools, no long-term memory, no planning loop—and wrapping models in scaffolding that fixes them.

Your job, meanwhile, stops being “type a clever prompt, get a clever answer.” You will need to define objectives crisply, negotiate trade-offs, and review plans the way you would with a junior teammate. Collaboration looks like setting guardrails, checking reasoning, and integrating agents into existing workflows instead of babysitting a chatbot.

The Real AI Race Is About Systems, Not Size

Viral hype around a secret GPT-5 quietly “passing” ARC-AGI-2 at 75% turned out to be wrong. Yet the story accidentally landed on a deeper truth: the frontier no longer lives inside a single giant model, it lives in the systems wrapped around it.

ARC Prize’s own leaderboard shows GPT-5 at 9.9% and GPT-5.2 around 53–54%, far from the claimed 75–76%. That gap between rumor and reality highlights how much of today’s progress comes from better orchestration, search, and tooling rather than a magic new trillion-parameter brain.

Foundation models still matter; GPT-5.2 roughly tripled GPT-5.1’s 17.6% ARC-AGI-2 score. But the biggest jumps now come from “unhobbling” those models with scaffolds: manager AIs, tool use, long-term memory, and multi-step planning that squeeze far more effective reasoning out of the same underlying weights.

That shift quietly rewrites the competitive landscape. You no longer need to own a hyperscale data center to compete; you need to design the smartest agentic stack on top of whatever API access you can buy.

A small lab can take an off-the-shelf model and bolt on: - A planner that decomposes problems into subgoals - A tool router that calls code, search, and specialized solvers - A verifier that cross-checks and iterates on answers

On ARC-like tasks, those additions can mean the difference between single-digit and human-adjacent performance.

Poetic’s rumored “manager AI” fits this arc: a controller that decides which model to call, how many samples to generate, and when to re-try or escalate. Whether or not its GPT-5 numbers hold up, the architecture points in the right direction: systems that treat LLMs as components, not oracles.

That is the real race: who can build the most capable, cost-efficient reasoning systems per dollar of compute, not who can announce the largest raw parameter count. Model size still buys you headroom, but unhobbling determines how much of that headroom turns into usable capability.

Watch unhobbling as the throughline from chatbots to co-workers. The fastest path from today’s LLMs to tomorrow’s agents runs through systems engineering, not just bigger GPUs.

Frequently Asked Questions

What is the ARC-AGI-2 benchmark?

It's a test designed by Francois Chollet to measure an AI's 'fluid intelligence'—its ability to solve novel, abstract reasoning puzzles with very few examples, something humans find easy but current AI struggles with.

What does 'unhobbling' an AI mean?

Coined by Leopold Aschenbrenner, 'unhobbling' refers to improving an AI's performance by removing its limitations, not by making the base model larger. This is done by building smarter systems around it, like adding memory, tools, or step-by-step reasoning frameworks.

Did GPT-5 actually pass the human-level benchmark?

No. Despite viral claims, official leaderboards show GPT-5.2 scoring around 54% on the ARC-AGI-2, a significant leap but still below the average human score of ~60-100%. The progress comes from 'unhobbling' techniques, not just the base model's power.

Who is Leopold Aschenbrenner?

He is a former OpenAI researcher known for his detailed 2024 paper, 'Situational Awareness: The Decade Ahead,' which discusses the rapid strategic progression towards AGI and popularizes concepts like 'unhobbling'.

GPT-5's Secret AGI Breakthrough