The Latest AI Breakthroughs: Beyond Transformers with CTM & GPT-5

The Architect of Transformers Says It's Time to Move On

Architect of the modern AI boom now wants to kill his own creation. Llion Jones, one of the eight authors behind Google’s 2017 “Attention Is All You Need” paper, is arguing that the transformer era is running out of road and that it is “time to move beyond transformers.” From his new perch as CTO and co‑founder of Sakana AI Labs, Jones is backing a radically different architecture called Continuous Thought Machines that treats thinking as a process, not a single shot.

Transformers turned next‑word prediction into a universal interface, powering GPT‑4, Gemini, Claude, and almost every major large language model. But simply scaling them—more parameters, more data, more GPUs—has started to hit diminishing returns, as recent “limits at scale” work suggests marginal gains shrink even as training costs explode into the tens or hundreds of millions of dollars per frontier model. The core criticism: these systems still struggle with multi‑step reasoning, brittle logic, and tasks that require planning over long horizons rather than regurgitating patterns.

That critique carries a different weight coming from someone who helped design attention in the first place. When an original transformer architect says the field needs new blueprints, it signals that major labs are already hunting for post‑transformer paradigms instead of assuming scaling curves will bail them out. Jones and Sakana are betting on neuroevolution and dynamical systems—searching for networks that evolve internal state over time, closer to how biological brains operate.

Continuous Thought Machines, as described in Sakana’s work, give each “neuron” a tiny memory and local update rule, then let thousands of these mini‑brains interact over many internal steps. Instead of a single forward pass from prompt to answer, the model runs internal “ticks” where it revisits the problem, refines intermediate representations, and can even change its mind before emitting an output. That shift turns computation from static pattern matching into an ongoing process.

This is the emerging fault line: models that just predict the next token versus systems that process information over time. Jones’s pivot marks the start of a new race to build AI that doesn’t just autocomplete our sentences, but actually thinks between them.

Sakana AI's Radical Bet: The Continuous Thought Machine

Sakana AI Labs is betting that Continuous Thought Machines are what comes after transformers. Co‑founded by Llion Jones, one of the eight authors behind “Attention Is All You Need,” the Tokyo‑based startup just raised a Series B to pursue CTM as a clean break from the architecture that powered GPT‑4, Gemini, and Claude.

Instead of firing once and forgetting, CTM treats thinking as an ongoing internal process. A standard transformer runs a single forward pass over your prompt, produces an output token, and then discards almost all internal state; CTM keeps “mulling” over a problem, updating its internal dynamics across many small steps before it commits to an answer.

Each CTM “neuron” behaves less like a dumb multiplier and more like a mini‑brain with its own memory. Neurons carry a tiny state vector that persists over time, so they can remember what happened a few ticks ago, update themselves, and influence future computation based on that evolving history.

Sakana’s paper describes the model as a synchronized swarm of these stateful units. Instead of treating activations as one‑off numbers, CTM tracks how neuron activities rise and fall together; those synchronization patterns—who “dances” in phase with whom—become the core representational currency, analogous to rhythmic firing in biological neural circuits.

That makes CTM fundamentally different from the stateless neurons in today’s transformer stacks. Mainstream LLMs fake deliberation by stacking more layers or sampling more tokens, but each layer still just computes f(x) and moves on; no individual unit carries a memory of its own past behavior.

CTM also bakes in explicit “thinking time.” The system can run for a variable number of internal ticks—short for easy tasks, longer for hard ones—before exposing an output, mirroring how humans take extra cycles on a tricky maze or math puzzle.

Sakana frames this not as a performance tweak but as a wholesale reimagining of what a model is. Instead of bigger feed‑forward bricks, CTM proposes a continuously evolving dynamical system as the basic substrate of machine reasoning.

Inside CTM: Neurons with Memories and Minds of Their Own

CTM starts by redefining what a neuron is allowed to be. Instead of a simple “I saw this, I output that” unit, each CTM neuron carries its own internal state—a tiny scratchpad that persists across time steps. Thousands of these mini-brains update their memories every tick, like tiny creatures keeping diaries of what they just saw and what they expect to see next.

Those diaries matter because CTM does not think in single snapshots. The model runs through multiple internal ticks, updating each neuron’s state again and again before committing to an answer. Hard problems trigger more ticks, so the system effectively chooses how long to think, rather than being locked to one forward pass per input.

Representation looks different too. Instead of treating meaning as a static vector, CTM encodes its “thoughts” in how neuron activities rise and fall together over time—synchronization as representation. When two neurons’ activations pulse in lockstep, CTM treats that coordinated rhythm as a sign they are jointly encoding some concept.

Picture a stadium of dancers performing a tightly choreographed routine. Any single dancer’s pose means little; the meaning emerges from who moves with whom and when. CTM leans on these temporal patterns of synchrony, using them as the substrate for concepts, plans, and intermediate reasoning steps.

Getting neurons to behave like this is not something you script by hand. Sakana AI leans on neuroevolution, using evolutionary algorithms to search over neuron update rules, connectivity patterns, and dynamical behaviors. Instead of pure gradient descent sculpting a fixed architecture, evolution proposes weird new mini-brain designs, and only the most capable survive.

That is a sharp break from mainstream large language models, where almost everything—from attention patterns to layer shapes—flows from gradient descent on a transformer stack. Here, gradient descent becomes one tool inside a larger search process that can mutate, recombine, and discard neuron behaviors wholesale. The result is a zoo of specialized neuron types with surprisingly rich dynamics.

This shift toward dynamic, stateful computation echoes broader work on continual and nested learning coming out of Google and others. Readers tracking these trends can Check Out The latest AI news we announced in November - Google AI updates for how major labs are also probing architectures that think over time rather than in one-shot bursts. Together, they point toward AI systems that feel less like static calculators and more like evolving, always-on thought processes.

Why 'Thinking Longer' Unlocks Deeper Reasoning

Brains get more interesting when they stop answering instantly and start looping. Continuous Thought Machines build that loop in at the hardware-of-thought level, giving the model explicit “internal ticks” where it can update its own hidden state, reconsider partial plans, and only then speak. Those ticks look a lot like a clock cycle for cognition: discrete, countable reasoning steps that run entirely inside the network, without emitting intermediate text or tool calls.

Each tick advances the internal dynamics of thousands of tiny stateful neurons. Instead of a single forward pass from input to output, CTM runs the same neural circuitry over and over, letting information propagate, settle, and sometimes reverse. More ticks literally mean more thinking time, and the system can dial that up for harder problems, just as humans linger on a tricky puzzle.

That extra runway shows up most clearly on tasks where transformers usually hit a wall. In maze-solving experiments, CTM agents can plan paths through mazes larger than any they saw during training, effectively extrapolating their strategy instead of memorizing layouts. Each internal tick lets the model mentally “walk” a few more steps, backtrack from dead ends, and propagate constraints across the grid.

Standard transformers struggle here because they compress the entire maze and solution into one or two passes of attention. Context length and parameter count become hard limits. CTM’s iterative loop decouples depth of reasoning from model size: a small network can still take 50, 100, or 500 ticks if the problem demands it, trading time for insight.

Researchers also pushed CTM on toy algorithmic tasks. The model learned simple algorithms like “flip the answer” rules in math puzzles and sorting numbers into ascending order. Critically, it did this procedurally: numbers move into place over successive ticks, mirroring textbook sorting passes rather than one-shot pattern matching.

That procedural flavor connects CTM directly to the industry’s obsession with deliberate, multi-step reasoning. OpenAI’s o1 family, Google’s “chain-of-thought” prompting, and tool-using agents all bolt extra loops around transformers. CTM bakes the loop into the architecture itself, turning multi-step reasoning from a prompt hack into a first-class computational primitive.

Deepseek's Efficiency Revolution for Long Context

Radically new brain-inspired architectures like CTM grab the headlines, but a quieter revolution may matter just as much: making today’s transformers radically cheaper to scale. That is where Deepseek Sparse Attention (DSA) comes in, not by replacing transformers, but by hacking away at their most painful bottleneck.

Standard self-attention suffers from brutal math. For a context of N tokens, attention costs scale as O(N²) because every token compares itself to every other token. Push context from 8,000 to 1,000,000 tokens and you don’t just add cost, you explode it by a factor of 15,625.

That quadratic wall kills many dreams about “infinite context” models that remember whole codebases, multi-day chats, or massive research archives. Even with GPU clusters, attending over hundreds of thousands of tokens in full precision drains memory, power, and latency budgets. You can feel that cost every time long-context models slow to a crawl.

Deepseek’s answer: don’t attend to everything, attend to what matters. DSA bolts a new module, the so-called lightning indexer, onto the transformer stack so each token can quickly triage the past instead of naively re-reading it.

The lightning indexer acts like a per-token search engine. For each new token, it rapidly scans all previous tokens, assigns a relevance score, and selects only the top K candidates for full attention. K stays small and fixed—dozens or hundreds—while N can balloon into the millions.

Think of it as reading only the highlighted notes in a textbook instead of re-reading every line on every page before you answer a question. You still ground your answer in the book, but you skip the irrelevant chapters and margin doodles that would have wasted time.

Under the hood, this turns attention from quadratic to roughly O(N·K), which behaves linearly as long as K stays capped. That shift unlocks extremely long contexts on today’s hardware, making “frontier intelligence” less about throwing more GPUs at the problem and more about being smarter about where models look.

Making Million-Token Context a Reality

Million-token context used to sound like a marketing fantasy. Deepseek’s Deepseek Sparse Attention (DSA) turns it into a budgeting question. By making compute and memory scale roughly linearly with sequence length instead of quadratically, DSA slashes the cost of looking back over huge histories, from chat logs to codebases.

Traditional attention makes every token compare itself to every other token. At 128K tokens, that already means more than 16 billion pairwise comparisons per layer; at 1 million tokens, you hit a trillion-plus interactions and hardware falls over. DSA’s lightning indexer short-circuits this by scoring relevance and only attending to the top-K tokens that matter.

Linear-ish scaling changes what engineers dare to ship. Context windows of 256K or 512K tokens move from “demo once on an A100 cluster” to “run daily for customers without catching fire.” One-million-token contexts stop being science projects and start looking like a viable SKU for enterprise copilots and research tools.

Entire software repositories can now fit into a single context: every microservice, every migration, every flaky test. A long-context model can trace a bug from a recent stack trace back through years of commits, design docs, and issue threads, and propose a fix that respects all of it. Complex refactors across hundreds of files become a single reasoning pass instead of a fragile chain of prompts.

Reinforcement learning agents benefit even more. With million-token histories, an RL system can condition on: - Months of gameplay trajectories - Full trading logs across regimes - Long-horizon robotics runs with rare failures

That depth lets agents learn from edge cases without truncating away the setup that caused them. Long-context modeling also supercharges scientific assistants like those described in Accelerating Science with GPT-5 – OpenAI, which can keep entire experiment logs, literature reviews, and raw data in active memory. DSA-style efficiency becomes a core enabler for the next wave of context-aware AI agents that reason over whole worlds, not snippets.

GPT-5's New Job: Supercharging Scientific Discovery

GPT-5 is quietly auditioning for a new role: lab partner to some of the smartest humans on the planet. OpenAI’s latest research program drops the model into real labs at Oxford, Cambridge, Harvard, and other top institutions, not to summarize textbooks, but to wrestle with live, unsolved problems.

According to OpenAI’s “Accelerating Science with GPT-5” report, researchers used the model on frontier questions in biology, chemistry, and physics. These were not benchmark puzzles or synthetic tasks; they were the same messy, high-stakes problems that typically soak up months of postdoc time and grant money.

GPT-5’s job description looks less like “robot scientist” and more like super-fast, knowledgeable research partner. Scientists prompted it to propose hypotheses, design experiments, critique methods, and comb through massive literatures that no human can fully track. The model generated candidate mechanisms, suggested alternative controls, and rephrased dense math or proofs into clearer, checkable steps.

OpenAI stresses that humans remained firmly in the driver’s seat. Every GPT-5 suggestion went through domain experts who filtered, corrected, and sometimes discarded its ideas. The system acted as a force multiplier: accelerating literature review, surfacing obscure but relevant papers, and enumerating edge cases that busy researchers might miss.

Early anecdotes from the study read like productivity hacks for the scientific method. One group used GPT-5 to: - Scan hundreds of papers for conflicting results - Propose unified explanations for the discrepancies - Draft new experimental setups to test those explanations

Another team leaned on GPT-5 to explore combinatorial design spaces that explode beyond human working memory—optimizing parameters, materials, or molecular structures across thousands of possibilities. The model did the tedious search; humans decided which directions actually made sense.

Crucially, OpenAI does not pitch GPT-5 as an oracle that “solves science.” Instead, the paper frames it as augmented cognition for labs: a system that collapses days of reading into minutes, generates dozens of plausible next steps, and frees human researchers to spend more time on judgment, intuition, and hands-on experiments.

Unlocking Medical Mysteries and Solving Ancient Math Problems

Science acceleration sounds abstract until GPT-5 starts rewriting lab notebooks and number theory papers in real time.

OpenAI’s own case studies read like speculative fiction. In one experiment, immunologists fed GPT-5 an unpublished chart from a human study: a time series showing a strange spike and crash in a specific immune cell population following treatment. No one on the team had a satisfying mechanistic explanation for the pattern.

GPT-5 didn’t just summarize the chart; it proposed a novel biological mechanism. The model suggested that a transient surge in a particular cytokine could trigger a short-lived expansion of a T cell subtype, followed by exhaustion and contraction, and even pointed to specific signaling pathways and prior papers that fit the curve shape. Researchers flagged the hypothesis, ran follow-up analyses, and later confirmed that the suggested pathway lined up with additional experimental data.

That workflow matters more than the single win. GPT-5 effectively jumped from “data description” to “mechanistic theory,” the step human scientists usually guard as core creative work. OpenAI reports that across multiple biology projects, GPT-5 moved from just cleaning datasets to proposing testable mechanisms, ranking candidate explanations, and suggesting which experiments to run first.

Math provided an even starker example. Two mathematicians working on a decades-old Erdos problem had pushed a combinatorics proof to a stubborn bottleneck. They had a stack of partial arguments and failed lemmas but no clean way through one critical step.

GPT-5 ingested the entire scratchpad: LaTeX proofs, dead-end attempts, and informal notes. Instead of brute-forcing algebra, the model highlighted a hidden symmetry in how a certain extremal configuration behaved under a transformation the authors had treated as irrelevant. That pattern-breaking insight suggested a different induction parameter and a new way to partition the objects in question, which the mathematicians then formalized into a valid proof step.

OpenAI frames this not as “AI proves Erdős,” but as GPT-5 acting like a third collaborator who never gets tired of re-reading the same 40-page draft. The system surfaces non-obvious restructurings that human co-authors then check, repair, or discard.

Versatility shows up outside whiteboards and wet labs too. In robotics, GPT-5 reviewed motion-planning and control algorithms, identified edge cases where safety guarantees silently failed, and proposed alternative formulations that closed those gaps—turning a text model into a roaming bug detector for physical systems.

The New Scientific Method: Human + AI

New workflows start to look less like lone geniuses and more like mixed human–machine labs. Researchers in the GPT‑5 experiments didn’t ask the model for “an answer”; they treated it as a search engine for ideas, running hundreds of candidate hypotheses, tweaks, and edge cases while they steered the overall agenda.

Humans still frame the problem space. They decide which biological pathway matters, which conjecture in number theory is even worth probing, and which experimental knobs the model can touch. That human intuition about what is interesting, plausible, or ethically acceptable does not emerge from gradient descent.

Once the goal is set, GPT‑5 becomes a force multiplier. It rapidly expands the search space: proposing alternate mechanisms for a disease, suggesting unorthodox parameter regimes for an experiment, or surfacing obscure papers across immunology, statistics, and topology that share a hidden structure. Think of it as a tireless postdoc who never stops reading.

A pattern emerges across the medical and math case studies. Humans: - Specify constraints and success criteria - Curate data, priors, and domain assumptions - Interrogate the model’s reasoning line by line - Decide which outputs justify real‑world experiments

GPT‑5, by contrast, shines when: - Generating novel hypotheses at scale - Connecting distant subfields via analogies and shared formalisms - Stress‑testing ideas with counterexamples and adversarial scenarios - Automating tedious symbolic or statistical checks

This division of labor assumes expert oversight because the model still hallucinates. GPT‑5 can fabricate citations, overfit to quirks of the prompt, or confidently recommend an experiment that violates a hidden constraint in the underlying biology or math.

Prompt sensitivity also turns into a methodological risk. Slight changes in how a question is posed can swing the model from a correct derivation to a subtle but fatal algebraic or conceptual error, especially in multi‑step chains of thought. Researchers in these projects therefore used strict prompt templates, redundant runs, and cross‑checks with traditional tools.

Viewed optimistically, this is a new scientific method: humans supply judgment and values, while systems like GPT‑5 industrialize the generation and falsification of ideas. For more examples of this hybrid workflow across labs, Check Out The Latest AI News and AI Breakthroughs that Matter Most: 2025.

What These Breakthroughs Mean for 2025

Suddenly, AI progress no longer runs on a single rail. Continuous Thought Machines, Deepseek Sparse Attention, and GPT‑5’s science co‑pilot sketches point to three orthogonal axes of change: new brain‑like architectures, brutal efficiency hacks for long context, and models that stop chatting and start doing real science.

CTM from Sakana AI, driven by transformer co‑author Llion Jones, rips up the “one forward pass, one answer” rule. Its neurons carry their own state, synchronize like oscillators, and iterate through internal ticks until a solution emerges, enabling maze solving, algorithmic sorting, and reinforcement learning agents that think multiple times before acting.

Deepseek’s Sparse Attention attacks a different bottleneck: cost. Standard attention scales quadratically with sequence length; at 1M tokens that becomes borderline absurd for both memory and FLOPs. Deepseek’s lightning indexer prunes context down to the top‑K relevant tokens, making million‑token windows behave more like linear‑cost operations instead of a compute explosion.

OpenAI’s GPT‑5 science work shifts the question from “how big is your model?” to “what can it actually discover?” In their own benchmarks, GPT‑5 helped generate hypotheses, design experiments, and debug code for real‑world tasks in biology, chemistry, and mathematics, turning LLMs into collaborators that can close full research loops rather than just autocomplete PDFs.

Taken together, these moves mark a break with the last five years of “just scale it” culture. Architectural bets like CTM, efficiency plays like DSA, and domain‑targeted deployments like GPT‑5‑for‑science signal a more pluralistic strategy: specialized systems, tailored reasoning modules, and workflows where humans and models occupy distinct roles.

Expect the next 6–12 months to be dominated by hybrids. Frontier stacks from OpenAI, Google, and others will likely keep transformers for language but bolt on: - CTM‑style recurrent modules for long‑horizon reasoning - Sparse‑attention layers for multi‑million‑token context - Domain agents tuned specifically for lab work, code, or theorem search

These papers do not read like isolated academic curiosities; they read like roadmaps. CTM sketches a post‑transformer control system, Deepseek shows how to stretch context windows without melting GPUs, and GPT‑5’s science agent outlines how those systems plug into real labs and research groups. Together, they look less like demos and more like blueprints for the next generation of AI infrastructure that will quietly underpin 2025’s biggest breakthroughs.

Frequently Asked Questions

What are Continuous Thought Machines (CTM)?

CTM is a new AI architecture proposed by Sakana AI that moves beyond single-pass transformers. It uses neurons with memory and iterative 'thinking time' to solve problems step-by-step, mimicking human reasoning more closely.

How is CTM different from AI like ChatGPT?

While models like ChatGPT generate responses in a single forward pass per token, CTMs internally refine their thoughts over multiple steps before producing an answer. This allows them to tackle more complex, multi-step reasoning tasks.

Is GPT-5 already being used for scientific research?

Yes, according to an OpenAI paper, a pre-release version of GPT-5 is being used in collaboration with top universities to accelerate real-world research in biology, mathematics, and computer science, acting as an expert research partner.

What makes Deepseek's new attention mechanism so efficient?

Deepseek's Sparse Attention (DSA) uses a 'lightning indexer' to identify and focus only on the most relevant parts of a long context. This avoids the massive computational cost of standard attention, allowing models to handle million-token contexts far more efficiently.

AI's Brain Just Got a Massive Upgrade