The First 'AGI' Model Is Here.

A startup claims the world's first AGI-capable model, but the real story is how new vision models are already changing everything. Discover why your AI skills are about to become obsolete.

industry insights
Hero image for: The First 'AGI' Model Is Here.

The Bombshell Claim: AGI Is Already Here?

World's First AGI-AGI-capable model. That is how Integral AI introduced its new system this morning, claiming not just another large language model but an architecture that can “autonomously plan, learn, and act across modalities” without task‑specific fine‑tuning. The company says the model handles text, code, images, and live tools in a single loop, and markets it explicitly as AGI-AGI-capable, not just “advanced.”

At the center of the announcement sits Integral AI founder Daniel Kwan, a former senior researcher on Google’s Brain and DeepMind teams, where he reportedly worked on large‑scale reinforcement learning and multimodal transformers. Kwan’s résumé—publications on policy‑gradient methods, early work on transformer‑based agents, and stints on internal Gemini prototypes—gives Integral a level of technical credibility most AI startups can’t fake.

Integral claims its system runs a 400‑billion‑parameter backbone with a Mixture‑of‑Experts layout, similar in spirit to Nvidia’s Neotron 3 and other sparse models, but wired into an “agentic controller” that can call tools, browse the web, and operate software interfaces. The company is already demoing the model solving multi‑step spreadsheet audits, refactoring large codebases, and walking through unfamiliar UIs using only screen pixels and text instructions.

Markets reacted instantly but unevenly. On X, several prominent researchers compared the AGI language to earlier overhyped launches, pointing to OpenAI’s and Google’s more cautious phrasing around GPT‑5‑class and Gemini models. Early benchmark snippets Integral shared—MMLU, GSM8K, and custom “knowledge work” suites—show strong but not obviously superhuman scores, feeding a wave of skepticism from academics and independent evaluators.

Investors and enterprise buyers, however, did not dismiss it outright. Tool‑calling agents that can reliably operate real software are exactly what Fortune 500 automation teams want, and Integral claims pilot customers already run the model on finance, legal, and operations workflows. If the demos survive third‑party replication, “AGI-AGI-capable” stops being just a slide‑deck adjective and starts to look like a new product category.

That leaves a blunt question hanging over the entire industry: is Integral AI front‑running the term AGI for attention, or did an ex‑Google insider just quietly ship World's First system that behaves less like a chatbot and more like a junior colleague?

Decoding 'AGI-Capable': Hype vs. Horizona

Illustration: Decoding 'AGI-Capable': Hype vs. Horizona
Illustration: Decoding 'AGI-Capable': Hype vs. Horizona

Integral AI hangs its “AGI-AGI-capable” claim on a narrow, technical idea: a model that can learn autonomously from its environment instead of relying on massive, pre‑curated datasets. In their framing, the system observes raw streams of images, interfaces, documents, and sensor data, then updates its own internal policies on the fly, more like a reinforcement‑learning agent than a static large language model. The company argues that once you can continuously adapt like this, you have the substrate on which artificial general intelligence could emerge.

That definition quietly sidesteps what most researchers mean by AGI. In mainstream AI research, AGI implies human‑level general intelligence: the ability to flexibly understand, plan, and act across almost any domain, with robustness, transfer, and common sense comparable to a person. By that standard, “AGI-AGI-capable” sounds more like “architecturally interesting” than “machines are now our cognitive peers.”

Where Integral AI is directionally aligned with the field is in its push toward models that can perceive, reason, and act as unified agents. The company describes a single system that ingests: - Text, images, and video - GUI states and API responses - Possibly real‑world sensor or robot data

and then chooses actions: clicking through interfaces, calling tools, issuing code, or updating a plan. That is the same agentic, multimodal stack companies like OpenAI, Google, and Zhipu (with GLM‑4.6V at 106B parameters plus a 9B Flash variant) are racing to build.

The gap appears when you look at evidence. Integral AI’s public demos so far resemble early‑stage research prototypes: short clips of UI navigation, toy robotics, and constrained puzzle solving, without hard numbers. There are no standardized benchmarks, no head‑to‑head results on suites like MMLU, MMBench, or AgentBench, and no ablation studies showing that autonomous learning beats conventional fine‑tuning.

That disconnect between rhetoric and receipts matters. Claiming “World's First AGI-AGI-capable” sets expectations of a GPT‑4‑class model that can robustly handle arbitrary tasks, adapt online, and explain its reasoning. Shipping a handful of underwhelming demos instead suggests a familiar story: the underlying research might be real, but the marketing has already sprinted several laps ahead of the science.

China's Visionary Leap with GLM-4.6V

China’s AI ecosystem just produced a concrete counterpoint to vague “AGI-AGI-capable” claims: Zhipu AI’s GLM-4.6V, a multimodal model that already ships with serious visual and reasoning chops. Where the Integral AI AGI Announcement leans on a bold promise of autonomous learning, GLM-4.6V plants a flag in something easier to verify: benchmarks, parameters, and working code.

GLM-4.6V arrives as an open-source multimodal vision-language model that ingests text, images, screenshots, and full document pages in a single pass. It does not just caption images; it parses dense PDFs, cluttered UIs, diagrams, and math plots while keeping long-range context intact.

Zhipu ships two variants aimed at different deployment realities. The full GLM-4.6V clocks in at roughly 106 billion parameters for cloud-scale workloads, while GLM-4.6V-Flash trims down to about 9 billion parameters for low-latency, on-device or edge scenarios.

Both models support context windows in the 128K-token range, which matters for real-world documents that span dozens or hundreds of pages. That capacity enables tasks like end-to-end contract review, technical paper analysis, or multi-screen app walkthroughs without chopping content into lossy fragments.

On benchmarks, Zhipu pitches GLM-4.6V as state-of-the-art among open visual language models at similar parameter scales. Internal and third-party tests highlight strong scores on: - Document understanding - Screenshot and GUI analysis - Diagram and chart interpretation - Visual question answering and math reasoning

What sets GLM-4.6V apart from many Western rivals is its native joint reasoning across modalities. You can feed a screenshot, a scanned form, and a text query together, and the model tracks layout, text, and visual cues as a single reasoning problem instead of bolting OCR on top of an LLM.

That design makes GLM-4.6V a credible open competitor to Google’s Gemini vision stack and OpenAI’s GPT-4.1/4.2V tier. Developers get a model they can self-host, fine-tune, and wire into agents for UI automation, enterprise search, or compliance workflows without surrendering everything to closed APIs.

Why Your Prompts Are About to Become Obsolete

Prompts are quietly turning into legacy UI. Models like GLM‑4.6V do not just read your words; they see your screen, parse your PDFs, and track structure across 100,000+ tokens of mixed text and images. That changes what you “say” to an AI from verbose prose to something closer to a product spec.

Instead of crafting a paragraph-long request, you hand the model a screenshot of your analytics dashboard and type: “Automate this based on monthly trends and send me anomalies on Slack.” GLM‑4.6V can inspect the chart axes, legend, filters, and even UI chrome to infer the underlying data model. Your text becomes a goal, and the screenshot becomes the context the model actually reasons over.

The key enabler is native multimodal function calling. Rather than forcing you to OCR an image or manually describe a layout, GLM‑4.6V passes raw images, diagrams, or document pages directly into tools and agents. A single call can bundle: - A 20‑page scanned contract - A product screenshot - A short text instruction

That package flows through a toolchain that can search, rewrite, execute code, or trigger external APIs, all grounded in what the model “saw.”

Prompt engineering, as a craft of elaborate incantations, starts to look outdated. You no longer need to spell out, “In the top‑right card labeled ‘MRR,’ identify month‑over‑month deltas…” when the model can visually locate the MRR widget and read its numbers. The hard part shifts from phrasing to scoping: defining constraints, data sources, permissions, and acceptable failure modes.

Interaction moves from chatty back‑and‑forth to goal-setting for autonomous agents. You point at a Figma board and say, “Turn this flow into a working onboarding experience and wire it to our Stripe sandbox.” The agent uses GLM‑4.6V’s vision stack to understand layout, hierarchy, and copy, then calls code tools, design systems, and deployment pipelines without you narrating every step.

As models get better at joint visual‑text reasoning, prompts become more like mission briefs. You supply artifacts—screenshots, whiteboard photos, dashboards—and a concise objective. The system handles the translation from what you show it to what needs to run.

The Economics of AI Just Flipped

Illustration: The Economics of AI Just Flipped
Illustration: The Economics of AI Just Flipped

High-end multimodal AI currently punishes anyone who touches video. Frontier APIs from OpenAI, Anthropic, and Google charge per token, and video pipelines explode token counts: every frame or sampled keyframe becomes text, every caption and transcript chunk adds up. Run a few hours of 1080p footage through GPT‑4o or Claude 3.5 Sonnet and you can watch your bill jump into the hundreds of dollars.

GLM‑4.6V attacks that problem from two angles: open weights and aggressive pricing. Zhipu AI offers the 106B‑parameter cloud model at rates that undercut Western rivals by a wide margin, with some Chinese providers quoting under $0.30 per million input tokens and $0.90 per million output. When you are chewing through tens of millions of tokens per day on surveillance feeds, UI recordings, or customer support screen captures, that delta becomes a budget line.

Then there is GLM‑4.6V‑Flash, the 9B‑parameter sibling tuned for local and edge deployment. Teams can run it on a couple of high‑end GPUs or a well‑specced workstation, pay once for hardware, and process essentially unlimited screenshots, PDFs, and diagrams. For continuous workloads—security cameras, industrial monitoring, gameplay analytics—local inference flips the economics from per‑call rent to fixed‑cost infrastructure.

This price pressure lands in a market where OpenAI and Anthropic still behave like premium SaaS vendors. Their multimodal tiers bundle: - Higher per‑token prices for image and video inputs - Strict rate limits - Opaque overage policies

GLM‑4.6V and similar models from Qwen, LLaVA, and NVIDIA NeMo invite another strategy: own the stack, rent only when you must. That undercuts incumbents on large, predictable workloads and relegates proprietary APIs to niche, “only if we need frontier performance” roles.

Cheaper, powerful vision‑language models also change who gets to build complex AI systems. A startup in Jakarta can fine‑tune GLM‑4.6V‑Flash on local invoices and shipping forms without a seven‑figure API budget. A two‑person indie studio can ship an in‑game coach that reads your HUD and minimap in real time, running entirely on the player’s PC.

As multimodal models become both accessible and good enough, the constraint shifts from money to imagination. The next wave of AI products—autonomous UI testers, always‑on factory inspectors, document‑native copilots—no longer belongs exclusively to companies that can afford frontier tokens at scale.

Nvidia's Quiet Revolution: Power on Your PC

Nvidia’s latest move toward local AI power is Neotron 3, a 30B-parameter Mixture-of-Experts language model with open weights. Built for speed and efficiency, it targets the gap between tiny on-device models and cloud-bound frontier systems. Nvidia claims Neotron 3 outperforms other ~30B models like GPT-4.1-OSS and Qwen 3 30B on standard benchmarks while staying lean enough for practical deployment.

Mixture-of-Experts, or MoE, flips the usual dense-model economics. Instead of activating every parameter for every token, Neotron 3 uses 128 experts with only 6 active per token, so most of the 31.6B parameters stay idle on any given step. You get the capacity of a much larger model with the compute footprint of something closer to a mid-size LLM.

That architecture matters if you want strong AI running directly on your own hardware. MoE lets Neotron 3 hit high throughput on modern GPUs while keeping latency low enough for interactive use: coding assistants, local copilots, or private document chat that never leaves your machine. You trade a bit of absolute frontier performance for predictable, controllable speed.

Privacy and sovereignty sit at the center of this shift. A model like Neotron 3 can live on a workstation, an edge server, or a small business NAS, handling: - Sensitive contracts and emails - Source code and build logs - Internal analytics and dashboards

No prompts or embeddings need to transit a vendor’s cloud. That stands in sharp contrast to cloud-only “World’s First AGI-AGI-capable” claims from players like Integral AI, which pitch massive centralized systems instead of personal infrastructure; see Integral AI Unveils World’s First AGI-AGI-capable Model - Business Wire for that vision.

Neotron 3 signals where Nvidia thinks the market goes next: not just hyperscale data centers, but PC-class AGI-era tooling, where individuals and small teams run serious models locally, on their own terms.

GPT-5.2's Surprising Pivot to 'Economic Value'

GPT‑5.2 landed with a thud for a lot of everyday users. Social feeds filled with side‑by‑side comparisons calling it “mid,” “regressed,” or “no better than 5.1” for creative writing, coding tricks, or casual chat. Yet inside enterprises, early adopters quietly reported something different: a model that suddenly felt eerily competent at knowledge work.

OpenAI’s own charts explain the disconnect. Instead of chasing marginal gains on academic benchmarks, GPT‑5.2 spikes on GDP‑V—short for “Gross Domestic Product‑Valuable,” a synthetic benchmark that measures how well a model performs economically useful tasks. On that axis, OpenAI claims GPT‑5.2 roughly doubles 5.1’s score, one of the largest single‑generation jumps they have shown.

GDP‑V tests the stuff that actually shows up on a balance sheet: drafting RFPs, structuring reports, wrangling messy spreadsheets, and turning vague bullet points into executive‑ready decks. GPT‑5.2 reflects that bias. It is tuned to build PowerPoint presentations from raw briefs, clean and reconcile data in Excel, and reason through multi‑step business workflows with fewer hallucinations and less hand‑holding.

Creative writing, quirky brainstorming, and open‑ended chat feel flatter because they were not the target. Users who treat GPT‑5.2 like a more powerful GPT‑4 for fiction, fan art prompts, or philosophical back‑and‑forths run straight into its new personality: more conservative, more literal, more “consultant” than “co‑writer.” For a CFO, that is a feature. For a novelist, it feels like a downgrade.

This pivot exposes where the market has moved. Frontier models now cost tens of millions of dollars to train and run; they cannot justify that burn rate on free chatbots and bedtime stories. OpenAI is explicitly optimizing for sectors that move GDP: finance, consulting, legal, operations, enterprise software, and internal automation.

You can see the strategic lock‑in forming. A model that is world‑class at:

  • PowerPoint and board packs
  • Excel modeling and scenario analysis
  • Policy, contract, and compliance workflows

slots directly into Microsoft 365, customer CRMs, and internal tools. GPT‑5.2 is less a general‑purpose chatbot upgrade and more a signal that the “World’s First AGI‑AGI-capable” race now runs through quarterly earnings.

The Rise of AI Super-Agents

Illustration: The Rise of AI Super-Agents
Illustration: The Rise of AI Super-Agents

Power is shifting from raw models to the super-agents wrapped around them. Manis 1.6 and Poetic show how thin layers of orchestration, memory, and self-critique can turn generic LLMs into systems that look suspiciously like autonomous coworkers rather than chatbots waiting for prompts.

Manis 1.6 leans into this by chaining multiple tools and sub-agents around a base model. It breaks a request into atomic tasks, routes each to specialized routines, and then fuses the results, so “research this market and draft a launch plan” becomes hours of automated browsing, clustering, and writing with minimal human steering.

Poetic goes even further on the reasoning front. Built on top of existing LLMs, it smashed the ARC-AGI benchmark not by training a new frontier model, but by adding a clever reasoning scaffold and self-auditing loop that forces the system to test and refine its own hypotheses before committing to an answer.

ARC-AGI is notoriously hostile to pattern-matching; it demands abstract reasoning over small visual puzzles. Poetic wraps the base model in a process that: - Enumerates candidate rules - Simulates each rule on examples - Discards inconsistent hypotheses - Iterates until a passing rule set emerges

That architecture pushed Poetic’s ARC-AGI performance well beyond typical LLM baselines, hinting that AGI-AGI-capable behavior may come from better “brains around the brain,” not just bigger parameter counts. Product design choices — how you decompose tasks, verify outputs, and let agents call tools — start to matter as much as the underlying weights.

This is why “AGI is likely to come out of product design” feels less like a slogan and more like a roadmap. Agentic scaffolding turns static models into systems that plan, remember, and self-correct, from retrieval-augmented research agents to code refactorers that run tests, bisect failures, and patch regressions on their own.

Users already experience this as autonomous work, not conversation. Poetic-style agents chew through benchmark suites and eval harnesses; Manis-like platforms manage multi-hour workflows that span browsers, CLIs, and cloud APIs, then hand you a finished report, dashboard, or codebase diff.

Tied to models like GLM-4.6V and Neotron 3, these super-agents can see, read, and act across screenshots, PDFs, and local files without constant prompting. The chatbot UI becomes a job ticket: you describe the outcome, the agent decomposes, executes, audits, and only bothers you when a real decision needs a human.

Sorting Signal from Noise in the AI Gold Rush

Marketing departments shout about AGI-AGI-capable models; engineers quietly ship systems that actually change workflows. GLM-4.6V, Neotron 3, and agentic platforms like Poetic all point in the same direction: practical, automated, multimodal AI that behaves less like a chatbot and more like infrastructure.

Multimodal capability now means more than “can see images.” GLM-4.6V ingests screenshots, PDFs, and diagrams alongside text, runs long-context reasoning over 100K+ tokens, and drives agents that click through UIs or parse entire contracts. Prompting shrinks from paragraphs of instructions into a single high-level goal the system decomposes on its own.

At the same time, efficient local models are breaking cloud AI’s monopoly. NVIDIA’s Neotron 3 squeezes a 30B-parameter Mixture-of-Experts model into hardware budgets that used to cap out at 7B, with 128 experts and only 6 active per token. GLM-4.6V-Flash pushes vision-language reasoning into a 9B-parameter package that can sit on a workstation or edge box instead of a hyperscaler GPU farm.

Agentic stacks ride on top of this substrate. Systems like Manis 1.6 or Poetic orchestrate multiple models, tools, and retrieval pipelines into persistent “AI super-agents” that remember context, schedule tasks, and operate across apps. The leap in value comes less from a single IQ jump in a base model and more from wiring those models into tools, memory, and autonomy.

Contrast that with the splashy “world’s first AGI” headlines. Integral AI’s World's First AGI-AGI-capable claim and similar pitches, like the startup profiled here: Ex-Google veteran's startup claims to have built world-first AGI model, remain largely unverified narratives. GLM-4.6V’s benchmark wins, Neotron 3’s efficiency numbers, and GPT-5.2’s GDP-value focus are measurable.

Industry sits far from general intelligence that can learn any task like a human. It stands very close to something more commercially explosive: stacked, automated, multimodal systems that quietly turn “use an AI” into “AI just did it.”

Your Next Move in the New AI Landscape

Start by getting your hands dirty with the new open-source multimodal stack. Spin up GLM‑4.6V‑Flash (9B) locally via Ollama or vLLM, and pair it with an open visual encoder like SigLIP or CLIP to prototype screenshot agents, PDF readers, and GUI bots without burning through GPT‑5.2 tokens at $10+ per long video or document job.

Developers should redesign inputs around files, not chat boxes. Build flows where users drag in: - 200‑page PDFs - Figma exports - Excel screenshots - Short video clips

Then let the model handle layout, tables, and diagrams directly instead of forcing users to copy‑paste text.

Tech leaders need to stop thinking “one model, one prompt” and start thinking model orchestration. For a production workflow, wire together a small local model (Neotron 3 at 30B parameters) for cheap routing and classification, a stronger cloud model for hard reasoning, and specialized tools for search, RAG, and code execution.

If you run a startup, your moat is no longer “we use GPT‑5.2.” Your moat is the agentic system design: how your stack breaks problems into steps, chooses tools, calls models, and recovers from failure. Instrument every agent with logging, traces, and per‑step cost so you can see why a workflow costs $0.03 or $3.

Enthusiasts should deliberately practice beyond prompt engineering. Clone a repo like AutoGen, CrewAI, or an open Poetic‑style agent, then swap in GLM‑4.6V for vision and a local Neotron 3 instance for text to see how multi‑agent coordination actually behaves under load.

Rethink every workflow that still assumes text-only input. Contract review means annotated PDFs, not pasted clauses. Customer support means logs, screenshots, and call transcripts. Analytics means CSVs, dashboards, and chart images, all fed into one multimodal context window.

Staying ahead now means you understand how to: - Choose the right open model for cost and latency - Design agents that call tools, browse, and plan autonomously - Tune guardrails, memory, and feedback loops

Prompt engineering becomes a small part of a larger job: architecting systems that can watch, read, decide, and act.

Frequently Asked Questions

What is an 'AGI-capable' model?

An 'AGI-capable' model is a term used to describe AI systems that can learn new tasks autonomously without pre-existing datasets, particularly in robotics or agentic settings. It is distinct from true AGI, which implies human-level intelligence across all cognitive tasks.

How does GLM-4.6V change AI prompting?

GLM-4.6V changes prompting by moving beyond text. Its native multimodal tool-calling allows users to provide images, documents, and screenshots directly as context, enabling the AI to 'see' and act on visual information without manual text descriptions.

Why are local LLMs like NVIDIA's Neotron 3 important?

Local LLMs are important for privacy, speed, and cost-control. By running on-device, they keep sensitive data from being sent to the cloud, reduce latency, and eliminate API-based token costs for frequent use.

What is the significance of Poetic beating the ARC-AGI benchmark?

Poetic's success shows that breakthroughs aren't just about bigger models, but smarter architecture. By building a reasoning and self-auditing layer on top of existing LLMs, it achieved superior performance at less than half the cost, proving the power of agentic scaffolding.

Tags

#AGI#LLM#Multimodal AI#Open Source#AI News

Stay Ahead of the AI Curve

Discover the best AI tools, agents, and MCP servers curated by Stork.AI. Find the right solutions to supercharge your workflow.