OpenAI's Code Red: Garlic is Coming
Sam Altman has declared 'Code Red' as Google's Gemini threatens to dethrone ChatGPT. OpenAI's secret counter-attack, a new model codenamed Garlic, is their last best hope to win the AI war.
Sam Altman Hits the Panic Button
Code Red hit OpenAI like a fire alarm in a data center. Sam Altman told employees the company was going “code red,” a label usually reserved for existential threats, and ordered teams to reorient around one goal: make ChatGPT meaningfully better, fast. Side projects, experimental features, and moonshot bets suddenly took a back seat to shoring up the core chatbot that made OpenAI a household name.
Google’s Gemini 3 created the crisis moment. After a shaky first-generation Gemini rollout, Gemini 3 landed as a brutal rebuttal to the “scaling is over” narrative, posting frontier-level performance and shipping directly into Google’s massive distribution channels. Google quietly jumped from roughly 450 million to around 650 million active Gemini users in a few months, while OpenAI’s own growth, hovering near a billion users, finally started to look mortal instead of inevitable.
Gemini 3 did more than win benchmarks; it flipped the storyline. For the first time, OpenAI looked like the complacent incumbent and Google like the hungry challenger, powered by its TPU fleet and decades of infrastructure work. SemiAnalysis reported that OpenAI hadn’t completed a successful, broadly deployed full-scale pretraining run for a new frontier model since GPT-4.0 in May 2024, while Google was scaling massive models on custom silicon.
Altman’s Code Red memo reportedly focused less on IQ points and more on experience. He pushed teams to improve personalization, speed, reliability, and the range of questions ChatGPT can confidently answer day to day. Internally, the priority shifted from flashy demos to the unglamorous plumbing that decides whether people actually stick with a chatbot as a default tool.
That pivot marks a quiet but profound strategy change. For years, OpenAI chased headline features: multimodality, agents, voice, app stores, splashy keynotes. Under Code Red, the mandate looks closer to classic platform defense: - Make ChatGPT feel faster than Gemini 3 - Make it feel more tailored than Gemini 3 - Make it break less often than Gemini 3
OpenAI is no longer just trying to invent the future of AI. Code Red signals a company suddenly forced to defend the present.
The 'Scaling is Dead' Heresy
Scaling heresy started as a whisper and hardened into dogma. Over the past year, Ilya Sutskever, Andrej Karpathy, and Yann LeCun all argued that simply stacking more GPUs and tokens onto existing LLM architectures had hit diminishing returns. Bigger no longer meant smarter; it just meant more expensive.
Researchers pointed to a supposed “wall” in pre‑training. Once models reached GPT‑4 class scale, each extra dollar of compute seemed to buy less capability, especially on hard reasoning and planning tasks. The new consensus: progress now required fresh algorithms, new architectures, and maybe entirely different training paradigms.
Sutskever framed it as an epoch shift on the Dwarkesh Patel podcast: 2012–2020 as the “age of research,” 2020–2025 as the “age of scaling,” and now a return to research because 100x more compute would not yield 100x better models. Karpathy echoed the line that current LLMs are “running out of space to grow.” LeCun went further, calling autoregressive text models a dead end and pushing for energy-based and world-model approaches.
That narrative hardened inside labs and on X, where memes cast “scaling is over” as common sense. When leading figures repeat that more data and more compute no longer move the needle, organizations stop betting on brute-force scaling. They redirect budgets from massive training runs into safety, tooling, and smaller, more specialized systems.
SemiAnalysis reported that OpenAI had not completed a successful full-scale pre‑training run for a broadly deployed new frontier model since GPT‑4.0 in May 2024—over 18 months ago. Internally, that looked like empirical proof of the wall: training got harder, bugs more catastrophic, and infrastructure limits more binding.
Google quietly disagreed. While rivals talked about ceilings, Google poured money into its TPUv5 fleet, high-bandwidth interconnects, and data pipelines tuned specifically for gargantuan multi‑trillion‑parameter mixtures. Gemini 3 arrived as a blunt counterargument: scaling, done right, still works.
That mismatch in belief created a blind spot. Competitors assumed everyone had hit the same wall; Google knew it had just climbed over its own. When Gemini 3 started beating OpenAI on key coding and reasoning benchmarks, the “scaling is dead” narrative stopped looking like wisdom and started looking like a self‑own.
Google's Gemini Shatters the Wall
Gemini 3 blew a hole in the “scaling is dead” narrative by doing the one thing skeptics said was tapped out: getting dramatically better by getting dramatically bigger. Google’s flagship model pushed past GPT-4-class systems on a swath of public benchmarks, from coding and math to multimodal reasoning, and did it while running interactively at consumer-facing latencies. For developers who had treated Gemini 1 and 1.5 as sidegrades, Gemini 3 finally felt like a clean generational jump.
Under the hood, Gemini 3 rides on Google’s vertically integrated AI stack: custom TPU silicon, hyperscale data centers, and a training pipeline tuned over nearly a decade. SemiAnalysis reports that while OpenAI has not completed a broadly deployed full-scale pretraining run since GPT-4.0 in May 2024, Google has continued stacking ever-larger training runs on its TPU fleet. That continuity matters because scaling laws only pay off if you can actually keep scaling.
Google’s TPU v5 and emerging v6/v7 generations give it a cost and throughput edge that standard GPU shops struggle to match. TPUs integrate high-bandwidth memory, interconnect, and matrix units in a package built explicitly for transformer-style workloads, reducing both power draw and networking overhead. When you can string together hundreds of thousands of these chips in tightly coupled pods, “just add more compute” stops being a meme and becomes a roadmap.
Strategically, that silicon advantage lets Google run more experiments, longer training schedules, and larger context windows without setting money on fire. Gemini 3’s massive mixture-of-experts configuration—routing tokens through specialized subnetworks—demands brutal amounts of inter-chip communication. TPUs, designed in lockstep with Google’s software stack, make that feasible at production scale.
Market reaction came fast. Google claims Gemini usage jumped from roughly 450 million to 650 million active users in a matter of months, largely off the back of Gemini Advanced and Gemini for Workspace. For the first time, developers who defaulted to OpenAI started seriously porting agents, copilots, and chatbots into the Google AI ecosystem.
That shift shows up in tooling. Cloud customers now see Gemini 3 options wired into Vertex AI, Google Docs, Gmail, Android, and Chrome, turning model choice into a default setting rather than a research project. For startups watching burn rates, cheaper inference on TPUs plus competitive quality makes Gemini 3 an easy A/B test against GPT-4.1.
Investors and rivals noticed. Coverage like OpenAI's Altman Declares 'Code Red' to Improve ChatGPT as Google Threatens AI Lead framed Gemini 3 as the first real threat to ChatGPT’s cultural and technical dominance. Sam Altman’s internal “code red” memo simply confirmed what the benchmarks already implied: Google had scaled straight through the wall everyone else insisted was solid.
Inside OpenAI's All-Hands-On-Deck Scramble
Code red inside OpenAI does not mean fire drills and slogans; it means a hard reset of priorities. According to reporting from the Wall Street Journal and internal memos, Sam Altman ordered teams to halt anything that doesn’t directly make ChatGPT faster, more reliable, or more addictive to use every day.
Projects that once looked like OpenAI’s next revenue engines are suddenly on ice. Work on experimental ads, shopping integrations, and lightweight enterprise side bets has been paused or slowed so engineers and researchers can move back to the core model stack.
Product managers who spent the past year sketching “AI-native” productivity tools now answer to a simpler mandate: defend daily active users. That means fewer experiments in adjacent apps and more heads-down work on the latency, uptime, and guardrails of OpenAI’s flagship chatbot.
Altman reportedly told staff that ChatGPT’s “day-to-day experience” lags where it needs to be, especially with Google’s Gemini 3 closing the gap. So performance work has become the new growth strategy: shaving hundreds of milliseconds off response times, hardening infrastructure, and tuning prompts and routing so users hit the best model path by default.
Personalization sits at the center of this sprint. Teams are racing to deepen user profiles, remember more context across sessions, and adapt tone and format so ChatGPT feels less like a generic assistant and more like a bespoke AI companion that understands your habits, documents, and workflows.
Internally, engineers describe an “all-hands” reshuffle that looks a lot like a wartime footing. Researchers who were exploring longer-term ideas have been reassigned to near-term improvements in reasoning reliability, multi-step tool use, and reducing the number of “I can’t help with that” dead ends.
Metrics have shifted accordingly. Instead of celebrating flashy demos, leadership now tracks: - Daily and weekly active users - Session length and task completion - Drop-off rates when ChatGPT answers incorrectly or too slowly
Code red, in practice, means OpenAI is treating every flaky response, slow answer, or irrelevant reply as an existential bug. With Garlic waiting in the wings, the company wants the runway of a loyal, engaged user base before it rolls out whatever comes next.
Unveiling 'Garlic': The Gemini Killer
Garlic is the kind of codename you pick when you’re trying to ward off something scary. According to a detailed scoop from The Information, OpenAI quietly started training “Garlic” this fall as its first true post-GPT-4 frontier model, explicitly framed internally as a response to Google’s Gemini 3 surge and TPU-driven scaling wins. Mark Chen, OpenAI’s chief research officer, reportedly told staff that Garlic is now the company’s top research priority.
Rather than chasing size for its own sake, Garlic targets the exact pre-training bottlenecks Gemini just bulldozed through. Google proved you can still scale if your compute stack is ruthless enough; OpenAI is betting you can close that gap with smarter pre-training recipes: more efficient data curation, curriculum-style training, and aggressive mixture-of-experts routing to keep costs in check. Internal docs cited by The Information describe Garlic as “GPT-4.5-class compute, Gemini-3-class efficiency.”
Where Gemini 3 flexed on web benchmarks and multimodal tasks, Garlic reportedly focuses on high-value workloads: coding, long-horizon reasoning, and tool use. On OpenAI’s internal coding suite—heavily weighted toward multi-file refactors and agentic workflows—Garlic already edges out Gemini 3 Pro and Anthropic’s Opus 4.5 in early runs, despite not being fully trained. One internal chart shared with researchers showed Garlic ahead by mid-single-digit percentage points on pass@1 coding metrics at comparable temperature.
Reasoning benchmarks tell a similar story. Garlic reportedly beats Gemini 3 and Opus 4.5 on OpenAI’s private math-and-logic mix, including synthetic chain-of-thought tasks designed to punish shallow pattern matching. Staff who saw the numbers described Garlic as “comfortably ahead of GPT-4.1” and “trading blows with Gemini 3 Ultra” on difficult multi-step prompts, even before the final training stages and reinforcement learning passes.
Architecture-wise, Garlic looks like an evolution, not a reboot. People familiar with the work describe a GPT-4.1-style backbone with heavier sparsity, better retrieval hooks, and tighter integration with OpenAI’s tool-calling stack. The goal: a model that can act as the default brain for agents, search-style workflows, and code copilots without the latency spikes that plague today’s largest systems.
Naming is where the speculation starts. Internally, Garlic is just a codename, but executives are reportedly debating whether to surface it as GPT-5.2—a quiet but sharp upgrade—or brand it GPT-5.5 and market it as the company’s full-scale answer to Gemini 3. Timelines floating around OpenAI point to an aggressive window: a staged release to enterprise customers in Q4, and broad availability by year’s end if training and safety evaluations stay on track.
The Return to Pre-Training's Brutal Frontier
Muscle memory is suddenly a strategic asset again at OpenAI. Chief research officer Mark Chen has reportedly told staff that the company let its pre-training expertise atrophy while it chased reinforcement learning from human feedback, safety work, and flashy product features—and that era is now over. Inside Code Red, pre-training moved from a background process to the main event.
For roughly 18 months after GPT-4o’s training run wrapped in May 2024, OpenAI did not complete a new full-scale frontier pre-train that shipped broadly, according to SemiAnalysis. That gap coincided with a pivot toward RLHF, tool use, and productization: ChatGPT, voice modes, agents, and enterprise features. Those bets brought users and revenue, but they also dulled a core competency just as Google proved that raw scaling still moves the ceiling.
Now OpenAI is rebuilding that muscle with an almost old-school, “frontier lab circa 2020” mentality. Chen has framed pre-training as the hardest, most leverage-rich part of the stack, and Code Red gives him political cover to hire accordingly. Internally, leaders talk about assembling a “superstar team” of systems engineers, optimization specialists, and data pipeline experts whose sole mandate is to push one more order of magnitude.
The rationale is simple and brutal: whoever owns pre-training efficiency owns the frontier. OpenAI believes its secret sauce lives in places outsiders can’t easily see—data curation recipes, curriculum schedules, optimizer tweaks, mixture-of-experts routing, and training-time alignment tricks. Those are precisely the knobs that determine whether a $1 of compute yields a modest bump or a Gemini 3-class jump.
Executives also think the market has misread their silence as stagnation. While Google flaunts TPUv7 and parameter counts, OpenAI is betting on less obvious edges: better loss scaling at trillion-token regimes, denser knowledge packing into smaller models, and architectures that survive catastrophic training failures. In internal briefings around Garlic, Chen has pointed staff to reports like OpenAI Developing 'Garlic' Model to Counter Google's Recent Gains as the public tip of a much larger iceberg.
Code Red, in practice, means compute reallocation, cancelled side projects, and a hiring funnel that routes top candidates straight into pre-training. If Garlic lands and matches the internal hype, OpenAI wants the industry to relearn an old lesson: alignment tricks and UX polish matter, but the real moat still starts at the first token of the corpus.
Smarter Isn't Enough: The User Experience War
Sam Altman’s internal memo reportedly hammered a simple point: for “99% of users,” the day-to-day experience matters more than abstract IQ points on a benchmark chart. That’s a brutal reframing of the frontier-model arms race. If Gemini 3 and Garlic are roughly interchangeable for most prompts, whoever makes the interaction feel smoother, faster, and more personal wins.
For typical users asking for email drafts, summaries, or code snippets, today’s large language models already feel “smart enough.” They don’t need a PhD-level theorem prover; they need an assistant that doesn’t stall, glitch, or forget context. Marginal gains in reasoning matter far less than whether ChatGPT, Gemini, or Claude feels like a dependable tool rather than a moody genius.
That shifts the battleground to scaffolding: everything wrapped around the core model. Altman reportedly singled out: - Personalization features - Speed - Reliability - Broader question coverage
Those are product problems, not just research problems, and they decide which icon users tap 20 times a day.
Speed becomes a UX feature on par with accuracy. Google touts Gemini 3’s responsiveness on its TPUv7 stack; OpenAI needs Garlic and its serving infrastructure to match or beat that latency, especially on mobile. A 400-millisecond difference in response time can decide whether an assistant feels instantaneous or sluggish.
Reliability runs deeper than uptime. Users want fewer “I can’t help with that” dead ends, fewer hallucinated citations, and consistent behavior across web, desktop, and phone. Google claims 650 million Gemini users; OpenAI hovers near 1 billion for ChatGPT. At that scale, one bad outage or broken feature ripples across classrooms, offices, and call centers.
Personalization is the next moat. Whoever turns a generic chatbot into a persistent, context-aware agent that remembers preferences, projects, and style wins the loyalty war—long before anyone notices who edged ahead on the next MMLU leaderboard.
The Moat: Can Brand Loyalty Beat Distribution?
ChatGPT sits in a rare tier of tech brands whose names turned into verbs almost overnight. People “ChatGPT” homework prompts, emails, and code the way they “Google” questions. That linguistic lock-in matters: it encodes OpenAI’s chatbot as the default mental model for AI assistants, even as rivals quietly outperform it on benchmarks.
Brand gravity collides head-on with Google’s distribution machine. Google can surface Gemini everywhere users already live: the Search box, Chrome’s URL bar, Docs sidebars, and Android’s system UI. OpenAI, by contrast, largely lives in a web app, a mobile app, and a scattered ecosystem of API integrations and third-party wrappers.
Google’s advantage compounds through defaults. Billions of people will meet generative AI through: - A Gemini answer above 10 blue links - A Gemini panel in Chrome - A Gemini suggestion in Gmail or Docs
Most of those users will never type “chatgpt.com” or compare Gemini to GPT-4. They will just accept whatever the search bar or compose box gives them.
OpenAI’s moat looks strongest with early adopters and power users. Developers, researchers, and AI-native professionals already juggle ChatGPT, Claude, Gemini, and open models like Llama or Mistral, often via “router” tools that auto-pick the best model. For this crowd, brand matters, but latency, context length, tool use, and raw reasoning quality decide which tab stays pinned.
Mass-market users behave differently. History says most people stick to defaults even when better tools exist: Chrome beat Firefox because Google controlled Search, not because Firefox got worse. If Gemini becomes the ambient assistant across Search, Android, and Chrome, OpenAI must convince users to seek out a separate app for marginally better answers.
Sam Altman’s bet on “day-to-day experience” implicitly acknowledges this split. Power users will chase the best model; everyone else will stick with whatever feels fast, familiar, and free. ChatGPT’s brand gives OpenAI time, but Google’s distribution gives Gemini reach—and in consumer tech, reach usually trains the next generation of habits.
This Isn't a Duel, It's a Royal Rumble
Code Red at OpenAI makes for a dramatic headline, but framing this as a clean OpenAI vs. Google duel misses the real story. AI now looks more like a crowded title card: OpenAI, Google, Anthropic, Meta, Mistral, Apple, xAI, and a fast-growing long tail of Chinese labs and open-source collectives. Each optimizes for a slightly different definition of “intelligence,” and that fragmentation is accelerating the pace of change.
Anthropic leans hard into constitutional AI, selling reliability and safety as enterprise features. Claude 3.5 models increasingly show up in regulated industries that care less about raw benchmark wins and more about auditability, refusal behavior, and stable APIs. Its pitch is simple: fewer surprises, better guardrails, strong coding and reasoning without Gemini’s or GPT’s brand baggage.
Meta, meanwhile, turned Llama into the default open-source substrate. Llama 3.1 and its 8B/70B variants now power thousands of startups, internal corporate tools, and on-device experiments. Meta trades frontier leadership for distribution: if developers build on Llama by default, Meta quietly shapes the ecosystem even when nobody touches its official apps.
Mistral plays the efficiency game. Its 7B–22B-class models punch above their weight on throughput and latency, especially on commodity GPUs. European data centers, cost-sensitive SaaS vendors, and scrappy infra startups increasingly reach for Mistral when GPT-4-class quality is overkill and every millisecond and dollar matters.
Zoom out, and Sam Altman’s Code Red and Google’s Gemini 3 push act as a forcing function for everyone else. As Google Takes a Swing at the AI Crown details, TPU economics and massive pre-training runs reset expectations for scale. That in turn pressures Anthropic to differentiate on safety, Meta to double down on permissive licenses, and Mistral to squeeze more performance per FLOP.
Users do not see a duel; they see a royal rumble of overlapping ecosystems. The real winner may be the emergent behavior of all these models locked in a feedback loop of competition, imitation, and one-upmanship.
Why This Cutthroat Battle is Great News For You
Code red at OpenAI and a TPU-fueled charge at Google sound terrifying if you’re a rival lab. If you’re a user, it’s a jackpot. Arms races in tech historically end with more capable products, faster iteration, and a brutal race to undercut prices.
Fierce competition already turned “LLM access” from a $20-per-month novelty into a commodity. OpenAI, Google, Anthropic, Meta, Mistral, and open-source projects now fight to offer more context, better tools, and higher rate limits for the same or less money. Enterprise buyers quietly push even harder, squeezing per-seat costs and demanding usage-based discounts.
Model quality jumps faster when no one feels safe. Gemini 3 forced OpenAI into Garlic, a renewed pre-training push after more than a year without a major frontier release beyond GPT-4.0. Anthropic answered GPT-4 with Claude 3.5 and 4.5; Meta keeps dropping larger Llama checkpoints for free, raising the floor for everyone.
Expect the next 6–12 months to deliver not just “GPT-5 vs Gemini 4” headlines, but concrete upgrades users can touch:
- Longer context windows as default, not premium
- Faster response times via better inference stacks and custom silicon
- More robust tools: code execution, browsing, and file handling that actually work at scale
- Higher reliability in multi-step tasks and agents
Price pressure will intensify. Google can subsidize Gemini through Search and Cloud, while Microsoft can bundle OpenAI models into 365 and Azure. That cross-subsidy dynamic historically drove down effective prices in cloud compute and storage; it will likely do the same for tokens, API calls, and “AI seat” licenses.
User experience will sharpen because Sam Altman explicitly made “day-to-day experience” the battlefield. Expect richer personalization, memory that survives across sessions, and workflows that look more like assistants embedded in email, docs, and IDEs than a blank chat box. ChatGPT’s brand moat only holds if the product feels obviously better every week.
Most importantly, no lab can stall. Any slowdown in pre-training, inference optimization, or UX polish becomes a headline and a churn event. That urgency means users get faster iteration cycles, more experimentation, and a constant stream of features competitors are too scared not to ship.
Frequently Asked Questions
What is OpenAI's 'Code Red'?
It's an internal initiative declared by CEO Sam Altman to urgently improve ChatGPT's performance and core technology in direct response to the competitive threat posed by Google's Gemini 3 model.
What is the 'Garlic' AI model?
'Garlic' is the internal codename for a new AI model being developed by OpenAI. It is designed specifically to counter Google's recent pre-training advancements and reportedly performs well against Gemini 3 in internal tests.
Is AI model scaling dead?
While some experts, including OpenAI's former co-founder Ilya Sutskever, suggested scaling was hitting its limits, Google's Gemini 3 proved that significant gains are still possible. OpenAI's leadership now asserts that scaling is not dead and they are refocusing on it.
Why is Google's Gemini 3 a major threat to ChatGPT?
Gemini 3 demonstrated massive performance gains, suggesting Google's custom TPU architecture gives them a key advantage in scaling models. This, combined with Google's vast user base and distribution channels, presents the first major challenge to OpenAI's market leadership.