Google Killed the Context Window
Stop feeding your LLM endless tokens. Google just revealed a 'Context Engineering' pattern that treats context like compiled code, and it changes everything about building scalable AI agents.
The Million-Token Myth Is Dead
Million-token context windows were supposed to be the cheat code for large language models. Vendors raced to advertise 128K, 200K, even 1M-token prompts as if sheer capacity would unlock reliable coding copilots, autonomous agents, and fully searchable knowledge bases. Reality has been less cinematic: more tokens often just means slower, pricier, and still-confused models.
Google’s latest research on context engineering calls the bluff. The company argues that “throwing more tokens at the problem buys time, but it doesn’t change the shape of the curve” for cost, latency, and reliability. You can briefly outrun complexity with a 1M-token window, but real workloads—RAG results, multi-agent logs, tool outputs, and user history—will catch up fast.
Three hard limits keep smashing the “just stuff the prompt” strategy. First is the cost-and-latency spiral: every extra 100K tokens pushes inference time and cloud bills up, often by 2–3x in production-scale apps. Second is “lost in the middle”, a documented effect where models ignore crucial instructions buried in overloaded prompts. Third is physics: even million-token windows overflow once you start chaining tools and agents over hours or days.
Google’s answer is not “2M tokens,” it is an architectural pivot. Instead of treating context as a giant mutable chat log, the company frames it as a compiled view over a richer, stateful system. Raw data—sessions, memories, artifacts—acts as source code; a pipeline of processors compiles a minimal, task-specific working context for each model call.
That shift turns context from a brute-force buffering problem into a systems-design problem. Google’s Agent Development Kit (ADK) bakes this into a four-layer stack: working context, session, memory, and artifacts, each with clear responsibilities and lifecycles. Context stops being “whatever fits in the window” and becomes an explicit product of code, policies, and scopes.
This article walks through how that framework actually works in production. Using Google Agent Development Kit (ADK), we will break down the context architecture, the different context objects and permissions, and a real document-research agent that shows why the million-token era is already over.
Why Your LLM Is Drowning in Data
Bigger context windows promise omniscience but mostly deliver sticker shock. Every extra 100,000 tokens adds real dollars and seconds, and production teams feel both. Google’s own context-engineering blog describes a cost/latency spiral: more tokens per call, multiplied by thousands of concurrent users, turns “nice demo” into “blown budget” fast.
Latency scales just as brutally. That million-token prompt means slower decoding, longer tool chains, and UX that drifts from “assistant” to “ticketing system.” In multi-agent setups, each agent call fans out into more model calls, so one bloated context infects an entire pipeline.
Then comes Lost in the Middle. LLMs don’t attend uniformly across giant prompts; they often overweight the beginning and end, and quietly ignore the center. Google’s blog and follow-up work show accuracy dropping when key facts sit buried in the middle of a long sequence, even when the model technically “sees” everything.
Picture a production agent’s prompt after a few minutes of use. Up top: a fresh user question. At the bottom: a recent tool error and a retry instruction. Wedged in the middle: the actual policy constraint or critical system instruction. The model happily optimizes around the edges and hallucinates past the thing that actually matters.
Real workloads make this worse. A single turn can include: - 20–50 RAG passages - 5–10 tool call logs - Dozens of prior chat messages
Stuff all of that into a “1M-token-ready” model and the context window still chokes after a handful of rich interactions. Even Google Agent Development Kit (ADK) demos show how quickly artifacts, memory, and session state would explode if you naively streamed everything into the prompt.
Hard physical limits finish the job. Context length scales sublinearly with compute and memory; pushing from 128K to 1M tokens already demands exotic infrastructure. Go much higher and you’re fighting GPU RAM, bandwidth, and training stability, not just clever prompt design.
These aren’t lab curiosities. They are the reasons production teams quietly cap history, truncate RAG results, and aggressively summarize. Until context stops being a raw stream and becomes a compiled view, large agents will keep drowning in their own data.
Google's 'Context Compiler' Changes Everything
Google’s new context compiler idea rips out the old mental model of a context window as a bottomless chat log. Instead of a mutable stream buffer, context becomes a compiled view over a much richer state: sessions, memories, and artifacts that live outside any single prompt. Each model call sees only a carefully constructed slice, not the whole haystack.
Think like a compiler engineer, not a prompt tweaker. Raw interaction data becomes the source code; a pipeline of processors acts as the compiler; the final prompt sent to Gemini is the optimized executable. Google’s own blog, Architecting efficient context-aware multi-agent framework for production, spells this out as a hard requirement for production-scale agents, not a nice-to-have abstraction.
Under this model, the developer’s job shifts from “how do I phrase this prompt?” to “how do I architect the pipeline that builds context?” You design how sessions store structured events, how memories encode long-term knowledge, and how artifacts like PDFs or CSVs get referenced by ID instead of dumped into the window. You stop hand-curating mega-prompts and start defining flows, processors, and scopes.
Google Agent Development Kit (ADK) bakes this into its APIs. It exposes distinct context layers—working context, session, memory, artifacts—and forces you to wire them together through explicit processors and state prefixes like `app`, `user`, and `temp`. That separation of storage from presentation means you can log thousands of events while still emitting a lean, targeted prompt under 5,000 tokens.
Power comes from scope-by-default. Every model invocation receives only the minimum context required for its task, assembled just-in-time by the compiler pipeline. If an agent needs more information, it calls tools to fetch artifacts or query memory, instead of preloading everything “just in case.”
This compiled-context approach hits all three pain points at once. Token counts drop, so cost and latency fall with them. “Lost in the middle” shrinks because the middle mostly disappears—irrelevant history never enters the working context. Physical context limits stop being a hard ceiling and become an optimization budget you control with code.
Principle #1: Separate Storage from Presentation
Context engineering starts with a clean break between where information lives and how the model sees it. Google’s Agent Development Kit (ADK) bakes this into its first principle: separate storage from presentation. That sounds abstract, but it’s the difference between a system you can evolve and one that collapses under its own transcript.
ADK draws a hard line between the Session and the Working Context. The Session acts as the persistent, authoritative ledger: every user message, tool invocation, tool result, and system event lands here as structured state. The Working Context is a disposable snapshot built for a single LLM call, then thrown away.
Think of the Session as your database and the Working Context as a SQL query result. You never mutate the database to match each query; you change the query. Same idea here: you keep a complete, append-only Session and compile different Working Contexts from it depending on the task, model, or latency budget.
That separation becomes critical the moment your product changes. Want to swap Gemini 3 Pro for a smaller model, or switch from a verbose chat-style prompt to a terse tool-execution prompt? You update the processors that build the Working Context, not the Session schema or historical data. Past interactions remain intact, even as you radically change how you frame them to the model.
The Google Agent Development Kit (ADK) docs formalize this with distinct context objects and state prefixes. Session-backed state lives under durable prefixes like `app` and `user`, while ephemeral, invocation-only data lives under `temp`. Only the Working Context reads across those namespaces and compiles a minimal view for the current call.
Practical impact shows up fast in multi-agent systems. One agent might see a Working Context with only the last 3 user turns and a tool summary; another might get a synthesized brief of a 200-message Session plus RAG hits. Both derive from the same underlying record, but each call pays only for the tokens it actually needs.
Principle #2 & #3: Build an AI Assembly Line
Context windows used to grow by string concatenation: user message, agent reply, tool output, repeat until the bill explodes. Google Agent Development Kit (ADK) rips that out and replaces it with Explicit Transformations: a named, ordered pipeline that compiles raw state into a working prompt. Context stops being a blob of text and becomes a build artifact.
Instead of `prompt = history + docs + tools`, you define processors like `summarize_session`, `select_relevant_artifacts`, `inject_instructions`, and `budget_tokens`. Each step has a name, a contract, and a place in the pipeline. You can log every stage, diff outputs between runs, and swap processors without touching the rest of the system.
Google’s context blog and ADK docs describe this as a flow of processors that transform four layers of state: session, memory, artifacts, and working context. A document-research agent, for example, might: - Fetch prior decisions from memory - Rank candidate PDFs from artifacts - Compress citations into a 2,000-token summary - Emit a minimal prompt for Gemini 3 Pro
Because processors are explicit, teams can unit-test them. You can assert that `select_relevant_artifacts` never pulls more than 5 documents, or that `summarize_session` stays under 1,000 tokens. Debugging stops being “why did the model hallucinate?” and starts being “which processor injected the wrong data?”
Scoped by Default attacks the other half of the mess: who sees what. Instead of dumping the entire multi-agent transcript into every tool call, ADK routes minimal, role-specific context via typed context objects. Tools, callbacks, and instruction providers each get a constrained view.
Tool functions receive a ToolContext that can read and write session state, save or load artifacts, and search memory. Callback handlers get a CallbackContext with state and artifacts but no memory search, preventing side-channel chaos. Instruction providers see a read-only context so they cannot secretly mutate state while generating system prompts.
Scoped context turns multi-agent systems into an AI assembly line. One stage reads artifacts and produces a structured summary; another stage, with a narrower scope, turns that summary into user-facing prose; a third stage logs outcomes back to memory. No single agent hauls around the entire 50,000-token history.
Together, Explicit Transformations and Scoped by Default produce context flows that are predictable under load. You know exactly which processors run, which state they can touch, and how many tokens they emit. That discipline is what makes a million-token model optional instead of mandatory.
The Four Layers of a Context-Aware Agent
Context-aware agents in Google Agent Development Kit (ADK) run on a four-layer stack that treats context as a product, not a side effect. Each layer answers a different question: what the model sees now, what actually happened, what should persist, and where the heavy assets live.
At the top sits the working context. This is the tight, temporary payload that ADK compiles and ships to Gemini for a single model call: selected messages, tool outputs, snippets from memory, and a few artifact references. ADK discards it immediately after the invocation, so nothing in working context is authoritative by itself.
Beneath that lives the session, the long-running, canonical log of an interaction. Every user message, model response, tool call, and tool result lands here as a structured event object, not a flat transcript. When ADK rebuilds a working context, it queries this session timeline and applies processors that filter, summarize, or re-thread events for the current task.
Longer-lived knowledge moves into memory. Memory holds user preferences (tone, languages, notification habits), durable facts (company policies, product specs), and distilled decisions that should outlive any single chat. ADK surfaces this via memory search APIs, so an agent can pull in only the 3–10 items that matter instead of replaying 10,000 tokens of history.
Artifacts solve the “giant file in the prompt” problem. Artifacts are large binary or text objects—PDFs, CSVs, images, audio files—stored once and addressed by stable names or IDs, not pasted into the prompt. Tools read and write artifacts, and the working context only carries lightweight references plus extracted chunks when necessary.
Together, these four layers form a context pipeline, not a monolith. A document-research agent, for example, might: read the session to find the latest user question, search memory for prior decisions, load a PDF artifact by ID, then compile a working context with only a few relevant paragraphs and citations. Cost and latency scale with that compiled slice, not with the raw corpus.
Google’s own guidance pushes teams to treat these layers as first-class design surfaces. The ADK Context Documentation spells out how working context, session, memory, and artifacts map to concrete types, permissions, and processors so multi-agent systems stay fast, cheap, and grounded as workloads grow.
Mastering State: The Secret to Persistent AI
State makes an AI feel persistent instead of goldfish-brained, and Google Agent Development Kit (ADK) bakes that directly into its context API with a deceptively simple prefix system. Rather than handing agents one amorphous blob of “memory,” ADK slices state into temp:, user:, and app: namespaces that map cleanly to how real software actually runs.
Start with temp:, the scratchpad. Anything you write under `temp:` lives for a single invocation and then disappears. A tool can stash a parsed CSV under `temp:parsed_doc`, another tool can read it 200 ms later, and once the agent replies, ADK wipes it—no risk of polluting long-term history with intermediate junk.
Move up a level and user: becomes the agent’s actual memory of a person. Keys like `user:reading_level`, `user:last_projects`, or `user:blocked_sources` persist across sessions as long as you plug ADK into a backing store. A research assistant built in the demo can remember which papers a user already summarized last week and avoid re-fetching or re-explaining them.
At the top sits app:, global state for the entire deployment. Feature flags (`app:enable_vision`), system-wide rate limits, or shared embeddings index references all live here. Every agent instance and every user can read these values, so you can flip a config once and watch behavior change across hundreds of concurrent sessions.
Together, these three prefixes give you a concrete state hierarchy without inventing a custom framework. You get: - temp: for per-turn wiring between tools - user: for per-identity memory - app: for cross-user configuration
That hierarchy directly encodes ADK’s design principles. Separate storage from presentation: state lives under `temp:`, `user:`, or `app:`, while the working context is a compiled slice of those keys. Explicit transformations: tools and processors read and write specific prefixes instead of mutating a giant prompt. Scoped by default: `temp:` never leaks past a turn, `user:` never accidentally becomes global, and `app:` never silently turns into user-specific behavior.
Code in Action: A Context-Aware Research Bot
Google Agent Development Kit (ADK) turns all that context theory into a working research bot. Yeyu Lab’s demo builds a document assistant that can search, open, and analyze files without ever shoving entire PDFs into the prompt. Instead, it compiles just enough context for each Gemini 1.5 or Gemini 2.0 call.
Core to the design is an on-demand loading pattern. The `list_documents` tool returns only lightweight metadata: document IDs, titles, maybe byte sizes and timestamps. The model sees a compact table of options, not 200 pages of raw text.
When the user picks something like “analyze the quarterly report,” the agent calls `analyze_document`. That tool pulls the full file from artifacts storage, runs chunking or summarization, and only then surfaces distilled results to the model. The LLM never receives the original document inline; it just gets the processed slices it asked for.
Tools coordinate through `temp:` state instead of dragging payloads across the wire. `list_documents` writes `temp.current_doc_id = "report_q2_2024"`; `analyze_document` reads that same key and knows exactly which artifact to load. No base64 blobs, no 50,000-token JSONs bouncing between tools.
That `temp:` scope lives for a single invocation, which keeps the working context minimal. A typical turn might include the user request, a short system prompt, a compact document list, and one `current_doc_id` string, yet still feel like a rich multi-step workflow. Google Agent Development Kit (ADK) handles the plumbing so the model focuses on reasoning.
Longer-term behavior rides on `user:` state. When someone says, “I prefer brief summaries,” a tool or callback writes `user.summary_style = "brief"`. Future calls—tomorrow, next week, on a different device—can read that key and automatically produce 3-sentence abstracts instead of 3-page breakdowns.
Preferences can stack up without bloating prompts. You might track: - `user.domain_focus = "finance"` - `user.citation_format = "APA"` - `user.summary_style = "brief"`
Each invocation compiles only the relevant subset into the working context. No one replays a 200-turn chat log just to remember that the user hates bullet points.
Intelligence in this agent comes from efficient tool use, not a monster context window. The model issues precise tool calls, hops through `temp:` and `user:` state, and touches artifacts only when necessary. Google Agent Development Kit (ADK) effectively kills the idea that you need to reread your entire history every time just to act smart.
From Prompt Engineer to Systems Architect
Prompt engineering once meant clever incantations and brittle hacks. Context Engineering, as framed by Google’s context blog and the Google Agent Development Kit (ADK), upgrades that role into something closer to distributed systems architecture for language models.
Instead of obsessing over a single mega-prompt, developers now design pipelines: how sessions, state, memory, and artifacts flow through processors into a compiled working context. ADK’s four-layer stack—working context, session log, long-term memory, and external artifacts—turns “what’s in the prompt?” into “what system produced this view, and why?”
That shift marks a clear maturation of AI development. You define: - What data lives where - Which transformations run when - Which agent or tool sees which scope
Result: behavior stops feeling magical and starts feeling debuggable.
Reliability jumps because every context slice has an explicit recipe. If an agent hallucinates, you inspect the processors and prefixes that built its view, not a 40,000-token blob. ADK’s state prefixes (`app`, `user`, `temp`) and scoped contexts (tool, callback, invocation) give you levers to reproduce bugs, write tests, and reason about failure modes.
Predictability improves because agents no longer slurp entire histories. They request artifacts by ID, search memory with controlled queries, and write to constrained state segments. Costs drop as well: you stream only the minimal compiled context per call instead of replaying weeks of logs into a million-token window.
For teams targeting production, this looks like a blueprint for multi-agent backends. One agent orchestrates, specialist agents own tools and memories, and ADK’s context framework guarantees each call sees just-enough information. Google’s own docs, including Introduction to Conversational Context: Session, State, and Memory, read less like prompt tips and more like an API design guide.
Context Engineering, as implemented in Google Agent Development Kit (ADK), turns LLM apps into systems you architect, not spells you tweak. That is the prerequisite for serious, regulated, multi-agent deployments.
Your Next Move with Google's ADK
Ready to try this for real? Start by installing the Google Agent Development Kit (ADK), grabbing a Gemini API key, and reading Google’s context-engineering explainer on the Google Developers blog: Architecting efficient context-aware multi-agent framework for production. That post defines the four-layer stack—working context, session, memory, artifacts—and the three principles your code should reflect.
Next, go straight to the official ADK context documentation at google.github.io/adk-docs/context. Focus on how `InvocationContext`, `ToolContext`, and `CallbackContext` gate access to session, memory, and artifacts, and how the `app:`, `user:`, and `temp:` prefixes implement scoped state. Treat these APIs as your system boundary, not just helper classes.
Then pull down Yeyu Lab’s demo from GitHub: context_demo. Run the document assistant, watch how artifacts get saved and referenced by ID instead of shoved into prompts, and trace how tools read/write state via prefixes rather than hiding data in free-form text. This is your reference implementation for a Context-Aware multi-agent workflow.
For a first project, refactor an existing RAG app to match this pattern. Replace your “stuff everything into the prompt” logic with:
- A session log of structured events
- Artifact storage for PDFs, CSVs, and long documents
- Memory search instead of re-sending the same facts
- `temp:` state for passing intermediate results between tools
You are not just tuning prompts anymore; you are designing distributed systems that happen to talk in natural language. Developers who internalize context as a compiled view—and who architect around state, artifacts, and explicit transformations—will be the ones shipping production-ready AI while everyone else keeps chasing a bigger context window.
Frequently Asked Questions
What is Context Engineering?
Context Engineering is a new architectural pattern from Google for AI agents. It treats context not as a single stream of text, but as a 'compiled view' created from various data sources like session history, memory, and external files, optimized for each specific model call.
Why are large context windows a problem for AI agents?
While seemingly powerful, large context windows lead to a spiral of high costs and slow latency. They also suffer from the 'lost in the middle' problem, where models struggle to find relevant information in a sea of noise, ultimately limiting scalability and reliability.
How does Google's Agent Development Kit (ADK) implement these ideas?
ADK provides a framework with built-in primitives for Context Engineering. It separates persistent storage (Session, Memory, Artifacts) from the temporary 'Working Context' sent to the LLM, using tools and state management to load only what's necessary, when it's necessary.
Does Context Engineering replace Retrieval-Augmented Generation (RAG)?
It complements and refines it. RAG is about retrieving data; Context Engineering is about how that retrieved data (and all other context) is structured, managed, and presented to the model. It provides a more robust, scalable system for RAG-style workflows.