Save 97% of Your LLM Context Window with Ref.tools & Exa MCPs

The High Cost of a Cluttered Mind

Large language models behave like overworked interns: give them too much to read and their answers fall apart. Researchers and practitioners now call this “context rot”—performance degrades as you stuff more text into the prompt window, even when that text is technically relevant. Past a certain point, more context doesn’t make models smarter; it makes them confused.

Developers still routinely paste entire docs pages, API references, and search results into a single prompt. A single modern framework guide can run 20,000+ tokens; multiply that by a few pages and you blow past 100,000 tokens on one request. That means you’re paying for the model to skim huge walls of boilerplate, TOCs, and repeated headers it will mostly ignore.

Those wasted tokens show up directly on your bill. At current API pricing, hammering an LLM with 100k-token prompts several times a day can quietly add hundreds of dollars a month to a team’s experimentation budget. Worse, bigger prompts take longer to process, so every query feels like waiting on a slow build.

Accuracy drops too. When you dump five overlapping docs pages into the context window, the model must juggle conflicting examples, deprecated syntax, and version-specific edge cases. Ask for a Tailwind v4 pattern and it might confidently regurgitate Tailwind v3 snippets it saw earlier in the same prompt, because the signal-to-noise ratio collapsed.

Naive retrieval also breaks agent workflows. Tool-using agents call search multiple times per task, so each step can add another 10,000–20,000 tokens of raw HTML and markdown. By step three, your “helpful assistant” drags around a bloated context history that obscures the few lines of code or config that actually matter.

The real challenge: giving an AI agent “perfect knowledge” of your stack without overwhelming its limited attention span. That means surfacing just the right 500–5,000 tokens—current SDK methods, your auth edge cases, that one migration note—instead of the entire internet. Systems that can do this reliably don’t just trim 50–90% of your context; they keep models sharp when it counts.

The 97% Context Killers: Ref.tools & Exa

Context rot has a new enemy: the Model Context Protocol. MCP is a simple but brutal idea—stop shoving everything into the context window and instead give models tools that can fetch exactly what they need, when they need it. Instead of a 100,000‑token firehose, MCP turns context into an API call.

Two MCP servers in particular form a kind of precision strike team: ref.tools and Exa. Ref.tools handles documentation, both public and private, while Exa focuses on high‑quality, low‑latency search for code and technical content. Together they replace brute‑force copy‑paste with targeted retrieval.

Ref.tools acts like a documentation surgeon. It indexes public docs, GitHub repos, PDFs, and internal sites, then returns only the few thousand tokens that actually matter for the current task, not the 20,000‑token blob you’d get from naive scraping. It also tracks search history in a session so the model doesn’t keep rediscovering the same pages.

Exa plays the complementary role for code and engineering research. Instead of broad web search, it prioritizes developer‑relevant sources and structures results so an AI agent can quickly extract APIs, patterns, and examples. For refactors, SDK migrations, or framework upgrades, that speed and focus mean fewer calls, fewer tokens, and less hallucinated guesswork.

Ray Fernando’s video pushes a bold number: a 97% reduction in context window usage on a complex refactor using these two MCPs. Previously, he would slam nearly 100k tokens of SDK docs, auth rules, and database provider details into the prompt. With ref.tools and Exa, the model pulls only the slices of Tailwind v4, ShadCN, and app‑specific code it actually needs.

That combo turns context into a surgical strike. The agent first queries ref.tools and Exa to understand Tailwind v4 design tokens, then scans the Anime Leak codebase for hard‑coded Tailwind v3 patterns and inconsistent themes. Instead of a bloated, fragile prompt, you get a tight loop: ask, fetch, apply, repeat—minimal tokens, maximal signal.

Ref.tools: The AI's Smart Librarian

Ref.tools behaves less like a search bar and more like a librarian for agents, built to keep large language models from drowning in documentation. Instead of blasting the model with entire pages, it performs what its creators call agentic search: multi-step, tool-driven querying that adapts to what the model is trying to do over time.

At the core is context-aware filtering. Ref.tools slices sprawling docs into small chunks, then selectively returns only the most relevant ~5,000 tokens for a given task, not the 20,000+ tokens a naive crawler might dump into your context window. On real-world queries, users report 50–70% token savings versus basic RAG, and up to 95–99% reductions compared to brute-force “paste the docs” workflows.

Session awareness is where it starts to feel built for agents rather than humans. Each search session tracks previous queries and answers, so ref.tools avoids sending duplicates and near-duplicates. When an AI assistant iterates with multiple tool calls—“how do I auth?”, “now show me pagination”, “now error handling”—ref.tools steers away from already-used passages instead of re-burning tokens on the same paragraphs.

Indexing spans both public and private worlds. Out of the box, ref.tools can crawl and index: - Public product docs and API references - Private GitHub repos - PDFs and other uploaded files - Arbitrary websites behind a single URL

That unified index becomes a single source of truth for your AI assistant, so it can answer “How does our billing middleware wrap Stripe?” by pulling from your GitHub, then immediately pivot to official Stripe docs without switching tools.

Crucially, ref.tools optimizes for natural language queries from agents, not human keyword hacking. An assistant can ask, “What are the required parameters for the Figma post comment endpoint, and show a minimal TypeScript example?” and ref.tools resolves that into targeted lookups across its index, then returns only the code blocks and explanation fragments that matter.

Because it speaks the Model Context Protocol, ref.tools plugs directly into Claude, Zed, Cursor, and other MCP-aware environments. Configure the MCP server once with an API key, and every new project in your editor can tap the same indexed docs without reconfiguration. For deeper technical specifics and setup guides, Ref.tools - Documentation Search for AI Coding Assistants walks through the full agent-centric workflow.

Exa: The Speed-Reader for Code

Exa plays the opposite role to ref.tools: where ref.tools is your meticulous in-house librarian, Exa is the street-smart speed-reader for the entire coding internet. Wired into Claude via MCP, it specializes in high-signal, low-latency search across public technical content, from docs and blog posts to GitHub issues and Stack Overflow threads.

While ref.tools indexes your PDFs, private repos, and vendor docs, Exa attacks the open web with ranking tuned for code. Ask for “Tailwind v4 CSS variables design tokens customization” or “Shadcn UI + Next.js route handlers,” and Exa surfaces pages that actually solve the problem instead of generic SEO sludge. You get fewer links, but each one earns its place in your context window.

Speed matters when you’re chaining tools. MCP agents often fire multiple queries per task—scan the codebase, check framework docs, verify API usage. Exa responds fast enough that a model like Claude 3.5 Haiku can loop through several research steps without ballooning latency or burning thousands of junk tokens on irrelevant pages.

Ref.tools shines when the answer lives in your world: your SDK, your auth rules, your internal design system. Exa shines when you need the world’s knowledge: a niche library, a breaking change in Tailwind v4, or a subtle bug buried in a GitHub discussion from last week. One keeps your private context razor-sharp; the other keeps you from being trapped inside your own repo.

Used together, they cover every surface area of a modern stack:

1ref.tools: private docs, vendor docs, internal PDFs, GitHub repos
2Exa: public web, framework docs, community examples, recent fixes

That combo means your AI assistant pulls only what matters—from your own systems and the wider ecosystem—while still slashing context usage by well over 90% compared to naive “paste the docs” workflows.

Wiring It Up: The Command-Line Method

Command line is the fastest way to bolt these MCPs onto your workflow, whether you live in VS Code’s integrated terminal or Anthropic’s Claude Code. You only need the `claude` CLI, an account with ref.tools, and an API key from Exa.

Start with ref.tools. After you create an account, head to its MCP settings page, generate an API key, and copy it. In your terminal, wire it up with a single command:

- `claude mcp add ref.tools --header "Authorization: Bearer YOUR_REF_API_KEY"`

That `--header` flag matters: MCP servers expect auth in HTTP-style headers, not environment variables. The CLI writes this to a local MCP config file under your user directory, so you configure it once and every new Claude Code project can see it.

Next, add Exa for high-speed code search. Grab an API key from Exa’s dashboard, then run:

- `claude mcp add exa --header "x-api-key: YOUR_EXA_API_KEY"`

Ref.tools and Exa use different header names, so copying the exact string from each provider’s dashboard avoids subtle 401 errors. If the provider gives you a prebuilt command, you can paste it directly into the terminal; just replace the placeholder key with your real one.

Security is non‑negotiable here. Those MCP configs live in your home directory or local project folder, which means `git add .` can accidentally vacuum them into your repo. Add patterns like:

1`.claude-mcp*`
2`mcp.config.*`
3`*.local.json`

to `.gitignore`, and keep API keys in local config only, never in shared code or CI logs.

To confirm everything actually works, ask the CLI what it sees:

- `claude mcp list`

You should spot `ref.tools` and `exa` in the active servers list, each marked as available. If either is missing or shows as unreachable, recheck the header name, key value, and that you didn’t paste extra quotes or whitespace.

The 'One-Click' Cursor IDE Integration

Cursor turns MCP setup from a terminal ritual into a UI shortcut. Instead of editing dotfiles, you open the IDE, hit settings, and wire in ref.tools and Exa in under a minute. No shell, no guessing where your config lives.

Open Cursor, click the gear icon, and jump into Tools & MCPs. This panel lists every active tool and any custom MCP servers you’ve already added, so you can see at a glance what your AI has access to.

To hook in ref.tools, scroll to “Custom MCP servers” and hit “Add custom MCP server.” Cursor pops a form with a name, URL, and an optional JSON config block where you can paste the exact snippet ref.tools generates. That JSON usually includes the MCP server URL plus headers for authentication.

Grab those details from the ref.tools dashboard under the “MCP” tab. You’ll see a prebuilt config with: - Server URL - Protocol version - Headers with an `Authorization` field

Paste that JSON into Cursor’s config box, then drop your ref.tools API key into the designated field if Cursor separates keys from headers. Cursor stores it locally, so your key never needs to live in source control.

Exa follows the same pattern. Head to the Exa dashboard, open the API section, and generate a key if you don’t have one. Copy the MCP URL and any sample JSON config they provide, then add a second custom MCP server in Cursor with those values and your Exa API key.

Under the hood, Cursor speaks the same Model Context Protocol as your CLI setup, just with a friendlier wrapper. If you want to sanity-check what’s happening, Model Context Protocol - Official Documentation breaks down the JSON schema Cursor consumes. Once both servers connect, Cursor’s AI can call ref.tools for docs and Exa for code search automatically, without you touching a terminal.

The Tailwind V4 Refactor Gauntlet

Refactor gauntlets don’t get more brutal than a framework jump mid-stream. Ray Fernando’s test case: upgrade an existing production app to Tailwind v4, align it with shadcn/ui, and unify a messy, half-forked design system without breaking the UX. The app, Anime Leak, already ships real features—image uploads, AI-generated “leaking” anime overlays, galleries, sharing—so regressions are not theoretical.

Tailwind v4 rewires how you think about styling: CSS variables, design tokens, and a new configuration story that wants a coherent system instead of ad-hoc utility soup. That alone demands careful reading of the latest Tailwind docs, migration notes, and examples. Now mix in a forked codebase with legacy Tailwind v3 classes, light-mode-first layouts, and a dark-themed landing page from a different author.

Perfect stress test material, because success requires two kinds of reasoning at once. The agent has to internalize a new design-token-based Tailwind mental model from documentation. Then it must scan dozens of components, pages, and layout files to infer the app’s de facto design system and reconcile it with Tailwind v4 and shadcn/ui.

Ray hands this to Claude’s Haiku 4.5 model running as an agent in Claude Code, with a very explicit brief. The prompt: use the `ref` MCP to read Tailwind v4 and design-system docs, and use the `exa` MCP to search broadly across real-world code and patterns. Only after that research phase should it traverse the Anime Leak repo and propose a unified token and theme strategy.

The instruction goes further: treat Tailwind v3 hard-coded classes as suspects to be normalized into v4-style tokens and variables. Respect the existing shadcn/ui primitives, but bring color, spacing, and typography into one consistent hierarchy that works across light and dark modes. No hand-holding, no pre-curated snippets.

Stakes sit squarely on context discipline. A naive setup would slam 50,000–100,000 tokens of Tailwind docs, shadcn docs, and app code into the window and hope the model doesn’t melt. Here, the question is sharper: can a tool-aware agent, constrained by `ref` and `exa`, stream just-enough documentation and just-enough code slices to stay under a few thousand tokens at a time—and still ship a correct, end-to-end Tailwind v4 refactor plan?

Watching the AI Cook: Tokens vs. Terabytes

Context windows usually feel like a ceiling. Here, they turned into a rounding error. Using ref.tools and Exa through MCP, the Tailwind V4 refactor agent pulled everything it needed—Tailwind docs, ShadCN patterns, and the Anime Leak codebase—using roughly 2,800 tokens end-to-end.

On a model with a 200,000-token context window, that 2,800-token footprint represents about 1.4% of the available space. Flip the ratio: the system left 98.6% of the window untouched, a 97%+ reduction compared to the classic “paste half the docs site into chat” workflow.

Contrast that with the old way the creator describes: shoving ~100,000 tokens of raw documentation into the model just to get started. A couple of SDK guides, auth rules, and database docs, and you were already halfway to max context before writing a single line of code.

Ref.tools and Exa invert that pattern. Instead of preloading everything, the agent calls these MCP servers to run targeted searches, fetch only the relevant slices, and stream them back as needed. No 20,000-token HTML blobs, just trimmed excerpts aligned with the current subtask.

You can see the payoff in the plan the agent generates once it finishes its reconnaissance. After reading Tailwind V4 docs via ref.tools and scanning the repo with Exa, it proposes a stepwise strategy rather than a vague refactor wish list.

The plan breaks down into concrete passes, for example: - Audit existing Tailwind V3 utility usage and custom classes - Map legacy tokens and colors to Tailwind V4 design tokens and CSS variables - Align ShadCN components with the new shared design system - Update config, layouts, and critical UI flows for consistent light/dark behavior

Each step traces directly back to context the agent actually read: Tailwind V4’s new design tokens model, ShadCN’s component patterns, and the current Anime Leak theming. Because the MCPs only surfaced those specific sections, the model did not waste tokens on marketing pages, changelog noise, or unrelated APIs.

That focus matters for quality as much as for cost. With only 2,800 carefully chosen tokens in play instead of a 100k-token slurry, the agent can keep the entire refactor plan, key Tailwind rules, and live code snippets simultaneously “in mind” without context rot. The result feels less like autocomplete and more like a lead engineer walking through a migration checklist.

Agentic Workflows Just Leveled Up

Agentic workflows stop being a parlor trick once you can pull 2,800 hyper-relevant tokens instead of spraying 100,000 at the wall. Ref.tools and Exa don’t just save money; they radically expand the surface area of problems you can hand off to an AI without choking its context window into uselessness.

Multi-step agents used to hit a hard ceiling: a couple of docs, a medium-size codebase, and everything turned to mush. With token-efficient MCPs, you can chain dozens of research hops—framework docs, SDK examples, internal RFCs, GitHub issues—while still staying under 10,000 tokens of live context.

That opens the door to workflows that look a lot more like real software projects. An agent can now: - Map an unfamiliar monorepo - Compare three competing libraries - Align with an internal design system - Propose a migration plan with explicit trade-offs

Cursor’s Plan Mode is where this becomes obvious. Instead of jumping straight to code, the agent can spend 20–30 tool calls purely on reconnaissance: scanning Tailwind v4 docs through ref.tools, trawling code patterns via Exa, and building a step-by-step refactor plan—without detonating your context budget.

Previously, that level of upfront planning meant either manual work or burning through hundreds of thousands of tokens on naive RAG. With ref.tools routinely cutting context by 50–70%, and scenarios like Ray Fernando’s Tailwind refactor landing around 2,800 tokens instead of ~100,000, Plan Mode suddenly scales to “weekend project” complexity, not just “single file fix.”

This is the quiet shift from autocomplete to AI partner. Code-completion models guess the next line; MCP-driven agents can justify why a migration path makes sense, cite the exact API changes, and point to the three files in your repo that violate the new contract.

Once context stops being the bottleneck, the limiting factor becomes process design, not token math. You start thinking in terms of playbooks—“greenfield feature spec,” “SDK upgrade,” “design system unification”—and wiring agents to run them end-to-end. For a sense of how fast this ecosystem is expanding, Awesome MCP Servers - Curated List already tracks dozens of specialized backends ready to plug into these workflows.

Build Your New AI Coding Stack

Context bloat is now a choice, not a constraint. A stack built around ref.tools and Exa gives you an AI pair programmer that reads terabytes while your model sees only the ~3,000 tokens that matter.

Ref.tools acts as your agentic search layer: it indexes public docs, private PDFs, and entire GitHub repos, then feeds your model only the most relevant ~5,000 tokens per query instead of spraying 20,000+ raw tokens from scraped pages. In practice, that means 50–70% fewer tokens on typical tasks and up to 95–99% savings on gnarly documentation hunts.

Exa complements that by doing high‑quality, code‑aware search across the web at speed. Instead of jamming SDK docs, auth rules, and provider guides directly into your prompt, your agent calls Exa to find the right snippets, then uses ref.tools to hydrate them into precise, minimal context.

You get three compounding wins at once: - Massive token reduction (from 100k‑token frenzies down to ~2,800 tokens in our Tailwind v4 refactor) - Better model behavior (less context rot, more focused reasoning) - Faster feature delivery (agents spend time coding, not hallucinating docs)

Best part: this stack rides on the open Model Context Protocol (MCP), so it works across models and editors. Claude, xAI, OpenAI, local models, VS Code, Cursor, Zed, cloud IDEs—if it speaks MCP, it can use these tools.

Set it up once, then let every new project inherit the benefits. Configure ref.tools and Exa at the user level, keep API keys out of your repos, and your next “read the docs + refactor the codebase” task becomes a single agentic prompt instead of a weekend.

Install them now: - ref.tools: https://ref.tools - Exa: https://exa.ai - MCP spec: https://modelcontextprotocol.io

Frequently Asked Questions

What is an MCP (Model Context Protocol) server?

An MCP server is a specialized service that acts as an intelligent data source for AI models. Instead of raw web searches, it provides focused, relevant, and token-efficient context for specific tasks, like searching documentation.

What is 'context rot' in LLMs?

Context rot is the degradation of an LLM's performance when its context window is filled with excessive or irrelevant information. This 'noise' makes the model less accurate and 'dumber' for the specific task at hand.

How do ref.tools and Exa actually save tokens?

Ref.tools uses intelligent, model-centric search to find and extract only the most relevant snippets from documentation. Exa provides high-quality, fast search for coding tasks. Together, they prevent dumping thousands of unnecessary tokens into the context.

Which code editors support these MCPs?

These MCPs can be used in any environment that supports the Model Context Protocol. The video demonstrates setup in terminal-based tools like 'Claude Code' and AI-native IDEs like Cursor, which has built-in support.

This AI Trick Saves 97% of Your Tokens