Build a Powerful Hybrid RAG Agent with MongoDB and Pydantic AI

Your RAG Is Probably Lying To You

RAG systems feel magical right up until they confidently answer the wrong question. Most failures trace back to a single culprit: pure semantic search. You embed your documents, run a vector similarity query, and hope the closest chunk contains the exact fact you need. Too often, it doesn’t.

Embedding models excel at fuzzy meaning, not surgical precision. They compress paragraphs into 768‑ or 1,536‑dimensional vectors that emphasize high‑level concepts over exact tokens. That tradeoff means they routinely blur details like part numbers, dates, legal citations, and function names into “similar enough” neighborhoods.

Ask a typical RAG stack for “What was our revenue in 2025?” and watch it hallucinate with confidence. The embedding for that question sits very close to chunks discussing “revenue in 2023” or “projected revenue over the next three years.” Semantic similarity says those are near‑identical; your system happily returns 2023 numbers as if they came from 2025.

That behavior is not a bug in embeddings; it is the design. Models trained on general text optimize for conceptual alignment: “revenue growth in 2023” and “revenue forecast for 2025” share most of their signal. The model treats the year as a minor detail, even though for finance, law, or engineering, the year often is the answer.

The same failure mode hits other high‑precision queries. Ask for: - “Section 3.2.1 of the contract” - “Error code 0x80070005” - “function init_user_session in auth.py”

A pure semantic search often surfaces nearby concepts—“termination clause,” “Windows access denied errors,” “user session initialization”—instead of the exact clause, error code, or function definition you actually need.

Enterprise teams feel this acutely. Compliance tools must distinguish §230 from §320. Support copilots must differentiate SKU-AX12B from SKU-AX21B. Internal engineering bots must not mix v1.2.3 and v1.3.2 release notes. A single swapped digit can mean a different product, statute, or API.

Core problem: RAG stacks chase conceptual understanding while underinvesting in precision. Without mechanisms that respect exact strings, numbers, and identifiers, your “knowledge assistant” behaves more like an opinionated autocomplete than a trustworthy retrieval system.

The Hybrid Search Fix That Just Works

Hybrid search quietly fixes most of what’s broken in naive RAG. Instead of betting everything on fuzzy embeddings or clinging to brittle keyMicrosoft Word filters, it runs semantic search and keyMicrosoft Word search together on every query, then fuses the results. You get conceptual understanding and literal string matching in a single pass.

Semantic search excels at “find me things like this,” even if the Microsoft Wording changes. Ask, “How does this contract handle early termination?” and vector search will surface clauses that never use the phrase “early termination” but describe the same idea. KeyMicrosoft Word search, by contrast, nails exact phrases, IDs, numbers, and rare domain terms that embeddings routinely smear into noise.

Hybrid search keeps both engines online all the time. For each question, the system computes a query embedding, runs a vector similarity search, and also runs a text or keyMicrosoft Word query over the same corpus. A rank-fusion step—MongoDB ships a $rankFusion operator specifically for this—merges the two ranked lists into a single, more reliable set of chunks.

That “use both, always” rule matters more than which LLM or embedding model you pick. Pure semantic pipelines hallucinate specifics: invoice numbers, error codes, function names. Pure keyMicrosoft Word pipelines miss paraphrases and higher-level questions like “compare the risk sections across these three contracts.” Hybrid search covers both ends without asking users to toggle modes or craft special syntax.

Cole Medin’s reference stack shows how little machinery you actually need. A Python agent built with PydanticAI, MongoDB, and Docling ingests PDFs, Microsoft Microsoft Word docs, markdown, and more, then stores chunked text plus embeddings in a single MongoDB collection. Every query fans out to both MongoDB Atlas Vector Search and classic text search, then gets fused before the LLM ever sees context.

That one architectural decision—always doing hybrid retrieval—dramatically stabilizes RAG behavior across almost any use case: legal review, customer support, internal wikis, even messy engineering docs. You stop debating “semantic vs. keyMicrosoft Word” as a binary choice and start treating them as complementary signals. Accuracy goes up, variance goes down, and your RAG stops lying as often.

The Minimalist Stack for Maximum Power

Most RAG tutorials drown you in microservices and orchestration frameworks. This stack goes the opposite direction: three moving parts, end to end. MongoDB, Pydantic AI, and Docling form a compact pipeline that still scales to millions of chunks and dozens of concurrent agents.

MongoDB sits at the center as a unified store for everything: raw documents, chunked passages, embeddings, and metadata. One collection can hold chunk text, a 1,536‑dimension embedding vector, source file info, and page numbers, all indexed with Atlas Vector Search and classic text search. That single database then powers hybrid search, fuzzy matching, and Reciprocal Rank Fusion without a separate vector DB.

Pydantic AI handles the agent brain without dragging in a full-blown workflow engine. You define tools as plain Python functions, wire them into an agent, and let the framework handle model calls, retries, and structured outputs. Its type‑first design means retrieval results from MongoDB arrive as validated models instead of fragile dicts, mirroring patterns in the official Pydantic AI – RAG Example.

Docling closes the loop on ingestion, which is where most RAG projects quietly fail. It parses PDFs, Microsoft Microsoft Word documents, markdown, and even audio transcripts into structured text with headings, tables, and layout cues. That structure feeds directly into Docling’s hybrid chunking so you store semantically coherent sections instead of random 500‑token slices.

Together, these three tools form a golden triangle for production RAG. MongoDB provides durable storage and fast hybrid retrieval, Pydantic AI orchestrates agents and tools cleanly, and Docling guarantees reliable input data. You get a stack that fits in a single Python repo, runs locally or in the cloud, and adapts to almost any use case without swapping components every month.

MongoDB: Your Database and Vector Store in One

MongoDB sits at the center of this stack, acting as both the primary document database and the vector store. Instead of shuttling embeddings off to a separate service, MongoDB Atlas Vector Search lets you attach high‑dimensional vectors directly to the same documents that hold your parsed content, metadata, and chunk references. One collection stores everything: chunk text, embedding arrays, document IDs, headings, and timestamps.

That single‑system design quietly kills an entire class of headaches. No Pinecone, Weaviate, or bespoke vector cluster to provision, secure, and monitor; no sync jobs to keep your operational data and embeddings from drifting out of date. When Docling ingests a new Microsoft Microsoft Word file or PDF and the agent generates embeddings, those writes land in one place, under one consistency model, with one backup story.

Operationally, that matters more than any benchmark chart. Teams already running MongoDB for their app data can bolt on Atlas Vector Search without adding new infrastructure or vendors to their threat model. Role‑based access control, VPC peering, encryption at rest, and auditing apply equally to raw documents and their embeddings, so security teams do not need to chase two different permission schemes.

Data consistency also becomes boring in the best way. You avoid dual writes to “documents DB” and “vector DB,” avoid race conditions between ingestion and indexing, and avoid background sync workers that inevitably fail at 2 a.m. A single write transaction can update the chunk text, its embedding, and any denormalized metadata, which keeps RAG answers aligned with the actual source of truth.

Hybrid search lives directly in MongoDB’s aggregation pipeline. A typical query stages a `$vectorSearch` on the embeddings field to grab semantically relevant chunks, then blends it with lexical operators like `$search`, `$text`, or `$regex` for keyMicrosoft Word‑level precision. You can run both in one pipeline and fuse the results with custom scoring logic or MongoDB’s `$rankFusion` operator.

That fusion step gives you fine‑grained control. You can boost exact phrase matches for error codes or IDs, down‑weight older documents, or filter by fields like `doc_type` and `tenant_id` before the LLM ever sees a token. All of it runs server‑side, close to the data, which keeps latency low and makes the “simple stack” claim more than marketing.

Why Pydantic AI Is Replacing LangChain

Pydantic AI slips into this stack as the agent framework, but its secret weapon is lineage. It comes from Pydantic, the data validation library that quietly powers thousands of Python backends, FastAPI apps, and internal tools. That heritage means strong typing, strict schemas, and predictable behavior instead of vibes-based prompt hacking.

Where LangChain sprawls, Pydantic AI trims. LangChain ships with dozens of abstractions—chains, runnables, executors, retrievers—that can feel like a DSL on top of Python. Pydantic AI stays closer to the language: you write normal functions, define clear input and output models, and let the framework handle the LLM calls and tool wiring.

A Pydantic AI “tool” looks like idiomatic Python, not framework magic. Conceptually, a MongoDB hybrid-search tool might resemble:

A Pydantic model describing the tool’s arguments (query string, limit, filters)
A plain async function that runs a MongoDB aggregation with vector and keyMicrosoft Word stages
A Pydantic model for the return type (chunks, scores, metadata)

The framework then exposes that function to the model as a typed tool, so the LLM calls it with structured arguments instead of raw JSON blobs.

Type-safety becomes a real feature, not marketing copy. If your tool expects a `limit: int = Field(ge=1, le=20)`, Pydantic AI validates it before your code ever hits MongoDB. If your agent must return a `Response` model with `answer: str` and `sources: list[str]`, you catch violations immediately instead of debugging half-parsed model output at 2 a.m.

Transparency might be the biggest differentiator. Pydantic AI avoids hidden planners and opaque routing graphs that decide when to call which tool. You can still build multi-step agents, but you keep explicit control over when the model searches, when it writes, and how it loops, using normal Python control flow.

For many RAG projects—dashboards, internal knowledge bots, coding assistants—Pydantic AI hits a sweet spot. You get structured tools, streaming, retries, and multi-turn state without swallowing a massive framework or reverse-engineering a black box. Most teams do not need LangChain’s full graph engine to ship a reliable hybrid-search agent; they need something they can read, debug, and extend in a single file.

Stop Fighting PDFs: Ingesting Data with Docling

RAG systems live or die on their ingestion pipeline. If your PDFs get mangled, tables vanish, and headings blur into body text, no amount of clever hybrid search will save you. Garbage chunks in means misleading retrieval out, no matter how fancy your embeddings or MongoDB queries look.

Docling targets that bottleneck directly. The library parses messy real-world files — PDF, Microsoft Microsoft Word, markdown, HTML, even images and audio transcripts — into a structured document model instead of a flat text dump. That structure preserves pages, headings, lists, tables, captions, and metadata so your downstream embeddings actually understand where a passage came from.

Under the hood, Docling runs layout analysis to separate columns, detect reading order, and keep tables intact instead of shredding them into incoherent lines. It exposes clean text plus rich metadata like page numbers, section titles, and element types, which you can store right alongside embeddings in MongoDB. When your agent answers a question, you can point back to “page 37, Methods section” instead of a mystery chunk.

For hybrid RAG, this metadata becomes retrieval fuel. You can index fields like `section`, `doc_type`, or `heading` and combine them with Atlas Vector Search in a single aggregation pipeline. Ask for “latency benchmarks in the appendix,” and your query can filter by section metadata before running semantic search, dramatically cutting noise.

Chunking is where Docling quietly earns its keep. Naive fixed-size chunks ignore document structure and slice through paragraphs, code blocks, or tables. Docling’s hybrid chunking strategy mixes semantic boundaries (headings, paragraphs) with size constraints so you get pieces that are both contextually coherent and embedding-friendly.

That hybrid approach shines on long technical reports and manuals. A single 100-page PDF can yield hundreds of chunks, each aligned to logical units like “2.3 Authentication Flow” instead of arbitrary token windows. Your LLM sees self-contained sections with intact diagrams, bullet lists, and surrounding explanation, which cuts hallucinations and improves answer grounding.

Docling’s design is intentionally backend-agnostic, so the same ingestion pipeline works whether you store embeddings in MongoDB, Postgres, or OpenSearch. For an example outside this stack, see the official Docling – RAG with OpenSearch Example, which uses the same parsing and chunking primitives against a different search engine.

From Raw Files to Smart Answers: The Full Flow

Raw documents enter this system once, then never stay “raw” again. Files from PDFs, Microsoft Microsoft Word, markdown, or HTML flow through Docling, which normalizes layout, extracts clean text, and preserves structure like headings, lists, and tables as metadata.

Docling’s output feeds a hybrid chunker that slices content along semantic and structural boundaries. Each chunk gets an ID, source document reference, position (page, section), and any tags you care about—product name, customer, environment—stored as plain fields alongside text.

From there, a dedicated embedding model (OpenAI, Cohere, or an in-house model) converts every chunk into a fixed-length vector, typically 768–3072 dimensions. Code then writes a single MongoDB document per chunk: `{ text, embedding, metadata… }`, indexed by Atlas Vector Search plus regular text and keyMicrosoft Word indexes.

That one-time ingestion pipeline looks like:

Files → Docling (parse + structure)
Docling output → hybrid chunking
Chunks → embedding model
Chunks + embeddings → MongoDB collection

When a user types a question, a Pydantic AI agent takes over. It validates the request into a strict schema, enriches it with system settings (temperature, tools), and generates a query embedding using the same model as ingestion.

The agent sends a hybrid search to MongoDB: one stage for vector similarity, one for text/keyMicrosoft Word search, fused via `$rankFusion` or a custom scoring pipeline. MongoDB returns the top 10–20 ranked chunks, complete with source metadata so answers remain traceable.

Pydantic AI wraps those chunks into a retrieval-augmented prompt and calls the LLM. The model answers using only the supplied context, while the agent enforces structure—JSON outputs, citations, or tool calls—before streaming the final response back to the user in real time.

Hybrid Search in Action: Real-World Queries

Hybrid search stops being theoretical the moment you throw an ugly, specific query at it. Cole Medin’s demo agent runs on a small MongoDB collection populated via Docling, then answers questions live so you can see semantic and keyMicrosoft Word search fight for control on every request. MongoDB’s $search pipeline runs both modes in parallel and fuses results with Reciprocal Rank Fusion, so whichever mode ranks a chunk higher gets more influence.

Ask for “Neuroflow revenue 2025” and you watch keyMicrosoft Word search carry the whole exchange. The query embedding for “2025” barely helps, because most embedding models treat years as generic tokens. MongoDB’s text search, however, locks onto the literal “2025” and nearby financial phrases, surfacing a single table row and sentence that mention “Neuroflow,” “revenue,” and “2025” together.

Pure semantic search on that same question tends to drag in 2024 or 2023 forecasts, or generic “future revenue” commentary, because the vector space clusters all forward-looking financial language. Hybrid search fixes that by letting lexical search veto semantically similar but numerically wrong chunks. The agent then feeds those precise lines into the LLM, which can safely quote the 2025 figure without hallucinating a number.

Change the query to “timeline for the Converse Pro launch” and the roles reverse. The source deck uses headings like “Launch Plan” and “Go-to-Market Schedule,” never the Microsoft Word “timeline.” A naive keyMicrosoft Word engine would miss the relevant section entirely or fall back to any stray “launch” mention.

Semantic search, via MongoDB Atlas Vector Search, understands that “timeline,” “schedule,” and “rollout plan” live in the same conceptual neighborhood. It returns the chunk containing the launch plan bullets, dates, and phases, even though the literal phrasing does not match the query. The agent then summarizes that section into a clean narrative answer instead of just dumping slide text.

Hybrid search shows its full power on fuzzier analytics questions like “revenue breakdown by service line.” Here, the best answer lives in a table where headers say “ARR,” “Professional Services,” and “Platform Fees,” and the body contains dollar amounts and percentages. Users do not always mirror those exact labels in their questions.

KeyMicrosoft Word search anchors on “revenue,” “ARR,” and numeric patterns like “$1.2M” and “35%,” ensuring the right table surfaces. Semantic search understands that “service line” maps to columns like “Professional Services” or “Implementation,” not just any occurrence of “service.” Fused together, they pull the exact table chunk, so the LLM can output a structured breakdown instead of a vague summary.

Merging Worlds: How Rank Fusion Works

Hybrid search raises an immediate question: how do you merge two ranked lists that speak completely different scoring languages? Vector similarity spits out cosine distances or dot products; keyMicrosoft Word search returns BM25 or text scores. Naively normalizing and averaging those numbers usually fails, because each algorithm’s score distribution shifts as your corpus grows.

Reciprocal Rank Fusion (RRF) sidesteps that mess by ignoring raw scores entirely. Instead, it only cares about where a document appears in each list. A chunk that shows up as rank 1 in vector search and rank 20 in keyMicrosoft Word search should still beat a chunk that appears at rank 10 in both.

RRF assigns each document a fused score using a tiny formula: 1 / (k + rank), summed across all result lists. With k typically set to 60, a document ranked 1st in a list gets 1 / 61, rank 2 gets 1 / 62, and so on. Documents absent from a list simply contribute nothing from that list.

This approach has two important properties for RAG. First, it heavily rewards documents that appear near the top of either ranking, even if they only appear in one. Second, it automatically boosts documents that appear in both lists, because their reciprocal scores add together. No manual weight tuning or per-index calibration required.

Hybrid RAG benefits directly from those traits. Semantic search can surface conceptually similar passages that never mention a keyMicrosoft Word, while keyMicrosoft Word search nails exact IDs, error codes, and quoted strings. RRF fuses both signals so that a PDF chunk containing “Error 0x80070005” and a semantically related explanation from a Microsoft Microsoft Word troubleshooting doc can both float to the top.

MongoDB bakes this into Atlas via the $rankFusion stage, which implements RRF inside the aggregation pipeline. You can run one $vectorSearch stage, one $search or $searchMeta stage, then hand both outputs to $rankFusion without leaving the database. No custom Python merge logic, no extra network hops.

For developers building similar stacks with Pydantic AI and Docling, RRF becomes a one-line configuration choice, not an algorithmic research project. For a deeper walkthrough of agent orchestration around this pattern, see How to Build a Powerful RAG Knowledge Base Agent with Pydantic AI.

Build Your First Hybrid Agent Today

Hybrid RAG stops being a research paper buzzMicrosoft Word and becomes a pattern you can actually ship. With MongoDB, PydanticAI, and Docling, you get a stack that stays small enough to reason about while covering almost every RAG use case you care about: dense semantic search, exact keyMicrosoft Word lookup, and robust ingestion for PDFs, Microsoft Microsoft Word, markdown, and more.

You do not juggle a separate vector database, a brittle parsing script, and an overcomplicated orchestration framework. One MongoDB cluster stores your chunks, metadata, and embeddings; Docling turns messy files into structured text; PydanticAI wires it all into an agent that calls tools, runs hybrid search, and returns grounded answers instead of hallucinations.

This is not a whiteboard architecture. Cole Medin’s video walks through a working Python agent, end to end, hitting MongoDB’s Atlas Vector Search, running hybrid search with rank fusion, and answering live queries across mixed document types in real time. Latency stays low, complexity stays manageable, and you can trace every step from upload to response.

You can clone exactly what you saw on screen. Grab the GitHub repo Cole published for the MongoDB agent at https://github.com/coleam00/MongoDB-RAG-Agent and run it locally with a handful of environment variables and a MongoDB Atlas cluster. From there, swap in your own documents, tweak chunk sizes, or extend the agent with new tools.

Treat this as a template, not a toy. The same pattern scales from a single internal handbook to thousands of support tickets, contracts, or research papers, without forcing you into a bespoke stack for every new feature. With a minimal hybrid RAG setup that actually works, you can spend less time wrestling infrastructure and more time shipping AI features that users trust.

Frequently Asked Questions

What is hybrid search in RAG?

Hybrid search combines semantic (vector) search, which understands concepts and relationships, with keyword search, which finds exact terms and phrases. This approach provides more accurate and relevant results by leveraging the strengths of both methods for every query.

Why use MongoDB for a RAG system?

MongoDB acts as both a standard NoSQL document database and a vector database through Atlas Vector Search. This consolidates the tech stack, simplifying architecture, deployment, and data management by eliminating the need for a separate, dedicated vector store.

What makes this stack simpler than LangChain or LlamaIndex?

This stack prioritizes simplicity and direct control. Pydantic AI offers a more 'Pythonic,' type-safe approach to building agents without the heavy abstractions of larger frameworks, while MongoDB's integrated nature reduces operational complexity.

Can this stack handle enterprise file formats like PDF and DOCX?

Yes. The stack uses Docling, a powerful library specifically designed for parsing and extracting text from various common file formats, including PDFs, Microsoft Word documents, Markdown, and more, making it ideal for real-world enterprise data.

The RAG Stack Devs Actually Use