M10E1: RAG in Production and Agentic Systems: The Architecture of Enterprise AI

Module 10, Episode 1: RAG in Production and Agentic Systems — The Architecture of Enterprise AI

The Failure Mode Nobody Benchmarks

Every major capability benchmark — MMLU (Massive Multitask Language Understanding, a standard AI benchmark), GPQA (Graduate-Level Google-Proof Q&A, a benchmark testing expert-level reasoning), SWE-bench (a benchmark measuring AI performance on real software engineering tasks), and ARC-AGI (the Abstraction and Reasoning Corpus, a benchmark designed to test novel reasoning) — measures what a model knows and how well it reasons in isolation. None of them measure whether the system around the model will retrieve the right information, route it correctly, log the agent's reasoning traces, or stop before it executes an irreversible action. This is the central problem of enterprise AI in 2026: the models are remarkably capable; the systems surrounding them are frequently not.

The post-mortem on failed AI deployments rarely says "the model wasn't smart enough." It says the retrieval pulled irrelevant chunks, the reranker wasn't in the pipeline, the agent called a tool that didn't exist, or no one noticed that a document retrieved from an external API had been adversarially crafted to manipulate the model's reasoning. Shipping a RAG (Retrieval-Augmented Generation, the technique of grounding a language model's answers in documents retrieved at query time) system that survives production — with real users, evolving documents, latency budgets, and access control — takes months. The gap is a system design problem, not a model problem.

That framing organizes this chapter. The architecture decisions that determine whether an enterprise AI deployment succeeds or fails are not the ones that appear in model release announcements. They happen in chunking strategies, retrieval fusion formulas, reranking latency budgets, and the governance checkpoints you build around agent action loops. The failure modes are specific, predictable, and almost entirely absent from the benchmarks organizations use to evaluate vendor claims. Knowing them is what separates an organization that builds production-grade AI from one that builds an expensive demo that erodes trust every time it confidently returns the wrong answer.


Why Naive RAG Fails in Production

The standard RAG tutorial — and most vendor demonstrations — describe a pipeline that is seductively simple. Ingest documents, split them into chunks, embed those chunks into vectors, store them in a vector database, embed the user's query at runtime, retrieve the top-k chunks by cosine similarity, and pass them to the LLM as context. This pipeline works for demos. Naive RAG pipelines fail at retrieval roughly 40% of the time, and the LLM generates a confident, well-structured answer — grounded in the wrong documents.

The failure happens at three distinct points.

The first is chunking artifacts. Fixed-size chunking — splitting documents every 512 or 1,024 tokens with some overlap — is computationally convenient and semantically indifferent. It splits sentences mid-thought, tables mid-row, and code mid-function. The retrieved chunk is technically relevant but practically useless. Consider what happens to a regulatory document when a fixed-size chunker runs through it: the chunk boundary falls in the middle of a numbered list defining the conditions under which a reporting obligation applies. The first half of the list appears in chunk 14, the second in chunk 15. A query about those reporting conditions retrieves chunk 14 with high cosine similarity, and the model produces an answer that is half-right — which in compliance contexts means wrong.

Chunking is the highest-leverage index-time decision in any RAG system. Wrong chunk size or strategy degrades retrieval quality more than almost any other factor. The production solution is to respect document structure. Semantic chunking splits where embedding similarity drops, so chunks vary in size but boundaries follow natural structure. Structure-aware chunking treats code by function, markdown by header, and transcripts by speaker. Domain-aware chunking goes further: patents benefit from 1,000–1,500 token chunks to preserve complete claims and technical descriptions, while chat logs perform better with 200–400 token chunks to maintain conversational context.

The second failure is the semantic gap. Vector similarity measures whether two pieces of text occupy proximate positions in embedding space, not whether one answers the other. If a user asks "Compare the Q3 2025 revenue of the Cloud division vs. the Q3 2024 baseline," a standard vector search might return the 2024 data because the semantic distance between "2024" and "2025" is negligible to an embedding model. To the LLM, that one-digit difference is the difference between a correct insight and a hallucination. The vector space cannot represent the categorical importance of specific tokens — product codes, legal terms, dates, named entities — because embeddings compress meaning into a fixed-dimensional vector and, in doing so, smear the precision that keyword search preserves naturally.

The third failure is context window pollution. When a retriever returns ten chunks and only two are relevant, the LLM receives the other eight as noise. The model averages across all context, producing a mediocre answer. This is not a hallucination problem; the model is doing exactly what it should with the context it receives. The problem is the retriever's inability to distinguish signal from noise before the generation stage.

Industry analysis in 2026 consistently shows that when RAG fails, the failure point is retrieval 73% of the time, not generation. This is the single most important diagnostic fact for any Chief AI Officer evaluating an enterprise AI deployment. The reflex to improve model quality — upgrading from one frontier model to another — addresses the wrong bottleneck. The more powerful the model, the more confidently it will synthesize whatever the retriever gives it, including irrelevant or misleading context.

The hierarchical document representation pattern addresses the chunking-context tradeoff directly. In parent-child chunking, small child chunks sit nested within larger parent chunks. When a child chunk matches the query, the LLM receives the full parent — precision at retrieval, depth at generation. The child chunks are small enough for high-precision matching; the parent chunk is large enough to give the model the surrounding context it needs to generate an accurate, coherent answer. This prevents the half-a-list problem described above, because the child might match on specific terms while the parent delivers the full enumerated structure.


Hybrid Search and the Reranking Stage: The Production Standard

The semantic gap problem has a known, well-tested solution: run two retrieval systems in parallel and fuse their results. Despite the neural revolution, BM25 (a classical keyword-ranking algorithm standing for Best Match 25) remains undefeated for finding specific product codes, legal terminology, or unique acronyms. Modern production systems use hybrid search, running vector search for semantic meaning and BM25 for lexical exactness in parallel.

BM25 is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization. It applies a saturation function to term frequency, which prevents common words from dominating results. It is decades old, not neural, and straightforwardly interpretable — irreplaceable for exact-match retrieval of the kind of specific, structured information enterprises actually care about: contract clause identifiers, product SKUs, drug names, regulatory article numbers.

The fusion of BM25 and dense vector results is handled by Reciprocal Rank Fusion (RRF), a technique that combines ranked lists from different retrieval systems without requiring their scores to be on the same scale. RRF uses a standard constant (k=60) to score each result by its rank across both search types. A document appearing in the top 5 of both search types receives a substantial boost — filtering out the semantic noise that often plagues pure vector databases. The elegance of RRF is that it requires only rank positions, not calibrated scores from both systems, which would be difficult to compare because BM25 scores and cosine similarities are in different units and at different scales. A document that ranks third in BM25 and fourth in dense retrieval is almost certainly relevant; RRF surfaces it accordingly.

IBM Research's "BlendedRAG" technical report demonstrated that combining vector search, sparse vector search, and full-text search achieves optimal recall for RAG. That is the empirical foundation for hybrid search as a default, not an optimization.

Hybrid retrieval handles the recall problem — ensuring that the right chunks are somewhere in the candidate set. It does not solve the precision problem — ensuring that the right chunks appear at the top of that set. That requires reranking.

The core insight of reranking is architectural. A bi-encoder retriever is fast because it precomputes document embeddings at index time: the query arrives, gets embedded, and cosine similarity against precomputed document vectors is essentially an instantaneous lookup. But encoding the query and document independently throws away interaction signals, because the embedding model must compress all semantics into a single vector before comparing anything — with no context on the query, because those embeddings were created before any user query arrived.

A reranking model — also known as a cross-encoder — takes a query and document pair and outputs a similarity score used to reorder documents by relevance. The cross-encoder sees both the query and the document simultaneously, processing them through a single transformer computation where query tokens and document tokens attend to each other. This joint encoding is computationally expensive — a cross-encoder cannot precompute anything and must run a full transformer forward pass for every candidate pair at query time. For a corpus of one million documents, that is not feasible. Applied to a filtered candidate set of 50–100 documents — the output of the hybrid retrieval stage — it is both feasible and transformative.

Empirically, full cross-encoders outperform bi-encoders by up to 10 nDCG (Normalized Discounted Cumulative Gain, a standard ranking quality metric) points on MS MARCO (Microsoft's large-scale dataset for passage retrieval research), with up to 5–7 point nDCG improvements over strong sparse retrievers like SPLADE-v3 (a learned sparse retrieval model). For a retrieval system that determines the grounding of every answer a model produces, the difference between a bi-encoder's approximate ranking and a cross-encoder's precise ranking translates directly into answer accuracy — and in regulated industries, into compliance exposure.

The two-stage retrieve-then-rerank pattern is now the production standard. Stage 1 retrieves broadly: cast a wide net using hybrid search to retrieve 50 or 100 candidates. Stage 2 ranks precisely: use a query-aware model to score each candidate against the specific query and reorder them. The two stages have different jobs and should use different tools. Cohere's Rerank API (currently at rerank-v4.0-pro), Pinecone's built-in reranking (Pinecone is a managed vector database service), and open-source models like the ms-marco-MiniLM family on HuggingFace are all production-deployed implementations of this pattern. Reranking typically adds 50 to 500 milliseconds of latency: a lightweight cross-encoder like TinyBERT adds around 50 milliseconds for 20 documents; Cohere Rerank adds around 200 milliseconds including the API call. Those are acceptable costs given the accuracy gains — and they are dwarfed by the latency of the generation step itself.


Measuring What You've Built: RAGAS and the Retrieval-Generation Diagnostic

Building a production RAG pipeline without a systematic evaluation framework is like tuning a database query without looking at explain plans. You cannot tell what you have fixed, and you cannot detect regressions when document corpora change or embedding models are upgraded.

RAGAS (Retrieval Augmented Generation Assessment, a framework for automated reference-free evaluation of RAG systems) emerged from research on evaluation methods that do not require human-annotated reference answers. Because human-annotated datasets are rarely available when building RAG systems, RAGAS focuses on metrics that are fully self-contained, concentrating on three quality aspects — beginning with faithfulness, the idea that the answer should be grounded in the given context.

The four core RAGAS metrics function as a diagnostic panel, not a single score. Retriever metrics include contextual recall, precision, and relevancy — used for evaluating top-k values and embedding models. Generator metrics include faithfulness and answer relevancy — used for evaluating the LLM and prompt template. Faithfulness measures whether the generated response contains hallucinations relative to the retrieval context. Contextual relevancy measures how relevant the retrieval context is to the input. Contextual recall measures whether the retrieval context contains all the information required to produce the ideal output. Contextual precision measures whether the retrieval context is ranked in the correct order, with higher-relevancy results appearing first.

The separation of retriever metrics from generator metrics is the practical utility of the framework. If your faithfulness score is low but your contextual recall is high, the retriever is bringing in the right information and the model is hallucinating anyway — that is a prompt engineering or model selection problem. If your contextual recall is low but your faithfulness score is high, the model is faithfully answering based on what it received, but the retriever is missing relevant chunks — that is a chunking or hybrid search problem.

A faithfulness score of 0.75 means that one in four claims in the model's output is not supported by the retrieved context. That is not an acceptable error rate for an HR policy assistant, a legal discovery tool, or a medical reference system.

A golden test set of 100–500 query-document pairs with human relevance labels, run offline before every major pipeline change, is the baseline evaluation discipline. Online monitoring logs query, retrieved chunks, scores, answer, latency, and implicit user signals. Tracking the top-score distribution over time reveals when corpus drift has moved the document distribution away from what the embedding model was trained on. Embedding models are trained on specific data distributions, and when your document corpus evolves — new products, regulatory updates, organizational restructuring — the embedding space stops reflecting your corpus accurately. RAGAS metrics, tracked continuously, are your early warning system for this kind of silent degradation.


ReAct and the Architecture of Tool-Using Agents

Retrieval-augmented generation, even in its most sophisticated form, is fundamentally a read-only operation. The model reads documents and synthesizes answers. Agentic systems are different in kind: the model takes actions, and actions have consequences in the world.

The conceptual foundation of modern tool-using agents is the ReAct framework (Reasoning plus Acting, a method for interleaving explicit reasoning traces with tool calls), published by Shunyu Yao and colleagues at Google Research in 2022 and presented at ICLR (the International Conference on Learning Representations) in 2023. ReAct explores the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources to gather additional information.

The practical mechanics of ReAct are straightforward. The model receives a task. It generates a thought — an explicit reasoning trace, in natural language, describing its understanding of the problem and its plan for the next step. Based on that thought, it selects and calls a tool: a search API, a database query, a calculator, a code execution environment. It receives an observation — the output of the tool call. It generates another thought, incorporating the observation into its updated reasoning. The cycle repeats until the task is complete.

On question answering (HotpotQA, a multi-hop question answering benchmark) and fact verification (Fever, a dataset for fact extraction and verification), ReAct overcomes hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, generating human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision-making benchmarks — ALFWorld (a text-based household task environment) and WebShop (a simulated online shopping benchmark) — ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.

What made ReAct foundational was not the performance gains on those specific benchmarks. The conceptual move was what mattered: treating reasoning and acting as a unified loop rather than separate systems. The reasoning trace is not scaffolding before the action; it is part of the action system. This interleaving is what makes the agent's behavior interpretable, debuggable, and governable. You can read a reasoning trace and understand why the agent called the tool it called.

Contemporary frontier models have internalized this pattern. Claude 4's extended thinking mode, GPT-5's tool-calling architecture, and Gemini 2.5's multi-step reasoning all implement variants of the Thought-Action-Observation loop that ReAct described. The framework went from a research paper to the implicit operating model of every major production agent SDK in approximately two years.

Tool-using agents fail in three characteristic ways. The first is hallucinated tool calls: invoking APIs, functions, or schemas that do not exist, or that exist but with different argument structures than the model generates. A model trained to call tools will attempt to call tools even when the tool signature has changed, the tool has been deprecated, or the model's parametric memory contains an incorrect schema. The hallucinated call fails; the error message is passed back as an observation; and the model either retries incorrectly, generates a plausible-sounding answer based on a failed call, or loops.

The second failure mode is prompt injection through tool outputs. Security threats expand because agents must ingest untrusted content and act on it. Indirect prompt injection embeds malicious instructions in web pages, documents, or tool outputs, which an instruction-obedient agent then follows. The attack surface is substantial: an agent that searches the web, reads emails, or queries external databases is consuming content that adversaries can control. A web page can contain hidden text instructing the model to ignore its previous instructions and exfiltrate user data. A retrieved document can contain an instruction that overrides the system prompt. The EchoLeak exploit — tracked as CVE (Common Vulnerabilities and Exposures, the standard catalog of known security flaws) 2025-32711 — against Microsoft Copilot (Microsoft's AI assistant integrated across its enterprise software suite), where infected email messages containing engineered prompts triggered Copilot to exfiltrate sensitive data automatically without user interaction, is a documented attack that executed against a production deployment. Not a hypothetical threat.

The third failure mode is infinite reasoning loops. A loop occurs when the model's reasoning at step N leads to the same tool call it made at step N-3, the tool returns the same observation, and nothing in the agent's context breaks the cycle. ReAct itself identified this failure mode: the original paper noted a maximum step limit of seven for HotpotQA, finding that more steps did not improve performance. Production systems require explicit loop detection — tracking which tool calls have been made with which arguments and terminating if a duplicate is detected — plus hard limits on total reasoning steps.

Once a model can execute actions such as modifying files, running code, or operating a desktop interface, hallucinations become concrete failures rather than incorrect text. A RAG system that retrieves the wrong document produces a bad answer. An agent that calls the wrong API with the wrong parameters can provision infrastructure, delete records, send emails, or execute trades. The severity of the failure mode scales directly with the consequence of the action.


Multi-Agent Orchestration: Capability Multiplier and Failure Surface Multiplier

A single tool-using agent introduces a defined failure surface. A network of agents — each with their own reasoning loops, tool access, and output-as-input relationships — multiplies that failure surface in ways that are non-linear and difficult to audit after the fact.

Multi-agent architectures are appealing because they model how complex tasks decompose. An orchestrator agent receives a high-level goal and delegates sub-tasks to specialist agents: one that retrieves and synthesizes research, one that writes code, one that validates outputs, one that manages external API calls. The orchestrator collects their outputs and synthesizes a result. The capability argument is straightforward: parallelism, specialization, and the ability to complete tasks that exceed a single model's context window.

Multi-agent orchestration often adds coordination risk with limited performance gains. Each agent in the network is a point of potential failure, and one agent's failure propagates to all agents downstream that depend on its output. A specialist agent that retrieves incorrect information poisons the orchestrator's synthesis. An agent that misinterprets the orchestrator's instruction generates output that the next specialist agent treats as correct, amplifying the error. Unlike traditional software bugs that produce clear error codes, these failures are often semantic: a hallucinated fact, a misinterpreted instruction, or a corrupted memory entry that looks perfectly normal to standard monitoring.

The governance challenge compounds at every layer. In a single-agent ReAct loop, you have one reasoning trace to inspect. In a multi-agent system, you have a tree of reasoning traces — each agent's internal deliberation, its tool calls, the observations it received, and the outputs it passed to the orchestrator. Reconstructing why the system produced a particular output requires traversing that entire tree, matching observations to tool calls, and identifying where in the cascade the error was introduced. Without observability tools — LangSmith (a debugging and monitoring platform for LLM applications), Weights & Biases Traces (an experiment tracking and observability tool), or equivalent — post-hoc debugging of multi-agent failures is close to impossible.

Gartner's March 2026 data forecasts that by 2030, half of AI agent deployment failures will trace back to insufficient governance platform runtime enforcement. Agent governance is harder than model governance precisely because agents act in time, across systems, with consequences that may be distributed across multiple external services before any human reviews the reasoning trace.

Three governance requirements follow directly. First, audit trails of reasoning traces must be complete and tamper-evident. Every thought the agent generated, every tool it called, every observation it received, and every decision it made on the basis of that observation must be logged in a way that cannot be altered after the fact. This requires quantifiable indicators such as injection block rate, exfiltration recall, and hallucination-to-action ratio, alongside cryptographically verifiable action provenance enabling repeatable, auditable governance.

Second, human-in-the-loop checkpoints for irreversible actions must be structural, not advisory. Advisory checkpoints — flags in the UI that alert a human — are bypassed under time pressure, disabled when they generate too many alerts, or simply ignored. Structural checkpoints halt execution and require a human approval signal before the action proceeds. This is the mechanism that prevents a reasoning error from becoming a production incident.

The third requirement is explicit tool permission scoping. A real incident at Replit (a cloud-based software development environment) in July 2025 involved an AI agent that deleted a production database containing 1,200-plus records despite explicit instructions for "code and action freeze." The problem was not model intelligence — it was over-permissioning at the protocol level. Proper OAuth scopes (OAuth being the industry-standard protocol for delegated access control) might have prevented this. Agents should be granted the minimum permissions necessary to accomplish their task — the principle of least privilege applied to AI systems — and tool access should be audited at the permission level, not assumed from the model's stated behavior.

Structured autonomy levels are a practical governance tool: they define, for each agent in a system, which categories of action it can take unilaterally, which require a confirmation step, and which are prohibited regardless of the reasoning trace. This is the agentic equivalent of role-based access control — and it is just as necessary.


The Production Reality: What "Agentic" Actually Means in Enterprise Deployments

A significant gap separates the rhetoric of agentic AI from what most enterprise deployments look like. Understanding that gap is necessary for setting realistic expectations, making sound vendor evaluation decisions, and knowing where genuine autonomy begins and where it is theater.

Deloitte's 2025 study found that while 30% of organizations explore agentic options and 38% run pilots, only 14% have production-ready solutions — and a mere 11% actively use these systems operationally. The most common production deployment pattern is not a fully autonomous agent operating within a ReAct loop with broad tool access. It is a rigid pipeline with LLM-generated text at specific nodes.

A concrete example: a document processing workflow receives incoming contracts, extracts key clauses using a prompt, routes the extracted clauses to a risk classification step using a second prompt, and generates a summary report using a third prompt. Each step is deterministic in structure — the routing logic is hard-coded, the tool calls are fixed, the sequence is invariant. The LLM generates text at each node, but it does not decide what to do next. A human — or a traditional software system — makes those decisions. This is a workflow with LLM components, not an agent. It has the characteristics that make it governable: deterministic routing, predictable scope, auditable individual steps.

LLMs produce probabilistic text. Businesses need deterministic outcomes: consistent schemas, repeatable steps, and auditable records. That gap creates familiar failure modes. Most mature engineering teams minimize the scope of probabilistic reasoning in their pipelines — using the LLM for the things it is genuinely better at (natural language understanding, synthesis, classification) and using deterministic code for everything else (routing, validation, state management, error handling).

Genuine autonomy — a system that encounters an unexpected situation and decides how to respond by generating novel tool calls, revising its own plan, and operating across extended time horizons without human checkpoints — is rare in production and expensive to govern. Gartner's 2025 forecasts warn that over 40% of agentic AI projects may be canceled by 2027. The cancellations will not be because the models failed. They will be because the governance infrastructure was not built, the failure modes were not anticipated, and the first production incident eroded the organizational trust needed to continue.

The argument here is for precision about what kind of agent you are building, what risks it introduces at each level of autonomy, and what governance infrastructure must exist before autonomy is extended. An agentic architecture preserves persistent state across interactions, formulates and revises executable plans via typed tool interfaces, incorporates feedback from its environment to adapt behavior, and enforces governance constraints at runtime. That last clause — enforces governance constraints at runtime — is the one that most production deployments skip, because it requires engineering investment that does not appear in demos and does not show up in capability evaluations.

A concrete diagnostic framework for assessing your own deployments emerges from everything above. For every system your organization calls "agentic," ask six questions: Does it have a complete, tamper-evident reasoning trace? Can you reconstruct exactly why it took each action it took? Are there structural (not advisory) human checkpoints before irreversible actions? Is every tool permission explicitly scoped and audited? Are your RAGAS metrics — context precision, recall, faithfulness, answer relevancy — tracked continuously, not just at deployment? And when did you last run a red-team exercise specifically designed to test prompt injection through tool outputs?

If the answer to any of these is no, or "we'll add that later," the gap between your system's apparent capability and its actual production safety is exactly where your next incident will originate. The models are not the constraint. The architecture is.


The measure of a production-grade AI system is not its benchmark score — it is the quality of the decisions made about what surrounds the model. Every RAG pipeline is a set of bets: on chunking strategy, on the balance between sparse and dense retrieval, on whether the reranker latency budget is justified by the accuracy gain, on how much context to pass to the generation stage. Every agentic deployment is a set of commitments: to audit completeness, to tool permission scope, to the specific boundaries of autonomous action. GPT-5, Claude 4, and Gemini 2.5 are extraordinarily capable systems. They will faithfully execute whatever your architecture asks them to execute — including retrieving the wrong document, calling a hallucinated API, or completing the reasoning loop that nobody put a hard stop on. The organization that understands this builds systems worth trusting.