ZenML

Practical Challenges in Building Production RAG Systems

Prolego
View original source

A detailed technical discussion between Prolego engineers about the practical challenges of implementing Retrieval Augmented Generation (RAG) systems in production. The conversation covers key challenges including document processing, chunking strategies, embedding techniques, and evaluation methods. The team shares real-world experiences about how RAG implementations differ from tutorial examples, particularly in handling complex document structures and different data formats.

Industry

Tech

Technologies

Overview

This case study is derived from a panel discussion at Prolego, a consulting firm specializing in generative AI solutions, featuring engineers Kevin DeWalt, Justin, Ben, and Cam who share their practical experiences building RAG (Retrieval Augmented Generation) applications for enterprise clients. The discussion serves as a valuable reality check against the proliferation of simplified RAG tutorials, highlighting the significant gap between demo projects and production-ready implementations.

The conversation opens with a practical demonstration of RAG using ChatGPT, where the host shows how asking a simple question like “Can I bring my small dog Charlie to work?” yields generic responses without context, but becomes useful when augmented with actual company policy documents. This illustrates the core value proposition of RAG: customizing general-purpose LLMs like GPT-4 with organization-specific data without requiring model fine-tuning.

The Gap Between Tutorials and Production Reality

One of the central themes of the discussion is how dramatically real-world RAG implementations differ from the simplified examples found in online tutorials. The engineers emphasize that most tutorials work with clean, well-structured text documents and carefully crafted queries that conveniently demonstrate the happy path. In contrast, production environments present messy, heterogeneous data sources including PDFs, Word documents, SharePoint sites, PowerPoint presentations, wikis, and emails.

The team notes that this is “yet another example of why we won’t be replacing data scientists and programmers with AI anytime” soon, acknowledging that the fundamental data engineering challenges remain core to successful RAG deployments.

Document Parsing and Chunking Challenges

Perhaps the most surprising insight from the discussion is that the most challenging aspect of RAG is often the step that appears simplest in architecture diagrams: converting source documents into meaningful chunks. Justin explains that documents like policies, procedures, and regulations have inherent hierarchical structure with sections and subsections, and naive chunking approaches can lose critical context.

The team provides a concrete example: a policy might state “this rule always applies except in the following conditions” followed by a list of exceptions. If you chunk sentence by sentence or even paragraph by paragraph, you might capture only the exception without the context that it’s an exception to a particular rule. Another example from life insurance applications involves statements like “patient does not have cancer, diabetes, heart disease” where missing the “does not have” qualifier would be catastrophic.

In terms of practical workflow, Ben describes spending “a couple of days” getting document parsing libraries operational, writing heuristics to capture different document elements, sorting elements by position on the page, and using font or color information to identify headings and structure. The goal is to create a flat representation suitable for machine learning operations while preserving hierarchical relationships through keys or metadata.

The engineers recommend using a flat dictionary or list structure where each element’s key encodes the document hierarchy (e.g., filename + chapter + subsection + paragraph), allowing both efficient batch processing and the ability to map back to original document structure when needed.

Embedding and Retrieval Complexity

The discussion reveals that the “most similar” retrieval step involves considerable nuance beyond simply computing cosine similarity between query and document embeddings. Context size presents challenges at multiple levels: both the embedding model’s context window for generating meaningful embeddings, and the generative LLM’s context window which limits how much retrieved information can be passed through.

Justin explains the chunking tradeoff clearly: too small (individual sentences) loses context and produces embeddings that don’t capture meaning well; too large (entire documents) tries to squeeze too much information into a single embedding representation, degrading retrieval quality. Finding the “sweet spot” requires experimentation with chunk sizes, overlap windows, and techniques that combine subsection text with parent section context.

The team discusses various retrieval enhancement techniques:

Ben mentions a Salesforce research paper on iterative summarization that progressively condenses text while preserving named entities and key information, improving the signal-to-noise ratio of retrieved content.

Evaluation Challenges

The team emphasizes that RAG evaluation is significantly more complex than traditional ML evaluation metrics. Justin uses an analogy of grading a student essay: you can check grammar (the easy part), but evaluating thoroughness, appropriate length, and factual correctness is much harder.

A key concept discussed is faithfulness as distinct from correctness. A system might produce a correct answer but not based on the retrieved context—particularly when using powerful models like GPT-4 that have extensive world knowledge. Faithfulness measures whether the generated response is actually derived from the provided context rather than the model’s parametric knowledge. This distinction is crucial for enterprise deployments where answers must be traceable to authoritative source documents.

The engineers acknowledge that many LLM evaluations rely on multiple-choice formats that are easy to score but don’t reflect real-world usage patterns. Ben explains that faithfulness evaluation requires LLMs to evaluate other LLMs, generating and assessing statements about the relationship between questions, answers, and context—there’s no simple numeric metric that captures everything.

Another evaluation challenge is the distribution of query difficulty. An evaluation set with “easy” questions that reliably retrieve correct documents will show good metrics, but users in production may ask questions in unexpected ways, use different vocabulary, or pose genuinely harder questions that the system wasn’t designed for.

Practical Recommendations

The team offers several pragmatic insights for production RAG deployments:

Technical Architecture Insights

The canonical RAG architecture discussed involves splitting source documents into chunks, generating embeddings via an embedding model, storing embeddings in a searchable format, embedding user queries through the same model, finding similar document embeddings, converting those back to text, and including that text in the prompt sent to a generative LLM.

The team notes that while this architecture looks simple, each box in the diagram conceals significant complexity. The embedding models may be pre-trained and commoditized, but everything around them—parsing, chunking, retrieval logic, context assembly, and evaluation—requires substantial engineering judgment and domain expertise.

This discussion provides a valuable counterweight to the “RAG is easy” narrative, demonstrating that production RAG applications are fundamentally data engineering and systems design challenges that happen to involve LLMs, rather than LLM projects that require a bit of data handling.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +52

Enterprise Code Search and Bug Investigation with Multi-Agent AI Systems

Wix 2026

Wix developed two interconnected AI systems to address the challenge of searching and understanding code across thousands of repositories and services in a large organization. The first system, OctoCode, is an MCP-based tool with 90,000 downloads and 5,000 weekly active users that helps developers search repositories, understand dependencies, and navigate complex codebases. The second system, Bilbo, is an enterprise service that orchestrates multiple AI agents to investigate bugs and perform deep research across the organization's technical stack, integrating with GitLab, databases, logs, documentation, and other internal systems. Both systems employ sophisticated prompt engineering, context management, sub-agent architectures, and custom tooling protocols to handle the complexity of enterprise-scale code search and investigation while managing token limits and maintaining response quality.

code_generation code_interpretation question_answering +31