Company
Microsoft
Title
Building Production-Grade RAG Systems for Financial Document Analysis
Industry
Finance
Year
2023
Summary (short)
Microsoft's team shares their experience implementing a production RAG system for analyzing financial documents, including analyst reports and SEC filings. They tackled complex challenges around metadata extraction, chart/graph analysis, and evaluation methodologies. The system needed to handle tens of thousands of documents, each containing hundreds of pages with tables, graphs, and charts spanning different time periods and fiscal years. Their solution incorporated multi-modal models for image analysis, custom evaluation frameworks, and specialized document processing pipelines.
## Overview This case study comes from a presentation by Microsoft development team leads (Kobi, a development group manager, and Mia, an IT development team lead) discussing their real-world experiences building a production RAG (Retrieval Augmented Generation) system for the financial industry. The system was designed to handle analyst reports, SEC filings (10-K forms), and other financial documents at scale—tens of thousands of documents, each potentially containing hundreds of pages with complex content including tables, charts, and graphs. The presentation explicitly challenges the common perception that RAG systems are easy to build. While marketing materials and tutorials often show RAG implementations in "8 lines of code" or promise systems built "in minutes," the speakers emphasize that production-grade RAG systems are actually "very, very messy and very, very hard." ## The Context Window Debate The presenters begin by addressing why RAG remains necessary despite increasingly large context windows in modern LLMs. GPT-4 supports 128K tokens, while Gemini 1.5 Pro can handle 2 million tokens (approximately 4,000 pages). However, they identify three critical reasons why simply dumping all documents into the context window is problematic: **Cost considerations**: Sending thousands of pages with every query would be prohibitively expensive, especially when queries might need multiple interactions to resolve. **Latency impact**: Processing large context windows significantly degrades user experience and also affects throughput on the generation side, as tokenization of large inputs adds overhead on both input and output processing. **Lost in the Middle phenomenon**: This is perhaps the most interesting finding the team shares. They reference research from 2023 demonstrating that LLMs have difficulty accurately answering questions when the relevant information is positioned in the middle of a long context. The probability of correct answers depends heavily on where the answer appears within the context. The team replicated this experiment using a 212-page document (approximately 118K tokens), inserting the same sentence about a conference location at page 5 versus page 212. When the information was at page 5, the model correctly identified that the conference would be at the David InterContinental hotel. When the same information was at page 212, the model claimed it had no knowledge about the conference despite having access to the same information. This finding alone justifies the need for precise retrieval rather than brute-force context stuffing. ## System Architecture The team describes their RAG architecture as consisting of two primary subsystems: The **Data Ingestion pipeline** handles document processing, parsing, embedding generation, and storage. Documents go through a document parser (they used Azure Document Intelligence), which provides structural analysis of each page—identifying text, tables, and graphical elements. Each element type receives specialized processing before being cleaned and stored in a vector database. The **Generation pipeline** handles queries, retrieves relevant chunks from the index with additional processing, and generates responses back to users. The presentation focuses primarily on the ingestion side, where they made significant improvements to achieve state-of-the-art results compared to their baseline system. ## Challenge 1: Metadata Extraction and Filtering A critical challenge in financial document RAG is that similarity-based retrieval alone is insufficient. Consider a query like "What is the projected revenue for Caterpillar in the 2023 Q1 filing?" The corpus may contain dozens of nearly identical SEC filings from the same company across different quarters. Similarity search will find relevant content, but without proper filtering by document metadata (company name, filing period, fiscal year), the system might return information from the wrong time period. The team's solution involved extracting metadata from documents, particularly from the first few pages where key information typically appears—document titles, dates, company names, stock ticker symbols, and reporting periods. An interesting complexity emerged around fiscal year handling. Different companies have different fiscal year calendars (Microsoft's fiscal year starts in July, for example). A query asking about "fiscal year 2024 Q1" for Microsoft actually refers to July 2023, not January 2024. The team demonstrated how they used prompts to convert calendar dates to fiscal periods based on company-specific calendars. However, they also showed how fragile these prompts can be—a seemingly minor change in the prompt (asking the model to return only the final answer rather than showing its reasoning) caused the model to return completely incorrect fiscal period conversions. This example highlights a recurring theme: small changes in prompts or processing can cause dramatic shifts in system behavior. The team emphasizes that such changes might easily pass code review without anyone noticing the potential for degraded accuracy. ## Challenge 2: Charts and Graphs Processing Financial documents are rich with visual information—charts, graphs, tables, and diagrams. The team identified two main challenges: storing visual information in a way that enables effective retrieval, and then being able to answer questions based on that visual content. For example, a question like "Which cryptocurrency is least held by companies?" might only be answerable by examining a pie chart—the information doesn't appear in the document's text. Their approach uses multimodal models to describe graphical elements during ingestion. For each chart or graph found on a page, they prompt a multimodal model to describe the chart's axes, data points, trends, and key insights. This description is then stored as JSON objects that can be indexed and retrieved via similarity search. The team acknowledges that this approach works well for simpler graphs but becomes more challenging with complex visualizations containing many data points, multiple series, and overlapping elements. For complex cases, they also store the original images so that during generation, a multimodal model can be invoked to analyze the visual directly—though this adds cost and latency. ## The Image Classifier Solution An unexpected operational challenge emerged during processing: their document corpus contained massive numbers of images that weren't relevant charts or graphs—photos, decorative elements, headers that had been converted to images during format conversions, and other non-informational visuals. Sending all of these through expensive multimodal processing was both time-consuming and costly. Their solution was to develop a simple image classifier that serves as a filter before multimodal processing. This classifier's sole purpose is to determine whether an image contains relevant informational content (charts, graphs, tables, diagrams) or is irrelevant (photos, decorations, logos). Only images classified as informational proceed to the more expensive multimodal description step. This is a practical example of the kind of optimization that becomes necessary only when operating at production scale with real-world document corpora. ## Evaluation-Driven Development Perhaps the most emphasized aspect of the presentation is the critical importance of robust evaluation. The team argues that evaluation must be built from "day one"—you cannot develop a production RAG system without understanding how to measure its performance. They illustrate the complexity of RAG evaluation with an example. Given an expert answer stating "4.7 million dollars" with two reasons (new product launches and new markets), the system returned "approximately 5 million dollars" with only one reason (product launches). Is this answer acceptable? The team describes an iterative approach to evaluation prompt design: **Level 1 - Binary correctness**: Simply asking the model if the answer is correct or incorrect. This is too coarse—it would mark the answer as incorrect due to minor numerical differences. **Level 2 - Claim decomposition**: Breaking down both the expert answer and system answer into individual claims, then identifying missing claims and contradictions. This reveals that one reason was missing. **Level 3 - Numerical tolerance**: For financial contexts, adding specific handling for numerical statements to understand whether differences are material. Is "approximately 5 million" a contradiction of "4.7 million" or an acceptable approximation? In many contexts, this might be acceptable rounding. The team emphasizes that each domain requires its own evaluation calibration. Financial applications might need strict numerical accuracy for certain types of figures while tolerating approximations for others. This evaluation methodology—borrowed from data science and model development practices—needs to become standard practice for software engineers building LLM applications. They advocate for automated evaluation systems that provide appropriate segmentation (evaluating performance separately on tables, charts, text, different document types) and enable confident comparison between system versions. Without this foundation, teams cannot safely make changes or improvements to their RAG systems. ## Key Takeaways The presentation concludes by reinforcing that "the devil is in the details." What appears simple in tutorials and demonstrations becomes extraordinarily complex in production environments with real data at scale. The core message is that evaluation-driven development is not optional—it's the engine that enables improvement and ensures that changes don't inadvertently degrade system performance. The Q&A touched on the challenge of creating evaluation datasets, acknowledging that this is one of the central difficulties in the field. While synthetic data generation can help bootstrap datasets and semi-automate the process, human curation remains essential for quality. Initial datasets often require multiple rounds of refinement as edge cases and quality issues are discovered during analysis.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.