Company
Box
Title
Enterprise Data Extraction Evolution from Simple RAG to Multi-Agent Architecture
Industry
Tech
Year
2025
Summary (short)
Box, a B2B unstructured data platform serving Fortune 500 companies, initially built a straightforward LLM-based metadata extraction system that successfully processed 10 million pages but encountered limitations with complex documents, OCR challenges, and scale requirements. They evolved from a simple pre-process-extract-post-process pipeline to a sophisticated multi-agent architecture that intelligently handles document complexity, field grouping, and quality feedback loops, resulting in a more robust and easily evolving system that better serves enterprise customers' diverse document processing needs.
## Overview Box, a B2B enterprise content and unstructured data platform, presents a compelling case study in how production LLM systems must evolve beyond simple architectures to handle real-world enterprise requirements. The company serves over 115,000 organizations including many Fortune 500 companies, managing an exabyte of customer data. Ben Kuss, sharing lessons from Box's AI journey, describes how the company became one of the first AI deployments for many large enterprises due to their existing trust relationship with customers who were otherwise hesitant about AI adoption. The case study focuses on Box's metadata extraction feature—pulling structured information from unstructured documents—which Kuss deliberately chose because it represents "the least agentic looking type of functionality" with no chatbot or conversational interface, making the lessons about agentic architecture particularly instructive. ## The Initial Architecture and Early Success Box began integrating AI into their products in 2023 with a straightforward architecture for data extraction. The initial pipeline was elegantly simple: take a document, define the fields to extract, perform pre-processing and OCR, then pass everything to a large language model to extract the specified fields. This approach worked remarkably well initially, with the team successfully processing 10 million pages for their first customer deployment. The team experienced what might be called the "honeymoon phase" of generative AI deployment—the technology seemed to solve their problem completely. The simplicity and apparent universality of the solution led to significant optimism about handling any document type. However, this success was soon challenged when the team began telling customers to bring any data they wanted to process. ## The Trough of Disillusionment: Production Challenges As customers began submitting increasingly diverse and complex documents, several critical problems emerged that the simple architecture couldn't handle: **Context Window Limitations**: Customers brought 300+ page documents that exceeded the context windows available at the time. The team implemented "enterprise RAG" approaches to handle document chunking and retrieval, but this added significant complexity to what was supposed to be a simple extraction pipeline. **OCR Quality Issues**: Real-world documents included handwritten annotations, crossed-out text, and multilingual content that degraded OCR accuracy. The system needed to handle cases where the text representation didn't match what a human could visually interpret from the document. **Attention Overwhelm**: Some customers requested extraction of 200-500 different fields from single documents, overwhelming the model's attention mechanisms, especially on complex documents. The initial approach of simply "chunking fields into groups" failed when semantically related fields (like customer names and customer addresses) needed to be processed together for coherent extraction. **Confidence Scoring Requirements**: Enterprise customers accustomed to traditional ML systems expected confidence scores—a feature that generative AI doesn't naturally provide. The team implemented "LLM as judge" approaches where a second model would evaluate extraction accuracy, but customers rightly questioned why they should accept wrong answers when the system could identify its own errors. This period represented Box's "trough of disillusionment" with generative AI, where the elegant initial solution proved inadequate for production-scale enterprise requirements. ## The Pivot to Agentic Architecture After watching Andrew Ng's deep learning course featuring Harrison Chase, the team began exploring agentic approaches despite internal skepticism. Some engineers argued that data extraction was "just a function" and didn't warrant the complexity of agent-based systems. Nevertheless, the team proceeded with a complete rearchitecture. The new multi-agent architecture separated problems into specialized sub-agents: **Field Grouping Agent**: Instead of using heuristic-based approaches to chunk fields, an intelligent agent learned to group semantically related fields together. This solved the problem where naive chunking would separate related concepts (like customer information and customer addresses), causing extraction errors. **Intelligent Extraction Agent**: Rather than prescribing a fixed extraction strategy, the agent could dynamically determine whether to analyze OCR text, examine page images directly, or combine approaches based on document characteristics. This flexibility allowed the system to adapt to documents where OCR was unreliable. **Quality Feedback Loop Agent**: Beyond simply providing confidence scores, this component could take action on detected errors. When extraction appeared incorrect, the agent could retry with different techniques, use multiple models for voting-based consensus, or apply specialized approaches for problematic fields. ## Production Benefits of the Agentic Approach The multi-agent architecture delivered several operational benefits that extended beyond solving the immediate technical problems: **Cleaner Abstraction**: The agentic approach shifted the engineering mindset from building "large-scale distributed systems" to thinking about document processing the way a person or team would approach it. Rather than designing a monolithic OCR and conversion pipeline, engineers could reason about individual document workflows. **Rapid Evolution**: When customers encountered problems with new document types, the team could quickly develop specialized agents or add supervisory layers to the processing graph. This modular approach replaced the previous pattern of building entirely new distributed systems for each edge case. **Specialized Agent Development**: For specific document types like lease agreements, the team could deploy specialized agents with custom routines, allowing the system to handle domain-specific requirements without disrupting the core architecture. **Engineering Culture Transformation**: By having engineers build and think in terms of agentic workflows, the team developed intuitions that helped them better support customers building their own LangGraph-powered agents that integrate with Box's tools. ## LLMOps Lessons and Recommendations The case study provides several practical insights for teams deploying LLMs in production: **Anticipate Complexity Escalation**: What works in initial deployment with cooperative test cases will likely fail when exposed to the full diversity of production data. Enterprise customers will bring documents and requirements that stress every assumption in the system design. **Build Agentic Early**: Kuss's key recommendation is to adopt agentic architectures from the beginning rather than trying to extend simple pipelines. The refactoring cost of moving from heuristic-based approaches to agent-based systems was significant, and starting with agentic patterns would have provided more flexibility from the outset. **Embrace Multi-Model Strategies**: The production system uses multiple models for voting, specialized extraction, and quality evaluation. This multi-model approach provides robustness that single-model architectures cannot achieve. **Consider Observability from the Agent Perspective**: When debugging or improving the system, thinking about how an agent "decided" to approach a document provides clearer mental models than tracing through complex distributed systems. ## Critical Assessment While the case study presents a positive outcome, several aspects warrant consideration: The transition from a simple architecture to a multi-agent system likely involved significant engineering investment and temporary disruption to customer service. The case study doesn't discuss the costs or timeline of this transition. The claim that agentic architectures are universally superior should be tempered—for truly simple extraction tasks, the overhead of agent orchestration may not be justified. The benefits primarily manifested when dealing with complex, varied enterprise documents. The mention of serving "10 million pages" for the first customer suggests high-volume processing, but the case study doesn't address latency, cost-per-extraction, or scalability considerations for the more complex agentic architecture. Additionally, while the LLM-as-judge approach for confidence scoring is mentioned as a solution, the fundamental challenge of LLM reliability and calibration remains an active research area. Customers' frustration with receiving incorrect answers alongside "low confidence" flags represents a real limitation of current approaches. ## Technology Stack and Tools The presentation references LangGraph as a framework customers use for building their own agent systems, suggesting Box's internal architecture may use similar graph-based agent orchestration patterns. The system incorporates OCR processing, RAG techniques for handling long documents, and multi-model evaluation strategies. ## Conclusion Box's journey from simple LLM-based extraction to multi-agent architecture illustrates a common pattern in enterprise LLMOps: initial success with generative AI gives way to complex production challenges that require more sophisticated approaches. The key insight is that agentic architectures, while seemingly more complex upfront, provide the modularity and flexibility needed to handle real-world enterprise requirements and evolve with changing customer needs. For organizations building LLM-powered features, the recommendation to "build agentic early" reflects hard-won experience with the costs of retrofitting simple systems to handle production complexity.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.