Box, a B2B unstructured data platform serving Fortune 500 companies, initially built a straightforward LLM-based metadata extraction system that successfully processed 10 million pages but encountered limitations with complex documents, OCR challenges, and scale requirements. They evolved from a simple pre-process-extract-post-process pipeline to a sophisticated multi-agent architecture that intelligently handles document complexity, field grouping, and quality feedback loops, resulting in a more robust and easily evolving system that better serves enterprise customers' diverse document processing needs.
Box, a B2B enterprise content and unstructured data platform, presents a compelling case study in how production LLM systems must evolve beyond simple architectures to handle real-world enterprise requirements. The company serves over 115,000 organizations including many Fortune 500 companies, managing an exabyte of customer data. Ben Kuss, sharing lessons from Box’s AI journey, describes how the company became one of the first AI deployments for many large enterprises due to their existing trust relationship with customers who were otherwise hesitant about AI adoption.
The case study focuses on Box’s metadata extraction feature—pulling structured information from unstructured documents—which Kuss deliberately chose because it represents “the least agentic looking type of functionality” with no chatbot or conversational interface, making the lessons about agentic architecture particularly instructive.
Box began integrating AI into their products in 2023 with a straightforward architecture for data extraction. The initial pipeline was elegantly simple: take a document, define the fields to extract, perform pre-processing and OCR, then pass everything to a large language model to extract the specified fields. This approach worked remarkably well initially, with the team successfully processing 10 million pages for their first customer deployment.
The team experienced what might be called the “honeymoon phase” of generative AI deployment—the technology seemed to solve their problem completely. The simplicity and apparent universality of the solution led to significant optimism about handling any document type. However, this success was soon challenged when the team began telling customers to bring any data they wanted to process.
As customers began submitting increasingly diverse and complex documents, several critical problems emerged that the simple architecture couldn’t handle:
Context Window Limitations: Customers brought 300+ page documents that exceeded the context windows available at the time. The team implemented “enterprise RAG” approaches to handle document chunking and retrieval, but this added significant complexity to what was supposed to be a simple extraction pipeline.
OCR Quality Issues: Real-world documents included handwritten annotations, crossed-out text, and multilingual content that degraded OCR accuracy. The system needed to handle cases where the text representation didn’t match what a human could visually interpret from the document.
Attention Overwhelm: Some customers requested extraction of 200-500 different fields from single documents, overwhelming the model’s attention mechanisms, especially on complex documents. The initial approach of simply “chunking fields into groups” failed when semantically related fields (like customer names and customer addresses) needed to be processed together for coherent extraction.
Confidence Scoring Requirements: Enterprise customers accustomed to traditional ML systems expected confidence scores—a feature that generative AI doesn’t naturally provide. The team implemented “LLM as judge” approaches where a second model would evaluate extraction accuracy, but customers rightly questioned why they should accept wrong answers when the system could identify its own errors.
This period represented Box’s “trough of disillusionment” with generative AI, where the elegant initial solution proved inadequate for production-scale enterprise requirements.
After watching Andrew Ng’s deep learning course featuring Harrison Chase, the team began exploring agentic approaches despite internal skepticism. Some engineers argued that data extraction was “just a function” and didn’t warrant the complexity of agent-based systems. Nevertheless, the team proceeded with a complete rearchitecture.
The new multi-agent architecture separated problems into specialized sub-agents:
Field Grouping Agent: Instead of using heuristic-based approaches to chunk fields, an intelligent agent learned to group semantically related fields together. This solved the problem where naive chunking would separate related concepts (like customer information and customer addresses), causing extraction errors.
Intelligent Extraction Agent: Rather than prescribing a fixed extraction strategy, the agent could dynamically determine whether to analyze OCR text, examine page images directly, or combine approaches based on document characteristics. This flexibility allowed the system to adapt to documents where OCR was unreliable.
Quality Feedback Loop Agent: Beyond simply providing confidence scores, this component could take action on detected errors. When extraction appeared incorrect, the agent could retry with different techniques, use multiple models for voting-based consensus, or apply specialized approaches for problematic fields.
The multi-agent architecture delivered several operational benefits that extended beyond solving the immediate technical problems:
Cleaner Abstraction: The agentic approach shifted the engineering mindset from building “large-scale distributed systems” to thinking about document processing the way a person or team would approach it. Rather than designing a monolithic OCR and conversion pipeline, engineers could reason about individual document workflows.
Rapid Evolution: When customers encountered problems with new document types, the team could quickly develop specialized agents or add supervisory layers to the processing graph. This modular approach replaced the previous pattern of building entirely new distributed systems for each edge case.
Specialized Agent Development: For specific document types like lease agreements, the team could deploy specialized agents with custom routines, allowing the system to handle domain-specific requirements without disrupting the core architecture.
Engineering Culture Transformation: By having engineers build and think in terms of agentic workflows, the team developed intuitions that helped them better support customers building their own LangGraph-powered agents that integrate with Box’s tools.
The case study provides several practical insights for teams deploying LLMs in production:
Anticipate Complexity Escalation: What works in initial deployment with cooperative test cases will likely fail when exposed to the full diversity of production data. Enterprise customers will bring documents and requirements that stress every assumption in the system design.
Build Agentic Early: Kuss’s key recommendation is to adopt agentic architectures from the beginning rather than trying to extend simple pipelines. The refactoring cost of moving from heuristic-based approaches to agent-based systems was significant, and starting with agentic patterns would have provided more flexibility from the outset.
Embrace Multi-Model Strategies: The production system uses multiple models for voting, specialized extraction, and quality evaluation. This multi-model approach provides robustness that single-model architectures cannot achieve.
Consider Observability from the Agent Perspective: When debugging or improving the system, thinking about how an agent “decided” to approach a document provides clearer mental models than tracing through complex distributed systems.
While the case study presents a positive outcome, several aspects warrant consideration:
The transition from a simple architecture to a multi-agent system likely involved significant engineering investment and temporary disruption to customer service. The case study doesn’t discuss the costs or timeline of this transition.
The claim that agentic architectures are universally superior should be tempered—for truly simple extraction tasks, the overhead of agent orchestration may not be justified. The benefits primarily manifested when dealing with complex, varied enterprise documents.
The mention of serving “10 million pages” for the first customer suggests high-volume processing, but the case study doesn’t address latency, cost-per-extraction, or scalability considerations for the more complex agentic architecture.
Additionally, while the LLM-as-judge approach for confidence scoring is mentioned as a solution, the fundamental challenge of LLM reliability and calibration remains an active research area. Customers’ frustration with receiving incorrect answers alongside “low confidence” flags represents a real limitation of current approaches.
The presentation references LangGraph as a framework customers use for building their own agent systems, suggesting Box’s internal architecture may use similar graph-based agent orchestration patterns. The system incorporates OCR processing, RAG techniques for handling long documents, and multi-model evaluation strategies.
Box’s journey from simple LLM-based extraction to multi-agent architecture illustrates a common pattern in enterprise LLMOps: initial success with generative AI gives way to complex production challenges that require more sophisticated approaches. The key insight is that agentic architectures, while seemingly more complex upfront, provide the modularity and flexibility needed to handle real-world enterprise requirements and evolve with changing customer needs. For organizations building LLM-powered features, the recommendation to “build agentic early” reflects hard-won experience with the costs of retrofitting simple systems to handle production complexity.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Box evolved their document data extraction system from a simple single-model approach to a sophisticated multi-agent architecture to handle enterprise-scale unstructured data processing. The initial straightforward approach of preprocessing documents and feeding them to an LLM worked well for basic use cases but failed when customers presented complex challenges like 300-page documents, poor OCR quality, hundreds of extraction fields, and confidence scoring requirements. By redesigning the system using an agentic approach with specialized sub-agents for different tasks, Box achieved better accuracy, easier system evolution, and improved maintainability while processing millions of pages for enterprise customers.
Articul8 developed a generative AI platform to address enterprise challenges in manufacturing and supply chain management, particularly for a European automotive manufacturer. The platform combines public AI models with domain-specific intelligence and proprietary data to create a comprehensive knowledge graph from vast amounts of unstructured data. The solution reduced incident response time from 90 seconds to 30 seconds (3x improvement) and enabled automated root cause analysis for manufacturing defects, helping experts disseminate daily incidents and optimize production processes that previously required manual analysis by experienced engineers.