Box's journey with AI-powered document data extraction represents a compelling case study in the evolution from simple LLM applications to sophisticated agentic AI systems in production environments. As an enterprise content platform with over an exabyte of data serving 115,000+ customers including two-thirds of the Fortune 500, Box's experience provides valuable insights into the challenges and solutions for deploying LLMs at scale in enterprise settings.
## Company Background and Initial Challenge
Box operates as an unstructured content platform that has specialized in secure enterprise content management for over 15 years. The company positioned itself as a trusted AI deployment partner for enterprises concerned about data security and leakage, often serving as customers' first AI deployment. The core challenge they addressed was the longstanding enterprise problem of extracting structured data from unstructured content - a critical need given that approximately 90% of enterprise data exists in unstructured formats like documents, contracts, and project proposals.
Historically, this extraction process relied on Intelligent Document Processing (IDP) solutions that required specialized AI models, extensive training data, and custom ML development. These traditional approaches were brittle, expensive, and limited in scope, leading most enterprises to avoid automating their most critical unstructured data processes.
## Initial LLM Implementation and Early Success
Box's initial approach with generative AI models (starting in 2023) followed a straightforward pattern that many organizations adopt: preprocessing documents through OCR, then using single-shot prompts to extract desired fields from the processed text. This approach leveraged multiple models from different vendors (including GPT, Gemini, Llama, OpenAI, and Anthropic) to ensure reliability and performance comparison.
The early results were remarkably successful, with off-the-shelf models outperforming even specialized traditional ML models. The solution offered flexibility across document types, strong performance on standard extraction tasks, and relatively straightforward implementation. This initial success validated the potential of generative AI for document processing and excited both the Box team and their customers.
## Scaling Challenges and Limitations
However, as customer requirements became more complex, significant limitations emerged. The single-shot approach struggled with several critical challenges that are common in production LLM deployments:
**Document Complexity**: When customers requested extraction from 300-page lease documents with 300+ fields, or complex digital assets requiring risk assessments, the models began to lose accuracy and coherence. The attention limitations of even advanced models became apparent when handling both large documents and numerous complex extraction targets simultaneously.
**OCR and Format Variability**: Real-world document processing revealed the persistent challenges of OCR accuracy, particularly with scanned documents, handwritten annotations, crossed-out text, and various PDF formats. These preprocessing failures cascaded into poor AI performance regardless of model quality.
**Multilingual Requirements**: Box's international customer base required processing documents in multiple languages, adding another layer of complexity that strained the single-shot approach.
**Field Relationship Complexity**: The system struggled with maintaining relationships between related fields (such as matching parties in a contract with their corresponding addresses), often producing logically inconsistent extractions.
**Accuracy Assessment**: Unlike traditional ML models that provide confidence scores, large language models lack reliable self-assessment capabilities. Even when implementing LLM-as-a-judge approaches, the system could only identify potential errors without fixing them, leaving enterprise customers with unusable uncertainty.
## Agentic Architecture Solution
Rather than waiting for better models or reverting to traditional approaches, Box developed an agentic AI architecture that orchestrates multiple AI components in a directed graph workflow. This decision was initially controversial within the engineering team, who preferred conventional solutions like improved OCR, post-processing regex checks, or fine-tuned models.
The agentic approach maintains the same input/output interface (documents in, extracted data out) while completely transforming the internal processing pipeline. The architecture includes several key components:
**Field Preparation and Grouping**: The system intelligently groups related fields that should be processed together to maintain logical consistency. For example, contract parties and their addresses are handled as unified units rather than independent extractions.
**Multi-Step Processing**: Rather than single-shot extraction, the system breaks complex documents into manageable chunks and processes them through multiple focused queries, allowing for better attention and accuracy on each subset.
**Validation and Cross-Checking Tools**: The architecture incorporates multiple validation mechanisms, including OCR verification, visual page analysis, and multi-model voting systems where different vendors' models vote on difficult extraction decisions.
**Iterative Refinement**: Using LLM-as-a-judge not just for assessment but for feedback-driven improvement, the system can iterate on extractions until quality thresholds are met.
**Multi-Model Ensemble**: The system leverages multiple models from different vendors, using voting mechanisms to resolve disagreements and improve overall accuracy.
## Technical Architecture Details
The agentic framework resembles systems like LangGraph, providing sophisticated orchestration capabilities while maintaining clean abstraction layers. Box separated the concerns of agentic workflow design from distributed system scaling, allowing different teams to optimize each aspect independently. This separation proved crucial for handling both individual complex documents and the scale requirements of processing 100 million documents daily.
The architecture's modularity enabled rapid iteration and improvement. When new challenges arose, solutions often involved adding new nodes to the workflow graph or adjusting existing prompt strategies rather than redesigning the entire system. This flexibility became a significant operational advantage, reducing time-to-solution for new requirements.
## Expansion to Advanced Capabilities
The agentic foundation enabled Box to develop more sophisticated capabilities beyond basic extraction. They launched "deep research" functionality that allows customers to conduct comprehensive analysis across their document repositories, similar to how OpenAI or Gemini perform deep research on internet content.
This deep research capability uses a complex directed graph workflow that includes document search, relevance assessment, outline generation, and comprehensive analysis - capabilities that would have been difficult to achieve without the agentic foundation already in place.
## Production Deployment and API Strategy
Box maintains an API-first approach, exposing their agentic capabilities through agent APIs that customers can integrate into their workflows. The system supports multi-tenancy and enterprise security requirements while providing the flexibility needed for diverse use cases across their customer base.
The production deployment includes comprehensive evaluation strategies combining traditional eval sets with LLM-as-a-judge approaches and challenge sets designed to test edge cases and prepare for increasingly complex future requirements. This evaluation framework helps ensure consistent quality as the system evolves.
## Engineering and Organizational Lessons
Box's experience highlighted the importance of helping engineering teams transition to "agentic-first" thinking. This cultural shift required moving beyond traditional software development patterns to embrace AI-orchestrated workflows as first-class architectural patterns.
The company deliberately chose to avoid fine-tuning approaches, citing the operational complexity of maintaining fine-tuned models across multiple vendors and model versions. Instead, they rely on prompt engineering, prompt caching, and agentic orchestration to achieve the required performance and reliability.
## Critical Assessment and Limitations
While Box's presentation emphasizes the successes of their agentic approach, several limitations and considerations should be noted:
**Complexity and Latency**: The agentic approach inherently introduces more complexity and longer processing times compared to single-shot extraction. While Box mentions this trade-off, the specific performance impacts and cost implications aren't detailed.
**Vendor Lock-in Concerns**: Despite supporting multiple models, the architecture's dependence on external LLM providers creates potential risks around API availability, pricing changes, and model deprecation.
**Evaluation Challenges**: While Box implemented comprehensive evaluation strategies, the fundamental challenge of assessing LLM accuracy in complex extraction tasks remains. The reliance on LLM-as-a-judge approaches introduces potential biases and failure modes.
**Scalability Questions**: While Box claims to handle 100 million documents daily, the presentation doesn't provide specific details about the infrastructure requirements, costs, or performance characteristics of the agentic approach at this scale.
## Industry Impact and Implications
Box's evolution from simple LLM prompting to agentic AI workflows represents a maturation pattern likely to be repeated across many enterprise AI applications. The case study demonstrates that initial LLM success often reveals deeper requirements that demand more sophisticated orchestration and validation approaches.
The emphasis on maintaining clean abstraction layers between agentic logic and distributed systems concerns provides a valuable architectural pattern for other organizations building production LLM applications. This separation of concerns enables independent optimization of AI capabilities and operational scalability.
Box's experience also illustrates the organizational challenges of adopting agentic AI approaches, requiring engineering teams to develop new mental models and architectural patterns. The success of such implementations depends not just on technical capabilities but on the organization's ability to adapt to AI-first development approaches.
The case study reinforces the importance of comprehensive evaluation frameworks and the challenges of assessing LLM performance in complex, real-world applications. Box's multi-layered approach combining traditional eval sets, LLM judges, and challenge datasets provides a model for rigorous AI system validation in production environments.