Company
Dropbox
Title
Building a Universal Search Product with RAG and AI Agents
Industry
Tech
Year
2025
Summary (short)
Dropbox developed Dash, a universal search and knowledge management product that addresses the challenges of fragmented business data across multiple applications and formats. The solution combines retrieval-augmented generation (RAG) and AI agents to provide powerful search capabilities, content summarization, and question-answering features. They implemented a custom Python interpreter for AI agents and developed a sophisticated RAG system that balances latency, quality, and data freshness requirements for enterprise use.
## Overview Dropbox Dash represents a significant production deployment of LLM-powered capabilities for enterprise knowledge management. The product aims to solve a common business problem: knowledge workers spend excessive time searching for information scattered across multiple applications, formats, and data modalities. Dash is positioned as a "universal search" product that combines AI-powered search with granular access controls, summarization, question-answering, and draft generation capabilities. The case study provides valuable insights into the engineering decisions and trade-offs involved in building an enterprise-grade LLM application, particularly around retrieval-augmented generation (RAG) and AI agent architectures. While the article comes from Dropbox's engineering blog and naturally presents their work favorably, it does offer substantive technical details about their approach. ## The Core Problem The challenges Dropbox identified for enterprise AI are threefold: - **Data Diversity**: Businesses handle many data types including emails, documents, meeting notes, and task management data, each with unique structures and contexts - **Data Fragmentation**: Relevant information is spread across multiple applications and services, requiring aggregation and synthesis - **Data Modalities**: Business data exists in multiple forms including text, images, audio, and video, requiring multi-modal processing capabilities These challenges are genuine concerns for any enterprise search or knowledge management system, and they directly impact the design of both the retrieval and generation components of a RAG system. ## RAG Implementation Details ### Retrieval System Architecture Dropbox made a deliberate architectural choice for their retrieval system that diverges from the common approach of using purely vector-based semantic search. Their system combines: - **Traditional Information Retrieval (IR)**: A lexical-based search system that indexes documents by their textual features - **On-the-fly Chunking**: Documents are chunked at query time rather than pre-chunking during indexing, which ensures retrieval of only relevant sections - **Reranking**: A larger embedding-based model re-sorts the initial results to place the most relevant chunks at the top The article candidly discusses the trade-offs they considered: - **Latency vs. Quality**: Advanced semantic search methods offer better quality but at higher latency costs. Dropbox targeted sub-2-second response times for over 95% of queries. - **Data Freshness vs. Scalability**: Frequent re-indexing for fresh data can hurt system throughput and spike latency. They implemented periodic data syncs and webhooks where appropriate. - **Budget vs. User Experience**: High-quality solutions with advanced embeddings and re-ranking require significant compute resources. Their choice of traditional IR with on-the-fly chunking and reranking is interesting because it suggests that pure vector search wasn't meeting their latency requirements while maintaining quality. This is a practical consideration that many production systems face. ### Model Selection and Evaluation Dropbox conducted rigorous evaluation of their RAG system using several public benchmark datasets: - Google's Natural Questions (real user queries with large documents) - MuSiQue (multi-hop questions requiring information linking) - Microsoft's Machine Reading Comprehension (short passages and multi-document queries from Bing logs) Their evaluation metrics included: - **LLM Judge for Answer Correctness**: Passing retrieved evidence through an LLM to score final answer accuracy - **LLM Judge for Completeness**: Measuring whether all relevant aspects of the question are addressed - **Source Precision, Recall, and F1**: Evaluating how accurately they retrieved the key passages needed for correct answers This use of LLM-based evaluation judges is now a common practice in production LLM systems, though the article doesn't discuss potential issues with LLM judge reliability or calibration. The system is described as model-agnostic, allowing flexibility in LLM selection and adaptation to rapid developments in the field. ## AI Agent Architecture For complex, multi-step tasks that RAG alone cannot handle, Dropbox developed an AI agent system. Their definition of AI agents focuses on "multi-step orchestration systems that can dynamically break down user queries into individual steps." ### Two-Stage Approach **Stage 1 - Planning**: The LLM breaks down a user query into a sequence of high-level steps, expressed as code statements in a custom domain-specific language (DSL) that resembles Python. This approach forces the LLM to express its reasoning as structured, executable code rather than free-form text. **Stage 2 - Execution**: The generated code is validated through static analysis and then executed. If the LLM references functionality that doesn't exist, a second LLM call is used to implement the missing code. This two-stage approach allows the agent to maintain clarity in its overall plan while being adaptable to new query types. ### Custom Interpreter and Security A notable aspect of their implementation is the development of a custom Python interpreter built from scratch specifically for executing LLM-generated code. This interpreter includes: - **Static Analysis**: Examining code without execution to identify security risks, missing functionality, and correctness errors - **Dry Runs**: Testing code paths before actual execution - **Runtime Type Enforcement**: Ensuring data and objects are of expected types The decision to build a minimal interpreter rather than using the full Python runtime is explicitly security-motivated. By implementing only the required functionality, they avoid inheriting security vulnerabilities present in full-featured interpreters. ### Testing and Debugging Benefits The code-based approach to agent planning offers several operational advantages: - **Debuggability**: When failures occur, they can identify which specific step failed rather than getting generic "can't answer" responses - **Deterministic Testing**: Individual components like date resolution can be tested independently - **Easier Evaluation**: Responses can be evaluated by checking return types rather than semantic similarity This approach represents a thoughtful solution to the challenge of testing LLM-based systems, where output variability across model versions typically makes traditional testing difficult. ## LLMOps Considerations Several LLMOps lessons emerge from this case study: ### Trade-off Management The article is refreshingly honest about the trade-offs involved in production LLM systems. They explicitly discuss that larger models provide more precise results but introduce latency that may not meet user expectations. The 2-second latency target for 95% of queries represents a concrete SLA that drove many of their architectural decisions. ### Model Agnosticism Their decision to build a model-agnostic system is a practical LLMOps consideration. It allows them to swap models as the field evolves rapidly, and potentially offer customers choice in which models are used. However, they also note that "the same prompts can't be used for different LLMs," meaning model agnosticism comes with the cost of maintaining multiple prompt variants. ### Evaluation Strategy The combination of public benchmarks and custom metrics using LLM judges represents a pragmatic approach to evaluation. Their emphasis on end-to-end quality measurement acknowledges that component-level metrics may not capture the true user experience. ### Security Architecture The custom interpreter approach demonstrates how security considerations can drive architectural decisions in LLMOps. Rather than retrofitting security onto an existing execution environment, they built a minimal runtime that limits attack surface by design. ## Future Directions Dropbox outlines several future directions including multi-turn conversations, self-reflective agents that evaluate their own performance, continuous fine-tuning for specific business needs, and multi-language support. These represent common evolution paths for production LLM systems moving from initial deployment toward more sophisticated capabilities. ## Balanced Assessment While the case study provides valuable technical insights, it's worth noting some limitations: - No concrete performance metrics or customer outcomes are shared beyond the 95th percentile latency target - The evaluation methodology using LLM judges isn't detailed enough to assess its reliability - There's no discussion of failure modes, hallucination rates, or how they handle edge cases - The article doesn't address costs or infrastructure requirements Overall, the Dropbox Dash case study offers a solid example of production LLMOps practices, particularly around the integration of RAG with agent-based architectures, the thoughtful approach to interpreter security, and the pragmatic handling of latency-quality trade-offs. The code-generation approach to agent planning is particularly interesting as it provides more structured, debuggable output compared to free-form reasoning approaches.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.