ZenML

Building a Universal Search Product with RAG and AI Agents

Dropbox 2025
View original source

Dropbox developed Dash, a universal search and knowledge management product that addresses the challenges of fragmented business data across multiple applications and formats. The solution combines retrieval-augmented generation (RAG) and AI agents to provide powerful search capabilities, content summarization, and question-answering features. They implemented a custom Python interpreter for AI agents and developed a sophisticated RAG system that balances latency, quality, and data freshness requirements for enterprise use.

Industry

Tech

Technologies

Overview

Dropbox Dash represents a significant production deployment of LLM-powered capabilities for enterprise knowledge management. The product aims to solve a common business problem: knowledge workers spend excessive time searching for information scattered across multiple applications, formats, and data modalities. Dash is positioned as a “universal search” product that combines AI-powered search with granular access controls, summarization, question-answering, and draft generation capabilities.

The case study provides valuable insights into the engineering decisions and trade-offs involved in building an enterprise-grade LLM application, particularly around retrieval-augmented generation (RAG) and AI agent architectures. While the article comes from Dropbox’s engineering blog and naturally presents their work favorably, it does offer substantive technical details about their approach.

The Core Problem

The challenges Dropbox identified for enterprise AI are threefold:

These challenges are genuine concerns for any enterprise search or knowledge management system, and they directly impact the design of both the retrieval and generation components of a RAG system.

RAG Implementation Details

Retrieval System Architecture

Dropbox made a deliberate architectural choice for their retrieval system that diverges from the common approach of using purely vector-based semantic search. Their system combines:

The article candidly discusses the trade-offs they considered:

Their choice of traditional IR with on-the-fly chunking and reranking is interesting because it suggests that pure vector search wasn’t meeting their latency requirements while maintaining quality. This is a practical consideration that many production systems face.

Model Selection and Evaluation

Dropbox conducted rigorous evaluation of their RAG system using several public benchmark datasets:

Their evaluation metrics included:

This use of LLM-based evaluation judges is now a common practice in production LLM systems, though the article doesn’t discuss potential issues with LLM judge reliability or calibration. The system is described as model-agnostic, allowing flexibility in LLM selection and adaptation to rapid developments in the field.

AI Agent Architecture

For complex, multi-step tasks that RAG alone cannot handle, Dropbox developed an AI agent system. Their definition of AI agents focuses on “multi-step orchestration systems that can dynamically break down user queries into individual steps.”

Two-Stage Approach

Stage 1 - Planning: The LLM breaks down a user query into a sequence of high-level steps, expressed as code statements in a custom domain-specific language (DSL) that resembles Python. This approach forces the LLM to express its reasoning as structured, executable code rather than free-form text.

Stage 2 - Execution: The generated code is validated through static analysis and then executed. If the LLM references functionality that doesn’t exist, a second LLM call is used to implement the missing code. This two-stage approach allows the agent to maintain clarity in its overall plan while being adaptable to new query types.

Custom Interpreter and Security

A notable aspect of their implementation is the development of a custom Python interpreter built from scratch specifically for executing LLM-generated code. This interpreter includes:

The decision to build a minimal interpreter rather than using the full Python runtime is explicitly security-motivated. By implementing only the required functionality, they avoid inheriting security vulnerabilities present in full-featured interpreters.

Testing and Debugging Benefits

The code-based approach to agent planning offers several operational advantages:

This approach represents a thoughtful solution to the challenge of testing LLM-based systems, where output variability across model versions typically makes traditional testing difficult.

LLMOps Considerations

Several LLMOps lessons emerge from this case study:

Trade-off Management

The article is refreshingly honest about the trade-offs involved in production LLM systems. They explicitly discuss that larger models provide more precise results but introduce latency that may not meet user expectations. The 2-second latency target for 95% of queries represents a concrete SLA that drove many of their architectural decisions.

Model Agnosticism

Their decision to build a model-agnostic system is a practical LLMOps consideration. It allows them to swap models as the field evolves rapidly, and potentially offer customers choice in which models are used. However, they also note that “the same prompts can’t be used for different LLMs,” meaning model agnosticism comes with the cost of maintaining multiple prompt variants.

Evaluation Strategy

The combination of public benchmarks and custom metrics using LLM judges represents a pragmatic approach to evaluation. Their emphasis on end-to-end quality measurement acknowledges that component-level metrics may not capture the true user experience.

Security Architecture

The custom interpreter approach demonstrates how security considerations can drive architectural decisions in LLMOps. Rather than retrofitting security onto an existing execution environment, they built a minimal runtime that limits attack surface by design.

Future Directions

Dropbox outlines several future directions including multi-turn conversations, self-reflective agents that evaluate their own performance, continuous fine-tuning for specific business needs, and multi-language support. These represent common evolution paths for production LLM systems moving from initial deployment toward more sophisticated capabilities.

Balanced Assessment

While the case study provides valuable technical insights, it’s worth noting some limitations:

Overall, the Dropbox Dash case study offers a solid example of production LLMOps practices, particularly around the integration of RAG with agent-based architectures, the thoughtful approach to interpreter security, and the pragmatic handling of latency-quality trade-offs. The code-generation approach to agent planning is particularly interesting as it provides more structured, debuggable output compared to free-form reasoning approaches.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Multi-Agent Financial Research and Question Answering System

Yahoo! Finance 2025

Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.

question_answering data_analysis chatbot +49

Evaluation-Driven LLM Production Workflows with Morgan Stanley and Grab Case Studies

OpenAI 2025

OpenAI's applied evaluation team presented best practices for implementing LLMs in production through two case studies: Morgan Stanley's internal document search system for financial advisors and Grab's computer vision system for Southeast Asian mapping. Both companies started with simple evaluation frameworks using just 5 initial test cases, then progressively scaled their evaluation systems while maintaining CI/CD integration. Morgan Stanley improved their RAG system's document recall from 20% to 80% through iterative evaluation and optimization, while Grab developed sophisticated vision fine-tuning capabilities for recognizing road signs and lane counts in Southeast Asian contexts. The key insight was that effective evaluation systems enable rapid iteration cycles and clear communication between teams and external partners like OpenAI for model improvement.

document_processing question_answering classification +42