Dropbox: Context Engineering for Agentic AI Systems

Overview

Dropbox’s case study on their Dash AI assistant provides valuable insights into the evolution of production LLM systems from simple retrieval-augmented generation (RAG) pipelines to complex agentic systems. The company describes how Dash initially functioned as a traditional enterprise search system combining semantic and keyword search across indexed documents. However, as users began requesting more sophisticated interactions—moving from simple information retrieval queries like “what is the status of the identity project” to action-oriented requests like “open the editor and write an executive summary of the projects that I own”—Dropbox recognized the need to transform Dash into an agentic AI capable of planning, reasoning, and taking action.

This transformation introduced what Dropbox calls “context engineering,” which they define as the process of structuring, filtering, and delivering the right context at the right time so the model can plan intelligently without being overwhelmed. The case study is particularly valuable because it candidly discusses the challenges and performance degradation they encountered as they scaled their system, and the pragmatic solutions they developed. While the post comes from Dropbox’s engineering blog and naturally presents their solutions in a positive light, the technical details and problem descriptions align with challenges commonly reported in the LLM production community, lending credibility to their account.

The Core Challenge: Analysis Paralysis in Agentic Systems

As Dropbox added new capabilities to Dash—including contextual search and assisted editing—they observed an unexpected pattern: more tools often meant slower and less accurate decision-making. In their architecture, a “tool” refers to any external function the model can call, such as search, lookup, or summarization operations. Each new capability expanded the model’s decision space, creating more choices and opportunities for confusion. The problem manifested as the model spending excessive time deciding how to act rather than actually taking action—a phenomenon they describe as “analysis paralysis” in AI systems.

The company experimented with the Model Context Protocol (MCP), an open standard for defining and describing tools that a server provides. While MCP helped by standardizing how tools are described and what inputs they accept, Dropbox discovered practical limitations. Each tool definition, including its description and parameters, consumes tokens from the model’s context window—the finite space available for the model to read and reason about information. Tool definitions consume significant numbers of tokens, directly impacting both cost and performance. More critically, they noticed that overall accuracy degraded for longer-running jobs, with tool calls adding substantial extra context that contributed to what has been termed “context rot” in the industry.

This experience led Dropbox to a fundamental realization: building effective agentic AI isn’t simply about adding more capabilities; it’s about helping the model focus on what matters most. Their solution centered on curating context through three core strategies that balance capability with clarity.

Strategy One: Limiting Tool Definitions

Dropbox’s first major insight was that giving the model too many tool options led to demonstrably worse results. Dash connects to numerous applications that customers use for work, and each application provides its own retrieval tools—search functions, find-by-ID operations, find-by-name lookups, and so forth. In principle, this meant that to service a single user request, Dash might need to consult Confluence for documentation, Google Docs for meeting notes, and Jira for project status, each through separate API calls.

They experimented with exposing these diverse retrieval tools directly to the model but found significant reliability issues. The model often needed to call multiple tools but didn’t do so consistently or efficiently. Their solution was radical simplification: replacing all retrieval options with a single, purpose-built tool backed by what they call the “Dash universal search index.” Instead of expecting the model to understand and choose between dozens of APIs, they created one unified interface that handles retrieval across all services.

The key principle here is consolidation over proliferation. By giving the model one consistent way to retrieve information, they made its reasoning clearer, its plans more efficient, and its context use more focused. This design philosophy also influenced their Dash MCP server implementation, which brings Dash’s retrieval capabilities to MCP-compatible applications like Claude, Cursor, and Goose through just one tool. The server connects to the systems users already work with and securely searches across their applications while keeping tool descriptions lean so more of the context window remains focused on the user’s actual request.

From an LLMOps perspective, this represents an important tradeoff decision. While exposing many specialized tools might seem to give the model more granular control, in practice it can degrade performance. The consolidation approach requires more sophisticated backend infrastructure—building and maintaining a unified index across multiple data sources—but simplifies the model’s decision-making significantly. This is a valuable lesson for teams building production LLM systems: sometimes the best way to make your model smarter is to give it fewer, better-designed choices.

Strategy Two: Filtering Context for Relevance

Dropbox’s second key insight addresses what happens after retrieval: not everything retrieved from multiple APIs is actually useful for the task at hand. When they experimented with pulling data from several tools simultaneously, they still needed mechanisms to rank and filter results so only the most relevant information reached the model.

Their solution involved building the Dash index to combine data from multiple sources into one unified index, then layering a knowledge graph on top to connect people, activity, and content across those sources. The knowledge graph maps relationships between these elements so the system can understand how different pieces of information are connected. These relationships enable ranking results based on what matters most for each specific query and each individual user. The result is that the model only sees content that Dash’s platform has already determined to be relevant, making every piece of context meaningful.

Importantly, they build the index and knowledge graph in advance rather than at query time. This architectural decision means Dash can focus on retrieval at runtime instead of rebuilding context dynamically, which makes the entire process faster and more efficient. From an LLMOps operational perspective, this represents a significant infrastructure investment—maintaining a continuously updated knowledge graph across multiple data sources requires substantial engineering—but the payoff comes in improved model performance and reduced inference costs.

The underlying principle is that everything retrieved shapes the model’s reasoning, so relevance filtering is critical to guiding it efficiently. Sending only what’s essential improves both performance and the quality of the entire agentic workflow. This strategy also implicitly addresses the context rot problem they mentioned earlier: by pre-filtering based on relevance rather than dumping all retrieved information into the context, they prevent the accumulation of marginally relevant information that can confuse the model over longer interactions.

Strategy Three: Specialized Agents for Complex Tasks

The third discovery Dropbox made involves task complexity: some tools are so complex that the model needs extensive context and examples to use them effectively. They encountered this challenge as they continued expanding the Dash Search tool. Query construction turned out to be a difficult task requiring understanding user intent, mapping that intent to index fields, rewriting queries for better semantic matching, and handling edge cases like typos, synonyms, and implicit context.

As the search tool grew more capable, the model needed increasingly detailed instructions to use it correctly. These instructions began consuming a significant portion of the context window, leaving less room for reasoning about the overall task. In other words, the model was spending more attention on how to search than on what to do with the results—another form of the attention allocation problem that plagued their initial implementations.

Their solution was architectural: moving search into its own specialized agent. The main planning agent decides when a search is needed and delegates the actual query construction to a specialized agent with its own focused prompt. This separation allows the main planning agent to stay focused on overall planning and execution while the search agent handles the specifics of retrieval with the detailed context it requires.

This multi-agent approach represents a sophisticated LLMOps pattern for managing complexity in production systems. Rather than trying to stuff all necessary instructions into a single mega-prompt, they’ve effectively decomposed the problem into specialized components. The main agent maintains a simpler, clearer prompt focused on high-level planning, while specialized agents handle complex subtasks with their own tailored prompts and context. This architecture likely also provides better maintainability and debuggability—when search behavior needs adjustment, they can modify the search agent’s prompt without touching the planning agent’s logic.

The key lesson they draw is that when a tool demands too much explanation or context to be used effectively, it’s often better to turn it into a dedicated agent with a focused prompt. This principle could extend to other complex capabilities beyond search—any sufficiently complex operation that requires extensive instruction might benefit from agent specialization.

Production LLMOps Considerations

Several aspects of this case study deserve critical examination from an LLMOps perspective. First, while Dropbox presents impressive solutions, the text doesn’t quantify the improvements achieved. They mention that “more tools often meant slower, less accurate decision making” and that “overall accuracy of Dash degraded for longer-running jobs,” but don’t provide specific metrics on accuracy improvements, latency reductions, or cost savings from their optimizations. This is understandable for a public blog post where companies often avoid disclosing detailed performance metrics, but it means we should view the claims as qualitative observations rather than rigorously validated improvements.

Second, the infrastructure investments required for their solutions are substantial. Building and maintaining a universal search index across multiple enterprise applications, constructing and updating a knowledge graph of relationships, and orchestrating multiple specialized agents all require significant engineering resources. While these investments clearly made sense for Dropbox given Dash’s importance to their product strategy, smaller teams or organizations with different constraints might not be able to replicate this approach. The case study would benefit from discussing the resource tradeoffs more explicitly.

Third, the text briefly mentions experiments with code-based tools and references a previous blog post on the topic, noting that other companies are approaching similar problems. This suggests the field of agentic AI context management is still rapidly evolving, and Dropbox’s current solutions may continue to change. They acknowledge this explicitly in the “Looking forward” section, noting that “context engineering for agentic AI systems is still an emerging discipline” and that they’re “continuing to learn and iterate.”

The discussion of the Model Context Protocol (MCP) is particularly interesting from a standards perspective. Dropbox experimented with MCP but found limitations related to token consumption from tool definitions. They conclude that “MCP continues to serve as a robust protocol, but effective scaling depends on reducing tool proliferation, investing in specialized agents, and enabling the LLM to generate code-based tools when appropriate.” This suggests that while standards like MCP provide valuable structure, production systems still require careful architectural decisions that go beyond simply adopting a standard protocol.

Context Management as a Core LLMOps Discipline

Perhaps the most valuable contribution of this case study is its framing of context management as a distinct engineering discipline for production LLM systems. Dropbox explicitly states their guiding principle: “better context leads to better outcomes. It’s about giving the model the right information, at the right time, in the right form.” This moves beyond the common focus on prompt engineering—which typically deals with how to phrase instructions—to address the broader question of what information should be included in the model’s context at all.

Their experience demonstrates that context is expensive in multiple dimensions: it affects cost through token consumption, speed through increased processing requirements, and quality through the amount of attention the model can allocate to the actual task. They found that “leaner contexts don’t just save resources; they also make the model smarter”—a counterintuitive insight that contradicts the common assumption that more context always helps.

Looking forward, Dropbox indicates they’re applying these lessons to other parts of Dash’s context management, including user and company profiles, as well as long-term and short-term memory. They believe “there’s even more performance to unlock by refining these areas, especially as we experiment with smaller and faster models.” This suggests their context engineering approach may become even more critical as they explore model optimization strategies where context window efficiency matters even more.

The company also notes that while their discussion centered on retrieval-based tools, action-oriented tools exhibit many of the same limitations. They mention that effective scaling depends on “reducing tool proliferation, investing in specialized agents, and enabling the LLM to generate code-based tools when appropriate”—an approach that parallels their consolidation of retrieval tools into the unified Dash retrieval system.

Critical Assessment and Broader Implications

From a balanced perspective, this case study represents a mature, thoughtful approach to production LLM engineering that goes beyond basic RAG implementations. Dropbox has clearly invested significantly in infrastructure and experimentation, and their willingness to discuss problems like analysis paralysis and context rot adds credibility to their account. The solutions they describe—tool consolidation, relevance filtering, and specialized agents—represent reasonable engineering tradeoffs for a system at their scale.

However, readers should recognize that these solutions emerged from Dropbox’s specific context: a company with substantial engineering resources building an enterprise AI assistant that integrates with numerous third-party applications. The universal search index and knowledge graph approach requires infrastructure that many organizations may not have resources to build. Additionally, the effectiveness of their solutions likely depends on factors they don’t discuss in detail, such as the quality of their knowledge graph construction, the effectiveness of their relevance ranking algorithms, and the specific prompts they use for specialized agents.

The case study would be stronger with quantitative metrics demonstrating improvement, discussion of failure modes and limitations of their approach, and more detail on the resource investments required. The company’s acknowledgment that context engineering remains an “emerging discipline” and their ongoing experimentation suggests these solutions should be viewed as works in progress rather than final answers.

Nevertheless, the core insights—that more tools can degrade rather than improve agentic AI performance, that relevance filtering is critical for retrieval quality, and that complex capabilities may benefit from specialized agents—represent valuable lessons for the LLMOps community. These patterns are likely to be relevant across a range of production LLM applications, even if the specific implementations differ based on organizational constraints and use cases.

The emphasis on context engineering as a distinct discipline, separate from but complementary to prompt engineering, represents an important conceptual contribution. As organizations move from simple LLM applications to more complex agentic systems, the challenges Dropbox describes around context management, tool proliferation, and attention allocation will become increasingly common. Their solutions provide one set of patterns for addressing these challenges, and the broader community will benefit from seeing how different organizations tackle similar problems with different approaches and constraints.

Context Engineering for Agentic AI Systems

Industry

Technologies

Overview

The Core Challenge: Analysis Paralysis in Agentic Systems

Strategy One: Limiting Tool Definitions

Strategy Two: Filtering Context for Relevance

Strategy Three: Specialized Agents for Complex Tasks

Production LLMOps Considerations

Context Management as a Core LLMOps Discipline

Critical Assessment and Broader Implications

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Scaling AI Product Development with Rigorous Evaluation and Observability

Evolution from Task-Specific Models to Multi-Agent Orchestration Platform