Company
DoorDash
Title
Building a Collaborative Multi-Agent AI Ecosystem for Enterprise Knowledge Access
Industry
Tech
Year
2025
Summary (short)
DoorDash developed an internal agentic AI platform to address the challenge of fragmented knowledge spread across experimentation platforms, metrics hubs, dashboards, wikis, and team communications. The solution evolved from deterministic workflows through single agents to hierarchical deep agents and exploratory agent swarms, built on foundational capabilities including hybrid vector search with RRF-based re-ranking, schema-aware SQL generation with pre-cached examples, multi-stage zero-data query validation, and LLM-as-judge evaluation frameworks. The platform integrates with Slack and Cursor to meet users in their existing workflows, enabling business teams and developers to access complex data and insights without context-switching, democratizing data access across the organization while maintaining rigorous guardrails and provenance tracking.
DoorDash has built a sophisticated internal agentic AI platform that serves as a unified cognitive layer over the company's distributed knowledge infrastructure. The platform addresses a fundamental challenge in large organizations: critical business knowledge is scattered across experimentation platforms, metrics repositories, dashboards, wikis, and institutional wisdom embedded in team communications. Historically, answering complex business questions required significant context-switching between searching wikis, querying Slack, writing SQL, and filing Jira tickets. The platform aims to consolidate this fragmented knowledge landscape into a cohesive, accessible system that can be queried naturally. ## Evolutionary Architecture Approach One of the most thoughtful aspects of DoorDash's approach is their explicit recognition that sophisticated multi-agent systems cannot be built overnight. They advocate for an evolutionary path with four distinct stages, each building upon the previous one while introducing new levels of autonomy and intelligence. **Workflows as Foundation**: The first stage involves deterministic workflows represented as directed acyclic graphs. These are sequential, pre-defined pipelines optimized for single, repeatable purposes. DoorDash uses workflows for high-stakes tasks where consistency and governance are paramount, such as automating Finance and Strategy internal reporting. A typical workflow might pull data from Google Docs, Google Sheets, Snowflake queries, and Slack threads to generate recurring reports on business operations, year-over-year trends, and daily growth metrics. The workflow characteristics—reliability, speed, and auditability—make them the system of record for critical routine operations. DoorDash explicitly notes that while self-service tools exist, they can be sub-optimal because they assume users have the technical skills and domain knowledge to query and interpret data correctly, which creates skillset gaps and risks of misinterpretation. **Single Agents with Dynamic Reasoning**: The second stage introduces agents that use LLM-driven policies to make dynamic decisions about which tools to call and what actions to take next. These agents leverage the ReAct cognitive architecture, which implements a think-act-observe loop where the LLM externalizes its reasoning. DoorDash notes an interesting evolution here: early agents used external "scratchpads" for chain-of-thought reasoning, but this capability is increasingly being integrated directly into model training pipelines during post-training fine-tuning. A practical example they provide involves investigating a drop in Midwest conversions—the agent would query a metrics glossary to define conversions, identify Midwest states, formulate precise data warehouse queries, and then hypothesize causes by querying experimentation platforms, incident logs, and marketing calendars. The primary limitation of single agents is context pollution: as they perform more steps, their context window fills with intermediate thoughts, degrading reasoning quality, increasing costs, and limiting their ability to handle long-running tasks. **Deep Agents for Hierarchical Decomposition**: To overcome single-agent limitations, DoorDash implements "deep agents"—a collaborative cognitive architecture involving multiple agents organized hierarchically. The core principle is specialization and delegation. While planner-worker models are common, DoorDash describes more sophisticated three-tiered systems: a manager agent decomposes complex requests into subtasks, a progress agent tracks completion and dependencies, and specialist decision agents execute individual actions. Some implementations also include a reflection agent that reviews outcomes and provides error feedback for dynamic plan adjustment. A critical enabling component is a persistent workspace or shared memory layer that acts as more than just a virtual file system—it allows one agent to create artifacts (datasets, code) that other agents can access hours or days later, enabling collaboration on problems too large for any single agent's context window. **Agent Swarms as the Frontier**: At the most advanced level, DoorDash is exploring agent swarms—dynamic networks of peer agents collaborating asynchronously without centralized control. Unlike hierarchical systems, swarms operate through distributed intelligence where no single agent has a complete picture, but coherent solutions emerge through local interactions. DoorDash compares this to ant colonies rather than corporate org charts. The primary challenges are governance and explainability, since emergent behavior makes it difficult to trace exact decision paths. DoorDash's research indicates that true swarm behavior requires a robust agent-to-agent (A2A) protocol that handles agent discovery, asynchronous state management, and lifecycle events beyond simple messaging. ## Platform Architecture and Core Capabilities The platform's sophistication lies in its foundational infrastructure that enables all these agentic patterns. **Hybrid Search with Advanced Re-ranking**: At the core is a high-performance multi-stage search engine built on vector databases. The algorithm combines traditional BM25 keyword search with dense semantic search, followed by a sophisticated re-ranker using reciprocal rank fusion (RRF). This hybrid approach is crucial because critical information at DoorDash spans wikis, experimentation results, and thousands of dashboards. The search engine forms the foundation for all retrieval-augmented generation (RAG) functionalities, ensuring agents ground their reasoning in accurate contextual information quickly. **Schema-Aware SQL Generation**: DoorDash has developed what they call their "secret sauce" for accurate SQL generation through a combination of techniques. The process begins with identifying appropriate data sources using RRF-based hybrid search with custom lemmatization fine-tuned for table names. Once tables are identified, they use a custom DescribeTable AI tool with pre-cached examples. This tool provides agents with compact, engine-agnostic column definitions enriched with example values for each column stored in an in-memory cache. This pre-caching significantly improves filtering accuracy for dimensional attributes like countries and product types by giving agents concrete examples for WHERE clauses. **Zero-Data Statistical Query Validation**: Trust is maintained through a rigorous multi-stage validation process that DoorDash calls "Zero-Data Statistical Query Validation and Autocorrection." This includes automated linting for code style and markdown enforcement, but its core is EXPLAIN-based checking for query correctness and performance against engines like Snowflake and Trino. For deeper validation, the system can check statistical metadata about query results—such as row counts or mean values—to proactively identify issues like empty result sets or zero-value columns, all without exposing sensitive data to the AI model. If issues are found, the agent autonomously uses feedback to correct its query. The system also learns from negative user feedback, allowing agents to modify responses and improve over time. **LLM-as-Judge Evaluation Framework**: DoorDash built an automated evaluation framework that systematically runs predefined question-and-answer scenarios against agents. An LLM judge grades each response for accuracy with detailed rationale. They also leverage open-source frameworks like DeepEval to measure nuanced metrics including faithfulness and contextual relevance. Results are automatically compiled into reports, providing scalable benchmarking, regression detection, and accelerated iteration. This continuous automated oversight is what DoorDash considers non-negotiable for deploying AI into critical business functions. **Integration Strategy**: A key insight is that platform power means nothing without accessibility. While a conversational web UI provides a central hub for discovering agents and reviewing chat history, the real productivity gains come from integrations with Slack and Cursor IDE. By allowing users to invoke agents directly within their existing workflows—analysts pulling data into Slack conversations or engineers generating code without leaving their editor—DoorDash eliminates context-switching friction and makes the agentic platform a natural extension of daily work. ## Key LLMOps Lessons and Operational Principles DoorDash emphasizes several critical lessons learned during their journey. **Building on solid foundations is paramount**—it's tempting to jump to advanced multi-agent designs, but these systems only amplify inconsistencies in underlying components. By first creating robust single-agent primitives (schema-aware SQL generation, multi-stage document retrieval), they ensure multi-agent systems are trustworthy. This also means **using the right tool for the job**: deterministic workflows for certified tasks where reliability is paramount, and dynamic deep-agent capabilities for exploratory work where the path is uncertain. **Guardrails and provenance are non-negotiable features**. Trust is earned through transparency and reliability. DoorDash implements multi-layered guardrails: common guardrails applying across the platform (like EXPLAIN-based validation for all generated SQL), guardrails for LLM behavior correction ensuring outputs adhere to company policy, and custom agent-specific guardrails (for example, preventing a Jira agent from closing tickets in specific projects). Every action is logged with full provenance, allowing users to trace answers back to source queries, documents, and agent interactions. This makes the system both auditable and easier to debug. **Memory and context management are product choices, not just technical ones**. Persisting every intermediate step can bloat context, reduce accuracy, and increase costs. DoorDash is deliberate about what state is passed between agents, often sharing only final artifacts rather than full conversational history. To keep latency and costs predictable, they budget the loop by enforcing strict step and time limits and implementing circuit breakers to prevent agentic plans from thrashing. ## Technical Implementation Details At the implementation level, DoorDash visualizes their architecture as a computational graph, using frameworks like LangGraph to decompose complex architectures into executable nodes with defined transitions. The resulting execution graph resembles a finite state machine with states representing task steps and transition rules governing movement between states. The system is designed around **open standards**. Model Context Protocol (MCP) standardizes how agents access tools and data, forming the bedrock of single-agent capabilities and ensuring secure, auditable interactions with internal knowledge bases and operational tools. DoorDash is exploring the agent-to-agent (A2A) protocol to standardize inter-agent communication, which they see as key to unlocking deep agents and swarms at scale. ## Deployment Phases and Current Status DoorDash's journey follows a phased approach: Phase 1 (launched) established the agentic platform foundation and marketplace with core single-agent primitives. Phase 2 (in preview) is rolling out the marketplace and implementing their first deep-agent systems for complex analyses, which they call the "AI Network." Phase 3 (exploration) involves integrating A2A protocol support to enable asynchronous tasks and dynamic swarm collaboration. ## Balanced Assessment While DoorDash's technical implementation is impressive and their architectural thinking is sophisticated, several considerations warrant attention. The case study is written by the engineering team building the platform, which naturally emphasizes successes and capabilities. The actual adoption rates, user satisfaction metrics, and comparison of agent-generated insights versus traditional analyst workflows are not provided. The claim that their approach "democratizes data access" should be validated against actual usage patterns—do non-technical users genuinely find the system accessible, or is adoption concentrated among already-technical staff? The evolutionary approach DoorDash advocates is sensible and well-articulated, but the case study doesn't discuss failure modes, rollback strategies, or situations where simpler solutions (like improved documentation or better self-service tooling) might have been more effective. The exploration of agent swarms is positioned as frontier research, but the practical business value and when such complexity becomes justified versus over-engineering isn't addressed. The reliance on open standards like MCP and A2A is forward-thinking, but these standards are still evolving and DoorDash's investment in them carries adoption risk. The zero-data statistical query validation is innovative, but the case study doesn't discuss false positive rates or how the system handles edge cases where statistical checks might be misleading. The integration strategy with Slack and Cursor is pragmatic and shows good product thinking about meeting users where they work. However, the security and governance implications of allowing AI agents to operate within production communication channels and development environments deserve more attention than provided. Overall, DoorDash presents a thoughtful, technically sophisticated approach to building production agentic AI systems with strong emphasis on foundations, guardrails, and evolutionary development. The case study serves as a valuable reference architecture for organizations considering similar systems, though readers should view specific capability claims with appropriate skepticism and focus on the underlying architectural principles and operational lessons that likely have broader applicability.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.