Doordash: RAG-Based Dasher Support Automation with LLM Guardrails and Quality Monitoring

Company

Doordash

Title

RAG-Based Dasher Support Automation with LLM Guardrails and Quality Monitoring

Industry

E-commerce

Link

https://careers.doordash.com/blog/large-language-modules-based-dasher-support-automation/

Year

2024

Summary (short)

DoorDash developed an LLM-based chatbot system to automate support for Dashers (delivery contractors) who encounter issues during deliveries. The existing flow-based automated support system could only handle a limited subset of issues, and while a knowledge base existed, it was difficult to navigate, time-consuming to parse, and only available in English. The solution involved implementing a RAG (Retrieval Augmented Generation) system that retrieves relevant information from knowledge base articles and generates contextually appropriate responses. To address LLM challenges including hallucinations, context summarization accuracy, language consistency, and latency, DoorDash built three key systems: an LLM Guardrail for real-time response validation, an LLM Judge for quality monitoring and evaluation, and a quality improvement pipeline. The system now autonomously assists thousands of Dashers daily, reducing hallucinations by 90% and compliance issues by 99%, while allowing human agents to focus on more complex support scenarios.

## Overview and Business Context DoorDash's case study presents a comprehensive example of deploying LLMs in a production support environment where accuracy, reliability, and quality control are paramount. The company sought to improve support for Dashers—the independent contractors who deliver orders from merchants to customers. While the delivery process appears straightforward, it involves numerous complexities that can require support intervention, particularly for new Dashers. The existing automated support system relied on flow-based resolutions with pre-built paths, which meant only a small subset of issues could be resolved quickly through automation. While DoorDash maintained a knowledge base of support articles, three critical problems limited their effectiveness: difficulty finding relevant articles, time-consuming information extraction from lengthy articles, and English-only availability despite many Dashers preferring other languages. This created an ideal use case for a RAG system that could retrieve information from the knowledge base and generate tailored responses to resolve Dasher issues efficiently. ## Core Technical Architecture: RAG Implementation The RAG system implementation follows a multi-stage process designed to handle conversational context and retrieve relevant information. When a Dasher presents an issue to the chatbot, the system first recognizes that the issue likely spans multiple messages and follow-up questions across a conversation. To address this, the system begins by condensing the entire conversation to identify the core problem through summarization. This conversation summarization step is critical because the system must accurately understand the Dasher's evolving issue as the dialogue progresses, and the quality of this summary directly affects the retrieval system's ability to find relevant information. Using the distilled issue summary, the system searches historical data to find the top N most similar cases that were previously resolved using knowledge base articles. Each identified similar case corresponds to specific KB articles that proved helpful for that issue type. These relevant articles are then integrated into the prompt template along with the conversation context and issue summary. This enriched template allows the LLM to generate tailored responses that leverage both the conversational context and the retrieved knowledge base content to provide precise and informed support to Dashers. The RAG architecture demonstrates a practical approach to grounding LLM responses in authoritative company information rather than relying solely on the LLM's pre-training knowledge, which can be outdated or incorrect. The system's reliance on similarity search against historical resolved cases represents an interesting hybrid approach that combines retrieval from actual support history with knowledge base content retrieval. ## Critical Production Challenges Identified DoorDash identified several significant challenges when deploying LLMs in their production support environment that required systematic solutions. The first major challenge was groundedness and relevance of responses in the RAG system. The LLM chatbot occasionally generated responses that diverged from intended context, and while these responses sounded natural and legitimate, users might not realize they were inaccurate. This issue stemmed from outdated or incorrect DoorDash-related information that was included during the LLM's training phase. Because LLMs typically train on publicly available text from platforms like Quora, Reddit, and Twitter, there was a heightened risk of propagating erroneous information about DoorDash operations and policies. This hallucination problem posed serious risks in a production support context where incorrect information could lead to operational failures or policy violations. The second challenge involved context summarization accuracy. Before retrieving relevant information, the system must accurately understand the Dasher's issue, which becomes particularly complex in multi-turn conversations where the issue evolves and changes as dialogue progresses. The quality and presentation of the summary directly affects the retrieval system's ability to find correct resolutions, making this summarization component a critical dependency for the entire RAG pipeline. Language consistency represented the third challenge, particularly important given DoorDash's multilingual Dasher population. Because LLMs primarily train on English data, they occasionally overlooked instructions to respond in different languages, especially when the prompt itself was in English. While this issue occurred infrequently and diminished as LLMs scaled, it still required systematic attention in production. The fourth challenge involved ensuring consistency between actions and responses. The LLM not only responds to users but can also perform actions through API calls, and these function calls must remain consistent with the response text to avoid confusing or contradictory user experiences. Finally, latency posed operational challenges, with response times varying from sub-second to tens of seconds depending on the model used and prompt size. Generally, larger prompts led to slower responses, creating tension between providing comprehensive context and maintaining acceptable user experience. ## LLM Guardrail System: Two-Tier Quality Control To address the hallucination and quality challenges, DoorDash developed an LLM Guardrail system that functions as an online monitoring tool evaluating each LLM output before it reaches users. The guardrail system checks response grounding in RAG information to prevent hallucinations, maintains response coherence with previous conversation context, and filters responses that violate company policies. The primary focus of the guardrail system is hallucination detection, where LLM-generated responses are either unrelated or only partially related to knowledge base articles. DoorDash initially tested a more sophisticated guardrail model but found that increased response times and heavy token usage made it prohibitively expensive for production deployment. This cost-performance tradeoff led them to adopt a pragmatic two-tier approach combining a cost-effective shallow check developed in-house with an LLM-based evaluator as backup. The first quality check layer employs semantic similarity comparison between the generated response and KB article segments. This shallow check uses a sliding window technique to measure similarities between LLM responses and relevant article segments. The reasoning is straightforward: if a response closely matches content from an article, it's less likely to be a hallucination. This approach provides fast, low-cost screening that can catch obvious issues without invoking expensive LLM evaluation. When the shallow check flags a response as potentially problematic, the system escalates to the second layer—an LLM-powered evaluator. The system constructs a comprehensive prompt including the initial response, relevant KB articles, and conversation history, then passes this to an evaluation model. This evaluation model assesses whether the response is properly grounded in the provided information and, when issues are found, provides rationale for further debugging and investigation. Responses must pass all guardrail tests before being shown to end users. If a response fails guardrail checks, the system can either retry generation or default to human agent support. The article notes that latency represents a notable drawback of this approach, as the end-to-end process includes generating a response, applying the guardrail, and possibly retrying with a new guardrail check. However, given the relatively small proportion of problematic responses, strategically defaulting to human agents proves an effective way to ensure quality user experience while maintaining high automation rates. The results demonstrate the guardrail system's effectiveness: it successfully reduced overall hallucinations by 90% and cut down potentially severe compliance issues by 99%. These metrics highlight the critical importance of production guardrails when deploying LLMs in high-stakes support environments where incorrect information could have serious operational or compliance consequences. ## LLM Judge: Quality Monitoring and Evaluation Beyond real-time guardrails, DoorDash developed an LLM Judge system for ongoing quality monitoring and evaluation. The team recognized that standard metrics like Dasher feedback, human engagement rate, and delivery speed, while useful for measuring overall performance, didn't provide actionable feedback for improving the chatbot system. To develop more granular quality insights, they manually reviewed thousands of chat transcripts between the LLM and Dashers, categorizing LLM chatbot quality into five distinct aspects: retrieval correctness, response accuracy, grammar and language accuracy, coherence to context, and relevance to the Dasher's request. For each quality aspect, DoorDash built monitors either by prompting a more sophisticated LLM or creating rules-based regular expression metrics. The overall quality assessment for each aspect is determined by prompting the LLM Judge with open-ended questions. The answers to these open-ended questions are processed and summarized into common issues and failure patterns. High-frequency issues identified through this process are then converted into prompts or rules for ongoing automated monitoring, creating a feedback loop where manual review insights become automated checks. This approach represents an interesting evolution in LLM evaluation methodology: starting with open-ended qualitative assessment to identify issues, then converting those insights into quantitative multiple-choice monitoring that can scale. Beyond the automated evaluation system, DoorDash maintains a dedicated human review team that evaluates random transcript samples, with continuous calibration between human review and automated systems ensuring comprehensive coverage and preventing evaluation drift. ## Quality Improvement Pipeline The LLM Judge insights feed into a systematic quality improvement pipeline addressing several root causes of quality issues. DoorDash identifies their system's quality challenges as stemming from insufficient knowledge base content, inaccurate retrieval, model hallucination, and suboptimal prompts. Human support agents play a crucial role as subject matter experts, meticulously reviewing LLM responses and guiding automated process enhancements—demonstrating the importance of human-in-the-loop approaches for production LLM systems. For knowledge base improvements, the team recognized that the KB serves as foundational truth for LLM responses, making it critical to offer complete, accurately phrased articles. LLM Judge quality evaluation enabled thorough reviews and KB updates to eliminate misleading terminology. Additionally, they're developing a developer-friendly KB management portal to streamline the process for updating and expanding articles, acknowledging that LLM quality depends heavily on the quality of retrieved information. Retrieval improvements focus on two key processes. Query contextualization involves simplifying queries to single, concise prompts while providing context through comprehensive conversation history. Article retrieval involves selecting optimal embedding models from available choices within their vector store to enhance retrieval accuracy. This highlights the importance of embedding model selection and vector store optimization in production RAG systems. Prompt improvements follow several principled approaches developed through experimentation. The team breaks down complex prompts into smaller, manageable parts and employs parallel processing where feasible to manage latency. They avoid negative language in prompts because models typically struggle with negation, instead clearly outlining desired actions and providing illustrative examples. They implement chain-of-thought prompting to encourage the model to process and display its reasoning, which aids in identifying and correcting logic errors and hallucinations by making the model's reasoning process transparent. ## Regression Prevention and Testing To maintain prompt quality and prevent model performance regression, DoorDash uses an open-source evaluation tool called Promptfoo, which functions similarly to unit testing in software development. This tool allows rapid prompt refinement and systematic evaluation of model responses across test cases. A suite of predefined tests is triggered by any prompt changes, blocking deployment of any prompts that fail tests. This represents a critical LLMOps practice: treating prompts as code that requires testing and version control. Newly identified issues are systematically added to Promptfoo test suites, ensuring continuous improvement and preventing regression when prompts or models are updated. This creates a growing test suite that encodes learnings from production issues, similar to how software teams build regression test suites over time. ## Production Outcomes and Scale The case study reports that the system now autonomously assists thousands of Dashers daily, streamlining basic support requests while maximizing the value of human contributions. This collaborative approach allows human support representatives to focus on solving more complex problems rather than handling routine inquiries. The quality monitoring and iterative improvement pipeline transformed an initial prototype into a robust chatbot solution that serves as a cornerstone for further automation capabilities. ## Critical Assessment and Tradeoffs While DoorDash presents their system as successful, the case study reveals several important tradeoffs and limitations inherent in production LLM deployment. The two-tier guardrail system, while cost-effective, introduces latency that impacts user experience. The decision to default to human agents when guardrails fail represents a pragmatic tradeoff between automation and quality, but the article doesn't quantify what percentage of queries fall back to humans or how this affects overall automation rates. The reliance on LLM Judge for quality evaluation introduces potential circularity—using LLMs to evaluate LLM outputs—though the human review calibration process helps mitigate this concern. The article also doesn't detail specific models used for different components (generation vs. evaluation), making it difficult to assess the actual cost and performance characteristics of the system. The article's claim of 90% hallucination reduction and 99% compliance issue reduction is impressive but lacks baseline context. Without knowing the initial hallucination and compliance violation rates, it's difficult to assess whether these percentages represent moving from unacceptable to acceptable performance or from good to excellent performance. The system's handling of multi-turn conversations and context summarization represents a known challenge in conversational AI, and while the article acknowledges this importance, it doesn't provide details on how summarization accuracy is measured or what techniques proved most effective. The multilingual support challenge is mentioned but not deeply explored—the article notes the issue "occurs infrequently and its occurrence diminishes as the LLM scales" but doesn't explain how DoorDash specifically addresses this for their diverse Dasher population. ## Future Directions and Broader Implications DoorDash acknowledges that their LLM chatbot represents a shift from traditional flow-based systems, introducing inherent uncertainty from the underlying language models. They emphasize that ensuring high-quality responses is paramount for any high-volume LLM application and that continuing to develop precise quality assessment methods will help identify and narrow performance gaps. They recognize that while the chatbot handles routine inquiries effectively, complex support scenarios will continue to require human expertise. The case study illustrates several critical LLMOps practices for production deployment: comprehensive guardrail systems for real-time quality control, systematic quality monitoring through LLM Judge, human-in-the-loop review and calibration, systematic testing and regression prevention, and continuous improvement pipelines feeding insights from production back into system refinement. The approach demonstrates the maturity required to move from LLM prototypes to production systems serving thousands of users daily in a domain where accuracy and reliability directly impact business operations and user experience.

Start deploying reproducible AI workflows today