DoorDash: Context-Aware Item Recommendations Using Hybrid LLM and Embedding-Based Retrieval

Company

DoorDash

Title

Context-Aware Item Recommendations Using Hybrid LLM and Embedding-Based Retrieval

Industry

E-commerce

Link

https://careersatdoordash.com/blog/part-3-doordash-2025-summer-intern-projects/

Year

2025

Summary (short)

DoorDash's Core Consumer ML team developed a GenAI-powered context shopping engine to address the challenge of lost user intent during in-app searches for items like "fresh vegetarian sushi." The traditional search system struggled to preserve specific user context, leading to generic recommendations and decision fatigue. The team implemented a hybrid approach combining embedding-based retrieval (EBR) using FAISS with LLM-based reranking to balance speed and personalization. The solution achieved end-to-end latency of approximately six seconds with store page loads under two seconds, while significantly improving user satisfaction through dynamic, personalized item carousels that maintained user context and preferences. This hybrid architecture proved more practical than pure LLM or deep neural network approaches by optimizing for both performance and cost efficiency.

## Overview This case study documents DoorDash's summer 2025 intern projects, with the most prominent LLMOps-related work focusing on a GenAI-powered context shopping engine developed by the Core Consumer ML team. The primary challenge addressed was context loss during in-app item searches, where specific user intent like "fresh vegetarian sushi" would not translate into appropriately personalized recommendations. The team explored multiple retrieval methodologies before settling on a hybrid approach that balanced latency, cost, and personalization requirements. ## Problem Context and Business Challenge DoorDash's existing search infrastructure struggled with preserving user context when customers searched for items with specific attributes. When users entered queries with nuanced requirements—such as dietary restrictions, taste preferences, or freshness requirements—the system would often return generic, duplicate, or irrelevant items. This led to decision fatigue among users and potentially reduced conversion rates. The core technical challenge was maintaining low latency (under two seconds for page loads) while providing highly contextualized, personalized recommendations that considered both historical user preferences and real-time session context. ## Methodology Exploration and Trade-offs The team conducted a thorough evaluation of several retrieval methodologies, demonstrating a thoughtful approach to LLMOps architecture selection. Each approach presented distinct trade-offs: **Deep Neural Network-Based Retrieval:** This approach would predict click-through rate or conversion rate for user-item pairs, offering semantic matching within a single unified model. While this provided strong potential for relevant recommendations and a unified inference pipeline, it proved impractical due to extensive data requirements and significant resource investments needed for training and maintenance at scale. The cost of data collection and processing posed substantial hurdles that made this approach prohibitive. **LLM-Only Pipeline:** The team explored relying solely on an LLM for context understanding, candidate generation, and ranking in a single call. This approach provided rich, highly contextualized outputs that significantly enhanced clarity and relevance of recommendations. However, severe latency issues of more than 20 seconds made it unsuitable for real-time consumer-facing environments. The practical deployment was also constrained by extremely large prompt sizes and processing overhead, making this approach non-viable for production despite its quality advantages. **Embedding-Based Retrieval (EBR):** This method converts search queries and items into numerical vectors (embeddings) and uses simple business-rule filters to quickly fetch relevant items from a vector database. EBR achieved near-instantaneous retrieval with minimal computational overhead, making it attractive from a performance perspective. However, it offered limited personalization and often returned duplicates or irrelevant items by failing to consider nuanced user preferences such as dietary restrictions or taste profiles. The lack of contextual understanding was a critical limitation. **Hybrid EBR Plus LLM Pipeline:** The winning approach combined rapid nearest-neighbor search using FAISS for candidate retrieval with refined LLM-based reranking and filtering. This architecture balanced quick retrieval with the LLM's sophisticated context and relevance evaluation capabilities, significantly enhancing personalization, precision, and contextual depth while maintaining acceptable latency. The main drawback was increased system complexity and orchestration challenges caused by reliance on multiple models and stages, but these were manageable compared to the benefits. ## Implementation Architecture The hybrid approach required careful orchestration across multiple stages to achieve the desired balance of speed, personalization, and scalability. The implementation followed a multi-stage pipeline architecture: **Stage 1: Context and Embedding Fetching:** The system conducted real-time compression of user profiles, incorporating dietary preferences, historical interactions, and contextual session data. The LLM model converted item descriptions into vector representations for efficient semantic matching. These embeddings were precomputed to minimize runtime latency while maintaining freshness of item catalog data. **Stage 2: Candidate Retrieval via EBR:** The team employed query expansion techniques to enhance retrieval quality by injecting user-specific terms such as dietary restrictions and taste preferences (for example, "vegan," "savory," or "spicy") into embedding vectors. This refined and personalized search results by embedding user preferences directly into the retrieval process. The system calculated semantic similarities between query embeddings and relevant item candidates from the embedding database using FAISS, which provided fast approximate nearest neighbor search capabilities. This stage was critical for maintaining low latency while ensuring a reasonable set of candidates reached the more expensive LLM reranking stage. **Stage 3: LLM-Based Reranking and Carousel Generation:** The LLM model was applied for precise reranking of candidates retrieved in Stage 2, significantly improving recommendation accuracy by evaluating semantic relevance and contextual alignment. Beyond just reranking, the LLM was also leveraged to dynamically generate compelling carousel titles and detailed item descriptions, enhancing user experience with contextual explanations. This stage added the contextual intelligence that pure embedding-based approaches lacked, but was constrained to a smaller candidate set to manage costs and latency. ## LLMOps Challenges and Solutions The hybrid approach introduced specific implementation challenges that required careful LLMOps considerations: **Latency Management:** DoorDash's strict latency constraints demanded careful optimization of the entire pipeline. The hybrid approach required parallelizing retrieval and LLM operations to maintain low latency. The team achieved end-to-end latency of approximately six seconds for the full recommendation pipeline, with store page load times consistently under two seconds. This represented a dramatic improvement over the 20+ second latency observed with LLM-only approaches while still providing contextualized recommendations far superior to pure EBR. **Cost Optimization:** Strategic management of LLM usage and scale was crucial for economic feasibility. By using EBR to reduce the candidate set from potentially thousands of items to a manageable subset, the system minimized the number of items requiring expensive LLM evaluation. This two-stage approach allowed selective deployment of LLM capabilities only where they provided maximum value, significantly reducing compute resources and associated costs compared to approaches that applied LLMs to all possible items. **Context Balancing:** The system needed to appropriately balance real-time context signals and stable user preferences. Advanced query expansion techniques and embedding optimizations were necessary to inject user-specific information into the retrieval stage without degrading performance. The architecture maintained separate representations for stable user preferences (dietary restrictions, historical patterns) and dynamic session context (current search query, time of day, recent interactions), allowing the system to weight these factors appropriately during both retrieval and reranking stages. ## Experimentation and Validation The team conducted extensive experimentation to validate the hybrid retrieval engine's performance and impact across multiple dimensions: **Latency Achievements:** Initial single-method baseline tests revealed the impracticality of the LLM-only approach with latency exceeding 20 seconds. The optimized hybrid solution achieved end-to-end latency of approximately six seconds for the complete recommendation pipeline, with store page load times consistently under two seconds—fully compliant with DoorDash's strict performance guidelines. This demonstrated that the hybrid architecture successfully balanced the competing demands of personalization quality and response time. **Cost Efficiency:** Through strategic use of EBR combined with selective LLM calls, the team significantly minimized compute resources and associated costs. While specific cost metrics were not provided in the case study, the approach proved economically scalable by limiting expensive LLM operations to a pre-filtered candidate set rather than the entire item catalog. This made the personalization engine viable for production deployment at DoorDash's scale. **Enhanced User Experience:** User experience demonstrations illustrated the transformation from static, generic menus to dynamic, personalized item carousels. Preliminary feedback indicated notable improvements in user satisfaction, with increased perceived relevance and ease of navigation. The system's ability to maintain user intent throughout the shopping journey represented a tangible improvement over the previous experience. ## Production Considerations and LLMOps Maturity While the case study represents an intern project and does not provide extensive details on full production deployment, several aspects suggest thoughtful consideration of LLMOps best practices: **Architecture Modularity:** The staged pipeline design allows for independent optimization and evolution of components. The EBR stage can be improved with better embeddings or query expansion techniques, while the LLM reranking stage can adopt newer models or prompting strategies without requiring a complete system redesign. **Performance Monitoring:** The team established clear latency targets and validated performance against these requirements before considering the solution viable. This demonstrates awareness of production SLAs and the importance of continuous monitoring. **Prompt Engineering:** The LLM was used for both reranking (evaluating item relevance) and generation (creating carousel titles and descriptions), suggesting careful prompt design to handle multiple tasks effectively. However, the case study does not provide details on prompt templates, few-shot examples, or prompt versioning strategies. **Scalability Design:** The use of precomputed embeddings and efficient vector search via FAISS demonstrates consideration for scaling to large item catalogs and high query volumes. The architecture separates expensive batch operations (embedding generation) from real-time inference paths. ## Critical Assessment and Limitations While this case study presents an impressive summer intern project, several caveats and limitations warrant attention: **Limited Production Evidence:** The case study represents an intern project with preliminary results rather than a fully deployed production system with extensive A/B testing and business metrics. Actual conversion rate improvements, cart additions, or revenue impact are not quantified. The "preliminary feedback" on user satisfaction lacks the rigor of controlled experiments. **Cost-Benefit Validation:** While the case study claims cost efficiency, no specific cost comparisons or ROI calculations are provided. The six-second latency for personalized recommendations may still represent a significant compute expense at scale, and whether this translates to measurable business value remains unclear. **Generalization Questions:** The evaluation appears focused on specific query types (like "fresh vegetarian sushi") but does not address how the system performs across the full diversity of DoorDash queries. Performance on simple queries, ambiguous queries, or queries outside the training distribution of the embeddings model is not discussed. **Model Selection and Tuning:** The case study does not specify which LLM was used, how it was selected, whether it was fine-tuned for DoorDash's specific use case, or how prompts were engineered. These details would be critical for practitioners attempting to replicate the approach. **Maintenance Burden:** The hybrid architecture introduces operational complexity with multiple models (embedding model, LLM), data pipelines (profile compression, embedding updates), and orchestration logic. The long-term maintenance costs and risks of this complexity are not addressed. ## Related Projects The document also mentions other intern projects with less direct LLMOps relevance but worth noting briefly: **Budget Signal Optimization for Sponsored Brand Ads:** This project focused on improving efficiency in ad serving by implementing real-time budget-capped flags using Kafka, Flink, and CockroachDB. It achieved a 43% reduction in search processor latency and 45% reduction in discarded candidates. While not directly GenAI-related, it demonstrates the type of real-time signal processing that could support LLM-based advertising systems. **Service Slowdown Communication System:** This system combines streaming data, anomaly detection, and in-app notifications to communicate delivery disruptions. It uses Apache Kafka and Spark Structured Streaming with Delta tables, applying anomaly detection logic to identify true slowdowns versus normal variance. This represents operational ML rather than generative AI but shows DoorDash's broader data infrastructure capabilities. **DashPass End-to-End Testing:** This project established automated E2E testing using Maestro test framework, Kaizen flows, and dynamic variables to simulate complex user flows for subscription signups. While not GenAI-focused, it represents testing infrastructure that would be essential for validating LLM-powered features in production. **User-Generated Content Feedback Loop:** This project built a notification system to alert users when their reviews achieve high visibility. It used ETL pipelines, engagement platforms, and journey orchestration but did not involve LLMs. However, such feedback systems could potentially be enhanced with LLM-generated personalized messages in future iterations. **Third-Party API Improvements:** This project focused on standardizing error handling and refactoring API endpoints for CPG advertising campaigns, with plans to introduce E2E testing. While not LLMOps-related, it demonstrates the importance of API stability and testing—considerations equally relevant for LLM-powered API services. ## Conclusion and Industry Context DoorDash's hybrid retrieval engine represents a pragmatic approach to deploying LLMs in production for personalized recommendations. The solution demonstrates mature thinking about trade-offs between model capabilities, latency requirements, and cost constraints—key considerations for any LLMOps implementation. The multi-stage architecture pattern of using fast, cheap methods for candidate generation followed by expensive, sophisticated models for refinement is increasingly common in production LLM systems and represents a sensible design pattern. However, as an intern project with preliminary results, the case study leaves many questions unanswered about production readiness, actual business impact, and operational sustainability. The lack of concrete metrics on conversion improvements, detailed cost analysis, or failure mode handling limits our ability to fully assess the solution's effectiveness. Organizations considering similar hybrid architectures should conduct thorough experimentation with their specific use cases and validate both technical performance and business value before full deployment. The case study does highlight the growing sophistication of LLMOps implementations even at the intern level, suggesting that hybrid architectures combining traditional ML techniques with large language models are becoming a standard playbook for production systems requiring both speed and intelligence.

Start deploying reproducible AI workflows today