Company
DoorDash
Title
Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration
Industry
E-commerce
Year
2025
Summary (short)
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
DoorDash's LLMOps case study represents a comprehensive implementation of large language models in production to solve personalization and catalog management challenges at massive scale. The company, led by their AI and machine learning team including Sadep (head of AI/ML for new business verticals), Rohan (ML engineer focusing on catalog problems), and Siman (personalization specialist), has evolved from a restaurant delivery platform to a multi-vertical marketplace encompassing grocery, convenience, alcohol, flowers, pets, retail, and electronics. The core challenge DoorDash faced was the paradigm shift from restaurant menus with approximately 100 items to retail stores like Best Buy with potentially 100,000+ items. This expansion created a three-sided marketplace complexity involving merchants, consumers, and dashers (shoppers), where each interaction point requires intelligent automation. Traditional machine learning approaches that relied primarily on user interaction data became insufficient, particularly for cold-start scenarios involving new customers, new products, or new merchant categories. **Product Knowledge Graph Construction** DoorDash's foundation for LLM implementation centers on their product knowledge graph, which serves as the backbone for all downstream applications. Previously, this process was entirely human-driven, involving manual processing of noisy merchant data through spreadsheets and human expertise to create standardized SKUs (stock keeping units). This approach was slow, expensive, and error-prone. The company implemented a progressive automation strategy using fine-tuned LLMs for attribute extraction. They discovered that smaller, fine-tuned models could achieve comparable performance to larger models like GPT-4 while providing significant cost and speed advantages. Their progression involved starting with GPT-4 with basic prompts, then fine-tuning GPT-4, followed by fine-tuning smaller models like GPT-4 mini and eventually Llama models. This approach reduced costs, increased processing speed, and maintained acceptable accuracy levels. For complex categorization tasks involving approximately 3,000 product categories, DoorDash implemented a RAG (Retrieval Augmented Generation) system rather than relying on large context windows. They create vector indices of categories using tools like Faiss and embeddings from GPT or open-source alternatives, perform retrieval to identify top candidate categories, and then use LLMs to make final categorization decisions. This approach improved both accuracy and response time while reducing hallucination compared to long-context approaches. **LLM Agents for Complex Data Processing** When RAG systems proved insufficient for highly ambiguous product data, DoorDash implemented LLM agents capable of tool calling and external reasoning. For example, when processing cryptic merchant data like "GRPR NTR 4VB," agents can perform searches against internal databases, manufacturer websites, or third-party data sources to accurately identify products. This automated the human operator workflow while maintaining accuracy and significantly improving processing speed. The agent implementation required careful attention to hallucination risks and robust evaluation frameworks. DoorDash emphasizes the importance of strong evaluation protocols and prompt engineering to maintain reliability when agents make external calls and synthesize information from multiple sources. **Scaling LLM Inference in Production** DoorDash implemented several strategies for scaling LLM inference across their product catalog of over 100 million SKUs. Their model cascading approach uses a funnel strategy: starting with heuristics for simple cases, applying in-house fine-tuned models for moderate complexity, and reserving API calls to frontier models only for the most complex cases. This approach handles approximately 88% of SKUs before requiring expensive large model calls, resulting in substantial cost savings and latency improvements. They also employ output distillation, where larger models are used to generate training data for smaller, more efficient models that can be deployed in production with improved throughput and reduced costs while maintaining acceptable accuracy. **Hierarchical RAG-Powered Personalization** For personalization, DoorDash developed a sophisticated hierarchical RAG system that addresses the cold-start problem by combining traditional collaborative filtering with LLM world knowledge. Their system works by selecting terminal nodes in their product taxonomy (e.g., "cake flour"), using RAG to retrieve relevant product types from their extensive catalog, employing LLMs to recommend complementary items based on world knowledge, mapping recommendations back to their catalog structure, and recursively expanding to child nodes until reaching terminal categories. This approach enables personalization even for new customers or products without historical interaction data. The system can recommend relevant items like butter, sugar, baking powder, and vanilla extract for cake flour purchases, even if no previous customer has made such combinations on their platform. **LLM-Based Evaluation Systems** DoorDash uses LLMs for evaluation of their recommendation systems, employing either more powerful models or councils of LLMs with chain-of-thought reasoning to assess recommendation quality. They implement a scoring system (3 for highly relevant, 2 for somewhat relevant, 1 for not relevant) and use explanations as feedback loops to improve both traditional ML models and RAG-based recommendations. This creates a continuous improvement cycle where LLM evaluations inform model refinements. **Semantic IDs and Future Directions** DoorDash is investing in semantic IDs as their next major advancement - dense representations that encapsulate semantic meaning from product hierarchies and item context rather than traditional numerical embeddings. These semantic IDs will enable prompt-based recommendations where LLM outputs directly correspond to meaningful product identifiers, simplifying RAG systems and catalog mapping while maintaining semantic coherence. This approach promises to address current limitations including the challenge of expanding catalog spaces, LLM hallucination issues, limited catalog context in LLM knowledge, and complex mapping between LLM outputs and internal product catalogs. **Contextual and Agentic Applications** DoorDash envisions future applications where LLMs enable contextual adaptation of their app experience. They demonstrated a concept for tornado emergency preparedness, where the system could automatically detect weather alerts, understand user profiles (including family composition), and proactively create customized shopping experiences with appropriate emergency supplies, baby care items for families with children, and non-perishable foods. This agentic approach would combine long-term and short-term memory, real-time context awareness, multi-tool integration (weather APIs, inventory systems, delivery platforms), self-reflection capabilities for continuous learning, and conversational interfaces for dynamic user interaction. **Technical Challenges and Learnings** Throughout their implementation, DoorDash has identified several key challenges and learnings. Fine-tuning proves essential for production deployment, particularly for domain-specific tasks where accuracy requirements are high (such as allergen information where health implications are critical). RAG systems require careful balance between retrieval recall and generation precision. Model cascading and distillation are crucial for cost-effective scaling. LLM evaluation systems can effectively replace expensive human evaluation while providing faster feedback loops. Semantic representations offer promising solutions to catalog mapping and cold-start challenges. The company's approach demonstrates a mature understanding of LLMOps principles, emphasizing the importance of layering generative AI on top of traditional ML foundations rather than replacing existing systems entirely. Their progressive implementation strategy, from simple attribute extraction to complex agentic workflows, illustrates a practical path for enterprise LLM adoption while maintaining production reliability and cost efficiency. DoorDash's case study represents a comprehensive example of LLMOps at scale, showing how large language models can be effectively integrated into existing ML infrastructure to solve real-world business problems while maintaining the operational requirements of a high-volume consumer platform serving millions of users across diverse product categories.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.