DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
DoorDash’s LLMOps case study represents a comprehensive implementation of large language models in production to solve personalization and catalog management challenges at massive scale. The company, led by their AI and machine learning team including Sadep (head of AI/ML for new business verticals), Rohan (ML engineer focusing on catalog problems), and Siman (personalization specialist), has evolved from a restaurant delivery platform to a multi-vertical marketplace encompassing grocery, convenience, alcohol, flowers, pets, retail, and electronics.
The core challenge DoorDash faced was the paradigm shift from restaurant menus with approximately 100 items to retail stores like Best Buy with potentially 100,000+ items. This expansion created a three-sided marketplace complexity involving merchants, consumers, and dashers (shoppers), where each interaction point requires intelligent automation. Traditional machine learning approaches that relied primarily on user interaction data became insufficient, particularly for cold-start scenarios involving new customers, new products, or new merchant categories.
Product Knowledge Graph Construction
DoorDash’s foundation for LLM implementation centers on their product knowledge graph, which serves as the backbone for all downstream applications. Previously, this process was entirely human-driven, involving manual processing of noisy merchant data through spreadsheets and human expertise to create standardized SKUs (stock keeping units). This approach was slow, expensive, and error-prone.
The company implemented a progressive automation strategy using fine-tuned LLMs for attribute extraction. They discovered that smaller, fine-tuned models could achieve comparable performance to larger models like GPT-4 while providing significant cost and speed advantages. Their progression involved starting with GPT-4 with basic prompts, then fine-tuning GPT-4, followed by fine-tuning smaller models like GPT-4 mini and eventually Llama models. This approach reduced costs, increased processing speed, and maintained acceptable accuracy levels.
For complex categorization tasks involving approximately 3,000 product categories, DoorDash implemented a RAG (Retrieval Augmented Generation) system rather than relying on large context windows. They create vector indices of categories using tools like Faiss and embeddings from GPT or open-source alternatives, perform retrieval to identify top candidate categories, and then use LLMs to make final categorization decisions. This approach improved both accuracy and response time while reducing hallucination compared to long-context approaches.
LLM Agents for Complex Data Processing
When RAG systems proved insufficient for highly ambiguous product data, DoorDash implemented LLM agents capable of tool calling and external reasoning. For example, when processing cryptic merchant data like “GRPR NTR 4VB,” agents can perform searches against internal databases, manufacturer websites, or third-party data sources to accurately identify products. This automated the human operator workflow while maintaining accuracy and significantly improving processing speed.
The agent implementation required careful attention to hallucination risks and robust evaluation frameworks. DoorDash emphasizes the importance of strong evaluation protocols and prompt engineering to maintain reliability when agents make external calls and synthesize information from multiple sources.
Scaling LLM Inference in Production
DoorDash implemented several strategies for scaling LLM inference across their product catalog of over 100 million SKUs. Their model cascading approach uses a funnel strategy: starting with heuristics for simple cases, applying in-house fine-tuned models for moderate complexity, and reserving API calls to frontier models only for the most complex cases. This approach handles approximately 88% of SKUs before requiring expensive large model calls, resulting in substantial cost savings and latency improvements.
They also employ output distillation, where larger models are used to generate training data for smaller, more efficient models that can be deployed in production with improved throughput and reduced costs while maintaining acceptable accuracy.
Hierarchical RAG-Powered Personalization
For personalization, DoorDash developed a sophisticated hierarchical RAG system that addresses the cold-start problem by combining traditional collaborative filtering with LLM world knowledge. Their system works by selecting terminal nodes in their product taxonomy (e.g., “cake flour”), using RAG to retrieve relevant product types from their extensive catalog, employing LLMs to recommend complementary items based on world knowledge, mapping recommendations back to their catalog structure, and recursively expanding to child nodes until reaching terminal categories.
This approach enables personalization even for new customers or products without historical interaction data. The system can recommend relevant items like butter, sugar, baking powder, and vanilla extract for cake flour purchases, even if no previous customer has made such combinations on their platform.
LLM-Based Evaluation Systems
DoorDash uses LLMs for evaluation of their recommendation systems, employing either more powerful models or councils of LLMs with chain-of-thought reasoning to assess recommendation quality. They implement a scoring system (3 for highly relevant, 2 for somewhat relevant, 1 for not relevant) and use explanations as feedback loops to improve both traditional ML models and RAG-based recommendations. This creates a continuous improvement cycle where LLM evaluations inform model refinements.
Semantic IDs and Future Directions
DoorDash is investing in semantic IDs as their next major advancement - dense representations that encapsulate semantic meaning from product hierarchies and item context rather than traditional numerical embeddings. These semantic IDs will enable prompt-based recommendations where LLM outputs directly correspond to meaningful product identifiers, simplifying RAG systems and catalog mapping while maintaining semantic coherence.
This approach promises to address current limitations including the challenge of expanding catalog spaces, LLM hallucination issues, limited catalog context in LLM knowledge, and complex mapping between LLM outputs and internal product catalogs.
Contextual and Agentic Applications
DoorDash envisions future applications where LLMs enable contextual adaptation of their app experience. They demonstrated a concept for tornado emergency preparedness, where the system could automatically detect weather alerts, understand user profiles (including family composition), and proactively create customized shopping experiences with appropriate emergency supplies, baby care items for families with children, and non-perishable foods.
This agentic approach would combine long-term and short-term memory, real-time context awareness, multi-tool integration (weather APIs, inventory systems, delivery platforms), self-reflection capabilities for continuous learning, and conversational interfaces for dynamic user interaction.
Technical Challenges and Learnings
Throughout their implementation, DoorDash has identified several key challenges and learnings. Fine-tuning proves essential for production deployment, particularly for domain-specific tasks where accuracy requirements are high (such as allergen information where health implications are critical). RAG systems require careful balance between retrieval recall and generation precision. Model cascading and distillation are crucial for cost-effective scaling. LLM evaluation systems can effectively replace expensive human evaluation while providing faster feedback loops. Semantic representations offer promising solutions to catalog mapping and cold-start challenges.
The company’s approach demonstrates a mature understanding of LLMOps principles, emphasizing the importance of layering generative AI on top of traditional ML foundations rather than replacing existing systems entirely. Their progressive implementation strategy, from simple attribute extraction to complex agentic workflows, illustrates a practical path for enterprise LLM adoption while maintaining production reliability and cost efficiency.
DoorDash’s case study represents a comprehensive example of LLMOps at scale, showing how large language models can be effectively integrated into existing ML infrastructure to solve real-world business problems while maintaining the operational requirements of a high-volume consumer platform serving millions of users across diverse product categories.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.
LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.