DoorDash: LLM-Assisted Personalization Framework for Multi-Vertical Retail Discovery

Company

DoorDash

Title

LLM-Assisted Personalization Framework for Multi-Vertical Retail Discovery

Industry

E-commerce

Link

https://careersatdoordash.com/blog/doordash-kdd-llm-assisted-personalization-framework/

Year

2025

Summary (short)

DoorDash developed an LLM-assisted personalization framework to help customers discover products across their expanding catalog of hundreds of thousands of SKUs spanning multiple verticals including grocery, convenience, alcohol, retail, flowers, and gifting. The solution combines traditional machine learning approaches like two-tower embedding models and multi-task learning rankers with LLM capabilities for semantic understanding, collection generation, query rewriting, and knowledge graph augmentation. The framework balances three core consumer value dimensions—familiarity (showing relevant favorites), affordability (optimizing for price sensitivity and deals), and novelty (introducing new complementary products)—across the entire personalization stack from retrieval to ranking to presentation. While specific quantitative results are not provided, the case study presents this as a production system deployed across multiple discovery surfaces including category pages, checkout aisles, personalized carousels, and search.

## Overall Summary DoorDash's case study presents their approach to building a production-scale LLM-assisted personalization framework for multi-vertical retail discovery. As DoorDash expanded beyond restaurant delivery into grocery, convenience, alcohol, retail, flowers, and gifting, they faced the challenge of helping customers discover relevant products across a catalog of hundreds of thousands of SKUs. Their solution represents a hybrid architecture that strategically combines traditional machine learning techniques with large language models, using each approach where it provides the most value. The framework is organized around three consumer value dimensions—familiarity, affordability, and novelty—and deploys LLMs at specific points in the personalization pipeline where semantic understanding, flexibility, and reasoning are most needed. This case study is notable for its pragmatic approach to LLM integration in a large-scale production system. Rather than replacing existing ML infrastructure, DoorDash augments traditional recommender systems with LLMs to handle specific tasks that benefit from language understanding and generation. The presentation at KDD 2025's PARIS Workshop provides insights into how they've operationalized this hybrid approach across multiple discovery surfaces in their platform. ## Technical Architecture and LLMOps Approach The core architecture consists of a five-step pipeline where LLMs and traditional ML systems work in concert. The five steps are attribute blending, collection prospecting, item retrieval and ranking, collection targeting, and presentation. This represents a thoughtful decomposition of the personalization problem that allows different techniques to be applied at different stages. The fundamental philosophy articulated in the case study is that classical recommender systems handle reliable retrieval and ranking at scale, while LLMs inject semantic understanding and agility where text, concepts, cold-start situations, and long-tail intent matter most. This division of labor is a key LLMOps consideration—recognizing that LLMs bring capabilities that complement rather than replace traditional ML approaches. ## LLM Integration Points in Production DoorDash deploys LLMs across several specific points in their personalization stack. For collection generation, LLMs create topical collections that help organize products semantically. This allows the system to surface coherent groups of items that share conceptual relationships beyond what might be captured by behavioral signals alone. For query understanding and rewriting, LLMs help disambiguate user intent. The case study provides the example of the search term "ragu"—which could refer to pasta sauce, a restaurant, or a brand. By incorporating dietary preferences, brand affinities, price sensitivity, and past shopping habits alongside LLM-powered query understanding, the system can better interpret what users actually mean and rank results accordingly. LLMs also generate natural language explanations for recommendations, which presumably helps with user trust and transparency, though the case study doesn't provide specific details on how these explanations are used in the interface or what impact they have on user behavior. Another integration point is knowledge graph augmentation. DoorDash uses LLMs to enrich their product knowledge graphs with semantic relationships. This is particularly valuable for cross-vertical novelty recommendations—for instance, translating a user's restaurant ordering history (frequent ramen orders) into relevant retail grocery suggestions (instant ramen kits, Asian condiments). The LLM can reason about these cross-domain relationships in ways that pure collaborative filtering might miss. The system also uses LLMs to summarize past orders into vector context. This is an interesting application where the LLM acts as a compression and abstraction mechanism, converting sparse transactional history into dense semantic representations that can be more efficiently processed by downstream components. ## Scaling Challenges and Solutions A critical aspect of this case study from an LLMOps perspective is how DoorDash addresses the scaling challenges of deploying LLMs against a catalog of millions of items across thousands of merchants. Naive prompting or brute-force generation would be impractical at this scale, both from a latency and cost perspective. Their primary solution is Hierarchical Retrieval-Augmented Generation (RAG). Rather than attempting to include the entire catalog in LLM context windows, they narrow the context using category trees and structured retrieval before calling the LLM. This approach keeps prompts compact, inference fast, and recommendations precise even as the catalog grows. The hierarchical aspect is particularly important—by using the natural hierarchy of product categories, they can progressively narrow down the relevant subset of the catalog before engaging the LLM for semantic reasoning. This hierarchical RAG implementation represents a thoughtful approach to the context length and computational cost challenges that often plague LLM deployments. By doing intelligent pre-filtering using traditional retrieval techniques, they ensure that the LLM only needs to reason over a manageable subset of items, reducing both latency and API costs. ## Semantic IDs as a Reusable Abstraction The second major infrastructure pattern DoorDash developed is Semantic IDs. These are described as compact, meaning-rich embeddings that encode catalog hierarchy. Semantic IDs serve multiple purposes in the system: they enable cold-start personalization for new users or items where behavioral signals are sparse, they power free-text-to-product retrieval (e.g., "show me cozy fall candles"), they enable intent-aligned recommendations for complex tasks like gifting or recipe generation, and they provide a shared semantic layer that can be reused across recommendations, search, and future agentic workflows. The Semantic IDs concept represents an interesting LLMOps pattern where the system creates an intermediate representation that bridges traditional collaborative filtering approaches and LLM-based semantic reasoning. By encoding products into this semantic space, they create a common language that different components of the system can use, reducing redundancy and improving consistency across surfaces. From an operational perspective, this abstraction likely simplifies the deployment and maintenance of their LLM-powered features. Rather than each use case needing its own custom integration with the product catalog, they can all work through this shared semantic layer. This kind of reusable abstraction is crucial for making LLM deployments maintainable at scale. ## Traditional ML Foundation While this is fundamentally an LLMOps case study, it's important to note that the foundation of DoorDash's personalization system remains traditional machine learning. For the familiarity dimension, they use a two-tower embedding model that learns customer and item representations from sparse order histories, engagement sequences, numerical and context features, and pre-trained embeddings. At serving time, they perform dot product scoring against an item-embedding index for efficient top-N recall, blending in recency, popularity, and reorder signals. For ranking, they deploy multi-task rankers with a mixture-of-experts design that optimizes for multiple outcomes simultaneously: click-through, add-to-cart, in-session conversion, and delayed conversion. These models share a common representation but specialize per surface, allowing them to balance relevance with exploration appropriately for different contexts. This traditional ML foundation is what enables the system to operate at the scale and latency requirements of a production e-commerce platform. The LLMs augment this foundation rather than replacing it, which is a key architectural decision that makes the system practical for real-time serving. ## Affordability and Value Optimization For the affordability dimension, DoorDash models price sensitivity, bulk and size preferences, and stock-up behavior. These signals feed into a Value-to-Consumer optimization objective that upranks items delivering the best value for each customer. They also operate a Deals Generation Engine that actively pairs the right discounts with the right customers within budget, efficiency, and marketplace constraints. While the case study doesn't explicitly state that LLMs are involved in the affordability dimension, it's plausible that LLMs help with understanding product equivalences and substitutions that inform value optimization—for example, recognizing that a customer who buys premium pasta sauce might be interested in a deal on a similar but more affordable brand. ## Novelty and Cross-Vertical Discovery The novelty dimension is where LLM capabilities seem particularly valuable. DoorDash distinguishes between intra-vertical novelty (surfacing new and complementary items based on co-purchase patterns within a category) and cross-vertical novelty (translating restaurant history into retail discovery). The cross-vertical novelty use case demonstrates the reasoning capabilities that LLMs bring to the system. By combining consumer clusters with food and retail knowledge graphs, the system can make conceptual leaps that would be difficult to learn from behavioral signals alone. Someone who orders ramen frequently from restaurants might indeed be interested in instant ramen kits or Asian condiments when shopping for groceries, but this connection requires understanding the semantic relationship between restaurant dishes and grocery ingredients. This is where the LLM's ability to reason about conceptual relationships and transfer knowledge across domains becomes valuable. The knowledge graph augmentation mentioned earlier likely plays a crucial role here, with the LLM helping to populate and enrich connections between restaurant menu items and retail products. ## Production Deployment Considerations The case study describes the system as deployed across multiple discovery surfaces: category pages where relevant items rise to the top, checkout aisles where complementary items help customers complete baskets, personalized carousels surfacing relevant collections on home and store pages, and search results that reflect individual user intent. From an LLMOps perspective, deploying across these multiple surfaces creates several operational challenges. Each surface likely has different latency requirements—search might need to be nearly instantaneous, while carousel generation could potentially be pre-computed or cached. The case study doesn't provide specific details on how they handle these different latency profiles, but the hierarchical RAG approach and use of Semantic IDs as a shared layer suggest they've thought carefully about making LLM inference efficient enough for real-time serving. The mention of making LLM-powered personalization "scalable, cost-effective, and reusable" indicates that cost management is a key concern. This is typical for production LLM deployments, especially in contexts where millions of predictions need to be made daily across a large user base. The focus on compact prompts and targeted LLM usage suggests they're actively managing inference costs. ## Evaluation and Validation One notable gap in this case study is the lack of specific quantitative results or evaluation metrics. The presentation doesn't provide A/B test results, uplift metrics, or other quantitative evidence of the system's effectiveness. This is somewhat common in conference presentations and blog posts where companies may be reluctant to share specific business metrics, but it does limit our ability to assess the actual impact of the LLM integration. The case study also doesn't discuss how they evaluate the quality of LLM-generated content (collections, explanations, query rewrites) or how they guard against typical LLM failure modes like hallucination, inconsistency, or bias. For a production system, these evaluation and quality assurance processes would be critical LLMOps components, but they're not covered in the available material. ## Tradeoffs and Critical Assessment From a balanced perspective, this case study represents a pragmatic and thoughtful approach to integrating LLMs into an existing production ML system. The hybrid architecture that uses LLMs where they add specific value while relying on traditional ML for scalable retrieval and ranking is sound engineering. The focus on making LLM integration efficient through hierarchical RAG and reusable abstractions like Semantic IDs shows operational maturity. However, several caveats are worth noting. First, without quantitative results, it's difficult to assess whether the LLM integration actually delivers meaningful business value or user experience improvements over a purely traditional ML approach. The case study reads somewhat like a technical showcase of what's possible rather than a rigorous evaluation of what's effective. Second, the complexity added by integrating LLMs into the system is non-trivial. The five-step pipeline with LLMs involved at multiple points creates additional dependencies, potential failure modes, and maintenance burden. Whether this complexity is justified depends on the incremental value delivered, which isn't quantified in the case study. Third, the case study doesn't discuss some important operational concerns like monitoring and debugging of LLM components in production, handling of LLM failures or degradations, cost management in detail, or how they update and iterate on LLM components without disrupting the overall system. Finally, there's some promotional language in the case study ("paradigm shift," "more than a vision; it's our daily mission") that should be viewed skeptically. The actual technical content is more incremental and evolutionary than revolutionary—it's a sensible integration of LLM capabilities into an existing personalization stack, not a fundamental reimagining of how personalization works. ## Key LLMOps Lessons Despite these caveats, the case study does offer several valuable lessons for LLMOps practitioners. The principle of using each approach where it shines—traditional ML for scalable retrieval and ranking, LLMs for semantic understanding and reasoning—is a sound architectural philosophy that many organizations could benefit from. Too often, the conversation around LLMs treats them as a replacement for existing ML systems rather than a complement. The focus on building scalable abstractions like hierarchical RAG and Semantic IDs is another important lesson. These abstractions make LLM capabilities more accessible across the organization and help manage the complexity and cost of LLM deployment. Creating reusable components that can serve multiple use cases is crucial for making LLM investments sustainable. The emphasis on anchoring on clear objectives (familiarity, affordability, novelty) provides a framework for making decisions about where and how to deploy LLMs. This kind of principled approach helps avoid the trap of using LLMs simply because they're novel or fashionable, and instead focuses on where they deliver actual value against defined objectives. Overall, this case study represents a mature, production-focused approach to LLM integration that prioritizes pragmatism and operational efficiency over novelty. While more quantitative evaluation would strengthen the case, the technical architecture and operational patterns described offer valuable insights for organizations building similar systems.

Start deploying reproducible AI workflows today