Doordash implemented an advanced search system using LLMs to better understand and process complex food delivery search queries. They combined LLMs with knowledge graphs for query segmentation and entity linking, using retrieval-augmented generation (RAG) to constrain outputs to their controlled vocabulary. The system improved popular dish carousel trigger rates by 30%, increased whole page relevance by over 2%, and led to higher conversion rates while maintaining high precision in query understanding.
Doordash, the food delivery platform, developed an LLM-powered query understanding system to improve their search retrieval capabilities. The core challenge they faced was that users often conduct precise searches with compound requirements—queries like “vegan chicken sandwich” that combine multiple constraints. Traditional embedding-based retrieval systems struggled with these queries because they would return results based on document similarity rather than strict attribute matching, potentially returning chicken sandwiches (not vegan) alongside vegan chicken sandwiches. For dietary restrictions in particular, this imprecision is problematic since users expect strict enforcement of such preferences.
The solution integrates LLMs into their search pipeline for query segmentation and entity linking, leveraging their existing knowledge graph infrastructure to constrain outputs and maintain high precision. This represents a production-grade implementation of LLMs within a search system serving a major consumer platform.
Doordash’s search architecture follows a two-track approach: one pipeline for processing documents (items and stores) and another for processing queries. On the document side, they have built comprehensive knowledge graphs for both food items and retail products. These graphs define hierarchical relationships between entities and annotate documents with rich metadata including dietary preferences, flavors, product categories, and quantities.
For example, a retail item like “Non-Dairy Milk & Cookies Vanilla Frozen Dessert - 8 oz” carries metadata tags such as Dietary Preference: “Dairy-free”, Flavor: “Vanilla”, Product Category: “Ice cream”, and Quantity: “8 oz”. This metadata is ingested into the search index and makes rich attribute-based retrieval possible.
The query understanding module uses LLMs for two critical tasks: query segmentation and entity linking.
Traditional query segmentation methods rely on statistical approaches like pointwise mutual information (PMI) or n-gram analysis. These methods struggle with complex queries containing multiple overlapping entities or ambiguous relationships. For instance, in “turkey sandwich with cranberry sauce,” traditional methods may not correctly determine whether “cranberry sauce” is a separate item or an attribute of the sandwich.
Doordash’s approach prompts LLMs to not just segment queries but to classify each segment under their taxonomy categories. Rather than producing arbitrary segments like ["small", "no-milk", "vanilla ice cream"], the model outputs structured mappings: {Quantity: "small", Dietary_Preference: "no-milk", Flavor: "vanilla", Product_Category: "ice cream"}. This structured output approach improves segmentation accuracy because the categories provide additional context about possible relationships between segments.
Once queries are segmented, the system maps these segments to concepts in the knowledge graph. A segment like “no-milk” should link to the “dairy-free” concept to enable retrieval that isn’t restricted to exact string matching. However, LLMs can hallucinate concepts that don’t exist in the knowledge graph or mislabel segments entirely.
To address this, Doordash employs retrieval-augmented generation (RAG) to constrain the model’s output to their controlled vocabulary. The process works as follows:
This approach reduces hallucinations by ensuring the model selects from concepts already in the knowledge graph rather than generating arbitrary outputs.
The article notes that the hallucination rate on segmentation is less than 1%, but even this low rate requires mitigation given the importance of attributes like dietary preferences. Several strategies are employed:
The article provides valuable insights into production trade-offs when deploying LLMs for search. Using LLMs for batch inference on fixed query sets provides highly accurate results but has significant drawbacks:
To balance these concerns, Doordash employs a hybrid approach combining LLM-based batch processing with methods that generalize well to unseen queries, including embedding retrieval, traditional statistical models (like BM25), and rule-based systems. This hybrid architecture leverages the deep contextual understanding of LLMs while maintaining real-time adaptability for novel queries.
The effectiveness of the query understanding system depends on integration with downstream components, particularly the rankers that order retrieved documents by relevance. The new query understanding signals needed to be made available to the rankers, which then had to adapt to both the new signals and changed patterns of consumer engagement that the retrieval improvements introduced.
The structured query understanding enables specific retrieval logic—for example, making dietary restrictions a MUST condition (strict enforcement) while allowing flexibility on less strict attributes like flavors as SHOULD conditions. This granular control over retrieval logic would not be possible with purely embedding-based approaches.
The implementation demonstrated significant improvements across multiple metrics:
The validated LLM integration opens several future possibilities: query rewriting and search path recommendations, surfacing popular items for new users, improving retrieval recall and precision through more granular attributes, and deeper consumer behavior understanding for personalization (e.g., inferring that a consumer prefers spicy dishes and Latin American cuisines).
This case study represents a well-architected integration of LLMs into a production search system. The approach is notable for several reasons: it leverages existing infrastructure (knowledge graphs) to constrain LLM outputs, it acknowledges and addresses the hallucination problem systematically, it considers production trade-offs between accuracy and scalability, and it validates results with both offline evaluation and online A/B testing.
The hybrid approach—combining LLM batch processing with real-time generalization methods—reflects a pragmatic production mindset. The system doesn’t rely solely on LLMs but uses them where they add the most value (complex query understanding) while maintaining traditional methods for scalability and real-time processing.
One limitation worth noting is that the article focuses primarily on batch inference for query understanding rather than real-time LLM inference, which would introduce additional latency and cost considerations. The long-term maintenance burden of keeping the knowledge graph and query understanding in sync is also a consideration for production systems at this scale.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
Toyota Motor North America (TMNA) and Toyota Connected built a generative AI platform to help dealership sales staff and customers access accurate vehicle information in real-time. The problem was that customers often arrived at dealerships highly informed from internet research, while sales staff lacked quick access to detailed vehicle specifications, trim options, and pricing. The solution evolved from a custom RAG-based system (v1) using Amazon Bedrock, SageMaker, and OpenSearch to retrieve information from official Toyota data sources, to a planned agentic platform (v2) using Amazon Bedrock AgentCore with Strands agents and MCP servers. The v1 system achieved over 7,000 interactions per month across Toyota's dealer network, with citation-backed responses and legal compliance built in, while v2 aims to enable more dynamic actions like checking local vehicle availability.