ZenML

Using LLMs to Enhance Search Discovery and Recommendations

Instacart 2024
View original source

Instacart integrated LLMs into their search stack to enhance product discovery and user engagement. They developed two content generation techniques: a basic approach using LLM prompting and an advanced approach incorporating domain-specific knowledge from query understanding models and historical data. The system generates complementary and substitute product recommendations, with content generated offline and served through a sophisticated pipeline. The implementation resulted in significant improvements in user engagement and revenue, while addressing challenges in content quality, ranking, and evaluation.

Industry

E-commerce

Technologies

Overview

Instacart, a major grocery e-commerce platform operating a four-sided marketplace, developed an LLM-powered system to enhance their search experience with discovery-oriented content. The case study from 2024 details how they moved beyond traditional search relevance to incorporate inspirational content that helps users find products they might not have explicitly searched for but would find valuable.

The core business problem was that while Instacart’s search was effective at returning directly relevant results, user research revealed a desire for more inspirational and discovery-driven content. The existing “Related Items” section was limited in its approach—for narrow queries like “croissant,” it would return loosely related items like “cookies” simply because they shared a department category. Additionally, the system failed to suggest complementary products that would naturally pair with search results (e.g., suggesting soy sauce and rice vinegar for a “sushi” search).

LLM Integration Strategy

Instacart’s approach to integrating LLMs into their production search stack was deliberate and multi-faceted. They identified two key advantages of LLMs for this use case: rich world knowledge that eliminates the need for building extensive knowledge graphs, and improved debuggability through transparent reasoning processes that allow developers to quickly identify and correct errors by adjusting prompts.

The team built upon their earlier success with “Ask Instacart,” which handled natural language-style queries, and extended LLM capabilities to enhance search results for all queries, not just broad intent ones.

Content Generation Techniques

Basic Generation

The basic generation technique involves instructing the LLM to act as an AI assistant for online grocery shopping. The prompt structure asks the LLM to generate three shopping lists for each query: substitute items, and two complementary/bought-together product groups. The prompts include:

The output is structured as JSON with categories for substitutes, complementary items, and themed collections. For example, an “ice cream” query would generate substitute frozen treats, complementary toppings and sauces, and themed lists like “Sweet Summer Delights.”

Advanced Generation

The advanced generation technique emerged from recognizing that basic generation often misinterpreted user intent or generated overly generic recommendations. For instance, a search for “Just Mayo” (a vegan mayonnaise brand) would be misinterpreted as generic mayonnaise, and “protein” would return common protein sources rather than the protein bars and powders that users actually converted on.

To address this, Instacart augmented prompts with domain-specific signals:

The prompt format for advanced generation explicitly includes annotations like “:BODYARMOR” for brand queries and “

:pizza, :frozen” for attributed product searches, along with previously purchased product categories. This fusion of LLM world knowledge with Instacart-specific context significantly improved recommendation accuracy.

Sequential Search Term Analysis

An innovative extension involved analyzing what users typically search for and purchase after their initial query. By examining next converted search terms, the system provides richer context to the LLM. For “sour cream,” instead of only considering sour cream products, the system incorporates data showing that users frequently purchase tortilla chips or baked potatoes afterward.

The implementation mines frequently co-occurring lists of consecutive search terms to extract high-quality signals, filtering out noise from partial or varied shopping sessions. This methodology led to an 18% improvement in engagement rate with inspirational content.

Data Pipeline and Serving Architecture

The production system implements an offline batch processing architecture optimized for latency and cost:

Data Preparation: A batch job extracts historical search queries from logs and enriches them with necessary metadata including QU signals, consecutive search terms, and other relevant signals.

Prompt Generation: Using predefined prompt templates as base structures, the system populates templates with enriched queries and associated metadata, creating contextually-rich prompts for each specific query.

LLM Response Generation: A batch job invokes the LLM and stores responses in a key-value store, with the query as key and the LLM response (containing substitute and complementary recommendations) as value.

Response-to-Product Mapping: Each item in the LLM-generated list is treated as a search query and passed through Instacart’s existing search engine to retrieve the best matching products from the catalog.

Post-processing: The pipeline removes duplicates and similar products, filters irrelevant items, and applies diversity-based reranking to ensure users see varied options.

Runtime Serving: When users issue queries, the system retrieves both standard search results and looks up the LLM-content table to display inspirational products in carousels with suitable titles.

Content Ranking and Page Optimization

With increased content on the page, Instacart faced interface clutter and operational complexity challenges. They developed a “Whole Page Ranker” model that determines optimal positions for new content on the page. This model balances showing highly relevant content to users while maintaining revenue objectives, dynamically adjusting layout based on content type and relevance.

Evaluation Approach: LLM as a Judge

A significant challenge was developing robust evaluation methods for discovery-oriented content, as traditional relevance metrics don’t directly apply—discovery content aims to inspire rather than directly answer queries. With the volume of searches and catalog diversity, scalable assessment methods were essential.

Instacart adopted the “LLM as a Judge” paradigm for quality evaluation. The evaluation prompt positions the LLM as an expert in e-commerce recommendation systems, tasking it to evaluate curator-generated content that produces complementary or substitute search terms. The goal is to assess whether the generated content would encourage users to make purchases, with the LLM providing quality scores.

Business Alignment and Results

The team emphasized aligning content generation with business metrics, particularly revenue. The generated content needed to meet user needs while supporting business growth objectives. While specific metrics beyond the 18% engagement improvement aren’t detailed, the post claims substantial improvements in user engagement and revenue.

Technical Considerations and Limitations

The case study honestly acknowledges limitations. The advanced generation approach, while effective, is still restrictive because context is bounded by products users engage with for the current query. This introduces bias that limits truly inspirational content generation—hence the development of the sequential search term analysis approach.

The reliance on offline batch processing means content freshness is limited to daily updates. The system also depends heavily on the quality of Query Understanding models for accurate intent annotation, and the product mapping step introduces potential for irrelevant recalls that require post-processing to filter.

Architectural Insights

The architecture demonstrates a pragmatic approach to LLM deployment: rather than serving LLM calls at request time with associated latency and cost concerns, Instacart pre-computes content offline for known queries. This approach trades off real-time personalization for cost efficiency and consistent latency. The key-value store serving model allows for rapid lookup during user sessions while the batch pipeline handles the computationally expensive LLM inference.

The integration with existing search infrastructure is notable—LLM outputs are treated as queries themselves to leverage existing product matching capabilities, avoiding the need to build entirely new retrieval systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Reinforcement Learning for Code Generation and Agent-Based Development Tools

Cursor 2025

This case study examines Cursor's implementation of reinforcement learning (RL) for training coding models and agents in production environments. The team discusses the unique challenges of applying RL to code generation compared to other domains like mathematics, including handling larger action spaces, multi-step tool calling processes, and developing reward signals that capture real-world usage patterns. They explore various technical approaches including test-based rewards, process reward models, and infrastructure optimizations for handling long context windows and high-throughput inference during RL training, while working toward more human-centric evaluation metrics beyond traditional test coverage.

code_generation code_interpretation data_analysis +61

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64