ZenML

Enhancing E-commerce Search with LLM-Powered Semantic Retrieval

Picnic 2024
View original source

Picnic, an e-commerce grocery delivery company, implemented LLM-enhanced search retrieval to improve product and recipe discovery across multiple languages and regions. They used GPT-3.5-turbo for prompt-based product description generation and OpenAI's text-embedding-3-small model for embedding generation, combined with OpenSearch for efficient retrieval. The system employs precomputation and caching strategies to maintain low latency while serving millions of customers across different countries.

Industry

E-commerce

Technologies

Overview

Picnic is a European grocery delivery company operating across the Netherlands, Germany, and France, delivering groceries ordered through a mobile application directly to customers’ doorsteps. The company faces a significant challenge: accommodating tens of thousands of products in a mobile interface while serving customers with diverse languages, cultural preferences, and culinary expectations. With millions of unique search terms being used by customers, developing a search system that delivers accurate, fast results became a prime candidate for LLM-enhanced solutions.

The case study illustrates a practical application of LLMs not for generative content creation, but for enhancing an existing machine learning task—specifically, search retrieval. This represents an increasingly common pattern in LLMOps where large language models are used to augment traditional systems rather than replace them entirely.

The Problem Space

The search retrieval challenge at Picnic is multifaceted. Users exhibit a wide variety of behaviors that make simple lookup-based search insufficient:

These challenges are compounded by rising customer expectations. As users become accustomed to interacting with advanced language models in their daily lives, they expect similar sophistication from e-commerce search experiences.

Technical Architecture and LLM Integration

Core Approach: Prompt-Based Product Description Generation

At the heart of Picnic’s solution is a technique they call “prompt-based product description generation.” This approach transforms search terms into rich descriptions that can be semantically compared against their entire product and recipe catalog. The fundamental insight is that converting a raw search query into a contextual description allows for better semantic matching with product information.

For example, a search query like “daughter’s birthday party” can be transformed by the LLM into a description of products typically associated with such an event, enabling retrieval of relevant items that might not contain those exact keywords.

Model Selection

Picnic chose OpenAI’s GPT-3.5-turbo for the prompt generation task, noting that it performs comparably to GPT-4-turbo while being significantly faster. This is a pragmatic production decision—choosing the model that provides sufficient quality while meeting latency and cost requirements. For embeddings, they use OpenAI’s text-embedding-3-small model, which produces 1536-dimensional vectors. The choice of the smaller embedding model was driven by a practical constraint: OpenSearch has a maximum dimensionality limit of 1536 for efficient retrieval, which aligns exactly with the output size of text-embedding-3-small.

Precomputation Strategy

One of the most critical LLMOps decisions in this case study is the precomputation approach. Rather than calling LLM APIs at query time—which would introduce unacceptable latency for a search-as-you-type experience—Picnic precomputes embeddings for both search terms and product/recipe content.

The rationale is compelling: by analyzing historical search data, they can identify and precompute embeddings for approximately 99% of search terms that customers use. This eliminates the need for real-time LLM inference in the vast majority of cases, allowing the system to deliver results in milliseconds rather than the seconds that typical LLM responses require.

This approach reflects a mature understanding of LLMOps trade-offs. While it requires upfront computational investment and ongoing maintenance as the product catalog changes, it enables a user experience that would be impossible with synchronous LLM calls.

Infrastructure: OpenSearch Integration

The system uses OpenSearch as the core retrieval engine. The architecture employs two indexes:

This separation allows for efficient matching of incoming queries against precomputed search term embeddings, followed by semantic retrieval of relevant products using vector similarity.

Caching and Reliability

Beyond precomputation, Picnic implements additional caching mechanisms throughout the system. This serves multiple purposes:

The emphasis on managing third-party dependencies through caching is a crucial LLMOps consideration. When your production system depends on external API calls, you must have strategies to handle outages, rate limits, and latency spikes gracefully.

Quality Assurance and Sanity Checks

The pipeline includes numerous sanity checks, such as verifying that embeddings are consistent and of the appropriate length. The case study explicitly acknowledges that outputs from language models can vary with updates and model iterations—a critical observation for production systems. These checks help maintain system integrity as underlying models evolve.

Evaluation and Deployment Strategy

Offline Optimization

The development process begins with extensive offline optimization, where the team experiments with:

This phase is conducted without affecting the production environment, allowing for safe experimentation. However, the team acknowledges a limitation: offline evaluation relies on historical search results, and the ground truth derived from past behavior may not perfectly reflect optimal outcomes. They recommend using offline evaluation primarily for initial parameter tuning.

Online A/B Testing

Following offline optimization, new features undergo controlled A/B testing with real users. This phase collects data on how actual customers interact with the enhanced search compared to the existing system. The case study emphasizes that this iterative approach ensures changes move in the right direction incrementally.

The metrics they focus on include:

Scaling and Continuous Improvement

Once A/B testing validates a feature, it’s scaled across the entire user base with careful monitoring to manage increased load and maintain system stability. The scaling phase also generates more data, enabling further refinements and personalization.

The team notes that initial A/B tests are just the beginning—there are “millions of ways to configure search results” including ranking changes, mixing recipes with articles, and hybrid approaches combining literal search with LLM-based semantic search.

Production Considerations and Trade-offs

Several aspects of this case study highlight mature LLMOps thinking:

Latency vs. Capability Trade-off: Rather than accepting LLM latency limitations, Picnic architecturally avoided them through precomputation. They acknowledge that a future iteration could “fully unleash the power of LLMs” with a redesigned user interface that accommodates slower response times, but for now they prioritize speed.

Cost Management: The team explicitly mentions that “prompting and embedding millions of search terms is resource-intensive” with associated API costs. The precomputation strategy helps control these costs by doing the work upfront rather than per-query.

Model Selection Pragmatism: Choosing GPT-3.5-turbo over GPT-4-turbo based on equivalent performance for their use case demonstrates practical model selection rather than defaulting to the most powerful option.

Dependency Management: The emphasis on caching to handle third-party dependencies reflects real-world concerns about relying on external APIs for critical production systems.

Limitations and Balanced Assessment

While the case study presents a thoughtful approach to LLM-enhanced search, it’s worth noting some limitations:

Overall, this case study represents a practical, production-focused application of LLMs to enhance search retrieval, with particular emphasis on meeting latency requirements through precomputation and managing external API dependencies through caching.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

AI-Powered Conversational Assistant for Streamlined Home Buying Experience

Rocket 2025

Rocket Companies, a Detroit-based FinTech company, developed Rocket AI Agent to address the overwhelming complexity of the home buying process by providing 24/7 personalized guidance and support. Built on Amazon Bedrock Agents, the AI assistant combines domain knowledge, personalized guidance, and actionable capabilities to transform client engagement across Rocket's digital properties. The implementation resulted in a threefold increase in conversion rates from web traffic to closed loans, 85% reduction in transfers to customer care, and 68% customer satisfaction scores, while enabling seamless transitions between AI assistance and human support when needed.

customer_support chatbot question_answering +40