ZenML

Semantic Caching for E-commerce Search Optimization

Walmart 2024
View original source

Walmart implemented semantic caching to enhance their e-commerce search functionality, moving beyond traditional exact-match caching to understand query intent and meaning. The system achieved unexpectedly high cache hit rates of around 50% for tail queries (compared to anticipated 10-20%), while handling the challenges of latency and cost optimization in a production environment. The solution enables more relevant product recommendations and improves the overall customer search experience.

Industry

E-commerce

Technologies

Overview

Walmart, one of the world’s largest retailers, has been implementing semantic caching and generative AI technologies to transform their e-commerce search capabilities. This case study, based on insights from Rohit Chatter (Chief Software Architect at Walmart), demonstrates how large-scale e-commerce operations can leverage LLM-adjacent technologies to improve search relevance, reduce latency, and manage costs while handling millions of customer queries.

The core innovation discussed centers on moving beyond traditional exact-match caching to a semantic approach that understands the meaning and intent behind customer search queries. This represents a significant operational challenge given the scale of Walmart’s e-commerce operations and the need to balance performance, cost, and user experience.

The Problem with Traditional Caching

Traditional caching mechanisms in search systems rely on exact match inputs to retrieve data. When a user searches for a specific term, the system checks if that exact query has been cached and returns the stored results if found. While this approach works well for common, frequently-repeated queries, it falls short when dealing with the natural variations in human language.

Consider that customers might search for the same product using countless different phrasings: “running shoes,” “jogging sneakers,” “athletic footwear for running,” etc. A traditional cache would treat each of these as entirely distinct queries, requiring separate API calls or database lookups, missing opportunities to reuse relevant cached results.

For tail queries—the long tail of less common, more specific searches—traditional caching becomes particularly inefficient. These queries are by definition less frequently repeated in their exact form, leading to very low cache hit rates and forcing more expensive computation or API calls for each unique phrasing.

Semantic Caching: The Technical Approach

Semantic caching addresses these limitations by focusing on the meaning behind queries rather than their exact string representation. The technical implementation involves several key components:

Vector Embeddings: Every query and product SKU must be converted into vector embeddings—numerical representations that capture semantic meaning. Similar concepts end up with similar vector representations, enabling the system to recognize that “running shoes” and “jogging sneakers” are semantically related.

Vector Search: Rather than looking for exact matches, the caching system performs similarity searches in the embedding space. When a new query comes in, it’s embedded and compared against cached query embeddings to find semantically similar previous queries whose results might be relevant.

Threshold-Based Matching: The system must determine how similar two queries need to be before the cached results are considered applicable. This involves tuning similarity thresholds to balance between reusing relevant results and avoiding returning inappropriate matches.

Results and Performance

The results reported by Walmart are impressive, though it’s worth noting these come from a promotional context (a conversation published on Portkey’s blog, which sells related infrastructure). According to Chatter, semantic caching delivered a cache hit rate of approximately 50% for tail queries—significantly exceeding their initial expectations of 10-20%.

This represents a substantial improvement in operational efficiency. Tail queries, which traditionally had very low cache hit rates with exact-match systems, can now leverage cached results from semantically similar previous queries. The implications for cost savings (fewer LLM API calls or expensive computations) and latency improvements (faster responses from cache hits) are significant at Walmart’s scale.

Beyond caching, Walmart is applying generative AI to understand the intent behind customer queries and present more relevant product groupings. The example provided illustrates this well: a query like “football watch party” returns not just snacks and chips, but also party drinks, Super Bowl apparel, and televisions.

This represents a move from keyword-matching search to intent-understanding search. Traditional search systems might only return products that contain the exact words “football watch party” in their descriptions. A generative AI approach can:

The case study suggests this approach can significantly reduce zero search result queries—instances where customers search for something and find nothing relevant, leading to frustration and abandoned shopping sessions.

Challenges and Trade-offs in Production

The case study is refreshingly honest about the challenges involved in implementing these technologies at scale, which provides valuable LLMOps insights:

Latency Challenges: Semantic caching requires more computation than traditional caching. Each query must be embedded (requiring a model inference), and then a similarity search must be performed across the cache. This adds overhead compared to a simple hash lookup in a traditional cache. Walmart is working toward achieving sub-second response times, indicating that latency remains a work in progress.

Cost Challenges: The infrastructure requirements for semantic caching are more demanding than traditional caching:

Hybrid Approach: The case study suggests that a hybrid approach using both simple (exact-match) caching and semantic caching together may be optimal. This allows the system to:

Considerations and Critical Assessment

While the case study presents compelling results, several considerations are worth noting:

Source Context: This case study was published on Portkey’s blog, and Portkey sells LLM gateway infrastructure including caching solutions. While the insights from Walmart appear genuine, the framing and selection of what to highlight may be influenced by Portkey’s commercial interests.

Specific Metrics Limited: While the 50% cache hit rate for tail queries is mentioned, other important metrics (latency improvements, cost savings in absolute terms, customer experience improvements, A/B test results) are not provided. The case study is more conceptual than data-rich.

Implementation Details Sparse: Technical details about the specific embedding models used, the vector database infrastructure, the similarity thresholds employed, or the scale of the deployment are not provided. This limits the actionability of the case study for practitioners looking to implement similar solutions.

Future-Looking Statements: Some of Chatter’s comments about working toward sub-second response times and future AR/VR integration suggest that the technology is still evolving and may not yet be fully mature in production.

LLMOps Implications

From an LLMOps perspective, this case study highlights several important themes:

Cost Management: At the scale of a retailer like Walmart, LLM API costs can quickly become prohibitive. Semantic caching represents a practical approach to reducing these costs by reusing results from semantically similar queries. This is a key LLMOps pattern that applies broadly to any high-volume LLM application.

Latency Optimization: Production search systems have strict latency requirements—customers expect results in milliseconds, not seconds. The tension between the power of semantic approaches and their computational overhead is a recurring theme in LLMOps. Hybrid architectures that use simpler methods when possible and fall back to more expensive approaches when needed represent a mature pattern.

Infrastructure Complexity: Semantic caching requires specialized infrastructure (embedding models, vector databases, similarity search engines) that adds operational complexity. Organizations must weigh the benefits against the increased complexity of their stack.

Continuous Improvement: The mention of working toward sub-second response times and exploring future technologies suggests that LLMOps is not a “set and forget” discipline. Continuous monitoring, optimization, and evolution of the system are required.

This case study provides a valuable glimpse into how a major e-commerce player is operationalizing LLM-adjacent technologies in production, with honest acknowledgment of both the benefits achieved and the challenges that remain.

More Like This

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building an Enterprise-Grade AI Agent for Recruiting at Scale

LinkedIn 2025

LinkedIn developed Hiring Assistant, an AI agent designed to transform the recruiting workflow by automating repetitive tasks like candidate sourcing, evaluation, and engagement across 1.2+ billion profiles. The system addresses the challenge of recruiters spending excessive time on pattern-recognition tasks rather than high-value decision-making and relationship building. Using a plan-and-execute agent architecture with specialized sub-agents for intake, sourcing, evaluation, outreach, screening, and learning, Hiring Assistant combines real-time conversational interfaces with large-scale asynchronous execution. The solution leverages LinkedIn's Economic Graph for talent insights, custom fine-tuned LLMs for candidate evaluation, and cognitive memory systems that learn from recruiter behavior over time. The result is a globally available agentic product that enables recruiters to work with greater speed, scale, and intelligence while maintaining human-in-the-loop control for critical decisions.

healthcare customer_support question_answering +51