Walmart: Hybrid AI System for Large-Scale Product Categorization

Overview

Walmart, one of the world’s largest retailers, operates a digital storefront with over 400 million SKUs. The company developed an AI system called Ghotok (named after a Bengali matchmaker) to improve the accuracy of product categorization on their e-commerce platform. The core challenge is matching products to the correct categories and product types to ensure customers can efficiently find what they’re looking for, just as they would in a well-organized physical store.

The system is particularly interesting from an LLMOps perspective because it demonstrates a pragmatic hybrid approach that combines traditional predictive AI (smaller, million-parameter models) with generative AI (larger, billion-parameter models) in an ensemble architecture designed for production scale. This case study provides insights into cost optimization, latency management, and handling AI limitations like hallucination in real-world deployments.

The Problem Space

Walmart’s product catalog features two key hierarchies: Categories (which customers see and navigate on the website) and Product Types (an internal classification for understanding product purpose). The relationship between these hierarchies is many-to-many, meaning a single category can map to multiple product types and vice versa. With millions of potential category-to-product-type pairs, manual curation is impractical, and miscategorization leads to poor customer experience.

A concrete example illustrates the challenge: a screwdriver could be classified under “Oral Care Accessories” (for dental use) or “Screwdriver Tool” (for electronics). The context matters enormously for customer relevance, and the system must understand these semantic distinctions at scale.

The Hybrid Ensemble Architecture

Ghotok’s approach is notable for its cost-conscious design that leverages the strengths of different AI paradigms. The team explicitly acknowledges that “ML inference using Generative AI technology is costly” and designs around this constraint.

The architecture works in stages. First, predictive AI models (traditional machine learning models with millions of parameters) are trained on domain-specific features. These models are faster and cheaper to run at inference time. The training process uses a human-labeled dataset for hyperparameter tuning, with evaluation metrics including precision, recall, F1 score, true positive rate (TPR), and false positive rate (FPR). By setting confidence thresholds based on a fixed false positive rate, these predictive models filter down candidate pairs from millions to thousands.

Only after this filtering step does the generative AI component come into play. The billion-parameter generative models are applied to the pre-filtered set of candidate pairs, using their superior semantic understanding to further eliminate false positives. This staged approach is a pragmatic solution to the cost-performance tradeoff inherent in deploying large language models at scale.

Prompt Engineering Techniques

The case study reveals specific prompt engineering strategies that improved the generative AI component’s performance. Chain-of-thought prompting proved instrumental in ensuring the model followed a logical reasoning path from input to output, improving both contextual relevance and logical consistency of the generated reasoning.

Additionally, the team employed what they call “symbol tuning” - adjusting how different parts of the input are weighted by the model. A key finding was that using the entire path representation (from root to leaf node in the category tree) as a string, rather than just context node names, significantly improved results. Furthermore, instructing the model to give higher importance to the leaf node during relevance assessment led to marked improvements. This insight about hierarchical path representation is valuable for anyone working with tree-structured data and LLMs.

Interestingly, the team notes that hallucination - often cited as a major concern with generative AI - “does not pose a significant issue” for their use case. This is because they leverage the LLM’s semantic comprehension selectively, specifically to eliminate false positives from the predictive AI layer rather than generating new content. This constrained use case naturally mitigates hallucination risks.

Production Deployment and Infrastructure

The integration of Ghotok into Walmart’s backend system demonstrates careful attention to production requirements. With several thousand categories, each connected to hundreds of product types, the offline data comprises millions of rows. Meeting typical service level agreements (SLAs) in the millisecond range requires sophisticated infrastructure.

The team implemented a two-tier LRU (Least Recently Used) caching system. The L1 cache is small but provides access times of one or two cycles. The L2 cache is larger but slightly slower. The processor first searches L1, then L2, and only queries primary storage if neither cache contains the requested data. This approach stores mappings of Category to Set, optimizing for the specific query patterns of the product filtering workflow.

Exception Handling and Human-in-the-Loop

Despite the success of the predictive-generative ensemble in lowering false positive rates, the team acknowledges that production deployments inevitably encounter edge cases. Their solution is an exception handling tool that combines machine learning with human intervention. This hybrid approach enables “swift and seamless resolution” of issues that the automated system cannot handle.

This human-in-the-loop component is crucial for maintaining system quality over time and represents a mature approach to LLMOps - recognizing that fully autonomous systems are often impractical and that human oversight remains valuable.

Evaluation Approach

The case study emphasizes the importance of human-labeled data for both training and evaluation. The team uses this data to select the best hyperparameters for predictive models and to determine confidence thresholds for filtering. Notably, they highlight that this approach “dispenses with the requirement for customer engagement data (which is noisy as sometimes customers might click on items by mistake or out of curiosity).”

The validation set-based evaluation showed that the ensemble of predictive and generative AI models achieved the best performance, though specific quantitative results are not provided in the case study.

Key Lessons for LLMOps Practitioners

Several lessons emerge from Walmart’s experience. First, the staged architecture that uses cheaper predictive models as a filter before applying expensive generative models is a practical pattern for cost optimization at scale. Second, chain-of-thought prompting and careful attention to input representation (like full path strings rather than just node names) can significantly improve LLM performance. Third, constraining LLM usage to specific tasks (like false positive elimination) can naturally mitigate hallucination concerns. Fourth, production deployments benefit from multi-tier caching strategies to meet latency requirements. Finally, human-in-the-loop exception handling remains valuable even with sophisticated AI systems.

The Ghotok system represents a thoughtful integration of traditional ML and generative AI, designed with production constraints in mind. While the case study is primarily a technical overview from Walmart’s engineering team and naturally presents the work in a positive light, the architectural decisions and lessons learned appear grounded in practical engineering experience rather than marketing hyperbole.

Hybrid AI System for Large-Scale Product Categorization

Industry

Technologies

Overview

The Problem Space

The Hybrid Ensemble Architecture

Prompt Engineering Techniques

Production Deployment and Infrastructure

Exception Handling and Human-in-the-Loop

Evaluation Approach

Key Lessons for LLMOps Practitioners

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

Building an Enterprise-Grade AI Agent for Recruiting at Scale