## Overview
Instacart developed a production-scale contextual retrieval system using BERT-like transformer models to power product recommendations across multiple shopping surfaces. The company operates in a complex e-commerce environment where customers often place large basket orders representing weekly shopping needs for entire families. The challenge was to create an efficient, real-time recommendation system that could understand user intent within a shopping session and provide relevant suggestions. For example, when a user adds pancake mix, views bacon, and adds eggs, the system should recognize breakfast intent and recommend complementary breakfast items.
Prior to this implementation, Instacart maintained disparate retrieval systems for different recommendation surfaces across both ads and organic content. These legacy systems were ad-hoc combinations of product co-occurrence patterns, similarity measures, and popularity signals, but they failed to properly leverage sequential contextual information from user sessions. The new unified system replaced these legacy approaches while serving diverse optimization goals—organic content optimized for user engagement and transaction revenue, while sponsored content additionally considers advertiser value and ad revenue.
## Production Architecture and Problem Formulation
The system operates as a multi-step ranking pipeline with retrieval, ranking, and re-ranking layers. The contextual retrieval system sits at the retrieval layer, reacting in real-time to user actions within shopping sessions. The team formulated the problem as a next-product prediction task: given a sequence of products that users interacted with (through cart adds, product page views, etc.), the model predicts probabilities for products the user may interact with next. Formally, they predict p(Pᵢ | Pₜ₁, Pₜ₂...) for i in [1, N], where N is the catalog size.
Once probabilities are predicted across all product IDs, the system retrieves the top K products based on predicted probabilities for downstream ranking stages. The production system serves multiple surfaces including search results, item details pages (with "Items to add next" carousels), cart pages, and pre/post-checkout recommendation modules. This centralized approach significantly reduced maintenance overhead by allowing the deprecation of many legacy systems.
## Model Architecture and Training Approach
The team adopted a Masked Language Model (MLM) approach borrowed from NLP, specifically inspired by BERT4Rec and Transformers4Rec. However, their implementation operates at significantly larger scale—their production models handle an order of magnitude more products than the tens of thousands shown in prior research, with a catalog containing millions of products across thousands of retailers. For their initial production deployment, they restricted model vocabulary to under one million products, selected based on product popularity and business rules, with out-of-vocabulary products mapped to an OOV token.
The architecture closely mirrors BERT but operates on product ID sequences rather than text tokens. During training, they use the MLM approach on historical sequences of product IDs from user sessions. At inference time, the encoded session representation from the transformer block predicts probabilities over all product IDs. The team experimented with different language model architectures including XLNet and BERT, ultimately converging on a simple BERT-like model based on offline evaluation performance.
A notable engineering decision was to use only product ID sequences without additional user or product features in their preliminary production version. This simplification proved sufficient to demonstrate significant impact, though they acknowledge plans to incorporate more contextual features in future iterations. The model uses sequence lengths of 20 for both training and inference, as they found that the last 3-5 products in the interaction sequence have outsized influence on recommendations, with prior products showing diminishing influence as sequence length increases.
## Evaluation Methodology and Metrics
The primary offline evaluation metric is Recall@K, which measures the percentage of times the actual next product (last token in test sequences) appears in the top K predictions from the model. This metric provides a practical measure of how effectively the model predicts which products users might interact with next. The team conducted rigorous ablation studies to validate the importance of sequence information, using two distinct approaches.
In the first evaluation approach, they trained models on randomized token sequences (where product order was shuffled) and compared performance against control models trained on proper sequences. Even with randomized training data producing seemingly meaningful recommendations, the Recall@K metrics were 10-40% worse depending on K value. This demonstrated that sequence information during training has meaningful impact on prediction quality.
The second evaluation kept the control model unchanged but randomized token sequences in the test dataset (keeping only the last product in its original position while shuffling preceding products). This evaluation showed 20-45% degradation in metrics depending on K, indicating that proper sequence information at inference time is critical for recommendation quality. These rigorous experiments provided strong empirical evidence for the value of sequence modeling in their production context.
## Production Impact and Business Results
The deployment delivered substantial business impact across multiple dimensions. The unified retrieval system enabled deprecation of old ad-hoc systems, reducing technical debt and maintenance burden. Initial offline evaluation showed significant uplift over prior systems, translating to outsized impact across transaction volume and the ad marketplace. Most notably, when launched on cart recommendations, the system achieved a 30% lift in user cart additions—a substantial improvement for a high-traffic production surface.
The system's ability to serve both ads and organic surfaces from a single retrieval layer represents significant operational efficiency. Previously, separate systems maintained for different surfaces created redundancy and inconsistency. The unified approach provides consistent user experience while allowing downstream ranking layers to optimize for surface-specific goals (user engagement, transaction revenue, advertiser value, ad revenue).
## Production Challenges and Ongoing Work
The team candidly discusses several challenges inherent in applying language models to their production use case. Catalog size poses a significant challenge—with millions of products across thousands of retailers, vocabulary restrictions become necessary. Their initial approach of limiting vocabulary to under one million products based on popularity and business rules means less popular products may not surface appropriately. They're exploring Approximate Nearest Neighbor (ANN) approaches to scale the model to millions of products.
Canonical product identity presents unique challenges in multi-retailer environments. Some products share common identity across retailers (branded items), while others like non-branded produce may not. Since their preliminary model uses only product ID sequences, non-popular product IDs may fail to surface in recommendations even when popular at specific retailers. They reference text-content inclusive approaches like TiGER as potential solutions.
Popularity bias emerges naturally from training data distribution—as in most retail environments, the majority of user interactions concentrate on a small catalog subset. The preliminary model exhibits bias toward popular products in recommendations, which they address by retrieving top-K products filtered by retailer. They acknowledge this as a common challenge in production recommender systems requiring ongoing attention.
Context expansion represents ongoing work. Their initial production version focuses on cart adds and product view sequences, but they're working to incorporate additional context like user search queries. This expansion would provide richer signals for understanding user intent and improving recommendation relevance.
## LLMOps and Production ML Considerations
From an LLMOps perspective, this case study demonstrates several important production considerations. The team made pragmatic engineering tradeoffs—starting with a simpler model using only product IDs rather than attempting to incorporate all possible features from the beginning. This allowed them to validate the core approach and deliver business value quickly while establishing infrastructure for future enhancement.
The emphasis on offline evaluation metrics (Recall@K) and rigorous ablation studies before production deployment reflects mature ML operations practices. They validated not just overall model performance but specifically tested the value of sequence information through controlled experiments, providing evidence-based justification for their architectural choices.
The multi-stage pipeline architecture (retrieval, ranking, re-ranking) represents a scalable production pattern for recommendation systems. The sequence model serves specifically at the retrieval layer, generating candidate products for downstream optimization. This separation of concerns allows different stages to optimize for different objectives while maintaining computational efficiency.
The team's discussion of ongoing challenges—catalog scale, canonical identity, popularity bias, context expansion—reflects realistic production constraints rather than presenting an overly optimistic view. They acknowledge that their "preliminary version" represents an initial production deployment with clear paths for enhancement, demonstrating iterative development practices appropriate for production ML systems.
The unified system serving multiple surfaces (search, item details, cart, checkout) while supporting both ads and organic content represents significant systems engineering. Maintaining consistency across surfaces while allowing customization for different business objectives requires careful abstraction and interface design. The ability to deprecate multiple legacy systems indicates successful production adoption and organizational alignment.
Overall, this case study illustrates practical application of transformer-based language models to production recommendation systems at significant scale, with honest discussion of both successes and ongoing challenges. The 30% lift in cart additions demonstrates real business impact, while the architectural decisions and evaluation methodology reflect mature LLMOps practices for deploying sequence models in production e-commerce environments.