Company
Netflix
Title
Foundation Model for Personalized Recommendation at Scale
Industry
Media & Entertainment
Year
2025
Summary (short)
Netflix developed a foundation model for personalized recommendations to address the maintenance complexity and inefficiency of operating numerous specialized recommendation models. The company built a large-scale transformer-based model inspired by LLM paradigms that processes hundreds of billions of user interactions from over 300 million users, employing autoregressive next-token prediction with modifications for recommendation-specific challenges. The foundation model enables centralized member preference learning that can be fine-tuned for specific tasks, used directly for predictions, or leveraged through embeddings, while demonstrating clear scaling law benefits as model and data size increase, ultimately improving recommendation quality across multiple downstream applications.
## Overview Netflix's foundation model for personalized recommendation represents a significant architectural shift in production recommendation systems, moving from maintaining numerous specialized models to a unified, large-scale foundation model approach inspired by the success of Large Language Models (LLMs). The case study, published in March 2025, addresses the operational challenges of managing increasingly complex recommendation infrastructure serving over 300 million users generating hundreds of billions of interactions. The motivation for this transition stems from practical LLMOps challenges: high maintenance costs of multiple specialized models (such as "Continue Watching" and "Today's Top Picks"), difficulty transferring innovations across models, and limitations in leveraging long-term user interaction histories due to serving latency and training cost constraints. Netflix explicitly draws from the NLP-to-LLM paradigm shift, adopting two key insights: a data-centric approach prioritizing large-scale quality data over feature engineering, and leveraging semi-supervised learning through next-token prediction objectives on unlabeled data. ## Data Engineering and Tokenization The production data pipeline processes interaction data at massive scale comparable to LLM token volumes. Netflix implements a sophisticated interaction tokenization strategy analogous to Byte Pair Encoding in NLP, where raw user actions are merged into meaningful tokens while preserving critical information like watch duration and engagement types. This represents a careful production tradeoff between sequence compression (for computational efficiency) and information retention (for prediction quality). The tokenization process addresses a fundamental production constraint: active users generate thousands of interaction events, exceeding typical transformer context windows, while inference services require millisecond-level latency—far more stringent than the seconds tolerated in many LLM applications. Netflix employs two production solutions: sparse attention mechanisms using low-rank compression to extend context windows to several hundred events while maintaining efficiency, and sliding window sampling during training that exposes the model to different segments of user history across epochs without requiring impractically large context windows. At inference time, KV caching enables efficient multi-step decoding while maintaining low latency requirements. Each interaction token contains heterogeneous information including action attributes (locale, time, duration, device) and content metadata (item ID, genre, release country). Unlike LLMs with single embedding spaces, Netflix's tokens embrace end-to-end learning with most features directly embedded. Timestamps receive special processing for both absolute and relative time understanding. The system categorizes features into request-time features (available at prediction, like login time and device) and post-action features (available after interaction, like show watched and duration), combining these to predict next interactions. ## Model Architecture and Training Objectives The foundation model employs autoregressive next-token prediction similar to GPT, effectively leveraging vast unlabeled user interaction data—an approach that has shown multiple successes in recommendation systems research. However, Netflix makes critical modifications recognizing fundamental differences between language and recommendation tasks. Unlike LLM pretraining where tokens receive equal weight, Netflix's model differentiates interaction importance—a 5-minute trailer view shouldn't equal a 2-hour movie watch. The more challenging problem involves aligning long-term user satisfaction with specific interactions. Netflix addresses this through multi-token prediction objectives where the model predicts the next n tokens at each step rather than single tokens, encouraging longer-term dependency capture and avoiding myopic predictions. The system employs multiple auxiliary prediction objectives beyond primary item ID prediction. For example, deriving genres from item sequences creates auxiliary targets serving multiple purposes: regularization to reduce overfitting on noisy item ID predictions, additional insights into user intentions and long-term preferences, and hierarchical prediction improvement where predicting auxiliary targets like genre or original language first narrows candidate lists before item ID prediction. ## Production Challenges: Entity Cold-Starting A unique production challenge for recommendation foundation models versus language models is entity cold-starting. Netflix continuously adds new titles requiring the model to estimate member preferences before any engagement occurs. This necessitates two critical production capabilities: **Incremental Training**: Foundation models trained on extensive datasets including every member's interaction history make frequent retraining impractical. Yet catalog and preferences continually evolve. Unlike LLMs with stable token vocabularies enabling straightforward incremental training, recommendation models require new embeddings for new titles, necessitating expanded embedding layers and output components. Netflix warm-starts new models by reusing parameters from previous models and initializing new title parameters through methods like adding random noise to average embeddings or weighted combinations based on metadata similarity. While initialization methods matter less with more fine-tuning data, this approach enables practical production deployment cycles. **Handling Unseen Entities**: Even with incremental training, efficient learning on new entities isn't guaranteed, and some entities may not appear in training data between fine-tuning cycles. Netflix addresses this by combining learnable item ID embeddings with learnable metadata-based embeddings. Each title's metadata (genres, storylines, tones) generates embeddings that are concatenated and then combined with ID-based embeddings through an attention-based mixing layer weighted by entity "age." This allows new titles with limited interaction data to rely more on metadata while established titles leverage ID-based embeddings. Introducing randomness during training encourages metadata learning rather than ID-embedding dependence, ensuring newly-launched or pre-launch titles have reasonable embeddings even without user interaction data. ## Downstream Production Applications The foundation model serves production systems through three primary mechanisms, each with distinct LLMOps considerations: **Direct Predictive Use**: The model includes multiple predictor heads trained for different tasks like forecasting member preferences across genres. These can be directly applied to meet diverse business needs, simplifying the production architecture by consolidating specialized models. **Embedding Utilization**: The model generates member and entity embeddings (videos, games, genres) calculated in batch jobs and stored for offline and online applications. These serve as features in other models or enable candidate generation like retrieving appealing titles for users. High-quality title embeddings also support title-to-title recommendations. However, a critical production challenge emerges: embedding spaces have arbitrary, uninterpretable dimensions and are incompatible across training runs. This creates significant operational burden for downstream consumers who must adapt to each retraining and redeployment, risking bugs from invalidated assumptions about embedding structure. Netflix addresses this by applying orthogonal low-rank transformation to stabilize user/item embedding spaces, ensuring consistent dimension meaning across foundation model retraining and redeployment—a crucial production stability consideration often overlooked in foundation model discussions. **Fine-Tuning for Specialized Applications**: Users can integrate the full model or subgraphs into their own models, fine-tuning with application-specific data using less data and computational resources than training from scratch while achieving comparable performance to previous models. This democratizes access to foundation model capabilities across Netflix's organization despite the initial model requiring significant resources. ## Scaling and Production Performance Netflix's experiments confirm scaling laws apply to recommendation foundation models, observing consistent improvements with increased data and model size. The case study presents empirical evidence of the relationship between model parameter size and relative performance improvement, demonstrating the scaling law in recommendation modeling with increased performance at larger model sizes. This represents important production validation that investment in larger models and more data yields predictable returns. Successful scaling requires robust evaluation effectively differentiating model performance, efficient training algorithms, and substantial computing resources. Scaling encompasses data (user engagement, external reviews, multimedia assets, high-quality embeddings), model parameters, and context window dimensions. ## Critical Assessment and Production Considerations While Netflix presents compelling results, the case study should be evaluated with several considerations: **Infrastructure Requirements**: The scale described—hundreds of billions of interactions from 300+ million users—represents infrastructure investments accessible primarily to large technology companies. The transferability of this approach to smaller organizations with fewer resources remains an open question not addressed in the case study. **Claimed Benefits vs. Demonstrated Results**: The case study states "promising results from downstream integrations" but provides limited quantitative metrics on actual production performance improvements, A/B testing results, or business impact measurements. The scaling law plot shows relative improvement but absolute performance gains aren't specified. **Complexity Transfer**: While the foundation model consolidates multiple specialized models reducing some maintenance burden, it introduces new complexities: managing incremental training pipelines, handling embedding space stability across versions, coordinating deployments across downstream consumers, and maintaining the substantially larger computational infrastructure. Whether the net operational complexity decreases isn't thoroughly addressed. **Metadata Dependency Trade-offs**: The cold-start solution relying on metadata embeddings assumes high-quality, comprehensive metadata exists for all content. The effectiveness of this approach for truly novel content without clear metadata parallels or in domains with sparse metadata isn't explored. **Latency Constraints**: The millisecond-level latency requirements mentioned drive significant architectural decisions (sparse attention, limited context windows, KV caching). These constraints may limit the model's ability to leverage its full capacity compared to less latency-constrained applications, potentially reducing the benefits of scaling relative to LLMs in other domains. **Embedding Stability Solution**: While the orthogonal low-rank transformation for embedding space stability is mentioned as solving downstream compatibility issues, the technical details of this approach and its impact on model quality aren't elaborated. This represents a critical production detail that would benefit from deeper exploration. The case study represents an important contribution to understanding foundation model deployment in production recommendation systems, demonstrating practical application of LLM paradigms to non-language domains while highlighting unique challenges in recommendation contexts. The explicit discussion of production constraints like latency requirements, cold-start challenges, and embedding stability provides valuable insights for practitioners considering similar approaches, even as some quantitative validation of claimed benefits would strengthen the case.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.