Netflix: Integrating Foundation Models into Production Personalization Systems

LLMOps Database

Media & Entertainment

Netflix

Company

Netflix

Title

Integrating Foundation Models into Production Personalization Systems

Industry

Media & Entertainment

Link

https://netflixtechblog.medium.com/integrating-netflixs-foundation-model-into-personalization-applications-cf176b5860eb

Year

2025

Summary (short)

Netflix developed a centralized foundation model for personalization to replace multiple specialized models powering their homepage recommendations. Rather than maintaining numerous individual models, they created one powerful transformer-based model trained on comprehensive user interaction histories and content data at scale. The challenge then became how to effectively integrate this large foundation model into existing production systems. Netflix experimented with and deployed three distinct integration approaches—embeddings via an Embedding Store, using the model as a subgraph within downstream models, and direct fine-tuning for specific applications—each with different tradeoffs in terms of latency, computational cost, freshness, and implementation complexity. These approaches are now used in production across different Netflix personalization use cases based on their specific requirements.

Tags

knowledge_distillation

## Overall Summary Netflix built a centralized foundation model for personalization with the goal of consolidating learning from multiple specialized models that previously powered different aspects of their homepage. The Netflix homepage traditionally relied on several specialized models, each requiring significant time and resources to maintain and improve. Their foundation model approach centralizes member preference learning by training one powerful transformer-based model on comprehensive user interaction histories and content data at large scale, then distributing its learnings to other models and applications. This case study focuses specifically on the production integration challenges—a gap in the literature that Netflix identified. While there is extensive research on training and inference of large-scale transformer models, there is limited practical guidance on effectively integrating these models into existing production systems. Netflix experimented with three distinct integration patterns, each now used in production for different use cases based on varying application needs including latency requirements, tech stack constraints, and different levels of commitment to leveraging the full power of the foundation model. ## Technical Architecture and Integration Approaches ### Embedding-Based Integration The first and most straightforward integration approach involves generating embeddings from the foundation model and serving them through Netflix's Embedding Store infrastructure. The transformer architecture naturally produces comprehensive user profile and item representations. Netflix extracts the hidden state of the last user event as the profile embedding and weights from the item tower as item embeddings. The production pipeline for this approach involves a sophisticated refresh cycle. The foundation model is pre-trained from scratch monthly, then fine-tuned daily based on the latest data. The daily fine-tuning process also expands the entity ID space to include newly launching titles. After the daily fine-tuned model is ready, batch inference runs to refresh profile and item embeddings, which are then published to the Embedding Store. A critical technical innovation in this approach is embedding stabilization (detailed in their published paper). When pre-training retrains the model from scratch, embedding spaces between different runs become completely different due to random initialization. Additionally, embeddings drift during daily fine-tuning despite warm-starting from the previous day's model. The stabilization technique maps embeddings generated each day into the same embedding space, enabling downstream models to consume pre-computed embeddings as features consistently. The Embedding Store itself is a specialized feature store built by Netflix's platform team. It handles versioning and timestamping of embeddings automatically and provides various interfaces for both offline and online access. This infrastructure makes producing and consuming embeddings straightforward for application teams. The embedding approach offers several advantages. It provides a low barrier to entry for teams wanting to leverage the foundation model, as integrating embeddings as features into existing pipelines is well-supported. Compared to other integration approaches, using embeddings has relatively smaller impacts on training and inference costs. Embeddings can serve as powerful features for other models or for candidate generation, helping retrieve appealing titles for users or facilitate title-to-title recommendations. However, there are notable limitations. The time gap between embedding computation and downstream model inference introduces staleness, impacting recommendation freshness. This prevents applications from fully unlocking the foundation model's benefits, particularly for use cases requiring real-time adaptability. While embeddings may not leverage the full power of the foundation model, they represent a pragmatic starting point. Netflix learned that embeddings are a low-cost, high-leverage way of using the foundation model. The investment in resilient embedding generation frameworks and embedding stores proved so valuable that they expanded their infrastructure to build a near-real-time embedding generation framework. This new framework updates embeddings based on user actions during sessions, making embeddings and downstream models more adaptive. Though the near-real-time framework cannot handle very large models, it represents an important direction for addressing staleness and improving recommendation adaptiveness. ### Subgraph Integration The second approach uses the foundation model as a subgraph within the downstream model's computational graph. The foundation model's decoder stack becomes part of the application model's full graph, processing raw user interaction sequences and outputting representations that feed into the downstream model. This deeper integration allows applications to fine-tune the foundation model subgraph as part of their own training process, potentially achieving better performance than static embeddings. There is no time gap or staleness between foundation model inference and application model inference, ensuring the most up-to-date learnings are utilized. Applications can also leverage specific layers from the foundation model that may not be exposed through the Embedding Store, uncovering more application-specific value. However, subgraph integration introduces significant complexities and tradeoffs. Application models must generate all features necessary for the subgraph as part of their feature generation process, adding time, compute, and complexity to their jobs. Merging the foundation model as a subgraph increases the application model size and inference time. To mitigate these challenges, the foundation model team provides reusable code and jobs that make feature generation more compute-efficient. For inference optimization, they split the subgraph to ensure it runs only once per profile per request and is shared across all items in the request. Netflix positions this approach for high-impact use cases where metric improvements compensate for increased cost and complexity. It allows deeper integration and enables applications to harness the full power of the foundation model, but requires careful consideration of the tradeoff between metric wins, compute cost, and development time. ### Direct Fine-Tuning Integration The third approach resembles fine-tuning LLMs with domain-specific data. The foundation model is trained on a next-token prediction objective, with tokens representing different user interactions. Since different interactions have varying importance to different surfaces on Netflix's website, the foundation model can be fine-tuned on product-specific data and used directly to power those products. For example, the "Trending now" row might benefit from emphasizing recent interactions on trending titles over older interactions. During fine-tuning, application teams can choose full parameter fine-tuning or freeze certain layers. They can also add different output heads with different objectives. Netflix built a fine-tuning framework to make it easy for application teams to develop custom fine-tuned versions of the foundation model. This approach offers the ability to adapt the foundation model to application-specific data and objectives, optimizing it for particular use cases. A valuable side benefit is that it provides a de facto baseline for new models and applications. Instead of designing new model stacks and spending months on feature engineering, new applications can directly utilize fine-tuned foundation models. The tradeoff is that this approach leads to more models and pipelines to maintain across the organization. The latency and Service Level Agreements (SLAs) of fine-tuned models must be carefully optimized for specific application use cases. ## Production Operations and Infrastructure Netflix's approach demonstrates sophisticated MLOps infrastructure supporting multiple integration patterns. The monthly pre-training and daily fine-tuning cycle balances model freshness with computational efficiency. The batch inference pipeline for embedding generation operates at scale, refreshing embeddings daily for the entire Netflix catalog and user base. The Embedding Store serves as critical infrastructure, handling the operational complexity of versioning, timestamping, and serving embeddings at scale with both offline and online interfaces. This abstraction allows application teams to focus on using embeddings rather than managing their lifecycle. For subgraph integration, Netflix provides reusable components and optimized code paths to reduce the implementation burden on application teams. Splitting the subgraph and sharing computations across items in a request demonstrates practical inference optimization strategies for large models. The fine-tuning framework represents another infrastructure investment that democratizes access to the foundation model's capabilities. By providing standardized APIs and workflows, Netflix lowers the barrier for teams to experiment with and deploy fine-tuned versions. ## Ongoing Innovation and Future Directions Netflix continues to refine these integration approaches over time. The Machine Learning Platform team is developing near-real-time embedding inference capabilities to address staleness issues with the embedding approach, though this currently has limitations for very large models. They are also working on a smaller distilled version of the foundation model to reduce inference latency for the subgraph approach, demonstrating awareness of the latency-accuracy tradeoffs in production systems. The company is refining and standardizing APIs used across these approaches to make them easier for application teams to adopt. This focus on developer experience and reducing integration friction is a key aspect of their LLMOps strategy. ## Critical Assessment The case study presents Netflix's integration approaches in a generally positive light, which is expected from a technical blog post by the company. However, several aspects deserve balanced consideration. The embedding staleness issue is acknowledged but may be more significant than suggested. Daily refresh cycles could be inadequate for capturing rapid shifts in user preferences or trending content, particularly during major content launches or cultural events. The near-real-time framework is positioned as a solution, but its inability to handle very large models is a substantial limitation that may restrict its applicability to the full foundation model. The subgraph approach's complexity is mentioned but potentially understated. The requirement for application teams to generate all features for the subgraph and manage increased model complexity could create significant technical debt and maintenance burden. The claim that reusable code mitigates this complexity needs validation through actual adoption metrics and developer feedback. The fine-tuning approach's proliferation of multiple fine-tuned models across the organization could lead to model sprawl, versioning challenges, and increased operational overhead. While positioned as providing a "de facto baseline," this may actually lower the incentive for teams to develop truly specialized models when appropriate. The lack of quantitative results is notable. The case study provides no concrete metrics on accuracy improvements, latency impacts, cost implications, or A/B test results comparing the three approaches. This makes it difficult to assess the actual production value delivered by each integration pattern. The monthly pre-training from scratch seems computationally expensive and potentially wasteful. The rationale for this choice versus continual learning or less frequent retraining is not explained. Similarly, the daily fine-tuning cycle's computational cost and environmental impact are not discussed. That said, the case study provides valuable insights into practical production challenges that are indeed underrepresented in research literature. The three integration patterns represent pragmatic solutions to real constraints faced when deploying large models. The focus on infrastructure like the Embedding Store and fine-tuning framework demonstrates mature MLOps thinking. The acknowledgment of tradeoffs and the "no one-size-fits-all" philosophy is more nuanced than typical vendor claims. ## Conclusion This case study illustrates Netflix's mature approach to integrating foundation models into production personalization systems. The three integration patterns—embeddings, subgraph, and fine-tuning—provide a framework for different use cases with varying requirements. The supporting infrastructure, including the Embedding Store, reusable components for subgraph integration, and fine-tuning frameworks, demonstrates significant investment in making these large models accessible and practical for application teams. While the lack of quantitative results and some potentially understated complexities limit the ability to fully assess the approach's effectiveness, the focus on practical production challenges and the acknowledgment of tradeoffs provides valuable insights for organizations facing similar large model deployment challenges.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source