ZenML

Integrating Foundation Models into Production Personalization Systems

Netflix 2025
View original source

Netflix developed a centralized foundation model for personalization to replace multiple specialized models powering their homepage recommendations. Rather than maintaining numerous individual models, they created one powerful transformer-based model trained on comprehensive user interaction histories and content data at scale. The challenge then became how to effectively integrate this large foundation model into existing production systems. Netflix experimented with and deployed three distinct integration approaches—embeddings via an Embedding Store, using the model as a subgraph within downstream models, and direct fine-tuning for specific applications—each with different tradeoffs in terms of latency, computational cost, freshness, and implementation complexity. These approaches are now used in production across different Netflix personalization use cases based on their specific requirements.

Industry

Media & Entertainment

Technologies

Overall Summary

Netflix built a centralized foundation model for personalization with the goal of consolidating learning from multiple specialized models that previously powered different aspects of their homepage. The Netflix homepage traditionally relied on several specialized models, each requiring significant time and resources to maintain and improve. Their foundation model approach centralizes member preference learning by training one powerful transformer-based model on comprehensive user interaction histories and content data at large scale, then distributing its learnings to other models and applications.

This case study focuses specifically on the production integration challenges—a gap in the literature that Netflix identified. While there is extensive research on training and inference of large-scale transformer models, there is limited practical guidance on effectively integrating these models into existing production systems. Netflix experimented with three distinct integration patterns, each now used in production for different use cases based on varying application needs including latency requirements, tech stack constraints, and different levels of commitment to leveraging the full power of the foundation model.

Technical Architecture and Integration Approaches

Embedding-Based Integration

The first and most straightforward integration approach involves generating embeddings from the foundation model and serving them through Netflix’s Embedding Store infrastructure. The transformer architecture naturally produces comprehensive user profile and item representations. Netflix extracts the hidden state of the last user event as the profile embedding and weights from the item tower as item embeddings.

The production pipeline for this approach involves a sophisticated refresh cycle. The foundation model is pre-trained from scratch monthly, then fine-tuned daily based on the latest data. The daily fine-tuning process also expands the entity ID space to include newly launching titles. After the daily fine-tuned model is ready, batch inference runs to refresh profile and item embeddings, which are then published to the Embedding Store.

A critical technical innovation in this approach is embedding stabilization (detailed in their published paper). When pre-training retrains the model from scratch, embedding spaces between different runs become completely different due to random initialization. Additionally, embeddings drift during daily fine-tuning despite warm-starting from the previous day’s model. The stabilization technique maps embeddings generated each day into the same embedding space, enabling downstream models to consume pre-computed embeddings as features consistently.

The Embedding Store itself is a specialized feature store built by Netflix’s platform team. It handles versioning and timestamping of embeddings automatically and provides various interfaces for both offline and online access. This infrastructure makes producing and consuming embeddings straightforward for application teams.

The embedding approach offers several advantages. It provides a low barrier to entry for teams wanting to leverage the foundation model, as integrating embeddings as features into existing pipelines is well-supported. Compared to other integration approaches, using embeddings has relatively smaller impacts on training and inference costs. Embeddings can serve as powerful features for other models or for candidate generation, helping retrieve appealing titles for users or facilitate title-to-title recommendations.

However, there are notable limitations. The time gap between embedding computation and downstream model inference introduces staleness, impacting recommendation freshness. This prevents applications from fully unlocking the foundation model’s benefits, particularly for use cases requiring real-time adaptability. While embeddings may not leverage the full power of the foundation model, they represent a pragmatic starting point.

Netflix learned that embeddings are a low-cost, high-leverage way of using the foundation model. The investment in resilient embedding generation frameworks and embedding stores proved so valuable that they expanded their infrastructure to build a near-real-time embedding generation framework. This new framework updates embeddings based on user actions during sessions, making embeddings and downstream models more adaptive. Though the near-real-time framework cannot handle very large models, it represents an important direction for addressing staleness and improving recommendation adaptiveness.

Subgraph Integration

The second approach uses the foundation model as a subgraph within the downstream model’s computational graph. The foundation model’s decoder stack becomes part of the application model’s full graph, processing raw user interaction sequences and outputting representations that feed into the downstream model.

This deeper integration allows applications to fine-tune the foundation model subgraph as part of their own training process, potentially achieving better performance than static embeddings. There is no time gap or staleness between foundation model inference and application model inference, ensuring the most up-to-date learnings are utilized. Applications can also leverage specific layers from the foundation model that may not be exposed through the Embedding Store, uncovering more application-specific value.

However, subgraph integration introduces significant complexities and tradeoffs. Application models must generate all features necessary for the subgraph as part of their feature generation process, adding time, compute, and complexity to their jobs. Merging the foundation model as a subgraph increases the application model size and inference time. To mitigate these challenges, the foundation model team provides reusable code and jobs that make feature generation more compute-efficient. For inference optimization, they split the subgraph to ensure it runs only once per profile per request and is shared across all items in the request.

Netflix positions this approach for high-impact use cases where metric improvements compensate for increased cost and complexity. It allows deeper integration and enables applications to harness the full power of the foundation model, but requires careful consideration of the tradeoff between metric wins, compute cost, and development time.

Direct Fine-Tuning Integration

The third approach resembles fine-tuning LLMs with domain-specific data. The foundation model is trained on a next-token prediction objective, with tokens representing different user interactions. Since different interactions have varying importance to different surfaces on Netflix’s website, the foundation model can be fine-tuned on product-specific data and used directly to power those products.

For example, the “Trending now” row might benefit from emphasizing recent interactions on trending titles over older interactions. During fine-tuning, application teams can choose full parameter fine-tuning or freeze certain layers. They can also add different output heads with different objectives. Netflix built a fine-tuning framework to make it easy for application teams to develop custom fine-tuned versions of the foundation model.

This approach offers the ability to adapt the foundation model to application-specific data and objectives, optimizing it for particular use cases. A valuable side benefit is that it provides a de facto baseline for new models and applications. Instead of designing new model stacks and spending months on feature engineering, new applications can directly utilize fine-tuned foundation models.

The tradeoff is that this approach leads to more models and pipelines to maintain across the organization. The latency and Service Level Agreements (SLAs) of fine-tuned models must be carefully optimized for specific application use cases.

Production Operations and Infrastructure

Netflix’s approach demonstrates sophisticated MLOps infrastructure supporting multiple integration patterns. The monthly pre-training and daily fine-tuning cycle balances model freshness with computational efficiency. The batch inference pipeline for embedding generation operates at scale, refreshing embeddings daily for the entire Netflix catalog and user base.

The Embedding Store serves as critical infrastructure, handling the operational complexity of versioning, timestamping, and serving embeddings at scale with both offline and online interfaces. This abstraction allows application teams to focus on using embeddings rather than managing their lifecycle.

For subgraph integration, Netflix provides reusable components and optimized code paths to reduce the implementation burden on application teams. Splitting the subgraph and sharing computations across items in a request demonstrates practical inference optimization strategies for large models.

The fine-tuning framework represents another infrastructure investment that democratizes access to the foundation model’s capabilities. By providing standardized APIs and workflows, Netflix lowers the barrier for teams to experiment with and deploy fine-tuned versions.

Ongoing Innovation and Future Directions

Netflix continues to refine these integration approaches over time. The Machine Learning Platform team is developing near-real-time embedding inference capabilities to address staleness issues with the embedding approach, though this currently has limitations for very large models. They are also working on a smaller distilled version of the foundation model to reduce inference latency for the subgraph approach, demonstrating awareness of the latency-accuracy tradeoffs in production systems.

The company is refining and standardizing APIs used across these approaches to make them easier for application teams to adopt. This focus on developer experience and reducing integration friction is a key aspect of their LLMOps strategy.

Critical Assessment

The case study presents Netflix’s integration approaches in a generally positive light, which is expected from a technical blog post by the company. However, several aspects deserve balanced consideration.

The embedding staleness issue is acknowledged but may be more significant than suggested. Daily refresh cycles could be inadequate for capturing rapid shifts in user preferences or trending content, particularly during major content launches or cultural events. The near-real-time framework is positioned as a solution, but its inability to handle very large models is a substantial limitation that may restrict its applicability to the full foundation model.

The subgraph approach’s complexity is mentioned but potentially understated. The requirement for application teams to generate all features for the subgraph and manage increased model complexity could create significant technical debt and maintenance burden. The claim that reusable code mitigates this complexity needs validation through actual adoption metrics and developer feedback.

The fine-tuning approach’s proliferation of multiple fine-tuned models across the organization could lead to model sprawl, versioning challenges, and increased operational overhead. While positioned as providing a “de facto baseline,” this may actually lower the incentive for teams to develop truly specialized models when appropriate.

The lack of quantitative results is notable. The case study provides no concrete metrics on accuracy improvements, latency impacts, cost implications, or A/B test results comparing the three approaches. This makes it difficult to assess the actual production value delivered by each integration pattern.

The monthly pre-training from scratch seems computationally expensive and potentially wasteful. The rationale for this choice versus continual learning or less frequent retraining is not explained. Similarly, the daily fine-tuning cycle’s computational cost and environmental impact are not discussed.

That said, the case study provides valuable insights into practical production challenges that are indeed underrepresented in research literature. The three integration patterns represent pragmatic solutions to real constraints faced when deploying large models. The focus on infrastructure like the Embedding Store and fine-tuning framework demonstrates mature MLOps thinking. The acknowledgment of tradeoffs and the “no one-size-fits-all” philosophy is more nuanced than typical vendor claims.

Conclusion

This case study illustrates Netflix’s mature approach to integrating foundation models into production personalization systems. The three integration patterns—embeddings, subgraph, and fine-tuning—provide a framework for different use cases with varying requirements. The supporting infrastructure, including the Embedding Store, reusable components for subgraph integration, and fine-tuning frameworks, demonstrates significant investment in making these large models accessible and practical for application teams. While the lack of quantitative results and some potentially understated complexities limit the ability to fully assess the approach’s effectiveness, the focus on practical production challenges and the acknowledgment of tradeoffs provides valuable insights for organizations facing similar large model deployment challenges.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Foundation Model for Unified Personalization at Scale

Netflix 2025

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

content_moderation classification summarization +37

Multi-Agent System for Misinformation Detection and Correction at Scale

Meta 2025

This case study presents a sophisticated multi-agent LLM system designed to identify, correct, and find the root causes of misinformation on social media platforms at scale. The solution addresses the limitations of pre-LLM era approaches (content-only features, no real-time information, low precision/recall) by deploying specialized agents including an Indexer (for sourcing authentic data), Extractor (adaptive retrieval and reranking), Classifier (discriminative misinformation categorization), Corrector (reasoning and correction generation), and Verifier (final validation). The system achieves high precision and recall by orchestrating these agents through a centralized coordinator, implementing comprehensive logging, evaluation at both individual agent and system levels, and optimization strategies including model distillation, semantic caching, and adaptive retrieval. The approach prioritizes accuracy over cost and latency given the high stakes of misinformation propagation on platforms.

fraud_detection content_moderation classification +36