Netflix: Hierarchical Multi-Task Learning for Intent Prediction in Recommender Systems

Company

Netflix

Title

Hierarchical Multi-Task Learning for Intent Prediction in Recommender Systems

Industry

Media & Entertainment

Link

https://netflixtechblog.com/fm-intent-predicting-user-session-intent-with-hierarchical-multi-task-learning-94c75e18f4b8

Year

2025

Summary (short)

Netflix developed FM-Intent, a novel recommendation model that enhances their existing foundation model by incorporating hierarchical multi-task learning to predict user session intent alongside next-item recommendations. The problem addressed was that while their foundation model successfully predicted what users might watch next, it lacked understanding of underlying user intents (such as discovering new content versus continuing existing viewing, genre preferences, and content type preferences). FM-Intent solves this by establishing a hierarchical relationship where intent predictions inform item recommendations, using Transformer encoders to process interaction metadata and attention-based aggregation to combine multiple intent signals. The solution demonstrated a statistically significant 7.4% improvement in next-item prediction accuracy compared to the previous state-of-the-art baseline (TransAct) in offline experiments, and has been successfully integrated into Netflix's production recommendation ecosystem for applications including personalized UI optimization, analytics, and enhanced recommendation signals.

Tags

## Overview Netflix presents FM-Intent as an evolution of their recommendation foundation model that incorporates hierarchical multi-task learning to predict user session intent. This case study is particularly relevant to LLMOps as it demonstrates how Netflix operationalized a complex machine learning system that extends their foundation model capabilities in production. While the case study refers to a "foundation model," it's important to note that this appears to be a specialized recommendation model rather than a large language model in the traditional sense. However, the operational challenges and approaches described - particularly around multi-task learning, model architecture design, production integration, and evaluation - are highly relevant to understanding how sophisticated ML systems are deployed at scale. The core motivation for FM-Intent stems from Netflix's recognition that their existing recommendation foundation model, while successful at predicting next-item interactions through large-scale learning from user interaction histories, could be enhanced by explicitly modeling user intent. The company identified that understanding not just what users might watch next, but why they are engaging with the platform - their underlying intent - could lead to more nuanced and effective recommendations. This represents a shift from purely behavioral prediction to incorporating cognitive modeling of user motivations. ## Problem Definition and User Intent Framework Netflix's approach to user intent is grounded in observable interaction metadata within their ecosystem. The company identified several key dimensions of user intent that manifest through implicit signals. The action type dimension captures whether users intend to discover new content versus continue previously started content - for instance, when a member plays a follow-up episode of a show they were already watching, this signals "continue watching" intent. Genre preference encompasses pre-defined labels such as Action, Thriller, or Comedy that indicate content preferences during a session, with Netflix noting that these preferences can shift significantly between sessions even for the same user. The movie versus show type dimension distinguishes whether users seek single longer viewing experiences or multiple shorter episodes. Finally, time-since-release captures whether users prefer newly released content, recent content released within weeks or months, or evergreen catalog titles. These dimensions serve as proxies for latent user intent that is not directly observable but crucial for relevant recommendations. This framework demonstrates a thoughtful operationalization of an abstract concept - user intent - into measurable signals that can be incorporated into a production system. From an LLMOps perspective, this illustrates the importance of defining clear objectives and measurable proxies when extending model capabilities beyond straightforward prediction tasks. ## Model Architecture and Technical Implementation FM-Intent employs a hierarchical multi-task learning architecture comprising three major components. The first component constructs rich input features by combining interaction metadata, with each interaction represented by combined categorical embeddings and numerical features to create comprehensive representations of user behavior. This feature engineering approach is critical for production systems, as it determines what signals the model can leverage. The second component handles user intent prediction by processing input feature sequences through a Transformer encoder that generates predictions for multiple intent signals. The Transformer encoder effectively models long-term user interests through multi-head attention mechanisms, allowing the model to capture dependencies across user interaction sequences. For each prediction task, the intent encoding is transformed into prediction scores via fully-connected layers. A particularly notable innovation is the attention-based aggregation of individual intent predictions, which generates a comprehensive intent embedding that captures the relative importance of different intent signals for each user. This aggregation mechanism provides not just predictions but interpretable insights into what drives user behavior, which is valuable for both personalization and explanation capabilities. The third component combines input features with the user intent embedding to make next-item recommendations. This hierarchical structure is where FM-Intent distinguishes itself from conventional multi-task learning approaches. Rather than treating intent prediction and item prediction as parallel tasks with shared representations, FM-Intent establishes a clear hierarchy where intent predictions are conducted first and their results feed into the next-item prediction task. This creates an explicit information flow that ensures next-item recommendations are informed by predicted user intent, establishing a more coherent recommendation pipeline. From an LLMOps perspective, this hierarchical architecture introduces complexity in terms of training, inference, and monitoring. The model requires careful orchestration to ensure intent predictions are completed before item predictions, which has implications for latency and throughput in production serving. The authors note that FM-Intent uses a much smaller dataset for training compared to their production foundation model due to its complex hierarchical prediction architecture, suggesting that scaling these types of hierarchical models presents challenges that required tradeoffs between model sophistication and training data scale. ## Evaluation Methodology and Results Netflix conducted comprehensive offline experiments on sampled user engagement data to evaluate FM-Intent's performance. The evaluation compared FM-Intent against several state-of-the-art sequential recommendation models including LSTM, GRU, Transformer, TransAct, and their production model baseline (referred to as FM-Intent-V0). The authors note that they added fully-connected layers to LSTM, GRU, and Transformer baselines to enable intent prediction, while using original implementations for other baselines. This modification of baselines to support comparable functionality is an important methodological decision that ensures fair comparison. The results showed that FM-Intent demonstrated a statistically significant 7.4% improvement in next-item prediction accuracy compared to the best baseline (TransAct). The authors note that most baseline models showed limited performance because they either cannot predict user intent or cannot incorporate intent predictions into next-item recommendations. Importantly, the production model baseline (FM-Intent-V0) performed well but lacked the ability to predict and leverage user intent. The authors clarify that FM-Intent-V0 was trained with a smaller dataset for fair comparison with other models, while the actual production model is trained with a much larger dataset. This evaluation approach demonstrates several important practices for production ML systems. First, the comparison against both research baselines and the existing production system provides context for both scientific advancement and practical business value. Second, the acknowledgment that comparison models were trained on smaller datasets for fairness, while noting that production models use larger datasets, illustrates the gap between experimental and production settings that LLMOps practitioners must navigate. Third, the emphasis on statistical significance suggests rigorous experimental methodology, though the case study does not provide details on confidence intervals, test set sizes, or multiple comparison corrections. ## Qualitative Analysis and Interpretability Beyond quantitative metrics, Netflix conducted qualitative analysis of the user intent embeddings generated by FM-Intent. Using K-means++ clustering with K=10, they identified distinct clusters of users with similar intents. The visualization revealed meaningful user segments with distinct viewing patterns, including users who primarily discover new content versus those who continue watching recent or favorite content, genre enthusiasts such as anime or kids content viewers, and users with specific viewing patterns such as rewatchers versus casual viewers. This clustering analysis serves multiple purposes in a production system. It provides interpretability and trust in the model's learned representations by demonstrating that the intent embeddings capture semantically meaningful patterns. It enables user segmentation for analytics and business intelligence, informing decisions about content acquisition and production. It also validates that the model is learning genuine patterns rather than spurious correlations. From an LLMOps perspective, this type of qualitative analysis is crucial for building confidence in complex models and identifying potential failure modes or biases that might not be apparent from aggregate metrics alone. ## Production Integration and Applications Netflix reports that FM-Intent has been successfully integrated into their recommendation ecosystem and can be leveraged for several downstream applications. For personalized UI optimization, predicted user intent can inform the layout and content selection on the Netflix homepage, emphasizing different rows based on whether users are in discovery mode, continue-watching mode, or exploring specific genres. For analytics and user understanding, intent embeddings and clusters provide valuable insights into viewing patterns and preferences that inform content acquisition and production decisions. As enhanced recommendation signals, intent predictions serve as features for other recommendation models, improving their accuracy and relevance. For search optimization, real-time intent predictions help prioritize search results based on the user's current session intent. This multi-application approach is characteristic of how foundation model capabilities are operationalized in production. Rather than deploying FM-Intent as a single-purpose model, Netflix has positioned it as a platform capability that feeds into multiple downstream systems. This creates both opportunities and challenges from an LLMOps perspective. The opportunities include amortizing the cost of training and serving the model across multiple use cases and creating consistent user understanding across different product surfaces. The challenges include managing dependencies where changes to FM-Intent could impact multiple downstream systems, ensuring consistent serving latency across different applications with varying requirements, and monitoring model performance across diverse use cases that may have different success metrics. ## LLMOps Considerations and Tradeoffs Several important LLMOps considerations emerge from this case study, though the authors do not explicitly discuss them in depth. The hierarchical architecture introduces sequential dependencies in the inference pipeline, where intent predictions must complete before item predictions can begin. This likely increases serving latency compared to simpler architectures, requiring careful optimization to meet production SLA requirements. The multi-task learning approach requires balancing multiple loss functions corresponding to different intent prediction tasks and next-item prediction, which introduces hyperparameter complexity and potential instability during training. The authors mention that FM-Intent uses a much smaller dataset compared to their production foundation model due to architectural complexity, suggesting that there are practical limits to model sophistication given computational constraints. This highlights a common tradeoff in production ML systems between model capability and operational feasibility. The case study does not detail how Netflix addressed this constraint - whether through model compression, distillation, more efficient architectures, or simply accepting higher training and serving costs for the benefits provided. The integration of FM-Intent into multiple downstream applications creates a complex dependency graph that must be carefully managed. Changes to the model - whether retraining with new data, architectural modifications, or hyperparameter adjustments - could potentially impact personalized UI rendering, search results, other recommendation models, and analytics pipelines. This requires robust versioning, testing, and rollback capabilities. The case study mentions comprehensive offline experiments but does not discuss online A/B testing methodology, staged rollouts, or monitoring strategies for detecting production issues. ## Model Monitoring and Observability While the case study does not explicitly discuss monitoring and observability, the multi-output nature of FM-Intent creates interesting monitoring challenges and opportunities. The model produces predictions for multiple intent dimensions (action type, genre preference, movie/show type, time-since-release) in addition to next-item recommendations. Each of these outputs could be monitored independently for distribution drift, prediction quality degradation, or anomalous patterns. The intent embeddings could be monitored for clustering stability over time, with significant changes in cluster composition potentially indicating shifts in user behavior or data quality issues. The hierarchical architecture means that errors can compound - if intent predictions degrade, this will likely impact next-item prediction quality even if the item prediction component itself is functioning correctly. This requires causal monitoring approaches that can attribute performance changes to specific model components. The qualitative clustering analysis performed during model development suggests that similar ongoing monitoring could be valuable in production, periodically re-clustering user intent embeddings to detect emerging user segments or shifts in existing segments. ## Scalability and Training Infrastructure The training of FM-Intent on Netflix's user engagement data at scale presents significant infrastructure challenges, though the case study provides limited details on this aspect. The hierarchical multi-task learning approach requires careful orchestration of gradient computation and backpropagation through multiple stages. The Transformer encoder architecture, while powerful for modeling sequential behavior, is computationally expensive particularly for long user interaction sequences. The attention mechanism scales quadratically with sequence length, which could be prohibitive for users with very long viewing histories. The authors note that FM-Intent uses smaller training datasets than their full production foundation model, suggesting that computational constraints influenced this decision. This raises questions about whether FM-Intent represents a parallel system to their main foundation model or an evolution that will eventually replace it. The engineering effort required to train, validate, and deploy a hierarchical multi-task model of this complexity is substantial, requiring coordination between research, engineering, and product teams. ## Generalization and Limitations While the case study presents impressive results, it's important to consider limitations and areas where the approach might face challenges. The evaluation is conducted entirely on Netflix data for Netflix's specific recommendation use case, so generalization to other domains or platforms is uncertain. Different platforms may have different manifestations of user intent that don't align with Netflix's framework of action type, genre preference, movie/show type, and time-since-release. The hierarchical architecture creates a strong inductive bias that intent should inform item recommendations, which seems intuitive but may not always hold. There could be cases where item-level signals provide better predictions independent of higher-level intent, and the hierarchical structure might constrain the model's ability to leverage these signals. The attention-based aggregation of intent predictions adds interpretability but introduces additional parameters and complexity that must be learned from data. The case study acknowledges that FM-Intent was trained on smaller datasets than their production foundation model due to architectural complexity, raising questions about whether the performance gains would persist at larger scale or whether the complexity introduces optimization challenges that limit scalability. The 7.4% improvement over the TransAct baseline is statistically significant and likely meaningful for business metrics, but it's not clear how this translates to user-perceivable improvements in recommendation quality or engagement metrics. ## Conclusion and Industry Implications Netflix's FM-Intent represents a sophisticated approach to enhancing recommendation systems through hierarchical multi-task learning for intent prediction. The case study demonstrates how a major streaming platform operationalized a complex ML architecture that goes beyond simple next-item prediction to model underlying user motivations. The successful production integration across multiple downstream applications illustrates the value of building platform capabilities that can be leveraged across different product surfaces. From an LLMOps perspective, this case study highlights several important themes. The balance between model sophistication and operational feasibility is a constant consideration, with architectural complexity imposing real constraints on training data scale and serving infrastructure. Multi-task and hierarchical architectures introduce coordination challenges in training, evaluation, and monitoring that require careful engineering. Qualitative analysis and interpretability are crucial complements to quantitative metrics for building confidence in complex models and enabling their effective use in downstream applications. Finally, production ML systems often serve as platforms feeding multiple use cases, which creates opportunities for impact but also dependencies that must be carefully managed. While the improvements demonstrated are meaningful, practitioners should approach claims with appropriate skepticism and recognize that Netflix's scale, data, and engineering resources may not be representative of other organizations. The tradeoffs between model complexity and operational simplicity will differ across contexts, and simpler approaches may be more appropriate in many situations. Nonetheless, FM-Intent provides a valuable example of sophisticated ML system design and deployment at one of the world's leading recommendation platforms.

Start deploying reproducible AI workflows today