LinkedIn developed a large foundation model called "Brew XL" with 150 billion parameters to unify all personalization and recommendation tasks across their platform, addressing the limitations of task-specific models that operate in silos. The solution involved training a massive language model on user interaction data through "promptification" techniques, then distilling it down to smaller, production-ready models (3B parameters) that could serve high-QPS recommendation systems with sub-second latency. The system demonstrated zero-shot capabilities for new tasks, improved performance on cold-start users, and achieved 7x latency reduction with 30x throughput improvement through optimization techniques including distillation, pruning, quantization, and sparsification.
LinkedIn developed a comprehensive LLMOps solution to address the fragmentation and inefficiencies of their existing recommendation systems. The company recognized that their traditional approach of building separate, task-specific models for different surfaces (feed recommendations, job matching, search, etc.) was time-consuming, resource-intensive, and prevented leveraging the most advanced architectures across all use cases. The initiative aimed to create a single, unified foundation model that could handle all personalization tasks on the LinkedIn platform while providing zero-shot capabilities for new applications.
The technical challenge was substantial: LinkedIn needed to process tens of thousands of queries per second (QPS) with sub-second latency requirements (400-500 milliseconds), while maintaining the quality and personalization that users expect from recommendation systems. The solution involved building what they called “Brew XL,” a 150 billion parameter foundation model, and then creating a sophisticated production pipeline to distill and optimize it for real-world deployment.
The core innovation lies in what LinkedIn calls “promptification” - the process of converting all user interaction data, profiles, and contextual information into structured prompts that a language model can process. This approach transforms traditional recommendation problems into natural language tasks where the model receives instructions about what to recommend, information about the user’s profile and interaction history, and then generates predictions about user preferences or behaviors.
The development pipeline follows a sophisticated multi-stage approach starting with open-source models as the foundation. The team implements “upcycling” techniques to control model size and balance throughput versus quality trade-offs. The training process involves continuous pre-training, fine-tuning, instruction fine-tuning, and alignment phases to create the large Brew XL model with 150 billion parameters.
One of the most critical insights from LinkedIn’s experience is that the “go big then go small” approach is essential for success. They extensively tested whether smaller models could be trained directly from scratch but found that this approach consistently underperformed. The large model serves as a teacher that possesses the reasoning capacity and world knowledge necessary to handle complex personalization tasks, which can then be distilled into smaller, more efficient models.
The distillation process represents a significant LLMOps challenge and innovation. Rather than attempting to distill directly from the 150B parameter model to the final production size, LinkedIn implements a gradual, step-by-step distillation approach. They systematically reduce model size through intermediate steps - from 150B to 8B, then to 3B, and finally to 1B parameters. This gradual approach proves much more effective than aggressive single-step distillation, maintaining model quality while achieving the necessary size reductions.
The distillation process is complemented by gradual pruning techniques. Pruning involves reducing the number of attention heads in transformers, decreasing MLP layers, and optimizing the overall architecture based on the observation that transformer models contain significant redundancy. However, aggressive pruning early in the process leads to substantial performance degradation (up to 1% quality loss), so LinkedIn implements an iterative approach: small pruning steps followed by distillation, repeated multiple times to achieve the desired compression without quality loss.
LinkedIn implements sophisticated quantization strategies using FP8 precision for both model parameters and activations. However, they discovered that applying FP8 quantization uniformly across all layers significantly degrades model performance. Their solution involves mixed precision approaches where different components of the model operate at different precision levels based on their sensitivity to quantization effects.
A critical insight for recommendation systems is that the language model head - the final layer that produces probability distributions over potential recommendations - must remain in FP32 precision. Using lower precision (FP16, BF16, or FP8) in this component causes numerical instability and poor calibration, making it impossible to effectively distinguish between different recommended items. This finding highlights the importance of understanding domain-specific requirements when optimizing LLMs for production deployment.
The attention mechanism, being the most computationally expensive component of transformers, receives particular attention in LinkedIn’s optimization strategy. They implement sparsification techniques based on the insight that not every token needs to attend to every other token, especially in recommendation contexts where the task structure is well-understood.
For multi-item scoring scenarios, LinkedIn processes up to 500 candidate items simultaneously while preventing them from attending to each other through specialized attention masks. Items only attend to user profile information and interaction history, not to other candidate items. This approach requires custom kernel development in serving frameworks like SGLang and vLLM, demonstrating the level of systems engineering required for production LLM deployment.
The serving infrastructure must balance multiple constraints including maintaining KV cache efficiency, supporting high throughput requirements, and preserving the chronological ordering that proves most effective for recommendation tasks. While the team experimented with more sophisticated approaches like retrieval-augmented generation (RAG) to select relevant historical interactions, they found that chronological ordering combined with appropriate positive/negative sampling provides the best balance of effectiveness and serving efficiency.
LinkedIn’s results demonstrate significant improvements in several key areas. For cold-start users - those with limited interaction history - the foundation model approach shows increasing advantages as the number of user interactions decreases. This improvement stems from the model’s ability to leverage world knowledge and patterns learned from the broader user base, addressing one of the most challenging problems in recommendation systems.
The zero-shot generalization capabilities prove particularly valuable for rapid feature deployment. The model demonstrates competitive or superior performance on four completely out-of-domain tasks that were never seen during training, matching or exceeding the performance of task-specific production models. This capability significantly reduces the time-to-market for new recommendation features, as teams can leverage the foundation model immediately rather than collecting data and training specialized models.
LinkedIn’s comprehensive experimentation reveals several scaling relationships that inform LLMOps decisions. Data scaling shows consistent performance improvements as more historical interaction data is incorporated, suggesting that longer retention periods and more comprehensive logging strategies can directly improve model quality.
Model size scaling experiments using Mixtral architectures demonstrate clear performance gains when moving from 7B to 8x22B parameters, reinforcing the importance of the large teacher model in their distillation pipeline. Perhaps most importantly for recommendation systems, context length emerges as a critical factor. Longer contexts allow the model to consider more user interaction history, leading to better personalization. However, they observe performance degradation at very long contexts, likely due to the base model’s limited training on extended sequences rather than the information becoming less relevant.
The complexity of LinkedIn’s LLMOps pipeline necessitates extensive automation to make the system manageable and enable rapid experimentation. Their automation framework handles everything from model training and evaluation to result aggregation and reporting. When researchers modify quantization parameters or other optimization settings, the entire pipeline executes automatically, with results populated into shared tracking systems.
This automation proves crucial for the iterative optimization process required to balance the multiple constraints of production deployment: model quality, inference latency, throughput requirements, and resource costs. The team leverages open-source tools like Lightning and various serving frameworks but invests heavily in integration and workflow automation to create a cohesive development environment.
While LinkedIn presents impressive results, several challenges and limitations merit consideration. The approach requires substantial computational resources, both for training the large teacher models and for the extensive experimentation needed to optimize the distillation and serving pipeline. The 150 billion parameter teacher model represents a significant infrastructure investment that may not be accessible to all organizations.
The serving optimizations, while effective, require deep technical expertise and custom kernel development. The attention sparsification and mixed precision strategies are highly specialized and may not generalize easily to other domains or architectures. Additionally, the approach is specifically designed for LinkedIn’s user engagement patterns and data characteristics, and the generalization to other platforms or recommendation domains remains to be demonstrated.
The cold-start improvements, while significant, are demonstrated within LinkedIn’s ecosystem and user base. The effectiveness of the approach for entirely new domains or user populations outside the training distribution requires further validation. Similarly, the zero-shot capabilities are tested on tasks that, while not seen during training, remain within the general domain of professional networking and content recommendation.
LinkedIn’s LLMOps implementation represents a sophisticated approach to productionizing large language models for recommendation systems, but it requires significant investment in infrastructure, expertise, and ongoing optimization to achieve the reported performance improvements.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.
Predibase, a fine-tuning and model serving platform, announced its acquisition by Rubrik, a data security and governance company, with the goal of combining Predibase's generative AI capabilities with Rubrik's secure data infrastructure. The integration aims to address the critical challenge that over 50% of AI pilots never reach production due to issues with security, model quality, latency, and cost. By combining Predibase's post-training and inference capabilities with Rubrik's data security posture management, the merged platform seeks to provide an end-to-end solution that enables enterprises to deploy generative AI applications securely and efficiently at scale.