Meta: Foundation Model for Ads Recommendation at Scale

Company

## Overview Meta's Generative Ads Recommendation Model (GEM) represents a landmark implementation of LLM-scale foundation models applied to the recommendation systems domain, specifically for ads ranking and personalization across Facebook and Instagram. Published in November 2025, this case study illustrates how Meta adapted techniques typically reserved for language models to tackle the unique challenges of production ads recommendation systems serving billions of users daily. The deployment demonstrates sophisticated LLMOps practices including distributed training at massive scale, knowledge transfer mechanisms, and continuous online model updates—all while maintaining strict latency requirements for real-time ad serving. The business context is compelling: Meta's ads recommendation system must process billions of user-ad interactions daily across multiple surfaces (Facebook Feed, Instagram, Business Messaging), learning from extremely sparse meaningful signals like clicks and conversions buried within vast amounts of impression data. Traditional recommendation models struggled to capture the complex, long-term user behavior patterns and cross-platform interactions necessary for optimal ad targeting. GEM addresses these limitations by operating at LLM-scale—trained on thousands of GPUs with architectural innovations that enable efficient scaling laws where performance gains scale cost-effectively with additional data and compute. ## Core Technical Challenges in Production Meta frames three fundamental challenges that required rethinking their recommendation architecture from an LLMOps perspective. First, handling a large, dynamic feature space across all Meta apps where billions of daily interactions produce extremely sparse meaningful signals. The model must generalize across diverse users and behaviors despite severe class imbalance. Second, processing heterogeneous multimodal data including advertiser goals, creative formats (text, images, video), measurement signals, and user behaviors across multiple delivery channels. This heterogeneity adds significant modeling complexity requiring unified representation of multi-source inputs. Third, training efficiently at scale—requiring thousands of GPUs with advanced parallelism strategies and system-level optimization to ensure hardware utilization remains cost-effective. The sparsity challenge is particularly acute in ads recommendation compared to other domains. While billions of impressions occur daily, conversion events (purchases, sign-ups) represent tiny fractions of interactions. GEM must learn meaningful patterns from this imbalanced distribution while avoiding overfitting to noise. The multimodal complexity goes beyond typical recommendation systems—integrating not just user history but advertiser intent, creative content understanding, cross-platform behavioral patterns, and business outcomes measured across different time horizons. ## Architectural Innovations for Scalability GEM's architecture represents a fundamental reimagining of recommendation model design, achieving 4x efficiency improvement over Meta's original ads ranking models for a given amount of data and compute. The architecture divides features into two categories: sequence features (activity history over time) and non-sequence features (static attributes like age, location, ad format). Customized attention mechanisms process each group independently while enabling cross-feature learning, improving both accuracy and scalability. For non-sequence feature interaction modeling, GEM enhances the "Wukong architecture" using stackable factorization machines with cross-layer attention connections. This design allows the model to learn which feature combinations matter most for prediction. Critically, each Wukong block scales both vertically (for deeper, more complex interactions) and horizontally (for broader feature coverage), enabling discovery of increasingly sophisticated user-ad patterns without architectural redesign. This scalability characteristic is essential for production deployment where feature spaces constantly expand as new ad formats, targeting options, and measurement signals are introduced. The offline sequence feature modeling represents perhaps the most significant departure from traditional recommendation architectures. User behavior sequences can span thousands of events—clicks, views, time spent, scrolling patterns—across both organic content and ads. Traditional architectures struggle with such long sequences due to computational and memory constraints. GEM employs a pyramid-parallel structure, stacking multiple parallel interaction modules in pyramid formation to capture complex user-ad relationships at scale. Meta built new scalable offline feature infrastructure processing sequences of thousands of events with minimal storage cost, enabling GEM to learn from much longer history of user interactions. This extended temporal modeling helps uncover patterns throughout the user's purchase journey that shorter-context models miss entirely. The InterFormer component addresses a critical limitation in existing approaches: compressing user behavior sequences into compact vectors for downstream tasks risks losing engagement signals. GEM's InterFormer preserves full sequence information while enabling efficient cross-feature learning through parallel summarization with interleaving structure—alternating between sequence learning (via custom transformer architecture) and cross-feature interaction layers. This progressive refinement maintains access to the complete user journey, enabling efficient scaling to higher layer counts without losing critical behavioral signals. In production, this means the model can reason about both recent interactions and long-term behavioral patterns simultaneously when scoring each ad impression. Multi-domain learning with domain-specific optimization tackles the challenge of learning across Meta's diverse surfaces (Facebook, Instagram, Business Messaging) which exhibit distinct user behaviors and interaction patterns. Traditional approaches either train isolated models per surface (losing cross-platform insights) or train unified models treating all surfaces identically (ignoring platform-specific behaviors). GEM learns from cross-surface user interactions while ensuring predictions remain tailored to each surface's unique characteristics. For example, insights from Instagram video ad engagement inform Facebook Feed ad predictions, while each domain's predictions optimize for surface-specific objectives like clicks versus conversions. This cross-domain transfer with domain adaptation is particularly relevant for LLMOps, as it demonstrates how foundation models can serve multiple downstream applications with varying objectives without requiring complete retraining. ## Post-Training Knowledge Transfer at Scale GEM only delivers production impact through efficient knowledge transfer to hundreds of user-facing vertical models (VMs). This aspect represents sophisticated LLMOps practice—the foundation model serves as a "teacher" that continuously improves downstream production models serving actual traffic. Meta employs both direct and hierarchical knowledge transfer strategies. Direct transfer enables GEM to propagate knowledge to major VMs within the same data spaces where GEM was trained. Hierarchical transfer distills knowledge from GEM into domain-specific foundation models, which then teach VMs, driving broad improvements across the entire ads model fleet. The combination achieves 2x the effectiveness of standard knowledge distillation—a significant accomplishment given the already mature state of distillation techniques in industry. The knowledge distillation implementation addresses a critical production challenge: stale supervision caused by delays in foundation model training and evaluation, plus domain mismatches between GEM predictions and VMs' surface-specific objectives. These outdated or misaligned signals between student models and teacher can degrade accuracy over time. Meta introduces a "Student Adapter" during training—a lightweight component that refines the teacher's outputs using the most recent ground-truth data. It learns a transformation better aligning teacher predictions with observed outcomes, ensuring student models receive up-to-date and domain-relevant supervision throughout training. This adaptation mechanism is crucial for production systems where data distributions shift constantly due to seasonal patterns, trending content, and evolving user behaviors. Representation learning complements distillation by generating semantically aligned features supporting efficient knowledge transfer. Rather than only transferring prediction targets (as in standard distillation), GEM transfers learned representations—intermediate embeddings capturing user intent, ad relevance signals, and cross-feature interactions. These rich representations enable VMs to leverage GEM's understanding without adding inference overhead. In production, this means vertical models can benefit from the foundation model's capacity and training data scale while maintaining strict latency requirements for real-time ad serving. Parameter sharing enables efficient knowledge reuse by allowing VMs to selectively incorporate components from foundation models. This lets smaller, latency-sensitive VMs leverage rich representations and pre-learned patterns without incurring full computational cost. The selective sharing is particularly important for Meta's deployment context where different surfaces and ad formats have varying latency budgets and computational constraints. Some VMs serving high-traffic surfaces might use only lightweight GEM components, while lower-traffic surfaces with more relaxed latency constraints might use deeper integration with GEM's architecture. ## Training Infrastructure and LLMOps at Scale GEM operates at scale typically reserved for modern LLMs, requiring complete overhaul of Meta's training infrastructure. The re-engineered training stack delivers a 23x increase in effective training FLOPs using 16x more GPUs while improving model FLOPS utilization (MFU) by 1.43x. This simultaneous improvement in throughput and efficiency is remarkable—typically scaling to more GPUs introduces communication overhead degrading per-GPU efficiency. The achievement reflects sophisticated system-level optimization essential for cost-effective LLM-scale model training in production. The distributed training strategy employs multi-dimensional parallelism carefully orchestrated across dense and sparse model components. Dense model parts (transformer layers, attention mechanisms) use Hybrid Sharded Distributed Parallel (HSDP) optimizing memory usage and reducing communication costs. This enables efficient distribution of dense parameters across thousands of GPUs. Sparse components—primarily large embedding tables for user and item features—employ two-dimensional approach combining data parallelism and model parallelism, optimized for synchronization efficiency and memory locality. The distinction is critical because recommendation models differ from pure language models in having massive embedding tables (billions of user IDs, item IDs, categorical features) that don't fit in single GPU memory and exhibit different access patterns than dense layers. System-level optimizations focus on saturating GPU compute throughput and reducing training bottlenecks. Meta developed custom in-house GPU kernels designed for variable-length (jagged) user sequences and computation fusion, leveraging latest GPU hardware features. The jagged tensor handling is essential for production recommendation systems where different users have vastly different interaction history lengths—a challenge not present in typical LLM training where sequences are uniformly batched. Graph-level compilation in PyTorch 2.0 automates key optimizations including activation checkpointing for memory savings and operator fusion for improved execution efficiency. Memory compression techniques like FP8 quantization for activations and unified embedding formats reduce memory footprint without significantly impacting model quality. Particularly innovative is the development of GPU communication collectives operating without utilizing Streaming Multiprocessor (SM) resources via NCCLX (Meta's fork of NVIDIA's NCCL). This eliminates contention between communication and compute workloads, improving overlap and GPU utilization. In large-scale distributed training, communication overhead often becomes the bottleneck as model size and GPU count increase. By offloading communication from compute resources, Meta achieves better overlap—GPUs can continue computing while communication happens in parallel, improving overall throughput. Reducing training overhead and job startup time is crucial for maintaining high effective training time (ETT)—the proportion of training time spent processing new data versus initialization, checkpointing, and compilation. Meta reduced job startup time by 5x through optimizing trainer initialization, data reader setup, checkpointing, and PyTorch 2.0 compilation. Notably, PyTorch 2.0 compilation time was reduced by 7x via caching strategies. For production LLMOps, these optimizations directly impact iteration speed and cost—faster job startup means researchers can run more experiments per day and recover from failures more quickly. GPU efficiency is optimized across all stages of the model lifecycle, demonstrating mature LLMOps practices. During exploration phase, Meta accelerates iteration using lightweight model variants at much lower cost than full-sized models. These variants support over half of all experiments, enabling faster idea validation with minimal resource overhead. This multi-fidelity approach to experimentation is essential for cost-effective research at scale—most ideas don't pan out, so testing them on expensive full-scale models wastes resources. During post-training, the model runs forward passes to generate knowledge (labels, embeddings) for downstream models. Unlike pure LLMs, Meta performs continuous online training to refresh foundation models as new data arrives. Traffic sharing between training and post-training knowledge generation, and between foundation model and downstream models, reduces computational demand. This continuous learning aspect distinguishes production recommendation systems from many LLM deployments where models are trained once and served without frequent updates. ## Production Deployment and Business Impact GEM launched across Facebook and Instagram earlier in 2025, delivering measurable business impact: 5% increase in ad conversions on Instagram and 3% increase on Facebook Feed in Q2. These gains are substantial given the maturity and optimization of Meta's existing ads recommendation systems. For a system already serving billions of users and generating tens of billions in annual revenue, single-digit percentage improvements represent significant business value. In Q3, architectural improvements doubled the performance benefit from adding given amounts of data and compute, enabling continued scaling at attractive ROI. This improvement in scaling efficiency is perhaps more important than the absolute performance gains—it validates the investment in larger models and more training compute going forward. The continuous online training aspect is particularly noteworthy from an LLMOps perspective. Unlike many foundation model deployments where models are trained offline and served statically, GEM requires continuous updates to remain effective as user behaviors, ad inventory, and advertiser objectives evolve. Meta's infrastructure supports this continuous training while managing the complexity of propagating updates across hundreds of downstream vertical models. The system must handle model versioning, gradual rollout, A/B testing, and rollback mechanisms—standard LLMOps concerns amplified by the scale and business criticality of the ads system. The hierarchical model architecture with foundation models teaching domain-specific models which then teach vertical models creates a sophisticated dependency graph requiring careful orchestration. When GEM is updated, those changes must propagate through intermediate models to production-serving models, with each stage requiring validation, testing, and performance monitoring. The post-training knowledge transfer framework enables this propagation efficiently, but the operational complexity is substantial. ## Critical Assessment and Limitations While Meta's presentation is impressive, several aspects warrant balanced assessment. First, the reported conversion increases (5% Instagram, 3% Facebook Feed) lack context about baseline performance, statistical significance, and measurement methodology. A/B testing ads systems is notoriously difficult due to network effects, spillover, and interference between test and control groups. The text doesn't detail the experimental design or confidence intervals, making it difficult to assess the robustness of these results. Second, the 4x efficiency improvement and 2x knowledge transfer effectiveness compared to previous systems are presented without detailing what those previous systems were or the methodology for measuring efficiency. Efficiency metrics in machine learning can be defined many ways (compute per unit accuracy gain, latency per prediction, cost per conversion), and the specific definition matters for interpreting these claims. The comparison may be against significantly older baselines, making the relative improvement appear larger than it would against more recent alternatives. Third, the case study focuses heavily on training infrastructure and model architecture but provides limited detail about inference serving, latency requirements, and real-time prediction constraints. Production recommendation systems must return predictions in milliseconds for each ad impression. How GEM's knowledge transfers to latency-constrained vertical models, what inference optimizations are employed, and how model updates deploy without disrupting serving are underexplored. These operational concerns are central to LLMOps but receive less attention than training methodology. Fourth, the cost of operating this system remains unclear. Training on thousands of GPUs continuously, maintaining hundreds of vertical models, and performing constant online updates represents substantial infrastructure investment. Whether the 3-5% conversion improvements justify this cost depends on Meta's specific business context and existing margins. For most organizations, replicating this approach would be economically infeasible, limiting the generalizability of the case study. Fifth, the multi-domain learning and cross-surface transfer are presented as clear wins, but the text doesn't discuss potential negative transfer or interference effects. Learning from Instagram video ad engagement to improve Facebook Feed predictions assumes behavioral patterns transfer meaningfully between surfaces, which may not always hold. Certain user segments might exhibit completely different behaviors across platforms, and the joint optimization could degrade performance for those segments relative to surface-specific models. ## Future Directions and Broader Implications Meta outlines ambitious future directions for GEM including learning from Meta's entire ecosystem across all modalities (text, images, audio, video), extending learnings to cover all major surfaces, developing unified engagement models that rank both organic content and ads jointly, and incorporating inference-time scaling for compute allocation optimization. The vision of unified organic and ads ranking is particularly interesting from an LLMOps perspective, as it would require aligning different objective functions (user engagement for organic, business outcomes for ads) in a single model framework. The mention of "agentic, insight-driven advertiser automation" suggests Meta envisions GEM powering more sophisticated advertiser-facing tools, potentially using the foundation model to provide strategic recommendations or automated campaign optimization beyond just ad ranking. This would extend the LLMOps challenge from prediction serving to decision-making and planning tasks. The case study demonstrates several important LLMOps principles: the value of foundation models for transferring knowledge across related tasks, the importance of system-level optimization for cost-effective large-scale training, the need for sophisticated post-training techniques to bridge foundation models and production applications, and the operational complexity of maintaining continuously-updating model hierarchies. For practitioners, the most transferable lessons likely concern the knowledge distillation and transfer learning techniques rather than the specific architectural choices or training scale, which depend heavily on Meta's unique context and resources. The architectural innovations around sequence modeling, cross-feature learning, and multi-domain optimization represent genuine advances in recommendation system design applicable beyond Meta's specific use case. The InterFormer approach to preserving full sequence information while enabling efficient cross-feature learning could inform other sequential prediction problems. The multi-domain learning framework with domain-specific optimization addresses a common challenge in organizations serving multiple products or user segments. Overall, this case study illustrates production LLMOps at perhaps the most extreme scale in the recommendation systems domain, with sophisticated approaches to training infrastructure, knowledge transfer, and continuous model updates. While the specific implementation may not be replicable for most organizations, the principles and techniques offer valuable insights for anyone building large-scale production ML systems. The balanced view recognizes both the genuine technical achievements and the limitations of the presentation, providing context for understanding what aspects might transfer to other settings versus what is specific to Meta's unique scale and resources.

Start deploying reproducible AI workflows today