Company
Pinterest
Title
Enhancing Ads Engagement with Multi-gate Mixture-of-Experts and Knowledge Distillation
Industry
Tech
Year
2025
Summary (short)
Pinterest improved their ads engagement modeling by implementing a Multi-gate Mixture-of-Experts (MMoE) architecture combined with knowledge distillation techniques. The system faced challenges with short data retention periods and computational efficiency, which they addressed through mixed precision inference and lightweight gate layers. The solution resulted in significant improvements in both offline accuracy and online metrics while achieving a 40% reduction in inference latency.
Pinterest's case study demonstrates a sophisticated approach to scaling and improving their ads engagement modeling system through the implementation of advanced model architectures and optimization techniques. This case study is particularly interesting as it showcases how a major tech company handles the challenges of deploying complex neural network architectures in a production environment while balancing performance improvements with infrastructure costs. The core problem Pinterest faced was that their existing DCNv2-style recommendation system wasn't seeing proportional gains from simply adding more layers and parameters. They needed a more efficient way to scale their model performance while managing computational resources effectively. This led them to explore the Multi-gate Mixture-of-Experts (MMoE) architecture as a solution. The implementation journey involved several key technical components and considerations: Architecture Selection and Optimization: * They experimented with various expert architectures including DCNv2, Masknet, and FinalMLP * Through careful experimentation, they discovered that DCNv2-based experts provided the best return on investment * They implemented a careful balance between the number of experts and performance gains, recognizing diminishing returns with too many experts * The gate layers were intentionally kept lightweight while maintaining performance, which helped reduce infrastructure costs Infrastructure and Production Considerations: * To address the increased computational demands of running multiple experts, they implemented mixed precision inference * This optimization resulted in a 40% reduction in inference latency without compromising model performance * The team had to carefully consider the trade-offs between model complexity and serving costs * They leveraged their existing mixed precision inference infrastructure, showing good integration with their production systems Knowledge Distillation Implementation: * To address the challenge of limited data retention periods (typically a few months to a year), they implemented knowledge distillation from production models to new experimental models * They discovered that pairwise style loss functions were particularly effective for their use case * The team made the important observation that knowledge distillation should be limited to batch training, as including it in incremental training led to significant overfitting * This technique proved valuable not just for new model deployments but also for feature upgrades and computation graph improvements Production Deployment Strategy: * The team carefully considered when and how to apply knowledge distillation in different scenarios * They developed specific strategies for handling model retraining when warm starting from checkpoints wasn't available * The implementation included careful consideration of how to handle the transition from experimental to production models Evaluation and Monitoring: * The team established that a 0.1% improvement in offline accuracy was considered significant in their domain * They implemented comprehensive evaluation across different view types (RelatedPins and Search) * The evaluation framework considered both offline and online metrics * Results showed significant improvements in both metrics categories One of the most impressive aspects of this implementation was how the team balanced theoretical improvements with practical constraints. They didn't just implement MMoE in its basic form but adapted it to their specific needs and constraints. The addition of knowledge distillation to address data retention limitations shows sophisticated thinking about real-world ML operations challenges. The case study also demonstrates good MLOps practices in terms of: * Careful experimentation and validation before production deployment * Consideration of infrastructure costs and optimization * Implementation of robust evaluation metrics * Clear understanding of trade-offs between model complexity and performance * Integration with existing infrastructure and systems The results achieved are particularly noteworthy given the scale at which Pinterest operates. The 40% reduction in inference latency while maintaining or improving model performance is a significant achievement in a production environment. Limitations and considerations that practitioners should note: * The approach requires significant computational resources for training multiple experts * The complexity of the system increases with the addition of multiple experts and gates * Knowledge distillation needs careful handling to avoid overfitting * The approach may require significant experimentation to find the right balance of experts and architecture choices This case study provides valuable insights for organizations looking to scale their ML models while maintaining efficiency and managing infrastructure costs. It's particularly relevant for companies dealing with large-scale recommendation systems and those facing similar data retention challenges.

Start your new ML Project today with ZenML Pro

Join 1,000s of members already deploying models with ZenML.