ZenML

Large Language Models for Search Relevance at Scale

Pinterest 2025
View original source

Pinterest's search relevance team integrated large language models into their search pipeline to improve semantic relevance prediction for over 6 billion monthly searches across 45 languages and 100+ countries. They developed a cross-encoder teacher model using fine-tuned open-source LLMs that achieved 12-20% performance improvements over existing models, then used knowledge distillation to create a production-ready bi-encoder student model that could scale efficiently. The solution incorporated visual language model captions, user engagement signals, and multilingual capabilities, ultimately improving search relevance metrics internationally while producing reusable semantic embeddings for other Pinterest surfaces.

Industry

Tech

Technologies

Pinterest’s Large Language Model Integration for Search Relevance

This case study presents Pinterest’s comprehensive approach to integrating large language models into their search infrastructure, demonstrating a sophisticated production deployment that handles massive scale while maintaining performance and cost efficiency. Pinterest operates as a visual discovery platform where users search for inspiration across diverse topics including recipes, home decor, travel, and fashion, processing over 6 billion searches monthly across billions of pins in more than 45 languages and 100+ countries.

Technical Architecture and Model Design

Pinterest’s search relevance system follows a traditional recommendation pipeline with query understanding, retrieval, ranking, and blending stages. The LLM integration specifically targets the semantic relevance modeling component during the re-ranking stage. Their core innovation involves a two-tier architecture consisting of a sophisticated teacher model and a production-optimized student model.

The teacher model employs a cross-encoder architecture that concatenates search queries and pin text representations, feeding them jointly into a large language model to capture rich interactions between query and content. This approach contrasts with bi-encoder methods where query and document embeddings are computed separately. The cross-encoder outputs embeddings that feed into multilayer perceptron layers producing five-dimensional vectors corresponding to relevance levels ranging from most relevant to most irrelevant.

Pinterest experimented extensively with different open-source language models, finding that larger and more advanced models consistently improved performance. Their results demonstrated that an 8-billion parameter Llama model achieved 12% improvement over multilingual BERT-based approaches and 20% improvement over their previous SearchSage embedding model. This experimentation process highlights the importance of systematic model evaluation in production LLM deployments.

Content Representation and Feature Engineering

A critical aspect of Pinterest’s approach involves building comprehensive text representations for each pin. Beyond traditional metadata like titles and descriptions, they incorporate several innovative features that demonstrate thoughtful integration of multimodal and user behavior signals. Visual language model-generated synthetic image captions provide direct extraction of visual information, enabling the text-based relevance models to understand image content without requiring multimodal architectures.

User engagement signals serve as powerful content annotations, including board titles from user-curated collections where pins have been saved and historical search queries that led to high engagement with specific pins. These user action-based features function as crowdsourced annotations that help models understand content semantics from actual usage patterns rather than just metadata.

Ablation studies revealed the value of each feature component, with vision-generated captions providing a solid baseline and sequential addition of other text features yielding consistent performance improvements. The user action-based features proved particularly valuable, demonstrating how implicit user feedback can enhance content understanding in production systems.

Knowledge Distillation for Production Scaling

The production deployment strategy centers on knowledge distillation to address the fundamental tension between model performance and computational efficiency. While the cross-encoder teacher model provides superior relevance prediction, its computational requirements make direct deployment prohibitively expensive at Pinterest’s scale.

The student model employs a bi-encoder architecture where pin and query representations are computed separately, enabling significant optimization opportunities. Pin embeddings are computed entirely offline and cached, with re-computation only occurring when pin features meaningfully change. This approach leverages the fact that Pinterest’s content is relatively stable compared to the query stream.

Query embeddings are computed in real-time but benefit from Pinterest’s search query patterns. Search queries tend to be short, limiting tokenization overhead, and exhibit high repetition rates enabling an 85% cache hit rate. The bi-encoder architecture allows these optimizations while the distillation process transfers much of the cross-encoder’s semantic understanding to the more efficient student model.

The training process for the student model demonstrates sophisticated semi-supervised learning techniques. While the teacher model trains on limited human-annotated data, the student model leverages daily search logs to generate training data at massive scale. This approach scales training data approximately 100x across different domains, languages, and countries where the multilingual teacher model produces reliable labels.

Multilingual and Global Deployment

Pinterest’s global reach necessitated careful attention to multilingual capabilities and cross-cultural transfer learning. Their use of multilingual language models enabled knowledge transfer from predominantly US-based human annotation data to international markets. Online results demonstrated relevance improvements in Germany and France despite training primarily on US data, highlighting the cross-lingual transfer capabilities of modern language models.

The multilingual approach involved using the same model architecture across all languages rather than developing language-specific models. This decision simplified deployment and maintenance while leveraging the inherent multilingual capabilities of foundation models. The success of this approach suggests that well-designed multilingual LLM systems can achieve broad international coverage without extensive localization efforts.

Production Results and Performance

The deployment achieved measurable improvements across multiple metrics including NDCG (Normalized Discounted Cumulative Gain) and precision at 8, with gains observed not only in the US but also internationally. Search fulfillment metrics, measuring user engagement with search results, also improved across non-US markets, validating the cross-lingual transfer effectiveness.

Beyond direct search relevance improvements, the system generated valuable semantic representations as a byproduct. The pin and query embeddings produced by the distilled model proved useful across other Pinterest surfaces including related pins and home feed recommendations. This demonstrates how thoughtfully designed LLM systems can provide multi-purpose value beyond their primary objective.

Operational Considerations and Lessons Learned

Pinterest’s experience highlights several key operational insights for production LLM deployment. The knowledge distillation approach proved essential for achieving the cost-performance balance required at web scale, with the bi-encoder student model enabling efficient serving while retaining much of the teacher model’s semantic understanding.

The integration of multiple signal types, particularly vision-generated captions and user behavior signals, demonstrates the value of comprehensive feature engineering in LLM applications. Rather than relying solely on raw text, Pinterest’s approach shows how auxiliary signals can significantly enhance model performance.

The multilingual deployment success suggests that well-designed foundation model approaches can achieve broad coverage without extensive per-market customization, though this likely depends on having sufficient multilingual capability in the base models and thoughtful architecture design.

The case study also illustrates the importance of systematic experimentation with different foundation models, as performance varied significantly across model choices and sizes. This reinforces the need for thorough evaluation processes when selecting base models for production applications.

Overall, Pinterest’s LLM integration represents a sophisticated production deployment that addresses real-world constraints while achieving meaningful performance improvements. Their two-tier architecture, comprehensive feature engineering, and careful attention to multilingual deployment provide valuable insights for organizations seeking to deploy LLMs at scale in recommendation and search systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Foundation Model for Unified Personalization at Scale

Netflix 2025

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

content_moderation classification summarization +37

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64