ZenML

Fine-tuning and Scaling LLMs for Search Relevance Prediction

Faire 2024
View original source

Faire, an e-commerce marketplace, tackled the challenge of evaluating search relevance at scale by transitioning from manual human labeling to automated LLM-based assessment. They first implemented a GPT-based solution and later improved it using fine-tuned Llama models. Their best performing model, Llama3-8b, achieved a 28% improvement in relevance prediction accuracy compared to their previous GPT model, while significantly reducing costs through self-hosted inference that can handle 70 million predictions per day using 16 GPUs.

Industry

E-commerce

Technologies

Overview

Faire is a global wholesale marketplace that connects hundreds of thousands of independent brands and retailers worldwide. Search functionality is critical to their platform, as it serves as the primary mechanism for retailers to discover and purchase products. The challenge they faced was that irrelevant search results not only frustrated users but also undermined trust in Faire’s ability to match retailers with appropriate brands.

The core problem was measuring semantic relevance at scale. Traditional human labeling was expensive, slow (with a one-month delay between measurement and available labels), and couldn’t keep up with the evolving search system—particularly as personalized retrieval sources increased the variation of query-product pairs shown to different retailers.

Problem Definition and Relevance Framework

Before any modeling work began, the team established a clear definition of relevance using the ESCI framework from the Amazon KDD Cup 2022. This framework breaks down relevance into four tiers:

This multi-tiered approach provides flexibility for downstream applications—search engine optimization might only use exact matches for high precision, while retrieval and ranking systems might focus on removing irrelevant matches to prioritize broader recall.

The team developed labeling guidelines with decision trees to achieve over 90% agreement among human labelers and quality audits. This investment in clear problem definition and high-quality labeled data proved essential for model performance.

Evolution of the Solution

Phase 1: Human Labeling

The initial approach involved working with a data annotation vendor to label sample query-product pairs monthly. This established ground truth and allowed iteration on guidelines for edge cases. However, the process was expensive and had significant lag time, making relevance measurements less actionable.

Phase 2: Fine-tuned GPT Model

The team framed the multi-class classification as a text completion problem, fine-tuning a leading GPT model to predict ESCI labels. The prompt concatenated search query text with product information (name, description, brand, category), and the model completed the text with one of the four relevance labels.

This approach achieved 0.56 Krippendorff’s Alpha and could label approximately 300,000 query-product pairs per hour. While this enabled daily relevance measurement, costs remained a limiting factor for scaling to the tens of millions of predictions needed.

Phase 3: Open-Source Llama Fine-tuning

The hypothesis was that semantic search relevance, despite its nuances, is a specific language understanding problem that may not require models with hundreds of billions of parameters. The team focused on Meta’s Llama family due to its benchmark performance and commercial licensing.

Technical Implementation Details

Fine-tuning Approach

The fine-tuning centered on smaller base models: Llama2-7b, Llama2-13b, and Llama3-8b. A significant advantage was that these models fit into the memory of a single A100 GPU, enabling rapid prototyping and iteration.

Key technical decisions included:

Dataset Experiments

The team tested three dataset sizes: Small (11k samples), Medium (50k samples), and Large (250k samples). The existing production GPT model was fine-tuned on the Small dataset, while new Llama models were trained on Medium and Large datasets for two epochs. A hold-out dataset of approximately 5k records was used for evaluation.

Training time scaled with model size—the largest model (Llama2-13b) took about five hours to complete training on the Large dataset.

Performance Results

The best-performing model, Llama3-8b trained on the Large dataset, achieved a 28% improvement in Krippendorff’s Alpha compared to the existing production GPT model. Key findings included:

Production Inference Setup

The selected Llama3-8b model is hosted on Faire’s GPU cluster for batch predictions. The application requires scoring tens of millions of product-query pairs daily, demanding high throughput optimization:

These optimizations enabled throughput of 70 million predictions per day during backfill operations, representing a substantial improvement in both cost and capability compared to the previous API-based GPT solution.

Cost and Operational Benefits

A critical advantage of the self-hosted approach was leveraging existing GPUs procured for general deep learning development. This meant:

Current and Future Applications

The current use of relevance predictions is primarily offline, enabling:

The team has identified several areas for future exploration:

Key Takeaways for LLMOps

This case study demonstrates several important LLMOps principles:

More Like This

Large-Scale Personalization and Product Knowledge Graph Enhancement Through LLM Integration

DoorDash 2025

DoorDash faced challenges in scaling personalization and maintaining product catalogs as they expanded beyond restaurants into new verticals like grocery, retail, and convenience stores, dealing with millions of SKUs and cold-start scenarios for new customers and products. They implemented a layered approach combining traditional machine learning with fine-tuned LLMs, RAG systems, and LLM agents to automate product knowledge graph construction, enable contextual personalization, and provide recommendations even without historical user interaction data. The solution resulted in faster, more cost-effective catalog processing, improved personalization for cold-start scenarios, and the foundation for future agentic shopping experiences that can adapt to real-time contexts like emergency situations.

customer_support question_answering classification +64

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Foundation Model for Unified Personalization at Scale

Netflix 2025

Netflix developed a unified foundation model based on transformer architecture to consolidate their diverse recommendation systems, which previously consisted of many specialized models for different content types, pages, and use cases. The foundation model uses autoregressive transformers to learn user representations from interaction sequences, incorporating multi-token prediction, multi-layer representation, and long context windows. By scaling from millions to billions of parameters over 2.5 years, they demonstrated that scaling laws apply to recommendation systems, achieving notable performance improvements while creating high leverage across downstream applications through centralized learning and easier fine-tuning for new use cases.

content_moderation classification summarization +37