Zalando: Multimodal LLM-as-a-Judge for Large-Scale Product Retrieval Evaluation

Company

Zalando

Title

Multimodal LLM-as-a-Judge for Large-Scale Product Retrieval Evaluation

Industry

E-commerce

Link

https://engineering.zalando.com/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html

Year

2025

Summary (short)

Zalando, a major e-commerce platform, faced the challenge of evaluating product retrieval systems at scale across multiple languages and diverse customer queries. Traditional human relevance assessments required substantial time and resources, making large-scale continuous evaluation impractical. The company developed a novel framework leveraging Multimodal Large Language Models (MLLMs) that automatically generate context-specific annotation guidelines and conduct relevance assessments by analyzing both text and images. Evaluated on 20,000 examples, the approach achieved accuracy comparable to human annotators while being up to 1,000 times cheaper and significantly faster (20 minutes versus weeks for humans), enabling continuous monitoring of high-frequency search queries in production and faster identification of areas requiring improvement.

Tags

continuous_deployment

scaling

documentation

openai

## Overview Zalando's case study presents a production deployment of Multimodal Large Language Models (MLLMs) for evaluating product retrieval systems at scale. The company operates a large e-commerce platform serving customers across multiple languages and markets, where search functionality is fundamental to business success. Customers using search typically exhibit higher intent to purchase, leading to greater engagement and conversion rates. However, evaluating whether search results are relevant to customer queries at scale presents significant operational challenges. Traditionally, Zalando relied on human annotators to assess the relevance of query-product pairs, a process that is time-consuming, expensive, and difficult to scale for continuous evaluation across diverse markets and product categories. The company needed a solution that could handle large-scale evaluation efficiently while maintaining quality comparable to human assessment. Their approach introduces a framework that positions LLMs not just as content generators but as evaluation judges—a pattern increasingly referred to as "LLM-as-a-Judge" in the MLOps community. ## The Production Problem and Business Context Search quality directly impacts customer experience and conversion rates on e-commerce platforms. Customers may struggle to articulate their needs precisely in search queries, and even when they express intent clearly, retrieval systems may fail to interpret it correctly. This results in irrelevant search results that degrade user experience and reduce conversions. For a platform operating at Zalando's scale across multiple languages and markets, continuously monitoring and improving search quality requires evaluating massive numbers of query-product pairs. The evaluation challenge is particularly complex in e-commerce because it involves both textual information (product descriptions, attributes, query text) and visual information (product images). A customer searching for "women's long sleeve t-shirt with green stripes" needs results that match multiple semantic requirements including product type, visual attributes, target demographic, and pattern. Human annotators can assess these multimodal signals but at significant cost and with limited scalability. ## Technical Architecture and LLMOps Implementation Zalando's framework, titled "Retrieve, Annotate, Evaluate, Repeat," implements a four-stage pipeline designed for production deployment: **Query Extraction**: The system extracts query-product pairs from search logs, leveraging real production traffic data. This grounds the evaluation in actual customer behavior rather than synthetic test sets, ensuring the assessment reflects real-world usage patterns. **Guideline Generation**: For each query, an LLM generates custom, context-specific annotation guidelines. This is a crucial design decision that addresses a common challenge in LLMOps: how to ensure consistent evaluation criteria across diverse queries without manually creating guidelines for every possible search term. The LLM analyzes the query (for example, "women's long sleeve t-shirt with green stripes") and identifies key requirements such as assortment category, sleeve length, product type, and pattern. It assigns importance levels to each requirement, creating a structured framework for relevance assessment. This automated guideline generation enables the system to adapt to new queries without human intervention, a key requirement for production scalability. **Multimodal Annotation**: MLLMs assign relevancy scores to search results based on both textual and visual product descriptions. The framework uses three relevance labels: "highly relevant," "acceptable substitute," and "irrelevant." The multimodal capability is essential because e-commerce products are inherently visual—a customer searching for "black sneakers" needs both the product type and color to match, which requires analyzing product images alongside textual attributes. The system leverages GPT-4o's vision capabilities to generate visual descriptions of product images, which are then incorporated into the relevance assessment process. **Evaluation and Storage**: Each labeled query-product pair is stored for continuous retrieval system evaluation and comparison across different configurations. This creates a persistent evaluation dataset that enables longitudinal tracking of search quality and A/B testing of different retrieval algorithms. The framework's modular design incorporates caching and parallel processing, critical production engineering considerations for handling high-throughput evaluation. The ability to cache guideline generation results and process annotations in parallel enables the system to evaluate 20,000 query-product pairs in approximately 20 minutes—a dramatic improvement over the weeks required for human annotation. ## Model Selection and Comparative Analysis Zalando evaluated multiple LLM variants to understand the cost-quality tradeoffs: GPT-4o, GPT-4 Turbo, and GPT-3.5 Turbo. This comparative analysis reflects mature LLMOps practice—recognizing that different models offer different tradeoffs between capability, cost, and latency, and that production deployments benefit from empirical evaluation of these tradeoffs on domain-specific tasks. The paper reports agreement metrics comparing LLM annotations against human annotators across English and German queries, demonstrating the framework's multilingual capability. The evaluation methodology measures agreement between LLM annotations and human majority votes, as well as inter-annotator agreement between human annotators themselves. This rigorous comparison establishes that LLM-based assessments achieve quality comparable to human judgment, a critical validation for production deployment. ## Error Analysis: A Nuanced View of Human vs. LLM Performance One of the most valuable contributions from an LLMOps perspective is Zalando's detailed error analysis comparing human and LLM annotation mistakes. While the overall error rates are comparable, the error distributions differ significantly, revealing complementary strengths and weaknesses. Human annotators primarily make errors related to brands (incorrectly marking products from wrong brands as relevant), products, and categories. These errors appear driven by annotation fatigue—the repetitive nature of relevance assessment leads to decreased attention to specific details. LLMs rarely make these types of errors, as they consistently apply the same evaluation criteria without fatigue. Conversely, LLMs exhibit two distinct error patterns. First, they tend to be overly strict in their judgments, lacking the flexible interpretation that humans naturally apply. For example, when a query asks for "black Levi's jeans" and the retrieved product is dark grey Levi's jeans, humans might reasonably consider this an acceptable substitute, while the LLM judges it as irrelevant due to the color mismatch. Second, LLMs suffer from "understanding" errors where they misinterpret query intent—for instance, interpreting "On Vacation" as a literal phrase about holiday activities rather than recognizing it as a brand name. This error analysis leads to a pragmatic production strategy: using LLMs to handle the "bulk work" of annotating straightforward queries where their consistency and lack of fatigue provide advantages, while reserving human expertise for complex cases involving style judgments, trend interpretation, or ambiguous queries where domain knowledge and flexible reasoning are essential. This human-LLM collaboration pattern represents a mature approach to LLMOps that leverages the complementary strengths of automated and human evaluation. ## Production Deployment and Impact The framework has been deployed in production at Zalando, enabling regular monitoring of high-frequency search queries and identification of low-performing queries for targeted improvement. This continuous assessment provides operational visibility into search quality and enables rapid identification of areas requiring algorithmic adjustments. The business impact is measured across multiple dimensions: **Cost Efficiency**: MLLM-based assessments are reported to be up to 1,000 times cheaper than human labor. While this claim should be interpreted carefully (it likely reflects direct annotation costs rather than full system development and operational costs), it nonetheless indicates substantial cost savings that make large-scale continuous evaluation economically feasible. **Speed**: The 20-minute evaluation time for 20,000 pairs versus weeks for human annotation enables much faster iteration cycles for search algorithm improvements. This velocity advantage is crucial for maintaining competitive search quality in dynamic e-commerce environments where product catalogs and customer preferences constantly evolve. **Multilingual Adaptability**: The framework's support for multiple languages addresses a critical operational requirement for e-commerce platforms operating in diverse markets. Training separate models or creating separate annotation guidelines for each language would be prohibitively expensive; the LLM-based approach handles multilingual evaluation through the model's inherent multilingual capabilities. ## Technical Considerations and Limitations Zalando explicitly acknowledges that semantic relevance is "necessary but not sufficient" for high customer engagement. Production search systems must balance relevance with other factors including personal preferences, product availability, price expectations, and seasonality. The paper focuses specifically on semantic relevance evaluation, while noting that production ranking incorporates additional features beyond pure relevance. This acknowledgment reflects mature understanding of LLMOps deployment: LLMs excel at certain tasks (semantic understanding and consistency-based evaluation) but cannot replace the full complexity of production ranking systems that must optimize for multiple business objectives and personalization signals. The LLM-based evaluation framework serves as one component in a broader search quality infrastructure rather than a complete solution. The framework's modular design enables it to support multiple search engines and accommodate updates to retrieval algorithms, suggesting that Zalando likely experiments with different retrieval approaches and uses the evaluation framework to compare their performance. This speaks to the operational maturity of treating evaluation as a first-class system component that must be maintained and evolved alongside the systems it evaluates. ## Future Directions and System Evolution Zalando outlines several areas for future enhancement that reveal thoughtful consideration of the framework's current limitations: **Deeper Human-LLM Collaboration**: Developing systematic approaches to identify cases that require human judgment and route them appropriately, while letting LLMs handle straightforward cases. This suggests potential development of confidence scoring or uncertainty detection mechanisms that could trigger human review. **Broadening Relevance Dimensions**: Extending the framework to assess multiple dimensions of relevance including personal preferences, seasonal trends, and emerging product categories. This would move toward more holistic evaluation but introduces additional complexity around how to weight different relevance dimensions. **Adaptability Across Market Segments**: Testing and refining the framework's ability to handle domain-specific language and visual cues that vary between product categories like fashion versus beauty products. This reflects recognition that different product domains may require adapted evaluation approaches. **Automated Trend Detection**: Leveraging real-time data and LLMs' ability to capture emerging terminology and styles to make the framework more responsive to evolving trends. This is particularly relevant in fashion e-commerce where styles and terminology evolve rapidly. ## LLMOps Maturity Indicators Several aspects of this case study indicate mature LLMOps practices: The use of real production traffic data for evaluation grounds the system in actual business needs rather than synthetic benchmarks. The rigorous comparative evaluation against human annotators with detailed error analysis demonstrates empirical validation rather than assuming LLM superiority. The acknowledgment of complementary human-LLM strengths and design for hybrid workflows shows pragmatic understanding of automation boundaries. The modular, scalable architecture with caching and parallel processing addresses production engineering requirements. The explicit comparison of different model versions (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) indicates attention to cost-quality tradeoffs. The recognition that semantic relevance is one component of a broader ranking system shows systems thinking rather than point solution mentality. ## Critical Assessment While the case study presents compelling results, several considerations warrant balanced assessment. The cost comparison of "up to 1,000 times cheaper" likely reflects direct annotation costs rather than full system costs including LLM API expenses, engineering development, infrastructure, and ongoing maintenance. The true cost advantage, while likely still substantial, may be less dramatic when accounting for these factors. The framework's reliance on commercial LLM APIs (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) creates dependencies on external providers for a critical evaluation capability. This introduces considerations around API availability, pricing changes, model versioning, and data privacy that production deployments must manage. The error analysis revealing LLM tendency toward overly strict judgments and occasional misunderstanding of brand names or domain-specific terminology suggests limitations in handling nuanced or specialized queries that may require ongoing refinement. The paper focuses on semantic relevance evaluation but notes that production ranking incorporates many additional signals. The relationship between improving semantic relevance evaluation and actual business metrics like click-through rates, conversion, or customer satisfaction is implied but not directly demonstrated, though this may reflect the paper's focus on the evaluation methodology rather than end-to-end business impact. Overall, this case study exemplifies sophisticated LLMOps deployment that thoughtfully addresses real production challenges with appropriate validation, error analysis, and recognition of system boundaries. It represents a valuable example of LLM-as-a-Judge pattern applied to e-commerce search evaluation at scale.

Start deploying reproducible AI workflows today