Booking.com: LLM-as-a-Judge Framework for Automated LLM Evaluation at Scale

LLMOps Database

E-commerce

Booking.com

Company

Booking.com

Title

LLM-as-a-Judge Framework for Automated LLM Evaluation at Scale

Industry

E-commerce

Link

https://booking.ai/llm-evaluation-practical-tips-at-booking-com-1b038a0d6662

Year

2025

Summary (short)

Booking.com developed a comprehensive framework to evaluate LLM-powered applications at scale using an LLM-as-a-judge approach. The solution addresses the challenge of evaluating generative AI applications where traditional metrics are insufficient and human evaluation is impractical. The framework uses a more powerful LLM to evaluate target LLM outputs based on carefully annotated "golden datasets," enabling continuous monitoring of production GenAI applications. The approach has been successfully deployed across multiple use cases at Booking.com, providing automated evaluation capabilities that significantly reduce the need for human oversight while maintaining evaluation quality.

Tags

high_stakes_application

Booking.com's implementation of an LLM-as-a-judge framework represents a sophisticated approach to solving one of the most challenging aspects of deploying generative AI applications at scale - comprehensive and automated evaluation. As a major online travel platform, Booking.com has been actively deploying LLM-powered applications across various use cases including property descriptions, text summarization, and customer interaction systems. The company recognized that traditional machine learning evaluation approaches were insufficient for their generative AI applications due to the inherent complexity of evaluating open-ended text generation tasks. The core challenge that prompted this LLMOps solution was the difficulty in evaluating LLM outputs where no single ground truth exists or where obtaining human evaluation at scale would be prohibitively expensive and time-consuming. Unlike traditional ML models that often have clear binary or numerical targets, LLM applications frequently generate creative or contextual content where quality assessment requires nuanced understanding of attributes like clarity, factual accuracy, instruction-following, and readability. Manual evaluation by human experts, while theoretically possible, would be impractical for production monitoring given the volume and velocity of LLM-generated content in Booking.com's systems. The LLM-as-a-judge framework operates on the principle of using a more powerful LLM to evaluate the outputs of target LLMs in production. This approach requires human involvement only during the initial setup phase to create high-quality "golden datasets" with reliable annotations. Once established, the judge LLM can continuously assess production outputs with minimal human oversight. The framework addresses three critical LLM challenges that Booking.com identified: hallucination (generating factually incorrect information with high confidence), failure to follow detailed instructions, and the non-negligible computational and financial costs associated with LLM inference. The technical architecture centers around a carefully orchestrated development and deployment cycle. The development phase begins with creating golden datasets that accurately represent the production data distribution. This involves implementing either a basic annotation protocol with single annotators or an advanced protocol using multiple annotators with inter-annotator agreement metrics. The golden datasets must meet strict quality criteria including representative sampling of production distributions and high-quality labels verified through rigorous annotation protocols. For golden dataset creation, Booking.com established standardized annotation protocols that emphasize clear metric definitions, pilot annotations for validation, and comprehensive quality reviews. The basic protocol works with single annotators but requires careful quality control, while the advanced protocol employs multiple annotators (3+) with aggregation functions to handle disagreements. The company recommends binary or categorical metrics over continuous scores, as LLMs demonstrate better performance with discrete classification tasks. Dataset sizes typically range from 500-1000 examples to ensure sufficient data for both validation and testing phases. The judge LLM development process follows an iterative approach using manual prompt engineering techniques. Booking.com typically employs powerful models like GPT-4.1 or Claude 4.0 Sonnet as the backbone for their judge LLMs, serving both as sanity checks for task feasibility and as upper bounds for achievable performance. The prompt engineering process incorporates chain-of-thought (CoT) reasoning to enhance explainability and includes few-shot examples while carefully avoiding overfitting to the golden dataset. The evaluation methodology splits golden datasets into validation and test sets, typically using a 50/50 split. Performance assessment relies on appropriate metrics such as macro F1-score for classification tasks, with iterative error analysis driving prompt refinements. Once strong models achieve satisfactory performance, the team develops more cost-efficient versions using weaker models while maintaining comparable accuracy levels. This dual-model approach enables high-quality evaluation during development phases and cost-effective monitoring in production environments. Production deployment involves continuous monitoring dashboards that track multiple quality metrics across different aspects of LLM performance. These include entity extraction accuracy, instruction-following compliance, user frustration indicators, context relevance, and task-specific metrics like location resolution and topic extraction accuracy. The monitoring system includes automated alerting mechanisms that notify application owners when performance anomalies are detected across any tracked metrics. The framework's practical implementation at Booking.com demonstrates several sophisticated LLMOps practices. The company has developed both pointwise judges (which assign absolute scores to individual responses) and comparative judges (which rank multiple responses). Pointwise judges serve dual purposes for both development ranking and production monitoring, while comparative judges provide stronger ranking signals during system development phases. This flexibility allows teams to choose the most appropriate evaluation approach based on their specific use cases and operational constraints. Booking.com has also invested in automated prompt engineering capabilities inspired by research like DeepMind's OPRO (Optimization by PROmpting), which automates the traditionally manual and time-intensive prompt development process. This automation addresses one of the significant bottlenecks in judge LLM development, reducing the typical one-day to one-week development cycle. However, the company maintains human oversight to ensure final prompts don't contain use-case-specific examples that might undermine generalizability. The technical infrastructure supporting this framework handles the complexity of managing multiple LLM models, prompt versions, and evaluation pipelines in production. While specific infrastructure details aren't extensively covered in the case study, the implementation clearly requires sophisticated MLOps capabilities including model versioning, A/B testing frameworks, monitoring dashboards, and automated alerting systems. Looking toward future developments, Booking.com is exploring synthetic data generation for golden dataset creation, which could significantly reduce annotation overhead while maintaining quality standards. The company is also working on evaluation methodologies specifically designed for LLM-based agents, which present additional challenges around multi-step reasoning, tool usage, and long-term goal completion assessment. The case study reveals several important LLMOps considerations that other organizations should note. The framework requires significant upfront investment in annotation processes and infrastructure development, but provides scalable evaluation capabilities that become increasingly valuable as LLM deployment expands. The dual-model approach (strong models for development, efficient models for production) represents a practical compromise between evaluation quality and operational costs. The emphasis on human-in-the-loop validation during golden dataset creation ensures that automated evaluation maintains alignment with business objectives and user expectations. From a broader LLMOps perspective, this case study demonstrates the critical importance of establishing robust evaluation frameworks early in generative AI adoption. The framework addresses fundamental challenges around quality assurance, performance monitoring, and cost management that are essential for successful production deployment of LLM applications. The systematic approach to prompt engineering, dataset creation, and continuous monitoring provides a replicable model for other organizations facing similar challenges in operationalizing generative AI systems. The technical sophistication demonstrated in this implementation reflects Booking.com's mature approach to machine learning operations, extending traditional MLOps practices to accommodate the unique challenges of generative AI systems. The framework's success in enabling automated evaluation at scale while maintaining quality standards represents a significant achievement in practical LLMOps implementation.

Start deploying reproducible AI workflows today

Enterprise-grade MLOps platform trusted by thousands of companies in production.

Book a Demo

Use Open Source